Files

T

schihei 220864d52f Add feature to extract images as separate files

2025-04-24 21:44:49 +02:00

8.9 KiB

Raw Blame History

Mistral OCR API Documentation

This document provides detailed information about the Mistral OCR API response format and how to work with it in your applications.

Mistral OCR API Documentation

API Response Format

The Mistral OCR API returns a JSON response with the following structure:

{
  "metadata": {
    "title": "Document Title",
    "author": "Document Author",
    "creation_date": "2023-01-01",
    "page_count": 5
  },
  "pages": [
    {
      "index": 0,
      "markdown": "# Page Content\n\nThis is the content of page 1...",
      "images": [
        {
          "id": "image-1",
          "image_base64": "base64-encoded-image-data"
        }
      ]
    },
    {
      "index": 1,
      "markdown": "## Page 2 Content\n\nThis is the content of page 2...",
      "images": []
    }
  ]
}

Document Metadata

The metadata object contains document-level information:

Field	Type	Description
`title`	String	The document title, if available
`author`	String	The document author, if available
`creation_date`	String	The document creation date in ISO format (YYYY-MM-DD), if available
`page_count`	Integer	The total number of pages in the document

Note that some metadata fields may be empty or missing if the information cannot be extracted from the document.

Pages

The pages array contains objects representing each page in the document:

Field	Type	Description
`index`	Integer	Zero-based page index
`markdown`	String	The extracted text content in Markdown format
`images`	Array	An array of image objects found on the page

Images

Each image object in the images array has the following structure:

Field	Type	Description
`id`	String	A unique identifier for the image
`image_base64`	String	Base64-encoded image data (only included if `include_images` is specified)

Working with the API Response

Parsing the JSON Response

Here's an example of how to parse the JSON response in Python:

import json

# Load the JSON response
with open('ocr_results.json', 'r') as f:
    ocr_data = json.load(f)

# Access metadata
title = ocr_data.get('metadata', {}).get('title', 'Untitled Document')
page_count = ocr_data.get('metadata', {}).get('page_count', 0)

# Access page content
for page in ocr_data.get('pages', []):
    page_index = page.get('index', 0)
    page_content = page.get('markdown', '')
    
    print(f"Page {page_index + 1}:")
    print(page_content)
    print("-" * 40)

Handling Images

The Mistral OCR CLI provides two approaches for handling images:

1. Embedded Images

When using the --images flag without --extract-images, images are embedded directly in the markdown as base64 data. If you've included images in the response (using the --include-images flag), you can extract and save them manually:

import base64
import os

# Create a directory for images
os.makedirs('extracted_images', exist_ok=True)

# Extract images from each page
for page in ocr_data.get('pages', []):
    page_index = page.get('index', 0)
    
    for img_index, image in enumerate(page.get('images', [])):
        img_id = image.get('id', f'unknown-{img_index}')
        img_data = image.get('image_base64', '')
        
        if img_data:
            # Remove data URL prefix if present
            if ',' in img_data:
                img_data = img_data.split(',', 1)[1]
            
            # Decode and save the image
            img_bytes = base64.b64decode(img_data)
            with open(f'extracted_images/page{page_index}_{img_id}.jpg', 'wb') as img_file:
                img_file.write(img_bytes)

2. Extracted Images

Alternatively, you can use the --extract-images flag with the CLI to automatically extract images to separate files. This approach:

Saves each image as a separate file in the specified directory (or output_dir/images by default)
Updates the markdown to reference these image files instead of embedding base64 data
Results in smaller, more manageable markdown files

Example command:

mistral-ocr markdown document.pdf --images --extract-images --image-dir custom_images

If you're working with the API directly and want to implement similar functionality, here's how you might do it:

import base64
import os
import re

def extract_images_from_ocr_data(ocr_data, image_dir='images'):
    """Extract images from OCR data and update markdown references."""
    # Create image directory
    os.makedirs(image_dir, exist_ok=True)
    
    # Process each page
    for page in ocr_data.get('pages', []):
        page_index = page.get('index', 0)
        markdown = page.get('markdown', '')
        
        # Extract and save images
        for img_index, image in enumerate(page.get('images', [])):
            img_id = image.get('id', f'unknown-{img_index}')
            img_data = image.get('image_base64', '')
            
            if img_data:
                # Generate filename
                filename = f"{img_id.replace(' ', '_')}.jpg"
                filepath = os.path.join(image_dir, filename)
                
                # Remove data URL prefix if present
                if ',' in img_data:
                    img_data = img_data.split(',', 1)[1]
                
                # Save the image
                with open(filepath, 'wb') as img_file:
                    img_file.write(base64.b64decode(img_data))
                
                # Update markdown to reference the file
                pattern = f"!\\[{re.escape(img_id)}\\]\\({re.escape(img_id)}\\)"
                replacement = f"![{img_id}]({os.path.join(os.path.basename(image_dir), filename)})"
                markdown = re.sub(pattern, replacement, markdown)
        
        # Update the page's markdown
        page['markdown'] = markdown
    
    return ocr_data

Working with Markdown Content

The OCR results are provided in Markdown format, which makes it easy to convert to other formats or display in applications:

import markdown

# Convert markdown to HTML
for page in ocr_data.get('pages', []):
    page_content = page.get('markdown', '')
    html_content = markdown.markdown(page_content)
    
    # Now you can use the HTML content in your application
    # For example, save it to an HTML file
    with open(f'page_{page.get("index", 0)}.html', 'w') as f:
        f.write(html_content)

Error Handling

When working with the API, you may encounter various errors. Here are some common error scenarios and how to handle them:

API Key Errors

API key must be provided or set as MISTRAL_API_KEY environment variable

Ensure your API key is correctly set as an environment variable or provided with the --api-key flag.

File Size Errors

File is too large (55.00 MB). Maximum allowed size is 52.00 MB

The Mistral API has a file size limit of 52MB. For larger files, consider splitting them into smaller documents.

Rate Limiting

API returned error status: 429 - Rate limit exceeded

The API has rate limits. Implement exponential backoff and retry logic in your application:

import time
import random

def api_request_with_retry(func, max_retries=5, initial_delay=1):
    retries = 0
    while retries < max_retries:
        try:
            return func()
        except Exception as e:
            if "429" in str(e) and retries < max_retries - 1:
                # Exponential backoff with jitter
                delay = initial_delay * (2 ** retries) + random.uniform(0, 1)
                print(f"Rate limited. Retrying in {delay:.2f} seconds...")
                time.sleep(delay)
                retries += 1
            else:
                raise

API Limitations

Maximum file size: 52MB
Supported file formats: PDF, JPG, JPEG, PNG, WEBP, GIF
Rate limits: Depends on your Mistral AI account tier
Concurrent requests: Depends on your Mistral AI account tier
Image extraction: Some complex images or diagrams may not be perfectly extracted
Language support: Check the Mistral AI documentation for the latest information on supported languages

8.9 KiB Raw Blame History