mistral-ocr/API.md

# Mistral OCR API Documentation

This document provides detailed information about the Mistral OCR API response format and how to work with it in your applications.

## Table of Contents

- [Mistral OCR API Documentation](#mistral-ocr-api-documentation)
  - [Table of Contents](#table-of-contents)
  - [API Response Format](#api-response-format)
    - [Document Metadata](#document-metadata)
    - [Pages](#pages)
    - [Images](#images)
  - [Working with the API Response](#working-with-the-api-response)
    - [Parsing the JSON Response](#parsing-the-json-response)
    - [Handling Images](#handling-images)
      - [1. Embedded Images](#1-embedded-images)
      - [2. Extracted Images](#2-extracted-images)
    - [Working with Markdown Content](#working-with-markdown-content)
  - [Error Handling](#error-handling)
    - [API Key Errors](#api-key-errors)
    - [File Size Errors](#file-size-errors)
    - [Rate Limiting](#rate-limiting)
  - [API Limitations](#api-limitations)

## API Response Format

The Mistral OCR API returns a JSON response with the following structure:

```json
{
  "metadata": {
    "title": "Document Title",
    "author": "Document Author",
    "creation_date": "2023-01-01",
    "page_count": 5
  },
  "pages": [
    {
      "index": 0,
      "markdown": "# Page Content\n\nThis is the content of page 1...",
      "images": [
        {
          "id": "image-1",
          "image_base64": "base64-encoded-image-data"
        }
      ]
    },
    {
      "index": 1,
      "markdown": "## Page 2 Content\n\nThis is the content of page 2...",
      "images": []
    }
  ]
}
```

### Document Metadata

The `metadata` object contains document-level information:

| Field | Type | Description |
|-------|------|-------------|
| `title` | String | The document title, if available |
| `author` | String | The document author, if available |
| `creation_date` | String | The document creation date in ISO format (YYYY-MM-DD), if available |
| `page_count` | Integer | The total number of pages in the document |

Note that some metadata fields may be empty or missing if the information cannot be extracted from the document.

### Pages

The `pages` array contains objects representing each page in the document:

| Field | Type | Description |
|-------|------|-------------|
| `index` | Integer | Zero-based page index |
| `markdown` | String | The extracted text content in Markdown format |
| `images` | Array | An array of image objects found on the page |

### Images

Each image object in the `images` array has the following structure:

| Field | Type | Description |
|-------|------|-------------|
| `id` | String | A unique identifier for the image |
| `image_base64` | String | Base64-encoded image data (only included if `include_images` is specified) |

## Working with the API Response

### Parsing the JSON Response

Here's an example of how to parse the JSON response in Python:

```python
import json

# Load the JSON response
with open('ocr_results.json', 'r') as f:
    ocr_data = json.load(f)

# Access metadata
title = ocr_data.get('metadata', {}).get('title', 'Untitled Document')
page_count = ocr_data.get('metadata', {}).get('page_count', 0)

# Access page content
for page in ocr_data.get('pages', []):
    page_index = page.get('index', 0)
    page_content = page.get('markdown', '')

    print(f"Page {page_index + 1}:")
    print(page_content)
    print("-" * 40)
```

### Handling Images

The Mistral OCR CLI provides two approaches for handling images:

#### 1. Embedded Images

When using the `--images` flag without `--extract-images`, images are embedded directly in the markdown as base64 data. If you've included images in the response (using the `--include-images` flag), you can extract and save them manually:

```python
import base64
import os

# Create a directory for images
os.makedirs('extracted_images', exist_ok=True)

# Extract images from each page
for page in ocr_data.get('pages', []):
    page_index = page.get('index', 0)

    for img_index, image in enumerate(page.get('images', [])):
        img_id = image.get('id', f'unknown-{img_index}')
        img_data = image.get('image_base64', '')

        if img_data:
            # Remove data URL prefix if present
            if ',' in img_data:
                img_data = img_data.split(',', 1)[1]

            # Decode and save the image
            img_bytes = base64.b64decode(img_data)
            with open(f'extracted_images/page{page_index}_{img_id}.jpg', 'wb') as img_file:
                img_file.write(img_bytes)
```

#### 2. Extracted Images

Alternatively, you can use the `--extract-images` flag with the CLI to automatically extract images to separate files. This approach:

- Saves each image as a separate file in the specified directory (or `output_dir/images` by default)
- Updates the markdown to reference these image files instead of embedding base64 data
- Results in smaller, more manageable markdown files

Example command:
```bash
mistral-ocr markdown document.pdf --images --extract-images --image-dir custom_images
```

If you're working with the API directly and want to implement similar functionality, here's how you might do it:

```python
import base64
import os
import re

def extract_images_from_ocr_data(ocr_data, image_dir='images'):
    """Extract images from OCR data and update markdown references."""
    # Create image directory
    os.makedirs(image_dir, exist_ok=True)

    # Process each page
    for page in ocr_data.get('pages', []):
        page_index = page.get('index', 0)
        markdown = page.get('markdown', '')

        # Extract and save images
        for img_index, image in enumerate(page.get('images', [])):
            img_id = image.get('id', f'unknown-{img_index}')
            img_data = image.get('image_base64', '')

            if img_data:
                # Generate filename
                filename = f"{img_id.replace(' ', '_')}.jpg"
                filepath = os.path.join(image_dir, filename)

                # Remove data URL prefix if present
                if ',' in img_data:
                    img_data = img_data.split(',', 1)[1]

                # Save the image
                with open(filepath, 'wb') as img_file:
                    img_file.write(base64.b64decode(img_data))

                # Update markdown to reference the file
                pattern = f"!\\[{re.escape(img_id)}\\]\\({re.escape(img_id)}\\)"
                replacement = f"![{img_id}]({os.path.join(os.path.basename(image_dir), filename)})"
                markdown = re.sub(pattern, replacement, markdown)

        # Update the page's markdown
        page['markdown'] = markdown

    return ocr_data
```

### Working with Markdown Content

The OCR results are provided in Markdown format, which makes it easy to convert to other formats or display in applications:

```python
import markdown

# Convert markdown to HTML
for page in ocr_data.get('pages', []):
    page_content = page.get('markdown', '')
    html_content = markdown.markdown(page_content)

    # Now you can use the HTML content in your application
    # For example, save it to an HTML file
    with open(f'page_{page.get("index", 0)}.html', 'w') as f:
        f.write(html_content)
```

## Error Handling

When working with the API, you may encounter various errors. Here are some common error scenarios and how to handle them:

### API Key Errors

```
API key must be provided or set as MISTRAL_API_KEY environment variable
```

Ensure your API key is correctly set as an environment variable or provided with the `--api-key` flag.

### File Size Errors

```
File is too large (55.00 MB). Maximum allowed size is 52.00 MB
```

The Mistral API has a file size limit of 52MB. For larger files, consider splitting them into smaller documents.

### Rate Limiting

```
API returned error status: 429 - Rate limit exceeded
```

The API has rate limits. Implement exponential backoff and retry logic in your application:

```python
import time
import random

def api_request_with_retry(func, max_retries=5, initial_delay=1):
    retries = 0
    while retries < max_retries:
        try:
            return func()
        except Exception as e:
            if "429" in str(e) and retries < max_retries - 1:
                # Exponential backoff with jitter
                delay = initial_delay * (2 ** retries) + random.uniform(0, 1)
                print(f"Rate limited. Retrying in {delay:.2f} seconds...")
                time.sleep(delay)
                retries += 1
            else:
                raise
```

## API Limitations

- **Maximum file size**: 52MB
- **Supported file formats**: PDF, JPG, JPEG, PNG, WEBP, GIF
- **Rate limits**: Depends on your Mistral AI account tier
- **Concurrent requests**: Depends on your Mistral AI account tier
- **Image extraction**: Some complex images or diagrams may not be perfectly extracted
- **Language support**: Check the Mistral AI documentation for the latest information on supported languages