Files

T

schihei 5e891ef461 Add comprehensive documentation and code comments

This commit adds extensive documentation to the Mistral OCR CLI project:

- Add API.md with detailed API response format documentation
- Add CHANGELOG.md to track version changes
- Add CONTRIBUTING.md with guidelines for contributors
- Enhance README.md with more detailed usage examples and troubleshooting
- Add proper docstrings to all Python modules and functions
- Update requirements.txt with development dependencies
- Improve setup.py with better metadata

These changes make the project more accessible to users and contributors.

2025-04-24 21:11:41 +02:00

6.4 KiB

Raw Blame History

Mistral OCR API Documentation

This document provides detailed information about the Mistral OCR API response format and how to work with it in your applications.

Mistral OCR API Documentation

API Response Format

The Mistral OCR API returns a JSON response with the following structure:

{
  "metadata": {
    "title": "Document Title",
    "author": "Document Author",
    "creation_date": "2023-01-01",
    "page_count": 5
  },
  "pages": [
    {
      "index": 0,
      "markdown": "# Page Content\n\nThis is the content of page 1...",
      "images": [
        {
          "id": "image-1",
          "image_base64": "base64-encoded-image-data"
        }
      ]
    },
    {
      "index": 1,
      "markdown": "## Page 2 Content\n\nThis is the content of page 2...",
      "images": []
    }
  ]
}

Document Metadata

The metadata object contains document-level information:

Field	Type	Description
`title`	String	The document title, if available
`author`	String	The document author, if available
`creation_date`	String	The document creation date in ISO format (YYYY-MM-DD), if available
`page_count`	Integer	The total number of pages in the document

Note that some metadata fields may be empty or missing if the information cannot be extracted from the document.

Pages

The pages array contains objects representing each page in the document:

Field	Type	Description
`index`	Integer	Zero-based page index
`markdown`	String	The extracted text content in Markdown format
`images`	Array	An array of image objects found on the page

Images

Each image object in the images array has the following structure:

Field	Type	Description
`id`	String	A unique identifier for the image
`image_base64`	String	Base64-encoded image data (only included if `include_images` is specified)

Working with the API Response

Parsing the JSON Response

Here's an example of how to parse the JSON response in Python:

import json

# Load the JSON response
with open('ocr_results.json', 'r') as f:
    ocr_data = json.load(f)

# Access metadata
title = ocr_data.get('metadata', {}).get('title', 'Untitled Document')
page_count = ocr_data.get('metadata', {}).get('page_count', 0)

# Access page content
for page in ocr_data.get('pages', []):
    page_index = page.get('index', 0)
    page_content = page.get('markdown', '')
    
    print(f"Page {page_index + 1}:")
    print(page_content)
    print("-" * 40)

Handling Images

If you've included images in the response (using the --include-images flag), you can extract and save them:

import base64
import os

# Create a directory for images
os.makedirs('extracted_images', exist_ok=True)

# Extract images from each page
for page in ocr_data.get('pages', []):
    page_index = page.get('index', 0)
    
    for img_index, image in enumerate(page.get('images', [])):
        img_id = image.get('id', f'unknown-{img_index}')
        img_data = image.get('image_base64', '')
        
        if img_data:
            # Remove data URL prefix if present
            if ',' in img_data:
                img_data = img_data.split(',', 1)[1]
            
            # Decode and save the image
            img_bytes = base64.b64decode(img_data)
            with open(f'extracted_images/page{page_index}_{img_id}.jpg', 'wb') as img_file:
                img_file.write(img_bytes)

Working with Markdown Content

The OCR results are provided in Markdown format, which makes it easy to convert to other formats or display in applications:

import markdown

# Convert markdown to HTML
for page in ocr_data.get('pages', []):
    page_content = page.get('markdown', '')
    html_content = markdown.markdown(page_content)
    
    # Now you can use the HTML content in your application
    # For example, save it to an HTML file
    with open(f'page_{page.get("index", 0)}.html', 'w') as f:
        f.write(html_content)

Error Handling

When working with the API, you may encounter various errors. Here are some common error scenarios and how to handle them:

API Key Errors

API key must be provided or set as MISTRAL_API_KEY environment variable

Ensure your API key is correctly set as an environment variable or provided with the --api-key flag.

File Size Errors

File is too large (55.00 MB). Maximum allowed size is 52.00 MB

The Mistral API has a file size limit of 52MB. For larger files, consider splitting them into smaller documents.

Rate Limiting

API returned error status: 429 - Rate limit exceeded

The API has rate limits. Implement exponential backoff and retry logic in your application:

import time
import random

def api_request_with_retry(func, max_retries=5, initial_delay=1):
    retries = 0
    while retries < max_retries:
        try:
            return func()
        except Exception as e:
            if "429" in str(e) and retries < max_retries - 1:
                # Exponential backoff with jitter
                delay = initial_delay * (2 ** retries) + random.uniform(0, 1)
                print(f"Rate limited. Retrying in {delay:.2f} seconds...")
                time.sleep(delay)
                retries += 1
            else:
                raise

API Limitations

Maximum file size: 52MB
Supported file formats: PDF, JPG, JPEG, PNG, WEBP, GIF
Rate limits: Depends on your Mistral AI account tier
Concurrent requests: Depends on your Mistral AI account tier
Image extraction: Some complex images or diagrams may not be perfectly extracted
Language support: Check the Mistral AI documentation for the latest information on supported languages

6.4 KiB Raw Blame History