This commit adds extensive documentation to the Mistral OCR CLI project: - Add API.md with detailed API response format documentation - Add CHANGELOG.md to track version changes - Add CONTRIBUTING.md with guidelines for contributors - Enhance README.md with more detailed usage examples and troubleshooting - Add proper docstrings to all Python modules and functions - Update requirements.txt with development dependencies - Improve setup.py with better metadata These changes make the project more accessible to users and contributors.
6.4 KiB
Mistral OCR API Documentation
This document provides detailed information about the Mistral OCR API response format and how to work with it in your applications.
Table of Contents
API Response Format
The Mistral OCR API returns a JSON response with the following structure:
{
"metadata": {
"title": "Document Title",
"author": "Document Author",
"creation_date": "2023-01-01",
"page_count": 5
},
"pages": [
{
"index": 0,
"markdown": "# Page Content\n\nThis is the content of page 1...",
"images": [
{
"id": "image-1",
"image_base64": "base64-encoded-image-data"
}
]
},
{
"index": 1,
"markdown": "## Page 2 Content\n\nThis is the content of page 2...",
"images": []
}
]
}
Document Metadata
The metadata object contains document-level information:
| Field | Type | Description |
|---|---|---|
title |
String | The document title, if available |
author |
String | The document author, if available |
creation_date |
String | The document creation date in ISO format (YYYY-MM-DD), if available |
page_count |
Integer | The total number of pages in the document |
Note that some metadata fields may be empty or missing if the information cannot be extracted from the document.
Pages
The pages array contains objects representing each page in the document:
| Field | Type | Description |
|---|---|---|
index |
Integer | Zero-based page index |
markdown |
String | The extracted text content in Markdown format |
images |
Array | An array of image objects found on the page |
Images
Each image object in the images array has the following structure:
| Field | Type | Description |
|---|---|---|
id |
String | A unique identifier for the image |
image_base64 |
String | Base64-encoded image data (only included if include_images is specified) |
Working with the API Response
Parsing the JSON Response
Here's an example of how to parse the JSON response in Python:
import json
# Load the JSON response
with open('ocr_results.json', 'r') as f:
ocr_data = json.load(f)
# Access metadata
title = ocr_data.get('metadata', {}).get('title', 'Untitled Document')
page_count = ocr_data.get('metadata', {}).get('page_count', 0)
# Access page content
for page in ocr_data.get('pages', []):
page_index = page.get('index', 0)
page_content = page.get('markdown', '')
print(f"Page {page_index + 1}:")
print(page_content)
print("-" * 40)
Handling Images
If you've included images in the response (using the --include-images flag), you can extract and save them:
import base64
import os
# Create a directory for images
os.makedirs('extracted_images', exist_ok=True)
# Extract images from each page
for page in ocr_data.get('pages', []):
page_index = page.get('index', 0)
for img_index, image in enumerate(page.get('images', [])):
img_id = image.get('id', f'unknown-{img_index}')
img_data = image.get('image_base64', '')
if img_data:
# Remove data URL prefix if present
if ',' in img_data:
img_data = img_data.split(',', 1)[1]
# Decode and save the image
img_bytes = base64.b64decode(img_data)
with open(f'extracted_images/page{page_index}_{img_id}.jpg', 'wb') as img_file:
img_file.write(img_bytes)
Working with Markdown Content
The OCR results are provided in Markdown format, which makes it easy to convert to other formats or display in applications:
import markdown
# Convert markdown to HTML
for page in ocr_data.get('pages', []):
page_content = page.get('markdown', '')
html_content = markdown.markdown(page_content)
# Now you can use the HTML content in your application
# For example, save it to an HTML file
with open(f'page_{page.get("index", 0)}.html', 'w') as f:
f.write(html_content)
Error Handling
When working with the API, you may encounter various errors. Here are some common error scenarios and how to handle them:
API Key Errors
API key must be provided or set as MISTRAL_API_KEY environment variable
Ensure your API key is correctly set as an environment variable or provided with the --api-key flag.
File Size Errors
File is too large (55.00 MB). Maximum allowed size is 52.00 MB
The Mistral API has a file size limit of 52MB. For larger files, consider splitting them into smaller documents.
Rate Limiting
API returned error status: 429 - Rate limit exceeded
The API has rate limits. Implement exponential backoff and retry logic in your application:
import time
import random
def api_request_with_retry(func, max_retries=5, initial_delay=1):
retries = 0
while retries < max_retries:
try:
return func()
except Exception as e:
if "429" in str(e) and retries < max_retries - 1:
# Exponential backoff with jitter
delay = initial_delay * (2 ** retries) + random.uniform(0, 1)
print(f"Rate limited. Retrying in {delay:.2f} seconds...")
time.sleep(delay)
retries += 1
else:
raise
API Limitations
- Maximum file size: 52MB
- Supported file formats: PDF, JPG, JPEG, PNG, WEBP, GIF
- Rate limits: Depends on your Mistral AI account tier
- Concurrent requests: Depends on your Mistral AI account tier
- Image extraction: Some complex images or diagrams may not be perfectly extracted
- Language support: Check the Mistral AI documentation for the latest information on supported languages