283 lines
8.9 KiB
Markdown
283 lines
8.9 KiB
Markdown
# Mistral OCR API Documentation
|
|
|
|
This document provides detailed information about the Mistral OCR API response format and how to work with it in your applications.
|
|
|
|
## Table of Contents
|
|
|
|
- [Mistral OCR API Documentation](#mistral-ocr-api-documentation)
|
|
- [Table of Contents](#table-of-contents)
|
|
- [API Response Format](#api-response-format)
|
|
- [Document Metadata](#document-metadata)
|
|
- [Pages](#pages)
|
|
- [Images](#images)
|
|
- [Working with the API Response](#working-with-the-api-response)
|
|
- [Parsing the JSON Response](#parsing-the-json-response)
|
|
- [Handling Images](#handling-images)
|
|
- [1. Embedded Images](#1-embedded-images)
|
|
- [2. Extracted Images](#2-extracted-images)
|
|
- [Working with Markdown Content](#working-with-markdown-content)
|
|
- [Error Handling](#error-handling)
|
|
- [API Key Errors](#api-key-errors)
|
|
- [File Size Errors](#file-size-errors)
|
|
- [Rate Limiting](#rate-limiting)
|
|
- [API Limitations](#api-limitations)
|
|
|
|
## API Response Format
|
|
|
|
The Mistral OCR API returns a JSON response with the following structure:
|
|
|
|
```json
|
|
{
|
|
"metadata": {
|
|
"title": "Document Title",
|
|
"author": "Document Author",
|
|
"creation_date": "2023-01-01",
|
|
"page_count": 5
|
|
},
|
|
"pages": [
|
|
{
|
|
"index": 0,
|
|
"markdown": "# Page Content\n\nThis is the content of page 1...",
|
|
"images": [
|
|
{
|
|
"id": "image-1",
|
|
"image_base64": "base64-encoded-image-data"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"index": 1,
|
|
"markdown": "## Page 2 Content\n\nThis is the content of page 2...",
|
|
"images": []
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Document Metadata
|
|
|
|
The `metadata` object contains document-level information:
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `title` | String | The document title, if available |
|
|
| `author` | String | The document author, if available |
|
|
| `creation_date` | String | The document creation date in ISO format (YYYY-MM-DD), if available |
|
|
| `page_count` | Integer | The total number of pages in the document |
|
|
|
|
Note that some metadata fields may be empty or missing if the information cannot be extracted from the document.
|
|
|
|
### Pages
|
|
|
|
The `pages` array contains objects representing each page in the document:
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `index` | Integer | Zero-based page index |
|
|
| `markdown` | String | The extracted text content in Markdown format |
|
|
| `images` | Array | An array of image objects found on the page |
|
|
|
|
### Images
|
|
|
|
Each image object in the `images` array has the following structure:
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `id` | String | A unique identifier for the image |
|
|
| `image_base64` | String | Base64-encoded image data (only included if `include_images` is specified) |
|
|
|
|
## Working with the API Response
|
|
|
|
### Parsing the JSON Response
|
|
|
|
Here's an example of how to parse the JSON response in Python:
|
|
|
|
```python
|
|
import json
|
|
|
|
# Load the JSON response
|
|
with open('ocr_results.json', 'r') as f:
|
|
ocr_data = json.load(f)
|
|
|
|
# Access metadata
|
|
title = ocr_data.get('metadata', {}).get('title', 'Untitled Document')
|
|
page_count = ocr_data.get('metadata', {}).get('page_count', 0)
|
|
|
|
# Access page content
|
|
for page in ocr_data.get('pages', []):
|
|
page_index = page.get('index', 0)
|
|
page_content = page.get('markdown', '')
|
|
|
|
print(f"Page {page_index + 1}:")
|
|
print(page_content)
|
|
print("-" * 40)
|
|
```
|
|
|
|
### Handling Images
|
|
|
|
The Mistral OCR CLI provides two approaches for handling images:
|
|
|
|
#### 1. Embedded Images
|
|
|
|
When using the `--images` flag without `--extract-images`, images are embedded directly in the markdown as base64 data. If you've included images in the response (using the `--include-images` flag), you can extract and save them manually:
|
|
|
|
```python
|
|
import base64
|
|
import os
|
|
|
|
# Create a directory for images
|
|
os.makedirs('extracted_images', exist_ok=True)
|
|
|
|
# Extract images from each page
|
|
for page in ocr_data.get('pages', []):
|
|
page_index = page.get('index', 0)
|
|
|
|
for img_index, image in enumerate(page.get('images', [])):
|
|
img_id = image.get('id', f'unknown-{img_index}')
|
|
img_data = image.get('image_base64', '')
|
|
|
|
if img_data:
|
|
# Remove data URL prefix if present
|
|
if ',' in img_data:
|
|
img_data = img_data.split(',', 1)[1]
|
|
|
|
# Decode and save the image
|
|
img_bytes = base64.b64decode(img_data)
|
|
with open(f'extracted_images/page{page_index}_{img_id}.jpg', 'wb') as img_file:
|
|
img_file.write(img_bytes)
|
|
```
|
|
|
|
#### 2. Extracted Images
|
|
|
|
Alternatively, you can use the `--extract-images` flag with the CLI to automatically extract images to separate files. This approach:
|
|
|
|
- Saves each image as a separate file in the specified directory (or `output_dir/images` by default)
|
|
- Updates the markdown to reference these image files instead of embedding base64 data
|
|
- Results in smaller, more manageable markdown files
|
|
|
|
Example command:
|
|
```bash
|
|
mistral-ocr markdown document.pdf --images --extract-images --image-dir custom_images
|
|
```
|
|
|
|
If you're working with the API directly and want to implement similar functionality, here's how you might do it:
|
|
|
|
```python
|
|
import base64
|
|
import os
|
|
import re
|
|
|
|
def extract_images_from_ocr_data(ocr_data, image_dir='images'):
|
|
"""Extract images from OCR data and update markdown references."""
|
|
# Create image directory
|
|
os.makedirs(image_dir, exist_ok=True)
|
|
|
|
# Process each page
|
|
for page in ocr_data.get('pages', []):
|
|
page_index = page.get('index', 0)
|
|
markdown = page.get('markdown', '')
|
|
|
|
# Extract and save images
|
|
for img_index, image in enumerate(page.get('images', [])):
|
|
img_id = image.get('id', f'unknown-{img_index}')
|
|
img_data = image.get('image_base64', '')
|
|
|
|
if img_data:
|
|
# Generate filename
|
|
filename = f"{img_id.replace(' ', '_')}.jpg"
|
|
filepath = os.path.join(image_dir, filename)
|
|
|
|
# Remove data URL prefix if present
|
|
if ',' in img_data:
|
|
img_data = img_data.split(',', 1)[1]
|
|
|
|
# Save the image
|
|
with open(filepath, 'wb') as img_file:
|
|
img_file.write(base64.b64decode(img_data))
|
|
|
|
# Update markdown to reference the file
|
|
pattern = f"!\\[{re.escape(img_id)}\\]\\({re.escape(img_id)}\\)"
|
|
replacement = f", filename)})"
|
|
markdown = re.sub(pattern, replacement, markdown)
|
|
|
|
# Update the page's markdown
|
|
page['markdown'] = markdown
|
|
|
|
return ocr_data
|
|
```
|
|
|
|
### Working with Markdown Content
|
|
|
|
The OCR results are provided in Markdown format, which makes it easy to convert to other formats or display in applications:
|
|
|
|
```python
|
|
import markdown
|
|
|
|
# Convert markdown to HTML
|
|
for page in ocr_data.get('pages', []):
|
|
page_content = page.get('markdown', '')
|
|
html_content = markdown.markdown(page_content)
|
|
|
|
# Now you can use the HTML content in your application
|
|
# For example, save it to an HTML file
|
|
with open(f'page_{page.get("index", 0)}.html', 'w') as f:
|
|
f.write(html_content)
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
When working with the API, you may encounter various errors. Here are some common error scenarios and how to handle them:
|
|
|
|
### API Key Errors
|
|
|
|
```
|
|
API key must be provided or set as MISTRAL_API_KEY environment variable
|
|
```
|
|
|
|
Ensure your API key is correctly set as an environment variable or provided with the `--api-key` flag.
|
|
|
|
### File Size Errors
|
|
|
|
```
|
|
File is too large (55.00 MB). Maximum allowed size is 52.00 MB
|
|
```
|
|
|
|
The Mistral API has a file size limit of 52MB. For larger files, consider splitting them into smaller documents.
|
|
|
|
### Rate Limiting
|
|
|
|
```
|
|
API returned error status: 429 - Rate limit exceeded
|
|
```
|
|
|
|
The API has rate limits. Implement exponential backoff and retry logic in your application:
|
|
|
|
```python
|
|
import time
|
|
import random
|
|
|
|
def api_request_with_retry(func, max_retries=5, initial_delay=1):
|
|
retries = 0
|
|
while retries < max_retries:
|
|
try:
|
|
return func()
|
|
except Exception as e:
|
|
if "429" in str(e) and retries < max_retries - 1:
|
|
# Exponential backoff with jitter
|
|
delay = initial_delay * (2 ** retries) + random.uniform(0, 1)
|
|
print(f"Rate limited. Retrying in {delay:.2f} seconds...")
|
|
time.sleep(delay)
|
|
retries += 1
|
|
else:
|
|
raise
|
|
```
|
|
|
|
## API Limitations
|
|
|
|
- **Maximum file size**: 52MB
|
|
- **Supported file formats**: PDF, JPG, JPEG, PNG, WEBP, GIF
|
|
- **Rate limits**: Depends on your Mistral AI account tier
|
|
- **Concurrent requests**: Depends on your Mistral AI account tier
|
|
- **Image extraction**: Some complex images or diagrams may not be perfectly extracted
|
|
- **Language support**: Check the Mistral AI documentation for the latest information on supported languages
|