339 lines
10 KiB
Markdown
339 lines
10 KiB
Markdown
# Mistral OCR CLI (Python)
|
|
|
|
A command-line tool for processing documents with Mistral AI's OCR capabilities, implemented in Python. This tool allows you to extract text and structured content from PDF documents and images while preserving the original formatting and layout.
|
|
|
|
## Features
|
|
|
|
- Process PDF documents and images using Mistral AI's OCR
|
|
- Extract text and structured content from documents
|
|
- Process local files or files from URLs
|
|
- Output results to stdout or to a file
|
|
- Convert OCR results to Markdown format
|
|
- Maintain document structure and formatting in the output
|
|
- Support for extracting and embedding images
|
|
- Metadata extraction (title, author, creation date)
|
|
- Page-by-page processing with optional single-file output
|
|
|
|
## How It Works
|
|
|
|
Mistral OCR CLI works by:
|
|
|
|
1. Uploading your document to the Mistral AI API (for local files) or providing the URL
|
|
2. Processing the document using Mistral's advanced OCR capabilities
|
|
3. Receiving structured JSON data containing the extracted text, formatting, and metadata
|
|
4. Optionally converting this data to Markdown format for easy reading and editing
|
|
|
|
The tool handles authentication, file uploads, API communication, and result formatting, making it easy to integrate OCR capabilities into your workflow.
|
|
|
|
## Installation
|
|
|
|
### Requirements
|
|
|
|
- Python 3.7 or later
|
|
- pip (Python package installer)
|
|
- A Mistral AI API key (sign up at [Mistral AI](https://mistral.ai) if you don't have one)
|
|
|
|
### Installing from source
|
|
|
|
```bash
|
|
git clone https://github.com/yourusername/mistral-ocr
|
|
cd mistral-ocr
|
|
pip install -e .
|
|
```
|
|
|
|
Alternatively, you can use the build script which creates a virtual environment and installs the package:
|
|
|
|
```bash
|
|
git clone https://github.com/yourusername/mistral-ocr
|
|
cd mistral-ocr
|
|
./build.sh
|
|
```
|
|
|
|
### Installing from PyPI (coming soon)
|
|
|
|
```bash
|
|
pip install mistral-ocr
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Setting up your API key
|
|
|
|
You can provide your Mistral API key in two ways:
|
|
|
|
1. Environment variable (recommended for security):
|
|
```bash
|
|
export MISTRAL_API_KEY=your-api-key
|
|
```
|
|
|
|
2. Command line flag:
|
|
```bash
|
|
mistral-ocr --api-key=your-api-key [command]
|
|
```
|
|
|
|
### Commands
|
|
|
|
#### Process a document
|
|
|
|
Process a document file or URL:
|
|
|
|
```bash
|
|
# Process a local PDF file
|
|
mistral-ocr process path/to/document.pdf
|
|
|
|
# Process a document from a URL
|
|
mistral-ocr process https://example.com/document.pdf
|
|
|
|
# Process an image from a URL
|
|
mistral-ocr process https://example.com/image.jpg
|
|
|
|
# Save output to a file
|
|
mistral-ocr process path/to/document.pdf --output-file results.json
|
|
|
|
# Include base64 encoded images in the output
|
|
mistral-ocr process path/to/document.pdf --include-images
|
|
```
|
|
|
|
#### Convert OCR JSON to Markdown
|
|
|
|
Convert previously processed OCR JSON results to Markdown:
|
|
|
|
```bash
|
|
# Convert OCR JSON to Markdown
|
|
mistral-ocr convert results.json
|
|
|
|
# Specify output directory
|
|
mistral-ocr convert results.json --output-dir output_folder
|
|
|
|
# Create a single markdown file instead of one per page
|
|
mistral-ocr convert results.json --single-file
|
|
|
|
# Specify output filename for single file mode
|
|
mistral-ocr convert results.json --output-file document.md
|
|
|
|
# Include images in markdown (if available in JSON)
|
|
mistral-ocr convert results.json --images
|
|
|
|
# Extract images to files instead of embedding them in markdown
|
|
mistral-ocr convert results.json --images --extract-images
|
|
|
|
# Specify a custom directory for extracted images
|
|
mistral-ocr convert results.json --images --extract-images --image-dir images_folder
|
|
```
|
|
|
|
#### Process and Convert in One Step
|
|
|
|
Process a document and convert to Markdown in a single command:
|
|
|
|
```bash
|
|
# Process document and generate markdown files
|
|
mistral-ocr markdown path/to/document.pdf
|
|
|
|
# Generate a single markdown file instead of separate files per page
|
|
mistral-ocr markdown path/to/document.pdf --single-file
|
|
|
|
# Specify output directory for markdown files
|
|
mistral-ocr markdown https://example.com/document.pdf --output-dir docs
|
|
|
|
# Specify a specific output file path (implies single file)
|
|
mistral-ocr markdown path/to/document.pdf --output-file docs/result.md
|
|
|
|
# Save intermediate JSON and generate markdown files
|
|
mistral-ocr markdown path/to/document.pdf --json-file results.json --output-dir docs
|
|
|
|
# Extract images to files instead of embedding them in markdown
|
|
mistral-ocr markdown path/to/document.pdf --images --extract-images
|
|
|
|
# Specify a custom directory for extracted images
|
|
mistral-ocr markdown path/to/document.pdf --images --extract-images --image-dir custom_images
|
|
```
|
|
|
|
This command combines the `process` and `convert` steps, creating markdown files directly from the document.
|
|
|
|
#### Version information
|
|
|
|
```bash
|
|
mistral-ocr version
|
|
```
|
|
|
|
### Examples
|
|
|
|
#### Process a local PDF and save the output
|
|
|
|
```bash
|
|
mistral-ocr process ~/Documents/sample.pdf --output-file results.json
|
|
```
|
|
|
|
#### Process a document from a URL
|
|
|
|
```bash
|
|
mistral-ocr process https://arxiv.org/pdf/2201.04234 > output.json
|
|
```
|
|
|
|
#### Convert OCR JSON to Markdown files
|
|
|
|
```bash
|
|
# Create separate files (one per page)
|
|
mistral-ocr convert output.json --output-dir markdown_docs
|
|
|
|
# Create a single file with all pages
|
|
mistral-ocr convert output.json --single-file --output-dir markdown_docs
|
|
|
|
# Create a single file with a specific filename
|
|
mistral-ocr convert output.json --output-file docs/paper.md
|
|
```
|
|
|
|
#### Process a document and generate markdown files in one step
|
|
|
|
```bash
|
|
# Generate separate files (one per page)
|
|
mistral-ocr markdown ~/Documents/research-paper.pdf --output-dir research_docs
|
|
|
|
# Generate a single markdown file
|
|
mistral-ocr markdown ~/Documents/research-paper.pdf --single-file --output-dir research_docs
|
|
|
|
# Generate a single markdown file with specific filename
|
|
mistral-ocr markdown ~/Documents/research-paper.pdf --output-file research_docs/paper.md
|
|
|
|
# Process a document and extract images to separate files
|
|
mistral-ocr markdown ~/Documents/research-paper.pdf --images --extract-images --output-dir research_docs
|
|
```
|
|
|
|
## Image Handling
|
|
|
|
The tool provides several options for handling images in the OCR output:
|
|
|
|
1. **No images**: By default, images are not included in the output.
|
|
|
|
2. **Embedded images**: Using the `--images` flag without `--extract-images` will embed base64-encoded images directly in the markdown file. This creates a self-contained document but can result in very large files.
|
|
|
|
3. **Extracted images**: Using both `--images` and `--extract-images` flags will:
|
|
- Extract images from the OCR results
|
|
- Save them as separate files in an images directory
|
|
- Reference these files in the markdown instead of embedding the base64 data
|
|
|
|
You can specify a custom directory for extracted images using the `--image-dir` option. If not specified, images will be saved in a subdirectory called "images" within the output directory.
|
|
|
|
## OCR Response Format
|
|
|
|
The OCR API returns a JSON response with the following structure:
|
|
|
|
```json
|
|
{
|
|
"metadata": {
|
|
"title": "Document Title",
|
|
"author": "Document Author",
|
|
"creation_date": "2023-01-01",
|
|
"page_count": 5
|
|
},
|
|
"pages": [
|
|
{
|
|
"index": 0,
|
|
"markdown": "# Page Content\n\nThis is the content of page 1...",
|
|
"images": [
|
|
{
|
|
"id": "image-1",
|
|
"image_base64": "base64-encoded-image-data"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"index": 1,
|
|
"markdown": "## Page 2 Content\n\nThis is the content of page 2...",
|
|
"images": []
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Key Components:
|
|
|
|
- **metadata**: Contains document-level information
|
|
- **title**: Document title (if available)
|
|
- **author**: Document author (if available)
|
|
- **creation_date**: Document creation date (if available)
|
|
- **page_count**: Total number of pages
|
|
|
|
- **pages**: Array of page objects
|
|
- **index**: Zero-based page index
|
|
- **markdown**: Extracted text in Markdown format
|
|
- **images**: Array of images found on the page
|
|
- **id**: Unique image identifier
|
|
- **image_base64**: Base64-encoded image data (only included if `--include-images` is specified)
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
#### API Key Issues
|
|
|
|
```
|
|
Error processing document: API key must be provided or set as MISTRAL_API_KEY environment variable
|
|
```
|
|
|
|
**Solution**: Ensure your API key is correctly set as an environment variable or provided with the `--api-key` flag.
|
|
|
|
#### File Size Limits
|
|
|
|
```
|
|
Error processing document: File is too large (55.00 MB). Maximum allowed size is 52.00 MB
|
|
```
|
|
|
|
**Solution**: The Mistral API has a file size limit of 52MB. For larger files, consider splitting them into smaller documents.
|
|
|
|
#### Rate Limiting
|
|
|
|
```
|
|
Error processing document: API returned error status: 429 - Rate limit exceeded
|
|
```
|
|
|
|
**Solution**: The API has rate limits. Wait a few minutes before trying again or contact Mistral AI to increase your rate limits.
|
|
|
|
#### Invalid JSON
|
|
|
|
```
|
|
Error converting JSON to markdown: Expecting property name enclosed in double quotes
|
|
```
|
|
|
|
**Solution**: Ensure the JSON file is valid. You can validate it using tools like `jq`.
|
|
|
|
### API Limitations
|
|
|
|
- Maximum file size: 52MB
|
|
- Supported file formats: PDF, JPG, JPEG, PNG, WEBP, GIF
|
|
- Rate limits may apply depending on your Mistral AI account tier
|
|
|
|
## Architecture Documentation
|
|
|
|
For a comprehensive overview of the Mistral OCR architecture, including UML diagrams, sequence diagrams, and other visual representations, please refer to the [ARCHITECTURE.md](ARCHITECTURE.md) document. This documentation provides detailed insights into:
|
|
|
|
- Class structure and relationships
|
|
- Component architecture
|
|
- Process workflows
|
|
- Data flow through the system
|
|
- User interaction patterns
|
|
|
|
These diagrams are useful for understanding the system design, onboarding new contributors, and planning future enhancements.
|
|
|
|
## Contributing
|
|
|
|
Contributions to Mistral OCR CLI are welcome! Here's how you can contribute:
|
|
|
|
1. **Fork the repository**
|
|
2. **Create a feature branch**:
|
|
```bash
|
|
git checkout -b feature/your-feature-name
|
|
```
|
|
3. **Make your changes**
|
|
4. **Run tests** (if available):
|
|
```bash
|
|
python -m unittest discover tests
|
|
```
|
|
5. **Submit a pull request**
|
|
|
|
Please ensure your code follows the project's coding standards and includes appropriate tests and documentation. For understanding the codebase structure, refer to the [ARCHITECTURE.md](ARCHITECTURE.md) document.
|
|
|
|
## License
|
|
|
|
MIT
|