Add feature to extract images as separate files

This commit is contained in:
2025-04-24 21:44:49 +02:00
parent 012755b7f4
commit 220864d52f
6 changed files with 197 additions and 17 deletions
+66 -1
View File
@@ -13,6 +13,8 @@ This document provides detailed information about the Mistral OCR API response f
- [Working with the API Response](#working-with-the-api-response)
- [Parsing the JSON Response](#parsing-the-json-response)
- [Handling Images](#handling-images)
- [1. Embedded Images](#1-embedded-images)
- [2. Extracted Images](#2-extracted-images)
- [Working with Markdown Content](#working-with-markdown-content)
- [Error Handling](#error-handling)
- [API Key Errors](#api-key-errors)
@@ -113,7 +115,11 @@ for page in ocr_data.get('pages', []):
### Handling Images
If you've included images in the response (using the `--include-images` flag), you can extract and save them:
The Mistral OCR CLI provides two approaches for handling images:
#### 1. Embedded Images
When using the `--images` flag without `--extract-images`, images are embedded directly in the markdown as base64 data. If you've included images in the response (using the `--include-images` flag), you can extract and save them manually:
```python
import base64
@@ -141,6 +147,65 @@ for page in ocr_data.get('pages', []):
img_file.write(img_bytes)
```
#### 2. Extracted Images
Alternatively, you can use the `--extract-images` flag with the CLI to automatically extract images to separate files. This approach:
- Saves each image as a separate file in the specified directory (or `output_dir/images` by default)
- Updates the markdown to reference these image files instead of embedding base64 data
- Results in smaller, more manageable markdown files
Example command:
```bash
mistral-ocr markdown document.pdf --images --extract-images --image-dir custom_images
```
If you're working with the API directly and want to implement similar functionality, here's how you might do it:
```python
import base64
import os
import re
def extract_images_from_ocr_data(ocr_data, image_dir='images'):
"""Extract images from OCR data and update markdown references."""
# Create image directory
os.makedirs(image_dir, exist_ok=True)
# Process each page
for page in ocr_data.get('pages', []):
page_index = page.get('index', 0)
markdown = page.get('markdown', '')
# Extract and save images
for img_index, image in enumerate(page.get('images', [])):
img_id = image.get('id', f'unknown-{img_index}')
img_data = image.get('image_base64', '')
if img_data:
# Generate filename
filename = f"{img_id.replace(' ', '_')}.jpg"
filepath = os.path.join(image_dir, filename)
# Remove data URL prefix if present
if ',' in img_data:
img_data = img_data.split(',', 1)[1]
# Save the image
with open(filepath, 'wb') as img_file:
img_file.write(base64.b64decode(img_data))
# Update markdown to reference the file
pattern = f"!\\[{re.escape(img_id)}\\]\\({re.escape(img_id)}\\)"
replacement = f"![{img_id}]({os.path.join(os.path.basename(image_dir), filename)})"
markdown = re.sub(pattern, replacement, markdown)
# Update the page's markdown
page['markdown'] = markdown
return ocr_data
```
### Working with Markdown Content
The OCR results are provided in Markdown format, which makes it easy to convert to other formats or display in applications: