Add feature to extract images as separate files
This commit is contained in:
@@ -13,6 +13,8 @@ This document provides detailed information about the Mistral OCR API response f
|
||||
- [Working with the API Response](#working-with-the-api-response)
|
||||
- [Parsing the JSON Response](#parsing-the-json-response)
|
||||
- [Handling Images](#handling-images)
|
||||
- [1. Embedded Images](#1-embedded-images)
|
||||
- [2. Extracted Images](#2-extracted-images)
|
||||
- [Working with Markdown Content](#working-with-markdown-content)
|
||||
- [Error Handling](#error-handling)
|
||||
- [API Key Errors](#api-key-errors)
|
||||
@@ -113,7 +115,11 @@ for page in ocr_data.get('pages', []):
|
||||
|
||||
### Handling Images
|
||||
|
||||
If you've included images in the response (using the `--include-images` flag), you can extract and save them:
|
||||
The Mistral OCR CLI provides two approaches for handling images:
|
||||
|
||||
#### 1. Embedded Images
|
||||
|
||||
When using the `--images` flag without `--extract-images`, images are embedded directly in the markdown as base64 data. If you've included images in the response (using the `--include-images` flag), you can extract and save them manually:
|
||||
|
||||
```python
|
||||
import base64
|
||||
@@ -141,6 +147,65 @@ for page in ocr_data.get('pages', []):
|
||||
img_file.write(img_bytes)
|
||||
```
|
||||
|
||||
#### 2. Extracted Images
|
||||
|
||||
Alternatively, you can use the `--extract-images` flag with the CLI to automatically extract images to separate files. This approach:
|
||||
|
||||
- Saves each image as a separate file in the specified directory (or `output_dir/images` by default)
|
||||
- Updates the markdown to reference these image files instead of embedding base64 data
|
||||
- Results in smaller, more manageable markdown files
|
||||
|
||||
Example command:
|
||||
```bash
|
||||
mistral-ocr markdown document.pdf --images --extract-images --image-dir custom_images
|
||||
```
|
||||
|
||||
If you're working with the API directly and want to implement similar functionality, here's how you might do it:
|
||||
|
||||
```python
|
||||
import base64
|
||||
import os
|
||||
import re
|
||||
|
||||
def extract_images_from_ocr_data(ocr_data, image_dir='images'):
|
||||
"""Extract images from OCR data and update markdown references."""
|
||||
# Create image directory
|
||||
os.makedirs(image_dir, exist_ok=True)
|
||||
|
||||
# Process each page
|
||||
for page in ocr_data.get('pages', []):
|
||||
page_index = page.get('index', 0)
|
||||
markdown = page.get('markdown', '')
|
||||
|
||||
# Extract and save images
|
||||
for img_index, image in enumerate(page.get('images', [])):
|
||||
img_id = image.get('id', f'unknown-{img_index}')
|
||||
img_data = image.get('image_base64', '')
|
||||
|
||||
if img_data:
|
||||
# Generate filename
|
||||
filename = f"{img_id.replace(' ', '_')}.jpg"
|
||||
filepath = os.path.join(image_dir, filename)
|
||||
|
||||
# Remove data URL prefix if present
|
||||
if ',' in img_data:
|
||||
img_data = img_data.split(',', 1)[1]
|
||||
|
||||
# Save the image
|
||||
with open(filepath, 'wb') as img_file:
|
||||
img_file.write(base64.b64decode(img_data))
|
||||
|
||||
# Update markdown to reference the file
|
||||
pattern = f"!\\[{re.escape(img_id)}\\]\\({re.escape(img_id)}\\)"
|
||||
replacement = f", filename)})"
|
||||
markdown = re.sub(pattern, replacement, markdown)
|
||||
|
||||
# Update the page's markdown
|
||||
page['markdown'] = markdown
|
||||
|
||||
return ocr_data
|
||||
```
|
||||
|
||||
### Working with Markdown Content
|
||||
|
||||
The OCR results are provided in Markdown format, which makes it easy to convert to other formats or display in applications:
|
||||
|
||||
Reference in New Issue
Block a user