Add feature to extract images as separate files

2025-04-24 21:44:49 +02:00
parent 012755b7f4
commit 220864d52f
6 changed files with 197 additions and 17 deletions
@@ -13,6 +13,8 @@ This document provides detailed information about the Mistral OCR API response f
  - [Working with the API Response](#working-with-the-api-response)
    - [Parsing the JSON Response](#parsing-the-json-response)
    - [Handling Images](#handling-images)
+      - [1. Embedded Images](#1-embedded-images)
+      - [2. Extracted Images](#2-extracted-images)
    - [Working with Markdown Content](#working-with-markdown-content)
  - [Error Handling](#error-handling)
    - [API Key Errors](#api-key-errors)
@@ -113,7 +115,11 @@ for page in ocr_data.get('pages', []):

 ### Handling Images

-If you've included images in the response (using the `--include-images` flag), you can extract and save them:
+The Mistral OCR CLI provides two approaches for handling images:
+
+#### 1. Embedded Images
+
+When using the `--images` flag without `--extract-images`, images are embedded directly in the markdown as base64 data. If you've included images in the response (using the `--include-images` flag), you can extract and save them manually:

 ```python
 import base64
@@ -141,6 +147,65 @@ for page in ocr_data.get('pages', []):
                img_file.write(img_bytes)
 ```

+#### 2. Extracted Images
+
+Alternatively, you can use the `--extract-images` flag with the CLI to automatically extract images to separate files. This approach:
+
+- Saves each image as a separate file in the specified directory (or `output_dir/images` by default)
+- Updates the markdown to reference these image files instead of embedding base64 data
+- Results in smaller, more manageable markdown files
+
+Example command:
+```bash
+mistral-ocr markdown document.pdf --images --extract-images --image-dir custom_images
+```
+
+If you're working with the API directly and want to implement similar functionality, here's how you might do it:
+
+```python
+import base64
+import os
+import re
+
+def extract_images_from_ocr_data(ocr_data, image_dir='images'):
+    """Extract images from OCR data and update markdown references."""
+    # Create image directory
+    os.makedirs(image_dir, exist_ok=True)
+    
+    # Process each page
+    for page in ocr_data.get('pages', []):
+        page_index = page.get('index', 0)
+        markdown = page.get('markdown', '')
+        
+        # Extract and save images
+        for img_index, image in enumerate(page.get('images', [])):
+            img_id = image.get('id', f'unknown-{img_index}')
+            img_data = image.get('image_base64', '')
+            
+            if img_data:
+                # Generate filename
+                filename = f"{img_id.replace(' ', '_')}.jpg"
+                filepath = os.path.join(image_dir, filename)
+                
+                # Remove data URL prefix if present
+                if ',' in img_data:
+                    img_data = img_data.split(',', 1)[1]
+                
+                # Save the image
+                with open(filepath, 'wb') as img_file:
+                    img_file.write(base64.b64decode(img_data))
+                
+                # Update markdown to reference the file
+                pattern = f"!\\[{re.escape(img_id)}\\]\\({re.escape(img_id)}\\)"
+                replacement = f"![{img_id}]({os.path.join(os.path.basename(image_dir), filename)})"
+                markdown = re.sub(pattern, replacement, markdown)
+        
+        # Update the page's markdown
+        page['markdown'] = markdown
+    
+    return ocr_data
+```
+
 ### Working with Markdown Content

 The OCR results are provided in Markdown format, which makes it easy to convert to other formats or display in applications: