Add feature to extract images as separate files

2025-04-24 21:44:49 +02:00
parent 012755b7f4
commit 220864d52f
6 changed files with 197 additions and 17 deletions
@@ -113,6 +113,12 @@ mistral-ocr convert results.json --output-file document.md

 # Include images in markdown (if available in JSON)
 mistral-ocr convert results.json --images
+
+# Extract images to files instead of embedding them in markdown
+mistral-ocr convert results.json --images --extract-images
+
+# Specify a custom directory for extracted images
+mistral-ocr convert results.json --images --extract-images --image-dir images_folder
 ```

 #### Process and Convert in One Step
@@ -134,6 +140,12 @@ mistral-ocr markdown path/to/document.pdf --output-file docs/result.md

 # Save intermediate JSON and generate markdown files
 mistral-ocr markdown path/to/document.pdf --json-file results.json --output-dir docs
+
+# Extract images to files instead of embedding them in markdown
+mistral-ocr markdown path/to/document.pdf --images --extract-images
+
+# Specify a custom directory for extracted images
+mistral-ocr markdown path/to/document.pdf --images --extract-images --image-dir custom_images
 ```

 This command combines the `process` and `convert` steps, creating markdown files directly from the document.
@@ -182,8 +194,26 @@ mistral-ocr markdown ~/Documents/research-paper.pdf --single-file --output-dir r

 # Generate a single markdown file with specific filename
 mistral-ocr markdown ~/Documents/research-paper.pdf --output-file research_docs/paper.md
+
+# Process a document and extract images to separate files
+mistral-ocr markdown ~/Documents/research-paper.pdf --images --extract-images --output-dir research_docs
 ```

+## Image Handling
+
+The tool provides several options for handling images in the OCR output:
+
+1. **No images**: By default, images are not included in the output.
+
+2. **Embedded images**: Using the `--images` flag without `--extract-images` will embed base64-encoded images directly in the markdown file. This creates a self-contained document but can result in very large files.
+
+3. **Extracted images**: Using both `--images` and `--extract-images` flags will:
+   - Extract images from the OCR results
+   - Save them as separate files in an images directory
+   - Reference these files in the markdown instead of embedding the base64 data
+
+You can specify a custom directory for extracted images using the `--image-dir` option. If not specified, images will be saved in a subdirectory called "images" within the output directory.
+
 ## OCR Response Format

 The OCR API returns a JSON response with the following structure: