14 KiB
Mistral OCR Architecture Documentation
This document provides a comprehensive overview of the Mistral OCR CLI architecture, including UML diagrams to illustrate the system's structure, behavior, and interactions.
Table of Contents
System Overview
Mistral OCR is a command-line tool for processing documents with Mistral AI's OCR capabilities. The system allows users to extract text and structured content from PDF documents and images while preserving the original formatting and layout.
The tool is structured around a set of commands:
process: Processes a document (local file or URL) with OCR and outputs JSONconvert: Converts the JSON output to Markdownmarkdown: Combines the process and convert steps in one commandversion: Displays the version information
Class Diagram
The following class diagram illustrates the main classes in the Mistral OCR system and their relationships:
classDiagram
class MistralClient {
+BASE_URL: str
+MAX_FILE_SIZE: int
+api_key: str
+session: requests.Session
+__init__(api_key: str)
+upload_file(file_path: str): str
+get_file_url(file_id: str): str
+process_ocr(doc_type: str, doc_source: str, include_image_base64: bool): bytes
}
class OCRResponse {
+pages: list
+metadata: OCRResponseMetadata
+__init__(pages: list, metadata: OCRResponseMetadata)
}
class OCRResponseMetadata {
+title: str
+author: str
+creation_date: str
+page_count: int
+__init__(title: str, author: str, creation_date: str, page_count: int)
}
class OCRResponsePage {
+index: int
+markdown: str
+image: str
+images: list
+dimensions: dict
+__init__(index: int, markdown: str, image: str, images: list, dimensions: dict)
}
class OCRResponseImage {
+id: str
+image_base64: str
+__init__(id: str, image_base64: str)
}
OCRResponse "1" *-- "1" OCRResponseMetadata: contains
OCRResponse "1" *-- "many" OCRResponsePage: contains
OCRResponsePage "1" *-- "many" OCRResponseImage: contains
Component Diagram
The following component diagram shows the high-level architecture of the Mistral OCR system:
graph TD
CLI[CLI Interface] --> ProcessCmd[Process Command]
CLI --> ConvertCmd[Convert Command]
CLI --> MarkdownCmd[Markdown Command]
CLI --> VersionCmd[Version Command]
ProcessCmd --> APIClient[Mistral API Client]
ConvertCmd --> JSONParser[JSON Parser]
MarkdownCmd --> ProcessCmd
MarkdownCmd --> ConvertCmd
APIClient --> ExternalAPI[Mistral AI API]
JSONParser --> MarkdownGenerator[Markdown Generator]
MarkdownGenerator --> ImageHandler[Image Handler]
subgraph "External Services"
ExternalAPI
end
subgraph "Core Components"
APIClient
JSONParser
MarkdownGenerator
ImageHandler
end
subgraph "Command Interface"
ProcessCmd
ConvertCmd
MarkdownCmd
VersionCmd
end
Sequence Diagrams
Process Document Workflow
The following sequence diagram illustrates the workflow for processing a document with OCR:
sequenceDiagram
participant User
participant CLI as CLI Interface
participant Process as Process Command
participant Client as Mistral Client
participant API as Mistral AI API
User->>CLI: mistral-ocr process document.pdf
CLI->>Process: run(args)
alt Local File
Process->>Client: process_local_file(file_path, output_file, include_images)
Client->>Client: upload_file(file_path)
Client->>API: POST /files
API-->>Client: file_id
Client->>Client: get_file_url(file_id)
Client->>API: GET /files/{file_id}/url
API-->>Client: signed_url
Client->>API: POST /ocr
API-->>Client: OCR results (JSON)
else URL
Process->>Client: process_url(url, output_file, include_images)
Client->>API: POST /ocr
API-->>Client: OCR results (JSON)
end
Client-->>Process: OCR results (JSON)
Process->>Process: handle_output(data, output_file)
Process-->>CLI: Success message
CLI-->>User: Display results or confirmation
Convert JSON to Markdown Workflow
The following sequence diagram illustrates the workflow for converting JSON to Markdown:
sequenceDiagram
participant User
participant CLI as CLI Interface
participant Convert as Convert Command
participant Parser as JSON Parser
participant Generator as Markdown Generator
User->>CLI: mistral-ocr convert results.json
CLI->>Convert: run(args)
Convert->>Parser: load and parse JSON
Parser-->>Convert: OCR data structure
Convert->>Generator: convert_json_to_markdown(json_file, args)
alt Single File Mode
Generator->>Generator: Process all pages into one file
loop For each page
Generator->>Generator: replace_image_references(content, images, include_images, extract_images, image_dir)
end
Generator->>Generator: Write combined markdown file
else Multi-File Mode
loop For each page
Generator->>Generator: replace_image_references(content, images, include_images, extract_images, image_dir)
Generator->>Generator: Write individual markdown file
end
end
Generator-->>Convert: Success message
Convert-->>CLI: Success message
CLI-->>User: Display confirmation
Combined Markdown Workflow
The following sequence diagram illustrates the combined workflow for processing a document and converting to Markdown in one step:
sequenceDiagram
participant User
participant CLI as CLI Interface
participant Markdown as Markdown Command
participant Process as Process Command
participant Convert as Convert Command
participant Client as Mistral Client
participant API as Mistral AI API
User->>CLI: mistral-ocr markdown document.pdf
CLI->>Markdown: run(args)
Markdown->>Markdown: Create temp JSON file if needed
alt Local File
Markdown->>Process: process_local_file(file_path, json_output_path, include_image_base64)
Process->>Client: upload_file(file_path)
Client->>API: POST /files
API-->>Client: file_id
Client->>Client: get_file_url(file_id)
Client->>API: GET /files/{file_id}/url
API-->>Client: signed_url
Client->>API: POST /ocr
API-->>Client: OCR results (JSON)
Client-->>Process: OCR results
Process->>Process: Write JSON to file
else URL
Markdown->>Process: process_url(url, json_output_path, include_image_base64)
Process->>Client: process_url(url, output_file, include_images)
Client->>API: POST /ocr
API-->>Client: OCR results (JSON)
Client-->>Process: OCR results
Process->>Process: Write JSON to file
end
Markdown->>Convert: convert_json_to_markdown(json_output_path, args)
Convert->>Convert: Parse JSON and generate markdown
Convert-->>Markdown: Success message
Markdown->>Markdown: Clean up temp file if created
Markdown-->>CLI: Success message
CLI-->>User: Display confirmation
Image Extraction Workflow
The following sequence diagram illustrates the workflow for extracting images to separate files:
sequenceDiagram
participant User
participant CLI as CLI Interface
participant Convert as Convert Command
participant Generator as Markdown Generator
participant ImageHandler as Image Handler
participant FileSystem as File System
User->>CLI: mistral-ocr convert results.json --images --extract-images
CLI->>Convert: run(args)
Convert->>Generator: convert_json_to_markdown(json_file, args)
Generator->>Generator: Determine image directory
Generator->>FileSystem: Create image directory if needed
loop For each page with images
Generator->>ImageHandler: replace_image_references(content, images, include_images, extract_images, image_dir)
loop For each image
ImageHandler->>ImageHandler: extract_image_to_file(image_base64, image_id, image_dir)
ImageHandler->>FileSystem: Write image file
FileSystem-->>ImageHandler: File path
ImageHandler->>ImageHandler: Update markdown to reference file
end
ImageHandler-->>Generator: Updated markdown content
end
Generator->>FileSystem: Write markdown file(s)
Generator-->>Convert: Success message
Convert-->>CLI: Success message
CLI-->>User: Display confirmation
Activity Diagram
The following activity diagram illustrates the overall process flow of the Mistral OCR system:
graph TD
Start([Start]) --> ParseArgs[Parse Command Line Arguments]
ParseArgs --> CommandCheck{Which Command?}
CommandCheck -->|process| ProcessCommand[Process Command]
CommandCheck -->|convert| ConvertCommand[Convert Command]
CommandCheck -->|markdown| MarkdownCommand[Markdown Command]
CommandCheck -->|version| VersionCommand[Version Command]
CommandCheck -->|none| ShowHelp[Show Help]
ProcessCommand --> InputCheck{Input Type?}
InputCheck -->|Local File| UploadFile[Upload File to API]
InputCheck -->|URL| ProcessURL[Process URL Directly]
UploadFile --> GetSignedURL[Get Signed URL]
GetSignedURL --> ProcessOCR[Process with OCR API]
ProcessURL --> ProcessOCR
ProcessOCR --> SaveJSON[Save JSON Results]
SaveJSON --> ProcessEnd([Process End])
ConvertCommand --> LoadJSON[Load JSON File]
LoadJSON --> ParseJSON[Parse JSON Structure]
ParseJSON --> OutputModeCheck{Output Mode?}
OutputModeCheck -->|Single File| CombinePages[Combine All Pages]
OutputModeCheck -->|Multiple Files| ProcessPages[Process Each Page Separately]
CombinePages --> ImageCheck{Include Images?}
ProcessPages --> ImageCheck
ImageCheck -->|No| GenerateMarkdown[Generate Markdown Without Images]
ImageCheck -->|Yes| ExtractCheck{Extract Images?}
ExtractCheck -->|No| EmbedImages[Embed Images as Base64]
ExtractCheck -->|Yes| ExtractImages[Extract Images to Files]
EmbedImages --> GenerateMarkdown
ExtractImages --> UpdateReferences[Update Image References]
UpdateReferences --> GenerateMarkdown
GenerateMarkdown --> SaveMarkdown[Save Markdown File(s)]
SaveMarkdown --> ConvertEnd([Convert End])
MarkdownCommand --> CreateTempJSON[Create Temporary JSON File]
CreateTempJSON --> ProcessInMarkdown[Process Document]
ProcessInMarkdown --> ConvertInMarkdown[Convert to Markdown]
ConvertInMarkdown --> CleanupTemp[Cleanup Temporary Files]
CleanupTemp --> MarkdownEnd([Markdown End])
VersionCommand --> ShowVersion[Show Version Information]
ShowVersion --> VersionEnd([Version End])
ShowHelp --> HelpEnd([Help End])
Use Case Diagram
The following use case diagram illustrates the different ways users can interact with the Mistral OCR CLI:
graph TD
User((User))
subgraph "Mistral OCR CLI"
ProcessDoc[Process Document]
ProcessURL[Process URL]
ConvertJSON[Convert JSON to Markdown]
CombinedProcess[Process and Convert in One Step]
ExtractImages[Extract Images to Files]
EmbedImages[Embed Images in Markdown]
CheckVersion[Check Version]
end
User -->|mistral-ocr process file.pdf| ProcessDoc
User -->|mistral-ocr process https://example.com/doc.pdf| ProcessURL
User -->|mistral-ocr convert results.json| ConvertJSON
User -->|mistral-ocr markdown file.pdf| CombinedProcess
User -->|--extract-images flag| ExtractImages
User -->|--images flag| EmbedImages
User -->|mistral-ocr version| CheckVersion
ProcessDoc -.-> ConvertJSON
ProcessURL -.-> ConvertJSON
ConvertJSON -.-> ExtractImages
ConvertJSON -.-> EmbedImages
CombinedProcess -.-> ProcessDoc
CombinedProcess -.-> ConvertJSON
Data Flow Diagram
The following data flow diagram illustrates how data moves through the Mistral OCR system:
graph TD
Input[Document Input]
API[Mistral AI API]
JSONStorage[JSON Storage]
MarkdownOutput[Markdown Output]
ImageStorage[Image Storage]
Input -->|Local File| FileUpload[File Upload]
Input -->|URL| DirectProcess[Direct Processing]
FileUpload --> API
DirectProcess --> API
API -->|OCR Results| JSONStorage
JSONStorage -->|Parse| MarkdownGen[Markdown Generation]
MarkdownGen -->|Text Content| MarkdownOutput
MarkdownGen -->|Image Data| ImageCheck{Extract Images?}
ImageCheck -->|Yes| ImageExtraction[Image Extraction]
ImageCheck -->|No| Base64Embedding[Base64 Embedding]
ImageExtraction --> ImageStorage
ImageExtraction -->|Image References| MarkdownOutput
Base64Embedding -->|Embedded Images| MarkdownOutput
subgraph "Input Processing"
Input
FileUpload
DirectProcess
end
subgraph "OCR Processing"
API
JSONStorage
end
subgraph "Output Generation"
MarkdownGen
ImageCheck
ImageExtraction
Base64Embedding
MarkdownOutput
ImageStorage
end
These diagrams provide a comprehensive overview of the Mistral OCR system architecture, workflows, and interactions. They can be used to understand the system's structure, behavior, and to guide future development and maintenance efforts.