Files
mistral-ocr/ARCHITECTURE.md
T

14 KiB

Mistral OCR Architecture Documentation

This document provides a comprehensive overview of the Mistral OCR CLI architecture, including UML diagrams to illustrate the system's structure, behavior, and interactions.

Table of Contents

System Overview

Mistral OCR is a command-line tool for processing documents with Mistral AI's OCR capabilities. The system allows users to extract text and structured content from PDF documents and images while preserving the original formatting and layout.

The tool is structured around a set of commands:

  • process: Processes a document (local file or URL) with OCR and outputs JSON
  • convert: Converts the JSON output to Markdown
  • markdown: Combines the process and convert steps in one command
  • version: Displays the version information

Class Diagram

The following class diagram illustrates the main classes in the Mistral OCR system and their relationships:

classDiagram
    class MistralClient {
        +BASE_URL: str
        +MAX_FILE_SIZE: int
        +api_key: str
        +session: requests.Session
        +__init__(api_key: str)
        +upload_file(file_path: str): str
        +get_file_url(file_id: str): str
        +process_ocr(doc_type: str, doc_source: str, include_image_base64: bool): bytes
    }
    
    class OCRResponse {
        +pages: list
        +metadata: OCRResponseMetadata
        +__init__(pages: list, metadata: OCRResponseMetadata)
    }
    
    class OCRResponseMetadata {
        +title: str
        +author: str
        +creation_date: str
        +page_count: int
        +__init__(title: str, author: str, creation_date: str, page_count: int)
    }
    
    class OCRResponsePage {
        +index: int
        +markdown: str
        +image: str
        +images: list
        +dimensions: dict
        +__init__(index: int, markdown: str, image: str, images: list, dimensions: dict)
    }
    
    class OCRResponseImage {
        +id: str
        +image_base64: str
        +__init__(id: str, image_base64: str)
    }
    
    OCRResponse "1" *-- "1" OCRResponseMetadata: contains
    OCRResponse "1" *-- "many" OCRResponsePage: contains
    OCRResponsePage "1" *-- "many" OCRResponseImage: contains

Component Diagram

The following component diagram shows the high-level architecture of the Mistral OCR system:

graph TD
    CLI[CLI Interface] --> ProcessCmd[Process Command]
    CLI --> ConvertCmd[Convert Command]
    CLI --> MarkdownCmd[Markdown Command]
    CLI --> VersionCmd[Version Command]
    
    ProcessCmd --> APIClient[Mistral API Client]
    ConvertCmd --> JSONParser[JSON Parser]
    MarkdownCmd --> ProcessCmd
    MarkdownCmd --> ConvertCmd
    
    APIClient --> ExternalAPI[Mistral AI API]
    
    JSONParser --> MarkdownGenerator[Markdown Generator]
    MarkdownGenerator --> ImageHandler[Image Handler]
    
    subgraph "External Services"
        ExternalAPI
    end
    
    subgraph "Core Components"
        APIClient
        JSONParser
        MarkdownGenerator
        ImageHandler
    end
    
    subgraph "Command Interface"
        ProcessCmd
        ConvertCmd
        MarkdownCmd
        VersionCmd
    end

Sequence Diagrams

Process Document Workflow

The following sequence diagram illustrates the workflow for processing a document with OCR:

sequenceDiagram
    participant User
    participant CLI as CLI Interface
    participant Process as Process Command
    participant Client as Mistral Client
    participant API as Mistral AI API
    
    User->>CLI: mistral-ocr process document.pdf
    CLI->>Process: run(args)
    
    alt Local File
        Process->>Client: process_local_file(file_path, output_file, include_images)
        Client->>Client: upload_file(file_path)
        Client->>API: POST /files
        API-->>Client: file_id
        Client->>Client: get_file_url(file_id)
        Client->>API: GET /files/{file_id}/url
        API-->>Client: signed_url
        Client->>API: POST /ocr
        API-->>Client: OCR results (JSON)
    else URL
        Process->>Client: process_url(url, output_file, include_images)
        Client->>API: POST /ocr
        API-->>Client: OCR results (JSON)
    end
    
    Client-->>Process: OCR results (JSON)
    Process->>Process: handle_output(data, output_file)
    Process-->>CLI: Success message
    CLI-->>User: Display results or confirmation

Convert JSON to Markdown Workflow

The following sequence diagram illustrates the workflow for converting JSON to Markdown:

sequenceDiagram
    participant User
    participant CLI as CLI Interface
    participant Convert as Convert Command
    participant Parser as JSON Parser
    participant Generator as Markdown Generator
    
    User->>CLI: mistral-ocr convert results.json
    CLI->>Convert: run(args)
    Convert->>Parser: load and parse JSON
    Parser-->>Convert: OCR data structure
    
    Convert->>Generator: convert_json_to_markdown(json_file, args)
    
    alt Single File Mode
        Generator->>Generator: Process all pages into one file
        loop For each page
            Generator->>Generator: replace_image_references(content, images, include_images, extract_images, image_dir)
        end
        Generator->>Generator: Write combined markdown file
    else Multi-File Mode
        loop For each page
            Generator->>Generator: replace_image_references(content, images, include_images, extract_images, image_dir)
            Generator->>Generator: Write individual markdown file
        end
    end
    
    Generator-->>Convert: Success message
    Convert-->>CLI: Success message
    CLI-->>User: Display confirmation

Combined Markdown Workflow

The following sequence diagram illustrates the combined workflow for processing a document and converting to Markdown in one step:

sequenceDiagram
    participant User
    participant CLI as CLI Interface
    participant Markdown as Markdown Command
    participant Process as Process Command
    participant Convert as Convert Command
    participant Client as Mistral Client
    participant API as Mistral AI API
    
    User->>CLI: mistral-ocr markdown document.pdf
    CLI->>Markdown: run(args)
    
    Markdown->>Markdown: Create temp JSON file if needed
    
    alt Local File
        Markdown->>Process: process_local_file(file_path, json_output_path, include_image_base64)
        Process->>Client: upload_file(file_path)
        Client->>API: POST /files
        API-->>Client: file_id
        Client->>Client: get_file_url(file_id)
        Client->>API: GET /files/{file_id}/url
        API-->>Client: signed_url
        Client->>API: POST /ocr
        API-->>Client: OCR results (JSON)
        Client-->>Process: OCR results
        Process->>Process: Write JSON to file
    else URL
        Markdown->>Process: process_url(url, json_output_path, include_image_base64)
        Process->>Client: process_url(url, output_file, include_images)
        Client->>API: POST /ocr
        API-->>Client: OCR results (JSON)
        Client-->>Process: OCR results
        Process->>Process: Write JSON to file
    end
    
    Markdown->>Convert: convert_json_to_markdown(json_output_path, args)
    Convert->>Convert: Parse JSON and generate markdown
    Convert-->>Markdown: Success message
    
    Markdown->>Markdown: Clean up temp file if created
    Markdown-->>CLI: Success message
    CLI-->>User: Display confirmation

Image Extraction Workflow

The following sequence diagram illustrates the workflow for extracting images to separate files:

sequenceDiagram
    participant User
    participant CLI as CLI Interface
    participant Convert as Convert Command
    participant Generator as Markdown Generator
    participant ImageHandler as Image Handler
    participant FileSystem as File System
    
    User->>CLI: mistral-ocr convert results.json --images --extract-images
    CLI->>Convert: run(args)
    Convert->>Generator: convert_json_to_markdown(json_file, args)
    
    Generator->>Generator: Determine image directory
    Generator->>FileSystem: Create image directory if needed
    
    loop For each page with images
        Generator->>ImageHandler: replace_image_references(content, images, include_images, extract_images, image_dir)
        
        loop For each image
            ImageHandler->>ImageHandler: extract_image_to_file(image_base64, image_id, image_dir)
            ImageHandler->>FileSystem: Write image file
            FileSystem-->>ImageHandler: File path
            ImageHandler->>ImageHandler: Update markdown to reference file
        end
        
        ImageHandler-->>Generator: Updated markdown content
    end
    
    Generator->>FileSystem: Write markdown file(s)
    Generator-->>Convert: Success message
    Convert-->>CLI: Success message
    CLI-->>User: Display confirmation

Activity Diagram

The following activity diagram illustrates the overall process flow of the Mistral OCR system:

graph TD
    Start([Start]) --> ParseArgs[Parse Command Line Arguments]
    ParseArgs --> CommandCheck{Which Command?}
    
    CommandCheck -->|process| ProcessCommand[Process Command]
    CommandCheck -->|convert| ConvertCommand[Convert Command]
    CommandCheck -->|markdown| MarkdownCommand[Markdown Command]
    CommandCheck -->|version| VersionCommand[Version Command]
    CommandCheck -->|none| ShowHelp[Show Help]
    
    ProcessCommand --> InputCheck{Input Type?}
    InputCheck -->|Local File| UploadFile[Upload File to API]
    InputCheck -->|URL| ProcessURL[Process URL Directly]
    
    UploadFile --> GetSignedURL[Get Signed URL]
    GetSignedURL --> ProcessOCR[Process with OCR API]
    ProcessURL --> ProcessOCR
    
    ProcessOCR --> SaveJSON[Save JSON Results]
    SaveJSON --> ProcessEnd([Process End])
    
    ConvertCommand --> LoadJSON[Load JSON File]
    LoadJSON --> ParseJSON[Parse JSON Structure]
    ParseJSON --> OutputModeCheck{Output Mode?}
    
    OutputModeCheck -->|Single File| CombinePages[Combine All Pages]
    OutputModeCheck -->|Multiple Files| ProcessPages[Process Each Page Separately]
    
    CombinePages --> ImageCheck{Include Images?}
    ProcessPages --> ImageCheck
    
    ImageCheck -->|No| GenerateMarkdown[Generate Markdown Without Images]
    ImageCheck -->|Yes| ExtractCheck{Extract Images?}
    
    ExtractCheck -->|No| EmbedImages[Embed Images as Base64]
    ExtractCheck -->|Yes| ExtractImages[Extract Images to Files]
    
    EmbedImages --> GenerateMarkdown
    ExtractImages --> UpdateReferences[Update Image References]
    UpdateReferences --> GenerateMarkdown
    
    GenerateMarkdown --> SaveMarkdownFiles[Save Markdown File(s)]
    SaveMarkdownFiles --> ConvertEnd([Convert End])
    
    MarkdownCommand --> CreateTempJSON[Create Temporary JSON File]
    CreateTempJSON --> ProcessInMarkdown[Process Document]
    ProcessInMarkdown --> ConvertInMarkdown[Convert to Markdown]
    ConvertInMarkdown --> CleanupTemp[Cleanup Temporary Files]
    CleanupTemp --> MarkdownEnd([Markdown End])
    
    VersionCommand --> ShowVersion[Show Version Information]
    ShowVersion --> VersionEnd([Version End])
    
    ShowHelp --> HelpEnd([Help End])

Use Case Diagram

The following use case diagram illustrates the different ways users can interact with the Mistral OCR CLI:

graph TD
    User((User))
    
    subgraph "Mistral OCR CLI"
        ProcessDoc[Process Document]
        ProcessURL[Process URL]
        ConvertJSON[Convert JSON to Markdown]
        CombinedProcess[Process and Convert in One Step]
        ExtractImages[Extract Images to Files]
        EmbedImages[Embed Images in Markdown]
        CheckVersion[Check Version]
    end
    
    User -->|mistral-ocr process file.pdf| ProcessDoc
    User -->|mistral-ocr process https://example.com/doc.pdf| ProcessURL
    User -->|mistral-ocr convert results.json| ConvertJSON
    User -->|mistral-ocr markdown file.pdf| CombinedProcess
    User -->|--extract-images flag| ExtractImages
    User -->|--images flag| EmbedImages
    User -->|mistral-ocr version| CheckVersion
    
    ProcessDoc -.-> ConvertJSON
    ProcessURL -.-> ConvertJSON
    ConvertJSON -.-> ExtractImages
    ConvertJSON -.-> EmbedImages
    CombinedProcess -.-> ProcessDoc
    CombinedProcess -.-> ConvertJSON

Data Flow Diagram

The following data flow diagram illustrates how data moves through the Mistral OCR system:

graph TD
    Input[Document Input]
    API[Mistral AI API]
    JSONStorage[JSON Storage]
    MarkdownOutput[Markdown Output]
    ImageStorage[Image Storage]
    
    Input -->|Local File| FileUpload[File Upload]
    Input -->|URL| DirectProcess[Direct Processing]
    
    FileUpload --> API
    DirectProcess --> API
    
    API -->|OCR Results| JSONStorage
    
    JSONStorage -->|Parse| MarkdownGen[Markdown Generation]
    
    MarkdownGen -->|Text Content| MarkdownOutput
    MarkdownGen -->|Image Data| ImageCheck{Extract Images?}
    
    ImageCheck -->|Yes| ImageExtraction[Image Extraction]
    ImageCheck -->|No| Base64Embedding[Base64 Embedding]
    
    ImageExtraction --> ImageStorage
    ImageExtraction -->|Image References| MarkdownOutput
    
    Base64Embedding -->|Embedded Images| MarkdownOutput
    
    subgraph "Input Processing"
        Input
        FileUpload
        DirectProcess
    end
    
    subgraph "OCR Processing"
        API
        JSONStorage
    end
    
    subgraph "Output Generation"
        MarkdownGen
        ImageCheck
        ImageExtraction
        Base64Embedding
        MarkdownOutput
        ImageStorage
    end

These diagrams provide a comprehensive overview of the Mistral OCR system architecture, workflows, and interactions. They can be used to understand the system's structure, behavior, and to guide future development and maintenance efforts.