Files
ht-docker-ai/recipes/document.md

7.2 KiB

Document Recognition with Hybrid OCR + Vision AI

Recipe for extracting structured data from invoices and documents using a hybrid approach: PaddleOCR for text extraction + MiniCPM-V 4.5 for intelligent parsing.

Architecture

┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│  PDF/Image   │ ───> │  PaddleOCR   │ ───> │   Raw Text   │
└──────────────┘      └──────────────┘      └──────┬───────┘
                                                   │
                      ┌──────────────┐             │
                      │  MiniCPM-V   │ <───────────┘
                      │   4.5 VLM    │ <─── Image
                      └──────┬───────┘
                             │
                      ┌──────▼───────┐
                      │ Structured   │
                      │    JSON      │
                      └──────────────┘

Why Hybrid?

Approach Accuracy Speed Best For
VLM Only 85-90% Fast Simple layouts
OCR Only N/A Fast Just text extraction
Hybrid 91%+ Medium Complex invoices

The hybrid approach provides OCR text as context to the VLM, improving accuracy on:

  • Small text and numbers
  • Low contrast documents
  • Dense tables

Services

Service Port Purpose
PaddleOCR 5000 Text extraction
Ollama (MiniCPM-V) 11434 Intelligent parsing

Running the Containers

Start both services:

# PaddleOCR (CPU is sufficient for OCR)
docker run -d --name paddleocr -p 5000:5000 \
  code.foss.global/host.today/ht-docker-ai:paddleocr-cpu

# MiniCPM-V 4.5 (GPU recommended)
docker run -d --name minicpm --gpus all -p 11434:11434 \
  -v ollama-data:/root/.ollama \
  code.foss.global/host.today/ht-docker-ai:minicpm45v

Image Conversion

Convert PDF to PNG at 200 DPI:

convert -density 200 -quality 90 input.pdf \
  -background white -alpha remove \
  page-%d.png

Step 1: Extract OCR Text

async function extractOcrText(imageBase64: string): Promise<string> {
  const response = await fetch('http://localhost:5000/ocr', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ image: imageBase64 }),
  });

  const data = await response.json();
  if (data.success && data.results) {
    return data.results.map((r: { text: string }) => r.text).join('\n');
  }
  return '';
}

Step 2: Build Enhanced Prompt

function buildPrompt(ocrText: string): string {
  const base = `You are an invoice parser. Extract the following fields:

1. invoice_number: The invoice/receipt number
2. invoice_date: Date in YYYY-MM-DD format
3. vendor_name: Company that issued the invoice
4. currency: EUR, USD, etc.
5. net_amount: Amount before tax (if shown)
6. vat_amount: Tax/VAT amount (0 if reverse charge)
7. total_amount: Final amount due

Return ONLY valid JSON:
{"invoice_number":"XXX","invoice_date":"YYYY-MM-DD","vendor_name":"Company","currency":"EUR","net_amount":100.00,"vat_amount":19.00,"total_amount":119.00}`;

  if (ocrText) {
    return `${base}

OCR text extracted from the invoice:
---
${ocrText}
---

Cross-reference the image with the OCR text above for accuracy.`;
  }
  return base;
}

Step 3: Call Vision-Language Model

async function extractInvoice(images: string[], ocrText: string): Promise<Invoice> {
  const payload = {
    model: 'openbmb/minicpm-v4.5:q8_0',
    prompt: buildPrompt(ocrText),
    images,  // Base64 encoded
    stream: false,
    options: {
      num_predict: 2048,
      temperature: 0.1,
    },
  };

  const response = await fetch('http://localhost:11434/api/generate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(payload),
  });

  const result = await response.json();
  return JSON.parse(result.response);
}

Consensus Voting

For production reliability, run multiple extraction passes and require consensus:

async function extractWithConsensus(images: string[], maxPasses: number = 5): Promise<Invoice> {
  const results: Map<string, { invoice: Invoice; count: number }> = new Map();

  // Optimization: Run Pass 1 (no OCR) parallel with OCR + Pass 2
  const [pass1Result, ocrText] = await Promise.all([
    extractInvoice(images, ''),
    extractOcrText(images[0]),
  ]);

  // Add Pass 1 result
  addResult(results, pass1Result);

  // Pass 2 with OCR context
  const pass2Result = await extractInvoice(images, ocrText);
  addResult(results, pass2Result);

  // Check for consensus (2 matching results)
  for (const [hash, data] of results) {
    if (data.count >= 2) {
      return data.invoice;  // Consensus reached!
    }
  }

  // Continue until consensus or max passes
  for (let pass = 3; pass <= maxPasses; pass++) {
    const result = await extractInvoice(images, ocrText);
    addResult(results, result);
    // Check consensus...
  }

  // Return most common result
  return getMostCommon(results);
}

function hashInvoice(inv: Invoice): string {
  return `${inv.invoice_number}|${inv.invoice_date}|${inv.total_amount.toFixed(2)}`;
}

Output Format

{
  "invoice_number": "INV-2024-001234",
  "invoice_date": "2024-08-15",
  "vendor_name": "Hetzner Online GmbH",
  "currency": "EUR",
  "net_amount": 167.52,
  "vat_amount": 31.83,
  "total_amount": 199.35
}

Test Results

Tested on 46 real invoices from various vendors:

Metric Value
Accuracy 91.3% (42/46)
Avg Time 42.7s per invoice
Consensus Rate 85% in 2 passes

Per-Vendor Results

Vendor Invoices Accuracy
Hetzner 3 100%
DigitalOcean 4 100%
Adobe 3 100%
Cloudflare 1 100%
Wasabi 4 100%
Figma 3 100%
Google Cloud 1 100%
MongoDB 3 0% (date parsing)

Hardware Requirements

Component Minimum Recommended
PaddleOCR (CPU) 4GB RAM 8GB RAM
MiniCPM-V (GPU) 10GB VRAM 12GB VRAM
MiniCPM-V (CPU) 16GB RAM 32GB RAM

Tips

  1. Use hybrid approach: OCR text dramatically improves number/date accuracy
  2. Consensus voting: Run 2-5 passes to catch hallucinations
  3. 200 DPI is optimal: Higher doesn't help, lower loses detail
  4. PNG over JPEG: Preserves text clarity
  5. Temperature 0.1: Low temperature for consistent output
  6. Multi-page support: Pass all pages in single request for context
  7. Normalize for comparison: Ignore case/whitespace when comparing invoice numbers

Common Issues

Issue Cause Solution
Wrong date Multiple dates on invoice Be specific in prompt about which date
Wrong currency Symbol vs code mismatch OCR helps disambiguate
Missing digits Low resolution Increase density to 300 DPI
Hallucinated data VLM uncertainty Use consensus voting