Files

Juergen Kunz 82358b2d5d feat(invoices): add hybrid OCR + vision invoice/document parsing with PaddleOCR, consensus voting, and prompt/test refactors

2026-01-16 14:24:37 +00:00

7.2 KiB

Raw Blame History

Document Recognition with Hybrid OCR + Vision AI

Recipe for extracting structured data from invoices and documents using a hybrid approach: PaddleOCR for text extraction + MiniCPM-V 4.5 for intelligent parsing.

Architecture

┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│  PDF/Image   │ ───> │  PaddleOCR   │ ───> │   Raw Text   │
└──────────────┘      └──────────────┘      └──────┬───────┘
                                                   │
                      ┌──────────────┐             │
                      │  MiniCPM-V   │ <───────────┘
                      │   4.5 VLM    │ <─── Image
                      └──────┬───────┘
                             │
                      ┌──────▼───────┐
                      │ Structured   │
                      │    JSON      │
                      └──────────────┘

Why Hybrid?

Approach	Accuracy	Speed	Best For
VLM Only	85-90%	Fast	Simple layouts
OCR Only	N/A	Fast	Just text extraction
Hybrid	91%+	Medium	Complex invoices

The hybrid approach provides OCR text as context to the VLM, improving accuracy on:

Small text and numbers
Low contrast documents
Dense tables

Services

Service	Port	Purpose
PaddleOCR	5000	Text extraction
Ollama (MiniCPM-V)	11434	Intelligent parsing

Running the Containers

Start both services:

# PaddleOCR (CPU is sufficient for OCR)
docker run -d --name paddleocr -p 5000:5000 \
  code.foss.global/host.today/ht-docker-ai:paddleocr-cpu

# MiniCPM-V 4.5 (GPU recommended)
docker run -d --name minicpm --gpus all -p 11434:11434 \
  -v ollama-data:/root/.ollama \
  code.foss.global/host.today/ht-docker-ai:minicpm45v

Image Conversion

Convert PDF to PNG at 200 DPI:

convert -density 200 -quality 90 input.pdf \
  -background white -alpha remove \
  page-%d.png

Step 1: Extract OCR Text

async function extractOcrText(imageBase64: string): Promise<string> {
  const response = await fetch('http://localhost:5000/ocr', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ image: imageBase64 }),
  });

  const data = await response.json();
  if (data.success && data.results) {
    return data.results.map((r: { text: string }) => r.text).join('\n');
  }
  return '';
}

Step 2: Build Enhanced Prompt

function buildPrompt(ocrText: string): string {
  const base = `You are an invoice parser. Extract the following fields:

1. invoice_number: The invoice/receipt number
2. invoice_date: Date in YYYY-MM-DD format
3. vendor_name: Company that issued the invoice
4. currency: EUR, USD, etc.
5. net_amount: Amount before tax (if shown)
6. vat_amount: Tax/VAT amount (0 if reverse charge)
7. total_amount: Final amount due

Return ONLY valid JSON:
{"invoice_number":"XXX","invoice_date":"YYYY-MM-DD","vendor_name":"Company","currency":"EUR","net_amount":100.00,"vat_amount":19.00,"total_amount":119.00}`;

  if (ocrText) {
    return `${base}

OCR text extracted from the invoice:
---
${ocrText}
---

Cross-reference the image with the OCR text above for accuracy.`;
  }
  return base;
}

Step 3: Call Vision-Language Model

async function extractInvoice(images: string[], ocrText: string): Promise<Invoice> {
  const payload = {
    model: 'openbmb/minicpm-v4.5:q8_0',
    prompt: buildPrompt(ocrText),
    images,  // Base64 encoded
    stream: false,
    options: {
      num_predict: 2048,
      temperature: 0.1,
    },
  };

  const response = await fetch('http://localhost:11434/api/generate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(payload),
  });

  const result = await response.json();
  return JSON.parse(result.response);
}

Consensus Voting

For production reliability, run multiple extraction passes and require consensus:

async function extractWithConsensus(images: string[], maxPasses: number = 5): Promise<Invoice> {
  const results: Map<string, { invoice: Invoice; count: number }> = new Map();

  // Optimization: Run Pass 1 (no OCR) parallel with OCR + Pass 2
  const [pass1Result, ocrText] = await Promise.all([
    extractInvoice(images, ''),
    extractOcrText(images[0]),
  ]);

  // Add Pass 1 result
  addResult(results, pass1Result);

  // Pass 2 with OCR context
  const pass2Result = await extractInvoice(images, ocrText);
  addResult(results, pass2Result);

  // Check for consensus (2 matching results)
  for (const [hash, data] of results) {
    if (data.count >= 2) {
      return data.invoice;  // Consensus reached!
    }
  }

  // Continue until consensus or max passes
  for (let pass = 3; pass <= maxPasses; pass++) {
    const result = await extractInvoice(images, ocrText);
    addResult(results, result);
    // Check consensus...
  }

  // Return most common result
  return getMostCommon(results);
}

function hashInvoice(inv: Invoice): string {
  return `${inv.invoice_number}|${inv.invoice_date}|${inv.total_amount.toFixed(2)}`;
}

Output Format

{
  "invoice_number": "INV-2024-001234",
  "invoice_date": "2024-08-15",
  "vendor_name": "Hetzner Online GmbH",
  "currency": "EUR",
  "net_amount": 167.52,
  "vat_amount": 31.83,
  "total_amount": 199.35
}

Test Results

Tested on 46 real invoices from various vendors:

Metric	Value
Accuracy	91.3% (42/46)
Avg Time	42.7s per invoice
Consensus Rate	85% in 2 passes

Per-Vendor Results

Vendor	Invoices	Accuracy
Hetzner	3	100%
DigitalOcean	4	100%
Adobe	3	100%
Cloudflare	1	100%
Wasabi	4	100%
Figma	3	100%
Google Cloud	1	100%
MongoDB	3	0% (date parsing)

Hardware Requirements

Component	Minimum	Recommended
PaddleOCR (CPU)	4GB RAM	8GB RAM
MiniCPM-V (GPU)	10GB VRAM	12GB VRAM
MiniCPM-V (CPU)	16GB RAM	32GB RAM

Tips

Use hybrid approach: OCR text dramatically improves number/date accuracy
Consensus voting: Run 2-5 passes to catch hallucinations
200 DPI is optimal: Higher doesn't help, lower loses detail
PNG over JPEG: Preserves text clarity
Temperature 0.1: Low temperature for consistent output
Multi-page support: Pass all pages in single request for context
Normalize for comparison: Ignore case/whitespace when comparing invoice numbers

Common Issues

Issue	Cause	Solution
Wrong date	Multiple dates on invoice	Be specific in prompt about which date
Wrong currency	Symbol vs code mismatch	OCR helps disambiguate
Missing digits	Low resolution	Increase density to 300 DPI
Hallucinated data	VLM uncertainty	Use consensus voting

7.2 KiB Raw Blame History