# Document Recognition with Hybrid OCR + Vision AI Recipe for extracting structured data from invoices and documents using a hybrid approach: PaddleOCR for text extraction + MiniCPM-V 4.5 for intelligent parsing. ## Architecture ``` ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ PDF/Image │ ───> │ PaddleOCR │ ───> │ Raw Text │ └──────────────┘ └──────────────┘ └──────┬───────┘ │ ┌──────────────┐ │ │ MiniCPM-V │ <───────────┘ │ 4.5 VLM │ <─── Image └──────┬───────┘ │ ┌──────▼───────┐ │ Structured │ │ JSON │ └──────────────┘ ``` ## Why Hybrid? | Approach | Accuracy | Speed | Best For | |----------|----------|-------|----------| | VLM Only | 85-90% | Fast | Simple layouts | | OCR Only | N/A | Fast | Just text extraction | | **Hybrid** | **91%+** | Medium | Complex invoices | The hybrid approach provides OCR text as context to the VLM, improving accuracy on: - Small text and numbers - Low contrast documents - Dense tables ## Services | Service | Port | Purpose | |---------|------|---------| | PaddleOCR | 5000 | Text extraction | | Ollama (MiniCPM-V) | 11434 | Intelligent parsing | ## Running the Containers **Start both services:** ```bash # PaddleOCR (CPU is sufficient for OCR) docker run -d --name paddleocr -p 5000:5000 \ code.foss.global/host.today/ht-docker-ai:paddleocr-cpu # MiniCPM-V 4.5 (GPU recommended) docker run -d --name minicpm --gpus all -p 11434:11434 \ -v ollama-data:/root/.ollama \ code.foss.global/host.today/ht-docker-ai:minicpm45v ``` ## Image Conversion Convert PDF to PNG at 200 DPI: ```bash convert -density 200 -quality 90 input.pdf \ -background white -alpha remove \ page-%d.png ``` ## Step 1: Extract OCR Text ```typescript async function extractOcrText(imageBase64: string): Promise { const response = await fetch('http://localhost:5000/ocr', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ image: imageBase64 }), }); const data = await response.json(); if (data.success && data.results) { return data.results.map((r: { text: string }) => r.text).join('\n'); } return ''; } ``` ## Step 2: Build Enhanced Prompt ```typescript function buildPrompt(ocrText: string): string { const base = `You are an invoice parser. Extract the following fields: 1. invoice_number: The invoice/receipt number 2. invoice_date: Date in YYYY-MM-DD format 3. vendor_name: Company that issued the invoice 4. currency: EUR, USD, etc. 5. net_amount: Amount before tax (if shown) 6. vat_amount: Tax/VAT amount (0 if reverse charge) 7. total_amount: Final amount due Return ONLY valid JSON: {"invoice_number":"XXX","invoice_date":"YYYY-MM-DD","vendor_name":"Company","currency":"EUR","net_amount":100.00,"vat_amount":19.00,"total_amount":119.00}`; if (ocrText) { return `${base} OCR text extracted from the invoice: --- ${ocrText} --- Cross-reference the image with the OCR text above for accuracy.`; } return base; } ``` ## Step 3: Call Vision-Language Model ```typescript async function extractInvoice(images: string[], ocrText: string): Promise { const payload = { model: 'openbmb/minicpm-v4.5:q8_0', prompt: buildPrompt(ocrText), images, // Base64 encoded stream: false, options: { num_predict: 2048, temperature: 0.1, }, }; const response = await fetch('http://localhost:11434/api/generate', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify(payload), }); const result = await response.json(); return JSON.parse(result.response); } ``` ## Consensus Voting For production reliability, run multiple extraction passes and require consensus: ```typescript async function extractWithConsensus(images: string[], maxPasses: number = 5): Promise { const results: Map = new Map(); // Optimization: Run Pass 1 (no OCR) parallel with OCR + Pass 2 const [pass1Result, ocrText] = await Promise.all([ extractInvoice(images, ''), extractOcrText(images[0]), ]); // Add Pass 1 result addResult(results, pass1Result); // Pass 2 with OCR context const pass2Result = await extractInvoice(images, ocrText); addResult(results, pass2Result); // Check for consensus (2 matching results) for (const [hash, data] of results) { if (data.count >= 2) { return data.invoice; // Consensus reached! } } // Continue until consensus or max passes for (let pass = 3; pass <= maxPasses; pass++) { const result = await extractInvoice(images, ocrText); addResult(results, result); // Check consensus... } // Return most common result return getMostCommon(results); } function hashInvoice(inv: Invoice): string { return `${inv.invoice_number}|${inv.invoice_date}|${inv.total_amount.toFixed(2)}`; } ``` ## Output Format ```json { "invoice_number": "INV-2024-001234", "invoice_date": "2024-08-15", "vendor_name": "Hetzner Online GmbH", "currency": "EUR", "net_amount": 167.52, "vat_amount": 31.83, "total_amount": 199.35 } ``` ## Test Results Tested on 46 real invoices from various vendors: | Metric | Value | |--------|-------| | **Accuracy** | 91.3% (42/46) | | **Avg Time** | 42.7s per invoice | | **Consensus Rate** | 85% in 2 passes | ### Per-Vendor Results | Vendor | Invoices | Accuracy | |--------|----------|----------| | Hetzner | 3 | 100% | | DigitalOcean | 4 | 100% | | Adobe | 3 | 100% | | Cloudflare | 1 | 100% | | Wasabi | 4 | 100% | | Figma | 3 | 100% | | Google Cloud | 1 | 100% | | MongoDB | 3 | 0% (date parsing) | ## Hardware Requirements | Component | Minimum | Recommended | |-----------|---------|-------------| | PaddleOCR (CPU) | 4GB RAM | 8GB RAM | | MiniCPM-V (GPU) | 10GB VRAM | 12GB VRAM | | MiniCPM-V (CPU) | 16GB RAM | 32GB RAM | ## Tips 1. **Use hybrid approach**: OCR text dramatically improves number/date accuracy 2. **Consensus voting**: Run 2-5 passes to catch hallucinations 3. **200 DPI is optimal**: Higher doesn't help, lower loses detail 4. **PNG over JPEG**: Preserves text clarity 5. **Temperature 0.1**: Low temperature for consistent output 6. **Multi-page support**: Pass all pages in single request for context 7. **Normalize for comparison**: Ignore case/whitespace when comparing invoice numbers ## Common Issues | Issue | Cause | Solution | |-------|-------|----------| | Wrong date | Multiple dates on invoice | Be specific in prompt about which date | | Wrong currency | Symbol vs code mismatch | OCR helps disambiguate | | Missing digits | Low resolution | Increase density to 300 DPI | | Hallucinated data | VLM uncertainty | Use consensus voting |