7.2 KiB
7.2 KiB
Document Recognition with Hybrid OCR + Vision AI
Recipe for extracting structured data from invoices and documents using a hybrid approach: PaddleOCR for text extraction + MiniCPM-V 4.5 for intelligent parsing.
Architecture
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ PDF/Image │ ───> │ PaddleOCR │ ───> │ Raw Text │
└──────────────┘ └──────────────┘ └──────┬───────┘
│
┌──────────────┐ │
│ MiniCPM-V │ <───────────┘
│ 4.5 VLM │ <─── Image
└──────┬───────┘
│
┌──────▼───────┐
│ Structured │
│ JSON │
└──────────────┘
Why Hybrid?
| Approach | Accuracy | Speed | Best For |
|---|---|---|---|
| VLM Only | 85-90% | Fast | Simple layouts |
| OCR Only | N/A | Fast | Just text extraction |
| Hybrid | 91%+ | Medium | Complex invoices |
The hybrid approach provides OCR text as context to the VLM, improving accuracy on:
- Small text and numbers
- Low contrast documents
- Dense tables
Services
| Service | Port | Purpose |
|---|---|---|
| PaddleOCR | 5000 | Text extraction |
| Ollama (MiniCPM-V) | 11434 | Intelligent parsing |
Running the Containers
Start both services:
# PaddleOCR (CPU is sufficient for OCR)
docker run -d --name paddleocr -p 5000:5000 \
code.foss.global/host.today/ht-docker-ai:paddleocr-cpu
# MiniCPM-V 4.5 (GPU recommended)
docker run -d --name minicpm --gpus all -p 11434:11434 \
-v ollama-data:/root/.ollama \
code.foss.global/host.today/ht-docker-ai:minicpm45v
Image Conversion
Convert PDF to PNG at 200 DPI:
convert -density 200 -quality 90 input.pdf \
-background white -alpha remove \
page-%d.png
Step 1: Extract OCR Text
async function extractOcrText(imageBase64: string): Promise<string> {
const response = await fetch('http://localhost:5000/ocr', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ image: imageBase64 }),
});
const data = await response.json();
if (data.success && data.results) {
return data.results.map((r: { text: string }) => r.text).join('\n');
}
return '';
}
Step 2: Build Enhanced Prompt
function buildPrompt(ocrText: string): string {
const base = `You are an invoice parser. Extract the following fields:
1. invoice_number: The invoice/receipt number
2. invoice_date: Date in YYYY-MM-DD format
3. vendor_name: Company that issued the invoice
4. currency: EUR, USD, etc.
5. net_amount: Amount before tax (if shown)
6. vat_amount: Tax/VAT amount (0 if reverse charge)
7. total_amount: Final amount due
Return ONLY valid JSON:
{"invoice_number":"XXX","invoice_date":"YYYY-MM-DD","vendor_name":"Company","currency":"EUR","net_amount":100.00,"vat_amount":19.00,"total_amount":119.00}`;
if (ocrText) {
return `${base}
OCR text extracted from the invoice:
---
${ocrText}
---
Cross-reference the image with the OCR text above for accuracy.`;
}
return base;
}
Step 3: Call Vision-Language Model
async function extractInvoice(images: string[], ocrText: string): Promise<Invoice> {
const payload = {
model: 'openbmb/minicpm-v4.5:q8_0',
prompt: buildPrompt(ocrText),
images, // Base64 encoded
stream: false,
options: {
num_predict: 2048,
temperature: 0.1,
},
};
const response = await fetch('http://localhost:11434/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(payload),
});
const result = await response.json();
return JSON.parse(result.response);
}
Consensus Voting
For production reliability, run multiple extraction passes and require consensus:
async function extractWithConsensus(images: string[], maxPasses: number = 5): Promise<Invoice> {
const results: Map<string, { invoice: Invoice; count: number }> = new Map();
// Optimization: Run Pass 1 (no OCR) parallel with OCR + Pass 2
const [pass1Result, ocrText] = await Promise.all([
extractInvoice(images, ''),
extractOcrText(images[0]),
]);
// Add Pass 1 result
addResult(results, pass1Result);
// Pass 2 with OCR context
const pass2Result = await extractInvoice(images, ocrText);
addResult(results, pass2Result);
// Check for consensus (2 matching results)
for (const [hash, data] of results) {
if (data.count >= 2) {
return data.invoice; // Consensus reached!
}
}
// Continue until consensus or max passes
for (let pass = 3; pass <= maxPasses; pass++) {
const result = await extractInvoice(images, ocrText);
addResult(results, result);
// Check consensus...
}
// Return most common result
return getMostCommon(results);
}
function hashInvoice(inv: Invoice): string {
return `${inv.invoice_number}|${inv.invoice_date}|${inv.total_amount.toFixed(2)}`;
}
Output Format
{
"invoice_number": "INV-2024-001234",
"invoice_date": "2024-08-15",
"vendor_name": "Hetzner Online GmbH",
"currency": "EUR",
"net_amount": 167.52,
"vat_amount": 31.83,
"total_amount": 199.35
}
Test Results
Tested on 46 real invoices from various vendors:
| Metric | Value |
|---|---|
| Accuracy | 91.3% (42/46) |
| Avg Time | 42.7s per invoice |
| Consensus Rate | 85% in 2 passes |
Per-Vendor Results
| Vendor | Invoices | Accuracy |
|---|---|---|
| Hetzner | 3 | 100% |
| DigitalOcean | 4 | 100% |
| Adobe | 3 | 100% |
| Cloudflare | 1 | 100% |
| Wasabi | 4 | 100% |
| Figma | 3 | 100% |
| Google Cloud | 1 | 100% |
| MongoDB | 3 | 0% (date parsing) |
Hardware Requirements
| Component | Minimum | Recommended |
|---|---|---|
| PaddleOCR (CPU) | 4GB RAM | 8GB RAM |
| MiniCPM-V (GPU) | 10GB VRAM | 12GB VRAM |
| MiniCPM-V (CPU) | 16GB RAM | 32GB RAM |
Tips
- Use hybrid approach: OCR text dramatically improves number/date accuracy
- Consensus voting: Run 2-5 passes to catch hallucinations
- 200 DPI is optimal: Higher doesn't help, lower loses detail
- PNG over JPEG: Preserves text clarity
- Temperature 0.1: Low temperature for consistent output
- Multi-page support: Pass all pages in single request for context
- Normalize for comparison: Ignore case/whitespace when comparing invoice numbers
Common Issues
| Issue | Cause | Solution |
|---|---|---|
| Wrong date | Multiple dates on invoice | Be specific in prompt about which date |
| Wrong currency | Symbol vs code mismatch | OCR helps disambiguate |
| Missing digits | Low resolution | Increase density to 300 DPI |
| Hallucinated data | VLM uncertainty | Use consensus voting |