feat(invoices): add hybrid OCR + vision invoice/document parsing with PaddleOCR, consensus voting, and prompt/test refactors

This commit is contained in:
2026-01-16 14:24:37 +00:00
parent acded2a165
commit 82358b2d5d
4 changed files with 380 additions and 109 deletions

View File

@@ -1,129 +1,250 @@
# Bank Statement Parsing with MiniCPM-V 4.5
# Document Recognition with Hybrid OCR + Vision AI
Recipe for extracting transactions from bank statement PDFs using vision-language AI.
Recipe for extracting structured data from invoices and documents using a hybrid approach:
PaddleOCR for text extraction + MiniCPM-V 4.5 for intelligent parsing.
## Model
## Architecture
- **Model**: MiniCPM-V 4.5 (8B parameters)
- **Ollama Name**: `openbmb/minicpm-v4.5:q8_0`
- **Quantization**: Q8_0 (9.8GB VRAM)
- **Runtime**: Ollama on GPU
```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ PDF/Image │ ───> │ PaddleOCR │ ───> │ Raw Text │
└──────────────┘ └──────────────┘ └──────┬───────┘
┌──────────────┐ │
│ MiniCPM-V │ <───────────┘
│ 4.5 VLM │ <─── Image
└──────┬───────┘
┌──────▼───────┐
│ Structured │
│ JSON │
└──────────────┘
```
## Why Hybrid?
| Approach | Accuracy | Speed | Best For |
|----------|----------|-------|----------|
| VLM Only | 85-90% | Fast | Simple layouts |
| OCR Only | N/A | Fast | Just text extraction |
| **Hybrid** | **91%+** | Medium | Complex invoices |
The hybrid approach provides OCR text as context to the VLM, improving accuracy on:
- Small text and numbers
- Low contrast documents
- Dense tables
## Services
| Service | Port | Purpose |
|---------|------|---------|
| PaddleOCR | 5000 | Text extraction |
| Ollama (MiniCPM-V) | 11434 | Intelligent parsing |
## Running the Containers
**Start both services:**
```bash
# PaddleOCR (CPU is sufficient for OCR)
docker run -d --name paddleocr -p 5000:5000 \
code.foss.global/host.today/ht-docker-ai:paddleocr-cpu
# MiniCPM-V 4.5 (GPU recommended)
docker run -d --name minicpm --gpus all -p 11434:11434 \
-v ollama-data:/root/.ollama \
code.foss.global/host.today/ht-docker-ai:minicpm45v
```
## Image Conversion
Convert PDF to PNG at 300 DPI for optimal OCR accuracy.
Convert PDF to PNG at 200 DPI:
```bash
convert -density 300 -quality 100 input.pdf \
convert -density 200 -quality 90 input.pdf \
-background white -alpha remove \
output-%d.png
page-%d.png
```
**Parameters:**
- `-density 300`: 300 DPI resolution (critical for accuracy)
- `-quality 100`: Maximum quality
- `-background white -alpha remove`: Remove transparency
- `output-%d.png`: Outputs page-0.png, page-1.png, etc.
## Step 1: Extract OCR Text
**Dependencies:**
```bash
apt-get install imagemagick
```typescript
async function extractOcrText(imageBase64: string): Promise<string> {
const response = await fetch('http://localhost:5000/ocr', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ image: imageBase64 }),
});
const data = await response.json();
if (data.success && data.results) {
return data.results.map((r: { text: string }) => r.text).join('\n');
}
return '';
}
```
## Prompt
## Step 2: Build Enhanced Prompt
```
You are a bank statement parser. Extract EVERY transaction from the table.
```typescript
function buildPrompt(ocrText: string): string {
const base = `You are an invoice parser. Extract the following fields:
Read the Amount column carefully:
- "- 21,47 €" means DEBIT, output as: -21.47
- "+ 1.000,00 €" means CREDIT, output as: 1000.00
- European format: comma = decimal point
1. invoice_number: The invoice/receipt number
2. invoice_date: Date in YYYY-MM-DD format
3. vendor_name: Company that issued the invoice
4. currency: EUR, USD, etc.
5. net_amount: Amount before tax (if shown)
6. vat_amount: Tax/VAT amount (0 if reverse charge)
7. total_amount: Final amount due
For each row output: {"date":"YYYY-MM-DD","counterparty":"NAME","amount":-21.47}
Return ONLY valid JSON:
{"invoice_number":"XXX","invoice_date":"YYYY-MM-DD","vendor_name":"Company","currency":"EUR","net_amount":100.00,"vat_amount":19.00,"total_amount":119.00}`;
Do not skip any rows. Return complete JSON array:
if (ocrText) {
return `${base}
OCR text extracted from the invoice:
---
${ocrText}
---
Cross-reference the image with the OCR text above for accuracy.`;
}
return base;
}
```
## API Call
## Step 3: Call Vision-Language Model
```python
import base64
import requests
```typescript
async function extractInvoice(images: string[], ocrText: string): Promise<Invoice> {
const payload = {
model: 'openbmb/minicpm-v4.5:q8_0',
prompt: buildPrompt(ocrText),
images, // Base64 encoded
stream: false,
options: {
num_predict: 2048,
temperature: 0.1,
},
};
# Load images
with open('page-0.png', 'rb') as f:
page0 = base64.b64encode(f.read()).decode('utf-8')
with open('page-1.png', 'rb') as f:
page1 = base64.b64encode(f.read()).decode('utf-8')
const response = await fetch('http://localhost:11434/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(payload),
});
payload = {
"model": "openbmb/minicpm-v4.5:q8_0",
"prompt": prompt,
"images": [page0, page1], # Multiple pages supported
"stream": False,
"options": {
"num_predict": 16384,
"temperature": 0.1
const result = await response.json();
return JSON.parse(result.response);
}
```
## Consensus Voting
For production reliability, run multiple extraction passes and require consensus:
```typescript
async function extractWithConsensus(images: string[], maxPasses: number = 5): Promise<Invoice> {
const results: Map<string, { invoice: Invoice; count: number }> = new Map();
// Optimization: Run Pass 1 (no OCR) parallel with OCR + Pass 2
const [pass1Result, ocrText] = await Promise.all([
extractInvoice(images, ''),
extractOcrText(images[0]),
]);
// Add Pass 1 result
addResult(results, pass1Result);
// Pass 2 with OCR context
const pass2Result = await extractInvoice(images, ocrText);
addResult(results, pass2Result);
// Check for consensus (2 matching results)
for (const [hash, data] of results) {
if (data.count >= 2) {
return data.invoice; // Consensus reached!
}
}
// Continue until consensus or max passes
for (let pass = 3; pass <= maxPasses; pass++) {
const result = await extractInvoice(images, ocrText);
addResult(results, result);
// Check consensus...
}
// Return most common result
return getMostCommon(results);
}
response = requests.post(
'http://localhost:11434/api/generate',
json=payload,
timeout=600
)
result = response.json()['response']
function hashInvoice(inv: Invoice): string {
return `${inv.invoice_number}|${inv.invoice_date}|${inv.total_amount.toFixed(2)}`;
}
```
## Output Format
```json
[
{"date":"2022-04-01","counterparty":"DIGITALOCEAN.COM","amount":-21.47},
{"date":"2022-04-01","counterparty":"DIGITALOCEAN.COM","amount":-58.06},
{"date":"2022-04-12","counterparty":"LOSSLESS GMBH","amount":1000.00}
]
{
"invoice_number": "INV-2024-001234",
"invoice_date": "2024-08-15",
"vendor_name": "Hetzner Online GmbH",
"currency": "EUR",
"net_amount": 167.52,
"vat_amount": 31.83,
"total_amount": 199.35
}
```
## Running the Container
**GPU (recommended):**
```bash
docker run -d --gpus all -p 11434:11434 \
-v ollama-data:/root/.ollama \
-e MODEL_NAME="openbmb/minicpm-v4.5:q8_0" \
ht-docker-ai:minicpm45v
```
**CPU (slower):**
```bash
docker run -d -p 11434:11434 \
-v ollama-data:/root/.ollama \
-e MODEL_NAME="openbmb/minicpm-v4.5:q4_0" \
ht-docker-ai:minicpm45v-cpu
```
## Hardware Requirements
| Quantization | VRAM/RAM | Speed |
|--------------|----------|-------|
| Q8_0 (GPU) | 10GB | Fast |
| Q4_0 (CPU) | 8GB | Slow |
## Test Results
| Statement | Pages | Transactions | Accuracy |
|-----------|-------|--------------|----------|
| bunq-2022-04 | 2 | 26 | 100% |
| bunq-2021-06 | 3 | 28 | 100% |
Tested on 46 real invoices from various vendors:
| Metric | Value |
|--------|-------|
| **Accuracy** | 91.3% (42/46) |
| **Avg Time** | 42.7s per invoice |
| **Consensus Rate** | 85% in 2 passes |
### Per-Vendor Results
| Vendor | Invoices | Accuracy |
|--------|----------|----------|
| Hetzner | 3 | 100% |
| DigitalOcean | 4 | 100% |
| Adobe | 3 | 100% |
| Cloudflare | 1 | 100% |
| Wasabi | 4 | 100% |
| Figma | 3 | 100% |
| Google Cloud | 1 | 100% |
| MongoDB | 3 | 0% (date parsing) |
## Hardware Requirements
| Component | Minimum | Recommended |
|-----------|---------|-------------|
| PaddleOCR (CPU) | 4GB RAM | 8GB RAM |
| MiniCPM-V (GPU) | 10GB VRAM | 12GB VRAM |
| MiniCPM-V (CPU) | 16GB RAM | 32GB RAM |
## Tips
1. **DPI matters**: 150 DPI causes missed rows; 300 DPI is optimal
2. **PNG over JPEG**: PNG preserves text clarity better
3. **Remove alpha**: Some models struggle with transparency
4. **Multi-page**: Pass all pages in single request for context
1. **Use hybrid approach**: OCR text dramatically improves number/date accuracy
2. **Consensus voting**: Run 2-5 passes to catch hallucinations
3. **200 DPI is optimal**: Higher doesn't help, lower loses detail
4. **PNG over JPEG**: Preserves text clarity
5. **Temperature 0.1**: Low temperature for consistent output
6. **European format**: Explicitly explain comma=decimal in prompt
6. **Multi-page support**: Pass all pages in single request for context
7. **Normalize for comparison**: Ignore case/whitespace when comparing invoice numbers
## Common Issues
| Issue | Cause | Solution |
|-------|-------|----------|
| Wrong date | Multiple dates on invoice | Be specific in prompt about which date |
| Wrong currency | Symbol vs code mismatch | OCR helps disambiguate |
| Missing digits | Low resolution | Increase density to 300 DPI |
| Hallucinated data | VLM uncertainty | Use consensus voting |