Files
ht-docker-ai/recipes/document.md

251 lines
7.2 KiB
Markdown

# Document Recognition with Hybrid OCR + Vision AI
Recipe for extracting structured data from invoices and documents using a hybrid approach:
PaddleOCR for text extraction + MiniCPM-V 4.5 for intelligent parsing.
## Architecture
```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ PDF/Image │ ───> │ PaddleOCR │ ───> │ Raw Text │
└──────────────┘ └──────────────┘ └──────┬───────┘
┌──────────────┐ │
│ MiniCPM-V │ <───────────┘
│ 4.5 VLM │ <─── Image
└──────┬───────┘
┌──────▼───────┐
│ Structured │
│ JSON │
└──────────────┘
```
## Why Hybrid?
| Approach | Accuracy | Speed | Best For |
|----------|----------|-------|----------|
| VLM Only | 85-90% | Fast | Simple layouts |
| OCR Only | N/A | Fast | Just text extraction |
| **Hybrid** | **91%+** | Medium | Complex invoices |
The hybrid approach provides OCR text as context to the VLM, improving accuracy on:
- Small text and numbers
- Low contrast documents
- Dense tables
## Services
| Service | Port | Purpose |
|---------|------|---------|
| PaddleOCR | 5000 | Text extraction |
| Ollama (MiniCPM-V) | 11434 | Intelligent parsing |
## Running the Containers
**Start both services:**
```bash
# PaddleOCR (CPU is sufficient for OCR)
docker run -d --name paddleocr -p 5000:5000 \
code.foss.global/host.today/ht-docker-ai:paddleocr-cpu
# MiniCPM-V 4.5 (GPU recommended)
docker run -d --name minicpm --gpus all -p 11434:11434 \
-v ollama-data:/root/.ollama \
code.foss.global/host.today/ht-docker-ai:minicpm45v
```
## Image Conversion
Convert PDF to PNG at 200 DPI:
```bash
convert -density 200 -quality 90 input.pdf \
-background white -alpha remove \
page-%d.png
```
## Step 1: Extract OCR Text
```typescript
async function extractOcrText(imageBase64: string): Promise<string> {
const response = await fetch('http://localhost:5000/ocr', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ image: imageBase64 }),
});
const data = await response.json();
if (data.success && data.results) {
return data.results.map((r: { text: string }) => r.text).join('\n');
}
return '';
}
```
## Step 2: Build Enhanced Prompt
```typescript
function buildPrompt(ocrText: string): string {
const base = `You are an invoice parser. Extract the following fields:
1. invoice_number: The invoice/receipt number
2. invoice_date: Date in YYYY-MM-DD format
3. vendor_name: Company that issued the invoice
4. currency: EUR, USD, etc.
5. net_amount: Amount before tax (if shown)
6. vat_amount: Tax/VAT amount (0 if reverse charge)
7. total_amount: Final amount due
Return ONLY valid JSON:
{"invoice_number":"XXX","invoice_date":"YYYY-MM-DD","vendor_name":"Company","currency":"EUR","net_amount":100.00,"vat_amount":19.00,"total_amount":119.00}`;
if (ocrText) {
return `${base}
OCR text extracted from the invoice:
---
${ocrText}
---
Cross-reference the image with the OCR text above for accuracy.`;
}
return base;
}
```
## Step 3: Call Vision-Language Model
```typescript
async function extractInvoice(images: string[], ocrText: string): Promise<Invoice> {
const payload = {
model: 'openbmb/minicpm-v4.5:q8_0',
prompt: buildPrompt(ocrText),
images, // Base64 encoded
stream: false,
options: {
num_predict: 2048,
temperature: 0.1,
},
};
const response = await fetch('http://localhost:11434/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(payload),
});
const result = await response.json();
return JSON.parse(result.response);
}
```
## Consensus Voting
For production reliability, run multiple extraction passes and require consensus:
```typescript
async function extractWithConsensus(images: string[], maxPasses: number = 5): Promise<Invoice> {
const results: Map<string, { invoice: Invoice; count: number }> = new Map();
// Optimization: Run Pass 1 (no OCR) parallel with OCR + Pass 2
const [pass1Result, ocrText] = await Promise.all([
extractInvoice(images, ''),
extractOcrText(images[0]),
]);
// Add Pass 1 result
addResult(results, pass1Result);
// Pass 2 with OCR context
const pass2Result = await extractInvoice(images, ocrText);
addResult(results, pass2Result);
// Check for consensus (2 matching results)
for (const [hash, data] of results) {
if (data.count >= 2) {
return data.invoice; // Consensus reached!
}
}
// Continue until consensus or max passes
for (let pass = 3; pass <= maxPasses; pass++) {
const result = await extractInvoice(images, ocrText);
addResult(results, result);
// Check consensus...
}
// Return most common result
return getMostCommon(results);
}
function hashInvoice(inv: Invoice): string {
return `${inv.invoice_number}|${inv.invoice_date}|${inv.total_amount.toFixed(2)}`;
}
```
## Output Format
```json
{
"invoice_number": "INV-2024-001234",
"invoice_date": "2024-08-15",
"vendor_name": "Hetzner Online GmbH",
"currency": "EUR",
"net_amount": 167.52,
"vat_amount": 31.83,
"total_amount": 199.35
}
```
## Test Results
Tested on 46 real invoices from various vendors:
| Metric | Value |
|--------|-------|
| **Accuracy** | 91.3% (42/46) |
| **Avg Time** | 42.7s per invoice |
| **Consensus Rate** | 85% in 2 passes |
### Per-Vendor Results
| Vendor | Invoices | Accuracy |
|--------|----------|----------|
| Hetzner | 3 | 100% |
| DigitalOcean | 4 | 100% |
| Adobe | 3 | 100% |
| Cloudflare | 1 | 100% |
| Wasabi | 4 | 100% |
| Figma | 3 | 100% |
| Google Cloud | 1 | 100% |
| MongoDB | 3 | 0% (date parsing) |
## Hardware Requirements
| Component | Minimum | Recommended |
|-----------|---------|-------------|
| PaddleOCR (CPU) | 4GB RAM | 8GB RAM |
| MiniCPM-V (GPU) | 10GB VRAM | 12GB VRAM |
| MiniCPM-V (CPU) | 16GB RAM | 32GB RAM |
## Tips
1. **Use hybrid approach**: OCR text dramatically improves number/date accuracy
2. **Consensus voting**: Run 2-5 passes to catch hallucinations
3. **200 DPI is optimal**: Higher doesn't help, lower loses detail
4. **PNG over JPEG**: Preserves text clarity
5. **Temperature 0.1**: Low temperature for consistent output
6. **Multi-page support**: Pass all pages in single request for context
7. **Normalize for comparison**: Ignore case/whitespace when comparing invoice numbers
## Common Issues
| Issue | Cause | Solution |
|-------|-------|----------|
| Wrong date | Multiple dates on invoice | Be specific in prompt about which date |
| Wrong currency | Symbol vs code mismatch | OCR helps disambiguate |
| Missing digits | Low resolution | Increase density to 300 DPI |
| Hallucinated data | VLM uncertainty | Use consensus voting |