update

2026-01-16 16:21:44 +00:00
parent 3c5cf578a5
commit 15ac1fcf67
13 changed files with 873 additions and 805 deletions
--- a/readme.hints.md
+++ b/readme.hints.md
@@ -77,56 +77,73 @@ HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \

 CPU variant has longer `start-period` (120s) due to slower startup.

-## PaddleOCR
+## PaddleOCR-VL (Recommended)

 ### Overview

-PaddleOCR is a standalone OCR service using PaddlePaddle's PP-OCRv4 model. It provides:
+PaddleOCR-VL is a 0.9B parameter Vision-Language Model specifically optimized for document parsing. It replaces the older PP-Structure approach with native VLM understanding.

- Text detection and recognition
- Multi-language support
- FastAPI REST API
- GPU and CPU variants
+**Key advantages over PP-Structure:**
+- Native table understanding (no HTML parsing needed)
+- 109 language support
+- Better handling of complex multi-row tables
+- Structured Markdown/JSON output

 ### Docker Images

 | Tag | Description |
 |-----|-------------|
-| `paddleocr` | GPU variant (default) |
-| `paddleocr-gpu` | GPU variant (alias) |
-| `paddleocr-cpu` | CPU-only variant |
+| `paddleocr-vl` | GPU variant using vLLM (recommended) |
+| `paddleocr-vl-cpu` | CPU variant using transformers |

-### API Endpoints
+### API Endpoints (OpenAI-compatible)

 | Endpoint | Method | Description |
 |----------|--------|-------------|
 | `/health` | GET | Health check with model info |
-| `/ocr` | POST | OCR with base64 image (JSON body) |
-| `/ocr/upload` | POST | OCR with file upload (multipart form) |
+| `/v1/models` | GET | List available models |
+| `/v1/chat/completions` | POST | OpenAI-compatible chat completions |
+| `/ocr` | POST | Legacy OCR endpoint |

 ### Request/Response Format

-**POST /ocr (JSON)**
+**POST /v1/chat/completions (OpenAI-compatible)**
 ```json
 {
-  "image": "<base64-encoded-image>",
-  "language": "en"  // optional
+  "model": "paddleocr-vl",
+  "messages": [
+    {
+      "role": "user",
+      "content": [
+        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
+        {"type": "text", "text": "Table Recognition:"}
+      ]
+    }
+  ],
+  "temperature": 0.0,
+  "max_tokens": 8192
 }
 ```

-**POST /ocr/upload (multipart)**
- `img`: image file
- `language`: optional language code
+**Task Prompts:**
+- `"OCR:"` - Text recognition
+- `"Table Recognition:"` - Table extraction (returns markdown)
+- `"Formula Recognition:"` - Formula extraction
+- `"Chart Recognition:"` - Chart extraction

 **Response**
 ```json
 {
-  "success": true,
-  "results": [
+  "id": "chatcmpl-...",
+  "object": "chat.completion",
+  "choices": [
    {
-      "text": "Invoice #12345",
-      "confidence": 0.98,
-      "box": [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "| Date | Description | Amount |\n|---|---|---|\n| 2021-06-01 | GITLAB INC | -119.96 |"
+      },
+      "finish_reason": "stop"
    }
  ]
 }
@@ -136,19 +153,16 @@ PaddleOCR is a standalone OCR service using PaddlePaddle's PP-OCRv4 model. It pr

 | Variable | Default | Description |
 |----------|---------|-------------|
-| `OCR_LANGUAGE` | `en` | Default language for OCR |
-| `SERVER_PORT` | `5000` | Server port |
-| `SERVER_HOST` | `0.0.0.0` | Server host |
-| `CUDA_VISIBLE_DEVICES` | (auto) | Set to `-1` for CPU-only |
+| `MODEL_NAME` | `PaddlePaddle/PaddleOCR-VL` | Model to load |
+| `HOST` | `0.0.0.0` | Server host |
+| `PORT` | `8000` | Server port |
+| `MAX_BATCHED_TOKENS` | `16384` | vLLM max batch tokens |
+| `GPU_MEMORY_UTILIZATION` | `0.9` | GPU memory usage (0-1) |

 ### Performance

- **GPU**: ~1-3 seconds per page
- **CPU**: ~10-30 seconds per page
-
-### Supported Languages
-
-Common language codes: `en` (English), `ch` (Chinese), `de` (German), `fr` (French), `es` (Spanish), `ja` (Japanese), `ko` (Korean)
+- **GPU (vLLM)**: ~2-5 seconds per page
+- **CPU**: ~30-60 seconds per page

 ---

@@ -193,6 +207,43 @@ npmci docker build
 npmci docker push code.foss.global
 ```

+## Multi-Pass Extraction Strategy
+
+The bank statement extraction uses a dual-VLM consensus approach:
+
+### Architecture: Dual-VLM Consensus
+
+| VLM | Model | Purpose |
+|-----|-------|---------|
+| **MiniCPM-V 4.5** | 8B params | Primary visual extraction |
+| **PaddleOCR-VL** | 0.9B params | Table-specialized extraction |
+
+### Extraction Strategy
+
+1. **Pass 1**: MiniCPM-V visual extraction (images → JSON)
+2. **Pass 2**: PaddleOCR-VL table recognition (images → markdown → JSON)
+3. **Consensus**: If Pass 1 == Pass 2 → Done (fast path)
+4. **Pass 3+**: MiniCPM-V visual if no consensus
+
+### Why Dual-VLM Works
+
+- **Different architectures**: Two independent models cross-check each other
+- **Specialized strengths**: PaddleOCR-VL optimized for tables, MiniCPM-V for general vision
+- **No structure loss**: Both VLMs see the original images directly
+- **Fast consensus**: Most documents complete in 2 passes when VLMs agree
+
+### Comparison vs Old PP-Structure Approach
+
+| Approach | Bank Statement Result | Issue |
+|----------|----------------------|-------|
+| MiniCPM-V Visual | 28 transactions ✓ | - |
+| PP-Structure HTML + Visual | 13 transactions ✗ | HTML merged rows incorrectly |
+| PaddleOCR-VL Table | 28 transactions ✓ | Native table understanding |
+
+**Key insight**: PP-Structure's HTML output loses structure for complex tables. PaddleOCR-VL's native VLM approach maintains table integrity.
+
+---
+
 ## Related Resources

 - [Ollama Documentation](https://ollama.ai/docs)