update
This commit is contained in:
117
readme.hints.md
117
readme.hints.md
@@ -77,56 +77,73 @@ HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
|
||||
|
||||
CPU variant has longer `start-period` (120s) due to slower startup.
|
||||
|
||||
## PaddleOCR
|
||||
## PaddleOCR-VL (Recommended)
|
||||
|
||||
### Overview
|
||||
|
||||
PaddleOCR is a standalone OCR service using PaddlePaddle's PP-OCRv4 model. It provides:
|
||||
PaddleOCR-VL is a 0.9B parameter Vision-Language Model specifically optimized for document parsing. It replaces the older PP-Structure approach with native VLM understanding.
|
||||
|
||||
- Text detection and recognition
|
||||
- Multi-language support
|
||||
- FastAPI REST API
|
||||
- GPU and CPU variants
|
||||
**Key advantages over PP-Structure:**
|
||||
- Native table understanding (no HTML parsing needed)
|
||||
- 109 language support
|
||||
- Better handling of complex multi-row tables
|
||||
- Structured Markdown/JSON output
|
||||
|
||||
### Docker Images
|
||||
|
||||
| Tag | Description |
|
||||
|-----|-------------|
|
||||
| `paddleocr` | GPU variant (default) |
|
||||
| `paddleocr-gpu` | GPU variant (alias) |
|
||||
| `paddleocr-cpu` | CPU-only variant |
|
||||
| `paddleocr-vl` | GPU variant using vLLM (recommended) |
|
||||
| `paddleocr-vl-cpu` | CPU variant using transformers |
|
||||
|
||||
### API Endpoints
|
||||
### API Endpoints (OpenAI-compatible)
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/health` | GET | Health check with model info |
|
||||
| `/ocr` | POST | OCR with base64 image (JSON body) |
|
||||
| `/ocr/upload` | POST | OCR with file upload (multipart form) |
|
||||
| `/v1/models` | GET | List available models |
|
||||
| `/v1/chat/completions` | POST | OpenAI-compatible chat completions |
|
||||
| `/ocr` | POST | Legacy OCR endpoint |
|
||||
|
||||
### Request/Response Format
|
||||
|
||||
**POST /ocr (JSON)**
|
||||
**POST /v1/chat/completions (OpenAI-compatible)**
|
||||
```json
|
||||
{
|
||||
"image": "<base64-encoded-image>",
|
||||
"language": "en" // optional
|
||||
"model": "paddleocr-vl",
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
|
||||
{"type": "text", "text": "Table Recognition:"}
|
||||
]
|
||||
}
|
||||
],
|
||||
"temperature": 0.0,
|
||||
"max_tokens": 8192
|
||||
}
|
||||
```
|
||||
|
||||
**POST /ocr/upload (multipart)**
|
||||
- `img`: image file
|
||||
- `language`: optional language code
|
||||
**Task Prompts:**
|
||||
- `"OCR:"` - Text recognition
|
||||
- `"Table Recognition:"` - Table extraction (returns markdown)
|
||||
- `"Formula Recognition:"` - Formula extraction
|
||||
- `"Chart Recognition:"` - Chart extraction
|
||||
|
||||
**Response**
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"results": [
|
||||
"id": "chatcmpl-...",
|
||||
"object": "chat.completion",
|
||||
"choices": [
|
||||
{
|
||||
"text": "Invoice #12345",
|
||||
"confidence": 0.98,
|
||||
"box": [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "| Date | Description | Amount |\n|---|---|---|\n| 2021-06-01 | GITLAB INC | -119.96 |"
|
||||
},
|
||||
"finish_reason": "stop"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -136,19 +153,16 @@ PaddleOCR is a standalone OCR service using PaddlePaddle's PP-OCRv4 model. It pr
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `OCR_LANGUAGE` | `en` | Default language for OCR |
|
||||
| `SERVER_PORT` | `5000` | Server port |
|
||||
| `SERVER_HOST` | `0.0.0.0` | Server host |
|
||||
| `CUDA_VISIBLE_DEVICES` | (auto) | Set to `-1` for CPU-only |
|
||||
| `MODEL_NAME` | `PaddlePaddle/PaddleOCR-VL` | Model to load |
|
||||
| `HOST` | `0.0.0.0` | Server host |
|
||||
| `PORT` | `8000` | Server port |
|
||||
| `MAX_BATCHED_TOKENS` | `16384` | vLLM max batch tokens |
|
||||
| `GPU_MEMORY_UTILIZATION` | `0.9` | GPU memory usage (0-1) |
|
||||
|
||||
### Performance
|
||||
|
||||
- **GPU**: ~1-3 seconds per page
|
||||
- **CPU**: ~10-30 seconds per page
|
||||
|
||||
### Supported Languages
|
||||
|
||||
Common language codes: `en` (English), `ch` (Chinese), `de` (German), `fr` (French), `es` (Spanish), `ja` (Japanese), `ko` (Korean)
|
||||
- **GPU (vLLM)**: ~2-5 seconds per page
|
||||
- **CPU**: ~30-60 seconds per page
|
||||
|
||||
---
|
||||
|
||||
@@ -193,6 +207,43 @@ npmci docker build
|
||||
npmci docker push code.foss.global
|
||||
```
|
||||
|
||||
## Multi-Pass Extraction Strategy
|
||||
|
||||
The bank statement extraction uses a dual-VLM consensus approach:
|
||||
|
||||
### Architecture: Dual-VLM Consensus
|
||||
|
||||
| VLM | Model | Purpose |
|
||||
|-----|-------|---------|
|
||||
| **MiniCPM-V 4.5** | 8B params | Primary visual extraction |
|
||||
| **PaddleOCR-VL** | 0.9B params | Table-specialized extraction |
|
||||
|
||||
### Extraction Strategy
|
||||
|
||||
1. **Pass 1**: MiniCPM-V visual extraction (images → JSON)
|
||||
2. **Pass 2**: PaddleOCR-VL table recognition (images → markdown → JSON)
|
||||
3. **Consensus**: If Pass 1 == Pass 2 → Done (fast path)
|
||||
4. **Pass 3+**: MiniCPM-V visual if no consensus
|
||||
|
||||
### Why Dual-VLM Works
|
||||
|
||||
- **Different architectures**: Two independent models cross-check each other
|
||||
- **Specialized strengths**: PaddleOCR-VL optimized for tables, MiniCPM-V for general vision
|
||||
- **No structure loss**: Both VLMs see the original images directly
|
||||
- **Fast consensus**: Most documents complete in 2 passes when VLMs agree
|
||||
|
||||
### Comparison vs Old PP-Structure Approach
|
||||
|
||||
| Approach | Bank Statement Result | Issue |
|
||||
|----------|----------------------|-------|
|
||||
| MiniCPM-V Visual | 28 transactions ✓ | - |
|
||||
| PP-Structure HTML + Visual | 13 transactions ✗ | HTML merged rows incorrectly |
|
||||
| PaddleOCR-VL Table | 28 transactions ✓ | Native table understanding |
|
||||
|
||||
**Key insight**: PP-Structure's HTML output loses structure for complex tables. PaddleOCR-VL's native VLM approach maintains table integrity.
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- [Ollama Documentation](https://ollama.ai/docs)
|
||||
|
||||
Reference in New Issue
Block a user