feat(docker-images): add vLLM-based Nanonets-OCR2-3B image, Qwen3-VL Ollama image and refactor build/docs/tests to use new runtime/layout
This commit is contained in:
284
readme.hints.md
284
readme.hints.md
@@ -2,12 +2,18 @@
|
||||
|
||||
## Architecture
|
||||
|
||||
This project uses **Ollama** as the runtime framework for serving AI models. This provides:
|
||||
This project uses **Ollama** and **vLLM** as runtime frameworks for serving AI models:
|
||||
|
||||
### Ollama-based Images (MiniCPM-V, Qwen3-VL)
|
||||
- Automatic model download and caching
|
||||
- Unified REST API (compatible with OpenAI format)
|
||||
- Built-in quantization support
|
||||
- GPU/CPU auto-detection
|
||||
- GPU auto-detection
|
||||
|
||||
### vLLM-based Images (Nanonets-OCR)
|
||||
- High-performance inference server
|
||||
- OpenAI-compatible API
|
||||
- Optimized for VLM workloads
|
||||
|
||||
## Model Details
|
||||
|
||||
@@ -24,18 +30,24 @@ This project uses **Ollama** as the runtime framework for serving AI models. Thi
|
||||
|------|---------------|
|
||||
| Full precision (bf16) | 18GB |
|
||||
| int4 quantized | 9GB |
|
||||
| GGUF (CPU) | 8GB RAM |
|
||||
|
||||
## Container Startup Flow
|
||||
|
||||
### Ollama-based containers
|
||||
1. `docker-entrypoint.sh` starts Ollama server in background
|
||||
2. Waits for server to be ready
|
||||
3. Checks if model already exists in volume
|
||||
4. Pulls model if not present
|
||||
5. Keeps container running
|
||||
|
||||
### vLLM-based containers
|
||||
1. vLLM server starts with model auto-download
|
||||
2. Health check endpoint available at `/health`
|
||||
3. OpenAI-compatible API at `/v1/chat/completions`
|
||||
|
||||
## Volume Persistence
|
||||
|
||||
### Ollama volumes
|
||||
Mount `/root/.ollama` to persist downloaded models:
|
||||
|
||||
```bash
|
||||
@@ -44,9 +56,16 @@ Mount `/root/.ollama` to persist downloaded models:
|
||||
|
||||
Without this volume, the model will be re-downloaded on each container start (~5GB download).
|
||||
|
||||
### vLLM/HuggingFace volumes
|
||||
Mount `/root/.cache/huggingface` for model caching:
|
||||
|
||||
```bash
|
||||
-v hf-cache:/root/.cache/huggingface
|
||||
```
|
||||
|
||||
## API Endpoints
|
||||
|
||||
All endpoints follow the Ollama API specification:
|
||||
### Ollama API (MiniCPM-V, Qwen3-VL)
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
@@ -56,192 +75,23 @@ All endpoints follow the Ollama API specification:
|
||||
| `/api/pull` | POST | Pull a model |
|
||||
| `/api/show` | POST | Show model info |
|
||||
|
||||
## GPU Detection
|
||||
### vLLM API (Nanonets-OCR)
|
||||
|
||||
The GPU variant uses Ollama's automatic GPU detection. For CPU-only mode, we set:
|
||||
|
||||
```dockerfile
|
||||
ENV CUDA_VISIBLE_DEVICES=""
|
||||
```
|
||||
|
||||
This forces Ollama to use CPU inference even if GPU is available.
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/health` | GET | Health check |
|
||||
| `/v1/models` | GET | List available models |
|
||||
| `/v1/chat/completions` | POST | OpenAI-compatible chat completions |
|
||||
|
||||
## Health Checks
|
||||
|
||||
Both variants include Docker health checks:
|
||||
All containers include Docker health checks:
|
||||
|
||||
```dockerfile
|
||||
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
|
||||
CMD curl -f http://localhost:11434/api/tags || exit 1
|
||||
```
|
||||
|
||||
CPU variant has longer `start-period` (120s) due to slower startup.
|
||||
|
||||
## PaddleOCR-VL (Recommended)
|
||||
|
||||
### Overview
|
||||
|
||||
PaddleOCR-VL is a 0.9B parameter Vision-Language Model specifically optimized for document parsing. It replaces the older PP-Structure approach with native VLM understanding.
|
||||
|
||||
**Key advantages over PP-Structure:**
|
||||
- Native table understanding (no HTML parsing needed)
|
||||
- 109 language support
|
||||
- Better handling of complex multi-row tables
|
||||
- Structured Markdown/JSON output
|
||||
|
||||
### Docker Images
|
||||
|
||||
| Tag | Description |
|
||||
|-----|-------------|
|
||||
| `paddleocr-vl` | GPU variant using vLLM (recommended) |
|
||||
| `paddleocr-vl-cpu` | CPU variant using transformers |
|
||||
|
||||
### API Endpoints (OpenAI-compatible)
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/health` | GET | Health check with model info |
|
||||
| `/v1/models` | GET | List available models |
|
||||
| `/v1/chat/completions` | POST | OpenAI-compatible chat completions |
|
||||
| `/ocr` | POST | Legacy OCR endpoint |
|
||||
|
||||
### Request/Response Format
|
||||
|
||||
**POST /v1/chat/completions (OpenAI-compatible)**
|
||||
```json
|
||||
{
|
||||
"model": "paddleocr-vl",
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
|
||||
{"type": "text", "text": "Table Recognition:"}
|
||||
]
|
||||
}
|
||||
],
|
||||
"temperature": 0.0,
|
||||
"max_tokens": 8192
|
||||
}
|
||||
```
|
||||
|
||||
**Task Prompts:**
|
||||
- `"OCR:"` - Text recognition
|
||||
- `"Table Recognition:"` - Table extraction (returns markdown)
|
||||
- `"Formula Recognition:"` - Formula extraction
|
||||
- `"Chart Recognition:"` - Chart extraction
|
||||
|
||||
**Response**
|
||||
```json
|
||||
{
|
||||
"id": "chatcmpl-...",
|
||||
"object": "chat.completion",
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "| Date | Description | Amount |\n|---|---|---|\n| 2021-06-01 | GITLAB INC | -119.96 |"
|
||||
},
|
||||
"finish_reason": "stop"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `MODEL_NAME` | `PaddlePaddle/PaddleOCR-VL` | Model to load |
|
||||
| `HOST` | `0.0.0.0` | Server host |
|
||||
| `PORT` | `8000` | Server port |
|
||||
| `MAX_BATCHED_TOKENS` | `16384` | vLLM max batch tokens |
|
||||
| `GPU_MEMORY_UTILIZATION` | `0.9` | GPU memory usage (0-1) |
|
||||
|
||||
### Performance
|
||||
|
||||
- **GPU (vLLM)**: ~2-5 seconds per page
|
||||
- **CPU**: ~30-60 seconds per page
|
||||
|
||||
---
|
||||
|
||||
## Adding New Models
|
||||
|
||||
To add a new model variant:
|
||||
|
||||
1. Create `Dockerfile_<modelname>`
|
||||
2. Set `MODEL_NAME` environment variable
|
||||
3. Update `build-images.sh` with new build target
|
||||
4. Add documentation to `readme.md`
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Model download hangs
|
||||
|
||||
Check container logs:
|
||||
```bash
|
||||
docker logs -f <container-name>
|
||||
```
|
||||
|
||||
The model download is ~5GB and may take several minutes.
|
||||
|
||||
### Out of memory
|
||||
|
||||
- GPU: Use int4 quantized version or add more VRAM
|
||||
- CPU: Increase container memory limit: `--memory=16g`
|
||||
|
||||
### API not responding
|
||||
|
||||
1. Check if container is healthy: `docker ps`
|
||||
2. Check logs for errors: `docker logs <container>`
|
||||
3. Verify port mapping: `curl localhost:11434/api/tags`
|
||||
|
||||
## CI/CD Integration
|
||||
|
||||
Build and push using npmci:
|
||||
|
||||
```bash
|
||||
npmci docker login
|
||||
npmci docker build
|
||||
npmci docker push code.foss.global
|
||||
```
|
||||
|
||||
## Multi-Pass Extraction Strategy
|
||||
|
||||
The bank statement extraction uses a dual-VLM consensus approach:
|
||||
|
||||
### Architecture: Dual-VLM Consensus
|
||||
|
||||
| VLM | Model | Purpose |
|
||||
|-----|-------|---------|
|
||||
| **MiniCPM-V 4.5** | 8B params | Primary visual extraction |
|
||||
| **PaddleOCR-VL** | 0.9B params | Table-specialized extraction |
|
||||
|
||||
### Extraction Strategy
|
||||
|
||||
1. **Pass 1**: MiniCPM-V visual extraction (images → JSON)
|
||||
2. **Pass 2**: PaddleOCR-VL table recognition (images → markdown → JSON)
|
||||
3. **Consensus**: If Pass 1 == Pass 2 → Done (fast path)
|
||||
4. **Pass 3+**: MiniCPM-V visual if no consensus
|
||||
|
||||
### Why Dual-VLM Works
|
||||
|
||||
- **Different architectures**: Two independent models cross-check each other
|
||||
- **Specialized strengths**: PaddleOCR-VL optimized for tables, MiniCPM-V for general vision
|
||||
- **No structure loss**: Both VLMs see the original images directly
|
||||
- **Fast consensus**: Most documents complete in 2 passes when VLMs agree
|
||||
|
||||
### Comparison vs Old PP-Structure Approach
|
||||
|
||||
| Approach | Bank Statement Result | Issue |
|
||||
|----------|----------------------|-------|
|
||||
| MiniCPM-V Visual | 28 transactions ✓ | - |
|
||||
| PP-Structure HTML + Visual | 13 transactions ✗ | HTML merged rows incorrectly |
|
||||
| PaddleOCR-VL Table | 28 transactions ✓ | Native table understanding |
|
||||
|
||||
**Key insight**: PP-Structure's HTML output loses structure for complex tables. PaddleOCR-VL's native VLM approach maintains table integrity.
|
||||
|
||||
---
|
||||
|
||||
## Nanonets-OCR-s
|
||||
@@ -254,7 +104,7 @@ Nanonets-OCR-s is a Qwen2.5-VL-3B model fine-tuned specifically for document OCR
|
||||
- Based on Qwen2.5-VL-3B (~4B parameters)
|
||||
- Fine-tuned for document OCR
|
||||
- Outputs markdown with semantic HTML tags
|
||||
- ~8-10GB VRAM (fits comfortably in 16GB)
|
||||
- ~10GB VRAM
|
||||
|
||||
### Docker Images
|
||||
|
||||
@@ -305,7 +155,7 @@ Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number>.
|
||||
### Performance
|
||||
|
||||
- **GPU (vLLM)**: ~3-8 seconds per page
|
||||
- **VRAM usage**: ~8-10GB
|
||||
- **VRAM usage**: ~10GB
|
||||
|
||||
### Two-Stage Pipeline (Nanonets + Qwen3)
|
||||
|
||||
@@ -332,6 +182,76 @@ docker start minicpm-test
|
||||
|
||||
---
|
||||
|
||||
## Multi-Pass Extraction Strategy
|
||||
|
||||
The bank statement extraction uses a dual-VLM consensus approach:
|
||||
|
||||
### Architecture: Dual-VLM Consensus
|
||||
|
||||
| VLM | Model | Purpose |
|
||||
|-----|-------|---------|
|
||||
| **MiniCPM-V 4.5** | 8B params | Primary visual extraction |
|
||||
| **Nanonets-OCR-s** | ~4B params | Document OCR with semantic output |
|
||||
|
||||
### Extraction Strategy
|
||||
|
||||
1. **Pass 1**: MiniCPM-V visual extraction (images → JSON)
|
||||
2. **Pass 2**: Nanonets-OCR semantic extraction (images → markdown → JSON)
|
||||
3. **Consensus**: If Pass 1 == Pass 2 → Done (fast path)
|
||||
4. **Pass 3+**: MiniCPM-V visual if no consensus
|
||||
|
||||
### Why Dual-VLM Works
|
||||
|
||||
- **Different architectures**: Two independent models cross-check each other
|
||||
- **Specialized strengths**: Nanonets-OCR-s optimized for document structure, MiniCPM-V for general vision
|
||||
- **No structure loss**: Both VLMs see the original images directly
|
||||
- **Fast consensus**: Most documents complete in 2 passes when VLMs agree
|
||||
|
||||
---
|
||||
|
||||
## Adding New Models
|
||||
|
||||
To add a new model variant:
|
||||
|
||||
1. Create `Dockerfile_<modelname>_<runtime>_<hardware>_VRAM<size>`
|
||||
2. Set `MODEL_NAME` environment variable
|
||||
3. Update `build-images.sh` with new build target
|
||||
4. Add documentation to `readme.md`
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Model download hangs
|
||||
|
||||
Check container logs:
|
||||
```bash
|
||||
docker logs -f <container-name>
|
||||
```
|
||||
|
||||
The model download is ~5GB and may take several minutes.
|
||||
|
||||
### Out of memory
|
||||
|
||||
- GPU: Use a lighter model variant or upgrade VRAM
|
||||
- Add more GPU memory: Consider multi-GPU setup
|
||||
|
||||
### API not responding
|
||||
|
||||
1. Check if container is healthy: `docker ps`
|
||||
2. Check logs for errors: `docker logs <container>`
|
||||
3. Verify port mapping: `curl localhost:11434/api/tags`
|
||||
|
||||
## CI/CD Integration
|
||||
|
||||
Build and push using npmci:
|
||||
|
||||
```bash
|
||||
npmci docker login
|
||||
npmci docker build
|
||||
npmci docker push code.foss.global
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- [Ollama Documentation](https://ollama.ai/docs)
|
||||
|
||||
Reference in New Issue
Block a user