feat(docker-images): add vLLM-based Nanonets-OCR2-3B image, Qwen3-VL Ollama image and refactor build/docs/tests to use new runtime/layout

This commit is contained in:
2026-01-19 21:05:51 +00:00
parent b58bcabc76
commit 08728ada4d
14 changed files with 1492 additions and 1126 deletions

View File

@@ -2,12 +2,18 @@
## Architecture
This project uses **Ollama** as the runtime framework for serving AI models. This provides:
This project uses **Ollama** and **vLLM** as runtime frameworks for serving AI models:
### Ollama-based Images (MiniCPM-V, Qwen3-VL)
- Automatic model download and caching
- Unified REST API (compatible with OpenAI format)
- Built-in quantization support
- GPU/CPU auto-detection
- GPU auto-detection
### vLLM-based Images (Nanonets-OCR)
- High-performance inference server
- OpenAI-compatible API
- Optimized for VLM workloads
## Model Details
@@ -24,18 +30,24 @@ This project uses **Ollama** as the runtime framework for serving AI models. Thi
|------|---------------|
| Full precision (bf16) | 18GB |
| int4 quantized | 9GB |
| GGUF (CPU) | 8GB RAM |
## Container Startup Flow
### Ollama-based containers
1. `docker-entrypoint.sh` starts Ollama server in background
2. Waits for server to be ready
3. Checks if model already exists in volume
4. Pulls model if not present
5. Keeps container running
### vLLM-based containers
1. vLLM server starts with model auto-download
2. Health check endpoint available at `/health`
3. OpenAI-compatible API at `/v1/chat/completions`
## Volume Persistence
### Ollama volumes
Mount `/root/.ollama` to persist downloaded models:
```bash
@@ -44,9 +56,16 @@ Mount `/root/.ollama` to persist downloaded models:
Without this volume, the model will be re-downloaded on each container start (~5GB download).
### vLLM/HuggingFace volumes
Mount `/root/.cache/huggingface` for model caching:
```bash
-v hf-cache:/root/.cache/huggingface
```
## API Endpoints
All endpoints follow the Ollama API specification:
### Ollama API (MiniCPM-V, Qwen3-VL)
| Endpoint | Method | Description |
|----------|--------|-------------|
@@ -56,192 +75,23 @@ All endpoints follow the Ollama API specification:
| `/api/pull` | POST | Pull a model |
| `/api/show` | POST | Show model info |
## GPU Detection
### vLLM API (Nanonets-OCR)
The GPU variant uses Ollama's automatic GPU detection. For CPU-only mode, we set:
```dockerfile
ENV CUDA_VISIBLE_DEVICES=""
```
This forces Ollama to use CPU inference even if GPU is available.
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Health check |
| `/v1/models` | GET | List available models |
| `/v1/chat/completions` | POST | OpenAI-compatible chat completions |
## Health Checks
Both variants include Docker health checks:
All containers include Docker health checks:
```dockerfile
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:11434/api/tags || exit 1
```
CPU variant has longer `start-period` (120s) due to slower startup.
## PaddleOCR-VL (Recommended)
### Overview
PaddleOCR-VL is a 0.9B parameter Vision-Language Model specifically optimized for document parsing. It replaces the older PP-Structure approach with native VLM understanding.
**Key advantages over PP-Structure:**
- Native table understanding (no HTML parsing needed)
- 109 language support
- Better handling of complex multi-row tables
- Structured Markdown/JSON output
### Docker Images
| Tag | Description |
|-----|-------------|
| `paddleocr-vl` | GPU variant using vLLM (recommended) |
| `paddleocr-vl-cpu` | CPU variant using transformers |
### API Endpoints (OpenAI-compatible)
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Health check with model info |
| `/v1/models` | GET | List available models |
| `/v1/chat/completions` | POST | OpenAI-compatible chat completions |
| `/ocr` | POST | Legacy OCR endpoint |
### Request/Response Format
**POST /v1/chat/completions (OpenAI-compatible)**
```json
{
"model": "paddleocr-vl",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
{"type": "text", "text": "Table Recognition:"}
]
}
],
"temperature": 0.0,
"max_tokens": 8192
}
```
**Task Prompts:**
- `"OCR:"` - Text recognition
- `"Table Recognition:"` - Table extraction (returns markdown)
- `"Formula Recognition:"` - Formula extraction
- `"Chart Recognition:"` - Chart extraction
**Response**
```json
{
"id": "chatcmpl-...",
"object": "chat.completion",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "| Date | Description | Amount |\n|---|---|---|\n| 2021-06-01 | GITLAB INC | -119.96 |"
},
"finish_reason": "stop"
}
]
}
```
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_NAME` | `PaddlePaddle/PaddleOCR-VL` | Model to load |
| `HOST` | `0.0.0.0` | Server host |
| `PORT` | `8000` | Server port |
| `MAX_BATCHED_TOKENS` | `16384` | vLLM max batch tokens |
| `GPU_MEMORY_UTILIZATION` | `0.9` | GPU memory usage (0-1) |
### Performance
- **GPU (vLLM)**: ~2-5 seconds per page
- **CPU**: ~30-60 seconds per page
---
## Adding New Models
To add a new model variant:
1. Create `Dockerfile_<modelname>`
2. Set `MODEL_NAME` environment variable
3. Update `build-images.sh` with new build target
4. Add documentation to `readme.md`
## Troubleshooting
### Model download hangs
Check container logs:
```bash
docker logs -f <container-name>
```
The model download is ~5GB and may take several minutes.
### Out of memory
- GPU: Use int4 quantized version or add more VRAM
- CPU: Increase container memory limit: `--memory=16g`
### API not responding
1. Check if container is healthy: `docker ps`
2. Check logs for errors: `docker logs <container>`
3. Verify port mapping: `curl localhost:11434/api/tags`
## CI/CD Integration
Build and push using npmci:
```bash
npmci docker login
npmci docker build
npmci docker push code.foss.global
```
## Multi-Pass Extraction Strategy
The bank statement extraction uses a dual-VLM consensus approach:
### Architecture: Dual-VLM Consensus
| VLM | Model | Purpose |
|-----|-------|---------|
| **MiniCPM-V 4.5** | 8B params | Primary visual extraction |
| **PaddleOCR-VL** | 0.9B params | Table-specialized extraction |
### Extraction Strategy
1. **Pass 1**: MiniCPM-V visual extraction (images → JSON)
2. **Pass 2**: PaddleOCR-VL table recognition (images → markdown → JSON)
3. **Consensus**: If Pass 1 == Pass 2 → Done (fast path)
4. **Pass 3+**: MiniCPM-V visual if no consensus
### Why Dual-VLM Works
- **Different architectures**: Two independent models cross-check each other
- **Specialized strengths**: PaddleOCR-VL optimized for tables, MiniCPM-V for general vision
- **No structure loss**: Both VLMs see the original images directly
- **Fast consensus**: Most documents complete in 2 passes when VLMs agree
### Comparison vs Old PP-Structure Approach
| Approach | Bank Statement Result | Issue |
|----------|----------------------|-------|
| MiniCPM-V Visual | 28 transactions ✓ | - |
| PP-Structure HTML + Visual | 13 transactions ✗ | HTML merged rows incorrectly |
| PaddleOCR-VL Table | 28 transactions ✓ | Native table understanding |
**Key insight**: PP-Structure's HTML output loses structure for complex tables. PaddleOCR-VL's native VLM approach maintains table integrity.
---
## Nanonets-OCR-s
@@ -254,7 +104,7 @@ Nanonets-OCR-s is a Qwen2.5-VL-3B model fine-tuned specifically for document OCR
- Based on Qwen2.5-VL-3B (~4B parameters)
- Fine-tuned for document OCR
- Outputs markdown with semantic HTML tags
- ~8-10GB VRAM (fits comfortably in 16GB)
- ~10GB VRAM
### Docker Images
@@ -305,7 +155,7 @@ Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number>.
### Performance
- **GPU (vLLM)**: ~3-8 seconds per page
- **VRAM usage**: ~8-10GB
- **VRAM usage**: ~10GB
### Two-Stage Pipeline (Nanonets + Qwen3)
@@ -332,6 +182,76 @@ docker start minicpm-test
---
## Multi-Pass Extraction Strategy
The bank statement extraction uses a dual-VLM consensus approach:
### Architecture: Dual-VLM Consensus
| VLM | Model | Purpose |
|-----|-------|---------|
| **MiniCPM-V 4.5** | 8B params | Primary visual extraction |
| **Nanonets-OCR-s** | ~4B params | Document OCR with semantic output |
### Extraction Strategy
1. **Pass 1**: MiniCPM-V visual extraction (images → JSON)
2. **Pass 2**: Nanonets-OCR semantic extraction (images → markdown → JSON)
3. **Consensus**: If Pass 1 == Pass 2 → Done (fast path)
4. **Pass 3+**: MiniCPM-V visual if no consensus
### Why Dual-VLM Works
- **Different architectures**: Two independent models cross-check each other
- **Specialized strengths**: Nanonets-OCR-s optimized for document structure, MiniCPM-V for general vision
- **No structure loss**: Both VLMs see the original images directly
- **Fast consensus**: Most documents complete in 2 passes when VLMs agree
---
## Adding New Models
To add a new model variant:
1. Create `Dockerfile_<modelname>_<runtime>_<hardware>_VRAM<size>`
2. Set `MODEL_NAME` environment variable
3. Update `build-images.sh` with new build target
4. Add documentation to `readme.md`
## Troubleshooting
### Model download hangs
Check container logs:
```bash
docker logs -f <container-name>
```
The model download is ~5GB and may take several minutes.
### Out of memory
- GPU: Use a lighter model variant or upgrade VRAM
- Add more GPU memory: Consider multi-GPU setup
### API not responding
1. Check if container is healthy: `docker ps`
2. Check logs for errors: `docker logs <container>`
3. Verify port mapping: `curl localhost:11434/api/tags`
## CI/CD Integration
Build and push using npmci:
```bash
npmci docker login
npmci docker build
npmci docker push code.foss.global
```
---
## Related Resources
- [Ollama Documentation](https://ollama.ai/docs)