6.5 KiB
Technical Notes - ht-docker-ai
Architecture
This project uses Ollama as the runtime framework for serving AI models. This provides:
- Automatic model download and caching
- Unified REST API (compatible with OpenAI format)
- Built-in quantization support
- GPU/CPU auto-detection
Model Details
MiniCPM-V 4.5
- Source: OpenBMB (https://github.com/OpenBMB/MiniCPM-V)
- Base Models: Qwen3-8B + SigLIP2-400M
- Total Parameters: 8B
- Ollama Model Name:
minicpm-v
VRAM Usage
| Mode | VRAM Required |
|---|---|
| Full precision (bf16) | 18GB |
| int4 quantized | 9GB |
| GGUF (CPU) | 8GB RAM |
Container Startup Flow
docker-entrypoint.shstarts Ollama server in background- Waits for server to be ready
- Checks if model already exists in volume
- Pulls model if not present
- Keeps container running
Volume Persistence
Mount /root/.ollama to persist downloaded models:
-v ollama-data:/root/.ollama
Without this volume, the model will be re-downloaded on each container start (~5GB download).
API Endpoints
All endpoints follow the Ollama API specification:
| Endpoint | Method | Description |
|---|---|---|
/api/tags |
GET | List available models |
/api/generate |
POST | Generate completion |
/api/chat |
POST | Chat completion |
/api/pull |
POST | Pull a model |
/api/show |
POST | Show model info |
GPU Detection
The GPU variant uses Ollama's automatic GPU detection. For CPU-only mode, we set:
ENV CUDA_VISIBLE_DEVICES=""
This forces Ollama to use CPU inference even if GPU is available.
Health Checks
Both variants include Docker health checks:
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:11434/api/tags || exit 1
CPU variant has longer start-period (120s) due to slower startup.
PaddleOCR-VL (Recommended)
Overview
PaddleOCR-VL is a 0.9B parameter Vision-Language Model specifically optimized for document parsing. It replaces the older PP-Structure approach with native VLM understanding.
Key advantages over PP-Structure:
- Native table understanding (no HTML parsing needed)
- 109 language support
- Better handling of complex multi-row tables
- Structured Markdown/JSON output
Docker Images
| Tag | Description |
|---|---|
paddleocr-vl |
GPU variant using vLLM (recommended) |
paddleocr-vl-cpu |
CPU variant using transformers |
API Endpoints (OpenAI-compatible)
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check with model info |
/v1/models |
GET | List available models |
/v1/chat/completions |
POST | OpenAI-compatible chat completions |
/ocr |
POST | Legacy OCR endpoint |
Request/Response Format
POST /v1/chat/completions (OpenAI-compatible)
{
"model": "paddleocr-vl",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
{"type": "text", "text": "Table Recognition:"}
]
}
],
"temperature": 0.0,
"max_tokens": 8192
}
Task Prompts:
"OCR:"- Text recognition"Table Recognition:"- Table extraction (returns markdown)"Formula Recognition:"- Formula extraction"Chart Recognition:"- Chart extraction
Response
{
"id": "chatcmpl-...",
"object": "chat.completion",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "| Date | Description | Amount |\n|---|---|---|\n| 2021-06-01 | GITLAB INC | -119.96 |"
},
"finish_reason": "stop"
}
]
}
Environment Variables
| Variable | Default | Description |
|---|---|---|
MODEL_NAME |
PaddlePaddle/PaddleOCR-VL |
Model to load |
HOST |
0.0.0.0 |
Server host |
PORT |
8000 |
Server port |
MAX_BATCHED_TOKENS |
16384 |
vLLM max batch tokens |
GPU_MEMORY_UTILIZATION |
0.9 |
GPU memory usage (0-1) |
Performance
- GPU (vLLM): ~2-5 seconds per page
- CPU: ~30-60 seconds per page
Adding New Models
To add a new model variant:
- Create
Dockerfile_<modelname> - Set
MODEL_NAMEenvironment variable - Update
build-images.shwith new build target - Add documentation to
readme.md
Troubleshooting
Model download hangs
Check container logs:
docker logs -f <container-name>
The model download is ~5GB and may take several minutes.
Out of memory
- GPU: Use int4 quantized version or add more VRAM
- CPU: Increase container memory limit:
--memory=16g
API not responding
- Check if container is healthy:
docker ps - Check logs for errors:
docker logs <container> - Verify port mapping:
curl localhost:11434/api/tags
CI/CD Integration
Build and push using npmci:
npmci docker login
npmci docker build
npmci docker push code.foss.global
Multi-Pass Extraction Strategy
The bank statement extraction uses a dual-VLM consensus approach:
Architecture: Dual-VLM Consensus
| VLM | Model | Purpose |
|---|---|---|
| MiniCPM-V 4.5 | 8B params | Primary visual extraction |
| PaddleOCR-VL | 0.9B params | Table-specialized extraction |
Extraction Strategy
- Pass 1: MiniCPM-V visual extraction (images → JSON)
- Pass 2: PaddleOCR-VL table recognition (images → markdown → JSON)
- Consensus: If Pass 1 == Pass 2 → Done (fast path)
- Pass 3+: MiniCPM-V visual if no consensus
Why Dual-VLM Works
- Different architectures: Two independent models cross-check each other
- Specialized strengths: PaddleOCR-VL optimized for tables, MiniCPM-V for general vision
- No structure loss: Both VLMs see the original images directly
- Fast consensus: Most documents complete in 2 passes when VLMs agree
Comparison vs Old PP-Structure Approach
| Approach | Bank Statement Result | Issue |
|---|---|---|
| MiniCPM-V Visual | 28 transactions ✓ | - |
| PP-Structure HTML + Visual | 13 transactions ✗ | HTML merged rows incorrectly |
| PaddleOCR-VL Table | 28 transactions ✓ | Native table understanding |
Key insight: PP-Structure's HTML output loses structure for complex tables. PaddleOCR-VL's native VLM approach maintains table integrity.