9.2 KiB
Technical Notes - ht-docker-ai
Architecture
This project uses Ollama as the runtime framework for serving AI models. This provides:
- Automatic model download and caching
- Unified REST API (compatible with OpenAI format)
- Built-in quantization support
- GPU/CPU auto-detection
Model Details
MiniCPM-V 4.5
- Source: OpenBMB (https://github.com/OpenBMB/MiniCPM-V)
- Base Models: Qwen3-8B + SigLIP2-400M
- Total Parameters: 8B
- Ollama Model Name:
minicpm-v
VRAM Usage
| Mode | VRAM Required |
|---|---|
| Full precision (bf16) | 18GB |
| int4 quantized | 9GB |
| GGUF (CPU) | 8GB RAM |
Container Startup Flow
docker-entrypoint.shstarts Ollama server in background- Waits for server to be ready
- Checks if model already exists in volume
- Pulls model if not present
- Keeps container running
Volume Persistence
Mount /root/.ollama to persist downloaded models:
-v ollama-data:/root/.ollama
Without this volume, the model will be re-downloaded on each container start (~5GB download).
API Endpoints
All endpoints follow the Ollama API specification:
| Endpoint | Method | Description |
|---|---|---|
/api/tags |
GET | List available models |
/api/generate |
POST | Generate completion |
/api/chat |
POST | Chat completion |
/api/pull |
POST | Pull a model |
/api/show |
POST | Show model info |
GPU Detection
The GPU variant uses Ollama's automatic GPU detection. For CPU-only mode, we set:
ENV CUDA_VISIBLE_DEVICES=""
This forces Ollama to use CPU inference even if GPU is available.
Health Checks
Both variants include Docker health checks:
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:11434/api/tags || exit 1
CPU variant has longer start-period (120s) due to slower startup.
PaddleOCR-VL (Recommended)
Overview
PaddleOCR-VL is a 0.9B parameter Vision-Language Model specifically optimized for document parsing. It replaces the older PP-Structure approach with native VLM understanding.
Key advantages over PP-Structure:
- Native table understanding (no HTML parsing needed)
- 109 language support
- Better handling of complex multi-row tables
- Structured Markdown/JSON output
Docker Images
| Tag | Description |
|---|---|
paddleocr-vl |
GPU variant using vLLM (recommended) |
paddleocr-vl-cpu |
CPU variant using transformers |
API Endpoints (OpenAI-compatible)
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check with model info |
/v1/models |
GET | List available models |
/v1/chat/completions |
POST | OpenAI-compatible chat completions |
/ocr |
POST | Legacy OCR endpoint |
Request/Response Format
POST /v1/chat/completions (OpenAI-compatible)
{
"model": "paddleocr-vl",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
{"type": "text", "text": "Table Recognition:"}
]
}
],
"temperature": 0.0,
"max_tokens": 8192
}
Task Prompts:
"OCR:"- Text recognition"Table Recognition:"- Table extraction (returns markdown)"Formula Recognition:"- Formula extraction"Chart Recognition:"- Chart extraction
Response
{
"id": "chatcmpl-...",
"object": "chat.completion",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "| Date | Description | Amount |\n|---|---|---|\n| 2021-06-01 | GITLAB INC | -119.96 |"
},
"finish_reason": "stop"
}
]
}
Environment Variables
| Variable | Default | Description |
|---|---|---|
MODEL_NAME |
PaddlePaddle/PaddleOCR-VL |
Model to load |
HOST |
0.0.0.0 |
Server host |
PORT |
8000 |
Server port |
MAX_BATCHED_TOKENS |
16384 |
vLLM max batch tokens |
GPU_MEMORY_UTILIZATION |
0.9 |
GPU memory usage (0-1) |
Performance
- GPU (vLLM): ~2-5 seconds per page
- CPU: ~30-60 seconds per page
Adding New Models
To add a new model variant:
- Create
Dockerfile_<modelname> - Set
MODEL_NAMEenvironment variable - Update
build-images.shwith new build target - Add documentation to
readme.md
Troubleshooting
Model download hangs
Check container logs:
docker logs -f <container-name>
The model download is ~5GB and may take several minutes.
Out of memory
- GPU: Use int4 quantized version or add more VRAM
- CPU: Increase container memory limit:
--memory=16g
API not responding
- Check if container is healthy:
docker ps - Check logs for errors:
docker logs <container> - Verify port mapping:
curl localhost:11434/api/tags
CI/CD Integration
Build and push using npmci:
npmci docker login
npmci docker build
npmci docker push code.foss.global
Multi-Pass Extraction Strategy
The bank statement extraction uses a dual-VLM consensus approach:
Architecture: Dual-VLM Consensus
| VLM | Model | Purpose |
|---|---|---|
| MiniCPM-V 4.5 | 8B params | Primary visual extraction |
| PaddleOCR-VL | 0.9B params | Table-specialized extraction |
Extraction Strategy
- Pass 1: MiniCPM-V visual extraction (images → JSON)
- Pass 2: PaddleOCR-VL table recognition (images → markdown → JSON)
- Consensus: If Pass 1 == Pass 2 → Done (fast path)
- Pass 3+: MiniCPM-V visual if no consensus
Why Dual-VLM Works
- Different architectures: Two independent models cross-check each other
- Specialized strengths: PaddleOCR-VL optimized for tables, MiniCPM-V for general vision
- No structure loss: Both VLMs see the original images directly
- Fast consensus: Most documents complete in 2 passes when VLMs agree
Comparison vs Old PP-Structure Approach
| Approach | Bank Statement Result | Issue |
|---|---|---|
| MiniCPM-V Visual | 28 transactions ✓ | - |
| PP-Structure HTML + Visual | 13 transactions ✗ | HTML merged rows incorrectly |
| PaddleOCR-VL Table | 28 transactions ✓ | Native table understanding |
Key insight: PP-Structure's HTML output loses structure for complex tables. PaddleOCR-VL's native VLM approach maintains table integrity.
Nanonets-OCR-s
Overview
Nanonets-OCR-s is a Qwen2.5-VL-3B model fine-tuned specifically for document OCR tasks. It outputs structured markdown with semantic tags.
Key features:
- Based on Qwen2.5-VL-3B (~4B parameters)
- Fine-tuned for document OCR
- Outputs markdown with semantic HTML tags
- ~8-10GB VRAM (fits comfortably in 16GB)
Docker Images
| Tag | Description |
|---|---|
nanonets-ocr |
GPU variant using vLLM (OpenAI-compatible API) |
API Endpoints (OpenAI-compatible via vLLM)
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/v1/models |
GET | List available models |
/v1/chat/completions |
POST | OpenAI-compatible chat completions |
Request/Response Format
POST /v1/chat/completions (OpenAI-compatible)
{
"model": "nanonets/Nanonets-OCR-s",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
{"type": "text", "text": "Extract the text from the above document..."}
]
}
],
"temperature": 0.0,
"max_tokens": 4096
}
Nanonets OCR Prompt
The model is designed to work with a specific prompt format:
Extract the text from the above document as if you were reading it naturally.
Return the tables in html format.
Return the equations in LaTeX representation.
If there is an image in the document and image caption is not present, add a small description inside <img></img> tag.
Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>.
Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number>.
Performance
- GPU (vLLM): ~3-8 seconds per page
- VRAM usage: ~8-10GB
Two-Stage Pipeline (Nanonets + Qwen3)
The Nanonets tests use a two-stage pipeline:
- Stage 1: Nanonets-OCR-s converts images to markdown (via vLLM on port 8000)
- Stage 2: Qwen3 8B extracts structured JSON from markdown (via Ollama on port 11434)
GPU Limitation: Both vLLM and Ollama require significant GPU memory. On a single GPU system:
- Running both simultaneously causes memory contention
- For single GPU: Run services sequentially (stop Nanonets before Qwen3)
- For multi-GPU: Assign each service to a different GPU
Sequential Execution:
# Step 1: Run Nanonets OCR (converts to markdown)
docker start nanonets-test
# ... perform OCR ...
docker stop nanonets-test
# Step 2: Run Qwen3 extraction (from markdown)
docker start minicpm-test
# ... extract JSON ...