# Technical Notes - ht-docker-ai ## Architecture This project uses **Ollama** as the runtime framework for serving AI models. This provides: - Automatic model download and caching - Unified REST API (compatible with OpenAI format) - Built-in quantization support - GPU/CPU auto-detection ## Model Details ### MiniCPM-V 4.5 - **Source**: OpenBMB (https://github.com/OpenBMB/MiniCPM-V) - **Base Models**: Qwen3-8B + SigLIP2-400M - **Total Parameters**: 8B - **Ollama Model Name**: `minicpm-v` ### VRAM Usage | Mode | VRAM Required | |------|---------------| | Full precision (bf16) | 18GB | | int4 quantized | 9GB | | GGUF (CPU) | 8GB RAM | ## Container Startup Flow 1. `docker-entrypoint.sh` starts Ollama server in background 2. Waits for server to be ready 3. Checks if model already exists in volume 4. Pulls model if not present 5. Keeps container running ## Volume Persistence Mount `/root/.ollama` to persist downloaded models: ```bash -v ollama-data:/root/.ollama ``` Without this volume, the model will be re-downloaded on each container start (~5GB download). ## API Endpoints All endpoints follow the Ollama API specification: | Endpoint | Method | Description | |----------|--------|-------------| | `/api/tags` | GET | List available models | | `/api/generate` | POST | Generate completion | | `/api/chat` | POST | Chat completion | | `/api/pull` | POST | Pull a model | | `/api/show` | POST | Show model info | ## GPU Detection The GPU variant uses Ollama's automatic GPU detection. For CPU-only mode, we set: ```dockerfile ENV CUDA_VISIBLE_DEVICES="" ``` This forces Ollama to use CPU inference even if GPU is available. ## Health Checks Both variants include Docker health checks: ```dockerfile HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \ CMD curl -f http://localhost:11434/api/tags || exit 1 ``` CPU variant has longer `start-period` (120s) due to slower startup. ## PaddleOCR-VL (Recommended) ### Overview PaddleOCR-VL is a 0.9B parameter Vision-Language Model specifically optimized for document parsing. It replaces the older PP-Structure approach with native VLM understanding. **Key advantages over PP-Structure:** - Native table understanding (no HTML parsing needed) - 109 language support - Better handling of complex multi-row tables - Structured Markdown/JSON output ### Docker Images | Tag | Description | |-----|-------------| | `paddleocr-vl` | GPU variant using vLLM (recommended) | | `paddleocr-vl-cpu` | CPU variant using transformers | ### API Endpoints (OpenAI-compatible) | Endpoint | Method | Description | |----------|--------|-------------| | `/health` | GET | Health check with model info | | `/v1/models` | GET | List available models | | `/v1/chat/completions` | POST | OpenAI-compatible chat completions | | `/ocr` | POST | Legacy OCR endpoint | ### Request/Response Format **POST /v1/chat/completions (OpenAI-compatible)** ```json { "model": "paddleocr-vl", "messages": [ { "role": "user", "content": [ {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}, {"type": "text", "text": "Table Recognition:"} ] } ], "temperature": 0.0, "max_tokens": 8192 } ``` **Task Prompts:** - `"OCR:"` - Text recognition - `"Table Recognition:"` - Table extraction (returns markdown) - `"Formula Recognition:"` - Formula extraction - `"Chart Recognition:"` - Chart extraction **Response** ```json { "id": "chatcmpl-...", "object": "chat.completion", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "| Date | Description | Amount |\n|---|---|---|\n| 2021-06-01 | GITLAB INC | -119.96 |" }, "finish_reason": "stop" } ] } ``` ### Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `MODEL_NAME` | `PaddlePaddle/PaddleOCR-VL` | Model to load | | `HOST` | `0.0.0.0` | Server host | | `PORT` | `8000` | Server port | | `MAX_BATCHED_TOKENS` | `16384` | vLLM max batch tokens | | `GPU_MEMORY_UTILIZATION` | `0.9` | GPU memory usage (0-1) | ### Performance - **GPU (vLLM)**: ~2-5 seconds per page - **CPU**: ~30-60 seconds per page --- ## Adding New Models To add a new model variant: 1. Create `Dockerfile_` 2. Set `MODEL_NAME` environment variable 3. Update `build-images.sh` with new build target 4. Add documentation to `readme.md` ## Troubleshooting ### Model download hangs Check container logs: ```bash docker logs -f ``` The model download is ~5GB and may take several minutes. ### Out of memory - GPU: Use int4 quantized version or add more VRAM - CPU: Increase container memory limit: `--memory=16g` ### API not responding 1. Check if container is healthy: `docker ps` 2. Check logs for errors: `docker logs ` 3. Verify port mapping: `curl localhost:11434/api/tags` ## CI/CD Integration Build and push using npmci: ```bash npmci docker login npmci docker build npmci docker push code.foss.global ``` ## Multi-Pass Extraction Strategy The bank statement extraction uses a dual-VLM consensus approach: ### Architecture: Dual-VLM Consensus | VLM | Model | Purpose | |-----|-------|---------| | **MiniCPM-V 4.5** | 8B params | Primary visual extraction | | **PaddleOCR-VL** | 0.9B params | Table-specialized extraction | ### Extraction Strategy 1. **Pass 1**: MiniCPM-V visual extraction (images → JSON) 2. **Pass 2**: PaddleOCR-VL table recognition (images → markdown → JSON) 3. **Consensus**: If Pass 1 == Pass 2 → Done (fast path) 4. **Pass 3+**: MiniCPM-V visual if no consensus ### Why Dual-VLM Works - **Different architectures**: Two independent models cross-check each other - **Specialized strengths**: PaddleOCR-VL optimized for tables, MiniCPM-V for general vision - **No structure loss**: Both VLMs see the original images directly - **Fast consensus**: Most documents complete in 2 passes when VLMs agree ### Comparison vs Old PP-Structure Approach | Approach | Bank Statement Result | Issue | |----------|----------------------|-------| | MiniCPM-V Visual | 28 transactions ✓ | - | | PP-Structure HTML + Visual | 13 transactions ✗ | HTML merged rows incorrectly | | PaddleOCR-VL Table | 28 transactions ✓ | Native table understanding | **Key insight**: PP-Structure's HTML output loses structure for complex tables. PaddleOCR-VL's native VLM approach maintains table integrity. --- ## Related Resources - [Ollama Documentation](https://ollama.ai/docs) - [MiniCPM-V GitHub](https://github.com/OpenBMB/MiniCPM-V) - [Ollama API Reference](https://github.com/ollama/ollama/blob/main/docs/api.md)