host.today/ht-docker-ai

Fork 0

Files

Juergen Kunz 15ac1fcf67 update

2026-01-16 16:21:44 +00:00

6.5 KiB

Raw Blame History

Technical Notes - ht-docker-ai

Architecture

This project uses Ollama as the runtime framework for serving AI models. This provides:

Automatic model download and caching
Unified REST API (compatible with OpenAI format)
Built-in quantization support
GPU/CPU auto-detection

Model Details

MiniCPM-V 4.5

Source: OpenBMB (https://github.com/OpenBMB/MiniCPM-V)
Base Models: Qwen3-8B + SigLIP2-400M
Total Parameters: 8B
Ollama Model Name: minicpm-v

VRAM Usage

Mode	VRAM Required
Full precision (bf16)	18GB
int4 quantized	9GB
GGUF (CPU)	8GB RAM

Container Startup Flow

docker-entrypoint.sh starts Ollama server in background
Waits for server to be ready
Checks if model already exists in volume
Pulls model if not present
Keeps container running

Volume Persistence

Mount /root/.ollama to persist downloaded models:

-v ollama-data:/root/.ollama

Without this volume, the model will be re-downloaded on each container start (~5GB download).

API Endpoints

All endpoints follow the Ollama API specification:

Endpoint	Method	Description
`/api/tags`	GET	List available models
`/api/generate`	POST	Generate completion
`/api/chat`	POST	Chat completion
`/api/pull`	POST	Pull a model
`/api/show`	POST	Show model info

GPU Detection

The GPU variant uses Ollama's automatic GPU detection. For CPU-only mode, we set:

ENV CUDA_VISIBLE_DEVICES=""

This forces Ollama to use CPU inference even if GPU is available.

Health Checks

Both variants include Docker health checks:

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:11434/api/tags || exit 1

CPU variant has longer start-period (120s) due to slower startup.

PaddleOCR-VL (Recommended)

Overview

PaddleOCR-VL is a 0.9B parameter Vision-Language Model specifically optimized for document parsing. It replaces the older PP-Structure approach with native VLM understanding.

Key advantages over PP-Structure:

Native table understanding (no HTML parsing needed)
109 language support
Better handling of complex multi-row tables
Structured Markdown/JSON output

Docker Images

Tag	Description
`paddleocr-vl`	GPU variant using vLLM (recommended)
`paddleocr-vl-cpu`	CPU variant using transformers

API Endpoints (OpenAI-compatible)

Endpoint	Method	Description
`/health`	GET	Health check with model info
`/v1/models`	GET	List available models
`/v1/chat/completions`	POST	OpenAI-compatible chat completions
`/ocr`	POST	Legacy OCR endpoint

Request/Response Format

POST /v1/chat/completions (OpenAI-compatible)

{
  "model": "paddleocr-vl",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
        {"type": "text", "text": "Table Recognition:"}
      ]
    }
  ],
  "temperature": 0.0,
  "max_tokens": 8192
}

Task Prompts:

"OCR:" - Text recognition
"Table Recognition:" - Table extraction (returns markdown)
"Formula Recognition:" - Formula extraction
"Chart Recognition:" - Chart extraction

Response

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "| Date | Description | Amount |\n|---|---|---|\n| 2021-06-01 | GITLAB INC | -119.96 |"
      },
      "finish_reason": "stop"
    }
  ]
}

Environment Variables

Variable	Default	Description
`MODEL_NAME`	`PaddlePaddle/PaddleOCR-VL`	Model to load
`HOST`	`0.0.0.0`	Server host
`PORT`	`8000`	Server port
`MAX_BATCHED_TOKENS`	`16384`	vLLM max batch tokens
`GPU_MEMORY_UTILIZATION`	`0.9`	GPU memory usage (0-1)

Performance

GPU (vLLM): ~2-5 seconds per page
CPU: ~30-60 seconds per page

Adding New Models

To add a new model variant:

Create Dockerfile_<modelname>
Set MODEL_NAME environment variable
Update build-images.sh with new build target
Add documentation to readme.md

Troubleshooting

Model download hangs

Check container logs:

docker logs -f <container-name>

The model download is ~5GB and may take several minutes.

Out of memory

GPU: Use int4 quantized version or add more VRAM
CPU: Increase container memory limit: --memory=16g

API not responding

Check if container is healthy: docker ps
Check logs for errors: docker logs <container>
Verify port mapping: curl localhost:11434/api/tags

CI/CD Integration

Build and push using npmci:

npmci docker login
npmci docker build
npmci docker push code.foss.global

Multi-Pass Extraction Strategy

The bank statement extraction uses a dual-VLM consensus approach:

Architecture: Dual-VLM Consensus

VLM	Model	Purpose
MiniCPM-V 4.5	8B params	Primary visual extraction
PaddleOCR-VL	0.9B params	Table-specialized extraction

Extraction Strategy

Pass 1: MiniCPM-V visual extraction (images → JSON)
Pass 2: PaddleOCR-VL table recognition (images → markdown → JSON)
Consensus: If Pass 1 == Pass 2 → Done (fast path)
Pass 3+: MiniCPM-V visual if no consensus

Why Dual-VLM Works

Different architectures: Two independent models cross-check each other
Specialized strengths: PaddleOCR-VL optimized for tables, MiniCPM-V for general vision
No structure loss: Both VLMs see the original images directly
Fast consensus: Most documents complete in 2 passes when VLMs agree

Comparison vs Old PP-Structure Approach

Approach	Bank Statement Result	Issue
MiniCPM-V Visual	28 transactions ✓	-
PP-Structure HTML + Visual	13 transactions ✗	HTML merged rows incorrectly
PaddleOCR-VL Table	28 transactions ✓	Native table understanding

Key insight: PP-Structure's HTML output loses structure for complex tables. PaddleOCR-VL's native VLM approach maintains table integrity.

6.5 KiB Raw Blame History