Files
ht-docker-ai/readme.hints.md
2026-01-16 16:21:44 +00:00

6.5 KiB

Technical Notes - ht-docker-ai

Architecture

This project uses Ollama as the runtime framework for serving AI models. This provides:

  • Automatic model download and caching
  • Unified REST API (compatible with OpenAI format)
  • Built-in quantization support
  • GPU/CPU auto-detection

Model Details

MiniCPM-V 4.5

VRAM Usage

Mode VRAM Required
Full precision (bf16) 18GB
int4 quantized 9GB
GGUF (CPU) 8GB RAM

Container Startup Flow

  1. docker-entrypoint.sh starts Ollama server in background
  2. Waits for server to be ready
  3. Checks if model already exists in volume
  4. Pulls model if not present
  5. Keeps container running

Volume Persistence

Mount /root/.ollama to persist downloaded models:

-v ollama-data:/root/.ollama

Without this volume, the model will be re-downloaded on each container start (~5GB download).

API Endpoints

All endpoints follow the Ollama API specification:

Endpoint Method Description
/api/tags GET List available models
/api/generate POST Generate completion
/api/chat POST Chat completion
/api/pull POST Pull a model
/api/show POST Show model info

GPU Detection

The GPU variant uses Ollama's automatic GPU detection. For CPU-only mode, we set:

ENV CUDA_VISIBLE_DEVICES=""

This forces Ollama to use CPU inference even if GPU is available.

Health Checks

Both variants include Docker health checks:

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:11434/api/tags || exit 1

CPU variant has longer start-period (120s) due to slower startup.

Overview

PaddleOCR-VL is a 0.9B parameter Vision-Language Model specifically optimized for document parsing. It replaces the older PP-Structure approach with native VLM understanding.

Key advantages over PP-Structure:

  • Native table understanding (no HTML parsing needed)
  • 109 language support
  • Better handling of complex multi-row tables
  • Structured Markdown/JSON output

Docker Images

Tag Description
paddleocr-vl GPU variant using vLLM (recommended)
paddleocr-vl-cpu CPU variant using transformers

API Endpoints (OpenAI-compatible)

Endpoint Method Description
/health GET Health check with model info
/v1/models GET List available models
/v1/chat/completions POST OpenAI-compatible chat completions
/ocr POST Legacy OCR endpoint

Request/Response Format

POST /v1/chat/completions (OpenAI-compatible)

{
  "model": "paddleocr-vl",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
        {"type": "text", "text": "Table Recognition:"}
      ]
    }
  ],
  "temperature": 0.0,
  "max_tokens": 8192
}

Task Prompts:

  • "OCR:" - Text recognition
  • "Table Recognition:" - Table extraction (returns markdown)
  • "Formula Recognition:" - Formula extraction
  • "Chart Recognition:" - Chart extraction

Response

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "| Date | Description | Amount |\n|---|---|---|\n| 2021-06-01 | GITLAB INC | -119.96 |"
      },
      "finish_reason": "stop"
    }
  ]
}

Environment Variables

Variable Default Description
MODEL_NAME PaddlePaddle/PaddleOCR-VL Model to load
HOST 0.0.0.0 Server host
PORT 8000 Server port
MAX_BATCHED_TOKENS 16384 vLLM max batch tokens
GPU_MEMORY_UTILIZATION 0.9 GPU memory usage (0-1)

Performance

  • GPU (vLLM): ~2-5 seconds per page
  • CPU: ~30-60 seconds per page

Adding New Models

To add a new model variant:

  1. Create Dockerfile_<modelname>
  2. Set MODEL_NAME environment variable
  3. Update build-images.sh with new build target
  4. Add documentation to readme.md

Troubleshooting

Model download hangs

Check container logs:

docker logs -f <container-name>

The model download is ~5GB and may take several minutes.

Out of memory

  • GPU: Use int4 quantized version or add more VRAM
  • CPU: Increase container memory limit: --memory=16g

API not responding

  1. Check if container is healthy: docker ps
  2. Check logs for errors: docker logs <container>
  3. Verify port mapping: curl localhost:11434/api/tags

CI/CD Integration

Build and push using npmci:

npmci docker login
npmci docker build
npmci docker push code.foss.global

Multi-Pass Extraction Strategy

The bank statement extraction uses a dual-VLM consensus approach:

Architecture: Dual-VLM Consensus

VLM Model Purpose
MiniCPM-V 4.5 8B params Primary visual extraction
PaddleOCR-VL 0.9B params Table-specialized extraction

Extraction Strategy

  1. Pass 1: MiniCPM-V visual extraction (images → JSON)
  2. Pass 2: PaddleOCR-VL table recognition (images → markdown → JSON)
  3. Consensus: If Pass 1 == Pass 2 → Done (fast path)
  4. Pass 3+: MiniCPM-V visual if no consensus

Why Dual-VLM Works

  • Different architectures: Two independent models cross-check each other
  • Specialized strengths: PaddleOCR-VL optimized for tables, MiniCPM-V for general vision
  • No structure loss: Both VLMs see the original images directly
  • Fast consensus: Most documents complete in 2 passes when VLMs agree

Comparison vs Old PP-Structure Approach

Approach Bank Statement Result Issue
MiniCPM-V Visual 28 transactions ✓ -
PP-Structure HTML + Visual 13 transactions ✗ HTML merged rows incorrectly
PaddleOCR-VL Table 28 transactions ✓ Native table understanding

Key insight: PP-Structure's HTML output loses structure for complex tables. PaddleOCR-VL's native VLM approach maintains table integrity.