host.today/ht-docker-ai

Fork 0

Files

Juergen Kunz 09ea7440e8 update

2026-01-18 15:54:16 +00:00

9.2 KiB

Raw Permalink Blame History

Technical Notes - ht-docker-ai

Architecture

This project uses Ollama as the runtime framework for serving AI models. This provides:

Automatic model download and caching
Unified REST API (compatible with OpenAI format)
Built-in quantization support
GPU/CPU auto-detection

Model Details

MiniCPM-V 4.5

Source: OpenBMB (https://github.com/OpenBMB/MiniCPM-V)
Base Models: Qwen3-8B + SigLIP2-400M
Total Parameters: 8B
Ollama Model Name: minicpm-v

VRAM Usage

Mode	VRAM Required
Full precision (bf16)	18GB
int4 quantized	9GB
GGUF (CPU)	8GB RAM

Container Startup Flow

docker-entrypoint.sh starts Ollama server in background
Waits for server to be ready
Checks if model already exists in volume
Pulls model if not present
Keeps container running

Volume Persistence

Mount /root/.ollama to persist downloaded models:

-v ollama-data:/root/.ollama

Without this volume, the model will be re-downloaded on each container start (~5GB download).

API Endpoints

All endpoints follow the Ollama API specification:

Endpoint	Method	Description
`/api/tags`	GET	List available models
`/api/generate`	POST	Generate completion
`/api/chat`	POST	Chat completion
`/api/pull`	POST	Pull a model
`/api/show`	POST	Show model info

GPU Detection

The GPU variant uses Ollama's automatic GPU detection. For CPU-only mode, we set:

ENV CUDA_VISIBLE_DEVICES=""

This forces Ollama to use CPU inference even if GPU is available.

Health Checks

Both variants include Docker health checks:

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:11434/api/tags || exit 1

CPU variant has longer start-period (120s) due to slower startup.

PaddleOCR-VL (Recommended)

Overview

PaddleOCR-VL is a 0.9B parameter Vision-Language Model specifically optimized for document parsing. It replaces the older PP-Structure approach with native VLM understanding.

Key advantages over PP-Structure:

Native table understanding (no HTML parsing needed)
109 language support
Better handling of complex multi-row tables
Structured Markdown/JSON output

Docker Images

Tag	Description
`paddleocr-vl`	GPU variant using vLLM (recommended)
`paddleocr-vl-cpu`	CPU variant using transformers

API Endpoints (OpenAI-compatible)

Endpoint	Method	Description
`/health`	GET	Health check with model info
`/v1/models`	GET	List available models
`/v1/chat/completions`	POST	OpenAI-compatible chat completions
`/ocr`	POST	Legacy OCR endpoint

Request/Response Format

POST /v1/chat/completions (OpenAI-compatible)

{
  "model": "paddleocr-vl",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
        {"type": "text", "text": "Table Recognition:"}
      ]
    }
  ],
  "temperature": 0.0,
  "max_tokens": 8192
}

Task Prompts:

"OCR:" - Text recognition
"Table Recognition:" - Table extraction (returns markdown)
"Formula Recognition:" - Formula extraction
"Chart Recognition:" - Chart extraction

Response

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "| Date | Description | Amount |\n|---|---|---|\n| 2021-06-01 | GITLAB INC | -119.96 |"
      },
      "finish_reason": "stop"
    }
  ]
}

Environment Variables

Variable	Default	Description
`MODEL_NAME`	`PaddlePaddle/PaddleOCR-VL`	Model to load
`HOST`	`0.0.0.0`	Server host
`PORT`	`8000`	Server port
`MAX_BATCHED_TOKENS`	`16384`	vLLM max batch tokens
`GPU_MEMORY_UTILIZATION`	`0.9`	GPU memory usage (0-1)

Performance

GPU (vLLM): ~2-5 seconds per page
CPU: ~30-60 seconds per page

Adding New Models

To add a new model variant:

Create Dockerfile_<modelname>
Set MODEL_NAME environment variable
Update build-images.sh with new build target
Add documentation to readme.md

Troubleshooting

Model download hangs

Check container logs:

docker logs -f <container-name>

The model download is ~5GB and may take several minutes.

Out of memory

GPU: Use int4 quantized version or add more VRAM
CPU: Increase container memory limit: --memory=16g

API not responding

Check if container is healthy: docker ps
Check logs for errors: docker logs <container>
Verify port mapping: curl localhost:11434/api/tags

CI/CD Integration

Build and push using npmci:

npmci docker login
npmci docker build
npmci docker push code.foss.global

Multi-Pass Extraction Strategy

The bank statement extraction uses a dual-VLM consensus approach:

Architecture: Dual-VLM Consensus

VLM	Model	Purpose
MiniCPM-V 4.5	8B params	Primary visual extraction
PaddleOCR-VL	0.9B params	Table-specialized extraction

Extraction Strategy

Pass 1: MiniCPM-V visual extraction (images → JSON)
Pass 2: PaddleOCR-VL table recognition (images → markdown → JSON)
Consensus: If Pass 1 == Pass 2 → Done (fast path)
Pass 3+: MiniCPM-V visual if no consensus

Why Dual-VLM Works

Different architectures: Two independent models cross-check each other
Specialized strengths: PaddleOCR-VL optimized for tables, MiniCPM-V for general vision
No structure loss: Both VLMs see the original images directly
Fast consensus: Most documents complete in 2 passes when VLMs agree

Comparison vs Old PP-Structure Approach

Approach	Bank Statement Result	Issue
MiniCPM-V Visual	28 transactions ✓	-
PP-Structure HTML + Visual	13 transactions ✗	HTML merged rows incorrectly
PaddleOCR-VL Table	28 transactions ✓	Native table understanding

Key insight: PP-Structure's HTML output loses structure for complex tables. PaddleOCR-VL's native VLM approach maintains table integrity.

Nanonets-OCR-s

Overview

Nanonets-OCR-s is a Qwen2.5-VL-3B model fine-tuned specifically for document OCR tasks. It outputs structured markdown with semantic tags.

Key features:

Based on Qwen2.5-VL-3B (~4B parameters)
Fine-tuned for document OCR
Outputs markdown with semantic HTML tags
~8-10GB VRAM (fits comfortably in 16GB)

Docker Images

Tag	Description
`nanonets-ocr`	GPU variant using vLLM (OpenAI-compatible API)

API Endpoints (OpenAI-compatible via vLLM)

Endpoint	Method	Description
`/health`	GET	Health check
`/v1/models`	GET	List available models
`/v1/chat/completions`	POST	OpenAI-compatible chat completions

Request/Response Format

POST /v1/chat/completions (OpenAI-compatible)

{
  "model": "nanonets/Nanonets-OCR-s",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
        {"type": "text", "text": "Extract the text from the above document..."}
      ]
    }
  ],
  "temperature": 0.0,
  "max_tokens": 4096
}

Nanonets OCR Prompt

The model is designed to work with a specific prompt format:

Extract the text from the above document as if you were reading it naturally.
Return the tables in html format.
Return the equations in LaTeX representation.
If there is an image in the document and image caption is not present, add a small description inside <img></img> tag.
Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>.
Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number>.

Performance

GPU (vLLM): ~3-8 seconds per page
VRAM usage: ~8-10GB

Two-Stage Pipeline (Nanonets + Qwen3)

The Nanonets tests use a two-stage pipeline:

Stage 1: Nanonets-OCR-s converts images to markdown (via vLLM on port 8000)
Stage 2: Qwen3 8B extracts structured JSON from markdown (via Ollama on port 11434)

GPU Limitation: Both vLLM and Ollama require significant GPU memory. On a single GPU system:

Running both simultaneously causes memory contention
For single GPU: Run services sequentially (stop Nanonets before Qwen3)
For multi-GPU: Assign each service to a different GPU

Sequential Execution:

# Step 1: Run Nanonets OCR (converts to markdown)
docker start nanonets-test
# ... perform OCR ...
docker stop nanonets-test

# Step 2: Run Qwen3 extraction (from markdown)
docker start minicpm-test
# ... extract JSON ...

9.2 KiB Raw Permalink Blame History

Technical Notes - ht-docker-ai

Architecture

Model Details

MiniCPM-V 4.5

VRAM Usage

Container Startup Flow

Volume Persistence

API Endpoints

GPU Detection

Health Checks

PaddleOCR-VL (Recommended)

Overview

Docker Images

API Endpoints (OpenAI-compatible)

Request/Response Format

Environment Variables

Performance

Adding New Models

Troubleshooting

Model download hangs

Out of memory

API not responding

CI/CD Integration

Multi-Pass Extraction Strategy

Architecture: Dual-VLM Consensus

Extraction Strategy

Why Dual-VLM Works

Comparison vs Old PP-Structure Approach

Nanonets-OCR-s

Overview

Docker Images

API Endpoints (OpenAI-compatible via vLLM)

Request/Response Format

Nanonets OCR Prompt

Performance

Two-Stage Pipeline (Nanonets + Qwen3)

Related Resources

9.2 KiB

Raw Permalink Blame History