Files
ht-docker-ai/readme.hints.md

6.9 KiB

Technical Notes - ht-docker-ai

Architecture

This project uses Ollama and vLLM as runtime frameworks for serving AI models:

Ollama-based Images (MiniCPM-V, Qwen3-VL)

  • Automatic model download and caching
  • Unified REST API (compatible with OpenAI format)
  • Built-in quantization support
  • GPU auto-detection

vLLM-based Images (Nanonets-OCR)

  • High-performance inference server
  • OpenAI-compatible API
  • Optimized for VLM workloads

Model Details

MiniCPM-V 4.5

VRAM Usage

Mode VRAM Required
Full precision (bf16) 18GB
int4 quantized 9GB

Container Startup Flow

Ollama-based containers

  1. docker-entrypoint.sh starts Ollama server in background
  2. Waits for server to be ready
  3. Checks if model already exists in volume
  4. Pulls model if not present
  5. Keeps container running

vLLM-based containers

  1. vLLM server starts with model auto-download
  2. Health check endpoint available at /health
  3. OpenAI-compatible API at /v1/chat/completions

Volume Persistence

Ollama volumes

Mount /root/.ollama to persist downloaded models:

-v ollama-data:/root/.ollama

Without this volume, the model will be re-downloaded on each container start (~5GB download).

vLLM/HuggingFace volumes

Mount /root/.cache/huggingface for model caching:

-v hf-cache:/root/.cache/huggingface

API Endpoints

Ollama API (MiniCPM-V, Qwen3-VL)

Endpoint Method Description
/api/tags GET List available models
/api/generate POST Generate completion
/api/chat POST Chat completion
/api/pull POST Pull a model
/api/show POST Show model info

vLLM API (Nanonets-OCR)

Endpoint Method Description
/health GET Health check
/v1/models GET List available models
/v1/chat/completions POST OpenAI-compatible chat completions

Health Checks

All containers include Docker health checks:

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:11434/api/tags || exit 1

Nanonets-OCR-s

Overview

Nanonets-OCR-s is a Qwen2.5-VL-3B model fine-tuned specifically for document OCR tasks. It outputs structured markdown with semantic tags.

Key features:

  • Based on Qwen2.5-VL-3B (~4B parameters)
  • Fine-tuned for document OCR
  • Outputs markdown with semantic HTML tags
  • ~10GB VRAM

Docker Images

Tag Description
nanonets-ocr GPU variant using vLLM (OpenAI-compatible API)

API Endpoints (OpenAI-compatible via vLLM)

Endpoint Method Description
/health GET Health check
/v1/models GET List available models
/v1/chat/completions POST OpenAI-compatible chat completions

Request/Response Format

POST /v1/chat/completions (OpenAI-compatible)

{
  "model": "nanonets/Nanonets-OCR-s",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
        {"type": "text", "text": "Extract the text from the above document..."}
      ]
    }
  ],
  "temperature": 0.0,
  "max_tokens": 4096
}

Nanonets OCR Prompt

The model is designed to work with a specific prompt format:

Extract the text from the above document as if you were reading it naturally.
Return the tables in html format.
Return the equations in LaTeX representation.
If there is an image in the document and image caption is not present, add a small description inside <img></img> tag.
Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>.
Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number>.

Performance

  • GPU (vLLM): ~3-8 seconds per page
  • VRAM usage: ~10GB

Two-Stage Pipeline (Nanonets + Qwen3)

The Nanonets tests use a two-stage pipeline:

  1. Stage 1: Nanonets-OCR-s converts images to markdown (via vLLM on port 8000)
  2. Stage 2: Qwen3 8B extracts structured JSON from markdown (via Ollama on port 11434)

GPU Limitation: Both vLLM and Ollama require significant GPU memory. On a single GPU system:

  • Running both simultaneously causes memory contention
  • For single GPU: Run services sequentially (stop Nanonets before Qwen3)
  • For multi-GPU: Assign each service to a different GPU

Sequential Execution:

# Step 1: Run Nanonets OCR (converts to markdown)
docker start nanonets-test
# ... perform OCR ...
docker stop nanonets-test

# Step 2: Run Qwen3 extraction (from markdown)
docker start minicpm-test
# ... extract JSON ...

Multi-Pass Extraction Strategy

The bank statement extraction uses a dual-VLM consensus approach:

Architecture: Dual-VLM Consensus

VLM Model Purpose
MiniCPM-V 4.5 8B params Primary visual extraction
Nanonets-OCR-s ~4B params Document OCR with semantic output

Extraction Strategy

  1. Pass 1: MiniCPM-V visual extraction (images → JSON)
  2. Pass 2: Nanonets-OCR semantic extraction (images → markdown → JSON)
  3. Consensus: If Pass 1 == Pass 2 → Done (fast path)
  4. Pass 3+: MiniCPM-V visual if no consensus

Why Dual-VLM Works

  • Different architectures: Two independent models cross-check each other
  • Specialized strengths: Nanonets-OCR-s optimized for document structure, MiniCPM-V for general vision
  • No structure loss: Both VLMs see the original images directly
  • Fast consensus: Most documents complete in 2 passes when VLMs agree

Adding New Models

To add a new model variant:

  1. Create Dockerfile_<modelname>_<runtime>_<hardware>_VRAM<size>
  2. Set MODEL_NAME environment variable
  3. Update build-images.sh with new build target
  4. Add documentation to readme.md

Troubleshooting

Model download hangs

Check container logs:

docker logs -f <container-name>

The model download is ~5GB and may take several minutes.

Out of memory

  • GPU: Use a lighter model variant or upgrade VRAM
  • Add more GPU memory: Consider multi-GPU setup

API not responding

  1. Check if container is healthy: docker ps
  2. Check logs for errors: docker logs <container>
  3. Verify port mapping: curl localhost:11434/api/tags

CI/CD Integration

Build and push using npmci:

npmci docker login
npmci docker build
npmci docker push code.foss.global