host.today/ht-docker-ai

Fork 0

Files

Juergen Kunz 7d135569fe initial

2026-01-16 01:51:57 +00:00

2.9 KiB

Raw Blame History

Technical Notes - ht-docker-ai

Architecture

This project uses Ollama as the runtime framework for serving AI models. This provides:

Automatic model download and caching
Unified REST API (compatible with OpenAI format)
Built-in quantization support
GPU/CPU auto-detection

Model Details

MiniCPM-V 4.5

Source: OpenBMB (https://github.com/OpenBMB/MiniCPM-V)
Base Models: Qwen3-8B + SigLIP2-400M
Total Parameters: 8B
Ollama Model Name: minicpm-v

VRAM Usage

Mode	VRAM Required
Full precision (bf16)	18GB
int4 quantized	9GB
GGUF (CPU)	8GB RAM

Container Startup Flow

docker-entrypoint.sh starts Ollama server in background
Waits for server to be ready
Checks if model already exists in volume
Pulls model if not present
Keeps container running

Volume Persistence

Mount /root/.ollama to persist downloaded models:

-v ollama-data:/root/.ollama

Without this volume, the model will be re-downloaded on each container start (~5GB download).

API Endpoints

All endpoints follow the Ollama API specification:

Endpoint	Method	Description
`/api/tags`	GET	List available models
`/api/generate`	POST	Generate completion
`/api/chat`	POST	Chat completion
`/api/pull`	POST	Pull a model
`/api/show`	POST	Show model info

GPU Detection

The GPU variant uses Ollama's automatic GPU detection. For CPU-only mode, we set:

ENV CUDA_VISIBLE_DEVICES=""

This forces Ollama to use CPU inference even if GPU is available.

Health Checks

Both variants include Docker health checks:

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:11434/api/tags || exit 1

CPU variant has longer start-period (120s) due to slower startup.

Adding New Models

To add a new model variant:

Create Dockerfile_<modelname>
Set MODEL_NAME environment variable
Update build-images.sh with new build target
Add documentation to readme.md

Troubleshooting

Model download hangs

Check container logs:

docker logs -f <container-name>

The model download is ~5GB and may take several minutes.

Out of memory

GPU: Use int4 quantized version or add more VRAM
CPU: Increase container memory limit: --memory=16g

API not responding

Check if container is healthy: docker ps
Check logs for errors: docker logs <container>
Verify port mapping: curl localhost:11434/api/tags

CI/CD Integration

Build and push using npmci:

npmci docker login
npmci docker build
npmci docker push code.foss.global

2.9 KiB Raw Blame History