4.5 KiB
Technical Notes - ht-docker-ai
Architecture
This project uses Ollama as the runtime framework for serving AI models. This provides:
- Automatic model download and caching
- Unified REST API (compatible with OpenAI format)
- Built-in quantization support
- GPU/CPU auto-detection
Model Details
MiniCPM-V 4.5
- Source: OpenBMB (https://github.com/OpenBMB/MiniCPM-V)
- Base Models: Qwen3-8B + SigLIP2-400M
- Total Parameters: 8B
- Ollama Model Name:
minicpm-v
VRAM Usage
| Mode | VRAM Required |
|---|---|
| Full precision (bf16) | 18GB |
| int4 quantized | 9GB |
| GGUF (CPU) | 8GB RAM |
Container Startup Flow
docker-entrypoint.shstarts Ollama server in background- Waits for server to be ready
- Checks if model already exists in volume
- Pulls model if not present
- Keeps container running
Volume Persistence
Mount /root/.ollama to persist downloaded models:
-v ollama-data:/root/.ollama
Without this volume, the model will be re-downloaded on each container start (~5GB download).
API Endpoints
All endpoints follow the Ollama API specification:
| Endpoint | Method | Description |
|---|---|---|
/api/tags |
GET | List available models |
/api/generate |
POST | Generate completion |
/api/chat |
POST | Chat completion |
/api/pull |
POST | Pull a model |
/api/show |
POST | Show model info |
GPU Detection
The GPU variant uses Ollama's automatic GPU detection. For CPU-only mode, we set:
ENV CUDA_VISIBLE_DEVICES=""
This forces Ollama to use CPU inference even if GPU is available.
Health Checks
Both variants include Docker health checks:
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:11434/api/tags || exit 1
CPU variant has longer start-period (120s) due to slower startup.
PaddleOCR
Overview
PaddleOCR is a standalone OCR service using PaddlePaddle's PP-OCRv4 model. It provides:
- Text detection and recognition
- Multi-language support
- FastAPI REST API
- GPU and CPU variants
Docker Images
| Tag | Description |
|---|---|
paddleocr |
GPU variant (default) |
paddleocr-gpu |
GPU variant (alias) |
paddleocr-cpu |
CPU-only variant |
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check with model info |
/ocr |
POST | OCR with base64 image (JSON body) |
/ocr/upload |
POST | OCR with file upload (multipart form) |
Request/Response Format
POST /ocr (JSON)
{
"image": "<base64-encoded-image>",
"language": "en" // optional
}
POST /ocr/upload (multipart)
img: image filelanguage: optional language code
Response
{
"success": true,
"results": [
{
"text": "Invoice #12345",
"confidence": 0.98,
"box": [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
}
]
}
Environment Variables
| Variable | Default | Description |
|---|---|---|
OCR_LANGUAGE |
en |
Default language for OCR |
SERVER_PORT |
5000 |
Server port |
SERVER_HOST |
0.0.0.0 |
Server host |
CUDA_VISIBLE_DEVICES |
(auto) | Set to -1 for CPU-only |
Performance
- GPU: ~1-3 seconds per page
- CPU: ~10-30 seconds per page
Supported Languages
Common language codes: en (English), ch (Chinese), de (German), fr (French), es (Spanish), ja (Japanese), ko (Korean)
Adding New Models
To add a new model variant:
- Create
Dockerfile_<modelname> - Set
MODEL_NAMEenvironment variable - Update
build-images.shwith new build target - Add documentation to
readme.md
Troubleshooting
Model download hangs
Check container logs:
docker logs -f <container-name>
The model download is ~5GB and may take several minutes.
Out of memory
- GPU: Use int4 quantized version or add more VRAM
- CPU: Increase container memory limit:
--memory=16g
API not responding
- Check if container is healthy:
docker ps - Check logs for errors:
docker logs <container> - Verify port mapping:
curl localhost:11434/api/tags
CI/CD Integration
Build and push using npmci:
npmci docker login
npmci docker build
npmci docker push code.foss.global