2026-01-16 01:51:57 +00:00
|
|
|
# Technical Notes - ht-docker-ai
|
|
|
|
|
|
|
|
|
|
## Architecture
|
|
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
This project uses **Ollama** and **vLLM** as runtime frameworks for serving AI models:
|
2026-01-16 01:51:57 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
### Ollama-based Images (MiniCPM-V, Qwen3-VL)
|
2026-01-16 01:51:57 +00:00
|
|
|
- Automatic model download and caching
|
|
|
|
|
- Unified REST API (compatible with OpenAI format)
|
|
|
|
|
- Built-in quantization support
|
2026-01-19 21:05:51 +00:00
|
|
|
- GPU auto-detection
|
|
|
|
|
|
|
|
|
|
### vLLM-based Images (Nanonets-OCR)
|
|
|
|
|
- High-performance inference server
|
|
|
|
|
- OpenAI-compatible API
|
|
|
|
|
- Optimized for VLM workloads
|
2026-01-16 01:51:57 +00:00
|
|
|
|
|
|
|
|
## Model Details
|
|
|
|
|
|
|
|
|
|
### MiniCPM-V 4.5
|
|
|
|
|
|
|
|
|
|
- **Source**: OpenBMB (https://github.com/OpenBMB/MiniCPM-V)
|
|
|
|
|
- **Base Models**: Qwen3-8B + SigLIP2-400M
|
|
|
|
|
- **Total Parameters**: 8B
|
|
|
|
|
- **Ollama Model Name**: `minicpm-v`
|
|
|
|
|
|
|
|
|
|
### VRAM Usage
|
|
|
|
|
|
|
|
|
|
| Mode | VRAM Required |
|
|
|
|
|
|------|---------------|
|
|
|
|
|
| Full precision (bf16) | 18GB |
|
|
|
|
|
| int4 quantized | 9GB |
|
|
|
|
|
|
|
|
|
|
## Container Startup Flow
|
|
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
### Ollama-based containers
|
2026-01-16 01:51:57 +00:00
|
|
|
1. `docker-entrypoint.sh` starts Ollama server in background
|
|
|
|
|
2. Waits for server to be ready
|
|
|
|
|
3. Checks if model already exists in volume
|
|
|
|
|
4. Pulls model if not present
|
|
|
|
|
5. Keeps container running
|
|
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
### vLLM-based containers
|
|
|
|
|
1. vLLM server starts with model auto-download
|
|
|
|
|
2. Health check endpoint available at `/health`
|
|
|
|
|
3. OpenAI-compatible API at `/v1/chat/completions`
|
|
|
|
|
|
2026-01-16 01:51:57 +00:00
|
|
|
## Volume Persistence
|
|
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
### Ollama volumes
|
2026-01-16 01:51:57 +00:00
|
|
|
Mount `/root/.ollama` to persist downloaded models:
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
-v ollama-data:/root/.ollama
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Without this volume, the model will be re-downloaded on each container start (~5GB download).
|
|
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
### vLLM/HuggingFace volumes
|
|
|
|
|
Mount `/root/.cache/huggingface` for model caching:
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
-v hf-cache:/root/.cache/huggingface
|
|
|
|
|
```
|
|
|
|
|
|
2026-01-16 01:51:57 +00:00
|
|
|
## API Endpoints
|
|
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
### Ollama API (MiniCPM-V, Qwen3-VL)
|
2026-01-16 01:51:57 +00:00
|
|
|
|
|
|
|
|
| Endpoint | Method | Description |
|
|
|
|
|
|----------|--------|-------------|
|
|
|
|
|
| `/api/tags` | GET | List available models |
|
|
|
|
|
| `/api/generate` | POST | Generate completion |
|
|
|
|
|
| `/api/chat` | POST | Chat completion |
|
|
|
|
|
| `/api/pull` | POST | Pull a model |
|
|
|
|
|
| `/api/show` | POST | Show model info |
|
|
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
### vLLM API (Nanonets-OCR)
|
2026-01-16 01:51:57 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
| Endpoint | Method | Description |
|
|
|
|
|
|----------|--------|-------------|
|
|
|
|
|
| `/health` | GET | Health check |
|
|
|
|
|
| `/v1/models` | GET | List available models |
|
|
|
|
|
| `/v1/chat/completions` | POST | OpenAI-compatible chat completions |
|
2026-01-16 01:51:57 +00:00
|
|
|
|
|
|
|
|
## Health Checks
|
|
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
All containers include Docker health checks:
|
2026-01-16 01:51:57 +00:00
|
|
|
|
|
|
|
|
```dockerfile
|
|
|
|
|
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
|
|
|
|
|
CMD curl -f http://localhost:11434/api/tags || exit 1
|
|
|
|
|
```
|
|
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
---
|
2026-01-16 01:51:57 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
## Nanonets-OCR-s
|
2026-01-16 13:23:01 +00:00
|
|
|
|
|
|
|
|
### Overview
|
|
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
Nanonets-OCR-s is a Qwen2.5-VL-3B model fine-tuned specifically for document OCR tasks. It outputs structured markdown with semantic tags.
|
2026-01-16 13:23:01 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
**Key features:**
|
|
|
|
|
- Based on Qwen2.5-VL-3B (~4B parameters)
|
|
|
|
|
- Fine-tuned for document OCR
|
|
|
|
|
- Outputs markdown with semantic HTML tags
|
|
|
|
|
- ~10GB VRAM
|
2026-01-16 13:23:01 +00:00
|
|
|
|
|
|
|
|
### Docker Images
|
|
|
|
|
|
|
|
|
|
| Tag | Description |
|
|
|
|
|
|-----|-------------|
|
2026-01-19 21:05:51 +00:00
|
|
|
| `nanonets-ocr` | GPU variant using vLLM (OpenAI-compatible API) |
|
2026-01-16 13:23:01 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
### API Endpoints (OpenAI-compatible via vLLM)
|
2026-01-16 13:23:01 +00:00
|
|
|
|
|
|
|
|
| Endpoint | Method | Description |
|
|
|
|
|
|----------|--------|-------------|
|
2026-01-19 21:05:51 +00:00
|
|
|
| `/health` | GET | Health check |
|
2026-01-16 16:21:44 +00:00
|
|
|
| `/v1/models` | GET | List available models |
|
|
|
|
|
| `/v1/chat/completions` | POST | OpenAI-compatible chat completions |
|
2026-01-16 13:23:01 +00:00
|
|
|
|
|
|
|
|
### Request/Response Format
|
|
|
|
|
|
2026-01-16 16:21:44 +00:00
|
|
|
**POST /v1/chat/completions (OpenAI-compatible)**
|
2026-01-16 13:23:01 +00:00
|
|
|
```json
|
|
|
|
|
{
|
2026-01-19 21:05:51 +00:00
|
|
|
"model": "nanonets/Nanonets-OCR-s",
|
2026-01-16 16:21:44 +00:00
|
|
|
"messages": [
|
|
|
|
|
{
|
|
|
|
|
"role": "user",
|
|
|
|
|
"content": [
|
|
|
|
|
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
|
2026-01-19 21:05:51 +00:00
|
|
|
{"type": "text", "text": "Extract the text from the above document..."}
|
2026-01-16 16:21:44 +00:00
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"temperature": 0.0,
|
2026-01-19 21:05:51 +00:00
|
|
|
"max_tokens": 4096
|
2026-01-16 13:23:01 +00:00
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
### Nanonets OCR Prompt
|
2026-01-16 13:23:01 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
The model is designed to work with a specific prompt format:
|
|
|
|
|
```
|
|
|
|
|
Extract the text from the above document as if you were reading it naturally.
|
|
|
|
|
Return the tables in html format.
|
|
|
|
|
Return the equations in LaTeX representation.
|
|
|
|
|
If there is an image in the document and image caption is not present, add a small description inside <img></img> tag.
|
|
|
|
|
Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>.
|
|
|
|
|
Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number>.
|
2026-01-16 13:23:01 +00:00
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Performance
|
|
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
- **GPU (vLLM)**: ~3-8 seconds per page
|
|
|
|
|
- **VRAM usage**: ~10GB
|
2026-01-16 01:51:57 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
### Two-Stage Pipeline (Nanonets + Qwen3)
|
2026-01-16 01:51:57 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
The Nanonets tests use a two-stage pipeline:
|
|
|
|
|
1. **Stage 1**: Nanonets-OCR-s converts images to markdown (via vLLM on port 8000)
|
|
|
|
|
2. **Stage 2**: Qwen3 8B extracts structured JSON from markdown (via Ollama on port 11434)
|
2026-01-16 01:51:57 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
**GPU Limitation**: Both vLLM and Ollama require significant GPU memory. On a single GPU system:
|
|
|
|
|
- Running both simultaneously causes memory contention
|
|
|
|
|
- For single GPU: Run services sequentially (stop Nanonets before Qwen3)
|
|
|
|
|
- For multi-GPU: Assign each service to a different GPU
|
2026-01-16 01:51:57 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
**Sequential Execution**:
|
2026-01-16 01:51:57 +00:00
|
|
|
```bash
|
2026-01-19 21:05:51 +00:00
|
|
|
# Step 1: Run Nanonets OCR (converts to markdown)
|
|
|
|
|
docker start nanonets-test
|
|
|
|
|
# ... perform OCR ...
|
|
|
|
|
docker stop nanonets-test
|
2026-01-16 01:51:57 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
# Step 2: Run Qwen3 extraction (from markdown)
|
|
|
|
|
docker start minicpm-test
|
|
|
|
|
# ... extract JSON ...
|
2026-01-16 01:51:57 +00:00
|
|
|
```
|
|
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
---
|
|
|
|
|
|
2026-01-16 16:21:44 +00:00
|
|
|
## Multi-Pass Extraction Strategy
|
|
|
|
|
|
|
|
|
|
The bank statement extraction uses a dual-VLM consensus approach:
|
|
|
|
|
|
|
|
|
|
### Architecture: Dual-VLM Consensus
|
|
|
|
|
|
|
|
|
|
| VLM | Model | Purpose |
|
|
|
|
|
|-----|-------|---------|
|
|
|
|
|
| **MiniCPM-V 4.5** | 8B params | Primary visual extraction |
|
2026-01-19 21:05:51 +00:00
|
|
|
| **Nanonets-OCR-s** | ~4B params | Document OCR with semantic output |
|
2026-01-16 16:21:44 +00:00
|
|
|
|
|
|
|
|
### Extraction Strategy
|
|
|
|
|
|
|
|
|
|
1. **Pass 1**: MiniCPM-V visual extraction (images → JSON)
|
2026-01-19 21:05:51 +00:00
|
|
|
2. **Pass 2**: Nanonets-OCR semantic extraction (images → markdown → JSON)
|
2026-01-16 16:21:44 +00:00
|
|
|
3. **Consensus**: If Pass 1 == Pass 2 → Done (fast path)
|
|
|
|
|
4. **Pass 3+**: MiniCPM-V visual if no consensus
|
|
|
|
|
|
|
|
|
|
### Why Dual-VLM Works
|
|
|
|
|
|
|
|
|
|
- **Different architectures**: Two independent models cross-check each other
|
2026-01-19 21:05:51 +00:00
|
|
|
- **Specialized strengths**: Nanonets-OCR-s optimized for document structure, MiniCPM-V for general vision
|
2026-01-16 16:21:44 +00:00
|
|
|
- **No structure loss**: Both VLMs see the original images directly
|
|
|
|
|
- **Fast consensus**: Most documents complete in 2 passes when VLMs agree
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
## Adding New Models
|
2026-01-18 15:54:16 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
To add a new model variant:
|
2026-01-18 15:54:16 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
1. Create `Dockerfile_<modelname>_<runtime>_<hardware>_VRAM<size>`
|
|
|
|
|
2. Set `MODEL_NAME` environment variable
|
|
|
|
|
3. Update `build-images.sh` with new build target
|
|
|
|
|
4. Add documentation to `readme.md`
|
2026-01-18 15:54:16 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
## Troubleshooting
|
2026-01-18 15:54:16 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
### Model download hangs
|
2026-01-18 15:54:16 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
Check container logs:
|
|
|
|
|
```bash
|
|
|
|
|
docker logs -f <container-name>
|
2026-01-18 15:54:16 +00:00
|
|
|
```
|
|
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
The model download is ~5GB and may take several minutes.
|
2026-01-18 15:54:16 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
### Out of memory
|
2026-01-18 15:54:16 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
- GPU: Use a lighter model variant or upgrade VRAM
|
|
|
|
|
- Add more GPU memory: Consider multi-GPU setup
|
2026-01-18 15:54:16 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
### API not responding
|
2026-01-18 15:54:16 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
1. Check if container is healthy: `docker ps`
|
|
|
|
|
2. Check logs for errors: `docker logs <container>`
|
|
|
|
|
3. Verify port mapping: `curl localhost:11434/api/tags`
|
2026-01-18 15:54:16 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
## CI/CD Integration
|
2026-01-18 15:54:16 +00:00
|
|
|
|
2026-01-19 21:05:51 +00:00
|
|
|
Build and push using npmci:
|
2026-01-18 15:54:16 +00:00
|
|
|
|
|
|
|
|
```bash
|
2026-01-19 21:05:51 +00:00
|
|
|
npmci docker login
|
|
|
|
|
npmci docker build
|
|
|
|
|
npmci docker push code.foss.global
|
2026-01-18 15:54:16 +00:00
|
|
|
```
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2026-01-16 01:51:57 +00:00
|
|
|
## Related Resources
|
|
|
|
|
|
|
|
|
|
- [Ollama Documentation](https://ollama.ai/docs)
|
|
|
|
|
- [MiniCPM-V GitHub](https://github.com/OpenBMB/MiniCPM-V)
|
|
|
|
|
- [Ollama API Reference](https://github.com/ollama/ollama/blob/main/docs/api.md)
|
2026-01-18 15:54:16 +00:00
|
|
|
- [Nanonets-OCR-s on HuggingFace](https://huggingface.co/nanonets/Nanonets-OCR-s)
|