# Technical Notes - ht-docker-ai

## Architecture

This project uses **Ollama** as the runtime framework for serving AI models. This provides:

- Automatic model download and caching
- Unified REST API (compatible with OpenAI format)
- Built-in quantization support
- GPU/CPU auto-detection

## Model Details

### MiniCPM-V 4.5

- **Source**: OpenBMB (https://github.com/OpenBMB/MiniCPM-V)
- **Base Models**: Qwen3-8B + SigLIP2-400M
- **Total Parameters**: 8B
- **Ollama Model Name**: `minicpm-v`

### VRAM Usage

| Mode | VRAM Required |
|------|---------------|
| Full precision (bf16) | 18GB |
| int4 quantized | 9GB |
| GGUF (CPU) | 8GB RAM |

## Container Startup Flow

1. `docker-entrypoint.sh` starts Ollama server in background
2. Waits for server to be ready
3. Checks if model already exists in volume
4. Pulls model if not present
5. Keeps container running

## Volume Persistence

Mount `/root/.ollama` to persist downloaded models:

```bash
-v ollama-data:/root/.ollama
```

Without this volume, the model will be re-downloaded on each container start (~5GB download).

## API Endpoints

All endpoints follow the Ollama API specification:

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/tags` | GET | List available models |
| `/api/generate` | POST | Generate completion |
| `/api/chat` | POST | Chat completion |
| `/api/pull` | POST | Pull a model |
| `/api/show` | POST | Show model info |

## GPU Detection

The GPU variant uses Ollama's automatic GPU detection. For CPU-only mode, we set:

```dockerfile
ENV CUDA_VISIBLE_DEVICES=""
```

This forces Ollama to use CPU inference even if GPU is available.

## Health Checks

Both variants include Docker health checks:

```dockerfile
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:11434/api/tags || exit 1
```

CPU variant has longer `start-period` (120s) due to slower startup.

## PaddleOCR-VL (Recommended)

### Overview

PaddleOCR-VL is a 0.9B parameter Vision-Language Model specifically optimized for document parsing. It replaces the older PP-Structure approach with native VLM understanding.

**Key advantages over PP-Structure:**
- Native table understanding (no HTML parsing needed)
- 109 language support
- Better handling of complex multi-row tables
- Structured Markdown/JSON output

### Docker Images

| Tag | Description |
|-----|-------------|
| `paddleocr-vl` | GPU variant using vLLM (recommended) |
| `paddleocr-vl-cpu` | CPU variant using transformers |

### API Endpoints (OpenAI-compatible)

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Health check with model info |
| `/v1/models` | GET | List available models |
| `/v1/chat/completions` | POST | OpenAI-compatible chat completions |
| `/ocr` | POST | Legacy OCR endpoint |

### Request/Response Format

**POST /v1/chat/completions (OpenAI-compatible)**
```json
{
  "model": "paddleocr-vl",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
        {"type": "text", "text": "Table Recognition:"}
      ]
    }
  ],
  "temperature": 0.0,
  "max_tokens": 8192
}
```

**Task Prompts:**
- `"OCR:"` - Text recognition
- `"Table Recognition:"` - Table extraction (returns markdown)
- `"Formula Recognition:"` - Formula extraction
- `"Chart Recognition:"` - Chart extraction

**Response**
```json
{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "| Date | Description | Amount |\n|---|---|---|\n| 2021-06-01 | GITLAB INC | -119.96 |"
      },
      "finish_reason": "stop"
    }
  ]
}
```

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_NAME` | `PaddlePaddle/PaddleOCR-VL` | Model to load |
| `HOST` | `0.0.0.0` | Server host |
| `PORT` | `8000` | Server port |
| `MAX_BATCHED_TOKENS` | `16384` | vLLM max batch tokens |
| `GPU_MEMORY_UTILIZATION` | `0.9` | GPU memory usage (0-1) |

### Performance

- **GPU (vLLM)**: ~2-5 seconds per page
- **CPU**: ~30-60 seconds per page

---

## Adding New Models

To add a new model variant:

1. Create `Dockerfile_<modelname>`
2. Set `MODEL_NAME` environment variable
3. Update `build-images.sh` with new build target
4. Add documentation to `readme.md`

## Troubleshooting

### Model download hangs

Check container logs:
```bash
docker logs -f <container-name>
```

The model download is ~5GB and may take several minutes.

### Out of memory

- GPU: Use int4 quantized version or add more VRAM
- CPU: Increase container memory limit: `--memory=16g`

### API not responding

1. Check if container is healthy: `docker ps`
2. Check logs for errors: `docker logs <container>`
3. Verify port mapping: `curl localhost:11434/api/tags`

## CI/CD Integration

Build and push using npmci:

```bash
npmci docker login
npmci docker build
npmci docker push code.foss.global
```

## Multi-Pass Extraction Strategy

The bank statement extraction uses a dual-VLM consensus approach:

### Architecture: Dual-VLM Consensus

| VLM | Model | Purpose |
|-----|-------|---------|
| **MiniCPM-V 4.5** | 8B params | Primary visual extraction |
| **PaddleOCR-VL** | 0.9B params | Table-specialized extraction |

### Extraction Strategy

1. **Pass 1**: MiniCPM-V visual extraction (images → JSON)
2. **Pass 2**: PaddleOCR-VL table recognition (images → markdown → JSON)
3. **Consensus**: If Pass 1 == Pass 2 → Done (fast path)
4. **Pass 3+**: MiniCPM-V visual if no consensus

### Why Dual-VLM Works

- **Different architectures**: Two independent models cross-check each other
- **Specialized strengths**: PaddleOCR-VL optimized for tables, MiniCPM-V for general vision
- **No structure loss**: Both VLMs see the original images directly
- **Fast consensus**: Most documents complete in 2 passes when VLMs agree

### Comparison vs Old PP-Structure Approach

| Approach | Bank Statement Result | Issue |
|----------|----------------------|-------|
| MiniCPM-V Visual | 28 transactions ✓ | - |
| PP-Structure HTML + Visual | 13 transactions ✗ | HTML merged rows incorrectly |
| PaddleOCR-VL Table | 28 transactions ✓ | Native table understanding |

**Key insight**: PP-Structure's HTML output loses structure for complex tables. PaddleOCR-VL's native VLM approach maintains table integrity.

---

## Related Resources

- [Ollama Documentation](https://ollama.ai/docs)
- [MiniCPM-V GitHub](https://github.com/OpenBMB/MiniCPM-V)
- [Ollama API Reference](https://github.com/ollama/ollama/blob/main/docs/api.md)