341 lines
9.2 KiB
Markdown
341 lines
9.2 KiB
Markdown
# Technical Notes - ht-docker-ai
|
|
|
|
## Architecture
|
|
|
|
This project uses **Ollama** as the runtime framework for serving AI models. This provides:
|
|
|
|
- Automatic model download and caching
|
|
- Unified REST API (compatible with OpenAI format)
|
|
- Built-in quantization support
|
|
- GPU/CPU auto-detection
|
|
|
|
## Model Details
|
|
|
|
### MiniCPM-V 4.5
|
|
|
|
- **Source**: OpenBMB (https://github.com/OpenBMB/MiniCPM-V)
|
|
- **Base Models**: Qwen3-8B + SigLIP2-400M
|
|
- **Total Parameters**: 8B
|
|
- **Ollama Model Name**: `minicpm-v`
|
|
|
|
### VRAM Usage
|
|
|
|
| Mode | VRAM Required |
|
|
|------|---------------|
|
|
| Full precision (bf16) | 18GB |
|
|
| int4 quantized | 9GB |
|
|
| GGUF (CPU) | 8GB RAM |
|
|
|
|
## Container Startup Flow
|
|
|
|
1. `docker-entrypoint.sh` starts Ollama server in background
|
|
2. Waits for server to be ready
|
|
3. Checks if model already exists in volume
|
|
4. Pulls model if not present
|
|
5. Keeps container running
|
|
|
|
## Volume Persistence
|
|
|
|
Mount `/root/.ollama` to persist downloaded models:
|
|
|
|
```bash
|
|
-v ollama-data:/root/.ollama
|
|
```
|
|
|
|
Without this volume, the model will be re-downloaded on each container start (~5GB download).
|
|
|
|
## API Endpoints
|
|
|
|
All endpoints follow the Ollama API specification:
|
|
|
|
| Endpoint | Method | Description |
|
|
|----------|--------|-------------|
|
|
| `/api/tags` | GET | List available models |
|
|
| `/api/generate` | POST | Generate completion |
|
|
| `/api/chat` | POST | Chat completion |
|
|
| `/api/pull` | POST | Pull a model |
|
|
| `/api/show` | POST | Show model info |
|
|
|
|
## GPU Detection
|
|
|
|
The GPU variant uses Ollama's automatic GPU detection. For CPU-only mode, we set:
|
|
|
|
```dockerfile
|
|
ENV CUDA_VISIBLE_DEVICES=""
|
|
```
|
|
|
|
This forces Ollama to use CPU inference even if GPU is available.
|
|
|
|
## Health Checks
|
|
|
|
Both variants include Docker health checks:
|
|
|
|
```dockerfile
|
|
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
|
|
CMD curl -f http://localhost:11434/api/tags || exit 1
|
|
```
|
|
|
|
CPU variant has longer `start-period` (120s) due to slower startup.
|
|
|
|
## PaddleOCR-VL (Recommended)
|
|
|
|
### Overview
|
|
|
|
PaddleOCR-VL is a 0.9B parameter Vision-Language Model specifically optimized for document parsing. It replaces the older PP-Structure approach with native VLM understanding.
|
|
|
|
**Key advantages over PP-Structure:**
|
|
- Native table understanding (no HTML parsing needed)
|
|
- 109 language support
|
|
- Better handling of complex multi-row tables
|
|
- Structured Markdown/JSON output
|
|
|
|
### Docker Images
|
|
|
|
| Tag | Description |
|
|
|-----|-------------|
|
|
| `paddleocr-vl` | GPU variant using vLLM (recommended) |
|
|
| `paddleocr-vl-cpu` | CPU variant using transformers |
|
|
|
|
### API Endpoints (OpenAI-compatible)
|
|
|
|
| Endpoint | Method | Description |
|
|
|----------|--------|-------------|
|
|
| `/health` | GET | Health check with model info |
|
|
| `/v1/models` | GET | List available models |
|
|
| `/v1/chat/completions` | POST | OpenAI-compatible chat completions |
|
|
| `/ocr` | POST | Legacy OCR endpoint |
|
|
|
|
### Request/Response Format
|
|
|
|
**POST /v1/chat/completions (OpenAI-compatible)**
|
|
```json
|
|
{
|
|
"model": "paddleocr-vl",
|
|
"messages": [
|
|
{
|
|
"role": "user",
|
|
"content": [
|
|
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
|
|
{"type": "text", "text": "Table Recognition:"}
|
|
]
|
|
}
|
|
],
|
|
"temperature": 0.0,
|
|
"max_tokens": 8192
|
|
}
|
|
```
|
|
|
|
**Task Prompts:**
|
|
- `"OCR:"` - Text recognition
|
|
- `"Table Recognition:"` - Table extraction (returns markdown)
|
|
- `"Formula Recognition:"` - Formula extraction
|
|
- `"Chart Recognition:"` - Chart extraction
|
|
|
|
**Response**
|
|
```json
|
|
{
|
|
"id": "chatcmpl-...",
|
|
"object": "chat.completion",
|
|
"choices": [
|
|
{
|
|
"index": 0,
|
|
"message": {
|
|
"role": "assistant",
|
|
"content": "| Date | Description | Amount |\n|---|---|---|\n| 2021-06-01 | GITLAB INC | -119.96 |"
|
|
},
|
|
"finish_reason": "stop"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `MODEL_NAME` | `PaddlePaddle/PaddleOCR-VL` | Model to load |
|
|
| `HOST` | `0.0.0.0` | Server host |
|
|
| `PORT` | `8000` | Server port |
|
|
| `MAX_BATCHED_TOKENS` | `16384` | vLLM max batch tokens |
|
|
| `GPU_MEMORY_UTILIZATION` | `0.9` | GPU memory usage (0-1) |
|
|
|
|
### Performance
|
|
|
|
- **GPU (vLLM)**: ~2-5 seconds per page
|
|
- **CPU**: ~30-60 seconds per page
|
|
|
|
---
|
|
|
|
## Adding New Models
|
|
|
|
To add a new model variant:
|
|
|
|
1. Create `Dockerfile_<modelname>`
|
|
2. Set `MODEL_NAME` environment variable
|
|
3. Update `build-images.sh` with new build target
|
|
4. Add documentation to `readme.md`
|
|
|
|
## Troubleshooting
|
|
|
|
### Model download hangs
|
|
|
|
Check container logs:
|
|
```bash
|
|
docker logs -f <container-name>
|
|
```
|
|
|
|
The model download is ~5GB and may take several minutes.
|
|
|
|
### Out of memory
|
|
|
|
- GPU: Use int4 quantized version or add more VRAM
|
|
- CPU: Increase container memory limit: `--memory=16g`
|
|
|
|
### API not responding
|
|
|
|
1. Check if container is healthy: `docker ps`
|
|
2. Check logs for errors: `docker logs <container>`
|
|
3. Verify port mapping: `curl localhost:11434/api/tags`
|
|
|
|
## CI/CD Integration
|
|
|
|
Build and push using npmci:
|
|
|
|
```bash
|
|
npmci docker login
|
|
npmci docker build
|
|
npmci docker push code.foss.global
|
|
```
|
|
|
|
## Multi-Pass Extraction Strategy
|
|
|
|
The bank statement extraction uses a dual-VLM consensus approach:
|
|
|
|
### Architecture: Dual-VLM Consensus
|
|
|
|
| VLM | Model | Purpose |
|
|
|-----|-------|---------|
|
|
| **MiniCPM-V 4.5** | 8B params | Primary visual extraction |
|
|
| **PaddleOCR-VL** | 0.9B params | Table-specialized extraction |
|
|
|
|
### Extraction Strategy
|
|
|
|
1. **Pass 1**: MiniCPM-V visual extraction (images → JSON)
|
|
2. **Pass 2**: PaddleOCR-VL table recognition (images → markdown → JSON)
|
|
3. **Consensus**: If Pass 1 == Pass 2 → Done (fast path)
|
|
4. **Pass 3+**: MiniCPM-V visual if no consensus
|
|
|
|
### Why Dual-VLM Works
|
|
|
|
- **Different architectures**: Two independent models cross-check each other
|
|
- **Specialized strengths**: PaddleOCR-VL optimized for tables, MiniCPM-V for general vision
|
|
- **No structure loss**: Both VLMs see the original images directly
|
|
- **Fast consensus**: Most documents complete in 2 passes when VLMs agree
|
|
|
|
### Comparison vs Old PP-Structure Approach
|
|
|
|
| Approach | Bank Statement Result | Issue |
|
|
|----------|----------------------|-------|
|
|
| MiniCPM-V Visual | 28 transactions ✓ | - |
|
|
| PP-Structure HTML + Visual | 13 transactions ✗ | HTML merged rows incorrectly |
|
|
| PaddleOCR-VL Table | 28 transactions ✓ | Native table understanding |
|
|
|
|
**Key insight**: PP-Structure's HTML output loses structure for complex tables. PaddleOCR-VL's native VLM approach maintains table integrity.
|
|
|
|
---
|
|
|
|
## Nanonets-OCR-s
|
|
|
|
### Overview
|
|
|
|
Nanonets-OCR-s is a Qwen2.5-VL-3B model fine-tuned specifically for document OCR tasks. It outputs structured markdown with semantic tags.
|
|
|
|
**Key features:**
|
|
- Based on Qwen2.5-VL-3B (~4B parameters)
|
|
- Fine-tuned for document OCR
|
|
- Outputs markdown with semantic HTML tags
|
|
- ~8-10GB VRAM (fits comfortably in 16GB)
|
|
|
|
### Docker Images
|
|
|
|
| Tag | Description |
|
|
|-----|-------------|
|
|
| `nanonets-ocr` | GPU variant using vLLM (OpenAI-compatible API) |
|
|
|
|
### API Endpoints (OpenAI-compatible via vLLM)
|
|
|
|
| Endpoint | Method | Description |
|
|
|----------|--------|-------------|
|
|
| `/health` | GET | Health check |
|
|
| `/v1/models` | GET | List available models |
|
|
| `/v1/chat/completions` | POST | OpenAI-compatible chat completions |
|
|
|
|
### Request/Response Format
|
|
|
|
**POST /v1/chat/completions (OpenAI-compatible)**
|
|
```json
|
|
{
|
|
"model": "nanonets/Nanonets-OCR-s",
|
|
"messages": [
|
|
{
|
|
"role": "user",
|
|
"content": [
|
|
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
|
|
{"type": "text", "text": "Extract the text from the above document..."}
|
|
]
|
|
}
|
|
],
|
|
"temperature": 0.0,
|
|
"max_tokens": 4096
|
|
}
|
|
```
|
|
|
|
### Nanonets OCR Prompt
|
|
|
|
The model is designed to work with a specific prompt format:
|
|
```
|
|
Extract the text from the above document as if you were reading it naturally.
|
|
Return the tables in html format.
|
|
Return the equations in LaTeX representation.
|
|
If there is an image in the document and image caption is not present, add a small description inside <img></img> tag.
|
|
Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>.
|
|
Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number>.
|
|
```
|
|
|
|
### Performance
|
|
|
|
- **GPU (vLLM)**: ~3-8 seconds per page
|
|
- **VRAM usage**: ~8-10GB
|
|
|
|
### Two-Stage Pipeline (Nanonets + Qwen3)
|
|
|
|
The Nanonets tests use a two-stage pipeline:
|
|
1. **Stage 1**: Nanonets-OCR-s converts images to markdown (via vLLM on port 8000)
|
|
2. **Stage 2**: Qwen3 8B extracts structured JSON from markdown (via Ollama on port 11434)
|
|
|
|
**GPU Limitation**: Both vLLM and Ollama require significant GPU memory. On a single GPU system:
|
|
- Running both simultaneously causes memory contention
|
|
- For single GPU: Run services sequentially (stop Nanonets before Qwen3)
|
|
- For multi-GPU: Assign each service to a different GPU
|
|
|
|
**Sequential Execution**:
|
|
```bash
|
|
# Step 1: Run Nanonets OCR (converts to markdown)
|
|
docker start nanonets-test
|
|
# ... perform OCR ...
|
|
docker stop nanonets-test
|
|
|
|
# Step 2: Run Qwen3 extraction (from markdown)
|
|
docker start minicpm-test
|
|
# ... extract JSON ...
|
|
```
|
|
|
|
---
|
|
|
|
## Related Resources
|
|
|
|
- [Ollama Documentation](https://ollama.ai/docs)
|
|
- [MiniCPM-V GitHub](https://github.com/OpenBMB/MiniCPM-V)
|
|
- [Ollama API Reference](https://github.com/ollama/ollama/blob/main/docs/api.md)
|
|
- [Nanonets-OCR-s on HuggingFace](https://huggingface.co/nanonets/Nanonets-OCR-s)
|