tag/v1.3.0/readme.hints.md

# Technical Notes - ht-docker-ai

## Architecture

This project uses **Ollama** as the runtime framework for serving AI models. This provides:

- Automatic model download and caching
- Unified REST API (compatible with OpenAI format)
- Built-in quantization support
- GPU/CPU auto-detection

## Model Details

### MiniCPM-V 4.5

- **Source**: OpenBMB (https://github.com/OpenBMB/MiniCPM-V)
- **Base Models**: Qwen3-8B + SigLIP2-400M
- **Total Parameters**: 8B
- **Ollama Model Name**: `minicpm-v`

### VRAM Usage

| Mode | VRAM Required |
|------|---------------|
| Full precision (bf16) | 18GB |
| int4 quantized | 9GB |
| GGUF (CPU) | 8GB RAM |

## Container Startup Flow

1. `docker-entrypoint.sh` starts Ollama server in background
2. Waits for server to be ready
3. Checks if model already exists in volume
4. Pulls model if not present
5. Keeps container running

## Volume Persistence

Mount `/root/.ollama` to persist downloaded models:

```bash
-v ollama-data:/root/.ollama
```

Without this volume, the model will be re-downloaded on each container start (~5GB download).

## API Endpoints

All endpoints follow the Ollama API specification:

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/tags` | GET | List available models |
| `/api/generate` | POST | Generate completion |
| `/api/chat` | POST | Chat completion |
| `/api/pull` | POST | Pull a model |
| `/api/show` | POST | Show model info |

## GPU Detection

The GPU variant uses Ollama's automatic GPU detection. For CPU-only mode, we set:

```dockerfile
ENV CUDA_VISIBLE_DEVICES=""
```

This forces Ollama to use CPU inference even if GPU is available.

## Health Checks

Both variants include Docker health checks:

```dockerfile
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:11434/api/tags || exit 1
```

CPU variant has longer `start-period` (120s) due to slower startup.

## PaddleOCR

### Overview

PaddleOCR is a standalone OCR service using PaddlePaddle's PP-OCRv4 model. It provides:

- Text detection and recognition
- Multi-language support
- FastAPI REST API
- GPU and CPU variants

### Docker Images

| Tag | Description |
|-----|-------------|
| `paddleocr` | GPU variant (default) |
| `paddleocr-gpu` | GPU variant (alias) |
| `paddleocr-cpu` | CPU-only variant |

### API Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Health check with model info |
| `/ocr` | POST | OCR with base64 image (JSON body) |
| `/ocr/upload` | POST | OCR with file upload (multipart form) |

### Request/Response Format

**POST /ocr (JSON)**
```json
{
  "image": "<base64-encoded-image>",
  "language": "en"  // optional
}
```

**POST /ocr/upload (multipart)**
- `img`: image file
- `language`: optional language code

**Response**
```json
{
  "success": true,
  "results": [
    {
      "text": "Invoice #12345",
      "confidence": 0.98,
      "box": [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
    }
  ]
}
```

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `OCR_LANGUAGE` | `en` | Default language for OCR |
| `SERVER_PORT` | `5000` | Server port |
| `SERVER_HOST` | `0.0.0.0` | Server host |
| `CUDA_VISIBLE_DEVICES` | (auto) | Set to `-1` for CPU-only |

### Performance

- **GPU**: ~1-3 seconds per page
- **CPU**: ~10-30 seconds per page

### Supported Languages

Common language codes: `en` (English), `ch` (Chinese), `de` (German), `fr` (French), `es` (Spanish), `ja` (Japanese), `ko` (Korean)

---

## Adding New Models

To add a new model variant:

1. Create `Dockerfile_<modelname>`
2. Set `MODEL_NAME` environment variable
3. Update `build-images.sh` with new build target
4. Add documentation to `readme.md`

## Troubleshooting

### Model download hangs

Check container logs:
```bash
docker logs -f <container-name>
```

The model download is ~5GB and may take several minutes.

### Out of memory

- GPU: Use int4 quantized version or add more VRAM
- CPU: Increase container memory limit: `--memory=16g`

### API not responding

1. Check if container is healthy: `docker ps`
2. Check logs for errors: `docker logs <container>`
3. Verify port mapping: `curl localhost:11434/api/tags`

## CI/CD Integration

Build and push using npmci:

```bash
npmci docker login
npmci docker build
npmci docker push code.foss.global
```

## Related Resources

- [Ollama Documentation](https://ollama.ai/docs)
- [MiniCPM-V GitHub](https://github.com/OpenBMB/MiniCPM-V)
- [Ollama API Reference](https://github.com/ollama/ollama/blob/main/docs/api.md)
initial 2026-01-16 01:51:57 +00:00			`# Technical Notes - ht-docker-ai`

			`## Architecture`

			`This project uses Ollama as the runtime framework for serving AI models. This provides:`

			`- Automatic model download and caching`
			`- Unified REST API (compatible with OpenAI format)`
			`- Built-in quantization support`
			`- GPU/CPU auto-detection`

			`## Model Details`

			`### MiniCPM-V 4.5`

			`- Source: OpenBMB (https://github.com/OpenBMB/MiniCPM-V)`
			`- Base Models: Qwen3-8B + SigLIP2-400M`
			`- Total Parameters: 8B`
			- Ollama Model Name: `minicpm-v`

			`### VRAM Usage`

			`\| Mode \| VRAM Required \|`
			`\|------\|---------------\|`
			`\| Full precision (bf16) \| 18GB \|`
			`\| int4 quantized \| 9GB \|`
			`\| GGUF (CPU) \| 8GB RAM \|`

			`## Container Startup Flow`

			1. `docker-entrypoint.sh` starts Ollama server in background
			`2. Waits for server to be ready`
			`3. Checks if model already exists in volume`
			`4. Pulls model if not present`
			`5. Keeps container running`

			`## Volume Persistence`

			Mount `/root/.ollama` to persist downloaded models:

			```bash
			`-v ollama-data:/root/.ollama`
			```

			`Without this volume, the model will be re-downloaded on each container start (~5GB download).`

			`## API Endpoints`

			`All endpoints follow the Ollama API specification:`

			`\| Endpoint \| Method \| Description \|`
			`\|----------\|--------\|-------------\|`
			\| `/api/tags` \| GET \| List available models \|
			\| `/api/generate` \| POST \| Generate completion \|
			\| `/api/chat` \| POST \| Chat completion \|
			\| `/api/pull` \| POST \| Pull a model \|
			\| `/api/show` \| POST \| Show model info \|

			`## GPU Detection`

			`The GPU variant uses Ollama's automatic GPU detection. For CPU-only mode, we set:`

			```dockerfile
			`ENV CUDA_VISIBLE_DEVICES=""`
			```

			`This forces Ollama to use CPU inference even if GPU is available.`

			`## Health Checks`

			`Both variants include Docker health checks:`

			```dockerfile
			`HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \`
			`CMD curl -f http://localhost:11434/api/tags \|\| exit 1`
			```

			CPU variant has longer `start-period` (120s) due to slower startup.

feat(paddleocr): add PaddleOCR OCR service (Docker images, server, tests, docs) and CI workflows 2026-01-16 13:23:01 +00:00			`## PaddleOCR`

			`### Overview`

			`PaddleOCR is a standalone OCR service using PaddlePaddle's PP-OCRv4 model. It provides:`

			`- Text detection and recognition`
			`- Multi-language support`
			`- FastAPI REST API`
			`- GPU and CPU variants`

			`### Docker Images`

			`\| Tag \| Description \|`
			`\|-----\|-------------\|`
			\| `paddleocr` \| GPU variant (default) \|
			\| `paddleocr-gpu` \| GPU variant (alias) \|
			\| `paddleocr-cpu` \| CPU-only variant \|

			`### API Endpoints`

			`\| Endpoint \| Method \| Description \|`
			`\|----------\|--------\|-------------\|`
			\| `/health` \| GET \| Health check with model info \|
			\| `/ocr` \| POST \| OCR with base64 image (JSON body) \|
			\| `/ocr/upload` \| POST \| OCR with file upload (multipart form) \|

			`### Request/Response Format`

			`POST /ocr (JSON)`
			```json
			`{`
			`"image": "<base64-encoded-image>",`
			`"language": "en" // optional`
			`}`
			```

			`POST /ocr/upload (multipart)`
			- `img`: image file
			- `language`: optional language code

			`Response`
			```json
			`{`
			`"success": true,`
			`"results": [`
			`{`
			`"text": "Invoice #12345",`
			`"confidence": 0.98,`
			`"box": [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]`
			`}`
			`]`
			`}`
			```

			`### Environment Variables`

			`\| Variable \| Default \| Description \|`
			`\|----------\|---------\|-------------\|`
			\| `OCR_LANGUAGE` \| `en` \| Default language for OCR \|
			\| `SERVER_PORT` \| `5000` \| Server port \|
			\| `SERVER_HOST` \| `0.0.0.0` \| Server host \|
			\| `CUDA_VISIBLE_DEVICES` \| (auto) \| Set to `-1` for CPU-only \|

			`### Performance`

			`- GPU: ~1-3 seconds per page`
			`- CPU: ~10-30 seconds per page`

			`### Supported Languages`

			Common language codes: `en` (English), `ch` (Chinese), `de` (German), `fr` (French), `es` (Spanish), `ja` (Japanese), `ko` (Korean)

			`---`

initial 2026-01-16 01:51:57 +00:00			`## Adding New Models`

			`To add a new model variant:`

			1. Create `Dockerfile_<modelname>`
			2. Set `MODEL_NAME` environment variable
			3. Update `build-images.sh` with new build target
			4. Add documentation to `readme.md`

			`## Troubleshooting`

			`### Model download hangs`

			`Check container logs:`
			```bash
			`docker logs -f <container-name>`
			```

			`The model download is ~5GB and may take several minutes.`

			`### Out of memory`

			`- GPU: Use int4 quantized version or add more VRAM`
			- CPU: Increase container memory limit: `--memory=16g`

			`### API not responding`

			1. Check if container is healthy: `docker ps`
			2. Check logs for errors: `docker logs <container>`
			3. Verify port mapping: `curl localhost:11434/api/tags`

			`## CI/CD Integration`

			`Build and push using npmci:`

			```bash
			`npmci docker login`
			`npmci docker build`
			`npmci docker push code.foss.global`
			```

			`## Related Resources`

			`- [Ollama Documentation](https://ollama.ai/docs)`
			`- [MiniCPM-V GitHub](https://github.com/OpenBMB/MiniCPM-V)`
			`- [Ollama API Reference](https://github.com/ollama/ollama/blob/main/docs/api.md)`