This commit is contained in:
2026-01-18 15:54:16 +00:00
parent 177e87d3b8
commit 09ea7440e8
5 changed files with 1261 additions and 0 deletions

View File

@@ -244,8 +244,97 @@ The bank statement extraction uses a dual-VLM consensus approach:
---
## Nanonets-OCR-s
### Overview
Nanonets-OCR-s is a Qwen2.5-VL-3B model fine-tuned specifically for document OCR tasks. It outputs structured markdown with semantic tags.
**Key features:**
- Based on Qwen2.5-VL-3B (~4B parameters)
- Fine-tuned for document OCR
- Outputs markdown with semantic HTML tags
- ~8-10GB VRAM (fits comfortably in 16GB)
### Docker Images
| Tag | Description |
|-----|-------------|
| `nanonets-ocr` | GPU variant using vLLM (OpenAI-compatible API) |
### API Endpoints (OpenAI-compatible via vLLM)
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Health check |
| `/v1/models` | GET | List available models |
| `/v1/chat/completions` | POST | OpenAI-compatible chat completions |
### Request/Response Format
**POST /v1/chat/completions (OpenAI-compatible)**
```json
{
"model": "nanonets/Nanonets-OCR-s",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
{"type": "text", "text": "Extract the text from the above document..."}
]
}
],
"temperature": 0.0,
"max_tokens": 4096
}
```
### Nanonets OCR Prompt
The model is designed to work with a specific prompt format:
```
Extract the text from the above document as if you were reading it naturally.
Return the tables in html format.
Return the equations in LaTeX representation.
If there is an image in the document and image caption is not present, add a small description inside <img></img> tag.
Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>.
Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number>.
```
### Performance
- **GPU (vLLM)**: ~3-8 seconds per page
- **VRAM usage**: ~8-10GB
### Two-Stage Pipeline (Nanonets + Qwen3)
The Nanonets tests use a two-stage pipeline:
1. **Stage 1**: Nanonets-OCR-s converts images to markdown (via vLLM on port 8000)
2. **Stage 2**: Qwen3 8B extracts structured JSON from markdown (via Ollama on port 11434)
**GPU Limitation**: Both vLLM and Ollama require significant GPU memory. On a single GPU system:
- Running both simultaneously causes memory contention
- For single GPU: Run services sequentially (stop Nanonets before Qwen3)
- For multi-GPU: Assign each service to a different GPU
**Sequential Execution**:
```bash
# Step 1: Run Nanonets OCR (converts to markdown)
docker start nanonets-test
# ... perform OCR ...
docker stop nanonets-test
# Step 2: Run Qwen3 extraction (from markdown)
docker start minicpm-test
# ... extract JSON ...
```
---
## Related Resources
- [Ollama Documentation](https://ollama.ai/docs)
- [MiniCPM-V GitHub](https://github.com/OpenBMB/MiniCPM-V)
- [Ollama API Reference](https://github.com/ollama/ollama/blob/main/docs/api.md)
- [Nanonets-OCR-s on HuggingFace](https://huggingface.co/nanonets/Nanonets-OCR-s)