update
This commit is contained in:
@@ -244,8 +244,97 @@ The bank statement extraction uses a dual-VLM consensus approach:
|
||||
|
||||
---
|
||||
|
||||
## Nanonets-OCR-s
|
||||
|
||||
### Overview
|
||||
|
||||
Nanonets-OCR-s is a Qwen2.5-VL-3B model fine-tuned specifically for document OCR tasks. It outputs structured markdown with semantic tags.
|
||||
|
||||
**Key features:**
|
||||
- Based on Qwen2.5-VL-3B (~4B parameters)
|
||||
- Fine-tuned for document OCR
|
||||
- Outputs markdown with semantic HTML tags
|
||||
- ~8-10GB VRAM (fits comfortably in 16GB)
|
||||
|
||||
### Docker Images
|
||||
|
||||
| Tag | Description |
|
||||
|-----|-------------|
|
||||
| `nanonets-ocr` | GPU variant using vLLM (OpenAI-compatible API) |
|
||||
|
||||
### API Endpoints (OpenAI-compatible via vLLM)
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/health` | GET | Health check |
|
||||
| `/v1/models` | GET | List available models |
|
||||
| `/v1/chat/completions` | POST | OpenAI-compatible chat completions |
|
||||
|
||||
### Request/Response Format
|
||||
|
||||
**POST /v1/chat/completions (OpenAI-compatible)**
|
||||
```json
|
||||
{
|
||||
"model": "nanonets/Nanonets-OCR-s",
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
|
||||
{"type": "text", "text": "Extract the text from the above document..."}
|
||||
]
|
||||
}
|
||||
],
|
||||
"temperature": 0.0,
|
||||
"max_tokens": 4096
|
||||
}
|
||||
```
|
||||
|
||||
### Nanonets OCR Prompt
|
||||
|
||||
The model is designed to work with a specific prompt format:
|
||||
```
|
||||
Extract the text from the above document as if you were reading it naturally.
|
||||
Return the tables in html format.
|
||||
Return the equations in LaTeX representation.
|
||||
If there is an image in the document and image caption is not present, add a small description inside <img></img> tag.
|
||||
Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>.
|
||||
Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number>.
|
||||
```
|
||||
|
||||
### Performance
|
||||
|
||||
- **GPU (vLLM)**: ~3-8 seconds per page
|
||||
- **VRAM usage**: ~8-10GB
|
||||
|
||||
### Two-Stage Pipeline (Nanonets + Qwen3)
|
||||
|
||||
The Nanonets tests use a two-stage pipeline:
|
||||
1. **Stage 1**: Nanonets-OCR-s converts images to markdown (via vLLM on port 8000)
|
||||
2. **Stage 2**: Qwen3 8B extracts structured JSON from markdown (via Ollama on port 11434)
|
||||
|
||||
**GPU Limitation**: Both vLLM and Ollama require significant GPU memory. On a single GPU system:
|
||||
- Running both simultaneously causes memory contention
|
||||
- For single GPU: Run services sequentially (stop Nanonets before Qwen3)
|
||||
- For multi-GPU: Assign each service to a different GPU
|
||||
|
||||
**Sequential Execution**:
|
||||
```bash
|
||||
# Step 1: Run Nanonets OCR (converts to markdown)
|
||||
docker start nanonets-test
|
||||
# ... perform OCR ...
|
||||
docker stop nanonets-test
|
||||
|
||||
# Step 2: Run Qwen3 extraction (from markdown)
|
||||
docker start minicpm-test
|
||||
# ... extract JSON ...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- [Ollama Documentation](https://ollama.ai/docs)
|
||||
- [MiniCPM-V GitHub](https://github.com/OpenBMB/MiniCPM-V)
|
||||
- [Ollama API Reference](https://github.com/ollama/ollama/blob/main/docs/api.md)
|
||||
- [Nanonets-OCR-s on HuggingFace](https://huggingface.co/nanonets/Nanonets-OCR-s)
|
||||
|
||||
Reference in New Issue
Block a user