update

2026-01-18 15:54:16 +00:00
parent 177e87d3b8
commit 09ea7440e8
5 changed files with 1261 additions and 0 deletions
--- a/readme.hints.md
+++ b/readme.hints.md
@@ -244,8 +244,97 @@ The bank statement extraction uses a dual-VLM consensus approach:

 ---

+## Nanonets-OCR-s
+
+### Overview
+
+Nanonets-OCR-s is a Qwen2.5-VL-3B model fine-tuned specifically for document OCR tasks. It outputs structured markdown with semantic tags.
+
+**Key features:**
+- Based on Qwen2.5-VL-3B (~4B parameters)
+- Fine-tuned for document OCR
+- Outputs markdown with semantic HTML tags
+- ~8-10GB VRAM (fits comfortably in 16GB)
+
+### Docker Images
+
+| Tag | Description |
+|-----|-------------|
+| `nanonets-ocr` | GPU variant using vLLM (OpenAI-compatible API) |
+
+### API Endpoints (OpenAI-compatible via vLLM)
+
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/health` | GET | Health check |
+| `/v1/models` | GET | List available models |
+| `/v1/chat/completions` | POST | OpenAI-compatible chat completions |
+
+### Request/Response Format
+
+**POST /v1/chat/completions (OpenAI-compatible)**
+```json
+{
+  "model": "nanonets/Nanonets-OCR-s",
+  "messages": [
+    {
+      "role": "user",
+      "content": [
+        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
+        {"type": "text", "text": "Extract the text from the above document..."}
+      ]
+    }
+  ],
+  "temperature": 0.0,
+  "max_tokens": 4096
+}
+```
+
+### Nanonets OCR Prompt
+
+The model is designed to work with a specific prompt format:
+```
+Extract the text from the above document as if you were reading it naturally.
+Return the tables in html format.
+Return the equations in LaTeX representation.
+If there is an image in the document and image caption is not present, add a small description inside <img></img> tag.
+Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>.
+Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number>.
+```
+
+### Performance
+
+- **GPU (vLLM)**: ~3-8 seconds per page
+- **VRAM usage**: ~8-10GB
+
+### Two-Stage Pipeline (Nanonets + Qwen3)
+
+The Nanonets tests use a two-stage pipeline:
+1. **Stage 1**: Nanonets-OCR-s converts images to markdown (via vLLM on port 8000)
+2. **Stage 2**: Qwen3 8B extracts structured JSON from markdown (via Ollama on port 11434)
+
+**GPU Limitation**: Both vLLM and Ollama require significant GPU memory. On a single GPU system:
+- Running both simultaneously causes memory contention
+- For single GPU: Run services sequentially (stop Nanonets before Qwen3)
+- For multi-GPU: Assign each service to a different GPU
+
+**Sequential Execution**:
+```bash
+# Step 1: Run Nanonets OCR (converts to markdown)
+docker start nanonets-test
+# ... perform OCR ...
+docker stop nanonets-test
+
+# Step 2: Run Qwen3 extraction (from markdown)
+docker start minicpm-test
+# ... extract JSON ...
+```
+
+---
+
 ## Related Resources

 - [Ollama Documentation](https://ollama.ai/docs)
 - [MiniCPM-V GitHub](https://github.com/OpenBMB/MiniCPM-V)
 - [Ollama API Reference](https://github.com/ollama/ollama/blob/main/docs/api.md)
+- [Nanonets-OCR-s on HuggingFace](https://huggingface.co/nanonets/Nanonets-OCR-s)