diff --git a/changelog.md b/changelog.md index a51455f..55eacb7 100644 --- a/changelog.md +++ b/changelog.md @@ -1,5 +1,15 @@ # Changelog +## 2026-01-19 - 1.14.2 - fix(readme) +update README to document Nanonets-OCR2-3B (replaces Nanonets-OCR-s), adjust VRAM and context defaults, expand feature docs, and update examples/test command + +- Renamed Nanonets-OCR-s -> Nanonets-OCR2-3B throughout README and examples +- Updated Nanonets VRAM guidance from ~10GB to ~12-16GB and documented 30K context +- Changed documented MAX_MODEL_LEN default from 8192 to 30000 +- Updated example model identifiers (model strings and curl/example snippets) to nanonets/Nanonets-OCR2-3B +- Added MiniCPM and Qwen feature bullets (multilingual, multi-image, flowchart support, expanded context notes) +- Replaced README test command from ./test-images.sh to pnpm test + ## 2026-01-19 - 1.14.1 - fix(extraction) improve JSON extraction prompts and model options for invoice and bank statement tests diff --git a/readme.md b/readme.md index a24f3b2..fa73b8d 100644 --- a/readme.md +++ b/readme.md @@ -2,7 +2,7 @@ Production-ready Docker images for state-of-the-art AI Vision-Language Models. Run powerful multimodal AI locally with GPU acceleration—**no cloud API keys required**. -> 🔥 **Three VLMs, one registry.** From lightweight document OCR to GPT-4o-level vision understanding—pick the right tool for your task. +> 🔥 **Three VLMs, one registry.** From high-performance document OCR to GPT-4o-level vision understanding—pick the right tool for your task. ## Issue Reporting and Security @@ -15,7 +15,7 @@ For reporting bugs, issues, or security vulnerabilities, please visit [community | Model | Parameters | Best For | API | Port | VRAM | |-------|-----------|----------|-----|------|------| | **MiniCPM-V 4.5** | 8B | General vision understanding, multi-image analysis | Ollama-compatible | 11434 | ~9GB | -| **Nanonets-OCR-s** | ~4B | Document OCR with semantic markdown output | OpenAI-compatible | 8000 | ~10GB | +| **Nanonets-OCR2-3B** | ~3B | Document OCR with semantic markdown, LaTeX, flowcharts | OpenAI-compatible | 8000 | ~12-16GB | | **Qwen3-VL-30B** | 30B (A3B) | Advanced visual agents, code generation from images | Ollama-compatible | 11434 | ~20GB | --- @@ -29,7 +29,7 @@ code.foss.global/host.today/ht-docker-ai: | Tag | Model | Runtime | Port | VRAM | |-----|-------|---------|------|------| | `minicpm45v` / `latest` | MiniCPM-V 4.5 | Ollama | 11434 | ~9GB | -| `nanonets-ocr` | Nanonets-OCR-s | vLLM | 8000 | ~10GB | +| `nanonets-ocr` | Nanonets-OCR2-3B | vLLM | 8000 | ~12-16GB | | `qwen3vl` | Qwen3-VL-30B-A3B | Ollama | 11434 | ~20GB | --- @@ -38,6 +38,13 @@ code.foss.global/host.today/ht-docker-ai: A GPT-4o level multimodal LLM from OpenBMB—handles image understanding, OCR, multi-image analysis, and visual reasoning across **30+ languages**. +### ✨ Key Features + +- 🌍 **Multilingual:** 30+ languages supported +- 🖼️ **Multi-image:** Analyze multiple images in one request +- 📊 **Versatile:** Charts, documents, photos, diagrams +- ⚡ **Efficient:** Runs on consumer GPUs (9GB VRAM) + ### Quick Start ```bash @@ -83,21 +90,22 @@ curl http://localhost:11434/api/chat -d '{ | Mode | VRAM Required | |------|---------------| -| int4 quantized | 9GB | -| Full precision (bf16) | 18GB | +| int4 quantized | ~9GB | +| Full precision (bf16) | ~18GB | --- -## 🔍 Nanonets-OCR-s +## 🔍 Nanonets-OCR2-3B -A **Qwen2.5-VL-3B** model fine-tuned specifically for document OCR. Outputs structured markdown with semantic HTML tags—perfect for preserving document structure. +The **latest Nanonets document OCR model** (October 2025 release)—based on Qwen2.5-VL-3B, fine-tuned specifically for document extraction with significant improvements over the original OCR-s. -### Key Features +### ✨ Key Features -- 📝 **Semantic output:** Tables → HTML, equations → LaTeX, watermarks/page numbers → tagged +- 📝 **Semantic output:** Tables → HTML, equations → LaTeX, flowcharts → structured markup - 🌍 **Multilingual:** Inherits Qwen's broad language support -- ⚡ **Efficient:** ~10GB VRAM, runs great on consumer GPUs +- 📄 **30K context:** Handle large, multi-page documents - 🔌 **OpenAI-compatible:** Drop-in replacement for existing pipelines +- 🎯 **Improved accuracy:** Better semantic tagging and LaTeX equation extraction vs. OCR-s ### Quick Start @@ -116,7 +124,7 @@ docker run -d \ curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "nanonets/Nanonets-OCR-s", + "model": "nanonets/Nanonets-OCR2-3B", "messages": [{ "role": "user", "content": [ @@ -131,7 +139,7 @@ curl http://localhost:8000/v1/chat/completions \ ### Output Format -Nanonets-OCR-s returns markdown with semantic tags: +Nanonets-OCR2-3B returns markdown with semantic tags: | Element | Output Format | |---------|---------------| @@ -140,13 +148,14 @@ Nanonets-OCR-s returns markdown with semantic tags: | Images | `description` | | Watermarks | `OFFICIAL COPY` | | Page numbers | `14` | +| Flowcharts | Structured markup | -### Performance +### Hardware Requirements -| Metric | Value | -|--------|-------| -| Speed | 3–8 seconds per page | -| VRAM | ~10GB | +| Config | VRAM | +|--------|------| +| 30K context (default) | ~12-16GB | +| Speed | ~3-8 seconds per page | --- @@ -154,7 +163,7 @@ Nanonets-OCR-s returns markdown with semantic tags: The **most powerful** Qwen vision model—30B parameters with 3B active (MoE architecture). Handles complex visual reasoning, code generation from screenshots, and visual agent capabilities. -### Key Features +### ✨ Key Features - 🚀 **256K context** (expandable to 1M tokens!) - 🤖 **Visual agent capabilities** — can plan and execute multi-step tasks @@ -204,7 +213,6 @@ curl http://localhost:11434/api/chat -d '{ Run multiple VLMs together for maximum flexibility: ```yaml -version: '3.8' services: # General vision tasks minicpm: @@ -259,10 +267,10 @@ volumes: | Variable | Default | Description | |----------|---------|-------------| -| `MODEL_NAME` | `nanonets/Nanonets-OCR-s` | HuggingFace model ID | +| `MODEL_NAME` | `nanonets/Nanonets-OCR2-3B` | HuggingFace model ID | | `HOST` | `0.0.0.0` | API bind address | | `PORT` | `8000` | API port | -| `MAX_MODEL_LEN` | `8192` | Maximum sequence length | +| `MAX_MODEL_LEN` | `30000` | Maximum sequence length | | `GPU_MEMORY_UTILIZATION` | `0.9` | GPU memory usage (0-1) | --- @@ -283,7 +291,7 @@ This dual-VLM approach catches extraction errors that single models miss. ### Why Multi-Model Works - **Different architectures:** Independent models cross-validate each other -- **Specialized strengths:** Nanonets-OCR-s excels at document structure; MiniCPM-V handles general vision +- **Specialized strengths:** Nanonets-OCR2-3B excels at document structure; MiniCPM-V handles general vision - **Native processing:** All VLMs see original images—no intermediate structure loss ### Model Selection Guide @@ -291,10 +299,11 @@ This dual-VLM approach catches extraction errors that single models miss. | Task | Recommended Model | |------|-------------------| | General image understanding | MiniCPM-V 4.5 | -| Document OCR with structure preservation | Nanonets-OCR-s | +| Document OCR with structure preservation | Nanonets-OCR2-3B | | Complex visual reasoning / code generation | Qwen3-VL-30B | | Multi-image analysis | MiniCPM-V 4.5 | | Visual agent tasks | Qwen3-VL-30B | +| Large documents (30K+ tokens) | Nanonets-OCR2-3B | --- @@ -309,7 +318,7 @@ cd ht-docker-ai ./build-images.sh # Run tests -./test-images.sh +pnpm test ``` ---