feat(cluster,api,models,cli): add cluster-aware model catalog deployments and request routing

2026-04-20 23:00:50 +00:00
parent 83cacd0cf1
commit 4f2266e1b7
55 changed files with 3970 additions and 1630 deletions
@@ -1,31 +1,43 @@
 # 🚀 ModelGrid

-**GPU infrastructure management daemon with OpenAI-compatible API for serving AI models in containers.**
+**vLLM deployment manager with an OpenAI-compatible API, clustering foundations, and a public OSS
+model catalog.**

-ModelGrid is a root-level daemon that transforms any GPU-equipped machine into a production-ready AI inference server. It manages Docker containers (Ollama, vLLM, TGI) across NVIDIA, AMD, and Intel GPUs, exposing a unified **OpenAI-compatible API** that works as a drop-in replacement for existing tools.
+ModelGrid is a root-level daemon that turns any GPU-equipped machine into a vLLM-serving node. It
+manages single-model vLLM deployments across NVIDIA, AMD, and Intel GPUs, exposes a unified
+**OpenAI-compatible API**, and resolves deployable models from **`list.modelgrid.com`**.

 ```
 ┌─────────────────────────────────────────────────────────────────────────┐
 │                          ModelGrid Daemon                                │
 │  ┌───────────────┐   ┌───────────────┐   ┌───────────────────────────┐  │
-│  │  GPU Manager  │   │   Container   │   │   OpenAI-Compatible API   │  │
-│  │ NVIDIA/AMD/   │──▶│  Orchestrator │──▶│   /v1/chat/completions    │  │
-│  │ Intel Arc     │   │ Ollama/vLLM/  │   │   /v1/models              │  │
-│  └───────────────┘   │     TGI       │   │   /v1/embeddings          │  │
-│                      └───────────────┘   └───────────────────────────┘  │
+│  │  GPU Manager  │   │  vLLM Deploy  │   │   OpenAI-Compatible API   │  │
+│  │ NVIDIA/AMD/   │──▶│   Scheduler    │──▶│   /v1/chat/completions    │  │
+│  │ Intel Arc     │   │ + Cluster Base │   │   /v1/models              │  │
+│  └───────────────┘   └───────────────┘   │   /v1/embeddings          │  │
+│           │                  │            └───────────────────────────┘  │
+│           └──── list.modelgrid.com catalog + deployment metadata ───────┘
 └─────────────────────────────────────────────────────────────────────────┘
 ```

 ## Issue Reporting and Security

-For reporting bugs, issues, or security vulnerabilities, please visit [community.foss.global/](https://community.foss.global/). This is the central community hub for all issue reporting. Developers who sign and comply with our contribution agreement and go through identification can also get a [code.foss.global/](https://code.foss.global/) account to submit Pull Requests directly.
+For reporting bugs, issues, or security vulnerabilities, please visit
+[community.foss.global/](https://community.foss.global/). This is the central community hub for all
+issue reporting. Developers who sign and comply with our contribution agreement and go through
+identification can also get a [code.foss.global/](https://code.foss.global/) account to submit Pull
+Requests directly.

 ## ✨ Features

- **🎯 OpenAI-Compatible API** — Drop-in replacement for OpenAI's API. Works with existing tools, SDKs, and applications
- **🖥️ Multi-GPU Support** — Auto-detect and manage NVIDIA (CUDA), AMD (ROCm), and Intel Arc (oneAPI) GPUs
- **📦 Container Orchestration** — Seamlessly manage Ollama, vLLM, and TGI containers with GPU passthrough
- **🔒 Greenlit Models** — Controlled model auto-pulling with VRAM validation for secure deployments
+- **🎯 OpenAI-Compatible API** — Drop-in replacement for OpenAI's API. Works with existing tools,
+  SDKs, and applications
+- **🖥️ Multi-GPU Support** — Auto-detect and manage NVIDIA (CUDA), AMD (ROCm), and Intel Arc
+  (oneAPI) GPUs
+- **📦 vLLM Deployments** — Launch model-specific vLLM runtimes instead of hand-managing containers
+- **📚 OSS Model Catalog** — Resolve supported models from `list.modelgrid.com`
+- **🕸️ Cluster Foundation** — Cluster-aware config surface for standalone, control-plane, and worker
+  roles
 - **⚡ Streaming Support** — Real-time token streaming via Server-Sent Events
 - **🔄 Auto-Recovery** — Health monitoring with automatic container restart
 - **🐳 Docker Native** — Full Docker/Podman integration with isolated networking
@@ -48,15 +60,16 @@ curl -sSL https://code.foss.global/modelgrid.com/modelgrid/raw/branch/main/insta

 ### Manual Binary Download

-Download the appropriate binary for your platform from [releases](https://code.foss.global/modelgrid.com/modelgrid/releases):
+Download the appropriate binary for your platform from
+[releases](https://code.foss.global/modelgrid.com/modelgrid/releases):

-| Platform | Binary |
-|----------|--------|
-| Linux x64 | `modelgrid-linux-x64` |
-| Linux ARM64 | `modelgrid-linux-arm64` |
-| macOS Intel | `modelgrid-macos-x64` |
-| macOS Apple Silicon | `modelgrid-macos-arm64` |
-| Windows x64 | `modelgrid-windows-x64.exe` |
+| Platform            | Binary                      |
+| ------------------- | --------------------------- |
+| Linux x64           | `modelgrid-linux-x64`       |
+| Linux ARM64         | `modelgrid-linux-arm64`     |
+| macOS Intel         | `modelgrid-macos-x64`       |
+| macOS Apple Silicon | `modelgrid-macos-arm64`     |
+| Windows x64         | `modelgrid-windows-x64.exe` |

 ```bash
 chmod +x modelgrid-linux-x64
@@ -75,14 +88,17 @@ sudo modelgrid config init
 # 3. Add an API key
 sudo modelgrid config apikey add

-# 4. Add a container (interactive)
-sudo modelgrid container add
+# 4. Browse the public catalog
+sudo modelgrid model list

-# 5. Enable and start the service
+# 5. Deploy a model
+sudo modelgrid run meta-llama/Llama-3.1-8B-Instruct
+
+# 6. Enable and start the service
 sudo modelgrid service enable
 sudo modelgrid service start

-# 6. Test the API
+# 7. Test the API
 curl http://localhost:8080/v1/models \
  -H "Authorization: Bearer YOUR_API_KEY"
 ```
@@ -240,25 +256,27 @@ GPU Devices (2):
 └────────────┴──────────────────────┴────────┴─────────────┴────────────┘
 ```

-### Container Management
+### Deployment Management

 ```bash
-modelgrid container list       # List all configured containers
-modelgrid container add        # Interactive container setup wizard
-modelgrid container remove ID  # Remove a container
-modelgrid container start [ID] # Start container(s)
-modelgrid container stop [ID]  # Stop container(s)
-modelgrid container logs ID    # Show container logs
+modelgrid ps                   # List active deployments
+modelgrid run MODEL            # Deploy a model from the registry
+modelgrid container list       # List all configured deployments
+modelgrid container add        # Interactive deployment setup wizard
+modelgrid container remove ID  # Remove a deployment
+modelgrid container start [ID] # Start deployment(s)
+modelgrid container stop [ID]  # Stop deployment(s)
+modelgrid container logs ID    # Show deployment logs
 ```

 ### Model Management

 ```bash
 modelgrid model list           # List available/loaded models
-modelgrid model pull NAME      # Pull a model (must be greenlit)
-modelgrid model remove NAME    # Remove a model from container
+modelgrid model pull NAME      # Deploy a model from the registry
+modelgrid model remove NAME    # Remove a model deployment
 modelgrid model status         # Show model recommendations with VRAM analysis
-modelgrid model refresh        # Refresh greenlist cache
+modelgrid model refresh        # Refresh registry cache
 ```

 ### Configuration
@@ -271,6 +289,21 @@ modelgrid config apikey add    # Generate and add new API key
 modelgrid config apikey remove # Remove an API key
 ```

+### Cluster
+
+```bash
+modelgrid cluster status       # Show cluster state
+modelgrid cluster nodes        # List registered nodes
+modelgrid cluster models       # Show model locations across nodes
+modelgrid cluster desired      # Show desired deployment targets
+modelgrid cluster ensure NAME  # Ask control plane to schedule a model
+modelgrid cluster scale NAME 3 # Set desired replica count
+modelgrid cluster clear NAME   # Remove desired deployment target
+modelgrid cluster cordon NODE  # Prevent new placements on a node
+modelgrid cluster drain NODE   # Mark a node for evacuation
+modelgrid cluster activate NODE # Mark a node active again
+```
+
 ### Global Options

 ```bash
@@ -279,28 +312,7 @@ modelgrid config apikey remove # Remove an API key
 --help, -h     # Show help message
 ```

-## 📦 Supported Containers
-
-### Ollama
-
-Best for general-purpose model serving with easy model management.
-
-```bash
-# Add via CLI
-sudo modelgrid container add
-# Select: ollama
-
-# Or configure directly
-{
-  "id": "ollama-1",
-  "type": "ollama",
-  "name": "Ollama Server",
-  "gpuIds": ["nvidia-0"],
-  "port": 11434
-}
-```
-
-**Supported models:** llama3, mistral, codellama, phi, gemma, and 100+ more
+## 📦 Supported Runtime

 ### vLLM

@@ -319,32 +331,14 @@ High-performance inference with PagedAttention and continuous batching.
 }
 ```

-**Best for:** Production workloads, high throughput, multi-GPU setups
-
-### TGI (Text Generation Inference)
-
-HuggingFace's production-ready inference server with quantization support.
-
-```bash
-{
-  "id": "tgi-1",
-  "type": "tgi",
-  "name": "TGI Server",
-  "gpuIds": ["nvidia-0"],
-  "port": 8080,
-  "env": {
-    "QUANTIZE": "gptq"  # Or: awq, bitsandbytes
-  }
-}
-```
-
-**Best for:** Quantized models, Flash Attention, HuggingFace ecosystem
+**Best for:** Production workloads, multi-GPU tensor parallelism, OpenAI-compatible serving

 ## 🎯 GPU Support

 ### NVIDIA (CUDA)

 **Requirements:**
+
 - NVIDIA Driver 470+
 - CUDA Toolkit 11.0+
 - NVIDIA Container Toolkit (`nvidia-docker2`)
@@ -361,6 +355,7 @@ sudo systemctl restart docker
 ### AMD (ROCm)

 **Requirements:**
+
 - ROCm 5.0+
 - AMD GPU with ROCm support (RX 6000+, MI series)

@@ -374,6 +369,7 @@ sudo amdgpu-install --usecase=rocm
 ### Intel Arc (oneAPI)

 **Requirements:**
+
 - Intel oneAPI Base Toolkit
 - Intel Arc A-series GPU (A770, A750, A380)

@@ -410,22 +406,33 @@ Configuration is stored at `/etc/modelgrid/config.json`:
  },
  "containers": [
    {
-      "id": "ollama-1",
-      "type": "ollama",
-      "name": "Primary Ollama",
-      "image": "ollama/ollama:latest",
+      "id": "vllm-llama31-8b",
+      "type": "vllm",
+      "name": "Primary vLLM",
+      "image": "vllm/vllm-openai:latest",
      "gpuIds": ["nvidia-0"],
-      "port": 11434,
-      "models": [],
+      "port": 8000,
+      "models": ["meta-llama/Llama-3.1-8B-Instruct"],
      "env": {},
      "volumes": []
    }
  ],
  "models": {
-    "greenlistUrl": "https://code.foss.global/modelgrid.com/model_lists/raw/branch/main/greenlit.json",
-    "autoPull": true,
-    "defaultContainer": "ollama",
-    "autoLoad": ["llama3:8b"]
+    "registryUrl": "https://list.modelgrid.com/catalog/models.json",
+    "autoDeploy": true,
+    "defaultEngine": "vllm",
+    "autoLoad": ["meta-llama/Llama-3.1-8B-Instruct"]
+  },
+  "cluster": {
+    "enabled": false,
+    "nodeName": "modelgrid-local",
+    "role": "standalone",
+    "bindHost": "0.0.0.0",
+    "gossipPort": 7946,
+    "sharedSecret": "",
+    "advertiseUrl": "http://127.0.0.1:8080",
+    "heartbeatIntervalMs": 5000,
+    "seedNodes": []
  },
  "checkInterval": 30000
 }
@@ -433,49 +440,102 @@ Configuration is stored at `/etc/modelgrid/config.json`:

 ### Configuration Options

-| Option | Description | Default |
-|--------|-------------|---------|
-| `api.port` | API server port | `8080` |
-| `api.host` | Bind address | `0.0.0.0` |
-| `api.apiKeys` | Valid API keys | `[]` |
-| `api.rateLimit` | Requests per minute | `60` |
-| `docker.runtime` | Container runtime | `docker` |
-| `gpus.autoDetect` | Auto-detect GPUs | `true` |
-| `models.autoPull` | Auto-pull greenlit models | `true` |
-| `models.autoLoad` | Models to preload on start | `[]` |
-| `checkInterval` | Health check interval (ms) | `30000` |
+| Option                    | Description                      | Default                 |
+| ------------------------- | -------------------------------- | ----------------------- |
+| `api.port`                | API server port                  | `8080`                  |
+| `api.host`                | Bind address                     | `0.0.0.0`               |
+| `api.apiKeys`             | Valid API keys                   | `[]`                    |
+| `api.rateLimit`           | Requests per minute              | `60`                    |
+| `docker.runtime`          | Container runtime                | `docker`                |
+| `gpus.autoDetect`         | Auto-detect GPUs                 | `true`                  |
+| `models.autoDeploy`       | Auto-start deployments on demand | `true`                  |
+| `models.autoLoad`         | Models to preload on start       | `[]`                    |
+| `cluster.role`            | Cluster mode                     | `standalone`            |
+| `cluster.sharedSecret`    | Shared secret for `/_cluster/*`  | unset                   |
+| `cluster.advertiseUrl`    | URL advertised to other nodes    | `http://127.0.0.1:8080` |
+| `cluster.controlPlaneUrl` | Control-plane URL for workers    | unset                   |
+| `checkInterval`           | Health check interval (ms)       | `30000`                 |

-## 🔒 Greenlit Models
+## 🕸️ Clustering

-ModelGrid uses a **greenlist system** for security. Only pre-approved models can be auto-pulled, preventing arbitrary downloads.
+Cluster mode uses ModelGrid's internal control-plane endpoints to:

-**Default greenlist includes:**
- `llama3.2:1b` (4GB VRAM)
- `llama3.2:3b` (6GB VRAM)
- `llama3:8b` (8GB VRAM)
- `mistral:7b` (8GB VRAM)
- `codellama:7b` (8GB VRAM)
+- register worker nodes
+- advertise locally deployed models
+- persist desired deployment targets separately from live heartbeats
+- schedule new deployments onto healthy nodes with enough VRAM
+- proxy OpenAI-compatible requests to the selected node gateway
+- exclude cordoned or draining nodes from new placements

-**Custom greenlist:**
+Minimal setup:

 ```json
-// greenlit.json
+{
+  "cluster": {
+    "enabled": true,
+    "nodeName": "worker-a",
+    "role": "worker",
+    "sharedSecret": "replace-me-with-a-random-secret",
+    "advertiseUrl": "http://worker-a.internal:8080",
+    "controlPlaneUrl": "http://control.internal:8080",
+    "heartbeatIntervalMs": 5000
+  }
+}
+```
+
+For the control plane, set `role` to `control-plane` and `advertiseUrl` to its reachable API URL.
+Set the same `cluster.sharedSecret` on every node to protect internal cluster endpoints.
+
+Runtime state files:
+
+- `/var/lib/modelgrid/cluster-state.json` for live node heartbeats
+- `/var/lib/modelgrid/cluster-control-state.json` for desired deployments and node lifecycle state
+
+## 📚 Model Catalog
+
+ModelGrid resolves deployable models from **`list.modelgrid.com`**. The catalog is public,
+versioned, and describes:
+
+- canonical model IDs
+- aliases
+- minimum VRAM and GPU count
+- vLLM launch defaults
+- desired replica counts
+- capabilities like chat, completions, and embeddings
+
+**Default catalog includes:**
+
+- `Qwen/Qwen2.5-7B-Instruct`
+- `meta-llama/Llama-3.1-8B-Instruct`
+- `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`
+- `BAAI/bge-m3`
+
+**Custom registry source:**
+
+```json
+// models.json
 {
  "version": "1.0",
-  "lastUpdated": "2026-01-30",
+  "generatedAt": "2026-04-20T00:00:00.000Z",
  "models": [
-    { "name": "llama3:8b", "container": "ollama", "minVram": 8 },
-    { "name": "llama3:70b", "container": "vllm", "minVram": 48 },
-    { "name": "mistral:7b-instruct", "container": "ollama", "minVram": 8 }
+    {
+      "id": "meta-llama/Llama-3.1-8B-Instruct",
+      "engine": "vllm",
+      "source": { "repo": "meta-llama/Llama-3.1-8B-Instruct" },
+      "capabilities": { "chat": true },
+      "requirements": { "minVramGb": 18 },
+      "launchDefaults": { "replicas": 2 }
+    }
  ]
 }
 ```

 Configure with:
+
 ```json
 {
  "models": {
-    "greenlistUrl": "https://your-server.com/greenlit.json"
+    "registryUrl": "https://your-server.com/models.json"
  }
 }
 ```
@@ -522,7 +582,6 @@ modelgrid/
 │   ├── drivers/              # Driver management
 │   ├── docker/               # Docker management
 │   ├── containers/           # Container orchestration
-│   │   ├── ollama.ts         # Ollama implementation
 │   │   ├── vllm.ts           # vLLM implementation
 │   │   └── tgi.ts            # TGI implementation
 │   ├── api/                  # OpenAI-compatible API
@@ -561,21 +620,31 @@ sudo systemctl daemon-reload

 ## License and Legal Information

-This repository contains open-source code licensed under the MIT License. A copy of the license can be found in the [LICENSE](./LICENSE) file.
+This repository contains open-source code licensed under the MIT License. A copy of the license can
+be found in the [LICENSE](./LICENSE) file.

-**Please note:** The MIT License does not grant permission to use the trade names, trademarks, service marks, or product names of the project, except as required for reasonable and customary use in describing the origin of the work and reproducing the content of the NOTICE file.
+**Please note:** The MIT License does not grant permission to use the trade names, trademarks,
+service marks, or product names of the project, except as required for reasonable and customary use
+in describing the origin of the work and reproducing the content of the NOTICE file.

 ### Trademarks

-This project is owned and maintained by Task Venture Capital GmbH. The names and logos associated with Task Venture Capital GmbH and any related products or services are trademarks of Task Venture Capital GmbH or third parties, and are not included within the scope of the MIT license granted herein.
+This project is owned and maintained by Task Venture Capital GmbH. The names and logos associated
+with Task Venture Capital GmbH and any related products or services are trademarks of Task Venture
+Capital GmbH or third parties, and are not included within the scope of the MIT license granted
+herein.

-Use of these trademarks must comply with Task Venture Capital GmbH's Trademark Guidelines or the guidelines of the respective third-party owners, and any usage must be approved in writing. Third-party trademarks used herein are the property of their respective owners and used only in a descriptive manner, e.g. for an implementation of an API or similar.
+Use of these trademarks must comply with Task Venture Capital GmbH's Trademark Guidelines or the
+guidelines of the respective third-party owners, and any usage must be approved in writing.
+Third-party trademarks used herein are the property of their respective owners and used only in a
+descriptive manner, e.g. for an implementation of an API or similar.

 ### Company Information

-Task Venture Capital GmbH
-Registered at District Court Bremen HRB 35230 HB, Germany
+Task Venture Capital GmbH Registered at District Court Bremen HRB 35230 HB, Germany

 For any legal inquiries or further information, please contact us via email at hello@task.vc.

-By using this repository, you acknowledge that you have read this section, agree to comply with its terms, and understand that the licensing of the code does not imply endorsement by Task Venture Capital GmbH of any derivative works.
+By using this repository, you acknowledge that you have read this section, agree to comply with its
+terms, and understand that the licensing of the code does not imply endorsement by Task Venture
+Capital GmbH of any derivative works.