393 lines
19 KiB
Markdown
393 lines
19 KiB
Markdown
|
|
# 🖥️ ModelGrid — UI Concept
|
||
|
|
|
||
|
|
**A browser-based operations console for ModelGrid, served by the same daemon that
|
||
|
|
already exposes the OpenAI-compatible API.**
|
||
|
|
|
||
|
|
This document sketches the user interface that will sit on top of the ModelGrid
|
||
|
|
daemon: what it shows, how it is organized, how an operator moves through it,
|
||
|
|
and how it stays in sync with a running node or a small cluster. It is a
|
||
|
|
concept, not a final spec — the goal is to lock the shape of the product
|
||
|
|
before any frontend code is written.
|
||
|
|
|
||
|
|
The structural idioms (tabbed top-level views, route-origin awareness,
|
||
|
|
embedded ops dashboard on a dedicated port, API-first with a thin UI on top)
|
||
|
|
are adapted from `@serve.zone/dcrouter`'s Ops dashboard. ModelGrid's UI should
|
||
|
|
feel familiar to anyone who has operated dcrouter, while staying grounded in
|
||
|
|
ModelGrid's own domain: GPUs, vLLM deployments, a public model catalog, and a
|
||
|
|
cluster of gateway-capable nodes.
|
||
|
|
|
||
|
|
## 🎯 Purpose & Audience
|
||
|
|
|
||
|
|
- **Primary user:** the operator of one or a few ModelGrid nodes. Often the
|
||
|
|
same person who provisioned the GPU host and ran `modelgrid service enable`.
|
||
|
|
- **Secondary user:** a platform engineer wiring ModelGrid into an internal
|
||
|
|
AI platform who needs to manage API keys, audit deployments, and watch
|
||
|
|
request traffic.
|
||
|
|
- **Not an end-user chat UI.** Consumers of the OpenAI-compatible API keep
|
||
|
|
using their own SDKs and tools. The browser UI is for operating the fleet,
|
||
|
|
not for prompting models.
|
||
|
|
|
||
|
|
The UI should collapse gracefully from a full cluster view down to a
|
||
|
|
single-node, standalone deployment, because both shapes are first-class in
|
||
|
|
ModelGrid's `cluster.role` model (`standalone` / `control-plane` / `worker`).
|
||
|
|
|
||
|
|
## 🧭 Top-Level Information Architecture
|
||
|
|
|
||
|
|
URLs follow `/{view}` for flat views and `/{view}/{subview}` for tabbed
|
||
|
|
views, matching dcrouter's routing idiom.
|
||
|
|
|
||
|
|
```
|
||
|
|
/overview
|
||
|
|
/stats
|
||
|
|
/configuration
|
||
|
|
|
||
|
|
/cluster
|
||
|
|
/nodes
|
||
|
|
/placements
|
||
|
|
/desired
|
||
|
|
|
||
|
|
/gpus
|
||
|
|
/devices
|
||
|
|
/drivers
|
||
|
|
|
||
|
|
/deployments
|
||
|
|
/active
|
||
|
|
/history
|
||
|
|
|
||
|
|
/models
|
||
|
|
/catalog
|
||
|
|
/deployed
|
||
|
|
|
||
|
|
/access
|
||
|
|
/apikeys
|
||
|
|
/clients
|
||
|
|
|
||
|
|
/logs (flat)
|
||
|
|
/metrics (flat)
|
||
|
|
/settings (flat)
|
||
|
|
```
|
||
|
|
|
||
|
|
Rationale for the split:
|
||
|
|
|
||
|
|
- **Overview** is the landing page — one screen that answers "is the fleet
|
||
|
|
healthy right now?"
|
||
|
|
- **Cluster / GPUs / Deployments / Models** are the four nouns an operator
|
||
|
|
actually reasons about when running ModelGrid. Keeping them at the top
|
||
|
|
level matches the CLI verbs (`modelgrid cluster`, `modelgrid gpu`,
|
||
|
|
`modelgrid container`, `modelgrid model`) so muscle memory transfers.
|
||
|
|
- **Access** consolidates the authn/authz surface (API keys today,
|
||
|
|
user/OIDC later) into one place, the way dcrouter groups `apitokens` and
|
||
|
|
`users` under `access`.
|
||
|
|
- **Logs** and **Metrics** are flat because they are cross-cutting streams,
|
||
|
|
not noun-scoped tabs.
|
||
|
|
|
||
|
|
The navigation chrome itself is a persistent left rail on desktop, collapsing
|
||
|
|
into a top hamburger on narrow viewports. The selected view is indicated
|
||
|
|
there; subviews surface as a tab strip at the top of the content area.
|
||
|
|
|
||
|
|
```
|
||
|
|
┌────────────┬──────────────────────────────────────────────────────────────┐
|
||
|
|
│ ModelGrid │ Overview ▸ Stats Configuration │
|
||
|
|
│ ├──────────────────────────────────────────────────────────────┤
|
||
|
|
│ Overview ●│ │
|
||
|
|
│ Cluster │ ┌─ Fleet Health ─────────────────────────────────────┐ │
|
||
|
|
│ GPUs │ │ 2 nodes • 3 GPUs • 4 deployments • api OK │ │
|
||
|
|
│ Deploys │ └───────────────────────────────────────────────────┘ │
|
||
|
|
│ Models │ ┌─ Live Traffic ──────────────┐ ┌─ GPU Utilization ─┐ │
|
||
|
|
│ Access │ │ 42 req/s p95 820 ms │ │ ▁▂▄▅▇█▇▅▄▂▁ │ │
|
||
|
|
│ │ │ ▁▂▃▅▇▇▅▃▂▁▁▂▄▆ │ │ avg 64% │ │
|
||
|
|
│ Logs │ └─────────────────────────────┘ └───────────────────┘ │
|
||
|
|
│ Metrics │ ┌─ Deployments ────────────────────────────────────┐ │
|
||
|
|
│ Settings │ │ llama-3.1-8b running 2/2 nvidia-0,1 │ │
|
||
|
|
│ │ │ qwen2.5-7b running 1/1 nvidia-2 │ │
|
||
|
|
│ node: ctrl │ │ bge-m3 pending 0/1 (no capacity) │ │
|
||
|
|
│ v1.1.0 │ └──────────────────────────────────────────────────┘ │
|
||
|
|
└────────────┴──────────────────────────────────────────────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
The footer of the rail surfaces the local node's identity (`nodeName`,
|
||
|
|
`role`), the daemon version, and a small link to the API base URL —
|
||
|
|
equivalent to how dcrouter surfaces its runtime identity in the sidebar.
|
||
|
|
|
||
|
|
## 📄 Per-View Sketches
|
||
|
|
|
||
|
|
### Overview ▸ Stats (landing page)
|
||
|
|
|
||
|
|
A dashboard of the things that an on-call operator wants to see in under
|
||
|
|
two seconds:
|
||
|
|
|
||
|
|
- **Fleet health band**: green/yellow/red status tiles for nodes, GPUs,
|
||
|
|
deployments, API.
|
||
|
|
- **Live traffic**: requests/sec, p50/p95/p99 latency, error rate. Sparkline
|
||
|
|
for the last 15 minutes, streaming from `/metrics` and a server-pushed
|
||
|
|
channel.
|
||
|
|
- **GPU utilization strip**: one micro-sparkline per GPU, colored by VRAM
|
||
|
|
pressure.
|
||
|
|
- **Deployment summary**: the `modelgrid ps` output, but clickable. Each
|
||
|
|
row deep-links into Deployments ▸ Active.
|
||
|
|
- **Catalog drift**: a small callout when `list.modelgrid.com` has newer
|
||
|
|
model entries than the node's cached catalog.
|
||
|
|
|
||
|
|
### Overview ▸ Configuration
|
||
|
|
|
||
|
|
A read-only rendering of the resolved `/etc/modelgrid/config.json` with
|
||
|
|
section headers (`api`, `docker`, `gpus`, `models`, `cluster`). Operators
|
||
|
|
can copy the JSON; editing config is intentionally kept to the Settings view
|
||
|
|
(or the CLI) to avoid a "two sources of truth" problem.
|
||
|
|
|
||
|
|
### Cluster ▸ Nodes
|
||
|
|
|
||
|
|
Mirrors `modelgrid cluster nodes`. Each row: node name, role badge
|
||
|
|
(`standalone` / `control-plane` / `worker`), advertised URL, last heartbeat,
|
||
|
|
GPU inventory summary, status (`active` / `cordoned` / `draining`).
|
||
|
|
|
||
|
|
Row actions: `cordon`, `drain`, `activate` — the same verbs as the CLI.
|
||
|
|
Hitting an action fires the corresponding control-plane call and shows an
|
||
|
|
in-row toast on success.
|
||
|
|
|
||
|
|
```
|
||
|
|
┌ Nodes ───────────────────────────────────────────────────────────────────┐
|
||
|
|
│ Name Role Advertised URL Heartbeat │
|
||
|
|
│ ────────────────────────────────────────────────────────────────────── │
|
||
|
|
│ control-a control-plane http://ctrl.internal:8080 2s ago ● │
|
||
|
|
│ worker-a worker http://wa.internal:8080 3s ago ● │
|
||
|
|
│ worker-b worker http://wb.internal:8080 41s ago ◐ │
|
||
|
|
│ [cordon] [drain]
|
||
|
|
└──────────────────────────────────────────────────────────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
### Cluster ▸ Placements
|
||
|
|
|
||
|
|
A live map of where every deployed model is currently running, read from
|
||
|
|
the control-plane's placement state. Grouped by model, with a column per
|
||
|
|
node. Cells show replica count and health. This is where the operator
|
||
|
|
answers "where did `llama-3.1-8b` actually end up?".
|
||
|
|
|
||
|
|
### Cluster ▸ Desired
|
||
|
|
|
||
|
|
The companion to Placements: the desired-state table. Each row is a model
|
||
|
|
with a target replica count. Rows can be added (`cluster ensure`), edited
|
||
|
|
(`cluster scale`), or removed (`cluster clear`). The reconciler's pending
|
||
|
|
work is surfaced as a diff badge: e.g. `+1 replica`, `moving from worker-b
|
||
|
|
→ worker-a`.
|
||
|
|
|
||
|
|
### GPUs ▸ Devices
|
||
|
|
|
||
|
|
Mirrors `modelgrid gpu list` / `gpu status`, rendered as a card per GPU:
|
||
|
|
vendor, model, VRAM free/total, driver version, temperature, current
|
||
|
|
utilization, and which deployment is using it. Cards stream their
|
||
|
|
utilization via the realtime channel; no full page reloads.
|
||
|
|
|
||
|
|
### GPUs ▸ Drivers
|
||
|
|
|
||
|
|
Status per vendor (NVIDIA / AMD / Intel): driver installed? version? any
|
||
|
|
known issue? Includes a button to run `modelgrid gpu install`
|
||
|
|
interactively — but since the install flow is privileged and interactive,
|
||
|
|
the UI only kicks off the CLI walk-through in a terminal session rather
|
||
|
|
than trying to reimplement it in the browser. A small "copy the command"
|
||
|
|
affordance makes this explicit.
|
||
|
|
|
||
|
|
### Deployments ▸ Active
|
||
|
|
|
||
|
|
The core operational table. One row per active vLLM deployment:
|
||
|
|
|
||
|
|
- container ID, display name, model, GPU bindings, port, uptime, request
|
||
|
|
rate, error rate
|
||
|
|
- status pill (`running`, `pending`, `restarting`, `failed`)
|
||
|
|
- row actions: `logs`, `stop`, `restart`, `remove`
|
||
|
|
|
||
|
|
Clicking a row opens a detail drawer with sub-tabs:
|
||
|
|
|
||
|
|
- **Summary** — the effective container config and the scheduling
|
||
|
|
decision that landed it on this node
|
||
|
|
- **Logs** — a live tail (SSE)
|
||
|
|
- **Metrics** — request latency histogram, token throughput, VRAM
|
||
|
|
occupancy
|
||
|
|
- **Events** — a timeline of lifecycle events (scheduled, pulled image,
|
||
|
|
started, health check, restart, stopped)
|
||
|
|
|
||
|
|
### Deployments ▸ History
|
||
|
|
|
||
|
|
Deployments that have been stopped or removed, with the reason and the
|
||
|
|
last-known logs. Useful for post-mortem on a failed deploy.
|
||
|
|
|
||
|
|
### Models ▸ Catalog
|
||
|
|
|
||
|
|
The current catalog resolved from `list.modelgrid.com`, with a "refresh"
|
||
|
|
action that calls `modelgrid model refresh`. Each entry shows canonical
|
||
|
|
ID, aliases, capabilities (chat / completions / embeddings), minimum
|
||
|
|
VRAM, default GPU count, and a `Deploy` button. Deploying opens a small
|
||
|
|
form that mirrors `modelgrid run`: target node (or auto), desired replica
|
||
|
|
count, optional env overrides (e.g. `HF_TOKEN`).
|
||
|
|
|
||
|
|
A visible "source" badge marks whether the entry came from the public
|
||
|
|
catalog or a custom `registryUrl`, so operators can tell at a glance which
|
||
|
|
models the cluster will actually trust for auto-deploy.
|
||
|
|
|
||
|
|
### Models ▸ Deployed
|
||
|
|
|
||
|
|
Shows the union of what is running across the cluster, with replica
|
||
|
|
counts, keyed by canonical model ID. This is the view a developer asks
|
||
|
|
the operator for when they want to know "what models can I hit on this
|
||
|
|
endpoint?". It is effectively a pretty rendering of `/v1/models`.
|
||
|
|
|
||
|
|
### Access ▸ API Keys
|
||
|
|
|
||
|
|
Mirrors `modelgrid config apikey list`. Columns: label, prefix (first
|
||
|
|
8 chars), created, last used, status. Actions: `generate`, `revoke`.
|
||
|
|
Generating a key shows the secret once in a modal with a copy button,
|
||
|
|
then never shows it again — the same contract as dcrouter's API tokens.
|
||
|
|
|
||
|
|
### Access ▸ Clients
|
||
|
|
|
||
|
|
Placeholder for per-consumer rate limits, quotas, and request labels.
|
||
|
|
This view is explicitly future work; it renders as "not yet configured"
|
||
|
|
until the daemon exposes client records. Listing it now reserves the IA
|
||
|
|
slot so it doesn't have to be retrofitted later.
|
||
|
|
|
||
|
|
### Logs
|
||
|
|
|
||
|
|
A unified tail across daemon, scheduler, and deployments, with filters
|
||
|
|
by source (`daemon`, `scheduler`, `deployment:<id>`), level, and
|
||
|
|
free-text. Streamed via SSE. A "pause" toggle freezes the view for
|
||
|
|
reading; a "download" action exports the current buffer as NDJSON.
|
||
|
|
|
||
|
|
### Metrics
|
||
|
|
|
||
|
|
The `/metrics` endpoint rendered as a small set of charts (request rate,
|
||
|
|
latency, error rate, VRAM occupancy, model throughput). This is
|
||
|
|
deliberately lightweight — serious monitoring is expected to come from
|
||
|
|
Prometheus scraping `/metrics` into Grafana, and the UI says so with a
|
||
|
|
link to the recommended dashboard snippet.
|
||
|
|
|
||
|
|
### Settings
|
||
|
|
|
||
|
|
Editable configuration, grouped to match the config file:
|
||
|
|
|
||
|
|
- **API** — port, bind host, CORS, rate limit
|
||
|
|
- **Docker** — runtime, network name, socket path
|
||
|
|
- **GPUs** — auto-detect toggle, per-GPU assignments
|
||
|
|
- **Models** — registry URL, auto-deploy, default engine, auto-load list
|
||
|
|
- **Cluster** — role, advertise URL, control-plane URL, shared secret,
|
||
|
|
heartbeat interval, seeds
|
||
|
|
|
||
|
|
Edits write through the daemon's config API (to be defined) and reload
|
||
|
|
without a restart wherever possible. Settings that require a restart are
|
||
|
|
marked with a `restart required` badge, and the UI surfaces a single
|
||
|
|
"restart daemon" action at the top of the view when any are pending.
|
||
|
|
|
||
|
|
## 🛤️ Key User Journeys
|
||
|
|
|
||
|
|
### Deploy a model from the catalog
|
||
|
|
|
||
|
|
1. Operator opens **Models ▸ Catalog**, filters for chat-capable models
|
||
|
|
with VRAM ≤ 24 GB.
|
||
|
|
2. Clicks `Deploy` on `meta-llama/Llama-3.1-8B-Instruct`.
|
||
|
|
3. Dialog appears with target node (`auto` / specific worker), replica
|
||
|
|
count (default from catalog), optional env (`HF_TOKEN`).
|
||
|
|
4. On submit, the UI calls the control plane (`cluster ensure` + `scale`
|
||
|
|
under the hood). The dialog closes and the new row appears in
|
||
|
|
**Deployments ▸ Active** in `pending` state.
|
||
|
|
5. SSE updates walk the row through `pulling image → starting → running`.
|
||
|
|
6. A toast links to the deployment detail drawer for logs.
|
||
|
|
|
||
|
|
### Add a worker node to an existing control plane
|
||
|
|
|
||
|
|
1. Operator opens **Cluster ▸ Nodes** on the control plane.
|
||
|
|
2. Clicks `Add node`, which opens a helper that pre-fills the worker's
|
||
|
|
expected `cluster` config block — role, control-plane URL, shared
|
||
|
|
secret — and exposes a one-liner install command.
|
||
|
|
3. The operator runs the install command on the worker host. The UI does
|
||
|
|
**not** SSH into anything; it just hands out the exact snippet.
|
||
|
|
4. Once the worker's daemon starts and registers, the new node appears
|
||
|
|
in the Nodes table with its first heartbeat. The helper closes
|
||
|
|
automatically.
|
||
|
|
|
||
|
|
### Rotate an API key
|
||
|
|
|
||
|
|
1. **Access ▸ API Keys** → `Generate`.
|
||
|
|
2. Name the key, pick a scope (today: single scope; later: per-model).
|
||
|
|
3. The secret is shown once in a modal; copy-to-clipboard and a clear
|
||
|
|
"you will not see this again" note.
|
||
|
|
4. Old key row gets a `revoke` action. Revoke is a confirm-then-apply
|
||
|
|
flow because it will break live traffic.
|
||
|
|
|
||
|
|
### Investigate a failing deployment
|
||
|
|
|
||
|
|
1. **Overview ▸ Stats** shows a red tile: `1 deployment failed`.
|
||
|
|
2. Click drills into **Deployments ▸ Active**, filtered to `failed`.
|
||
|
|
3. Open the row drawer → **Events** tab to see the lifecycle timeline.
|
||
|
|
4. Jump to **Logs** tab for the live tail. If the deployment is down,
|
||
|
|
fall back to the last 500 lines from its event buffer.
|
||
|
|
5. From the drawer, `restart` retries the deployment; if it fails again,
|
||
|
|
the `Summary` tab shows the scheduling decision so the operator can
|
||
|
|
see whether VRAM, GPU pinning, or image pull is the root cause.
|
||
|
|
|
||
|
|
## 📡 Realtime, Auth, and API Contract
|
||
|
|
|
||
|
|
- **Realtime updates.** Metrics, logs, GPU utilization, heartbeats, and
|
||
|
|
deployment state changes stream over Server-Sent Events. A single
|
||
|
|
`/v1/_ui/events?topics=...` endpoint is preferred over per-feature
|
||
|
|
sockets so the browser holds exactly one connection. WebSocket is
|
||
|
|
reserved for bidirectional features (e.g. an interactive install
|
||
|
|
walkthrough) that we do not need in v1.
|
||
|
|
- **Auth model.** The UI runs behind the same daemon process as the
|
||
|
|
OpenAI-compatible API, on a dedicated `uiPort` (default `8081`) to
|
||
|
|
keep the data-plane clean. Login uses a session cookie; the first-boot
|
||
|
|
bootstrap seeds an `admin` user with a one-time password printed to
|
||
|
|
`journalctl -u modelgrid`, the same way dcrouter prints its initial
|
||
|
|
`admin`/`admin`. SSO/OIDC is a later add-on.
|
||
|
|
- **API contract.** Every UI action maps to an HTTP endpoint on the
|
||
|
|
daemon (`/v1/_ui/...`). The UI must not talk to any private internals
|
||
|
|
directly; this keeps `@modelgrid.com/modelgrid-apiclient` (a future
|
||
|
|
sibling to `@serve.zone/dcrouter-apiclient`) able to do everything the
|
||
|
|
UI can do, from scripts.
|
||
|
|
- **Origin badges.** Similar to dcrouter's `config` / `email` / `dns` /
|
||
|
|
`api` route-origin model, ModelGrid should tag each deployment with
|
||
|
|
its origin: `config` (seeded via `containers` in config.json),
|
||
|
|
`catalog` (auto-deployed from `models.autoLoad`), `api` (created via
|
||
|
|
UI/API). Origin determines what the UI allows: `config`-origin
|
||
|
|
deployments are toggle-only, `api`-origin deployments are full CRUD.
|
||
|
|
|
||
|
|
## 🧱 Implementation Notes (non-binding)
|
||
|
|
|
||
|
|
- **Web component stack.** Match the dcrouter OpsServer approach:
|
||
|
|
component-per-view under `ts_web/elements/<area>/`, a tiny
|
||
|
|
SmartRouter-style client router (`ts_web/router.ts`), and a single
|
||
|
|
`appstate.ts` as the store.
|
||
|
|
- **Packaging.** Follow dcrouter's module split: `@modelgrid.com/modelgrid`
|
||
|
|
ships the daemon and the UI bundle; a future
|
||
|
|
`@modelgrid.com/modelgrid-web` can carve out the web boundary if the
|
||
|
|
bundle grows large.
|
||
|
|
- **Dark theme default** (black background, high-contrast foreground) to
|
||
|
|
match dcrouter and the expected server-ops environment. Light theme
|
||
|
|
is a later toggle.
|
||
|
|
- **No server-side rendering.** The UI is a static SPA served by the
|
||
|
|
daemon; all data is fetched through the API. This keeps the runtime
|
||
|
|
surface small and makes the UI-less `curl` story identical to the UI
|
||
|
|
story.
|
||
|
|
|
||
|
|
## ❓ Open Questions
|
||
|
|
|
||
|
|
- **Edit config from the UI or keep it CLI/file-first?** Current lean:
|
||
|
|
UI is authoritative only for API keys, deployments, and cluster
|
||
|
|
actions. Config editing is exposed but optional, with CLI still the
|
||
|
|
canonical path for reproducible installs.
|
||
|
|
- **Do we expose a model prompt playground?** Nice to have for smoke
|
||
|
|
tests, but it blurs the operator/consumer line. Defer to v2.
|
||
|
|
- **Cluster-wide vs per-node view.** On a worker node, should the UI
|
||
|
|
show only local state, or proxy the control plane's cluster view? The
|
||
|
|
current lean: workers show local-only, and link to the control plane
|
||
|
|
for cluster views. This avoids split-brain confusion.
|
||
|
|
- **Access control granularity.** API keys today are coarse (all or
|
||
|
|
nothing). A future model might scope keys per deployment or per
|
||
|
|
model. Reserve the column in the Access ▸ API Keys table now.
|
||
|
|
|
||
|
|
## 🛑 Out of Scope (for this concept)
|
||
|
|
|
||
|
|
- End-user chat or prompt UIs for the OpenAI-compatible API.
|
||
|
|
- Billing, quotas, or usage-based pricing dashboards.
|
||
|
|
- Multi-tenant isolation beyond per-API-key separation.
|
||
|
|
- Anything specific to non-vLLM runtimes — the UI assumes the v1.1.0
|
||
|
|
reorientation around vLLM as the only first-class runtime.
|