diff --git a/readme.ui.md b/readme.ui.md new file mode 100644 index 0000000..da72dd8 --- /dev/null +++ b/readme.ui.md @@ -0,0 +1,392 @@ +# 🖥️ ModelGrid — UI Concept + +**A browser-based operations console for ModelGrid, served by the same daemon that +already exposes the OpenAI-compatible API.** + +This document sketches the user interface that will sit on top of the ModelGrid +daemon: what it shows, how it is organized, how an operator moves through it, +and how it stays in sync with a running node or a small cluster. It is a +concept, not a final spec — the goal is to lock the shape of the product +before any frontend code is written. + +The structural idioms (tabbed top-level views, route-origin awareness, +embedded ops dashboard on a dedicated port, API-first with a thin UI on top) +are adapted from `@serve.zone/dcrouter`'s Ops dashboard. ModelGrid's UI should +feel familiar to anyone who has operated dcrouter, while staying grounded in +ModelGrid's own domain: GPUs, vLLM deployments, a public model catalog, and a +cluster of gateway-capable nodes. + +## 🎯 Purpose & Audience + +- **Primary user:** the operator of one or a few ModelGrid nodes. Often the + same person who provisioned the GPU host and ran `modelgrid service enable`. +- **Secondary user:** a platform engineer wiring ModelGrid into an internal + AI platform who needs to manage API keys, audit deployments, and watch + request traffic. +- **Not an end-user chat UI.** Consumers of the OpenAI-compatible API keep + using their own SDKs and tools. The browser UI is for operating the fleet, + not for prompting models. + +The UI should collapse gracefully from a full cluster view down to a +single-node, standalone deployment, because both shapes are first-class in +ModelGrid's `cluster.role` model (`standalone` / `control-plane` / `worker`). + +## 🧭 Top-Level Information Architecture + +URLs follow `/{view}` for flat views and `/{view}/{subview}` for tabbed +views, matching dcrouter's routing idiom. + +``` +/overview + /stats + /configuration + +/cluster + /nodes + /placements + /desired + +/gpus + /devices + /drivers + +/deployments + /active + /history + +/models + /catalog + /deployed + +/access + /apikeys + /clients + +/logs (flat) +/metrics (flat) +/settings (flat) +``` + +Rationale for the split: + +- **Overview** is the landing page — one screen that answers "is the fleet + healthy right now?" +- **Cluster / GPUs / Deployments / Models** are the four nouns an operator + actually reasons about when running ModelGrid. Keeping them at the top + level matches the CLI verbs (`modelgrid cluster`, `modelgrid gpu`, + `modelgrid container`, `modelgrid model`) so muscle memory transfers. +- **Access** consolidates the authn/authz surface (API keys today, + user/OIDC later) into one place, the way dcrouter groups `apitokens` and + `users` under `access`. +- **Logs** and **Metrics** are flat because they are cross-cutting streams, + not noun-scoped tabs. + +The navigation chrome itself is a persistent left rail on desktop, collapsing +into a top hamburger on narrow viewports. The selected view is indicated +there; subviews surface as a tab strip at the top of the content area. + +``` +┌────────────┬──────────────────────────────────────────────────────────────┐ +│ ModelGrid │ Overview ▸ Stats Configuration │ +│ ├──────────────────────────────────────────────────────────────┤ +│ Overview ●│ │ +│ Cluster │ ┌─ Fleet Health ─────────────────────────────────────┐ │ +│ GPUs │ │ 2 nodes • 3 GPUs • 4 deployments • api OK │ │ +│ Deploys │ └───────────────────────────────────────────────────┘ │ +│ Models │ ┌─ Live Traffic ──────────────┐ ┌─ GPU Utilization ─┐ │ +│ Access │ │ 42 req/s p95 820 ms │ │ ▁▂▄▅▇█▇▅▄▂▁ │ │ +│ │ │ ▁▂▃▅▇▇▅▃▂▁▁▂▄▆ │ │ avg 64% │ │ +│ Logs │ └─────────────────────────────┘ └───────────────────┘ │ +│ Metrics │ ┌─ Deployments ────────────────────────────────────┐ │ +│ Settings │ │ llama-3.1-8b running 2/2 nvidia-0,1 │ │ +│ │ │ qwen2.5-7b running 1/1 nvidia-2 │ │ +│ node: ctrl │ │ bge-m3 pending 0/1 (no capacity) │ │ +│ v1.1.0 │ └──────────────────────────────────────────────────┘ │ +└────────────┴──────────────────────────────────────────────────────────────┘ +``` + +The footer of the rail surfaces the local node's identity (`nodeName`, +`role`), the daemon version, and a small link to the API base URL — +equivalent to how dcrouter surfaces its runtime identity in the sidebar. + +## 📄 Per-View Sketches + +### Overview ▸ Stats (landing page) + +A dashboard of the things that an on-call operator wants to see in under +two seconds: + +- **Fleet health band**: green/yellow/red status tiles for nodes, GPUs, + deployments, API. +- **Live traffic**: requests/sec, p50/p95/p99 latency, error rate. Sparkline + for the last 15 minutes, streaming from `/metrics` and a server-pushed + channel. +- **GPU utilization strip**: one micro-sparkline per GPU, colored by VRAM + pressure. +- **Deployment summary**: the `modelgrid ps` output, but clickable. Each + row deep-links into Deployments ▸ Active. +- **Catalog drift**: a small callout when `list.modelgrid.com` has newer + model entries than the node's cached catalog. + +### Overview ▸ Configuration + +A read-only rendering of the resolved `/etc/modelgrid/config.json` with +section headers (`api`, `docker`, `gpus`, `models`, `cluster`). Operators +can copy the JSON; editing config is intentionally kept to the Settings view +(or the CLI) to avoid a "two sources of truth" problem. + +### Cluster ▸ Nodes + +Mirrors `modelgrid cluster nodes`. Each row: node name, role badge +(`standalone` / `control-plane` / `worker`), advertised URL, last heartbeat, +GPU inventory summary, status (`active` / `cordoned` / `draining`). + +Row actions: `cordon`, `drain`, `activate` — the same verbs as the CLI. +Hitting an action fires the corresponding control-plane call and shows an +in-row toast on success. + +``` +┌ Nodes ───────────────────────────────────────────────────────────────────┐ +│ Name Role Advertised URL Heartbeat │ +│ ────────────────────────────────────────────────────────────────────── │ +│ control-a control-plane http://ctrl.internal:8080 2s ago ● │ +│ worker-a worker http://wa.internal:8080 3s ago ● │ +│ worker-b worker http://wb.internal:8080 41s ago ◐ │ +│ [cordon] [drain] +└──────────────────────────────────────────────────────────────────────────┘ +``` + +### Cluster ▸ Placements + +A live map of where every deployed model is currently running, read from +the control-plane's placement state. Grouped by model, with a column per +node. Cells show replica count and health. This is where the operator +answers "where did `llama-3.1-8b` actually end up?". + +### Cluster ▸ Desired + +The companion to Placements: the desired-state table. Each row is a model +with a target replica count. Rows can be added (`cluster ensure`), edited +(`cluster scale`), or removed (`cluster clear`). The reconciler's pending +work is surfaced as a diff badge: e.g. `+1 replica`, `moving from worker-b +→ worker-a`. + +### GPUs ▸ Devices + +Mirrors `modelgrid gpu list` / `gpu status`, rendered as a card per GPU: +vendor, model, VRAM free/total, driver version, temperature, current +utilization, and which deployment is using it. Cards stream their +utilization via the realtime channel; no full page reloads. + +### GPUs ▸ Drivers + +Status per vendor (NVIDIA / AMD / Intel): driver installed? version? any +known issue? Includes a button to run `modelgrid gpu install` +interactively — but since the install flow is privileged and interactive, +the UI only kicks off the CLI walk-through in a terminal session rather +than trying to reimplement it in the browser. A small "copy the command" +affordance makes this explicit. + +### Deployments ▸ Active + +The core operational table. One row per active vLLM deployment: + +- container ID, display name, model, GPU bindings, port, uptime, request + rate, error rate +- status pill (`running`, `pending`, `restarting`, `failed`) +- row actions: `logs`, `stop`, `restart`, `remove` + +Clicking a row opens a detail drawer with sub-tabs: + +- **Summary** — the effective container config and the scheduling + decision that landed it on this node +- **Logs** — a live tail (SSE) +- **Metrics** — request latency histogram, token throughput, VRAM + occupancy +- **Events** — a timeline of lifecycle events (scheduled, pulled image, + started, health check, restart, stopped) + +### Deployments ▸ History + +Deployments that have been stopped or removed, with the reason and the +last-known logs. Useful for post-mortem on a failed deploy. + +### Models ▸ Catalog + +The current catalog resolved from `list.modelgrid.com`, with a "refresh" +action that calls `modelgrid model refresh`. Each entry shows canonical +ID, aliases, capabilities (chat / completions / embeddings), minimum +VRAM, default GPU count, and a `Deploy` button. Deploying opens a small +form that mirrors `modelgrid run`: target node (or auto), desired replica +count, optional env overrides (e.g. `HF_TOKEN`). + +A visible "source" badge marks whether the entry came from the public +catalog or a custom `registryUrl`, so operators can tell at a glance which +models the cluster will actually trust for auto-deploy. + +### Models ▸ Deployed + +Shows the union of what is running across the cluster, with replica +counts, keyed by canonical model ID. This is the view a developer asks +the operator for when they want to know "what models can I hit on this +endpoint?". It is effectively a pretty rendering of `/v1/models`. + +### Access ▸ API Keys + +Mirrors `modelgrid config apikey list`. Columns: label, prefix (first +8 chars), created, last used, status. Actions: `generate`, `revoke`. +Generating a key shows the secret once in a modal with a copy button, +then never shows it again — the same contract as dcrouter's API tokens. + +### Access ▸ Clients + +Placeholder for per-consumer rate limits, quotas, and request labels. +This view is explicitly future work; it renders as "not yet configured" +until the daemon exposes client records. Listing it now reserves the IA +slot so it doesn't have to be retrofitted later. + +### Logs + +A unified tail across daemon, scheduler, and deployments, with filters +by source (`daemon`, `scheduler`, `deployment:`), level, and +free-text. Streamed via SSE. A "pause" toggle freezes the view for +reading; a "download" action exports the current buffer as NDJSON. + +### Metrics + +The `/metrics` endpoint rendered as a small set of charts (request rate, +latency, error rate, VRAM occupancy, model throughput). This is +deliberately lightweight — serious monitoring is expected to come from +Prometheus scraping `/metrics` into Grafana, and the UI says so with a +link to the recommended dashboard snippet. + +### Settings + +Editable configuration, grouped to match the config file: + +- **API** — port, bind host, CORS, rate limit +- **Docker** — runtime, network name, socket path +- **GPUs** — auto-detect toggle, per-GPU assignments +- **Models** — registry URL, auto-deploy, default engine, auto-load list +- **Cluster** — role, advertise URL, control-plane URL, shared secret, + heartbeat interval, seeds + +Edits write through the daemon's config API (to be defined) and reload +without a restart wherever possible. Settings that require a restart are +marked with a `restart required` badge, and the UI surfaces a single +"restart daemon" action at the top of the view when any are pending. + +## 🛤️ Key User Journeys + +### Deploy a model from the catalog + +1. Operator opens **Models ▸ Catalog**, filters for chat-capable models + with VRAM ≤ 24 GB. +2. Clicks `Deploy` on `meta-llama/Llama-3.1-8B-Instruct`. +3. Dialog appears with target node (`auto` / specific worker), replica + count (default from catalog), optional env (`HF_TOKEN`). +4. On submit, the UI calls the control plane (`cluster ensure` + `scale` + under the hood). The dialog closes and the new row appears in + **Deployments ▸ Active** in `pending` state. +5. SSE updates walk the row through `pulling image → starting → running`. +6. A toast links to the deployment detail drawer for logs. + +### Add a worker node to an existing control plane + +1. Operator opens **Cluster ▸ Nodes** on the control plane. +2. Clicks `Add node`, which opens a helper that pre-fills the worker's + expected `cluster` config block — role, control-plane URL, shared + secret — and exposes a one-liner install command. +3. The operator runs the install command on the worker host. The UI does + **not** SSH into anything; it just hands out the exact snippet. +4. Once the worker's daemon starts and registers, the new node appears + in the Nodes table with its first heartbeat. The helper closes + automatically. + +### Rotate an API key + +1. **Access ▸ API Keys** → `Generate`. +2. Name the key, pick a scope (today: single scope; later: per-model). +3. The secret is shown once in a modal; copy-to-clipboard and a clear + "you will not see this again" note. +4. Old key row gets a `revoke` action. Revoke is a confirm-then-apply + flow because it will break live traffic. + +### Investigate a failing deployment + +1. **Overview ▸ Stats** shows a red tile: `1 deployment failed`. +2. Click drills into **Deployments ▸ Active**, filtered to `failed`. +3. Open the row drawer → **Events** tab to see the lifecycle timeline. +4. Jump to **Logs** tab for the live tail. If the deployment is down, + fall back to the last 500 lines from its event buffer. +5. From the drawer, `restart` retries the deployment; if it fails again, + the `Summary` tab shows the scheduling decision so the operator can + see whether VRAM, GPU pinning, or image pull is the root cause. + +## 📡 Realtime, Auth, and API Contract + +- **Realtime updates.** Metrics, logs, GPU utilization, heartbeats, and + deployment state changes stream over Server-Sent Events. A single + `/v1/_ui/events?topics=...` endpoint is preferred over per-feature + sockets so the browser holds exactly one connection. WebSocket is + reserved for bidirectional features (e.g. an interactive install + walkthrough) that we do not need in v1. +- **Auth model.** The UI runs behind the same daemon process as the + OpenAI-compatible API, on a dedicated `uiPort` (default `8081`) to + keep the data-plane clean. Login uses a session cookie; the first-boot + bootstrap seeds an `admin` user with a one-time password printed to + `journalctl -u modelgrid`, the same way dcrouter prints its initial + `admin`/`admin`. SSO/OIDC is a later add-on. +- **API contract.** Every UI action maps to an HTTP endpoint on the + daemon (`/v1/_ui/...`). The UI must not talk to any private internals + directly; this keeps `@modelgrid.com/modelgrid-apiclient` (a future + sibling to `@serve.zone/dcrouter-apiclient`) able to do everything the + UI can do, from scripts. +- **Origin badges.** Similar to dcrouter's `config` / `email` / `dns` / + `api` route-origin model, ModelGrid should tag each deployment with + its origin: `config` (seeded via `containers` in config.json), + `catalog` (auto-deployed from `models.autoLoad`), `api` (created via + UI/API). Origin determines what the UI allows: `config`-origin + deployments are toggle-only, `api`-origin deployments are full CRUD. + +## 🧱 Implementation Notes (non-binding) + +- **Web component stack.** Match the dcrouter OpsServer approach: + component-per-view under `ts_web/elements//`, a tiny + SmartRouter-style client router (`ts_web/router.ts`), and a single + `appstate.ts` as the store. +- **Packaging.** Follow dcrouter's module split: `@modelgrid.com/modelgrid` + ships the daemon and the UI bundle; a future + `@modelgrid.com/modelgrid-web` can carve out the web boundary if the + bundle grows large. +- **Dark theme default** (black background, high-contrast foreground) to + match dcrouter and the expected server-ops environment. Light theme + is a later toggle. +- **No server-side rendering.** The UI is a static SPA served by the + daemon; all data is fetched through the API. This keeps the runtime + surface small and makes the UI-less `curl` story identical to the UI + story. + +## ❓ Open Questions + +- **Edit config from the UI or keep it CLI/file-first?** Current lean: + UI is authoritative only for API keys, deployments, and cluster + actions. Config editing is exposed but optional, with CLI still the + canonical path for reproducible installs. +- **Do we expose a model prompt playground?** Nice to have for smoke + tests, but it blurs the operator/consumer line. Defer to v2. +- **Cluster-wide vs per-node view.** On a worker node, should the UI + show only local state, or proxy the control plane's cluster view? The + current lean: workers show local-only, and link to the control plane + for cluster views. This avoids split-brain confusion. +- **Access control granularity.** API keys today are coarse (all or + nothing). A future model might scope keys per deployment or per + model. Reserve the column in the Access ▸ API Keys table now. + +## 🛑 Out of Scope (for this concept) + +- End-user chat or prompt UIs for the OpenAI-compatible API. +- Billing, quotas, or usage-based pricing dashboards. +- Multi-tenant isolation beyond per-API-key separation. +- Anything specific to non-vLLM runtimes — the UI assumes the v1.1.0 + reorientation around vLLM as the only first-class runtime.