Rework the implementation-notes section of readme.ui.md so the UI delivery story matches ModelGrid's deno-compile single-binary shape. Adopt the @stack.gallery/registry pattern: a build step bundles ts_web/ into a generated ts_bundled/bundle.ts exporting a path->bytes map that typedserver serves at runtime. Add a dev-vs-prod asset-source switch so UI edits stay hot-reloadable during development while release builds embed the whole console in the binary.
20 KiB
🖥️ ModelGrid — UI Concept
A browser-based operations console for ModelGrid, served by the same daemon that already exposes the OpenAI-compatible API.
This document sketches the user interface that will sit on top of the ModelGrid daemon: what it shows, how it is organized, how an operator moves through it, and how it stays in sync with a running node or a small cluster. It is a concept, not a final spec — the goal is to lock the shape of the product before any frontend code is written.
The structural idioms (tabbed top-level views, route-origin awareness,
embedded ops dashboard on a dedicated port, API-first with a thin UI on top)
are adapted from @serve.zone/dcrouter's Ops dashboard. ModelGrid's UI should
feel familiar to anyone who has operated dcrouter, while staying grounded in
ModelGrid's own domain: GPUs, vLLM deployments, a public model catalog, and a
cluster of gateway-capable nodes.
🎯 Purpose & Audience
- Primary user: the operator of one or a few ModelGrid nodes. Often the
same person who provisioned the GPU host and ran
modelgrid service enable. - Secondary user: a platform engineer wiring ModelGrid into an internal AI platform who needs to manage API keys, audit deployments, and watch request traffic.
- Not an end-user chat UI. Consumers of the OpenAI-compatible API keep using their own SDKs and tools. The browser UI is for operating the fleet, not for prompting models.
The UI should collapse gracefully from a full cluster view down to a
single-node, standalone deployment, because both shapes are first-class in
ModelGrid's cluster.role model (standalone / control-plane / worker).
🧭 Top-Level Information Architecture
URLs follow /{view} for flat views and /{view}/{subview} for tabbed
views, matching dcrouter's routing idiom.
/overview
/stats
/configuration
/cluster
/nodes
/placements
/desired
/gpus
/devices
/drivers
/deployments
/active
/history
/models
/catalog
/deployed
/access
/apikeys
/clients
/logs (flat)
/metrics (flat)
/settings (flat)
Rationale for the split:
- Overview is the landing page — one screen that answers "is the fleet healthy right now?"
- Cluster / GPUs / Deployments / Models are the four nouns an operator
actually reasons about when running ModelGrid. Keeping them at the top
level matches the CLI verbs (
modelgrid cluster,modelgrid gpu,modelgrid container,modelgrid model) so muscle memory transfers. - Access consolidates the authn/authz surface (API keys today,
user/OIDC later) into one place, the way dcrouter groups
apitokensandusersunderaccess. - Logs and Metrics are flat because they are cross-cutting streams, not noun-scoped tabs.
The navigation chrome itself is a persistent left rail on desktop, collapsing into a top hamburger on narrow viewports. The selected view is indicated there; subviews surface as a tab strip at the top of the content area.
┌────────────┬──────────────────────────────────────────────────────────────┐
│ ModelGrid │ Overview ▸ Stats Configuration │
│ ├──────────────────────────────────────────────────────────────┤
│ Overview ●│ │
│ Cluster │ ┌─ Fleet Health ─────────────────────────────────────┐ │
│ GPUs │ │ 2 nodes • 3 GPUs • 4 deployments • api OK │ │
│ Deploys │ └───────────────────────────────────────────────────┘ │
│ Models │ ┌─ Live Traffic ──────────────┐ ┌─ GPU Utilization ─┐ │
│ Access │ │ 42 req/s p95 820 ms │ │ ▁▂▄▅▇█▇▅▄▂▁ │ │
│ │ │ ▁▂▃▅▇▇▅▃▂▁▁▂▄▆ │ │ avg 64% │ │
│ Logs │ └─────────────────────────────┘ └───────────────────┘ │
│ Metrics │ ┌─ Deployments ────────────────────────────────────┐ │
│ Settings │ │ llama-3.1-8b running 2/2 nvidia-0,1 │ │
│ │ │ qwen2.5-7b running 1/1 nvidia-2 │ │
│ node: ctrl │ │ bge-m3 pending 0/1 (no capacity) │ │
│ v1.1.0 │ └──────────────────────────────────────────────────┘ │
└────────────┴──────────────────────────────────────────────────────────────┘
The footer of the rail surfaces the local node's identity (nodeName,
role), the daemon version, and a small link to the API base URL —
equivalent to how dcrouter surfaces its runtime identity in the sidebar.
📄 Per-View Sketches
Overview ▸ Stats (landing page)
A dashboard of the things that an on-call operator wants to see in under two seconds:
- Fleet health band: green/yellow/red status tiles for nodes, GPUs, deployments, API.
- Live traffic: requests/sec, p50/p95/p99 latency, error rate. Sparkline
for the last 15 minutes, streaming from
/metricsand a server-pushed channel. - GPU utilization strip: one micro-sparkline per GPU, colored by VRAM pressure.
- Deployment summary: the
modelgrid psoutput, but clickable. Each row deep-links into Deployments ▸ Active. - Catalog drift: a small callout when
list.modelgrid.comhas newer model entries than the node's cached catalog.
Overview ▸ Configuration
A read-only rendering of the resolved /etc/modelgrid/config.json with
section headers (api, docker, gpus, models, cluster). Operators
can copy the JSON; editing config is intentionally kept to the Settings view
(or the CLI) to avoid a "two sources of truth" problem.
Cluster ▸ Nodes
Mirrors modelgrid cluster nodes. Each row: node name, role badge
(standalone / control-plane / worker), advertised URL, last heartbeat,
GPU inventory summary, status (active / cordoned / draining).
Row actions: cordon, drain, activate — the same verbs as the CLI.
Hitting an action fires the corresponding control-plane call and shows an
in-row toast on success.
┌ Nodes ───────────────────────────────────────────────────────────────────┐
│ Name Role Advertised URL Heartbeat │
│ ────────────────────────────────────────────────────────────────────── │
│ control-a control-plane http://ctrl.internal:8080 2s ago ● │
│ worker-a worker http://wa.internal:8080 3s ago ● │
│ worker-b worker http://wb.internal:8080 41s ago ◐ │
│ [cordon] [drain]
└──────────────────────────────────────────────────────────────────────────┘
Cluster ▸ Placements
A live map of where every deployed model is currently running, read from
the control-plane's placement state. Grouped by model, with a column per
node. Cells show replica count and health. This is where the operator
answers "where did llama-3.1-8b actually end up?".
Cluster ▸ Desired
The companion to Placements: the desired-state table. Each row is a model
with a target replica count. Rows can be added (cluster ensure), edited
(cluster scale), or removed (cluster clear). The reconciler's pending
work is surfaced as a diff badge: e.g. +1 replica, moving from worker-b → worker-a.
GPUs ▸ Devices
Mirrors modelgrid gpu list / gpu status, rendered as a card per GPU:
vendor, model, VRAM free/total, driver version, temperature, current
utilization, and which deployment is using it. Cards stream their
utilization via the realtime channel; no full page reloads.
GPUs ▸ Drivers
Status per vendor (NVIDIA / AMD / Intel): driver installed? version? any
known issue? Includes a button to run modelgrid gpu install
interactively — but since the install flow is privileged and interactive,
the UI only kicks off the CLI walk-through in a terminal session rather
than trying to reimplement it in the browser. A small "copy the command"
affordance makes this explicit.
Deployments ▸ Active
The core operational table. One row per active vLLM deployment:
- container ID, display name, model, GPU bindings, port, uptime, request rate, error rate
- status pill (
running,pending,restarting,failed) - row actions:
logs,stop,restart,remove
Clicking a row opens a detail drawer with sub-tabs:
- Summary — the effective container config and the scheduling decision that landed it on this node
- Logs — a live tail (SSE)
- Metrics — request latency histogram, token throughput, VRAM occupancy
- Events — a timeline of lifecycle events (scheduled, pulled image, started, health check, restart, stopped)
Deployments ▸ History
Deployments that have been stopped or removed, with the reason and the last-known logs. Useful for post-mortem on a failed deploy.
Models ▸ Catalog
The current catalog resolved from list.modelgrid.com, with a "refresh"
action that calls modelgrid model refresh. Each entry shows canonical
ID, aliases, capabilities (chat / completions / embeddings), minimum
VRAM, default GPU count, and a Deploy button. Deploying opens a small
form that mirrors modelgrid run: target node (or auto), desired replica
count, optional env overrides (e.g. HF_TOKEN).
A visible "source" badge marks whether the entry came from the public
catalog or a custom registryUrl, so operators can tell at a glance which
models the cluster will actually trust for auto-deploy.
Models ▸ Deployed
Shows the union of what is running across the cluster, with replica
counts, keyed by canonical model ID. This is the view a developer asks
the operator for when they want to know "what models can I hit on this
endpoint?". It is effectively a pretty rendering of /v1/models.
Access ▸ API Keys
Mirrors modelgrid config apikey list. Columns: label, prefix (first
8 chars), created, last used, status. Actions: generate, revoke.
Generating a key shows the secret once in a modal with a copy button,
then never shows it again — the same contract as dcrouter's API tokens.
Access ▸ Clients
Placeholder for per-consumer rate limits, quotas, and request labels. This view is explicitly future work; it renders as "not yet configured" until the daemon exposes client records. Listing it now reserves the IA slot so it doesn't have to be retrofitted later.
Logs
A unified tail across daemon, scheduler, and deployments, with filters
by source (daemon, scheduler, deployment:<id>), level, and
free-text. Streamed via SSE. A "pause" toggle freezes the view for
reading; a "download" action exports the current buffer as NDJSON.
Metrics
The /metrics endpoint rendered as a small set of charts (request rate,
latency, error rate, VRAM occupancy, model throughput). This is
deliberately lightweight — serious monitoring is expected to come from
Prometheus scraping /metrics into Grafana, and the UI says so with a
link to the recommended dashboard snippet.
Settings
Editable configuration, grouped to match the config file:
- API — port, bind host, CORS, rate limit
- Docker — runtime, network name, socket path
- GPUs — auto-detect toggle, per-GPU assignments
- Models — registry URL, auto-deploy, default engine, auto-load list
- Cluster — role, advertise URL, control-plane URL, shared secret, heartbeat interval, seeds
Edits write through the daemon's config API (to be defined) and reload
without a restart wherever possible. Settings that require a restart are
marked with a restart required badge, and the UI surfaces a single
"restart daemon" action at the top of the view when any are pending.
🛤️ Key User Journeys
Deploy a model from the catalog
- Operator opens Models ▸ Catalog, filters for chat-capable models with VRAM ≤ 24 GB.
- Clicks
Deployonmeta-llama/Llama-3.1-8B-Instruct. - Dialog appears with target node (
auto/ specific worker), replica count (default from catalog), optional env (HF_TOKEN). - On submit, the UI calls the control plane (
cluster ensure+scaleunder the hood). The dialog closes and the new row appears in Deployments ▸ Active inpendingstate. - SSE updates walk the row through
pulling image → starting → running. - A toast links to the deployment detail drawer for logs.
Add a worker node to an existing control plane
- Operator opens Cluster ▸ Nodes on the control plane.
- Clicks
Add node, which opens a helper that pre-fills the worker's expectedclusterconfig block — role, control-plane URL, shared secret — and exposes a one-liner install command. - The operator runs the install command on the worker host. The UI does not SSH into anything; it just hands out the exact snippet.
- Once the worker's daemon starts and registers, the new node appears in the Nodes table with its first heartbeat. The helper closes automatically.
Rotate an API key
- Access ▸ API Keys →
Generate. - Name the key, pick a scope (today: single scope; later: per-model).
- The secret is shown once in a modal; copy-to-clipboard and a clear "you will not see this again" note.
- Old key row gets a
revokeaction. Revoke is a confirm-then-apply flow because it will break live traffic.
Investigate a failing deployment
- Overview ▸ Stats shows a red tile:
1 deployment failed. - Click drills into Deployments ▸ Active, filtered to
failed. - Open the row drawer → Events tab to see the lifecycle timeline.
- Jump to Logs tab for the live tail. If the deployment is down, fall back to the last 500 lines from its event buffer.
- From the drawer,
restartretries the deployment; if it fails again, theSummarytab shows the scheduling decision so the operator can see whether VRAM, GPU pinning, or image pull is the root cause.
📡 Realtime, Auth, and API Contract
- Realtime updates. Metrics, logs, GPU utilization, heartbeats, and
deployment state changes stream over Server-Sent Events. A single
/v1/_ui/events?topics=...endpoint is preferred over per-feature sockets so the browser holds exactly one connection. WebSocket is reserved for bidirectional features (e.g. an interactive install walkthrough) that we do not need in v1. - Auth model. The UI runs behind the same daemon process as the
OpenAI-compatible API, on a dedicated
uiPort(default8081) to keep the data-plane clean. Login uses a session cookie; the first-boot bootstrap seeds anadminuser with a one-time password printed tojournalctl -u modelgrid, the same way dcrouter prints its initialadmin/admin. SSO/OIDC is a later add-on. - API contract. Every UI action maps to an HTTP endpoint on the
daemon (
/v1/_ui/...). The UI must not talk to any private internals directly; this keeps@modelgrid.com/modelgrid-apiclient(a future sibling to@serve.zone/dcrouter-apiclient) able to do everything the UI can do, from scripts. - Origin badges. Similar to dcrouter's
config/email/dns/apiroute-origin model, ModelGrid should tag each deployment with its origin:config(seeded viacontainersin config.json),catalog(auto-deployed frommodels.autoLoad),api(created via UI/API). Origin determines what the UI allows:config-origin deployments are toggle-only,api-origin deployments are full CRUD.
🧱 Implementation Notes (non-binding)
- Web component stack. Match the dcrouter OpsServer approach:
component-per-view under
ts_web/elements/<area>/, a tiny SmartRouter-style client router (ts_web/router.ts), and a singleappstate.tsas the store. - Bundled into the binary via
ts_bundled/bundle.ts. ModelGrid is a Deno project that ships as adeno compilesingle binary, so the UI follows the@stack.gallery/registrypattern: a build step bundles thets_web/sources (HTML, JS, CSS, fonts, icons) into a single generatedts_bundled/bundle.tsmodule that exports a{ path → bytes | string }map. The daemon dynamically imports that module at startup and hands the map to typedserver, which serves it on the UI port. Result: no external asset directory, no runtime filesystem dependency, one binary still ships the entire console. - Dev vs prod asset source. In
deno task dev, typedserver is pointed atts_web/on disk so UI edits are hot-reloadable without re-running the bundler. Indeno task compile/ prod, the bundler regeneratests_bundled/bundle.tsfirst and the compiled binary serves exclusively from the embedded map. A single flag (UI_ASSET_SOURCE=disk|bundle, defaultbundle) picks the strategy at runtime. - Bundler placement. Mirrors
@stack.gallery/registry: keep the bundler inscripts/bundle-ui.ts, invoke it from adeno task bundle:uithat thecompile:alltask depends on, and.gitignorethe generatedts_bundled/bundle.tsso it is only produced during release builds (or regenerated on demand for local prod testing). - Packaging. Follow dcrouter's module split:
@modelgrid.com/modelgridships the daemon and the embedded UI bundle; a future@modelgrid.com/modelgrid-webcan carve out the web sources as their own publishable boundary if the bundle grows large or the UI needs to be consumed independently. - Dark theme default (black background, high-contrast foreground) to match dcrouter and the expected server-ops environment. Light theme is a later toggle.
- No server-side rendering. The UI is a static SPA; typedserver
returns the asset map's
index.htmlfor the app shell and the rest of the state comes from the API. This keeps the runtime surface small and makes the UI-lesscurlstory identical to the UI story.
❓ Open Questions
- Edit config from the UI or keep it CLI/file-first? Current lean: UI is authoritative only for API keys, deployments, and cluster actions. Config editing is exposed but optional, with CLI still the canonical path for reproducible installs.
- Do we expose a model prompt playground? Nice to have for smoke tests, but it blurs the operator/consumer line. Defer to v2.
- Cluster-wide vs per-node view. On a worker node, should the UI show only local state, or proxy the control plane's cluster view? The current lean: workers show local-only, and link to the control plane for cluster views. This avoids split-brain confusion.
- Access control granularity. API keys today are coarse (all or nothing). A future model might scope keys per deployment or per model. Reserve the column in the Access ▸ API Keys table now.
🛑 Out of Scope (for this concept)
- End-user chat or prompt UIs for the OpenAI-compatible API.
- Billing, quotas, or usage-based pricing dashboards.
- Multi-tenant isolation beyond per-API-key separation.
- Anything specific to non-vLLM runtimes — the UI assumes the v1.1.0 reorientation around vLLM as the only first-class runtime.