Files
jkunz 9c9c0c90ae
CI / Type Check & Lint (push) Successful in 6s
CI / Build Test (Current Platform) (push) Successful in 6s
CI / Build All Platforms (push) Successful in 41s
docs(readme): ship UI via typedserver + bundled ts module
Rework the implementation-notes section of readme.ui.md so the UI
delivery story matches ModelGrid's deno-compile single-binary shape.
Adopt the @stack.gallery/registry pattern: a build step bundles ts_web/
into a generated ts_bundled/bundle.ts exporting a path->bytes map that
typedserver serves at runtime. Add a dev-vs-prod asset-source switch so
UI edits stay hot-reloadable during development while release builds
embed the whole console in the binary.
2026-04-21 09:38:24 +00:00

20 KiB

🖥️ ModelGrid — UI Concept

A browser-based operations console for ModelGrid, served by the same daemon that already exposes the OpenAI-compatible API.

This document sketches the user interface that will sit on top of the ModelGrid daemon: what it shows, how it is organized, how an operator moves through it, and how it stays in sync with a running node or a small cluster. It is a concept, not a final spec — the goal is to lock the shape of the product before any frontend code is written.

The structural idioms (tabbed top-level views, route-origin awareness, embedded ops dashboard on a dedicated port, API-first with a thin UI on top) are adapted from @serve.zone/dcrouter's Ops dashboard. ModelGrid's UI should feel familiar to anyone who has operated dcrouter, while staying grounded in ModelGrid's own domain: GPUs, vLLM deployments, a public model catalog, and a cluster of gateway-capable nodes.

🎯 Purpose & Audience

  • Primary user: the operator of one or a few ModelGrid nodes. Often the same person who provisioned the GPU host and ran modelgrid service enable.
  • Secondary user: a platform engineer wiring ModelGrid into an internal AI platform who needs to manage API keys, audit deployments, and watch request traffic.
  • Not an end-user chat UI. Consumers of the OpenAI-compatible API keep using their own SDKs and tools. The browser UI is for operating the fleet, not for prompting models.

The UI should collapse gracefully from a full cluster view down to a single-node, standalone deployment, because both shapes are first-class in ModelGrid's cluster.role model (standalone / control-plane / worker).

🧭 Top-Level Information Architecture

URLs follow /{view} for flat views and /{view}/{subview} for tabbed views, matching dcrouter's routing idiom.

/overview
  /stats
  /configuration

/cluster
  /nodes
  /placements
  /desired

/gpus
  /devices
  /drivers

/deployments
  /active
  /history

/models
  /catalog
  /deployed

/access
  /apikeys
  /clients

/logs                  (flat)
/metrics               (flat)
/settings              (flat)

Rationale for the split:

  • Overview is the landing page — one screen that answers "is the fleet healthy right now?"
  • Cluster / GPUs / Deployments / Models are the four nouns an operator actually reasons about when running ModelGrid. Keeping them at the top level matches the CLI verbs (modelgrid cluster, modelgrid gpu, modelgrid container, modelgrid model) so muscle memory transfers.
  • Access consolidates the authn/authz surface (API keys today, user/OIDC later) into one place, the way dcrouter groups apitokens and users under access.
  • Logs and Metrics are flat because they are cross-cutting streams, not noun-scoped tabs.

The navigation chrome itself is a persistent left rail on desktop, collapsing into a top hamburger on narrow viewports. The selected view is indicated there; subviews surface as a tab strip at the top of the content area.

┌────────────┬──────────────────────────────────────────────────────────────┐
│  ModelGrid │  Overview ▸ Stats  Configuration                             │
│            ├──────────────────────────────────────────────────────────────┤
│  Overview ●│                                                              │
│  Cluster   │   ┌─ Fleet Health ─────────────────────────────────────┐     │
│  GPUs      │   │  2 nodes  •  3 GPUs  •  4 deployments  •  api OK   │     │
│  Deploys   │   └───────────────────────────────────────────────────┘     │
│  Models    │   ┌─ Live Traffic ──────────────┐ ┌─ GPU Utilization ─┐     │
│  Access    │   │  42 req/s   p95 820 ms      │ │  ▁▂▄▅▇█▇▅▄▂▁      │     │
│            │   │  ▁▂▃▅▇▇▅▃▂▁▁▂▄▆             │ │  avg 64%          │     │
│  Logs      │   └─────────────────────────────┘ └───────────────────┘     │
│  Metrics   │   ┌─ Deployments ────────────────────────────────────┐      │
│  Settings  │   │  llama-3.1-8b      running    2/2  nvidia-0,1    │      │
│            │   │  qwen2.5-7b        running    1/1  nvidia-2      │      │
│ node: ctrl │   │  bge-m3            pending    0/1  (no capacity) │      │
│ v1.1.0     │   └──────────────────────────────────────────────────┘      │
└────────────┴──────────────────────────────────────────────────────────────┘

The footer of the rail surfaces the local node's identity (nodeName, role), the daemon version, and a small link to the API base URL — equivalent to how dcrouter surfaces its runtime identity in the sidebar.

📄 Per-View Sketches

Overview ▸ Stats (landing page)

A dashboard of the things that an on-call operator wants to see in under two seconds:

  • Fleet health band: green/yellow/red status tiles for nodes, GPUs, deployments, API.
  • Live traffic: requests/sec, p50/p95/p99 latency, error rate. Sparkline for the last 15 minutes, streaming from /metrics and a server-pushed channel.
  • GPU utilization strip: one micro-sparkline per GPU, colored by VRAM pressure.
  • Deployment summary: the modelgrid ps output, but clickable. Each row deep-links into Deployments ▸ Active.
  • Catalog drift: a small callout when list.modelgrid.com has newer model entries than the node's cached catalog.

Overview ▸ Configuration

A read-only rendering of the resolved /etc/modelgrid/config.json with section headers (api, docker, gpus, models, cluster). Operators can copy the JSON; editing config is intentionally kept to the Settings view (or the CLI) to avoid a "two sources of truth" problem.

Cluster ▸ Nodes

Mirrors modelgrid cluster nodes. Each row: node name, role badge (standalone / control-plane / worker), advertised URL, last heartbeat, GPU inventory summary, status (active / cordoned / draining).

Row actions: cordon, drain, activate — the same verbs as the CLI. Hitting an action fires the corresponding control-plane call and shows an in-row toast on success.

┌ Nodes ───────────────────────────────────────────────────────────────────┐
│  Name          Role            Advertised URL              Heartbeat     │
│  ──────────────────────────────────────────────────────────────────────  │
│  control-a     control-plane   http://ctrl.internal:8080   2s ago    ●   │
│  worker-a      worker          http://wa.internal:8080     3s ago    ●   │
│  worker-b      worker          http://wb.internal:8080     41s ago   ◐   │
│                                                            [cordon] [drain]
└──────────────────────────────────────────────────────────────────────────┘

Cluster ▸ Placements

A live map of where every deployed model is currently running, read from the control-plane's placement state. Grouped by model, with a column per node. Cells show replica count and health. This is where the operator answers "where did llama-3.1-8b actually end up?".

Cluster ▸ Desired

The companion to Placements: the desired-state table. Each row is a model with a target replica count. Rows can be added (cluster ensure), edited (cluster scale), or removed (cluster clear). The reconciler's pending work is surfaced as a diff badge: e.g. +1 replica, moving from worker-b → worker-a.

GPUs ▸ Devices

Mirrors modelgrid gpu list / gpu status, rendered as a card per GPU: vendor, model, VRAM free/total, driver version, temperature, current utilization, and which deployment is using it. Cards stream their utilization via the realtime channel; no full page reloads.

GPUs ▸ Drivers

Status per vendor (NVIDIA / AMD / Intel): driver installed? version? any known issue? Includes a button to run modelgrid gpu install interactively — but since the install flow is privileged and interactive, the UI only kicks off the CLI walk-through in a terminal session rather than trying to reimplement it in the browser. A small "copy the command" affordance makes this explicit.

Deployments ▸ Active

The core operational table. One row per active vLLM deployment:

  • container ID, display name, model, GPU bindings, port, uptime, request rate, error rate
  • status pill (running, pending, restarting, failed)
  • row actions: logs, stop, restart, remove

Clicking a row opens a detail drawer with sub-tabs:

  • Summary — the effective container config and the scheduling decision that landed it on this node
  • Logs — a live tail (SSE)
  • Metrics — request latency histogram, token throughput, VRAM occupancy
  • Events — a timeline of lifecycle events (scheduled, pulled image, started, health check, restart, stopped)

Deployments ▸ History

Deployments that have been stopped or removed, with the reason and the last-known logs. Useful for post-mortem on a failed deploy.

Models ▸ Catalog

The current catalog resolved from list.modelgrid.com, with a "refresh" action that calls modelgrid model refresh. Each entry shows canonical ID, aliases, capabilities (chat / completions / embeddings), minimum VRAM, default GPU count, and a Deploy button. Deploying opens a small form that mirrors modelgrid run: target node (or auto), desired replica count, optional env overrides (e.g. HF_TOKEN).

A visible "source" badge marks whether the entry came from the public catalog or a custom registryUrl, so operators can tell at a glance which models the cluster will actually trust for auto-deploy.

Models ▸ Deployed

Shows the union of what is running across the cluster, with replica counts, keyed by canonical model ID. This is the view a developer asks the operator for when they want to know "what models can I hit on this endpoint?". It is effectively a pretty rendering of /v1/models.

Access ▸ API Keys

Mirrors modelgrid config apikey list. Columns: label, prefix (first 8 chars), created, last used, status. Actions: generate, revoke. Generating a key shows the secret once in a modal with a copy button, then never shows it again — the same contract as dcrouter's API tokens.

Access ▸ Clients

Placeholder for per-consumer rate limits, quotas, and request labels. This view is explicitly future work; it renders as "not yet configured" until the daemon exposes client records. Listing it now reserves the IA slot so it doesn't have to be retrofitted later.

Logs

A unified tail across daemon, scheduler, and deployments, with filters by source (daemon, scheduler, deployment:<id>), level, and free-text. Streamed via SSE. A "pause" toggle freezes the view for reading; a "download" action exports the current buffer as NDJSON.

Metrics

The /metrics endpoint rendered as a small set of charts (request rate, latency, error rate, VRAM occupancy, model throughput). This is deliberately lightweight — serious monitoring is expected to come from Prometheus scraping /metrics into Grafana, and the UI says so with a link to the recommended dashboard snippet.

Settings

Editable configuration, grouped to match the config file:

  • API — port, bind host, CORS, rate limit
  • Docker — runtime, network name, socket path
  • GPUs — auto-detect toggle, per-GPU assignments
  • Models — registry URL, auto-deploy, default engine, auto-load list
  • Cluster — role, advertise URL, control-plane URL, shared secret, heartbeat interval, seeds

Edits write through the daemon's config API (to be defined) and reload without a restart wherever possible. Settings that require a restart are marked with a restart required badge, and the UI surfaces a single "restart daemon" action at the top of the view when any are pending.

🛤️ Key User Journeys

Deploy a model from the catalog

  1. Operator opens Models ▸ Catalog, filters for chat-capable models with VRAM ≤ 24 GB.
  2. Clicks Deploy on meta-llama/Llama-3.1-8B-Instruct.
  3. Dialog appears with target node (auto / specific worker), replica count (default from catalog), optional env (HF_TOKEN).
  4. On submit, the UI calls the control plane (cluster ensure + scale under the hood). The dialog closes and the new row appears in Deployments ▸ Active in pending state.
  5. SSE updates walk the row through pulling image → starting → running.
  6. A toast links to the deployment detail drawer for logs.

Add a worker node to an existing control plane

  1. Operator opens Cluster ▸ Nodes on the control plane.
  2. Clicks Add node, which opens a helper that pre-fills the worker's expected cluster config block — role, control-plane URL, shared secret — and exposes a one-liner install command.
  3. The operator runs the install command on the worker host. The UI does not SSH into anything; it just hands out the exact snippet.
  4. Once the worker's daemon starts and registers, the new node appears in the Nodes table with its first heartbeat. The helper closes automatically.

Rotate an API key

  1. Access ▸ API KeysGenerate.
  2. Name the key, pick a scope (today: single scope; later: per-model).
  3. The secret is shown once in a modal; copy-to-clipboard and a clear "you will not see this again" note.
  4. Old key row gets a revoke action. Revoke is a confirm-then-apply flow because it will break live traffic.

Investigate a failing deployment

  1. Overview ▸ Stats shows a red tile: 1 deployment failed.
  2. Click drills into Deployments ▸ Active, filtered to failed.
  3. Open the row drawer → Events tab to see the lifecycle timeline.
  4. Jump to Logs tab for the live tail. If the deployment is down, fall back to the last 500 lines from its event buffer.
  5. From the drawer, restart retries the deployment; if it fails again, the Summary tab shows the scheduling decision so the operator can see whether VRAM, GPU pinning, or image pull is the root cause.

📡 Realtime, Auth, and API Contract

  • Realtime updates. Metrics, logs, GPU utilization, heartbeats, and deployment state changes stream over Server-Sent Events. A single /v1/_ui/events?topics=... endpoint is preferred over per-feature sockets so the browser holds exactly one connection. WebSocket is reserved for bidirectional features (e.g. an interactive install walkthrough) that we do not need in v1.
  • Auth model. The UI runs behind the same daemon process as the OpenAI-compatible API, on a dedicated uiPort (default 8081) to keep the data-plane clean. Login uses a session cookie; the first-boot bootstrap seeds an admin user with a one-time password printed to journalctl -u modelgrid, the same way dcrouter prints its initial admin/admin. SSO/OIDC is a later add-on.
  • API contract. Every UI action maps to an HTTP endpoint on the daemon (/v1/_ui/...). The UI must not talk to any private internals directly; this keeps @modelgrid.com/modelgrid-apiclient (a future sibling to @serve.zone/dcrouter-apiclient) able to do everything the UI can do, from scripts.
  • Origin badges. Similar to dcrouter's config / email / dns / api route-origin model, ModelGrid should tag each deployment with its origin: config (seeded via containers in config.json), catalog (auto-deployed from models.autoLoad), api (created via UI/API). Origin determines what the UI allows: config-origin deployments are toggle-only, api-origin deployments are full CRUD.

🧱 Implementation Notes (non-binding)

  • Web component stack. Match the dcrouter OpsServer approach: component-per-view under ts_web/elements/<area>/, a tiny SmartRouter-style client router (ts_web/router.ts), and a single appstate.ts as the store.
  • Bundled into the binary via ts_bundled/bundle.ts. ModelGrid is a Deno project that ships as a deno compile single binary, so the UI follows the @stack.gallery/registry pattern: a build step bundles the ts_web/ sources (HTML, JS, CSS, fonts, icons) into a single generated ts_bundled/bundle.ts module that exports a { path → bytes | string } map. The daemon dynamically imports that module at startup and hands the map to typedserver, which serves it on the UI port. Result: no external asset directory, no runtime filesystem dependency, one binary still ships the entire console.
  • Dev vs prod asset source. In deno task dev, typedserver is pointed at ts_web/ on disk so UI edits are hot-reloadable without re-running the bundler. In deno task compile / prod, the bundler regenerates ts_bundled/bundle.ts first and the compiled binary serves exclusively from the embedded map. A single flag (UI_ASSET_SOURCE=disk|bundle, default bundle) picks the strategy at runtime.
  • Bundler placement. Mirrors @stack.gallery/registry: keep the bundler in scripts/bundle-ui.ts, invoke it from a deno task bundle:ui that the compile:all task depends on, and .gitignore the generated ts_bundled/bundle.ts so it is only produced during release builds (or regenerated on demand for local prod testing).
  • Packaging. Follow dcrouter's module split: @modelgrid.com/modelgrid ships the daemon and the embedded UI bundle; a future @modelgrid.com/modelgrid-web can carve out the web sources as their own publishable boundary if the bundle grows large or the UI needs to be consumed independently.
  • Dark theme default (black background, high-contrast foreground) to match dcrouter and the expected server-ops environment. Light theme is a later toggle.
  • No server-side rendering. The UI is a static SPA; typedserver returns the asset map's index.html for the app shell and the rest of the state comes from the API. This keeps the runtime surface small and makes the UI-less curl story identical to the UI story.

Open Questions

  • Edit config from the UI or keep it CLI/file-first? Current lean: UI is authoritative only for API keys, deployments, and cluster actions. Config editing is exposed but optional, with CLI still the canonical path for reproducible installs.
  • Do we expose a model prompt playground? Nice to have for smoke tests, but it blurs the operator/consumer line. Defer to v2.
  • Cluster-wide vs per-node view. On a worker node, should the UI show only local state, or proxy the control plane's cluster view? The current lean: workers show local-only, and link to the control plane for cluster views. This avoids split-brain confusion.
  • Access control granularity. API keys today are coarse (all or nothing). A future model might scope keys per deployment or per model. Reserve the column in the Access ▸ API Keys table now.

🛑 Out of Scope (for this concept)

  • End-user chat or prompt UIs for the OpenAI-compatible API.
  • Billing, quotas, or usage-based pricing dashboards.
  • Multi-tenant isolation beyond per-API-key separation.
  • Anything specific to non-vLLM runtimes — the UI assumes the v1.1.0 reorientation around vLLM as the only first-class runtime.