Files
smartkvm/readme.md
T

485 lines
16 KiB
Markdown
Raw Normal View History

2026-05-16 13:41:55 +00:00
# @push.rocks/smartkvm
🎛️ Programmable visual KVM automation for browser-based KVM devices.
`@push.rocks/smartkvm` turns a remote machine that only exposes a visual KVM UI into a clean TypeScript control surface: capture frames in, send keyboard events out, and optionally run terminal commands through OCR-readable command wrappers.
Its core interfaces are transport-focused and model-agnostic, while the package also ships a ready-to-use Mistral OCR adapter via `@push.rocks/smartai/ocr` for teams that want a working OCR path out of the box.
## Issue Reporting and Security
For reporting bugs, issues, or security vulnerabilities, please visit [community.foss.global/](https://community.foss.global/). This is the central community hub for all issue reporting. Developers who sign and comply with our contribution agreement and go through identification can also get a [code.foss.global/](https://code.foss.global/) account to submit Pull Requests directly.
## What It Does
Many machines behind visual KVMs do not expose SSH, RDP, WinRM, a PTY, or any native management plane. The only reliable automation channel is often this:
```txt
video feed in
keyboard events out
```
`smartkvm` makes that channel programmable.
The current driver opens the KVM web UI with Puppeteer, focuses the viewer, captures frames from a `video`, `canvas`, or viewer wrapper, and sends keyboard input through Chromium. The public interfaces are generic, so future drivers can target JetKVM native WebRTC/data channels, PiKVM-style APIs, TinyPilot, GL.iNet Comet, or HDMI capture plus USB HID without changing consumer code.
## Highlights
- 🚀 `SmartBrowserKvm` for Puppeteer-powered browser KVM automation.
- 📸 Frame capture as base64 PNG from `video`, `canvas`, or element screenshots.
- ⌨️ Keyboard transport with text typing, individual keys, and shortcuts.
- 🧠 Pluggable OCR interface plus a Mistral OCR adapter powered by `@push.rocks/smartai/ocr`.
- 🧪 Terminal command wrappers with start/end markers and exit code parsing.
- 🧰 Minimal SmartAgent-compatible tools without importing `@push.rocks/smartagent`.
- 🔌 Generic `IKvmDriver` abstraction ready for non-Puppeteer drivers later.
2026-05-16 13:41:55 +00:00
## Install
```sh
pnpm add @push.rocks/smartkvm
```
## Mental Model
`smartkvm` has three layers:
```txt
SmartBrowserKvm
opens the KVM web UI, captures frames, sends keyboard events
SmartKvmTerminal
uses any IKvmDriver plus any IOcrEngine to type wrapped commands and parse OCR text
createMistralKvmOcrEngine()
provides a ready IOcrEngine backed by Mistral Document AI OCR
createSmartKvmTools()
exposes terminal actions as small tool objects for agent frameworks
```
The package deliberately does not implement mouse automation APIs, native JetKVM/WebRTC, native PiKVM APIs, terminal region auto-detection, or keyboard layout detection. The core still accepts any `IOcrEngine`; the Mistral adapter is just the first bundled OCR implementation.
## Quick Start: Browser KVM Transport
```typescript
import { SmartBrowserKvm } from '@push.rocks/smartkvm';
const kvm = new SmartBrowserKvm({
url: 'https://jetkvm.local',
kind: 'jetkvm',
username: 'admin',
password: 'admin',
headless: false,
ignoreHttpsErrors: true,
});
await kvm.connect();
await kvm.typeText('hello from smartkvm');
await kvm.pressKey('Enter');
const frame = await kvm.captureFrame();
console.log(frame.mimeType, frame.width, frame.height, frame.dataBase64.slice(0, 32));
await kvm.disconnect();
```
## Quick Start: Terminal With Mistral OCR
```typescript
import {
SmartBrowserKvm,
SmartKvmTerminal,
createMistralKvmOcrEngine,
} from '@push.rocks/smartkvm';
const kvm = new SmartBrowserKvm({
url: 'https://jetkvm.local',
kind: 'jetkvm',
headless: false,
ignoreHttpsErrors: true,
});
await kvm.connect();
const terminal = new SmartKvmTerminal({
kvm,
ocrEngine: createMistralKvmOcrEngine({
apiKey: process.env.MISTRAL_API_KEY,
}),
osHint: 'linux',
shellHint: 'bash',
});
const result = await terminal.runCommand('uname -a');
console.log(result.combinedText);
await kvm.disconnect();
```
## Quick Start: Terminal Commands Through OCR
2026-05-16 13:41:55 +00:00
```typescript
import {
SmartBrowserKvm,
SmartKvmTerminal,
type IOcrEngine,
} from '@push.rocks/smartkvm';
const ocrEngine: IOcrEngine = {
async recognize(frame, options) {
// Plug in your OCR engine here, for example Tesseract, a local OCR service,
// a screenshot OCR pipeline, or any implementation that returns text.
2026-05-16 13:41:55 +00:00
return {
text: '',
confidence: 0,
};
},
};
const kvm = new SmartBrowserKvm({
url: 'https://some-kvm.local',
kind: 'generic',
2026-05-16 13:41:55 +00:00
headless: false,
ignoreHttpsErrors: true,
});
await kvm.connect();
const terminal = new SmartKvmTerminal({
kvm,
ocrEngine,
osHint: 'linux',
shellHint: 'bash',
commandTimeoutMs: 30_000,
ocrPollIntervalMs: 500,
2026-05-16 13:41:55 +00:00
});
await terminal.bootstrap();
const result = await terminal.runCommand('pwd');
if (result.completed) {
console.log('exit:', result.exitCode);
console.log(result.combinedText);
} else {
console.log('command timed out');
console.log(result.rawOcrText);
}
2026-05-16 13:41:55 +00:00
await kvm.disconnect();
```
## SmartBrowserKvm
2026-05-16 13:41:55 +00:00
`SmartBrowserKvm` implements `IKvmDriver` with Puppeteer.
```typescript
const kvm = new SmartBrowserKvm({
url: 'https://kvm.local',
kind: 'jetkvm',
username: 'admin',
password: 'admin',
headless: false,
viewerSelector: 'video, canvas',
captureSelector: 'video, canvas',
ignoreHttpsErrors: true,
userDataDir: '.nogit/kvm-profile',
executablePath: '/usr/bin/chromium',
timeoutMs: 30_000,
});
```
### Browser Options
- `url`: the KVM web UI URL.
- `kind`: optional device hint, one of `jetkvm`, `glinet`, `pikvm`, `tinypilot`, or `generic`.
- `username` and `password`: optional credentials for generic login forms.
- `headless`: Puppeteer headless mode, defaults to `true`.
- `viewerSelector`: element that receives keyboard focus, defaults to `video, canvas`.
- `captureSelector`: element used for capture, defaults to `viewerSelector` and then `video, canvas`.
- `ignoreHttpsErrors`: accepts self-signed KVM certificates through Puppeteer `acceptInsecureCerts`.
- `userDataDir`: persists cookies and login state.
- `executablePath`: uses a specific Chromium or Chrome binary.
- `timeoutMs`: initial page load and viewer detection timeout, defaults to `30000`.
### Connection Flow
`connect()` launches Chromium, opens `url`, attempts a generic login when credentials are supplied, waits for the viewer to become ready, and clicks the viewer once to focus it.
Generic login looks for common username and password fields, including `input[name="username"]`, `input[autocomplete="username"]`, `input[type="email"]`, `input[type="text"]`, `input[name="password"]`, and `input[type="password"]`. If matching fields are not found, login is skipped silently.
Viewer readiness accepts direct `video`, direct `canvas`, a wrapper containing either, or a generic visible wrapper with a non-zero bounding box. Video readiness requires `videoWidth` and `videoHeight`; canvas readiness requires `width` and `height`.
### Frame Capture
`captureFrame()` returns an `IKvmFrame`:
```typescript
interface IKvmFrame {
timestamp: number;
width: number;
height: number;
mimeType: 'image/png' | 'image/jpeg';
dataBase64: string;
}
```
Capture strategy:
- Draw a selected `video` to an internal canvas and return PNG base64.
- Draw a selected `canvas` to another canvas and return PNG base64.
- If the selected element is a wrapper, capture its inner `video` or `canvas`.
- If no media element is present, fall back to a Puppeteer element screenshot.
### Keyboard Control
```typescript
await kvm.typeText('whoami', { delayMs: 10 });
await kvm.pressKey('Enter');
await kvm.pressShortcut(['Control', 'Alt', 'T']);
```
Every keyboard method focuses the viewer first. `pressShortcut()` presses keys in order and releases them in reverse order.
There is intentionally no public mouse automation API in v1. The only mouse action is the internal click used to focus the viewer.
## SmartKvmTerminal
`SmartKvmTerminal` uses any `IKvmDriver` and any `IOcrEngine` to type wrapped terminal commands and parse OCR text into command results.
```typescript
const terminal = new SmartKvmTerminal({
kvm,
ocrEngine,
osHint: 'windows',
shellHint: 'powershell',
commandTimeoutMs: 45_000,
ocrPollIntervalMs: 500,
ocrMaxAttempts: 120,
ocrCrop: {
x: 0,
y: 120,
width: 1280,
height: 600,
},
});
```
### Bootstrap Shortcuts
`bootstrap()` can open a terminal using keyboard-only defaults:
- `windows`: `Meta + R`, types `powershell -NoLogo`, then presses `Enter`.
- `macos`: `Meta + Space`, types `Terminal`, then presses `Enter`.
- `linux`: `Control + Alt + T`.
- `unknown`: does nothing.
Browser-based KVMs can intercept or remap shortcuts. `bootstrap()` intentionally implements only the generic path.
### Running Commands
`runCommand(command)` creates a wrapped command, types it, presses `Enter`, then polls OCR until the end marker appears, the timeout is reached, or `ocrMaxAttempts` is reached.
```typescript
const result = await terminal.runCommand('uname -a');
console.log(result.commandId);
console.log(result.completed);
console.log(result.timedOut);
console.log(result.exitCode);
console.log(result.combinedText);
console.log(result.rawOcrText);
```
Command timeouts do not throw. They return:
```typescript
{
completed: false,
timedOut: true,
combinedText: rawOcrText,
rawOcrText,
}
```
Infrastructure failures still throw, for example an unconnected KVM, a missing viewer selector, a missing capture selector, or a media element without a frame.
### Observing Text
`observeText()` captures one frame and sends it to the configured OCR engine:
```typescript
const visibleText = await terminal.observeText();
```
OCR is called with `{ language: 'eng', crop: options.ocrCrop }`.
## Mistral OCR Adapter
`createMistralKvmOcrEngine()` returns an `IOcrEngine` backed by `@push.rocks/smartai/ocr`, which uses Mistral Document AI OCR with `mistral-ocr-latest` by default.
```typescript
import { createMistralKvmOcrEngine } from '@push.rocks/smartkvm';
const ocrEngine = createMistralKvmOcrEngine({
apiKey: process.env.MISTRAL_API_KEY,
confidenceScoresGranularity: 'page',
});
```
Options:
- `apiKey`: Mistral API key, required unless `transport` is supplied.
- `model`: OCR model, defaults to `mistral-ocr-latest`.
- `endpointUrl`: override the Mistral OCR endpoint.
- `confidenceScoresGranularity`: `'page'` | `'word'`, defaults to `'page'` in the KVM adapter.
- `tableFormat`: `'markdown'` | `'html'`.
- `extractHeader` and `extractFooter`: pass-through Mistral OCR flags.
- `transport`: test/custom transport hook inherited from `@push.rocks/smartai/ocr`.
Current limitation: `IOcrRecognizeOptions.crop` is rejected by this adapter because the Mistral endpoint receives the full KVM frame. Use a separate image-cropping OCR engine if you need terminal-region cropping today.
## Command Wrappers
The wrapper utilities make terminal output parseable through OCR by adding simple markers:
```txt
SMARTKVM_START_<commandId>
SMARTKVM_END_<commandId>_<exitCode>
```
```typescript
import {
createWrappedKvmCommand,
parseWrappedKvmCommandOutput,
} from '@push.rocks/smartkvm';
const wrapped = createWrappedKvmCommand('echo hello', 'bash');
console.log(wrapped.textToType);
const parsed = parseWrappedKvmCommandOutput({
commandId: wrapped.commandId,
startMarker: wrapped.startMarker,
endMarkerPrefix: wrapped.endMarkerPrefix,
rawText: `
prompt
${wrapped.startMarker}
hello
${wrapped.endMarkerPrefix}0
prompt
`,
});
console.log(parsed.completed); // true
console.log(parsed.exitCode); // 0
console.log(parsed.combinedText); // hello
```
Supported shell hints are `bash`, `zsh`, `sh`, `powershell`, `cmd`, and `unknown`. `unknown` uses the POSIX-style wrapper.
The parser is intentionally simple and deterministic. It tolerates whitespace and line endings, but it does not do fuzzy OCR correction in v1.
## SmartAgent-Compatible Tools
`createSmartKvmTools()` returns small tool objects without importing `@push.rocks/smartagent`.
```typescript
import { createSmartKvmTools } from '@push.rocks/smartkvm';
const tools = createSmartKvmTools({ terminal });
const runCommandTool = tools.find((tool) => tool.name === 'kvm_terminal_run_command');
const observeTool = tools.find((tool) => tool.name === 'kvm_terminal_observe');
const commandResult = await runCommandTool?.execute({ command: 'hostname' });
const currentText = await observeTool?.execute({});
```
Included tools:
- `kvm_terminal_run_command`: accepts `{ command: string }` and returns `IKvmTerminalCommandResult`.
- `kvm_terminal_observe`: accepts `{}` and returns OCR text from the current frame.
## Public API
Root exports:
```typescript
export * from './smartkvm.interfaces.js';
export * from './smartkvm.classes.browserkvm.js';
export * from './smartkvm.classes.kvmterminal.js';
export * from './smartkvm.commandwrappers.js';
export * from './smartkvm.tools.smartagent.js';
export * from './smartkvm.ocr.smartai.js';
```
Important types:
- `IKvmDriver`: generic transport interface for connect, disconnect, focus, capture, typing, key presses, shortcuts, and wait.
- `IKvmFrame`: timestamped base64 frame payload.
- `IOcrEngine`: pluggable OCR contract.
- `IKvmTerminalOptions`: terminal transport, OCR, OS hint, shell hint, timeout, polling, attempts, and crop options.
- `IKvmTerminalCommandResult`: parsed command result with markers, completion state, timeout state, exit code, combined output, and raw OCR text.
- `IWrappedKvmCommand`: generated command wrapper metadata and typed command string.
- `ISmartKvmTool`: minimal tool shape for agent integrations.
- `ISmartKvmMistralOcrEngineOptions`: options for the bundled Mistral OCR adapter.
## Driver Scope
The public abstraction is intentionally broader than Puppeteer. `SmartBrowserKvm` is the first concrete driver, but consumers should depend on `IKvmDriver` when possible.
Future drivers can implement the same interface for:
- JetKVM native WebRTC or data-channel control.
- GL.iNet Comet and PiKVM-compatible APIs.
- TinyPilot.
- Custom HDMI capture plus USB HID control.
2026-05-16 13:41:55 +00:00
## Manual Browser Test
Automated tests do not require real KVM hardware. The browser smoke test only runs when `SMARTKVM_TEST_URL` is set.
2026-05-16 13:41:55 +00:00
```sh
SMARTKVM_TEST_URL=https://your-kvm.local pnpm test
```
Optional variables:
- `SMARTKVM_TEST_USERNAME`
- `SMARTKVM_TEST_PASSWORD`
- `SMARTKVM_TEST_HEADLESS=false`
## Development
```sh
pnpm install
pnpm test
pnpm run build
tsbuild check "test/**/*"
```
The package uses ESM, TypeScript, Puppeteer, and the push.rocks test/build stack.
## License and Legal Information
This repository contains open-source code licensed under the MIT License. A copy of the license can be found in the [license](./license) file.
**Please note:** The MIT License does not grant permission to use the trade names, trademarks, service marks, or product names of the project, except as required for reasonable and customary use in describing the origin of the work and reproducing the content of the NOTICE file.
### Trademarks
This project is owned and maintained by Task Venture Capital GmbH. The names and logos associated with Task Venture Capital GmbH and any related products or services are trademarks of Task Venture Capital GmbH or third parties, and are not included within the scope of the MIT license granted herein.
Use of these trademarks must comply with Task Venture Capital GmbH's Trademark Guidelines or the guidelines of the respective third-party owners, and any usage must be approved in writing. Third-party trademarks used herein are the property of their respective owners and used only in a descriptive manner, e.g. for an implementation of an API or similar.
### Company Information
Task Venture Capital GmbH
Registered at District Court Bremen HRB 35230 HB, Germany
For any legal inquiries or further information, please contact us via email at hello@task.vc.
By using this repository, you acknowledge that you have read this section, agree to comply with its terms, and understand that the licensing of the code does not imply endorsement by Task Venture Capital GmbH of any derivative works.