🎛️ Programmable visual KVM automation for browser-based KVM devices.
`@push.rocks/smartkvm` turns a remote machine that only exposes a visual KVM UI into a clean TypeScript control surface: capture frames in, send keyboard events out, and optionally run terminal commands through OCR-readable command wrappers.
Its core interfaces are transport-focused and model-agnostic, while the package also ships a ready-to-use Mistral OCR adapter via `@push.rocks/smartai/ocr` for teams that want a working OCR path out of the box.
## Issue Reporting and Security
For reporting bugs, issues, or security vulnerabilities, please visit [community.foss.global/](https://community.foss.global/). This is the central community hub for all issue reporting. Developers who sign and comply with our contribution agreement and go through identification can also get a [code.foss.global/](https://code.foss.global/) account to submit Pull Requests directly.
## What It Does
Many machines behind visual KVMs do not expose SSH, RDP, WinRM, a PTY, or any native management plane. The only reliable automation channel is often this:
```txt
video feed in
keyboard events out
```
`smartkvm` makes that channel programmable.
The current driver opens the KVM web UI with Puppeteer, focuses the viewer, captures frames from a `video`, `canvas`, or viewer wrapper, and sends keyboard input through Chromium. The public interfaces are generic, so future drivers can target JetKVM native WebRTC/data channels, PiKVM-style APIs, TinyPilot, GL.iNet Comet, or HDMI capture plus USB HID without changing consumer code.
## Highlights
- 🚀 `SmartBrowserKvm` for Puppeteer-powered browser KVM automation.
- 📸 Frame capture as base64 PNG from `video`, `canvas`, or element screenshots.
- ⌨️ Keyboard transport with text typing, individual keys, and shortcuts.
- 🧠 Pluggable OCR interface plus a Mistral OCR adapter powered by `@push.rocks/smartai/ocr`.
- 🧪 Terminal command wrappers with start/end markers and exit code parsing.
- 🧰 Minimal SmartAgent-compatible tools without importing `@push.rocks/smartagent`.
- 🔌 Generic `IKvmDriver` abstraction ready for non-Puppeteer drivers later.
opens the KVM web UI, captures frames, sends keyboard events
SmartKvmTerminal
uses any IKvmDriver plus any IOcrEngine to type wrapped commands and parse OCR text
createMistralKvmOcrEngine()
provides a ready IOcrEngine backed by Mistral Document AI OCR
createSmartKvmTools()
exposes terminal actions as small tool objects for agent frameworks
```
The package deliberately does not implement mouse automation APIs, native JetKVM/WebRTC, native PiKVM APIs, terminal region auto-detection, or keyboard layout detection. The core still accepts any `IOcrEngine`; the Mistral adapter is just the first bundled OCR implementation.
`SmartBrowserKvm` implements `IKvmDriver` with Puppeteer.
```typescript
constkvm=newSmartBrowserKvm({
url:'https://kvm.local',
kind:'jetkvm',
username:'admin',
password:'admin',
headless: false,
viewerSelector:'video, canvas',
captureSelector:'video, canvas',
ignoreHttpsErrors: true,
userDataDir:'.nogit/kvm-profile',
executablePath:'/usr/bin/chromium',
timeoutMs: 30_000,
});
```
### Browser Options
-`url`: the KVM web UI URL.
-`kind`: optional device hint, one of `jetkvm`, `glinet`, `pikvm`, `tinypilot`, or `generic`.
-`username` and `password`: optional credentials for generic login forms.
-`headless`: Puppeteer headless mode, defaults to `true`.
-`viewerSelector`: element that receives keyboard focus, defaults to `video, canvas`.
-`captureSelector`: element used for capture, defaults to `viewerSelector` and then `video, canvas`.
-`ignoreHttpsErrors`: accepts self-signed KVM certificates through Puppeteer `acceptInsecureCerts`.
-`userDataDir`: persists cookies and login state.
-`executablePath`: uses a specific Chromium or Chrome binary.
-`timeoutMs`: initial page load and viewer detection timeout, defaults to `30000`.
### Connection Flow
`connect()` launches Chromium, opens `url`, attempts a generic login when credentials are supplied, waits for the viewer to become ready, and clicks the viewer once to focus it.
Generic login looks for common username and password fields, including `input[name="username"]`, `input[autocomplete="username"]`, `input[type="email"]`, `input[type="text"]`, `input[name="password"]`, and `input[type="password"]`. If matching fields are not found, login is skipped silently.
Viewer readiness accepts direct `video`, direct `canvas`, a wrapper containing either, or a generic visible wrapper with a non-zero bounding box. Video readiness requires `videoWidth` and `videoHeight`; canvas readiness requires `width` and `height`.
### Frame Capture
`captureFrame()` returns an `IKvmFrame`:
```typescript
interfaceIKvmFrame{
timestamp: number;
width: number;
height: number;
mimeType:'image/png'|'image/jpeg';
dataBase64: string;
}
```
Capture strategy:
- Draw a selected `video` to an internal canvas and return PNG base64.
- Draw a selected `canvas` to another canvas and return PNG base64.
- If the selected element is a wrapper, capture its inner `video` or `canvas`.
- If no media element is present, fall back to a Puppeteer element screenshot.
### Keyboard Control
```typescript
awaitkvm.typeText('whoami',{delayMs: 10});
awaitkvm.pressKey('Enter');
awaitkvm.pressShortcut(['Control','Alt','T']);
```
Every keyboard method focuses the viewer first. `pressShortcut()` presses keys in order and releases them in reverse order.
There is intentionally no public mouse automation API in v1. The only mouse action is the internal click used to focus the viewer.
## SmartKvmTerminal
`SmartKvmTerminal` uses any `IKvmDriver` and any `IOcrEngine` to type wrapped terminal commands and parse OCR text into command results.
```typescript
constterminal=newSmartKvmTerminal({
kvm,
ocrEngine,
osHint:'windows',
shellHint:'powershell',
commandTimeoutMs: 45_000,
ocrPollIntervalMs: 500,
ocrMaxAttempts: 120,
ocrCrop:{
x: 0,
y: 120,
width: 1280,
height: 600,
},
});
```
### Bootstrap Shortcuts
`bootstrap()` can open a terminal using keyboard-only defaults:
-`windows`: `Meta + R`, types `powershell -NoLogo`, then presses `Enter`.
-`macos`: `Meta + Space`, types `Terminal`, then presses `Enter`.
-`linux`: `Control + Alt + T`.
-`unknown`: does nothing.
Browser-based KVMs can intercept or remap shortcuts. `bootstrap()` intentionally implements only the generic path.
### Running Commands
`runCommand(command)` creates a wrapped command, types it, presses `Enter`, then polls OCR until the end marker appears, the timeout is reached, or `ocrMaxAttempts` is reached.
```typescript
constresult=awaitterminal.runCommand('uname -a');
console.log(result.commandId);
console.log(result.completed);
console.log(result.timedOut);
console.log(result.exitCode);
console.log(result.combinedText);
console.log(result.rawOcrText);
```
Command timeouts do not throw. They return:
```typescript
{
completed: false,
timedOut: true,
combinedText: rawOcrText,
rawOcrText,
}
```
Infrastructure failures still throw, for example an unconnected KVM, a missing viewer selector, a missing capture selector, or a media element without a frame.
### Observing Text
`observeText()` captures one frame and sends it to the configured OCR engine:
```typescript
constvisibleText=awaitterminal.observeText();
```
OCR is called with `{ language: 'eng', crop: options.ocrCrop }`.
## Mistral OCR Adapter
`createMistralKvmOcrEngine()` returns an `IOcrEngine` backed by `@push.rocks/smartai/ocr`, which uses Mistral Document AI OCR with `mistral-ocr-latest` by default.
-`apiKey`: Mistral API key, required unless `transport` is supplied.
-`model`: OCR model, defaults to `mistral-ocr-latest`.
-`endpointUrl`: override the Mistral OCR endpoint.
-`confidenceScoresGranularity`: `'page'` | `'word'`, defaults to `'page'` in the KVM adapter.
-`tableFormat`: `'markdown'` | `'html'`.
-`extractHeader` and `extractFooter`: pass-through Mistral OCR flags.
-`transport`: test/custom transport hook inherited from `@push.rocks/smartai/ocr`.
Current limitation: `IOcrRecognizeOptions.crop` is rejected by this adapter because the Mistral endpoint receives the full KVM frame. Use a separate image-cropping OCR engine if you need terminal-region cropping today.
## Command Wrappers
The wrapper utilities make terminal output parseable through OCR by adding simple markers:
-`kvm_terminal_run_command`: accepts `{ command: string }` and returns `IKvmTerminalCommandResult`.
-`kvm_terminal_observe`: accepts `{}` and returns OCR text from the current frame.
## Public API
Root exports:
```typescript
export*from'./smartkvm.interfaces.js';
export*from'./smartkvm.classes.browserkvm.js';
export*from'./smartkvm.classes.kvmterminal.js';
export*from'./smartkvm.commandwrappers.js';
export*from'./smartkvm.tools.smartagent.js';
export*from'./smartkvm.ocr.smartai.js';
```
Important types:
-`IKvmDriver`: generic transport interface for connect, disconnect, focus, capture, typing, key presses, shortcuts, and wait.
-`IKvmFrame`: timestamped base64 frame payload.
-`IOcrEngine`: pluggable OCR contract.
-`IKvmTerminalOptions`: terminal transport, OCR, OS hint, shell hint, timeout, polling, attempts, and crop options.
-`IKvmTerminalCommandResult`: parsed command result with markers, completion state, timeout state, exit code, combined output, and raw OCR text.
-`IWrappedKvmCommand`: generated command wrapper metadata and typed command string.
-`ISmartKvmTool`: minimal tool shape for agent integrations.
-`ISmartKvmMistralOcrEngineOptions`: options for the bundled Mistral OCR adapter.
## Driver Scope
The public abstraction is intentionally broader than Puppeteer. `SmartBrowserKvm` is the first concrete driver, but consumers should depend on `IKvmDriver` when possible.
Future drivers can implement the same interface for:
The package uses ESM, TypeScript, Puppeteer, and the push.rocks test/build stack.
## License and Legal Information
This repository contains open-source code licensed under the MIT License. A copy of the license can be found in the [license](./license) file.
**Please note:** The MIT License does not grant permission to use the trade names, trademarks, service marks, or product names of the project, except as required for reasonable and customary use in describing the origin of the work and reproducing the content of the NOTICE file.
### Trademarks
This project is owned and maintained by Task Venture Capital GmbH. The names and logos associated with Task Venture Capital GmbH and any related products or services are trademarks of Task Venture Capital GmbH or third parties, and are not included within the scope of the MIT license granted herein.
Use of these trademarks must comply with Task Venture Capital GmbH's Trademark Guidelines or the guidelines of the respective third-party owners, and any usage must be approved in writing. Third-party trademarks used herein are the property of their respective owners and used only in a descriptive manner, e.g. for an implementation of an API or similar.
### Company Information
Task Venture Capital GmbH
Registered at District Court Bremen HRB 35230 HB, Germany
For any legal inquiries or further information, please contact us via email at hello@task.vc.
By using this repository, you acknowledge that you have read this section, agree to comply with its terms, and understand that the licensing of the code does not imply endorsement by Task Venture Capital GmbH of any derivative works.