diff --git a/rfcs/computer-use.md b/rfcs/computer-use.md new file mode 100644 index 000000000000..8b6d925d693f --- /dev/null +++ b/rfcs/computer-use.md @@ -0,0 +1,666 @@ +# RFC: Computer Use for opencode + +- **Status**: Draft +- **Author**: +- **Related issues**: #20490 (closed), #20917, #26772, #30755, #32945 +- **Target release**: opencode 1.19 (Phase 1–3a), 1.20+ (Phase 3b–5) + +## Summary + +Add first-class desktop computer-use tools to opencode: an agent can +take screenshots, locate UI elements, and inject mouse/keyboard input +on the host OS. The agent loop is **see screen → decide → act → +repeat** and works against any GUI application, not just browsers. + +This is the open-source equivalent of Claude Code's computer-use and +OpenAI's CUA, designed to be **provider-agnostic** while also +**passing through native provider tools when available**, and to +**run fully on-device** via UI-TARS-1.5-7B for users who want a +local, private, zero-per-action-cost path. + +## Goals + +- Screenshot + mouse + keyboard control of the host desktop. +- Works against any GUI application (not limited to browsers). +- Provider-agnostic core: any vision-capable model can drive it. +- **Native pass-through** for three targets: + - Anthropic `computer_use` (Claude) + - OpenAI `computer-use-preview` + - UI-TARS-1.5-7B (local, Apache-2.0, beats both closed competitors + on OSWorld: 42.5 vs 36.4 / 28.0) +- Structure-first when available (#32945 direction): accessibility + trees preferred over pixel coordinates, with OCR fallback. +- Auditable: every action logged through the session event stream. +- Permissioned: integrates with the existing `permissions` system. +- Multi-OS: Linux (X11, Wayland, XWayland), macOS, Windows, headless/CI. +- **One-line local setup** for the UI-TARS path (`opencode + computer-use --local`). + +## Non-Goals (v1) + +- **Visual element "click-to-edit" annotation in opencode Desktop** + (Codex's actual UX). That is a separate, browser-scoped feature + (see #30755). This RFC targets the Claude Code / CUA model: + agent-driven, not user-annotated. +- Recording / replay of reusable GUI flows (see #32945) — designed + compatible with that direction but not in v1. +- Mobile / embedded targets. + +## User-facing behavior + +1. Opt-in via `computer_use.enabled: true` in `opencode.json` + (default off — the feature ships **disabled by default** behind + an experimental flag; see §Architecture). +2. Top-level config block, **adjacent to `tools`** in + `opencode.json`: + + ```jsonc + { + "tools": { /* ... */ }, + "computer_use": { /* ... */ } + } + ``` + +3. **Permission model: ask per batch / session, not per action.** + On the first `computer_*` call in a session, the user is prompted + once with the task description and the proposed + `max_steps_per_task` budget (default 50). Approving grants + permission for all subsequent `computer_*` calls in that session + up to the step budget. Per-action confirmation is rejected as + impractical UX. Three granularities: + + - `ask` (default) — prompt once at session start + - `allow` — no prompt, every action allowed up to the step budget + - `deny` — refuse all input tools; screenshot still allowed + +4. New permission rules in `permissions`: + - `computer_screenshot` — default `allow` + - `computer_click`, `computer_type`, `computer_key`, + `computer_scroll`, `computer_cursor_position` — default + gated by the batch grant above +5. Tools exposed in the agent's tool surface (see §Tool surface). +6. **Screenshot transport: base64 PNG image part.** Native format for + UI-TARS, Anthropic, and OpenAI vision APIs. Default + `max_width: 1024` with aspect-preserving downscale — aggressive + enough to fit a typical screen into ~150 KB and stay well under + one context window for many turns. UI-TARS does not need a + coordinate grid overlay (its grounding training handles + coordinates natively); generic VLMs get an optional grid overlay + (`grid_overlay: "auto"` = on for `generic`, off for `ui_tars`). +7. Each action emits a typed `computer_use.action` event on the + session event stream containing the action, parameters, target + window (if detectable), and the resulting screenshot hash. +8. Sub-agents in the read-only `plan` agent scope never see these + tools (per #20483 referenced in #20490). +9. The Desktop app surfaces a single "Allow computer-use for this + session (≤ N steps)?" prompt at session start; no per-action + prompts. + +## Architecture + +``` ++---------------------------------------------------------+ +| Agent / LLM loop | ++----------------------------+----------------------------+ + | tool calls + v ++---------------------------------------------------------+ +| Tool Registry (Core, experimental feature) | +| - computer_screenshot, computer_click, | +| computer_type, computer_key, computer_scroll, ... | +| - gated by computer_use.enabled feature flag | ++----------------------------+----------------------------+ + | + v ++---------------------------------------------------------+ +| ComputerUse service (Core, src/computer-use/) | +| - adapter dispatch (see §Provider targeting) | +| - batch-permission gate (one prompt / session) | +| - max_steps_per_task counter & enforcement | +| - event stream emit (computer_use.action) | +| - screenshot policy (default max_width: 1024, | +| PNG base64, optional grid overlay) | ++----------------------------+----------------------------+ + | + v ++---------------------------------------------------------+ +| Backend abstraction (src/computer-use/backend/) | +| - DisplayBackend (screenshot) | +| - InputBackend (mouse/keyboard) | +| - StructureBackend (a11y tree, OCR — Phase 5) | +| - ClipboardBackend | ++----------------------------+----------------------------+ + | | | | + v v v v + Linux/ Linux/ macOS Windows + X11 Wayland +``` + +### Core-native (experimental) integration + +This feature ships **directly in Core**, not as a plugin. Rationale: + +- Permission gate, event stream, and step-budget enforcement must + hook into Core's existing primitives (`permissions`, session event + stream, tool registry). Plugin-first would force a parallel + permission model. +- The agent loop benefits from Core's provider-aware tool registration + so the four adapter paths (Anthropic / OpenAI / UI-TARS / generic) + can dispatch per-turn without plugin IPC. +- Distribution: opencode's value prop is "works out of the box." A + plugin would force every user to install something extra. + +It ships behind a single **experimental feature flag**, +`computer_use.enabled` (default `false`), and is labelled +`experimental` in `opencode.json` schema validation and in the +release notes. OS-specific backends live in their own modules under +`packages/opencode/src/computer-use/backend/` and are imported lazily +per-OS so the bundle stays small. Third-party backends (e.g., +proprietary VNC, Android-over-ADB) can register through a small +backend plugin interface without modifying Core. + +```jsonc +// opencode.json (top level) +{ + "tools": { /* existing tools config */ }, + "computer_use": { + "enabled": false, // experimental — must be explicitly opted in + // ... + } +} +``` + +## Provider targeting + +The `ComputerUse` service has four adapter implementations. Selection +is **per-turn**, driven by the active provider+model in the Catalog. + +### 1. Anthropic native adapter + +Used when the active model is a Claude variant with the `computer_use` +typed block. Adapter translates between Anthropic's tool schema and +the Core backend. Provider tool results are recorded as model-native +facts per `CONTEXT.md` "provider-executed tool results" rule. + +### 2. OpenAI native adapter + +Used when the active model is `computer-use-preview` (CUA). Same +pass-through pattern as Anthropic. + +### 3. UI-TARS native adapter — recommended local path + +Used when the active model is `ByteDance-Seed/UI-TARS-1.5-7B` (or a +finetune/quantization) served via an OpenAI-compatible local server +(vLLM, SGLang, Ollama, LM Studio). + +Why this is more than a generic local VLM: + +- **UI-TARS is trained for grounded GUI action** (ScreenSpot-Pro 61.6, + vs Claude 3.7's 27.7 — see references). It emits coordinates + directly without prompting tricks. +- **OSWorld 42.5** beats Claude 3.7 (28.0) and OpenAI CUA (36.4) + with 100 steps. +- **Apache-2.0**, runs on a single consumer GPU at Q4 (8–16GB VRAM). +- **Reasoning-then-action**: the model produces a thought before each + action. opencode surfaces this in the same visible-thinking block + used for other models. + +The adapter is a thin translation layer: + +- Translate UI-TARS's native action vocabulary (click(x,y), type, + scroll, key, finished, etc.) into Core tool calls. +- Constrain decoding with JSON mode for deterministic parsing. +- Forward the model's thought text into the assistant message's + thinking field. + +### 4. Generic vision adapter — fallback + +Used for all other vision-capable models (Claude without the native +adapter enabled, GPT-4o, Gemini, Qwen2.5-VL, InternVL, etc.). The +tool surface is the same; the difference is that screenshots are +sent as image parts in the message and the model emits `computer_*` +tool calls via standard function calling. + +### Adapter selection rules + +```ts +function selectAdapter(provider: Provider, model: Model): Adapter { + if (model.id.startsWith("ByteDance-Seed/UI-TARS")) return "ui_tars" + if (provider === "anthropic" && model.features.includes("computer_use")) + return "anthropic_native" + if (provider === "openai" && model.id === "computer-use-preview") + return "openai_native" + if (model.modalities.includes("vision")) return "generic" + throw new ConfigurationError( + `Model ${model.id} has no vision capability; computer use requires it.` + ) +} +``` + +Users can override with `computer_use.adapter: "ui_tars" | "native" +| "generic"` in config. + +## Tool surface + +```ts +// Pseudo-schema; final shapes live in the Tool Registry's zod schemas. + +computer_screenshot({ + region?: { x: number; y: number; w: number; h: number }, + display?: number, + downscale?: { + max_width?: number // default 1024 (aggressive — see §Screenshot transport) + max_height?: number // default unset (aspect-preserving) + grid?: boolean // default: true for generic, false for ui_tars + } +}) -> { image: base64png, width: number, height: number, scale: number } + +computer_click({ x: number, y: number, + button?: "left"|"right"|"middle", + modifiers?: Key[] }) +computer_double_click({ x: number, y: number, ... }) +computer_type({ text: string, delay_ms?: number }) +computer_key({ key: string, modifiers?: Key[] }) +computer_scroll({ x: number, y: number, dx: number, dy: number }) +computer_cursor_position() -> { x: number, y: number } + +// Optional, v1.1 — gated by capability detection: +computer_window_list() -> Window[] +computer_window_focus({ title_contains?: string }) +computer_clipboard_read() -> string +computer_clipboard_write({ text: string }) +``` + +UI-TARS's native vocabulary maps directly to these, so the UI-TARS +adapter is a 1:1 translation. Generic VLMs emit these as standard +function calls. + +## OS backend implementations (v1 scope) + +| OS / env | Screenshot | Mouse/Keyboard | Clipboard | Notes | +|----------------|-----------------------------------------|-------------------------------|----------------------------|-------| +| Linux X11 | `scrot` / `maim` | `xdotool` | `xclip` | Headless: `Xvfb` | +| Linux Wayland | `grim` (wlroots) / XDG Portal (GNOME) / `spectacle` (KDE) | `ydotool`+`ydotoold` | `wl-copy`/`wl-paste` | Requires `/dev/uinput` write access + `input` group membership (see §Failure handling) | +| Linux XWayland | both stacks coexisting | route per-window | both | Detect `DISPLAY`+`WAYLAND_DISPLAY` | +| macOS | `screencapture -x` | `cliclick` | `pbcopy`/`pbpaste` | Requires Accessibility permission | +| Windows | PowerShell `System.Windows.Forms` | `SendKeys`/Win32 `mouse_event`/`keybd_event` | `Get-Clipboard`/`Set-Clipboard` | **UIPI limitation**: input injection is silently blocked into windows running at a higher elevation than `opencode` (User Interface Privilege Isolation). See §Failure handling. | +| Headless / CI | virtual display | virtual display | n/a | `Xvfb`, `cage`, `weston --headless` | + +### Per-platform notes + +#### Linux Wayland — `ydotool` and `/dev/uinput` + +`ydotool` does **not** speak Wayland directly; it writes synthetic +input events to the kernel through `/dev/uinput`. This requires: + +- The `uinput` kernel module loaded: `sudo modprobe uinput` +- A `udev` rule granting the user's group write access to + `/dev/uinput` (most distros ship one for the `input` group). +- The user to be a member of the `input` group: + `sudo usermod -aG input $USER` (re-login required). +- The `ydotoold` daemon running (foreground or as a systemd user + service): `ydotoold &` or + `systemctl --user enable --now ydotoold`. + +If any of these are missing, `ydotool` either fails with `Permission +denied` (caught and surfaced) or hangs silently (a watchdog timer +must interrupt). Both cases are mapped to actionable errors in +§Failure handling. + +#### Windows — UIPI (User Interface Privilege Isolation) + +Windows enforces UIPI: a process running at a lower integrity level +**cannot send input to** a window running at a higher integrity +level. Practically, if `opencode` runs **without** admin elevation, +every `computer_click` / `computer_type` targeting an "Run as +administrator" window will fail silently with no error reported to +the injecting process. + +This is a hard OS-level constraint with no workaround from userspace +short of: + +- Running `opencode` itself as Administrator (lowers the security + posture of the whole session — **not recommended**, and the + Desktop app should refuse this without an explicit override), or +- Closing the elevated target window (often impractical). + +The backend **detects** the UIPI mismatch where possible (compare +the target window's integrity level against the injecting process) +and surfaces a clear error rather than failing silently. See +§Failure handling for the exact message. + +#### macOS — Accessibility permission + +The running process (`opencode` itself, or the Terminal that +launched it, or the Desktop app) must be granted Accessibility +access in **System Settings → Privacy & Security → Accessibility**. +Without it, `cliclick` and `CoreGraphics` calls return no error but +no input is delivered. Detected and surfaced clearly. + +#### Linux X11 — DISPLAY-based, mostly unconstrained + +`xdotool` requires only `DISPLAY` to be set and the user to have +access to the X server (usually automatic when running in the user's +own session). No additional group or kernel module requirements +beyond a normal desktop install. + +## Structure backend (long-term direction, partial v1) + +Per #32945: when the focused app exposes an accessibility tree +(UI Automation on Windows, NSAccessibility on macOS, AT-SPI on +Linux), prefer it over pixel coordinates. `computer_locate(role, +name?, ...)` returns structural selectors that survive +theme/resize changes. Falls back to OCR (Tesseract) or pixel +diffing when no a11y tree is available. + +Out of scope for v1's first cut, but the backend interface is +designed to admit it without API changes. + +## Permission model + +Built on the existing `permissions` system. Tool-name matching: + +- `computer_screenshot` — low risk, default `allow` +- `computer_click`, `computer_type`, `computer_key`, + `computer_scroll`, `computer_cursor_position` — `DangerFullAccess`, + default `ask` + +Users can pin `allow` / `deny` per-session, per-workspace, or +globally. The Desktop app surfaces an "Allow once / Allow for +session / Deny" prompt identical to `bash`. + +## Audit / event stream + +Every action emits one `computer_use.action` event with: + +- timestamp, session id, agent id +- tool name + parameters +- target window title + app name (when detectable) +- resulting screenshot hash (not the image itself; image stays in + managed tool-output storage per `CONTEXT.md` "Managed Tool Output + File") +- success / failure + error reason + +The session event stream is the same one already consumed by the +desktop UI for tool-call rendering; the existing transcript UI can +surface these events without changes. + +## Configuration + +Top-level block, **adjacent to `tools`** in `opencode.json`: + +```jsonc +// opencode.json (excerpt) +{ + "tools": { /* existing tools config */ }, + "computer_use": { + "enabled": false, // EXPERIMENTAL — must be explicitly opted in + "backend": "auto", // auto | x11 | wayland | macos | windows | headless + "adapter": "auto", // auto | native | ui_tars | generic + "screenshot": { + "max_width": 1024, // aggressive default — saves context & bandwidth + "max_height": null, // null = aspect-preserving (no fixed cap) + "grid_overlay": "auto" // true | false | auto (true for generic, false for ui_tars) + }, + "input": { + "pre_action_delay_ms": 50, + "humanize": false // add jitter; v1.1 + }, + "permissions": "ask", // ask | allow | deny (default for non-screenshot tools) + "max_steps_per_task": 50, // safety cap per session grant + "ui_tars": { + "endpoint": "http://localhost:8000/v1", + "model": "ByteDance-Seed/UI-TARS-1.5-7B", + "thinking": true, // surface model's chain-of-thought + "api_key": "${UI_TARS_API_KEY}" // optional; many local servers don't require one + } + } +} +``` + +## One-line local setup (`opencode computer-use --local`) + +A new CLI subcommand that bootstraps the entire local UI-TARS path. +The flow is: **hardware probe → confirm → install vLLM → download +model → start server → write config snippet**. + +```bash +$ opencode computer-use --local + +🔍 Pre-flight hardware check... + ✅ GPU: NVIDIA RTX 4070 (compute capability 8.9) + ✅ VRAM: 12 GB free (need ≥ 8 GB) + ✅ Disk: 142 GB free at ~/.cache/huggingface (need ≥ 30 GB) + ✅ RAM: 24 GB system, 18 GB free + ✅ pip: Python 3.12.4 available + ✅ Network: huggingface.co reachable + +🔍 Checking prerequisites... + ✅ opencode 1.19.0 + ❌ vLLM not found + ❌ UI-TARS-1.5-7B not cached + +📦 Install vLLM? [Y/n] Y + $ pip install vllm + ✅ vLLM 0.7.2 installed + +📥 Download UI-TARS-1.5-7B (~15 GB at Q4)? [Y/n] Y + $ huggingface-cli download ByteDance-Seed/UI-TARS-1.5-7B ... + ✅ Model cached at ~/.cache/huggingface/... + +🚀 Start local server? [Y/n] Y + $ vllm serve ByteDance-Seed/UI-TARS-1.5-7B --port 8000 & + ✅ Server up at http://localhost:8000/v1 + +✏️ Wrote opencode.json snippet (top-level, adjacent to "tools"): + { + "tools": { /* existing */ }, + "provider": { + "ui_tars_local": { + "type": "openai-compatible", + "base_url": "http://localhost:8000/v1", + "model": "ByteDance-Seed/UI-TARS-1.5-7B" + } + }, + "computer_use": { + "enabled": true, + "adapter": "ui_tars", + "permissions": "ask", + "max_steps_per_task": 50, + "screenshot": { "max_width": 1024 } + } + } + +✅ Ready. Try: opencode "open Firefox and screenshot the homepage" + +⚠️ Computer-use is an EXPERIMENTAL feature. You'll be asked once + per session to grant up to 50 computer-use steps. + +🛑 To stop the server later: opencode computer-use --stop +``` + +If the pre-flight check fails, the command aborts before any +download or install and prints the specific failed check with its +remediation (see §Pre-flight Hardware Check). + +The command is implemented as a thin wrapper over existing tools +(`pip`, `huggingface-cli`, `vllm`) plus a config snippet writer. +It is **idempotent** (skips already-completed steps) and **safe** +(refuses to overwrite an existing config without `--force`). +`--skip-hardware-check` bypasses the pre-flight check (logged with +a warning) for users who know what they're doing. + +This is the killer onboarding story: + +> `brew install opencode && opencode computer-use --local` +> → fully offline GUI agent in ~5 minutes, no API keys. + +## Failure handling & diagnostics + +Every backend must surface *actionable* errors when the environment +isn't usable. No silent hangs. + +| Failure | Surfaced message (with remediation) | +|------------------------------------------|--------------------------------------| +| Wayland — `/dev/uinput` not writable | "Cannot write to `/dev/uinput`. Run: `sudo usermod -aG input $USER` (then re-login), `sudo modprobe uinput`, and ensure a `udev` rule grants the `input` group write access. Then `sudo systemctl restart ydotoold`." | +| Wayland — user not in `input` group | "User not in `input` group (required for `/dev/uinput`). Run: `sudo usermod -aG input $USER` and re-login." | +| Wayland — `ydotoold` daemon not running | "ydotoold is not running. Install with your package manager, then start it: `sudo ydotoold &` (or `systemctl --user enable --now ydotoold`)." | +| Wayland — `uinput` kernel module missing | "`uinput` kernel module not loaded. Run: `sudo modprobe uinput && echo uinput \| sudo tee /etc/modules-load.d/uinput.conf`." | +| macOS — Accessibility permission missing | "opencode needs Accessibility permission. System Settings → Privacy & Security → Accessibility → enable opencode (or the launching Terminal / Desktop app)." | +| macOS — Screen Recording permission missing | "opencode needs Screen Recording permission to capture screenshots. System Settings → Privacy & Security → Screen Recording → enable opencode." | +| Headless — no virtual display detected | "No display server detected. Run under `xvfb-run opencode ...` for X11 CI, or `cage opencode` / `weston --headless` for Wayland CI." | +| Required CLI tool missing (`grim`, `scrot`, `xdotool`, `cliclick`) | Specific install command for the user's detected package manager (apt / dnf / pacman / brew). | +| Windows — UIPI elevation mismatch | "Target window is running as Administrator; opencode is not. Windows UIPI blocks input injection into elevated windows. Either close the elevated target, or restart opencode as Administrator (not recommended — lowers your session's security posture). Diagnostic: `whoami /groups \| findstr Mandatory`." | +| UI-TARS server unreachable | "Cannot reach http://localhost:8000/v1. Run `opencode computer-use --local` to start it, or `opencode computer-use --status` to check." | +| UI-TARS model not loaded | "Server is up but model isn't loaded. Wait ~30s after `vllm serve` completes startup." | + +### Pre-flight Hardware Check (`--local`) + +Before installing vLLM or downloading UI-TARS-1.5-7B, the +`opencode computer-use --local` command runs a non-destructive +hardware probe and **aborts with a clear error if minimums aren't +met**. The user can override with `--skip-hardware-check` (logged +with a warning). + +| Check | Minimum | Failure message | +|------------------------------------------------|-------------------------------|------------------| +| **GPU presence & compute capability** | NVIDIA CUDA (sm_70+, Volta+) **or** Apple Silicon (M1+) | "No compatible GPU detected. UI-TARS-7B requires either an NVIDIA GPU with CUDA compute capability ≥ 7.0 or Apple Silicon (M1 or newer). CPU-only inference is not supported in v1." | +| **VRAM available** (after subtracting other workloads) | ≥ 8 GB free (Q4) — recommend ≥ 12 GB (Q8) or ≥ 16 GB (fp16) | "Only X GB of VRAM available; UI-TARS-7B Q4 needs ≥ 8 GB. Free GPU memory or use a smaller quantization (if available)." | +| **Free disk space** | ≥ 30 GB (vLLM ~3 GB + model ~15 GB Q4 / ~30 GB fp16 + safetensors overhead + room for context cache) | "Only X GB free at ; UI-TARS-1.5-7B needs ≥ 30 GB free. Free disk space or set `HF_HOME` to a larger volume." | +| **Free RAM** (Apple Silicon unified memory) | ≥ 16 GB system RAM with ≥ 12 GB free | "Only X GB of system RAM free; Apple Silicon inference needs ≥ 16 GB system RAM with ≥ 12 GB free. Close other apps or use a smaller model." | +| **`pip` available** (Linux/Windows) | Python ≥ 3.10 + pip | "Python 3.10+ with pip is required to install vLLM. Install Python from python.org or via your package manager." | +| **Network reachability** | `huggingface.co` reachable | "Cannot reach huggingface.co to download the model. Check your network connection, proxy settings, or set `HF_ENDPOINT` to a mirror." | + +The probe runs in `--status` too, so users can sanity-check their +machine without starting the server: + +```bash +$ opencode computer-use --status +GPU: NVIDIA RTX 4070, 12 GB VRAM available (compute capability 8.9) ✅ +Disk: 142 GB free at /home/user ✅ +RAM: 24 GB system, 18 GB free ✅ +Server: not running +Model: ByteDance-Seed/UI-TARS-1.5-7B not cached +``` + +## Phased delivery + +- **Phase 0 — RFC + issue triage** (this document). + - Comment on #32945, link #20490 / #20917 / #26772 / #30755. + - Get sign-off on tool surface and permission model. + +- **Phase 1 — Core tool skeleton + Linux X11 backend** (opencode 1.19). + - Tool registry entries (`computer_*`). + - Permission integration. + - Event stream emit. + - `xdotool` / `scrot` / `xclip` backend. + - Generic vision adapter end-to-end working. + - User-visible: opt-in via config; works with any vision-capable + cloud model. + +- **Phase 2 — macOS + Windows + Wayland backends** (opencode 1.19). + - Each with the specific permission/error UX from the table above. + - Backend auto-detection hardened across distros. + +- **Phase 3a — UI-TARS native adapter + `--local` setup** (opencode 1.19). + - Highest-leverage integration: Apache-2.0, runs locally, beats + closed competitors on OSWorld. + - References: [UI-TARS-desktop](https://github.com/bytedance/UI-TARS-desktop) + for Electron UX patterns, + [UI-TARS SDK](https://github.com/bytedance/UI-TARS) for action + vocabulary mapping. + - Ship `opencode computer-use --local` and `--status` subcommands. + +- **Phase 3b — Closed provider native adapters** (opencode 1.20). + - Anthropic `computer_use` block autodetect + pass-through. + - OpenAI `computer-use-preview` autodetect + pass-through. + +- **Phase 3c — Generic vision adapter polish** (opencode 1.20). + - CoT prompting helpers for non-UI-TARS VLMs. + - Coordinate grid overlay defaults. + +- **Phase 4 — Headless / CI path** (opencode 1.20). + - `xvfb-run opencode ...` integration documented. + - `cage` / `weston --headless` for Wayland CI. + - GitHub Actions recipe in `packages/opencode/.github/`. + +- **Phase 5 — Structure backend** (opencode 1.21+). + - AT-SPI / NSAccessibility / UI Automation. + - `computer_locate(role, name?, ...)` tool. + - Composes with #32945's "reusable flows" direction. + +## Open questions for maintainers + +### Resolved in this revision + +| # | Question | Decision | +|---|----------|----------| +| 1 | Core vs plugin-first | **Core-native, experimental.** Ships behind `computer_use.enabled: false` feature flag, labelled `experimental` in schema validation and release notes. | +| 2 | Permission granularity | **Ask per batch / session** with `max_steps_per_task` budget (default 50). One prompt at session start, no per-action prompts. | +| 3 | Screenshot transport | **Base64 PNG image part**, aggressive default `max_width: 1024` (aspect-preserving). Native format for UI-TARS / Anthropic / OpenAI. | +| 4 | Native closed-provider adapters | **In Core** for v1 (one install, simpler UX). Open to extracting later if it bloats the Core provider surface. | +| 5 | Config block location | **Top-level `computer_use` adjacent to `tools`** in `opencode.json`. | +| 6 | `--local` setup command | **Ship as first-class subcommand** (`--local`, `--status`, `--stop`). The onboarding story is the differentiator vs Claude Code / Cursor. | + +### Still open + +7. **Quantized UI-TARS defaults.** Default to Q4 (~8GB VRAM) or + fp16 (~16GB)? Default to a smaller variant like the Qwen2-VL-2B + derivative if available? +8. **When to graduate out of experimental.** Ship as experimental for + 1.19 and 1.20, then promote to stable in 1.21 once we have: + - Headless CI green for ≥ 2 weeks + - At least 3 external users reporting successful real-world runs + - All OS backend failure paths (incl. UIPI / `/dev/uinput`) covered + by tests +9. **`grid_overlay` default for generic adapter.** Coordinate grid + helps non-UI-TARS VLMs localize. Trade-off: extra annotation + tokens. Default `true` for `generic` adapter seems right but + needs benchmark confirmation. +10. **Step-budget UX after a denial.** When the user denies a + session grant, can the agent retry with a narrower request + (e.g., "only `computer_screenshot`, no input"), or is that a + per-session state? Proposal: a denied session can be re-prompted + with a narrower scope. + +## References + +### opencode issues + +- #20490 — original detailed spec (closed without merged impl) +- #20917 — claude-code plugin compat request +- #26772 — integrated browser (separate scope, browser only) +- #30755 — Codex click-to-edit (separate scope, browser only) +- #32945 — accessibility-tree + reusable flows (compatible direction) + +### UI-TARS + +- UI-TARS-1.5-7B model card — + https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B +- UI-TARS paper (Qin et al., 2025) — https://arxiv.org/abs/2501.12326 +- UI-TARS-1.5 blog — https://seed-tars.com/1.5/ +- UI-TARS-desktop (Electron reference impl) — + https://github.com/bytedance/UI-TARS-desktop +- UI-TARS SDK — https://github.com/bytedance/UI-TARS +- ScreenSpot-Pro grounding benchmark — https://arxiv.org/abs/2504.07981 +- OSWorld benchmark — https://arxiv.org/abs/2404.07972 +- Windows Agent Arena — https://arxiv.org/abs/2409.08264 +- WebVoyager — https://arxiv.org/abs/2401.13919 +- Online-Mind2Web — https://arxiv.org/abs/2504.01382 + +### Closed-provider equivalents + +- Claude Code computer use docs (Anthropic) +- OpenAI `computer-use-preview` schema +- AT-SPI / NSAccessibility / UI Automation references for Phase 5 + +### opencode internals + +- `CONTEXT.md` — session runtime semantics; in particular the + "Managed Tool Output File" and "provider-executed tool results" + rules that govern how screenshots and provider-native tool calls + land in session history. +- `AGENTS.md` — contributor style guide; Phase 1 PRs must follow + it. +- `specs/v2/` — V2 session architecture; computer-use event emission + hooks into the V2 event stream. \ No newline at end of file