diff --git a/.githooks/pre-commit b/.githooks/pre-commit old mode 100755 new mode 100644 diff --git a/DEV-LOG.md b/DEV-LOG.md index 45b869cca..349e499a2 100644 --- a/DEV-LOG.md +++ b/DEV-LOG.md @@ -1,5 +1,133 @@ # DEV-LOG +## Computer Use Windows 增强:窗口绑定截图 + UI Automation + OCR (2026-04-03) + +在三平台基础实现之上,利用 Windows 原生 API 增强 Computer Use 的 Windows 专属能力。 + +**新增文件:** + +| 文件 | 行数 | 说明 | +|------|------|------| +| `src/utils/computerUse/win32/windowCapture.ts` | — | `PrintWindow` 窗口绑定截图,支持被遮挡/后台窗口 | +| `src/utils/computerUse/win32/windowEnum.ts` | — | `EnumWindows` 精确窗口枚举(HWND + PID + 标题) | +| `src/utils/computerUse/win32/uiAutomation.ts` | — | `IUIAutomation` UI 元素树读取、按钮点击、文本写入、坐标识别 | +| `src/utils/computerUse/win32/ocr.ts` | — | `Windows.Media.Ocr` 截图+文字识别(英语+中文) | + +**修改文件:** + +| 文件 | 变更 | +|------|------| +| `packages/@ant/computer-use-swift/src/backends/win32.ts` | `listRunning` 改用 EnumWindows;新增 `captureWindowTarget` 窗口级截图 | + +**验证结果(Windows x64):** +- 窗口枚举:38 个可见窗口 ✅ +- 窗口截图:VS Code 2575x1415, 444KB ✅(PrintWindow, 即使被遮挡) +- UI Automation:坐标元素识别 ✅ +- OCR:识别 VS Code 界面文字,34 行 ✅ + +--- + +## Enable Computer Use — macOS + Windows + Linux (2026-04-03) + +恢复 Computer Use 屏幕操控功能。参考项目仅 macOS,本次扩展为三平台支持。 + +**Phase 1 — MCP server stub 替换:** +从参考项目复制 `@ant/computer-use-mcp` 完整实现(12 文件,6517 行)。 + +**Phase 2 — 移除 src/ 中 8 处 macOS 硬编码:** + +| 文件 | 改动 | +|------|------| +| `src/main.tsx:1605` | 去掉 `getPlatform() === 'macos'` | +| `src/utils/computerUse/swiftLoader.ts` | 移除 darwin-only throw | +| `src/utils/computerUse/executor.ts` | 平台守卫扩展为 darwin+win32+linux;剪贴板按平台分发(pbcopy→PowerShell→xclip);paste 快捷键 command→ctrl | +| `src/utils/computerUse/drainRunLoop.ts` | 非 darwin 直接执行 fn() | +| `src/utils/computerUse/escHotkey.ts` | 非 darwin 返回 false(Ctrl+C fallback) | +| `src/utils/computerUse/hostAdapter.ts` | 非 darwin 权限检查返回 granted | +| `src/utils/computerUse/common.ts` | platform + screenshotFiltering 动态化 | +| `src/utils/computerUse/gates.ts` | enabled:true + hasRequiredSubscription→true | + +**Phase 3 — input/swift 包 dispatcher + backends 三平台架构:** + +``` +packages/@ant/computer-use-{input,swift}/src/ +├── index.ts ← dispatcher +├── types.ts ← 共享接口 +└── backends/ + ├── darwin.ts ← macOS AppleScript(原样拆出,不改逻辑) + ├── win32.ts ← Windows PowerShell + └── linux.ts ← Linux xdotool/scrot/xrandr/wmctrl +``` + +**编译开关:** `CHICAGO_MCP` 加入 DEFAULT_FEATURES + DEFAULT_BUILD_FEATURES + +**验证结果(Windows x64):** +- `isSupported: true` ✅ +- 鼠标定位 + 前台窗口信息 ✅ +- 双显示器检测 2560x1440 × 2 ✅ +- 全屏截图 3MB base64 ✅ +- `bun run build` 463 files ✅ + +--- + +## Enable Voice Mode / VOICE_MODE (2026-04-03) + +恢复 `/voice` 语音输入功能。`src/` 下所有 voice 相关源码已与官方一致(0 行差异),问题出在:① `VOICE_MODE` 编译开关未开,命令不显示;② `audio-capture-napi` 是 SoX 子进程 stub(Windows 不支持),缺少官方原生 `.node` 二进制。 + +**新增文件:** + +| 文件 | 说明 | +|------|------| +| `vendor/audio-capture/{platform}/audio-capture.node` | 6 个平台的原生音频二进制(cpal,来自参考项目) | +| `vendor/audio-capture-src/index.ts` | 原生模块加载器(按 `${arch}-${platform}` 动态 require `.node`) | + +**修改文件:** + +| 文件 | 变更 | +|------|------| +| `packages/audio-capture-napi/src/index.ts` | SoX 子进程 stub → 原生 `.node` 加载器(含 `process.cwd()` workspace 路径 fallback) | +| `scripts/dev.ts` | `DEFAULT_FEATURES` 加 `"VOICE_MODE"` | +| `build.ts` | `DEFAULT_BUILD_FEATURES` 加 `"VOICE_MODE"` | +| `docs/features/voice-mode.md` | 追加恢复计划章节(第八节) | + +**验证结果:** + +- `isNativeAudioAvailable()` → `true`(Windows x64 原生 `.node` 加载成功) +- `feature('VOICE_MODE')` → `ENABLED` +- `bun run build` → voice 代码编入产物 + +**运行时前置条件:** claude.ai OAuth 登录 + 麦克风权限 + +--- + +## Enable Claude in Chrome MCP (2026-04-03) + +恢复 Chrome 浏览器控制功能。`src/` 下所有 claudeInChrome 相关源码已与官方一致(0 行差异),问题出在 `@ant/claude-for-chrome-mcp` 包是 6 行 stub(返回空工具列表和 null server)。 + +**替换文件:** + +| 文件 | 变更 | +|------|------| +| `packages/@ant/claude-for-chrome-mcp/src/index.ts` | 6 行 stub → 15 行完整导出 | + +**新增文件:** + +| 文件 | 行数 | 说明 | +|------|------|------| +| `packages/@ant/claude-for-chrome-mcp/src/types.ts` | 134 | 类型定义 | +| `packages/@ant/claude-for-chrome-mcp/src/browserTools.ts` | 546 | 17 个浏览器工具定义 | +| `packages/@ant/claude-for-chrome-mcp/src/mcpServer.ts` | 96 | MCP Server | +| `packages/@ant/claude-for-chrome-mcp/src/mcpSocketClient.ts` | 493 | Unix Socket 客户端 | +| `packages/@ant/claude-for-chrome-mcp/src/mcpSocketPool.ts` | 327 | 多 Profile 连接池 | +| `packages/@ant/claude-for-chrome-mcp/src/bridgeClient.ts` | 1126 | Bridge WebSocket 客户端 | +| `packages/@ant/claude-for-chrome-mcp/src/toolCalls.ts` | 301 | 工具调用路由 | + +**不需要 feature flag,不需要改 dev.ts/build.ts,不改 src/ 下任何文件。** + +**运行时依赖:** Chrome 浏览器 + Claude in Chrome 扩展(https://claude.ai/chrome) + +--- + ## OpenAI 接口兼容 (2026-04-03) **分支**: `feature/openai` diff --git a/README.md b/README.md index c9f30b4a3..dc92cdb5f 100644 --- a/README.md +++ b/README.md @@ -36,7 +36,10 @@ - [x] 添加自定义 GrowthBook 支持 (GB 也是开源的, 现在你可以配置一个自定义的遥控平台) [文档](https://ccb.agent-aura.top/docs/internals/growthbook-adapter) - [x] 自定义 login 模式, 大家可以用这个配置 Claude 的模型! - [x] 修复搜索工具的 rg 缺失问题(需要重新 bun i) - - [ ] OpenAI 接口兼容! /login 然后配置 OpenAI 平台即可! + - [x] OpenAI 接口兼容! /login 然后配置 OpenAI 平台即可! + - [x] Chrome use 支持(暂时浏览器插件要订阅权限,万恶的牢 A) 感谢 @amDosion + - [x] Computer use 支持 感谢 @amDosion + - [x] /voice 支持 @amDosion - [ ] V6 大规模重构石山代码, 全面模块分包 - [ ] V6 将会为全新分支, 届时 main 分支将会封存为历史版本 diff --git a/Run.ps1 b/Run.ps1 new file mode 100644 index 000000000..6741892cb --- /dev/null +++ b/Run.ps1 @@ -0,0 +1,2 @@ +bun install +bun run dev --dangerously-skip-permissions \ No newline at end of file diff --git a/build.ts b/build.ts index 11c4a2481..80d9e9a62 100644 --- a/build.ts +++ b/build.ts @@ -10,7 +10,7 @@ rmSync(outdir, { recursive: true, force: true }); // Default features that match the official CLI build. // Additional features can be enabled via FEATURE_=1 env vars. -const DEFAULT_BUILD_FEATURES = ["AGENT_TRIGGERS_REMOTE"]; +const DEFAULT_BUILD_FEATURES = ["AGENT_TRIGGERS_REMOTE", "CHICAGO_MCP", "VOICE_MODE"]; // Collect FEATURE_* env vars → Bun.build features const envFeatures = Object.keys(process.env) diff --git a/docs/features/claude-in-chrome-mcp.md b/docs/features/claude-in-chrome-mcp.md new file mode 100644 index 000000000..3b668ff55 --- /dev/null +++ b/docs/features/claude-in-chrome-mcp.md @@ -0,0 +1,243 @@ +# Claude in Chrome MCP — 恢复计划 + +更新时间:2026-04-03 +参考项目:`E:\源码\claude-code-source-main\claude-code-source-main` + +## 1. 功能概述 + +Claude in Chrome 让 Claude Code CLI 通过 MCP 协议控制用户的 Chrome 浏览器:导航网页、填写表单、截图、录制 GIF、读取 DOM、执行 JS、监控网络请求和控制台日志。 + +通信方式有两种: +- **本地 Socket**:Chrome 扩展通过 Native Messaging Host 与 CLI 建立 Unix socket 连接 +- **Bridge WebSocket**:通过 Anthropic 的 bridge 服务中转,支持远程浏览器 + +## 2. 完整加载链路 + +``` +CLI 启动 + │ + ▼ +src/main.tsx:1003 + .option('--chrome', 'Enable Claude in Chrome integration') + │ + ▼ +src/main.tsx:1522-1527 + setChromeFlagOverride(chromeOpts.chrome) + │ + ▼ +src/utils/claudeInChrome/setup.ts + shouldEnableClaudeInChrome() + ├── --chrome flag → true + ├── --no-chrome flag → false + ├── 非交互模式 → false + ├── 环境变量 CLAUDE_CODE_DISABLE_CHROME → false + ├── 配置 claudeInChromeDefaultEnabled → true/false + └── Chrome 扩展已安装 + GrowthBook tengu_chrome_auto_enable → auto + │ + ▼ +src/utils/claudeInChrome/setup.ts + setupClaudeInChrome() + ├── 生成 MCP server 配置 + └── 返回 mcpConfig + allowedTools + │ + ▼ +src/utils/claudeInChrome/mcpServer.ts + import { createClaudeForChromeMcpServer } from '@ant/claude-for-chrome-mcp' + │ + ▼ +packages/@ant/claude-for-chrome-mcp/src/index.ts ← 当前是 STUB + export function createClaudeForChromeMcpServer() { return null } + export const BROWSER_TOOLS = [] +``` + +## 3. 阻塞点清单 + +| # | 阻塞点 | 位置 | 状态 | +|---|--------|------|------| +| ① | `@ant/claude-for-chrome-mcp` 是 stub | `packages/@ant/claude-for-chrome-mcp/src/index.ts` | **6 行空壳,返回 null** | +| ② | 缺少完整实现(7 个文件,3038 行) | `packages/@ant/claude-for-chrome-mcp/src/` | 只有 1 个 stub 文件 | + +**不需要任何 feature flag** — `/chrome` 命令无条件注册在 `src/commands.ts:264`。 + +**不需要改 `src/` 下任何文件** — 以下文件全部与参考项目 0 行差异: +- `src/utils/claudeInChrome/setup.ts` +- `src/utils/claudeInChrome/mcpServer.ts` +- `src/utils/claudeInChrome/common.ts` +- `src/utils/claudeInChrome/chromeNativeHost.ts` +- `src/utils/claudeInChrome/prompt.ts` +- `src/utils/claudeInChrome/setupPortable.ts` +- `src/utils/claudeInChrome/toolRendering.tsx` +- `src/commands/chrome/index.ts` +- `src/commands/chrome/chrome.tsx`(仅 sourcemap 差异) +- `src/skills/bundled/claudeInChrome.ts` + +## 4. 参考项目完整实现清单 + +参考项目路径:`deps/@ant/claude-for-chrome-mcp/src/` + +| 文件 | 行数 | 职责 | +|------|------|------| +| `index.ts` | 15 | 导出入口:`createBridgeClient`、`BROWSER_TOOLS`、`createChromeSocketClient`、`createClaudeForChromeMcpServer`、`localPlatformLabel` + 类型导出 | +| `types.ts` | 134 | 类型定义:`Logger`、`PermissionMode`、`BridgeConfig`、`ChromeExtensionInfo`、`ClaudeForChromeContext`、`SocketClient`、`BridgePermissionRequest/Response`、`PermissionOverrides` | +| `browserTools.ts` | 546 | 17 个浏览器工具定义(MCP tool schema) | +| `mcpServer.ts` | 96 | MCP Server 创建:注册 `ListTools`/`CallTool` handler,选择 socket/bridge 传输 | +| `mcpSocketClient.ts` | 493 | Unix Socket 客户端:连接 Chrome Native Messaging Host,JSON-RPC 通信 | +| `mcpSocketPool.ts` | 327 | Socket 连接池:多 Chrome profile 支持,按 tabId 路由 | +| `bridgeClient.ts` | 1126 | Bridge WebSocket 客户端:连接 Anthropic bridge 服务,扩展发现、设备配对、权限管理 | +| `toolCalls.ts` | 301 | 工具调用路由:连接状态处理、结果转换、权限模式切换、浏览器切换 | + +### 17 个浏览器工具 + +| 工具名 | 功能 | +|--------|------| +| `javascript_tool` | 在页面上下文执行 JavaScript | +| `read_page` | 获取页面可访问性树(DOM) | +| `find` | 自然语言搜索页面元素 | +| `form_input` | 填写表单字段 | +| `computer` | 鼠标键盘操作 + 截图(13 种 action) | +| `navigate` | URL 导航 / 前进后退 | +| `resize_window` | 调整浏览器窗口尺寸 | +| `gif_creator` | GIF 录制和导出 | +| `upload_image` | 图片上传到文件输入框或拖拽区域 | +| `get_page_text` | 提取页面纯文本 | +| `tabs_context_mcp` | 获取当前标签组信息 | +| `tabs_create_mcp` | 创建新标签页 | +| `update_plan` | 向用户提交操作计划供审批 | +| `read_console_messages` | 读取浏览器控制台日志 | +| `read_network_requests` | 读取网络请求 | +| `shortcuts_list` | 列出可用快捷方式 | +| `shortcuts_execute` | 执行快捷方式 | +| `switch_browser` | 切换到其他 Chrome 浏览器(仅 bridge 模式) | + +### 外部依赖 + +| 依赖 | 用途 | 我们项目是否已有 | +|------|------|----------------| +| `ws` | WebSocket 客户端(bridge 模式) | ✅ 有 | +| `@modelcontextprotocol/sdk` | MCP Server + 类型 | ✅ 有 | +| `fs`/`net`/`os`/`path` | Node.js 内置 | ✅ | + +## 5. 修复步骤 + +### 步骤 1:复制完整实现到 stub 包目录 + +```bash +# 从参考项目复制 7 个文件(覆盖现有的 1 个 stub) +cp "E:/源码/claude-code-source-main/claude-code-source-main/deps/@ant/claude-for-chrome-mcp/src/"*.ts \ + "E:/源码/Claude-code-bast/packages/@ant/claude-for-chrome-mcp/src/" +``` + +复制后 `packages/@ant/claude-for-chrome-mcp/src/` 应包含 8 个文件: + +``` +packages/@ant/claude-for-chrome-mcp/src/ +├── index.ts ← 覆盖 stub(15 行,导出入口) +├── types.ts ← 新增(134 行) +├── browserTools.ts ← 新增(546 行) +├── mcpServer.ts ← 新增(96 行) +├── mcpSocketClient.ts ← 新增(493 行) +├── mcpSocketPool.ts ← 新增(327 行) +├── bridgeClient.ts ← 新增(1126 行) +└── toolCalls.ts ← 新增(301 行) +``` + +### 步骤 2:验证构建 + +```bash +bun run build +``` + +不需要改 `scripts/dev.ts` 或 `build.ts`(无 feature flag)。 + +### 步骤 3:功能验证 + +```bash +# 启动(手动启用 chrome) +bun run dev -- --chrome + +# 在 REPL 中: +# 1. /chrome 命令应显示 Chrome 设置菜单 +# 2. 如果 Chrome 扩展已安装 → 状态显示 "Enabled" +# 3. 如果未安装 → 提示安装扩展链接 +``` + +## 6. 验证测试项 + +### 6.1 构建验证 + +| 测试项 | 预期结果 | 验证命令 | +|--------|---------|---------| +| build 成功 | 无报错 | `bun run build` | +| BROWSER_TOOLS 非空 | 产物中包含 17 个工具定义 | `grep "javascript_tool" dist/*.js` | +| createClaudeForChromeMcpServer 非 null | 产物中包含 MCP Server 创建逻辑 | `grep "ListToolsRequestSchema" dist/*.js` | +| Bridge WebSocket 逻辑在产物中 | 包含 bridge 连接代码 | `grep "bridge.claudeusercontent.com" dist/*.js` | + +### 6.2 命令注册验证 + +| 测试项 | 预期结果 | +|--------|---------| +| `/chrome` 命令可见 | REPL 中输入 `/chrome` 显示设置菜单 | +| `--chrome` 参数可用 | `bun run dev -- --chrome` 不报错 | +| `--no-chrome` 参数可用 | `bun run dev -- --no-chrome` 不报错 | + +### 6.3 MCP Server 验证(需要 Chrome 扩展) + +| 测试项 | 预期结果 | +|--------|---------| +| Chrome 扩展检测 | 已安装扩展时 `/chrome` 显示 "Extension: Installed" | +| Socket 连接 | 扩展连接后 MCP tools 可用 | +| BROWSER_TOOLS 注册 | `tabs_context_mcp` 等 17 个工具在 MCP 工具列表中可见 | + +### 6.4 工具功能验证(需要 Chrome 扩展 + 连接) + +| 测试项 | 预期结果 | +|--------|---------| +| `tabs_context_mcp` | 返回当前标签组信息 | +| `navigate` | 能导航到指定 URL | +| `computer` + `screenshot` | 返回页面截图 | +| `read_page` | 返回 DOM 可访问性树 | +| `javascript_tool` | 执行 JS 并返回结果 | + +### 6.5 不影响现有功能 + +| 测试项 | 预期结果 | +|--------|---------| +| 不带 `--chrome` 启动 | 正常运行,无 chrome 相关报错 | +| `/voice` 命令 | 不受影响 | +| `/schedule` 命令 | 不受影响 | +| `bun test` | 现有测试全部通过 | + +## 7. 改动总结 + +| 操作 | 文件 | 说明 | +|------|------|------| +| 覆盖 stub | `packages/@ant/claude-for-chrome-mcp/src/index.ts` | 6 行 stub → 15 行完整导出 | +| 新增 | `packages/@ant/claude-for-chrome-mcp/src/types.ts` | 134 行类型定义 | +| 新增 | `packages/@ant/claude-for-chrome-mcp/src/browserTools.ts` | 546 行,17 个工具定义 | +| 新增 | `packages/@ant/claude-for-chrome-mcp/src/mcpServer.ts` | 96 行 MCP Server | +| 新增 | `packages/@ant/claude-for-chrome-mcp/src/mcpSocketClient.ts` | 493 行 Socket 客户端 | +| 新增 | `packages/@ant/claude-for-chrome-mcp/src/mcpSocketPool.ts` | 327 行连接池 | +| 新增 | `packages/@ant/claude-for-chrome-mcp/src/bridgeClient.ts` | 1126 行 Bridge 客户端 | +| 新增 | `packages/@ant/claude-for-chrome-mcp/src/toolCalls.ts` | 301 行工具调用路由 | + +**不改动**:`src/` 下所有文件(已与参考项目一致)、`scripts/dev.ts`、`build.ts`。 + +## 8. 运行时依赖 + +| 依赖 | 必需? | 说明 | +|------|--------|------| +| Chrome 浏览器 | 是 | 需安装 Chrome | +| Claude in Chrome 扩展 | 是 | 从 https://claude.ai/chrome 安装 | +| claude.ai OAuth 登录 | Bridge 模式需要 | 本地 Socket 模式不需要 | +| Native Messaging Host | 本地 Socket 需要 | 扩展安装时自动注册 | + +## 9. 与 /voice、/schedule 恢复方式对比 + +| 项 | `/schedule` | `/voice` | Claude in Chrome | +|---|---|---|---| +| 编译开关 | `AGENT_TRIGGERS_REMOTE` | `VOICE_MODE` | **无需** | +| 改 dev.ts/build.ts | ✅ | ✅ | **不需要** | +| 缺失的 vendor 二进制 | 无 | `.node` 文件 | 无 | +| 需要替换的 stub | 无 | `audio-capture-napi` | `@ant/claude-for-chrome-mcp`(7 个文件) | +| 改动 src/ 源码 | 无 | 无 | 无 | +| 平台限制 | 无 | 需原生 `.node` | 需 Chrome 浏览器 | diff --git a/docs/features/computer-use-windows-enhancement.md b/docs/features/computer-use-windows-enhancement.md new file mode 100644 index 000000000..288da5daf --- /dev/null +++ b/docs/features/computer-use-windows-enhancement.md @@ -0,0 +1,315 @@ +# Computer Use Windows 增强实施计划 + +更新时间:2026-04-03 +依赖文档:`docs/features/windows-ai-desktop-control.md`、`docs/features/computer-use.md` + +## 1. 目标 + +在已有的 PowerShell 子进程方案基础上,利用 Windows 原生 API 增强 Computer Use 的 Windows 实现,解决 3 个核心问题: + +1. **窗口绑定截图**:当前 `CopyFromScreen` 只能全屏截图,无法对指定窗口截图(尤其是被遮挡/最小化窗口) +2. **UI 结构感知**:当前只能通过坐标点击,无法像 macOS Accessibility 那样理解 UI 元素树 +3. **性能**:每次 PowerShell 启动约 273ms,剪贴板/窗口枚举等高频操作需要更快的方式 + +## 2. 已验证的 Windows API 能力 + +以下 API 全部通过 PowerShell P/Invoke 实测通过: + +| 能力 | API | 验证结果 | +|------|-----|---------| +| 窗口绑定截图 | `PrintWindow(hwnd, hdc, PW_RENDERFULLCONTENT)` | ✅ VS Code 342KB, Chrome 273KB | +| 枚举窗口+HWND | `EnumWindows` + `GetWindowText` + `GetWindowThreadProcessId` | ✅ 38 个窗口,含 HWND/PID/标题 | +| UI 元素树 | `System.Windows.Automation.AutomationElement` | ✅ 记事本 39 个元素 | +| UI 写值 | `ValuePattern.SetValue()` | ✅ 成功写入记事本文本 | +| UI 点击 | `InvokePattern.Invoke()` | ✅ 按钮可程序化点击 | +| 坐标元素识别 | `AutomationElement.FromPoint(x, y)` | ✅ 返回元素类型+名称 | +| OCR | `Windows.Media.Ocr.OcrEngine` | ✅ 英语+中文引擎可用 | +| 全局热键 | `RegisterHotKey` | ✅ API 可调 | +| 剪贴板直接操作 | `System.Windows.Forms.Clipboard` | ✅ 读/写/图片检测 | +| Shell 启动 | `ShellExecute` | ✅ 打开文件/URL/应用 | + +## 3. 架构设计 + +### 3.1 文件结构 + +在现有 `backends/win32.ts` 基础上新增 Windows 专属模块: + +``` +packages/@ant/computer-use-input/src/ +├── backends/ +│ ├── darwin.ts ← 不动 +│ ├── win32.ts ← 增强:直接 Win32 API 替代部分 PowerShell +│ └── linux.ts ← 不动 + +packages/@ant/computer-use-swift/src/ +├── backends/ +│ ├── darwin.ts ← 不动 +│ ├── win32.ts ← 增强:PrintWindow 窗口截图 + EnumWindows +│ └── linux.ts ← 不动 + +packages/@ant/computer-use-mcp/src/ +│ └── tools.ts ← 增加 Windows 专属工具定义(UI Automation、OCR) + +src/utils/computerUse/ +│ └── win32/ ← 新增目录:Windows 专属能力 +│ ├── uiAutomation.ts ← UI 元素树、点击、写值 +│ ├── ocr.ts ← 截图 + OCR 文字识别 +│ ├── windowCapture.ts ← PrintWindow 窗口绑定截图 +│ └── windowEnum.ts ← EnumWindows 窗口枚举 +``` + +### 3.2 分层 + +``` +┌──────────────────────────────────────────────┐ +│ Computer Use MCP Tools │ +│ screenshot / click / type / request_access │ +│ + Windows 专属: ui_tree / ocr / window_cap │ +├──────────────────────────────────────────────┤ +│ src/utils/computerUse/ │ +│ executor.ts → 按平台 dispatch │ +│ win32/ → Windows 专属能力模块 │ +├──────────────────────────────────────────────┤ +│ packages/@ant/computer-use-{input,swift} │ +│ backends/win32.ts → PowerShell + Win32 API │ +├──────────────────────────────────────────────┤ +│ Windows Native API │ +│ PrintWindow / EnumWindows / UI Automation │ +│ SendInput / Clipboard / OCR / ShellExecute │ +└──────────────────────────────────────────────┘ +``` + +## 4. 实施计划 + +### Phase A:窗口绑定截图(解决核心问题) + +**问题**:当前 `CopyFromScreen` 只能全屏截图,无法对指定窗口截图。 +**方案**:用 `PrintWindow` + `FindWindow` 实现窗口级截图。 + +| 步骤 | 文件 | 改动 | +|------|------|------| +| A.1 | `src/utils/computerUse/win32/windowCapture.ts` | 新建:`captureWindow(title)` 用 PrintWindow 截取指定窗口 | +| A.2 | `src/utils/computerUse/win32/windowEnum.ts` | 新建:`listWindows()` 用 EnumWindows 返回 {hwnd, pid, title}[] | +| A.3 | `packages/@ant/computer-use-swift/src/backends/win32.ts` | `screenshot.captureExcluding` 增加按窗口截图能力 | +| A.4 | `packages/@ant/computer-use-swift/src/backends/win32.ts` | `apps.listRunning` 用 EnumWindows 替代 Get-Process(返回 HWND) | + +**PowerShell 脚本核心**: + +```powershell +# PrintWindow 截取指定窗口 +Add-Type -AssemblyName System.Drawing +Add-Type -ReferencedAssemblies System.Drawing @' +using System; using System.Runtime.InteropServices; using System.Drawing; using System.Drawing.Imaging; +public class WinCap { + [DllImport("user32.dll", CharSet=CharSet.Unicode)] + public static extern IntPtr FindWindow(string c, string t); + [DllImport("user32.dll")] + public static extern bool GetWindowRect(IntPtr h, out RECT r); + [DllImport("user32.dll")] + public static extern bool PrintWindow(IntPtr h, IntPtr hdc, uint f); + [StructLayout(LayoutKind.Sequential)] + public struct RECT { public int L, T, R, B; } + // ... CaptureByTitle(string title) → base64 +} +'@ +``` + +**验证标准**: +- 能按窗口标题截图 +- 被遮挡的窗口也能截图 +- 返回 base64 + width + height + +### Phase B:UI Automation(Windows 专属新能力) + +**问题**:macOS 有 Accessibility API 可以读取/操作 UI 元素,Windows 当前只能坐标点击。 +**方案**:用 `System.Windows.Automation` 实现 UI 树读取和元素操作。 + +| 步骤 | 文件 | 改动 | +|------|------|------| +| B.1 | `src/utils/computerUse/win32/uiAutomation.ts` | 新建:核心 UIA 操作封装 | +| B.2 | `packages/@ant/computer-use-mcp/src/tools.ts` | 增加 Windows 专属工具定义 | + +**uiAutomation.ts 导出函数**: + +```typescript +// 获取窗口的 UI 元素树 +getUITree(windowTitle: string, depth: number): UIElement[] + +// 按名称/类型/AutomationId 查找元素 +findElement(windowTitle: string, query: {name?, controlType?, automationId?}): UIElement | null + +// 点击元素(InvokePattern) +clickElement(windowTitle: string, automationId: string): boolean + +// 设置元素值(ValuePattern) +setValue(windowTitle: string, automationId: string, value: string): boolean + +// 获取坐标处的元素 +elementAtPoint(x: number, y: number): UIElement | null +``` + +**UIElement 类型**: +```typescript +interface UIElement { + name: string + controlType: string // Button, Edit, Text, List, etc. + automationId: string + boundingRect: { x: number, y: number, w: number, h: number } + isEnabled: boolean + value?: string // ValuePattern 可用时 + children?: UIElement[] +} +``` + +**PowerShell 脚本核心**: +```powershell +Add-Type -AssemblyName UIAutomationClient +Add-Type -AssemblyName UIAutomationTypes + +# 读取 UI 树 +$root = [AutomationElement]::RootElement +$window = $root.FindFirst([TreeScope]::Children, + [PropertyCondition]::new([AutomationElement]::NameProperty, $title)) +$elements = $window.FindAll([TreeScope]::Descendants, [Condition]::TrueCondition) + +# 写入文本 +$element.GetCurrentPattern([ValuePattern]::Pattern).SetValue($text) + +# 点击按钮 +$element.GetCurrentPattern([InvokePattern]::Pattern).Invoke() +``` + +**验证标准**: +- 能读取记事本的 UI 树(按钮、文本框、菜单) +- 能向文本框写入内容 +- 能点击按钮 +- 能识别坐标处的元素 + +### Phase C:OCR 屏幕文字识别 + +**问题**:截图后 AI 只能看到图片,无法直接读取文字。 +**方案**:用 `Windows.Media.Ocr` 对截图进行文字识别。 + +| 步骤 | 文件 | 改动 | +|------|------|------| +| C.1 | `src/utils/computerUse/win32/ocr.ts` | 新建:截图 + OCR 识别 | +| C.2 | `packages/@ant/computer-use-mcp/src/tools.ts` | 增加 `screen_ocr` 工具定义 | + +**ocr.ts 导出函数**: +```typescript +// 对屏幕区域 OCR +ocrRegion(x: number, y: number, w: number, h: number, lang?: string): OcrResult + +// 对指定窗口 OCR +ocrWindow(windowTitle: string, lang?: string): OcrResult + +interface OcrResult { + text: string + lines: { text: string, bounds: {x,y,w,h} }[] + language: string +} +``` + +**已确认可用语言**:英语 (en-US) + 中文 (zh-Hans-CN) + +**验证标准**: +- 能识别屏幕区域中的英文和中文 +- 返回文字内容 + 每行的位置信息 + +### Phase D:高频操作性能优化 + +**问题**:每次 PowerShell 启动 273ms,鼠标移动等高频操作太慢。 +**方案**:用 .NET `System.Windows.Forms.Clipboard` 等直接 API 替代 PowerShell 子进程。 + +| 步骤 | 文件 | 改动 | +|------|------|------| +| D.1 | `src/utils/computerUse/executor.ts` | 剪贴板操作用直接 API 替代 PowerShell | +| D.2 | 考虑驻留 PowerShell 进程 | 通过 stdin/stdout 交互,摊平启动成本 | + +**剪贴板直接 API**(不需要 PowerShell 子进程): +```powershell +# 读:50ms → <1ms +[System.Windows.Forms.Clipboard]::GetText() + +# 写:50ms → <1ms +[System.Windows.Forms.Clipboard]::SetText($text) + +# 图片检测 +[System.Windows.Forms.Clipboard]::ContainsImage() +``` + +### Phase E:`request_access` Windows 适配 + +**问题**:`request_access` 依赖 macOS bundleId 识别应用,Windows 没有这个概念。 +**方案**:在 Windows 上用 exe 路径 + 窗口标题替代 bundleId。 + +| 步骤 | 文件 | 改动 | +|------|------|------| +| E.1 | `packages/@ant/computer-use-mcp/src/toolCalls.ts` | `resolveRequestedApps` 在 Windows 上用 exe 路径匹配 | +| E.2 | `packages/@ant/computer-use-mcp/src/sentinelApps.ts` | 增加 Windows 危险应用列表(cmd.exe, powershell.exe 等) | +| E.3 | `packages/@ant/computer-use-mcp/src/deniedApps.ts` | 增加 Windows 浏览器/终端识别规则 | +| E.4 | `src/utils/computerUse/hostAdapter.ts` | `ensureOsPermissions` Windows 上检查 UAC 状态 | + +**Windows 应用标识映射**: +``` +macOS bundleId → Windows 等价 +com.apple.Safari → C:\Program Files\...\msedge.exe(或窗口标题匹配) +com.google.Chrome → chrome.exe +com.apple.Terminal → WindowsTerminal.exe / cmd.exe +``` + +### Phase F:全局热键(ESC 拦截) + +**问题**:当前非 darwin 直接跳过 ESC 热键,用 Ctrl+C 替代。 +**方案**:用 `RegisterHotKey` 或 `SetWindowsHookEx(WH_KEYBOARD_LL)` 实现。 + +| 步骤 | 文件 | 改动 | +|------|------|------| +| F.1 | `src/utils/computerUse/escHotkey.ts` | Windows 分支:RegisterHotKey 注册 ESC | + +**优先级低**——当前 Ctrl+C fallback 可用,ESC 热键是体验优化。 + +## 5. 执行优先级 + +``` +Phase A: 窗口绑定截图 ← P0 核心需求,解决"操作其他界面" +Phase B: UI Automation ← P0 核心能力,AI 理解 UI 结构 +Phase C: OCR ← P1 增值能力,AI 读屏幕文字 +Phase D: 性能优化 ← P1 体验优化,高频操作提速 +Phase E: request_access 适配 ← P1 功能完整性,权限模型适配 +Phase F: ESC 热键 ← P2 体验优化,可后做 +``` + +## 6. 每个 Phase 的改动量估算 + +| Phase | 新增文件 | 修改文件 | 新增代码行 | 风险 | +|-------|---------|---------|-----------|------| +| A 窗口截图 | 2 | 1 | ~200 | 低 | +| B UI Automation | 1 | 1 | ~300 | 中 | +| C OCR | 1 | 1 | ~150 | 低 | +| D 性能优化 | 0 | 2 | ~50 | 低 | +| E request_access | 0 | 3 | ~100 | 中 | +| F ESC 热键 | 0 | 1 | ~50 | 低 | +| **总计** | **4** | **9** | **~850** | — | + +## 7. 不动的文件 + +- `backends/darwin.ts`(两个包都不动) +- `backends/linux.ts`(两个包都不动) +- `src/utils/computerUse/` 中 macOS 相关代码路径不动 +- `packages/@ant/computer-use-mcp/src/` 中已复制的参考项目代码不动(只追加 Windows 工具) + +## 8. 与 macOS/Linux 方案的对比 + +| 能力 | macOS | Windows (增强后) | Linux | +|------|-------|-----------------|-------| +| 截图方式 | SCContentFilter (per-app) | **PrintWindow (per-window)** | scrot (全屏/区域) | +| UI 结构 | Accessibility API | **UI Automation** | 无 | +| OCR | 无内置 | **Windows.Media.Ocr** | 无内置 | +| 键鼠 | CGEvent + enigo | SendInput + keybd_event | xdotool | +| 窗口管理 | NSWorkspace | **EnumWindows + Win32** | wmctrl | +| 剪贴板 | pbcopy/pbpaste | **Clipboard 直接 API** | xclip | +| ESC 热键 | CGEventTap | RegisterHotKey | 无 | +| 应用标识 | bundleId | exe 路径 + 窗口标题 | /proc + wmctrl | + +**Windows 增强后将在 UI Automation 和 OCR 方面超过 macOS 方案**——这两项 macOS 原始实现也没有(Anthropic 用的是截图 + Claude 视觉理解,没有结构化 UI 数据)。 diff --git a/docs/features/computer-use.md b/docs/features/computer-use.md new file mode 100644 index 000000000..2ae3cc77f --- /dev/null +++ b/docs/features/computer-use.md @@ -0,0 +1,197 @@ +# Computer Use — macOS / Windows / Linux 跨平台实施计划 + +更新时间:2026-04-03 +参考项目:`E:\源码\claude-code-source-main\claude-code-source-main` + +## 1. 现状 + +参考项目的 Computer Use **仅支持 macOS**——从入口到底层全部写死 darwin。我们的项目在 Phase 1-3 中已经完成了: + +- ✅ `@ant/computer-use-mcp` stub 替换为完整实现(12 文件) +- ✅ `@ant/computer-use-input` 拆为 dispatcher + backends(darwin + win32) +- ✅ `@ant/computer-use-swift` 拆为 dispatcher + backends(darwin + win32) +- ✅ `CHICAGO_MCP` 编译开关已开 +- ❌ `src/` 层有 6 处 macOS 硬编码阻塞 + +## 2. 阻塞点全景 + +### 2.1 入口层 + +| # | 文件:行号 | 阻塞代码 | 影响 | +|---|----------|---------|------| +| 1 | `src/main.tsx:1605` | `getPlatform() === 'macos'` | 整个 CU 初始化被跳过 | + +### 2.2 加载层 + +| # | 文件:行号 | 阻塞代码 | 影响 | +|---|----------|---------|------| +| 2 | `src/utils/computerUse/swiftLoader.ts:16` | `process.platform !== 'darwin'` → throw | 截图、应用管理全部不可用 | +| 3 | `src/utils/computerUse/executor.ts:263` | `process.platform !== 'darwin'` → throw | 整个 executor 工厂函数不可用 | + +### 2.3 macOS 特有依赖 + +| # | 文件:行号 | 依赖 | macOS 实现 | 需要替代方案 | +|---|----------|------|-----------|------------| +| 4 | `executor.ts:70-88` | 剪贴板 | `pbcopy`/`pbpaste` | Win: PowerShell `Get/Set-Clipboard`;Linux: `xclip`/`wl-copy` | +| 5 | `drainRunLoop.ts:21` | CFRunLoop pump | `cu._drainMainRunLoop()` | 非 darwin:直接执行 fn(),不需要 pump | +| 6 | `escHotkey.ts:28` | ESC 热键 | CGEventTap | 非 darwin:返回 false(已有 Ctrl+C fallback) | +| 7 | `hostAdapter.ts:48-54` | 系统权限 | TCC accessibility + screenRecording | Win:直接 granted;Linux:检查 xdotool | +| 8 | `common.ts:56` | 平台标识 | `platform: 'darwin'` 硬编码 | 动态获取 | +| 9 | `executor.ts:180` | 粘贴快捷键 | `command+v` | Win/Linux:`ctrl+v` | + +### 2.4 缺失的 Linux 后端 + +| 包 | macOS | Windows | Linux | +|---|-------|---------|-------| +| `computer-use-input/backends/` | ✅ darwin.ts | ✅ win32.ts | ❌ 需新建 linux.ts | +| `computer-use-swift/backends/` | ✅ darwin.ts | ✅ win32.ts | ❌ 需新建 linux.ts | + +## 3. 每个平台的能力依赖 + +### 3.1 computer-use-input(键鼠) + +| 功能 | macOS | Windows | Linux | +|------|-------|---------|-------| +| 鼠标移动 | CGEvent JXA | SetCursorPos P/Invoke | xdotool mousemove | +| 鼠标点击 | CGEvent JXA | SendInput P/Invoke | xdotool click | +| 鼠标滚轮 | CGEvent JXA | SendInput MOUSEEVENTF_WHEEL | xdotool scroll | +| 键盘按键 | System Events osascript | keybd_event P/Invoke | xdotool key | +| 组合键 | System Events osascript | keybd_event 组合 | xdotool key combo | +| 文本输入 | System Events keystroke | SendKeys.SendWait | xdotool type | +| 前台应用 | System Events osascript | GetForegroundWindow P/Invoke | xdotool getactivewindow + /proc | +| 工具依赖 | osascript(内置) | powershell(内置) | xdotool(需安装) | + +### 3.2 computer-use-swift(截图 + 应用管理) + +| 功能 | macOS | Windows | Linux | +|------|-------|---------|-------| +| 全屏截图 | screencapture | CopyFromScreen | gnome-screenshot / scrot / grim | +| 区域截图 | screencapture -R | CopyFromScreen(rect) | gnome-screenshot -a / scrot -a / grim -g | +| 显示器列表 | CGGetActiveDisplayList JXA | Screen.AllScreens | xrandr --query | +| 运行中应用 | System Events JXA | Get-Process | wmctrl -l / ps | +| 打开应用 | osascript activate | Start-Process | xdg-open / gtk-launch | +| 隐藏/显示 | System Events visibility | ShowWindow/SetForegroundWindow | wmctrl -c / xdotool | +| 工具依赖 | screencapture + osascript | powershell | xdotool + scrot/grim + wmctrl | + +### 3.3 executor 层 + +| 功能 | macOS | Windows | Linux | +|------|-------|---------|-------| +| drainRunLoop | CFRunLoop pump | 不需要 | 不需要 | +| ESC 热键 | CGEventTap | 跳过(Ctrl+C fallback) | 跳过(Ctrl+C fallback) | +| 剪贴板读 | pbpaste | `powershell Get-Clipboard` | xclip -o / wl-paste | +| 剪贴板写 | pbcopy | `powershell Set-Clipboard` | xclip / wl-copy | +| 粘贴快捷键 | command+v | ctrl+v | ctrl+v | +| 终端检测 | __CFBundleIdentifier | WT_SESSION / TERM_PROGRAM | TERM_PROGRAM | +| 系统权限 | TCC check | 直接 granted | 检查 xdotool 安装 | + +## 4. 执行步骤 + +### Phase 1:已完成 ✅ + +- [x] `@ant/computer-use-mcp` stub → 完整实现 +- [x] `@ant/computer-use-input` dispatcher + darwin/win32 backends +- [x] `@ant/computer-use-swift` dispatcher + darwin/win32 backends +- [x] `CHICAGO_MCP` 编译开关 + +### Phase 2:移除 6 处 macOS 硬编码(解锁 macOS + Windows) + +**改动原则:macOS 代码路径不变,只在每处 darwin 守卫后加 win32/linux 分支。** + +| 步骤 | 文件 | 改动 | +|------|------|------| +| 2.1 | `src/main.tsx:1605` | `getPlatform() === 'macos'` → 去掉平台限制,或改为 `!== 'unknown'` | +| 2.2 | `src/utils/computerUse/swiftLoader.ts:16-18` | 移除 `process.platform !== 'darwin'` throw。`@ant/computer-use-swift/index.ts` 已有跨平台 dispatch | +| 2.3 | `src/utils/computerUse/executor.ts:263-267` | 移除 `process.platform !== 'darwin'` throw。改为检查 input/swift isSupported | +| 2.4 | `src/utils/computerUse/executor.ts:70-88` | 剪贴板函数按平台分发:darwin→pbcopy/pbpaste,win32→PowerShell Get/Set-Clipboard,linux→xclip | +| 2.5 | `src/utils/computerUse/executor.ts:180` | `typeViaClipboard` 中 `command+v` → 非 darwin 时用 `ctrl+v` | +| 2.6 | `src/utils/computerUse/executor.ts:273` | `const cu = requireComputerUseSwift()` → 改为 `new ComputerUseAPI()`(从 package 直接实例化,不走 swiftLoader throw) | +| 2.7 | `src/utils/computerUse/drainRunLoop.ts` | 开头加 `if (process.platform !== 'darwin') return fn()` | +| 2.8 | `src/utils/computerUse/escHotkey.ts` | `registerEscHotkey` 非 darwin 返回 false(已有 Ctrl+C fallback) | +| 2.9 | `src/utils/computerUse/hostAdapter.ts:48-54` | `ensureOsPermissions` 非 darwin 返回 `{ granted: true }` | +| 2.10 | `src/utils/computerUse/common.ts:56` | `platform: 'darwin'` → `platform: process.platform === 'win32' ? 'windows' : process.platform === 'linux' ? 'linux' : 'darwin'` | +| 2.11 | `src/utils/computerUse/common.ts:55` | `screenshotFiltering: 'native'` → 非 darwin 时 `'none'`(Windows/Linux 截图不支持 per-app 过滤) | +| 2.12 | `src/utils/computerUse/gates.ts:13` | `enabled: false` → `enabled: true`(无 GrowthBook 时默认可用) | +| 2.13 | `src/utils/computerUse/gates.ts:39-43` | `hasRequiredSubscription()` → 直接返回 `true` | + +### Phase 3:新增 Linux 后端 + +| 步骤 | 文件 | 内容 | +|------|------|------| +| 3.1 | `packages/@ant/computer-use-input/src/backends/linux.ts` | xdotool 键鼠(mousemove/click/key/type/getactivewindow) | +| 3.2 | `packages/@ant/computer-use-swift/src/backends/linux.ts` | scrot/grim 截图 + xrandr 显示器 + wmctrl 窗口管理 | +| 3.3 | `packages/@ant/computer-use-input/src/index.ts` | dispatcher 加 `case 'linux'` | +| 3.4 | `packages/@ant/computer-use-swift/src/index.ts` | dispatcher 加 `case 'linux'` | + +### Phase 4:验证 + +| 测试项 | macOS | Windows | Linux | +|--------|-------|---------|-------| +| build 成功 | ✅ | 验证 | 验证 | +| MCP 工具列表非空 | 验证 | 验证 | 验证 | +| 鼠标移动 | 验证 | ✅ 已通过 | 验证 | +| 截图 | 验证 | ✅ 已通过 | 验证 | +| 键盘输入 | 验证 | 验证 | 验证 | +| 前台窗口 | 验证 | ✅ 已通过 | 验证 | +| 剪贴板 | 验证 | 验证 | 验证 | + +## 5. 文件改动总览 + +### 不动的文件(14 个) + +`cleanup.ts`、`computerUseLock.ts`、`wrapper.tsx`、`toolRendering.tsx`、`mcpServer.ts`、`setup.ts`、`appNames.ts`、`inputLoader.ts`、`src/services/mcp/client.ts`、`@ant/computer-use-mcp/src/*`(Phase 1 已完成)、`backends/darwin.ts`(两个包都不动) + +### 改 src/ 的文件(8 个) + +| 文件 | 改动量 | 风险 | +|------|--------|------| +| `main.tsx` | 1 行 | 低 | +| `swiftLoader.ts` | 2 行 | 低 | +| `executor.ts` | ~40 行(剪贴板分发 + 平台守卫 + paste 快捷键) | **中** | +| `drainRunLoop.ts` | 1 行 | 低 | +| `escHotkey.ts` | 3 行 | 低 | +| `hostAdapter.ts` | 5 行 | 低 | +| `common.ts` | 3 行 | 低 | +| `gates.ts` | 3 行 | 低 | + +### 新增文件(2 个) + +| 文件 | 行数估算 | +|------|---------| +| `packages/@ant/computer-use-input/src/backends/linux.ts` | ~150 行 | +| `packages/@ant/computer-use-swift/src/backends/linux.ts` | ~200 行 | + +## 6. Linux 依赖工具 + +| 工具 | 用途 | 安装命令(Ubuntu) | +|------|------|-------------------| +| `xdotool` | 键鼠模拟 + 窗口管理 | `sudo apt install xdotool` | +| `scrot` 或 `gnome-screenshot` | 截图 | `sudo apt install scrot` | +| `xrandr` | 显示器信息 | 通常已预装 | +| `xclip` | 剪贴板 | `sudo apt install xclip` | +| `wmctrl` | 窗口列表/切换 | `sudo apt install wmctrl` | + +Wayland 环境需要替代工具:`ydotool`(替代 xdotool)、`grim`(替代 scrot)、`wl-clipboard`(替代 xclip)。初期可先只支持 X11,Wayland 标记为 todo。 + +## 7. 执行顺序建议 + +``` +Phase 2(解锁 macOS + Windows) + ├── 2.1-2.3 移除 3 处硬编码 throw/skip + ├── 2.4-2.5 剪贴板 + 粘贴快捷键平台分发 + ├── 2.6 swiftLoader → 直接实例化 + ├── 2.7-2.9 drainRunLoop / escHotkey / permissions 平台分支 + ├── 2.10-2.11 common.ts 平台标识动态化 + ├── 2.12-2.13 gates.ts 默认值 + └── 验证 Windows + +Phase 3(Linux 后端) + ├── 3.1 input/backends/linux.ts + ├── 3.2 swift/backends/linux.ts + ├── 3.3-3.4 dispatcher 加 linux case + └── 验证 Linux + +Phase 4(集成验证 + PR) +``` + +每个 Phase 可独立验证、独立提交。Phase 2 完成后 macOS + Windows 可用,Phase 3 完成后三平台全部可用。 diff --git a/packages/@ant/claude-for-chrome-mcp/src/bridgeClient.ts b/packages/@ant/claude-for-chrome-mcp/src/bridgeClient.ts new file mode 100644 index 000000000..846284778 --- /dev/null +++ b/packages/@ant/claude-for-chrome-mcp/src/bridgeClient.ts @@ -0,0 +1,1126 @@ +/** + * WebSocket bridge client for the Chrome extension MCP server. + * Communicates with the Chrome extension via the office bridge server's /chrome path. + */ + +import WebSocket from "ws"; + +import { SocketConnectionError } from "./mcpSocketClient.js"; +import { + localPlatformLabel, + type BridgePermissionRequest, + type ChromeExtensionInfo, + type ClaudeForChromeContext, + type PermissionMode, + type PermissionOverrides, + type SocketClient, +} from "./types.js"; + +/** Timeout for list_extensions response from the bridge. */ +const DISCOVERY_TIMEOUT_MS = 5000; + +/** How long to wait for a peer_connected event when 0 extensions are found. */ +const PEER_WAIT_TIMEOUT_MS = 10_000; + +interface PendingToolCall { + resolve: (value: unknown) => void; + reject: (reason: Error) => void; + timer: NodeJS.Timeout; + results: unknown[]; + isTabsContext: boolean; + onPermissionRequest?: (request: BridgePermissionRequest) => Promise; + startTime: number; + toolName: string; +} + +export class BridgeClient implements SocketClient { + private ws: WebSocket | null = null; + private connected = false; + private authenticated = false; + private connecting = false; + private reconnectTimer: NodeJS.Timeout | null = null; + private reconnectAttempts = 0; + private pendingCalls = new Map(); + private notificationHandler: + | ((notification: { + method: string; + params?: Record; + }) => void) + | null = null; + private context: ClaudeForChromeContext; + private permissionMode: PermissionMode = "ask"; + private allowedDomains: string[] | undefined; + private tabsContextCollectionTimeoutMs = 2000; + private toolCallTimeoutMs = 120_000; + private connectionStartTime: number | null = null; + private connectionEstablishedTime: number | null = null; + /** The device_id of the selected Chrome extension for targeted routing. */ + private selectedDeviceId: string | undefined; + /** True after first discovery attempt completes (success or timeout). */ + private discoveryComplete = false; + /** Shared promise so concurrent callTool invocations join the same discovery. */ + private discoveryPromise: Promise | null = null; + /** Pending discovery response from bridge. */ + private pendingDiscovery: { + resolve: (extensions: ChromeExtensionInfo[]) => void; + timeout: NodeJS.Timeout; + } | null = null; + /** The device_id we had selected before a peer_disconnected — for auto-reselect. */ + private previousSelectedDeviceId: string | undefined; + /** Callbacks waiting for the next peer_connected event. Receives `true` on peer arrival, `false` on abort. */ + private peerConnectedWaiters: Array<(arrived: boolean) => void> = []; + /** The request_id of the current pending pairing broadcast. */ + private pendingPairingRequestId: string | undefined; + /** True while a pairing broadcast is in flight and no response yet. */ + private pairingInProgress = false; + /** The deviceId from a previous persisted pairing. */ + private persistedDeviceId: string | undefined; + /** Resolve callback for a blocking switchBrowser() call. */ + private pendingSwitchResolve: + | ((result: { deviceId: string; name: string } | null) => void) + | null = null; + + constructor(context: ClaudeForChromeContext) { + this.context = context; + if (context.initialPermissionMode) { + this.permissionMode = context.initialPermissionMode; + } + } + + public async ensureConnected(): Promise { + const { logger, serverName } = this.context; + logger.info( + `[${serverName}] ensureConnected called, connected=${this.connected}, authenticated=${this.authenticated}, wsState=${this.ws?.readyState}`, + ); + + if ( + this.connected && + this.authenticated && + this.ws?.readyState === WebSocket.OPEN + ) { + logger.info(`[${serverName}] Already connected and authenticated`); + return true; + } + + if (!this.connecting) { + logger.info(`[${serverName}] Not connecting, starting connection...`); + await this.connect(); + } else { + logger.info(`[${serverName}] Already connecting, waiting...`); + } + + // Wait for authentication with timeout + return new Promise((resolve) => { + const timeout = setTimeout(() => { + logger.info( + `[${serverName}] Connection timeout, connected=${this.connected}, authenticated=${this.authenticated}`, + ); + resolve(false); + }, 10_000); + const check = () => { + if (this.connected && this.authenticated) { + logger.info(`[${serverName}] Connection successful`); + clearTimeout(timeout); + resolve(true); + } else if (!this.connecting) { + logger.info(`[${serverName}] No longer connecting, giving up`); + clearTimeout(timeout); + resolve(false); + } else { + setTimeout(check, 200); + } + }; + check(); + }); + } + + public async callTool( + name: string, + args: Record, + permissionOverrides?: PermissionOverrides, + ): Promise { + const { logger, serverName, trackEvent } = this.context; + + if (!this.ws || this.ws.readyState !== WebSocket.OPEN) { + throw new SocketConnectionError(`[${serverName}] Bridge not connected`); + } + + // Lazy discovery: run on first tool call if no extension selected yet. + // Use a shared promise so concurrent callers join the same discovery. + if (!this.selectedDeviceId && !this.discoveryComplete) { + this.discoveryPromise ??= this.discoverAndSelectExtension().finally( + () => { + this.discoveryPromise = null; + }, + ); + await this.discoveryPromise; + } + + // TODO: Once all extensions support pairing, throw here for multi-extension + // cases where pairingInProgress is true. For now, let the bridge handle + // routing — it auto-routes to a single extension or returns an error for + // multiple extensions without a target_device_id. + + const toolUseId = crypto.randomUUID(); + const isTabsContext = name === "tabs_context_mcp"; + const startTime = Date.now(); + const timeoutMs = isTabsContext + ? this.tabsContextCollectionTimeoutMs + : this.toolCallTimeoutMs; + + // Track tool call start + trackEvent?.("chrome_bridge_tool_call_started", { + tool_name: name, + tool_use_id: toolUseId, + }); + + // Per-call overrides (from session context) take priority over + // instance values (from set_permission_mode on the singleton). + const effectivePermissionMode = + permissionOverrides?.permissionMode ?? this.permissionMode; + const effectiveAllowedDomains = + permissionOverrides?.allowedDomains ?? this.allowedDomains; + + return new Promise((resolve, reject) => { + const timer = setTimeout(() => { + const pending = this.pendingCalls.get(toolUseId); + if (pending) { + this.pendingCalls.delete(toolUseId); + const durationMs = Date.now() - pending.startTime; + + if (isTabsContext && pending.results.length > 0) { + // For tabs_context, resolve with collected results even on timeout + trackEvent?.("chrome_bridge_tool_call_completed", { + tool_name: name, + tool_use_id: toolUseId, + duration_ms: durationMs, + }); + resolve(this.mergeTabsResults(pending.results)); + } else { + logger.warn( + `[${serverName}] Tool call timeout: ${name} (${toolUseId.slice(0, 8)}) after ${durationMs}ms, pending calls: ${this.pendingCalls.size}`, + ); + trackEvent?.("chrome_bridge_tool_call_timeout", { + tool_name: name, + tool_use_id: toolUseId, + duration_ms: durationMs, + timeout_ms: timeoutMs, + }); + reject( + new SocketConnectionError( + `[${serverName}] Tool call timed out: ${name}`, + ), + ); + } + } + }, timeoutMs); + + this.pendingCalls.set(toolUseId, { + resolve, + reject, + timer, + results: [], + isTabsContext, + onPermissionRequest: permissionOverrides?.onPermissionRequest, + startTime, + toolName: name, + }); + + const message: Record = { + type: "tool_call", + tool_use_id: toolUseId, + client_type: this.context.clientTypeId, + tool: name, + args, + }; + + // Target the selected extension for routing + if (this.selectedDeviceId) { + message.target_device_id = this.selectedDeviceId; + } + + // Only include permission fields when a value exists. + // Priority: per-call override (from session context) > instance value (from set_permission_mode). + if (effectivePermissionMode) { + message.permission_mode = effectivePermissionMode; + } + if (effectiveAllowedDomains?.length) { + message.allowed_domains = effectiveAllowedDomains; + } + if (permissionOverrides?.onPermissionRequest) { + message.handle_permission_prompts = true; + } + + logger.debug( + `[${serverName}] Sending tool_call: ${name} (${toolUseId.slice(0, 8)})`, + ); + this.ws!.send(JSON.stringify(message)); + }); + } + + public isConnected(): boolean { + return ( + this.connected && + this.authenticated && + this.ws?.readyState === WebSocket.OPEN + ); + } + + public disconnect(): void { + this.cleanup(); + } + + public setNotificationHandler( + handler: (notification: { + method: string; + params?: Record; + }) => void, + ): void { + this.notificationHandler = handler; + } + + public async setPermissionMode( + mode: PermissionMode, + allowedDomains?: string[], + ): Promise { + this.permissionMode = mode; + this.allowedDomains = allowedDomains; + } + + // =========================================================================== + // Extension discovery and selection + // =========================================================================== + + /** + * Discover connected extensions and auto-select one, or broadcast a pairing request. + * Called lazily on the first tool call. + */ + private async discoverAndSelectExtension(): Promise { + const { logger, serverName } = this.context; + + this.persistedDeviceId ??= this.context.getPersistedDeviceId?.(); + + let extensions = await this.queryBridgeExtensions(); + + if (extensions.length === 0) { + logger.info( + `[${serverName}] No extensions connected, waiting up to ${PEER_WAIT_TIMEOUT_MS}ms for peer_connected`, + ); + const peerArrived = await this.waitForPeerConnected(PEER_WAIT_TIMEOUT_MS); + if (peerArrived) { + extensions = await this.queryBridgeExtensions(); + } + } + + this.discoveryComplete = true; + + if (extensions.length === 0) { + // Still nothing — callTool will throw a clear error + logger.info(`[${serverName}] No extensions found after waiting`); + return; + } + + // Single extension: auto-select silently + if (extensions.length === 1) { + const ext = extensions[0]!; + if (!this.isLocalExtension(ext)) { + this.context.onRemoteExtensionWarning?.(ext); + } + this.selectExtension(ext.deviceId); + return; + } + + // Multiple extensions: check for persisted selection + if (this.persistedDeviceId) { + const persisted = extensions.find( + (e) => e.deviceId === this.persistedDeviceId, + ); + if (persisted) { + logger.info( + `[${serverName}] Auto-connecting to persisted extension: ${persisted.name || persisted.deviceId.slice(0, 8)}`, + ); + this.selectExtension(persisted.deviceId); + return; + } + } + + // Multiple extensions, no valid persisted selection: broadcast and fail fast + this.broadcastPairingRequest(); + this.pairingInProgress = true; + } + + /** + * Query the bridge for connected extensions. Returns empty array on timeout. + * Deduplicates by deviceId, keeping the most recent connection — the bridge + * may report stale duplicates (e.g. after a service worker restart). + */ + private async queryBridgeExtensions(): Promise { + const raw: ChromeExtensionInfo[] = await new Promise((resolve) => { + const timeout = setTimeout(() => { + this.pendingDiscovery = null; + resolve([]); + }, DISCOVERY_TIMEOUT_MS); + + this.pendingDiscovery = { resolve, timeout }; + this.ws?.send(JSON.stringify({ type: "list_extensions" })); + }); + + const byDeviceId = new Map(); + for (const ext of raw) { + const existing = byDeviceId.get(ext.deviceId); + if (!existing || ext.connectedAt > existing.connectedAt) { + byDeviceId.set(ext.deviceId, ext); + } + } + return [...byDeviceId.values()]; + } + + /** + * Select an extension by device ID for per-message targeted routing. + */ + private selectExtension(deviceId: string): void { + const { logger, serverName } = this.context; + this.selectedDeviceId = deviceId; + this.previousSelectedDeviceId = undefined; + logger.info( + `[${serverName}] Selected Chrome extension: ${deviceId.slice(0, 8)}...`, + ); + } + + /** + * Check if an extension might be on the same machine as this MCP client + * by comparing OS platform. Extensions can't provide a real hostname from + * the service worker sandbox, so platform is a weak heuristic. The profile + * email is the primary differentiator shown in the selection dialog. + */ + private isLocalExtension(ext: ChromeExtensionInfo): boolean { + if (!ext.osPlatform) return false; + return ext.osPlatform === localPlatformLabel(); + } + + /** + * Returns a promise that resolves to `true` when a peer_connected event + * fires, or `false` if the timeout elapses first. + */ + private waitForPeerConnected(timeoutMs: number): Promise { + return new Promise((resolve) => { + const timer = setTimeout(() => { + this.peerConnectedWaiters = this.peerConnectedWaiters.filter( + (w) => w !== onPeer, + ); + resolve(false); + }, timeoutMs); + + const onPeer = (arrived: boolean) => { + clearTimeout(timer); + resolve(arrived); + }; + + this.peerConnectedWaiters.push(onPeer); + }); + } + + /** + * Broadcast a pairing request to all connected extensions. + * Non-blocking — the pairing_response handler will select the extension. + */ + private broadcastPairingRequest(): void { + const requestId = crypto.randomUUID(); + this.pendingPairingRequestId = requestId; + this.ws?.send( + JSON.stringify({ + type: "pairing_request", + request_id: requestId, + client_type: this.context.clientTypeId, + }), + ); + } + + /** + * Switch to a different browser. Broadcasts a pairing request and blocks + * until a response arrives or timeout (120s). Returns the paired extension + * info, or null on timeout. + */ + public async switchBrowser(): Promise< + | { + deviceId: string; + name: string; + } + | "no_other_browsers" + | null + > { + const extensions = await this.queryBridgeExtensions(); + const currentDeviceId = + this.selectedDeviceId ?? this.previousSelectedDeviceId; + if ( + extensions.length === 0 || + (extensions.length === 1 && + (!currentDeviceId || extensions[0]!.deviceId === currentDeviceId)) + ) { + return "no_other_browsers"; + } + + this.previousSelectedDeviceId = this.selectedDeviceId; + this.selectedDeviceId = undefined; + this.discoveryComplete = false; + this.pairingInProgress = false; + + const requestId = crypto.randomUUID(); + this.pendingPairingRequestId = requestId; + if (this.ws?.readyState !== WebSocket.OPEN) { + return null; + } + this.ws.send( + JSON.stringify({ + type: "pairing_request", + request_id: requestId, + client_type: this.context.clientTypeId, + }), + ); + + // Resolve any previous pending switch so the caller doesn't hang forever + if (this.pendingSwitchResolve) { + this.pendingSwitchResolve(null); + } + + // Block for switch_browser since user is actively engaged + return new Promise((resolve) => { + const timer = setTimeout(() => { + if (this.pendingPairingRequestId === requestId) { + this.pendingPairingRequestId = undefined; + } + this.pendingSwitchResolve = null; + resolve(null); + }, 120_000); + + this.pendingSwitchResolve = (result) => { + clearTimeout(timer); + this.pendingSwitchResolve = null; + resolve(result); + }; + }); + } + + private async connect(): Promise { + const { logger, serverName, bridgeConfig, trackEvent } = this.context; + + if (!bridgeConfig) { + logger.error(`[${serverName}] No bridge config provided`); + return; + } + + if (this.connecting) { + return; + } + + this.connecting = true; + this.authenticated = false; + this.connectionStartTime = Date.now(); + this.closeSocket(); + + // Get user ID for the connection path + let userId: string; + let token: string | undefined; + + if (bridgeConfig.devUserId) { + userId = bridgeConfig.devUserId; + logger.debug(`[${serverName}] Using dev user ID for bridge connection`); + } else { + logger.debug(`[${serverName}] Fetching user ID for bridge connection`); + const fetchedUserId = await bridgeConfig.getUserId(); + if (!fetchedUserId) { + const durationMs = Date.now() - this.connectionStartTime; + logger.error( + `[${serverName}] No user ID available after ${durationMs}ms`, + ); + trackEvent?.("chrome_bridge_connection_failed", { + duration_ms: durationMs, + error_type: "no_user_id", + reconnect_attempt: this.reconnectAttempts, + }); + this.connecting = false; + this.context.onAuthenticationError?.(); + return; + } + userId = fetchedUserId; + + logger.debug( + `[${serverName}] Fetching OAuth token for bridge connection`, + ); + token = await bridgeConfig.getOAuthToken(); + if (!token) { + const durationMs = Date.now() - this.connectionStartTime; + logger.error( + `[${serverName}] No OAuth token available after ${durationMs}ms`, + ); + trackEvent?.("chrome_bridge_connection_failed", { + duration_ms: durationMs, + error_type: "no_oauth_token", + reconnect_attempt: this.reconnectAttempts, + }); + this.connecting = false; + this.context.onAuthenticationError?.(); + return; + } + } + + // Connect to user-specific endpoint: /chrome/ + const wsUrl = `${bridgeConfig.url}/chrome/${userId}`; + logger.info(`[${serverName}] Connecting to bridge: ${wsUrl}`); + + // Track connection started + trackEvent?.("chrome_bridge_connection_started", { + bridge_url: wsUrl, + }); + + try { + this.ws = new WebSocket(wsUrl); + } catch (error) { + const durationMs = Date.now() - this.connectionStartTime; + logger.error( + `[${serverName}] Failed to create WebSocket after ${durationMs}ms:`, + error, + ); + trackEvent?.("chrome_bridge_connection_failed", { + duration_ms: durationMs, + error_type: "websocket_error", + reconnect_attempt: this.reconnectAttempts, + }); + this.connecting = false; + this.scheduleReconnect(); + return; + } + + this.ws.on("open", () => { + logger.info( + `[${serverName}] WebSocket connected, sending connect message`, + ); + + // First message must be connect (same format as office path) + const connectMessage: Record = { + type: "connect", + client_type: this.context.clientTypeId, + }; + + if (bridgeConfig.devUserId) { + connectMessage.dev_user_id = bridgeConfig.devUserId; + } else { + connectMessage.oauth_token = token; + } + + this.ws?.send(JSON.stringify(connectMessage)); + }); + + this.ws.on("message", (data: WebSocket.Data) => { + try { + const message = JSON.parse(data.toString()) as Record; + logger.debug( + `[${serverName}] Bridge received: ${JSON.stringify(message)}`, + ); + this.handleMessage(message); + } catch (error) { + logger.error(`[${serverName}] Failed to parse bridge message:`, error); + } + }); + + this.ws.on("close", (code: number) => { + const durationSinceConnect = this.connectionEstablishedTime + ? Date.now() - this.connectionEstablishedTime + : 0; + logger.info( + `[${serverName}] Bridge connection closed (code: ${code}, duration: ${durationSinceConnect}ms)`, + ); + trackEvent?.("chrome_bridge_disconnected", { + close_code: code, + duration_since_connect_ms: durationSinceConnect, + reconnect_attempt: this.reconnectAttempts + 1, + }); + this.connected = false; + this.authenticated = false; + this.connecting = false; + this.connectionEstablishedTime = null; + this.scheduleReconnect(); + }); + + this.ws.on("error", (error: Error) => { + const durationMs = this.connectionStartTime + ? Date.now() - this.connectionStartTime + : 0; + logger.error( + `[${serverName}] Bridge WebSocket error after ${durationMs}ms: ${error.message}`, + ); + trackEvent?.("chrome_bridge_connection_failed", { + duration_ms: durationMs, + error_type: "websocket_error", + reconnect_attempt: this.reconnectAttempts, + }); + this.connected = false; + this.authenticated = false; + this.connecting = false; + }); + } + + private handleMessage(message: Record): void { + const { logger, serverName, trackEvent } = this.context; + + switch (message.type) { + case "paired": { + const durationMs = this.connectionStartTime + ? Date.now() - this.connectionStartTime + : 0; + logger.info( + `[${serverName}] Paired with Chrome extension (duration: ${durationMs}ms)`, + ); + this.connected = true; + this.authenticated = true; + this.connecting = false; + this.reconnectAttempts = 0; + this.connectionEstablishedTime = Date.now(); + trackEvent?.("chrome_bridge_connection_succeeded", { + duration_ms: durationMs, + status: "paired", + }); + break; + } + + case "waiting": { + const durationMs = this.connectionStartTime + ? Date.now() - this.connectionStartTime + : 0; + logger.info( + `[${serverName}] Waiting for Chrome extension to connect (duration: ${durationMs}ms)`, + ); + this.connected = true; + this.authenticated = true; + this.connecting = false; + this.reconnectAttempts = 0; + this.connectionEstablishedTime = Date.now(); + trackEvent?.("chrome_bridge_connection_succeeded", { + duration_ms: durationMs, + status: "waiting", + }); + break; + } + + case "peer_connected": + logger.info(`[${serverName}] Chrome extension connected to bridge`); + trackEvent?.("chrome_bridge_peer_connected", null); + // If no extension selected, mark discovery as needed (next tool call will discover) + if (!this.selectedDeviceId) { + this.discoveryComplete = false; + } + // Auto-reselect if the previously selected extension reconnected (e.g., service worker restart) + if ( + this.previousSelectedDeviceId && + message.deviceId === this.previousSelectedDeviceId && + !this.pendingSwitchResolve + ) { + logger.info( + `[${serverName}] Previously selected extension reconnected, auto-reselecting`, + ); + this.selectExtension(this.previousSelectedDeviceId); + this.previousSelectedDeviceId = undefined; + } + if (this.peerConnectedWaiters.length > 0) { + const waiters = this.peerConnectedWaiters; + this.peerConnectedWaiters = []; + for (const waiter of waiters) { + waiter(true); + } + } + break; + + case "peer_disconnected": + logger.info( + `[${serverName}] Chrome extension disconnected from bridge`, + ); + trackEvent?.("chrome_bridge_peer_disconnected", null); + // If the selected extension disconnected, clear selection for re-discovery + if (message.deviceId && message.deviceId === this.selectedDeviceId) { + logger.info( + `[${serverName}] Selected extension disconnected, clearing selection`, + ); + this.previousSelectedDeviceId = this.selectedDeviceId; + this.selectedDeviceId = undefined; + this.discoveryComplete = false; + } + break; + + case "extensions_list": + // Response to list_extensions — resolve pending discovery + if (this.pendingDiscovery) { + clearTimeout(this.pendingDiscovery.timeout); + this.pendingDiscovery.resolve( + (message.extensions as ChromeExtensionInfo[]) ?? [], + ); + this.pendingDiscovery = null; + } + break; + + case "pairing_response": { + const requestId = message.request_id as string; + const responseDeviceId = message.device_id as string; + const responseName = message.name as string; + if ( + this.pendingPairingRequestId === requestId && + responseDeviceId && + responseName + ) { + this.pendingPairingRequestId = undefined; + this.pairingInProgress = false; + this.selectExtension(responseDeviceId); + this.context.onExtensionPaired?.(responseDeviceId, responseName); + logger.info( + `[${serverName}] Paired with "${responseName}" (${responseDeviceId.slice(0, 8)})`, + ); + if (this.pendingSwitchResolve) { + this.pendingSwitchResolve({ + deviceId: responseDeviceId, + name: responseName, + }); + this.pendingSwitchResolve = null; + } + } + break; + } + + case "ping": + this.ws?.send(JSON.stringify({ type: "pong" })); + break; + + case "pong": + // Response to our keepalive, nothing to do + break; + + case "tool_result": + this.handleToolResult(message); + break; + + case "permission_request": + void this.handlePermissionRequest(message); + break; + + case "notification": + if (this.notificationHandler) { + this.notificationHandler({ + method: message.method as string, + params: message.params as Record | undefined, + }); + } + break; + + case "error": + logger.warn(`[${serverName}] Bridge error: ${message.error}`); + // If we had a selected extension, the error may indicate it's gone + // (e.g., extension disconnected between list and select). Clear state + // so the next tool call re-discovers. + if (this.selectedDeviceId) { + this.selectedDeviceId = undefined; + this.discoveryComplete = false; + } + break; + + default: + logger.warn( + `[${serverName}] Unrecognized bridge message type: ${message.type}`, + ); + } + } + + private async handlePermissionRequest( + message: Record, + ): Promise { + const { logger, serverName } = this.context; + const toolUseId = message.tool_use_id as string; + const requestId = message.request_id as string; + + if (!toolUseId || !requestId) { + logger.warn( + `[${serverName}] permission_request missing tool_use_id or request_id`, + ); + return; + } + + const pending = this.pendingCalls.get(toolUseId); + if (!pending?.onPermissionRequest) { + // Don't auto-deny — the bridge broadcasts permission_request to all + // connected MCP clients, and only the client that made the tool call + // has the pending entry. Auto-denying here would race with the correct + // client's handler when multiple Desktop instances are connected. + logger.debug( + `[${serverName}] Ignoring permission_request for unknown tool_use_id ${toolUseId.slice(0, 8)} (not our call)`, + ); + return; + } + + const request: BridgePermissionRequest = { + toolUseId, + requestId, + toolType: (message.tool_type as string) ?? "unknown", + url: (message.url as string) ?? "", + actionData: message.action_data as Record | undefined, + }; + + try { + const allowed = await pending.onPermissionRequest(request); + this.sendPermissionResponse(requestId, allowed); + } catch (error) { + logger.error(`[${serverName}] Error handling permission request:`, error); + this.sendPermissionResponse(requestId, false); + } + } + + private sendPermissionResponse(requestId: string, allowed: boolean): void { + if (this.ws?.readyState === WebSocket.OPEN) { + const message: Record = { + type: "permission_response", + request_id: requestId, + allowed, + }; + if (this.selectedDeviceId) { + message.target_device_id = this.selectedDeviceId; + } + this.ws.send(JSON.stringify(message)); + } + } + + private handleToolResult(message: Record): void { + const { logger, serverName, trackEvent } = this.context; + const toolUseId = message.tool_use_id as string; + if (!toolUseId) { + logger.warn(`[${serverName}] Received tool_result without tool_use_id`); + return; + } + + const pending = this.pendingCalls.get(toolUseId); + if (!pending) { + logger.debug( + `[${serverName}] Received tool_result for unknown call: ${toolUseId.slice(0, 8)}`, + ); + return; + } + + const durationMs = Date.now() - pending.startTime; + + // Normalize bridge response format to match socket client format. + // Bridge sends: { type, tool_use_id, content: [...], is_error?: boolean } + // Socket sends: { result: { content: [...] } } or { error: { content: [...] } } + const normalized = this.normalizeBridgeResponse(message); + const isError = Boolean(message.is_error) || "error" in normalized; + + if (pending.isTabsContext && !this.selectedDeviceId) { + // No extension selected: collect results from all extensions (pre-selection / backward compat) + pending.results.push(normalized); + // Don't resolve yet — let the timer handle collection + } else { + // For other tools, resolve on first result + clearTimeout(pending.timer); + this.pendingCalls.delete(toolUseId); + + if (isError) { + // Extract error message for telemetry + const errorContent = (normalized as { error?: { content?: unknown[] } }) + .error?.content; + let errorMessage = "Unknown error"; + if (Array.isArray(errorContent)) { + const textItem = errorContent.find( + (item) => + typeof item === "object" && item !== null && "text" in item, + ) as { text?: string } | undefined; + if (textItem?.text) { + errorMessage = textItem.text.slice(0, 200); + } + } + + logger.warn( + `[${serverName}] Tool call error: ${pending.toolName} (${toolUseId.slice(0, 8)}) after ${durationMs}ms`, + ); + trackEvent?.("chrome_bridge_tool_call_error", { + tool_name: pending.toolName, + tool_use_id: toolUseId, + duration_ms: durationMs, + error_message: errorMessage, + }); + } else { + logger.debug( + `[${serverName}] Tool call completed: ${pending.toolName} (${toolUseId.slice(0, 8)}) in ${durationMs}ms`, + ); + trackEvent?.("chrome_bridge_tool_call_completed", { + tool_name: pending.toolName, + tool_use_id: toolUseId, + duration_ms: durationMs, + }); + } + + pending.resolve(normalized); + } + } + + private normalizeBridgeResponse( + message: Record, + ): Record { + // Already has result/error wrapper (socket format) — pass through + if (message.result || message.error) { + return message; + } + + // Bridge format has content at top level — wrap it + if (message.content) { + if (message.is_error) { + return { error: { content: message.content } }; + } + return { result: { content: message.content } }; + } + + return message; + } + + private mergeTabsResults(results: unknown[]): unknown { + const mergedTabs: unknown[] = []; + + for (const result of results) { + const msg = result as Record; + const resultData = msg.result as + | { content?: Array<{ type: string; text?: string }> } + | undefined; + const content = resultData?.content; + + if (!content || !Array.isArray(content)) continue; + + for (const item of content) { + if (item.type === "text" && item.text) { + try { + const parsed = JSON.parse(item.text); + if (Array.isArray(parsed)) { + mergedTabs.push(...parsed); + } else if ( + parsed?.availableTabs && + Array.isArray(parsed.availableTabs) + ) { + mergedTabs.push(...parsed.availableTabs); + } + } catch { + // Not JSON, skip + } + } + } + } + + if (mergedTabs.length > 0) { + const tabListText = mergedTabs + .map((t) => { + const tab = t as { tabId: number; title: string; url: string }; + return ` \u2022 tabId ${tab.tabId}: "${tab.title}" (${tab.url})`; + }) + .join("\n"); + + return { + result: { + content: [ + { + type: "text", + text: JSON.stringify({ availableTabs: mergedTabs }), + }, + { + type: "text", + text: `\n\nTab Context:\n- Available tabs:\n${tabListText}`, + }, + ], + }, + }; + } + + // Return first result as fallback + return results[0]; + } + + private scheduleReconnect(): void { + const { logger, serverName, trackEvent } = this.context; + + if (this.reconnectTimer) return; + + this.reconnectAttempts++; + + if (this.reconnectAttempts > 100) { + logger.warn( + `[${serverName}] Giving up bridge reconnection after 100 attempts`, + ); + trackEvent?.("chrome_bridge_reconnect_exhausted", { + total_attempts: 100, + }); + this.reconnectAttempts = 0; + return; + } + + const delay = Math.min( + 2000 * Math.pow(1.5, this.reconnectAttempts - 1), + 30_000, + ); + + if (this.reconnectAttempts <= 10 || this.reconnectAttempts % 10 === 0) { + logger.info( + `[${serverName}] Bridge reconnecting in ${Math.round(delay)}ms (attempt ${this.reconnectAttempts})`, + ); + } + + this.reconnectTimer = setTimeout(() => { + this.reconnectTimer = null; + void this.connect(); + }, delay); + } + + private closeSocket(): void { + if (this.ws) { + this.ws.removeAllListeners(); + this.ws.close(); + this.ws = null; + } + this.connected = false; + this.authenticated = false; + // Clear extension selection state so reconnections start fresh + this.selectedDeviceId = undefined; + this.discoveryComplete = false; + this.pendingPairingRequestId = undefined; + this.pairingInProgress = false; + if (this.pendingSwitchResolve) { + this.pendingSwitchResolve(null); + this.pendingSwitchResolve = null; + } + if (this.pendingDiscovery) { + clearTimeout(this.pendingDiscovery.timeout); + this.pendingDiscovery.resolve([]); + this.pendingDiscovery = null; + } + // Unblock any in-progress waitForPeerConnected so it doesn't hang until its timeout + if (this.peerConnectedWaiters.length > 0) { + const waiters = this.peerConnectedWaiters; + this.peerConnectedWaiters = []; + for (const waiter of waiters) { + waiter(false); + } + } + } + + private cleanup(): void { + if (this.reconnectTimer) { + clearTimeout(this.reconnectTimer); + this.reconnectTimer = null; + } + + // Reject all pending calls + for (const [id, pending] of this.pendingCalls) { + clearTimeout(pending.timer); + pending.reject(new SocketConnectionError("Bridge client disconnected")); + this.pendingCalls.delete(id); + } + + this.closeSocket(); + this.reconnectAttempts = 0; + } +} + +export function createBridgeClient( + context: ClaudeForChromeContext, +): BridgeClient { + return new BridgeClient(context); +} diff --git a/packages/@ant/claude-for-chrome-mcp/src/browserTools.ts b/packages/@ant/claude-for-chrome-mcp/src/browserTools.ts new file mode 100644 index 000000000..a1aba77ab --- /dev/null +++ b/packages/@ant/claude-for-chrome-mcp/src/browserTools.ts @@ -0,0 +1,546 @@ +export const BROWSER_TOOLS = [ + { + name: "javascript_tool", + description: + "Execute JavaScript code in the context of the current page. The code runs in the page's context and can interact with the DOM, window object, and page variables. Returns the result of the last expression or any thrown errors. If you don't have a valid tab ID, use tabs_context_mcp first to get available tabs.", + inputSchema: { + type: "object", + properties: { + action: { + type: "string", + description: "Must be set to 'javascript_exec'", + }, + text: { + type: "string", + description: + "The JavaScript code to execute. The code will be evaluated in the page context. The result of the last expression will be returned automatically. Do NOT use 'return' statements - just write the expression you want to evaluate (e.g., 'window.myData.value' not 'return window.myData.value'). You can access and modify the DOM, call page functions, and interact with page variables.", + }, + tabId: { + type: "number", + description: + "Tab ID to execute the code in. Must be a tab in the current group. Use tabs_context_mcp first if you don't have a valid tab ID.", + }, + }, + required: ["action", "text", "tabId"], + }, + }, + { + name: "read_page", + description: + "Get an accessibility tree representation of elements on the page. By default returns all elements including non-visible ones. Output is limited to 50000 characters by default. If the output exceeds this limit, you will receive an error asking you to specify a smaller depth or focus on a specific element using ref_id. Optionally filter for only interactive elements. If you don't have a valid tab ID, use tabs_context_mcp first to get available tabs.", + inputSchema: { + type: "object", + properties: { + filter: { + type: "string", + enum: ["interactive", "all"], + description: + 'Filter elements: "interactive" for buttons/links/inputs only, "all" for all elements including non-visible ones (default: all elements)', + }, + tabId: { + type: "number", + description: + "Tab ID to read from. Must be a tab in the current group. Use tabs_context_mcp first if you don't have a valid tab ID.", + }, + depth: { + type: "number", + description: + "Maximum depth of the tree to traverse (default: 15). Use a smaller depth if output is too large.", + }, + ref_id: { + type: "string", + description: + "Reference ID of a parent element to read. Will return the specified element and all its children. Use this to focus on a specific part of the page when output is too large.", + }, + max_chars: { + type: "number", + description: + "Maximum characters for output (default: 50000). Set to a higher value if your client can handle large outputs.", + }, + }, + required: ["tabId"], + }, + }, + { + name: "find", + description: + 'Find elements on the page using natural language. Can search for elements by their purpose (e.g., "search bar", "login button") or by text content (e.g., "organic mango product"). Returns up to 20 matching elements with references that can be used with other tools. If more than 20 matches exist, you\'ll be notified to use a more specific query. If you don\'t have a valid tab ID, use tabs_context_mcp first to get available tabs.', + inputSchema: { + type: "object", + properties: { + query: { + type: "string", + description: + 'Natural language description of what to find (e.g., "search bar", "add to cart button", "product title containing organic")', + }, + tabId: { + type: "number", + description: + "Tab ID to search in. Must be a tab in the current group. Use tabs_context_mcp first if you don't have a valid tab ID.", + }, + }, + required: ["query", "tabId"], + }, + }, + { + name: "form_input", + description: + "Set values in form elements using element reference ID from the read_page tool. If you don't have a valid tab ID, use tabs_context_mcp first to get available tabs.", + inputSchema: { + type: "object", + properties: { + ref: { + type: "string", + description: + 'Element reference ID from the read_page tool (e.g., "ref_1", "ref_2")', + }, + value: { + type: ["string", "boolean", "number"], + description: + "The value to set. For checkboxes use boolean, for selects use option value or text, for other inputs use appropriate string/number", + }, + tabId: { + type: "number", + description: + "Tab ID to set form value in. Must be a tab in the current group. Use tabs_context_mcp first if you don't have a valid tab ID.", + }, + }, + required: ["ref", "value", "tabId"], + }, + }, + { + name: "computer", + description: `Use a mouse and keyboard to interact with a web browser, and take screenshots. If you don't have a valid tab ID, use tabs_context_mcp first to get available tabs.\n* Whenever you intend to click on an element like an icon, you should consult a screenshot to determine the coordinates of the element before moving the cursor.\n* If you tried clicking on a program or link but it failed to load, even after waiting, try adjusting your click location so that the tip of the cursor visually falls on the element that you want to click.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.`, + inputSchema: { + type: "object", + properties: { + action: { + type: "string", + enum: [ + "left_click", + "right_click", + "type", + "screenshot", + "wait", + "scroll", + "key", + "left_click_drag", + "double_click", + "triple_click", + "zoom", + "scroll_to", + "hover", + ], + description: + "The action to perform:\n* `left_click`: Click the left mouse button at the specified coordinates.\n* `right_click`: Click the right mouse button at the specified coordinates to open context menus.\n* `double_click`: Double-click the left mouse button at the specified coordinates.\n* `triple_click`: Triple-click the left mouse button at the specified coordinates.\n* `type`: Type a string of text.\n* `screenshot`: Take a screenshot of the screen.\n* `wait`: Wait for a specified number of seconds.\n* `scroll`: Scroll up, down, left, or right at the specified coordinates.\n* `key`: Press a specific keyboard key.\n* `left_click_drag`: Drag from start_coordinate to coordinate.\n* `zoom`: Take a screenshot of a specific region for closer inspection.\n* `scroll_to`: Scroll an element into view using its element reference ID from read_page or find tools.\n* `hover`: Move the mouse cursor to the specified coordinates or element without clicking. Useful for revealing tooltips, dropdown menus, or triggering hover states.", + }, + coordinate: { + type: "array", + items: { type: "number" }, + minItems: 2, + maxItems: 2, + description: + "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates. Required for `left_click`, `right_click`, `double_click`, `triple_click`, and `scroll`. For `left_click_drag`, this is the end position.", + }, + text: { + type: "string", + description: + 'The text to type (for `type` action) or the key(s) to press (for `key` action). For `key` action: Provide space-separated keys (e.g., "Backspace Backspace Delete"). Supports keyboard shortcuts using the platform\'s modifier key (use "cmd" on Mac, "ctrl" on Windows/Linux, e.g., "cmd+a" or "ctrl+a" for select all).', + }, + duration: { + type: "number", + minimum: 0, + maximum: 30, + description: + "The number of seconds to wait. Required for `wait`. Maximum 30 seconds.", + }, + scroll_direction: { + type: "string", + enum: ["up", "down", "left", "right"], + description: "The direction to scroll. Required for `scroll`.", + }, + scroll_amount: { + type: "number", + minimum: 1, + maximum: 10, + description: + "The number of scroll wheel ticks. Optional for `scroll`, defaults to 3.", + }, + start_coordinate: { + type: "array", + items: { type: "number" }, + minItems: 2, + maxItems: 2, + description: + "(x, y): The starting coordinates for `left_click_drag`.", + }, + region: { + type: "array", + items: { type: "number" }, + minItems: 4, + maxItems: 4, + description: + "(x0, y0, x1, y1): The rectangular region to capture for `zoom`. Coordinates define a rectangle from top-left (x0, y0) to bottom-right (x1, y1) in pixels from the viewport origin. Required for `zoom` action. Useful for inspecting small UI elements like icons, buttons, or text.", + }, + repeat: { + type: "number", + minimum: 1, + maximum: 100, + description: + "Number of times to repeat the key sequence. Only applicable for `key` action. Must be a positive integer between 1 and 100. Default is 1. Useful for navigation tasks like pressing arrow keys multiple times.", + }, + ref: { + type: "string", + description: + 'Element reference ID from read_page or find tools (e.g., "ref_1", "ref_2"). Required for `scroll_to` action. Can be used as alternative to `coordinate` for click actions.', + }, + modifiers: { + type: "string", + description: + 'Modifier keys for click actions. Supports: "ctrl", "shift", "alt", "cmd" (or "meta"), "win" (or "windows"). Can be combined with "+" (e.g., "ctrl+shift", "cmd+alt"). Optional.', + }, + tabId: { + type: "number", + description: + "Tab ID to execute the action on. Must be a tab in the current group. Use tabs_context_mcp first if you don't have a valid tab ID.", + }, + }, + required: ["action", "tabId"], + }, + }, + { + name: "navigate", + description: + "Navigate to a URL, or go forward/back in browser history. If you don't have a valid tab ID, use tabs_context_mcp first to get available tabs.", + inputSchema: { + type: "object", + properties: { + url: { + type: "string", + description: + 'The URL to navigate to. Can be provided with or without protocol (defaults to https://). Use "forward" to go forward in history or "back" to go back in history.', + }, + tabId: { + type: "number", + description: + "Tab ID to navigate. Must be a tab in the current group. Use tabs_context_mcp first if you don't have a valid tab ID.", + }, + }, + required: ["url", "tabId"], + }, + }, + { + name: "resize_window", + description: + "Resize the current browser window to specified dimensions. Useful for testing responsive designs or setting up specific screen sizes. If you don't have a valid tab ID, use tabs_context_mcp first to get available tabs.", + inputSchema: { + type: "object", + properties: { + width: { + type: "number", + description: "Target window width in pixels", + }, + height: { + type: "number", + description: "Target window height in pixels", + }, + tabId: { + type: "number", + description: + "Tab ID to get the window for. Must be a tab in the current group. Use tabs_context_mcp first if you don't have a valid tab ID.", + }, + }, + required: ["width", "height", "tabId"], + }, + }, + { + name: "gif_creator", + description: + "Manage GIF recording and export for browser automation sessions. Control when to start/stop recording browser actions (clicks, scrolls, navigation), then export as an animated GIF with visual overlays (click indicators, action labels, progress bar, watermark). All operations are scoped to the tab's group. When starting recording, take a screenshot immediately after to capture the initial state as the first frame. When stopping recording, take a screenshot immediately before to capture the final state as the last frame. For export, either provide 'coordinate' to drag/drop upload to a page element, or set 'download: true' to download the GIF.", + inputSchema: { + type: "object", + properties: { + action: { + type: "string", + enum: ["start_recording", "stop_recording", "export", "clear"], + description: + "Action to perform: 'start_recording' (begin capturing), 'stop_recording' (stop capturing but keep frames), 'export' (generate and export GIF), 'clear' (discard frames)", + }, + tabId: { + type: "number", + description: + "Tab ID to identify which tab group this operation applies to", + }, + download: { + type: "boolean", + description: + "Always set this to true for the 'export' action only. This causes the gif to be downloaded in the browser.", + }, + filename: { + type: "string", + description: + "Optional filename for exported GIF (default: 'recording-[timestamp].gif'). For 'export' action only.", + }, + options: { + type: "object", + description: + "Optional GIF enhancement options for 'export' action. Properties: showClickIndicators (bool), showDragPaths (bool), showActionLabels (bool), showProgressBar (bool), showWatermark (bool), quality (number 1-30). All default to true except quality (default: 10).", + properties: { + showClickIndicators: { + type: "boolean", + description: + "Show orange circles at click locations (default: true)", + }, + showDragPaths: { + type: "boolean", + description: "Show red arrows for drag actions (default: true)", + }, + showActionLabels: { + type: "boolean", + description: + "Show black labels describing actions (default: true)", + }, + showProgressBar: { + type: "boolean", + description: "Show orange progress bar at bottom (default: true)", + }, + showWatermark: { + type: "boolean", + description: "Show Claude logo watermark (default: true)", + }, + quality: { + type: "number", + description: + "GIF compression quality, 1-30 (lower = better quality, slower encoding). Default: 10", + }, + }, + }, + }, + required: ["action", "tabId"], + }, + }, + { + name: "upload_image", + description: + "Upload a previously captured screenshot or user-uploaded image to a file input or drag & drop target. Supports two approaches: (1) ref - for targeting specific elements, especially hidden file inputs, (2) coordinate - for drag & drop to visible locations like Google Docs. Provide either ref or coordinate, not both.", + inputSchema: { + type: "object", + properties: { + imageId: { + type: "string", + description: + "ID of a previously captured screenshot (from the computer tool's screenshot action) or a user-uploaded image", + }, + ref: { + type: "string", + description: + 'Element reference ID from read_page or find tools (e.g., "ref_1", "ref_2"). Use this for file inputs (especially hidden ones) or specific elements. Provide either ref or coordinate, not both.', + }, + coordinate: { + type: "array", + items: { + type: "number", + }, + description: + "Viewport coordinates [x, y] for drag & drop to a visible location. Use this for drag & drop targets like Google Docs. Provide either ref or coordinate, not both.", + }, + tabId: { + type: "number", + description: + "Tab ID where the target element is located. This is where the image will be uploaded to.", + }, + filename: { + type: "string", + description: + 'Optional filename for the uploaded file (default: "image.png")', + }, + }, + required: ["imageId", "tabId"], + }, + }, + { + name: "get_page_text", + description: + "Extract raw text content from the page, prioritizing article content. Ideal for reading articles, blog posts, or other text-heavy pages. Returns plain text without HTML formatting. If you don't have a valid tab ID, use tabs_context_mcp first to get available tabs.", + inputSchema: { + type: "object", + properties: { + tabId: { + type: "number", + description: + "Tab ID to extract text from. Must be a tab in the current group. Use tabs_context_mcp first if you don't have a valid tab ID.", + }, + }, + required: ["tabId"], + }, + }, + { + name: "tabs_context_mcp", + title: "Tabs Context", + description: + "Get context information about the current MCP tab group. Returns all tab IDs inside the group if it exists. CRITICAL: You must get the context at least once before using other browser automation tools so you know what tabs exist. Each new conversation should create its own new tab (using tabs_create_mcp) rather than reusing existing tabs, unless the user explicitly asks to use an existing tab.", + inputSchema: { + type: "object", + properties: { + createIfEmpty: { + type: "boolean", + description: + "Creates a new MCP tab group if none exists, creates a new Window with a new tab group containing an empty tab (which can be used for this conversation). If a MCP tab group already exists, this parameter has no effect.", + }, + }, + required: [], + }, + }, + { + name: "tabs_create_mcp", + title: "Tabs Create", + description: + "Creates a new empty tab in the MCP tab group. CRITICAL: You must get the context using tabs_context_mcp at least once before using other browser automation tools so you know what tabs exist.", + inputSchema: { + type: "object", + properties: {}, + required: [], + }, + }, + { + name: "update_plan", + description: + "Present a plan to the user for approval before taking actions. The user will see the domains you intend to visit and your approach. Once approved, you can proceed with actions on the approved domains without additional permission prompts.", + inputSchema: { + type: "object" as const, + properties: { + domains: { + type: "array" as const, + items: { type: "string" as const }, + description: + "List of domains you will visit (e.g., ['github.com', 'stackoverflow.com']). These domains will be approved for the session when the user accepts the plan.", + }, + approach: { + type: "array" as const, + items: { type: "string" as const }, + description: + "High-level description of what you will do. Focus on outcomes and key actions, not implementation details. Be concise - aim for 3-7 items.", + }, + }, + required: ["domains", "approach"], + }, + }, + { + name: "read_console_messages", + description: + "Read browser console messages (console.log, console.error, console.warn, etc.) from a specific tab. Useful for debugging JavaScript errors, viewing application logs, or understanding what's happening in the browser console. Returns console messages from the current domain only. If you don't have a valid tab ID, use tabs_context_mcp first to get available tabs. IMPORTANT: Always provide a pattern to filter messages - without a pattern, you may get too many irrelevant messages.", + inputSchema: { + type: "object", + properties: { + tabId: { + type: "number", + description: + "Tab ID to read console messages from. Must be a tab in the current group. Use tabs_context_mcp first if you don't have a valid tab ID.", + }, + onlyErrors: { + type: "boolean", + description: + "If true, only return error and exception messages. Default is false (return all message types).", + }, + clear: { + type: "boolean", + description: + "If true, clear the console messages after reading to avoid duplicates on subsequent calls. Default is false.", + }, + pattern: { + type: "string", + description: + "Regex pattern to filter console messages. Only messages matching this pattern will be returned (e.g., 'error|warning' to find errors and warnings, 'MyApp' to filter app-specific logs). You should always provide a pattern to avoid getting too many irrelevant messages.", + }, + limit: { + type: "number", + description: + "Maximum number of messages to return. Defaults to 100. Increase only if you need more results.", + }, + }, + required: ["tabId"], + }, + }, + { + name: "read_network_requests", + description: + "Read HTTP network requests (XHR, Fetch, documents, images, etc.) from a specific tab. Useful for debugging API calls, monitoring network activity, or understanding what requests a page is making. Returns all network requests made by the current page, including cross-origin requests. Requests are automatically cleared when the page navigates to a different domain. If you don't have a valid tab ID, use tabs_context_mcp first to get available tabs.", + inputSchema: { + type: "object", + properties: { + tabId: { + type: "number", + description: + "Tab ID to read network requests from. Must be a tab in the current group. Use tabs_context_mcp first if you don't have a valid tab ID.", + }, + urlPattern: { + type: "string", + description: + "Optional URL pattern to filter requests. Only requests whose URL contains this string will be returned (e.g., '/api/' to filter API calls, 'example.com' to filter by domain).", + }, + clear: { + type: "boolean", + description: + "If true, clear the network requests after reading to avoid duplicates on subsequent calls. Default is false.", + }, + limit: { + type: "number", + description: + "Maximum number of requests to return. Defaults to 100. Increase only if you need more results.", + }, + }, + required: ["tabId"], + }, + }, + { + name: "shortcuts_list", + description: + "List all available shortcuts and workflows (shortcuts and workflows are interchangeable). Returns shortcuts with their commands, descriptions, and whether they are workflows. Use shortcuts_execute to run a shortcut or workflow.", + inputSchema: { + type: "object", + properties: { + tabId: { + type: "number", + description: + "Tab ID to list shortcuts from. Must be a tab in the current group. Use tabs_context_mcp first if you don't have a valid tab ID.", + }, + }, + required: ["tabId"], + }, + }, + { + name: "shortcuts_execute", + description: + "Execute a shortcut or workflow by running it in a new sidepanel window using the current tab (shortcuts and workflows are interchangeable). Use shortcuts_list first to see available shortcuts. This starts the execution and returns immediately - it does not wait for completion.", + inputSchema: { + type: "object", + properties: { + tabId: { + type: "number", + description: + "Tab ID to execute the shortcut on. Must be a tab in the current group. Use tabs_context_mcp first if you don't have a valid tab ID.", + }, + shortcutId: { + type: "string", + description: "The ID of the shortcut to execute", + }, + command: { + type: "string", + description: + "The command name of the shortcut to execute (e.g., 'debug', 'summarize'). Do not include the leading slash.", + }, + }, + required: ["tabId"], + }, + }, + { + name: "switch_browser", + description: + "Switch which Chrome browser is used for browser automation. Call this when the user wants to connect to a different Chrome browser. Broadcasts a connection request to all Chrome browsers with the extension installed — the user clicks 'Connect' in the desired browser.", + inputSchema: { + type: "object", + properties: {}, + required: [], + }, + }, +]; diff --git a/packages/@ant/claude-for-chrome-mcp/src/index.ts b/packages/@ant/claude-for-chrome-mcp/src/index.ts index d7d76b189..68b694069 100644 --- a/packages/@ant/claude-for-chrome-mcp/src/index.ts +++ b/packages/@ant/claude-for-chrome-mcp/src/index.ts @@ -1,11 +1,15 @@ -export const BROWSER_TOOLS: any[] = [] - -export class ClaudeForChromeContext {} - -export class Logger {} - -export type PermissionMode = any - -export function createClaudeForChromeMcpServer(..._args: any[]): any { - return null -} +export { BridgeClient, createBridgeClient } from "./bridgeClient.js"; +export { BROWSER_TOOLS } from "./browserTools.js"; +export { + createChromeSocketClient, + createClaudeForChromeMcpServer, +} from "./mcpServer.js"; +export { localPlatformLabel } from "./types.js"; +export type { + BridgeConfig, + ChromeExtensionInfo, + ClaudeForChromeContext, + Logger, + PermissionMode, + SocketClient, +} from "./types.js"; diff --git a/packages/@ant/claude-for-chrome-mcp/src/mcpServer.ts b/packages/@ant/claude-for-chrome-mcp/src/mcpServer.ts new file mode 100644 index 000000000..40b1812a5 --- /dev/null +++ b/packages/@ant/claude-for-chrome-mcp/src/mcpServer.ts @@ -0,0 +1,96 @@ +import { Server } from "@modelcontextprotocol/sdk/server/index.js"; +import type { CallToolResult } from "@modelcontextprotocol/sdk/types.js"; +import { + CallToolRequestSchema, + ListToolsRequestSchema, +} from "@modelcontextprotocol/sdk/types.js"; + +import { createBridgeClient } from "./bridgeClient.js"; +import { BROWSER_TOOLS } from "./browserTools.js"; +import { createMcpSocketClient } from "./mcpSocketClient.js"; +import { createMcpSocketPool } from "./mcpSocketPool.js"; +import { handleToolCall } from "./toolCalls.js"; +import type { ClaudeForChromeContext, SocketClient } from "./types.js"; + +/** + * Create the socket/bridge client for the Chrome extension MCP server. + * Exported so Desktop can share a single instance between the registered + * MCP server and the InternalMcpServerManager (CCD sessions). + */ +export function createChromeSocketClient( + context: ClaudeForChromeContext, +): SocketClient { + return context.bridgeConfig + ? createBridgeClient(context) + : context.getSocketPaths + ? createMcpSocketPool(context) + : createMcpSocketClient(context); +} + +export function createClaudeForChromeMcpServer( + context: ClaudeForChromeContext, + existingSocketClient?: SocketClient, +): Server { + const { serverName, logger } = context; + + // Choose transport: bridge (WebSocket) > socket pool (multi-profile) > single socket. + const socketClient = + existingSocketClient ?? createChromeSocketClient(context); + + const server = new Server( + { + name: serverName, + version: "1.0.0", + }, + { + capabilities: { + tools: {}, + logging: {}, + }, + }, + ); + + server.setRequestHandler(ListToolsRequestSchema, async () => { + if (context.isDisabled?.()) { + return { tools: [] }; + } + return { + tools: context.bridgeConfig + ? BROWSER_TOOLS + : BROWSER_TOOLS.filter((t) => t.name !== "switch_browser"), + }; + }); + + server.setRequestHandler( + CallToolRequestSchema, + async (request): Promise => { + logger.info(`[${serverName}] Executing tool: ${request.params.name}`); + + return handleToolCall( + context, + socketClient, + request.params.name, + request.params.arguments || {}, + ); + }, + ); + + socketClient.setNotificationHandler((notification) => { + logger.info( + `[${serverName}] Forwarding MCP notification: ${notification.method}`, + ); + server + .notification({ + method: notification.method, + params: notification.params, + }) + .catch((error) => { + // Server may not be connected yet (e.g., during startup or after disconnect) + logger.info( + `[${serverName}] Failed to forward MCP notification: ${error.message}`, + ); + }); + }); + + return server; +} diff --git a/packages/@ant/claude-for-chrome-mcp/src/mcpSocketClient.ts b/packages/@ant/claude-for-chrome-mcp/src/mcpSocketClient.ts new file mode 100644 index 000000000..1443d7f5c --- /dev/null +++ b/packages/@ant/claude-for-chrome-mcp/src/mcpSocketClient.ts @@ -0,0 +1,493 @@ +import { promises as fsPromises } from "fs"; +import { createConnection } from "net"; +import type { Socket } from "net"; +import { platform } from "os"; +import { dirname } from "path"; + +import type { + ClaudeForChromeContext, + PermissionMode, + PermissionOverrides, +} from "./types.js"; + +export class SocketConnectionError extends Error { + constructor(message: string) { + super(message); + this.name = "SocketConnectionError"; + } +} + +interface ToolRequest { + method: string; // "execute_tool" + params?: { + client_id?: string; // "desktop" | "claude-code" + tool?: string; + args?: Record; + }; +} + +interface ToolResponse { + result?: unknown; + error?: string; +} + +interface Notification { + method: string; + params?: Record; +} + +type SocketMessage = ToolResponse | Notification; + +function isToolResponse(message: SocketMessage): message is ToolResponse { + return "result" in message || "error" in message; +} + +function isNotification(message: SocketMessage): message is Notification { + return "method" in message && typeof message.method === "string"; +} + +class McpSocketClient { + private socket: Socket | null = null; + private connected = false; + private connecting = false; + private responseCallback: ((response: ToolResponse) => void) | null = null; + private notificationHandler: ((notification: Notification) => void) | null = + null; + private responseBuffer = Buffer.alloc(0); + private reconnectAttempts = 0; + private maxReconnectAttempts = 10; + private reconnectDelay = 1000; + private reconnectTimer: NodeJS.Timeout | null = null; + private context: ClaudeForChromeContext; + // When true, disables automatic reconnection. Used by McpSocketPool which + // manages reconnection externally by rescanning available sockets. + public disableAutoReconnect = false; + + constructor(context: ClaudeForChromeContext) { + this.context = context; + } + + private async connect(): Promise { + const { serverName, logger } = this.context; + + if (this.connecting) { + logger.info( + `[${serverName}] Already connecting, skipping duplicate attempt`, + ); + return; + } + + this.closeSocket(); + this.connecting = true; + + const socketPath = + this.context.getSocketPath?.() ?? this.context.socketPath; + logger.info(`[${serverName}] Attempting to connect to: ${socketPath}`); + + try { + await this.validateSocketSecurity(socketPath); + } catch (error) { + this.connecting = false; + logger.info(`[${serverName}] Security validation failed:`, error); + // Don't retry on security failures (wrong perms/owner) - those won't + // self-resolve. Only the error handler retries on transient errors. + return; + } + + this.socket = createConnection(socketPath); + + // Timeout the initial connection attempt - if socket file exists but native + // host is dead, the connect can hang indefinitely + const connectTimeout = setTimeout(() => { + if (!this.connected) { + logger.info( + `[${serverName}] Connection attempt timed out after 5000ms`, + ); + this.closeSocket(); + this.scheduleReconnect(); + } + }, 5000); + + this.socket.on("connect", () => { + clearTimeout(connectTimeout); + this.connected = true; + this.connecting = false; + this.reconnectAttempts = 0; + logger.info(`[${serverName}] Successfully connected to bridge server`); + }); + + this.socket.on("data", (data: Buffer) => { + this.responseBuffer = Buffer.concat([this.responseBuffer, data]); + + while (this.responseBuffer.length >= 4) { + const length = this.responseBuffer.readUInt32LE(0); + + if (this.responseBuffer.length < 4 + length) { + break; + } + + const messageBytes = this.responseBuffer.slice(4, 4 + length); + this.responseBuffer = this.responseBuffer.slice(4 + length); + + try { + const message = JSON.parse( + messageBytes.toString("utf-8"), + ) as SocketMessage; + + if (isNotification(message)) { + logger.info( + `[${serverName}] Received notification: ${message.method}`, + ); + if (this.notificationHandler) { + this.notificationHandler(message); + } + } else if (isToolResponse(message)) { + logger.info(`[${serverName}] Received tool response: ${message}`); + this.handleResponse(message); + } else { + logger.info(`[${serverName}] Received unknown message: ${message}`); + } + } catch (error) { + logger.info(`[${serverName}] Failed to parse message:`, error); + } + } + }); + + this.socket.on("error", (error: Error & { code?: string }) => { + clearTimeout(connectTimeout); + logger.info(`[${serverName}] Socket error (code: ${error.code}):`, error); + this.connected = false; + this.connecting = false; + + if ( + error.code && + [ + "ECONNREFUSED", // Native host not listening (stale socket) + "ECONNRESET", // Connection reset by peer + "EPIPE", // Broken pipe (native host died mid-write) + "ENOENT", // Socket file was deleted + "EOPNOTSUPP", // Socket file exists but is not a valid socket + "ECONNABORTED", // Connection aborted + ].includes(error.code) + ) { + this.scheduleReconnect(); + } + }); + + this.socket.on("close", () => { + clearTimeout(connectTimeout); + this.connected = false; + this.connecting = false; + this.scheduleReconnect(); + }); + } + + private scheduleReconnect(): void { + const { serverName, logger } = this.context; + + if (this.disableAutoReconnect) { + return; + } + + if (this.reconnectTimer) { + logger.info(`[${serverName}] Reconnect already scheduled, skipping`); + return; + } + + this.reconnectAttempts++; + + // Give up after extended polling (~50 min). A new ensureConnected() call + // from a tool request will restart the cycle if needed. + const maxTotalAttempts = 100; + if (this.reconnectAttempts > maxTotalAttempts) { + logger.info( + `[${serverName}] Giving up after ${maxTotalAttempts} attempts. Will retry on next tool call.`, + ); + this.reconnectAttempts = 0; + return; + } + + // Use aggressive backoff for first 10 attempts, then slow poll every 30s. + const delay = Math.min( + this.reconnectDelay * Math.pow(1.5, this.reconnectAttempts - 1), + 30000, + ); + + if (this.reconnectAttempts <= this.maxReconnectAttempts) { + logger.info( + `[${serverName}] Reconnecting in ${Math.round(delay)}ms (attempt ${ + this.reconnectAttempts + })`, + ); + } else if (this.reconnectAttempts % 10 === 0) { + // Log every 10th slow-poll attempt to avoid log spam + logger.info( + `[${serverName}] Still polling for native host (attempt ${this.reconnectAttempts})`, + ); + } + + this.reconnectTimer = setTimeout(() => { + this.reconnectTimer = null; + void this.connect(); + }, delay); + } + + private handleResponse(response: ToolResponse): void { + if (this.responseCallback) { + const callback = this.responseCallback; + this.responseCallback = null; + callback(response); + } + } + + public setNotificationHandler( + handler: (notification: Notification) => void, + ): void { + this.notificationHandler = handler; + } + + public async ensureConnected(): Promise { + const { serverName } = this.context; + + if (this.connected && this.socket) { + return true; + } + + if (!this.socket && !this.connecting) { + await this.connect(); + } + + // Wait for connection with timeout + return new Promise((resolve, reject) => { + let checkTimeoutId: NodeJS.Timeout | null = null; + + const timeout = setTimeout(() => { + if (checkTimeoutId) { + clearTimeout(checkTimeoutId); + } + reject( + new SocketConnectionError( + `[${serverName}] Connection attempt timed out after 5000ms`, + ), + ); + }, 5000); + + const checkConnection = () => { + if (this.connected) { + clearTimeout(timeout); + resolve(true); + } else { + checkTimeoutId = setTimeout(checkConnection, 500); + } + }; + checkConnection(); + }); + } + + private async sendRequest( + request: ToolRequest, + timeoutMs = 30000, + ): Promise { + const { serverName } = this.context; + + if (!this.socket) { + throw new SocketConnectionError( + `[${serverName}] Cannot send request: not connected`, + ); + } + + const socket = this.socket; + + return new Promise((resolve, reject) => { + const timeout = setTimeout(() => { + this.responseCallback = null; + reject( + new SocketConnectionError( + `[${serverName}] Tool request timed out after ${timeoutMs}ms`, + ), + ); + }, timeoutMs); + + this.responseCallback = (response) => { + clearTimeout(timeout); + resolve(response); + }; + + const requestJson = JSON.stringify(request); + const requestBytes = Buffer.from(requestJson, "utf-8"); + + const lengthPrefix = Buffer.allocUnsafe(4); + lengthPrefix.writeUInt32LE(requestBytes.length, 0); + + const message = Buffer.concat([lengthPrefix, requestBytes]); + socket.write(message); + }); + } + + public async callTool( + name: string, + args: Record, + _permissionOverrides?: PermissionOverrides, + ): Promise { + const request: ToolRequest = { + method: "execute_tool", + params: { + client_id: this.context.clientTypeId, + tool: name, + args, + }, + }; + + return this.sendRequestWithRetry(request); + } + + /** + * Send a request with automatic retry on connection errors. + * + * On connection error or timeout, the native host may be a zombie (connected + * to dead Chrome). Force reconnect to pick up a fresh native host process + * and retry once. + */ + private async sendRequestWithRetry(request: ToolRequest): Promise { + const { serverName, logger } = this.context; + + try { + return await this.sendRequest(request); + } catch (error) { + if (!(error instanceof SocketConnectionError)) { + throw error; + } + + logger.info( + `[${serverName}] Connection error, forcing reconnect and retrying: ${error.message}`, + ); + + this.closeSocket(); + await this.ensureConnected(); + + return await this.sendRequest(request); + } + } + + public async setPermissionMode( + _mode: PermissionMode, + _allowedDomains?: string[], + ): Promise { + // No-op: permission mode is only supported over the bridge (WebSocket) transport + } + + public isConnected(): boolean { + return this.connected; + } + + private closeSocket(): void { + if (this.socket) { + this.socket.removeAllListeners(); + this.socket.end(); + this.socket.destroy(); + this.socket = null; + } + this.connected = false; + this.connecting = false; + } + + private cleanup(): void { + if (this.reconnectTimer) { + clearTimeout(this.reconnectTimer); + this.reconnectTimer = null; + } + + this.closeSocket(); + this.reconnectAttempts = 0; + this.responseBuffer = Buffer.alloc(0); + this.responseCallback = null; + } + + public disconnect(): void { + this.cleanup(); + } + + private async validateSocketSecurity(socketPath: string): Promise { + const { serverName, logger } = this.context; + if (platform() === "win32") { + return; + } + try { + // Validate the parent directory permissions if it's the socket directory + // (not /tmp itself, which has mode 1777 for legacy single-socket paths) + const dirPath = dirname(socketPath); + const dirBasename = dirPath.split("/").pop() || ""; + const isSocketDir = dirBasename.startsWith("claude-mcp-browser-bridge-"); + if (isSocketDir) { + try { + const dirStats = await fsPromises.stat(dirPath); + if (dirStats.isDirectory()) { + const dirMode = dirStats.mode & 0o777; + if (dirMode !== 0o700) { + throw new Error( + `[${serverName}] Insecure socket directory permissions: ${dirMode.toString( + 8, + )} (expected 0700). Directory may have been tampered with.`, + ); + } + const currentUid = process.getuid?.(); + if (currentUid !== undefined && dirStats.uid !== currentUid) { + throw new Error( + `Socket directory not owned by current user (uid: ${currentUid}, dir uid: ${dirStats.uid}). ` + + `Potential security risk.`, + ); + } + } + } catch (dirError) { + if ((dirError as NodeJS.ErrnoException).code !== "ENOENT") { + throw dirError; + } + // Directory doesn't exist yet - native host will create it + } + } + + const stats = await fsPromises.stat(socketPath); + + if (!stats.isSocket()) { + throw new Error( + `[${serverName}] Path exists but it's not a socket: ${socketPath}`, + ); + } + + const mode = stats.mode & 0o777; + if (mode !== 0o600) { + throw new Error( + `[${serverName}] Insecure socket permissions: ${mode.toString( + 8, + )} (expected 0600). Socket may have been tampered with.`, + ); + } + + const currentUid = process.getuid?.(); + if (currentUid !== undefined && stats.uid !== currentUid) { + throw new Error( + `Socket not owned by current user (uid: ${currentUid}, socket uid: ${stats.uid}). ` + + `Potential security risk.`, + ); + } + + logger.info(`[${serverName}] Socket security validation passed`); + } catch (error) { + if ((error as NodeJS.ErrnoException).code === "ENOENT") { + logger.info( + `[${serverName}] Socket not found, will be created by server`, + ); + return; + } + throw error; + } + } +} + +export function createMcpSocketClient( + context: ClaudeForChromeContext, +): McpSocketClient { + return new McpSocketClient(context); +} + +export type { McpSocketClient }; diff --git a/packages/@ant/claude-for-chrome-mcp/src/mcpSocketPool.ts b/packages/@ant/claude-for-chrome-mcp/src/mcpSocketPool.ts new file mode 100644 index 000000000..9329fb6fa --- /dev/null +++ b/packages/@ant/claude-for-chrome-mcp/src/mcpSocketPool.ts @@ -0,0 +1,327 @@ +import { + createMcpSocketClient, + SocketConnectionError, +} from "./mcpSocketClient.js"; +import type { McpSocketClient } from "./mcpSocketClient.js"; +import type { + ClaudeForChromeContext, + PermissionMode, + PermissionOverrides, +} from "./types.js"; + +/** + * Manages connections to multiple Chrome native host sockets (one per Chrome profile). + * Routes tool calls to the correct socket based on tab ID. + * + * For `tabs_context_mcp`: queries all connected sockets and merges results. + * For other tools: routes based on the `tabId` argument using a routing table + * built from tabs_context_mcp responses. + */ +export class McpSocketPool { + private clients: Map = new Map(); + private tabRoutes: Map = new Map(); + private context: ClaudeForChromeContext; + private notificationHandler: + | ((notification: { method: string; params?: Record }) => void) + | null = null; + + constructor(context: ClaudeForChromeContext) { + this.context = context; + } + + public setNotificationHandler( + handler: (notification: { + method: string; + params?: Record; + }) => void, + ): void { + this.notificationHandler = handler; + for (const client of this.clients.values()) { + client.setNotificationHandler(handler); + } + } + + /** + * Discover available sockets and ensure at least one is connected. + */ + public async ensureConnected(): Promise { + const { logger, serverName } = this.context; + + this.refreshClients(); + + // Try to connect any disconnected clients + const connectPromises: Promise[] = []; + for (const client of this.clients.values()) { + if (!client.isConnected()) { + connectPromises.push( + client.ensureConnected().catch(() => false), + ); + } + } + + if (connectPromises.length > 0) { + await Promise.all(connectPromises); + } + + const connectedCount = this.getConnectedClients().length; + if (connectedCount === 0) { + logger.info(`[${serverName}] No connected sockets in pool`); + return false; + } + + logger.info(`[${serverName}] Socket pool: ${connectedCount} connected`); + return true; + } + + /** + * Call a tool, routing to the correct socket based on tab ID. + * For tabs_context_mcp, queries all sockets and merges results. + */ + public async callTool( + name: string, + args: Record, + _permissionOverrides?: PermissionOverrides, + ): Promise { + if (name === "tabs_context_mcp") { + return this.callTabsContext(args); + } + + // Route by tabId if present + const tabId = args.tabId as number | undefined; + if (tabId !== undefined) { + const socketPath = this.tabRoutes.get(tabId); + if (socketPath) { + const client = this.clients.get(socketPath); + if (client?.isConnected()) { + return client.callTool(name, args); + } + } + // Tab route not found or client disconnected — fall through to any connected + } + + // Fallback: use first connected client + const connected = this.getConnectedClients(); + if (connected.length === 0) { + throw new SocketConnectionError( + `[${this.context.serverName}] No connected sockets available`, + ); + } + return connected[0]!.callTool(name, args); + } + + public async setPermissionMode( + mode: PermissionMode, + allowedDomains?: string[], + ): Promise { + const connected = this.getConnectedClients(); + await Promise.all( + connected.map((client) => client.setPermissionMode(mode, allowedDomains)), + ); + } + + public isConnected(): boolean { + return this.getConnectedClients().length > 0; + } + + public disconnect(): void { + for (const client of this.clients.values()) { + client.disconnect(); + } + this.clients.clear(); + this.tabRoutes.clear(); + } + + private getConnectedClients(): McpSocketClient[] { + return [...this.clients.values()].filter((c) => c.isConnected()); + } + + /** + * Query all connected sockets for tabs and merge results. + * Updates the tab routing table. + */ + private async callTabsContext( + args: Record, + ): Promise { + const { logger, serverName } = this.context; + const connected = this.getConnectedClients(); + + if (connected.length === 0) { + throw new SocketConnectionError( + `[${serverName}] No connected sockets available`, + ); + } + + // If only one client, skip merging overhead + if (connected.length === 1) { + const result = await connected[0]!.callTool("tabs_context_mcp", args); + this.updateTabRoutes(result, this.getSocketPathForClient(connected[0]!)); + return result; + } + + // Query all connected clients in parallel + const results = await Promise.allSettled( + connected.map(async (client) => { + const result = await client.callTool("tabs_context_mcp", args); + const socketPath = this.getSocketPathForClient(client); + return { result, socketPath }; + }), + ); + + // Merge tab results + const mergedTabs: unknown[] = []; + this.tabRoutes.clear(); + + for (const settledResult of results) { + if (settledResult.status !== "fulfilled") { + logger.info( + `[${serverName}] tabs_context_mcp failed on one socket: ${settledResult.reason}`, + ); + continue; + } + + const { result, socketPath } = settledResult.value; + this.updateTabRoutes(result, socketPath); + + const tabs = this.extractTabs(result); + if (tabs) { + mergedTabs.push(...tabs); + } + } + + // Return merged result in the same format as the extension response + if (mergedTabs.length > 0) { + const tabListText = mergedTabs + .map((t) => { + const tab = t as { tabId: number; title: string; url: string }; + return ` • tabId ${tab.tabId}: "${tab.title}" (${tab.url})`; + }) + .join("\n"); + + return { + result: { + content: [ + { + type: "text", + text: JSON.stringify({ availableTabs: mergedTabs }), + }, + { + type: "text", + text: `\n\nTab Context:\n- Available tabs:\n${tabListText}`, + }, + ], + }, + }; + } + + // Fallback: return first successful result as-is + for (const settledResult of results) { + if (settledResult.status === "fulfilled") { + return settledResult.value.result; + } + } + + throw new SocketConnectionError( + `[${serverName}] All sockets failed for tabs_context_mcp`, + ); + } + + /** + * Extract tab objects from a tool response to update routing table. + */ + private updateTabRoutes(result: unknown, socketPath: string): void { + const tabs = this.extractTabs(result); + if (!tabs) return; + + for (const tab of tabs) { + if (typeof tab === "object" && tab !== null && "tabId" in tab) { + const tabId = (tab as { tabId: number }).tabId; + this.tabRoutes.set(tabId, socketPath); + } + } + } + + private extractTabs(result: unknown): unknown[] | null { + if (!result || typeof result !== "object") return null; + + // Response format: { result: { content: [{ type: "text", text: "{\"availableTabs\":[...],\"tabGroupId\":...}" }] } } + const asResponse = result as { + result?: { content?: Array<{ type: string; text?: string }> }; + }; + const content = asResponse.result?.content; + if (!content || !Array.isArray(content)) return null; + + for (const item of content) { + if (item.type === "text" && item.text) { + try { + const parsed = JSON.parse(item.text); + if (Array.isArray(parsed)) return parsed; + // Handle { availableTabs: [...] } format + if (parsed && Array.isArray(parsed.availableTabs)) { + return parsed.availableTabs; + } + } catch { + // Not JSON, skip + } + } + } + return null; + } + + private getSocketPathForClient(client: McpSocketClient): string { + for (const [path, c] of this.clients.entries()) { + if (c === client) return path; + } + return ""; + } + + /** + * Scan for available sockets and create/remove clients as needed. + */ + private refreshClients(): void { + const socketPaths = this.getAvailableSocketPaths(); + const { logger, serverName } = this.context; + + // Add new clients for newly discovered sockets + for (const path of socketPaths) { + if (!this.clients.has(path)) { + logger.info(`[${serverName}] Adding socket to pool: ${path}`); + const clientContext: ClaudeForChromeContext = { + ...this.context, + socketPath: path, + getSocketPath: undefined, + getSocketPaths: undefined, + }; + const client = createMcpSocketClient(clientContext); + client.disableAutoReconnect = true; + if (this.notificationHandler) { + client.setNotificationHandler(this.notificationHandler); + } + this.clients.set(path, client); + } + } + + // Remove clients for sockets that no longer exist + for (const [path, client] of this.clients.entries()) { + if (!socketPaths.includes(path)) { + logger.info(`[${serverName}] Removing stale socket from pool: ${path}`); + client.disconnect(); + this.clients.delete(path); + for (const [tabId, socketPath] of this.tabRoutes.entries()) { + if (socketPath === path) { + this.tabRoutes.delete(tabId); + } + } + } + } + } + + private getAvailableSocketPaths(): string[] { + return this.context.getSocketPaths?.() ?? []; + } +} + +export function createMcpSocketPool( + context: ClaudeForChromeContext, +): McpSocketPool { + return new McpSocketPool(context); +} diff --git a/packages/@ant/claude-for-chrome-mcp/src/toolCalls.ts b/packages/@ant/claude-for-chrome-mcp/src/toolCalls.ts new file mode 100644 index 000000000..65bd8d178 --- /dev/null +++ b/packages/@ant/claude-for-chrome-mcp/src/toolCalls.ts @@ -0,0 +1,301 @@ +import type { CallToolResult } from "@modelcontextprotocol/sdk/types.js"; + +import { SocketConnectionError } from "./mcpSocketClient.js"; +import type { + ClaudeForChromeContext, + PermissionMode, + PermissionOverrides, + SocketClient, +} from "./types.js"; + +export const handleToolCall = async ( + context: ClaudeForChromeContext, + socketClient: SocketClient, + name: string, + args: Record, + permissionOverrides?: PermissionOverrides, +): Promise => { + // Handle permission mode changes locally (not forwarded to extension) + if (name === "set_permission_mode") { + return handleSetPermissionMode(socketClient, args); + } + + // Handle switch_browser outside the normal tool call flow (manages its own connection) + if (name === "switch_browser") { + return handleSwitchBrowser(context, socketClient); + } + + try { + const isConnected = await socketClient.ensureConnected(); + + context.logger.silly( + `[${context.serverName}] Server is connected: ${isConnected}. Received tool call: ${name} with args: ${JSON.stringify(args)}.`, + ); + + if (isConnected) { + return await handleToolCallConnected( + context, + socketClient, + name, + args, + permissionOverrides, + ); + } + + return handleToolCallDisconnected(context); + } catch (error) { + context.logger.info(`[${context.serverName}] Error calling tool:`, error); + + if (error instanceof SocketConnectionError) { + return handleToolCallDisconnected(context); + } + + return { + content: [ + { + type: "text", + text: `Error calling tool, please try again. : ${error instanceof Error ? error.message : String(error)}`, + }, + ], + isError: true, + }; + } +}; + +async function handleToolCallConnected( + context: ClaudeForChromeContext, + socketClient: SocketClient, + name: string, + args: Record, + permissionOverrides?: PermissionOverrides, +): Promise { + const response = await socketClient.callTool(name, args, permissionOverrides); + + context.logger.silly( + `[${context.serverName}] Received result from socket bridge: ${JSON.stringify(response)}`, + ); + + if (response === null || response === undefined) { + return { + content: [{ type: "text", text: "Tool execution completed" }], + }; + } + + // Response will have either result or error field + const { result, error } = response as { + result?: { content: unknown[] | string }; + error?: { content: unknown[] | string }; + }; + + // Determine which field has the content and whether it's an error + const contentData = error || result; + const isError = !!error; + + if (!contentData) { + return { + content: [{ type: "text", text: "Tool execution completed" }], + }; + } + + if (isError && isAuthenticationError(contentData.content)) { + context.onAuthenticationError(); + } + + const { content } = contentData; + + if (content && Array.isArray(content)) { + if (isError) { + return { + content: content.map((item: unknown) => { + if (typeof item === "object" && item !== null && "type" in item) { + return item; + } + + return { type: "text", text: String(item) }; + }), + isError: true, + } as CallToolResult; + } + + const convertedContent = content.map((item: unknown) => { + if ( + typeof item === "object" && + item !== null && + "type" in item && + "source" in item + ) { + const typedItem = item; + if ( + typedItem.type === "image" && + typeof typedItem.source === "object" && + typedItem.source !== null && + "data" in typedItem.source + ) { + return { + type: "image", + data: typedItem.source.data, + mimeType: + "media_type" in typedItem.source + ? typedItem.source.media_type || "image/png" + : "image/png", + }; + } + } + + if (typeof item === "object" && item !== null && "type" in item) { + return item; + } + + return { type: "text", text: String(item) }; + }); + + return { + content: convertedContent, + isError, + } as CallToolResult; + } + + // Handle string content + if (typeof content === "string") { + return { + content: [{ type: "text", text: content }], + isError, + } as CallToolResult; + } + + // Fallback for unexpected result format + context.logger.warn( + `[${context.serverName}] Unexpected result format from socket bridge`, + response, + ); + + return { + content: [{ type: "text", text: JSON.stringify(response) }], + isError, + }; +} + +function handleToolCallDisconnected( + context: ClaudeForChromeContext, +): CallToolResult { + const text = context.onToolCallDisconnected(); + return { + content: [{ type: "text", text }], + }; +} + +/** + * Handle set_permission_mode tool call locally. + * This is security-sensitive as it controls whether permission prompts are shown. + */ +async function handleSetPermissionMode( + socketClient: SocketClient, + args: Record, +): Promise { + // Validate permission mode at runtime + const validModes = [ + "ask", + "skip_all_permission_checks", + "follow_a_plan", + ] as const; + const mode = args.mode as string | undefined; + const permissionMode: PermissionMode = + mode && validModes.includes(mode as PermissionMode) + ? (mode as PermissionMode) + : "ask"; + + if (socketClient.setPermissionMode) { + await socketClient.setPermissionMode( + permissionMode, + args.allowed_domains as string[] | undefined, + ); + } + + return { + content: [ + { type: "text", text: `Permission mode set to: ${permissionMode}` }, + ], + }; +} + +/** + * Handle switch_browser tool call. Broadcasts a pairing request and blocks + * until a browser responds or timeout. + */ +async function handleSwitchBrowser( + context: ClaudeForChromeContext, + socketClient: SocketClient, +): Promise { + if (!context.bridgeConfig) { + return { + content: [ + { + type: "text", + text: "Browser switching is only available with bridge connections.", + }, + ], + isError: true, + }; + } + + const isConnected = await socketClient.ensureConnected(); + if (!isConnected) { + return handleToolCallDisconnected(context); + } + + const result = (await socketClient.switchBrowser?.()) ?? null; + + if (result === "no_other_browsers") { + return { + content: [ + { + type: "text", + text: "No other browsers available to switch to. Open Chrome with the Claude extension in another browser to switch.", + }, + ], + isError: true, + }; + } + + if (result) { + return { + content: [ + { type: "text", text: `Connected to browser "${result.name}".` }, + ], + }; + } + + return { + content: [ + { + type: "text", + text: "No browser responded within the timeout. Make sure Chrome is open with the Claude extension installed, then try again.", + }, + ], + isError: true, + }; +} + +/** + * Check if the error content indicates an authentication issue + */ +function isAuthenticationError(content: unknown[] | string): boolean { + const errorText = Array.isArray(content) + ? content + .map((item) => { + if (typeof item === "string") return item; + if ( + typeof item === "object" && + item !== null && + "text" in item && + typeof item.text === "string" + ) { + return item.text; + } + return ""; + }) + .join(" ") + : String(content); + + return errorText.toLowerCase().includes("re-authenticated"); +} diff --git a/packages/@ant/claude-for-chrome-mcp/src/types.ts b/packages/@ant/claude-for-chrome-mcp/src/types.ts new file mode 100644 index 000000000..0334882b9 --- /dev/null +++ b/packages/@ant/claude-for-chrome-mcp/src/types.ts @@ -0,0 +1,134 @@ +export interface Logger { + info: (message: string, ...args: unknown[]) => void; + error: (message: string, ...args: unknown[]) => void; + warn: (message: string, ...args: unknown[]) => void; + debug: (message: string, ...args: unknown[]) => void; + silly: (message: string, ...args: unknown[]) => void; +} + +export type PermissionMode = + | "ask" + | "skip_all_permission_checks" + | "follow_a_plan"; + +export interface BridgeConfig { + /** Bridge WebSocket base URL (e.g., wss://bridge.claudeusercontent.com) */ + url: string; + /** Returns the user's account UUID for the connection path */ + getUserId: () => Promise; + /** Returns a valid OAuth token for bridge authentication */ + getOAuthToken: () => Promise; + /** Optional dev user ID for local development (bypasses OAuth) */ + devUserId?: string; +} + +/** Metadata about a connected Chrome extension instance. */ +export interface ChromeExtensionInfo { + deviceId: string; + osPlatform?: string; + connectedAt: number; + name?: string; +} + +export interface ClaudeForChromeContext { + serverName: string; + logger: Logger; + socketPath: string; + // Optional dynamic resolver for socket path. When provided, called on each + // connection attempt to handle runtime conditions (e.g., TMPDIR mismatch). + getSocketPath?: () => string; + // Optional resolver returning all available socket paths (for multi-profile support). + // When provided, a socket pool connects to all sockets and routes by tab ID. + getSocketPaths?: () => string[]; + clientTypeId: string; // "desktop" | "claude-code" + onToolCallDisconnected: () => string; + onAuthenticationError: () => void; + isDisabled?: () => boolean; + /** Bridge WebSocket configuration. When provided, uses bridge instead of socket. */ + bridgeConfig?: BridgeConfig; + /** If set, permission mode is sent to the extension immediately on bridge connection. */ + initialPermissionMode?: PermissionMode; + /** Optional callback to track telemetry events for bridge connections */ + trackEvent?: ( + eventName: K, + metadata: Record | null, + ) => void; + /** Called when user pairs with an extension via the browser pairing flow. */ + onExtensionPaired?: (deviceId: string, name: string) => void; + /** Returns the previously paired deviceId, if any. */ + getPersistedDeviceId?: () => string | undefined; + /** Called when a remote extension is auto-selected (only option available). */ + onRemoteExtensionWarning?: (ext: ChromeExtensionInfo) => void; +} + +/** + * Map Node's process.platform to the platform string reported by Chrome extensions + * via navigator.userAgentData.platform. + */ +export function localPlatformLabel(): string { + return process.platform === "darwin" + ? "macOS" + : process.platform === "win32" + ? "Windows" + : "Linux"; +} + +/** Permission request forwarded from the extension to the desktop for user approval. */ +export interface BridgePermissionRequest { + /** Links to the pending tool_call */ + toolUseId: string; + /** Unique ID for this permission request */ + requestId: string; + /** Tool type, e.g. "navigate", "click", "execute_javascript" */ + toolType: string; + /** The URL/domain context */ + url: string; + /** Additional action data (click coordinates, text, etc.) */ + actionData?: Record; +} + +/** Desktop response to a bridge permission request. */ +export interface BridgePermissionResponse { + requestId: string; + allowed: boolean; +} + +/** Per-call permission overrides, allowing each session to use its own permission state. */ +export interface PermissionOverrides { + permissionMode: PermissionMode; + allowedDomains?: string[]; + /** Callback invoked when the extension requests user permission via the bridge. */ + onPermissionRequest?: (request: BridgePermissionRequest) => Promise; +} + +/** Shared interface for McpSocketClient and McpSocketPool */ +export interface SocketClient { + ensureConnected(): Promise; + callTool( + name: string, + args: Record, + permissionOverrides?: PermissionOverrides, + ): Promise; + isConnected(): boolean; + disconnect(): void; + setNotificationHandler( + handler: (notification: { + method: string; + params?: Record; + }) => void, + ): void; + /** Set permission mode for the current session. Only effective on BridgeClient. */ + setPermissionMode?( + mode: PermissionMode, + allowedDomains?: string[], + ): Promise; + /** Switch to a different browser. Only available on BridgeClient. */ + switchBrowser?(): Promise< + | { + deviceId: string; + name: string; + } + | "no_other_browsers" + | null + >; +} diff --git a/packages/@ant/computer-use-input/src/backends/darwin.ts b/packages/@ant/computer-use-input/src/backends/darwin.ts new file mode 100644 index 000000000..4f9569d2d --- /dev/null +++ b/packages/@ant/computer-use-input/src/backends/darwin.ts @@ -0,0 +1,137 @@ +/** + * macOS backend for computer-use-input + * + * Uses AppleScript (osascript) and JXA (JavaScript for Automation) to control + * mouse and keyboard via CoreGraphics events and System Events. + */ + +import { $ } from 'bun' +import type { FrontmostAppInfo, InputBackend } from '../types.js' + +const KEY_MAP: Record = { + return: 36, enter: 36, tab: 48, space: 49, delete: 51, backspace: 51, + escape: 53, esc: 53, + left: 123, right: 124, down: 125, up: 126, + f1: 122, f2: 120, f3: 99, f4: 118, f5: 96, f6: 97, + f7: 98, f8: 100, f9: 101, f10: 109, f11: 103, f12: 111, + home: 115, end: 119, pageup: 116, pagedown: 121, +} + +const MODIFIER_MAP: Record = { + command: 'command down', cmd: 'command down', meta: 'command down', super: 'command down', + shift: 'shift down', + option: 'option down', alt: 'option down', + control: 'control down', ctrl: 'control down', +} + +async function osascript(script: string): Promise { + const result = await $`osascript -e ${script}`.quiet().nothrow().text() + return result.trim() +} + +async function jxa(script: string): Promise { + const result = await $`osascript -l JavaScript -e ${script}`.quiet().nothrow().text() + return result.trim() +} + +function buildMouseJxa(eventType: string, x: number, y: number, btn: number, clickState?: number): string { + let script = `ObjC.import("CoreGraphics"); var p = $.CGPointMake(${x},${y}); var e = $.CGEventCreateMouseEvent(null, $.${eventType}, p, ${btn});` + if (clickState !== undefined) { + script += ` $.CGEventSetIntegerValueField(e, $.kCGMouseEventClickState, ${clickState});` + } + script += ` $.CGEventPost($.kCGHIDEventTap, e);` + return script +} + +export const moveMouse: InputBackend['moveMouse'] = async (x, y, _animated) => { + await jxa(buildMouseJxa('kCGEventMouseMoved', x, y, 0)) +} + +export const key: InputBackend['key'] = async (keyName, action) => { + if (action === 'release') return + const lower = keyName.toLowerCase() + const keyCode = KEY_MAP[lower] + if (keyCode !== undefined) { + await osascript(`tell application "System Events" to key code ${keyCode}`) + } else { + await osascript(`tell application "System Events" to keystroke "${keyName.length === 1 ? keyName : lower}"`) + } +} + +export const keys: InputBackend['keys'] = async (parts) => { + const modifiers: string[] = [] + let finalKey: string | null = null + for (const part of parts) { + const mod = MODIFIER_MAP[part.toLowerCase()] + if (mod) modifiers.push(mod) + else finalKey = part + } + if (!finalKey) return + const lower = finalKey.toLowerCase() + const keyCode = KEY_MAP[lower] + const modStr = modifiers.length > 0 ? ` using {${modifiers.join(', ')}}` : '' + if (keyCode !== undefined) { + await osascript(`tell application "System Events" to key code ${keyCode}${modStr}`) + } else { + await osascript(`tell application "System Events" to keystroke "${finalKey.length === 1 ? finalKey : lower}"${modStr}`) + } +} + +export const mouseLocation: InputBackend['mouseLocation'] = async () => { + const result = await jxa('ObjC.import("CoreGraphics"); var e = $.CGEventCreate(null); var p = $.CGEventGetLocation(e); p.x + "," + p.y') + const [xStr, yStr] = result.split(',') + return { x: Math.round(Number(xStr)), y: Math.round(Number(yStr)) } +} + +export const mouseButton: InputBackend['mouseButton'] = async (button, action, count) => { + const pos = await mouseLocation() + const btn = button === 'left' ? 0 : button === 'right' ? 1 : 2 + const downType = btn === 0 ? 'kCGEventLeftMouseDown' : btn === 1 ? 'kCGEventRightMouseDown' : 'kCGEventOtherMouseDown' + const upType = btn === 0 ? 'kCGEventLeftMouseUp' : btn === 1 ? 'kCGEventRightMouseUp' : 'kCGEventOtherMouseUp' + + if (action === 'click') { + for (let i = 0; i < (count ?? 1); i++) { + await jxa(buildMouseJxa(downType, pos.x, pos.y, btn, i + 1)) + await jxa(buildMouseJxa(upType, pos.x, pos.y, btn, i + 1)) + } + } else if (action === 'press') { + await jxa(buildMouseJxa(downType, pos.x, pos.y, btn)) + } else { + await jxa(buildMouseJxa(upType, pos.x, pos.y, btn)) + } +} + +export const mouseScroll: InputBackend['mouseScroll'] = async (amount, direction) => { + const script = direction === 'vertical' + ? `ObjC.import("CoreGraphics"); var e = $.CGEventCreateScrollWheelEvent(null, 0, 1, ${amount}); $.CGEventPost($.kCGHIDEventTap, e);` + : `ObjC.import("CoreGraphics"); var e = $.CGEventCreateScrollWheelEvent(null, 0, 2, 0, ${amount}); $.CGEventPost($.kCGHIDEventTap, e);` + await jxa(script) +} + +export const typeText: InputBackend['typeText'] = async (text) => { + const escaped = text.replace(/\\/g, '\\\\').replace(/"/g, '\\"') + await osascript(`tell application "System Events" to keystroke "${escaped}"`) +} + +export const getFrontmostAppInfo: InputBackend['getFrontmostAppInfo'] = () => { + try { + const result = Bun.spawnSync({ + cmd: ['osascript', '-e', ` + tell application "System Events" + set frontApp to first application process whose frontmost is true + set appName to name of frontApp + set bundleId to bundle identifier of frontApp + return bundleId & "|" & appName + end tell + `], + stdout: 'pipe', + stderr: 'pipe', + }) + const output = new TextDecoder().decode(result.stdout).trim() + if (!output || !output.includes('|')) return null + const [bundleId, appName] = output.split('|', 2) + return { bundleId: bundleId!, appName: appName! } + } catch { + return null + } +} diff --git a/packages/@ant/computer-use-input/src/backends/linux.ts b/packages/@ant/computer-use-input/src/backends/linux.ts new file mode 100644 index 000000000..60d27347a --- /dev/null +++ b/packages/@ant/computer-use-input/src/backends/linux.ts @@ -0,0 +1,173 @@ +/** + * Linux backend for computer-use-input + * + * Uses xdotool for mouse and keyboard simulation. + * Requires: xdotool (apt install xdotool) + */ + +import type { FrontmostAppInfo, InputBackend } from '../types.js' + +// --------------------------------------------------------------------------- +// Shell helper — run a command and return trimmed stdout +// --------------------------------------------------------------------------- + +function run(cmd: string[]): string { + const result = Bun.spawnSync({ + cmd, + stdout: 'pipe', + stderr: 'pipe', + }) + return new TextDecoder().decode(result.stdout).trim() +} + +async function runAsync(cmd: string[]): Promise { + const proc = Bun.spawn(cmd, { stdout: 'pipe', stderr: 'pipe' }) + const out = await new Response(proc.stdout).text() + await proc.exited + return out.trim() +} + +// --------------------------------------------------------------------------- +// xdotool key name mapping +// --------------------------------------------------------------------------- + +const KEY_MAP: Record = { + return: 'Return', enter: 'Return', tab: 'Tab', space: 'space', + backspace: 'BackSpace', delete: 'Delete', escape: 'Escape', esc: 'Escape', + left: 'Left', up: 'Up', right: 'Right', down: 'Down', + home: 'Home', end: 'End', pageup: 'Prior', pagedown: 'Next', + f1: 'F1', f2: 'F2', f3: 'F3', f4: 'F4', f5: 'F5', f6: 'F6', + f7: 'F7', f8: 'F8', f9: 'F9', f10: 'F10', f11: 'F11', f12: 'F12', + shift: 'shift', lshift: 'shift', rshift: 'shift', + control: 'ctrl', ctrl: 'ctrl', lcontrol: 'ctrl', rcontrol: 'ctrl', + alt: 'alt', option: 'alt', lalt: 'alt', ralt: 'alt', + win: 'super', meta: 'super', command: 'super', cmd: 'super', super: 'super', + insert: 'Insert', printscreen: 'Print', pause: 'Pause', + numlock: 'Num_Lock', capslock: 'Caps_Lock', scrolllock: 'Scroll_Lock', +} + +const MODIFIER_KEYS = new Set([ + 'shift', 'lshift', 'rshift', 'control', 'ctrl', 'lcontrol', 'rcontrol', + 'alt', 'option', 'lalt', 'ralt', 'win', 'meta', 'command', 'cmd', 'super', +]) + +function mapKey(name: string): string { + return KEY_MAP[name.toLowerCase()] ?? name +} + +// --------------------------------------------------------------------------- +// xdotool mouse button mapping +// --------------------------------------------------------------------------- + +function mouseButtonNum(button: 'left' | 'right' | 'middle'): string { + return button === 'left' ? '1' : button === 'right' ? '3' : '2' +} + +// --------------------------------------------------------------------------- +// Implementation +// --------------------------------------------------------------------------- + +export const moveMouse: InputBackend['moveMouse'] = async (x, y, _animated) => { + run(['xdotool', 'mousemove', '--sync', String(Math.round(x)), String(Math.round(y))]) +} + +export const mouseLocation: InputBackend['mouseLocation'] = async () => { + const out = run(['xdotool', 'getmouselocation']) + // Output format: "x:123 y:456 screen:0 window:12345678" + const xMatch = out.match(/x:(\d+)/) + const yMatch = out.match(/y:(\d+)/) + return { + x: xMatch ? Number(xMatch[1]) : 0, + y: yMatch ? Number(yMatch[1]) : 0, + } +} + +export const mouseButton: InputBackend['mouseButton'] = async (button, action, count) => { + const btn = mouseButtonNum(button) + if (action === 'click') { + const n = count ?? 1 + run(['xdotool', 'click', '--repeat', String(n), btn]) + } else if (action === 'press') { + run(['xdotool', 'mousedown', btn]) + } else { + run(['xdotool', 'mouseup', btn]) + } +} + +export const mouseScroll: InputBackend['mouseScroll'] = async (amount, direction) => { + // xdotool click 4=scroll up, 5=scroll down, 6=scroll left, 7=scroll right + // Positive amount = down/right, negative = up/left + if (direction === 'vertical') { + const btn = amount >= 0 ? '5' : '4' + const repeats = Math.abs(Math.round(amount)) + if (repeats > 0) { + run(['xdotool', 'click', '--repeat', String(repeats), btn]) + } + } else { + const btn = amount >= 0 ? '7' : '6' + const repeats = Math.abs(Math.round(amount)) + if (repeats > 0) { + run(['xdotool', 'click', '--repeat', String(repeats), btn]) + } + } +} + +export const key: InputBackend['key'] = async (keyName, action) => { + const mapped = mapKey(keyName) + if (action === 'press') { + run(['xdotool', 'keydown', mapped]) + } else { + run(['xdotool', 'keyup', mapped]) + } +} + +export const keys: InputBackend['keys'] = async (parts) => { + // xdotool key accepts "modifier+modifier+key" format + const modifiers: string[] = [] + let finalKey: string | null = null + + for (const part of parts) { + if (MODIFIER_KEYS.has(part.toLowerCase())) { + modifiers.push(mapKey(part)) + } else { + finalKey = part + } + } + if (!finalKey) return + + const combo = [...modifiers, mapKey(finalKey)].join('+') + run(['xdotool', 'key', combo]) +} + +export const typeText: InputBackend['typeText'] = async (text) => { + run(['xdotool', 'type', '--delay', '12', text]) +} + +export const getFrontmostAppInfo: InputBackend['getFrontmostAppInfo'] = () => { + try { + const windowId = run(['xdotool', 'getactivewindow']) + if (!windowId) return null + + const pidStr = run(['xdotool', 'getwindowpid', windowId]) + if (!pidStr) return null + + const pid = pidStr.trim() + + // Read the executable path from /proc + let exePath = '' + try { + exePath = run(['readlink', '-f', `/proc/${pid}/exe`]) + } catch { /* ignore */ } + + // Read the process name from /proc/comm + let appName = '' + try { + appName = run(['cat', `/proc/${pid}/comm`]) + } catch { /* ignore */ } + + if (!exePath && !appName) return null + return { bundleId: exePath || `/proc/${pid}/exe`, appName: appName || 'unknown' } + } catch { + return null + } +} diff --git a/packages/@ant/computer-use-input/src/backends/win32.ts b/packages/@ant/computer-use-input/src/backends/win32.ts new file mode 100644 index 000000000..08900be56 --- /dev/null +++ b/packages/@ant/computer-use-input/src/backends/win32.ts @@ -0,0 +1,218 @@ +/** + * Windows backend for computer-use-input + * + * Uses PowerShell with Win32 P/Invoke (SetCursorPos, SendInput, keybd_event, + * GetForegroundWindow) to control mouse and keyboard. + * + * All P/Invoke types are compiled once at module load and reused across calls. + */ + +import type { FrontmostAppInfo, InputBackend } from '../types.js' + +// --------------------------------------------------------------------------- +// PowerShell helper — run a script and return trimmed stdout +// --------------------------------------------------------------------------- + +function ps(script: string): string { + const result = Bun.spawnSync({ + cmd: ['powershell', '-NoProfile', '-NonInteractive', '-Command', script], + stdout: 'pipe', + stderr: 'pipe', + }) + return new TextDecoder().decode(result.stdout).trim() +} + +async function psAsync(script: string): Promise { + const proc = Bun.spawn( + ['powershell', '-NoProfile', '-NonInteractive', '-Command', script], + { stdout: 'pipe', stderr: 'pipe' }, + ) + const out = await new Response(proc.stdout).text() + await proc.exited + return out.trim() +} + +// --------------------------------------------------------------------------- +// P/Invoke type definitions (compiled once, cached by PowerShell session) +// --------------------------------------------------------------------------- + +const WIN32_TYPES = ` +Add-Type -Language CSharp @' +using System; +using System.Runtime.InteropServices; +using System.Text; +using System.Diagnostics; + +public class CuWin32 { + // --- Cursor --- + [DllImport("user32.dll")] public static extern bool SetCursorPos(int X, int Y); + [DllImport("user32.dll")] public static extern bool GetCursorPos(out POINT p); + [StructLayout(LayoutKind.Sequential)] public struct POINT { public int X; public int Y; } + + // --- SendInput --- + [StructLayout(LayoutKind.Sequential)] public struct MOUSEINPUT { + public int dx; public int dy; public int mouseData; public uint dwFlags; public uint time; public IntPtr dwExtraInfo; + } + [StructLayout(LayoutKind.Explicit)] public struct INPUT { + [FieldOffset(0)] public uint type; + [FieldOffset(4)] public MOUSEINPUT mi; + } + [StructLayout(LayoutKind.Sequential)] public struct KEYBDINPUT { + public ushort wVk; public ushort wScan; public uint dwFlags; public uint time; public IntPtr dwExtraInfo; + } + [StructLayout(LayoutKind.Explicit)] public struct KINPUT { + [FieldOffset(0)] public uint type; + [FieldOffset(4)] public KEYBDINPUT ki; + } + [DllImport("user32.dll", SetLastError=true)] public static extern uint SendInput(uint n, INPUT[] i, int cb); + [DllImport("user32.dll", SetLastError=true)] public static extern uint SendInput(uint n, KINPUT[] i, int cb); + + // --- Keyboard --- + [DllImport("user32.dll")] public static extern void keybd_event(byte bVk, byte bScan, uint dwFlags, UIntPtr dwExtraInfo); + [DllImport("user32.dll")] public static extern short VkKeyScan(char ch); + + // --- Window --- + [DllImport("user32.dll")] public static extern IntPtr GetForegroundWindow(); + [DllImport("user32.dll")] public static extern uint GetWindowThreadProcessId(IntPtr hWnd, out uint pid); + [DllImport("user32.dll", CharSet=CharSet.Unicode)] public static extern int GetWindowText(IntPtr hWnd, StringBuilder sb, int max); + + // Constants + public const uint INPUT_MOUSE = 0, INPUT_KEYBOARD = 1; + public const uint MOUSEEVENTF_LEFTDOWN = 0x0002, MOUSEEVENTF_LEFTUP = 0x0004; + public const uint MOUSEEVENTF_RIGHTDOWN = 0x0008, MOUSEEVENTF_RIGHTUP = 0x0010; + public const uint MOUSEEVENTF_MIDDLEDOWN = 0x0020, MOUSEEVENTF_MIDDLEUP = 0x0040; + public const uint MOUSEEVENTF_WHEEL = 0x0800, MOUSEEVENTF_HWHEEL = 0x1000; + public const uint KEYEVENTF_KEYUP = 0x0002; +} +'@ +` + +// --------------------------------------------------------------------------- +// Virtual key code mapping +// --------------------------------------------------------------------------- + +const VK_MAP: Record = { + return: 0x0D, enter: 0x0D, tab: 0x09, space: 0x20, + backspace: 0x08, delete: 0x2E, escape: 0x1B, esc: 0x1B, + left: 0x25, up: 0x26, right: 0x27, down: 0x28, + home: 0x24, end: 0x23, pageup: 0x21, pagedown: 0x22, + f1: 0x70, f2: 0x71, f3: 0x72, f4: 0x73, f5: 0x74, f6: 0x75, + f7: 0x76, f8: 0x77, f9: 0x78, f10: 0x79, f11: 0x7A, f12: 0x7B, + shift: 0xA0, lshift: 0xA0, rshift: 0xA1, + control: 0xA2, ctrl: 0xA2, lcontrol: 0xA2, rcontrol: 0xA3, + alt: 0xA4, option: 0xA4, lalt: 0xA4, ralt: 0xA5, + win: 0x5B, meta: 0x5B, command: 0x5B, cmd: 0x5B, super: 0x5B, + insert: 0x2D, printscreen: 0x2C, pause: 0x13, + numlock: 0x90, capslock: 0x14, scrolllock: 0x91, +} + +const MODIFIER_KEYS = new Set(['shift', 'lshift', 'rshift', 'control', 'ctrl', 'lcontrol', 'rcontrol', 'alt', 'option', 'lalt', 'ralt', 'win', 'meta', 'command', 'cmd', 'super']) + +// --------------------------------------------------------------------------- +// Implementation +// --------------------------------------------------------------------------- + +export const moveMouse: InputBackend['moveMouse'] = async (x, y, _animated) => { + ps(`${WIN32_TYPES}; [CuWin32]::SetCursorPos(${Math.round(x)}, ${Math.round(y)}) | Out-Null`) +} + +export const mouseLocation: InputBackend['mouseLocation'] = async () => { + const out = ps(`${WIN32_TYPES}; $p = New-Object CuWin32+POINT; [CuWin32]::GetCursorPos([ref]$p) | Out-Null; "$($p.X),$($p.Y)"`) + const [xStr, yStr] = out.split(',') + return { x: Number(xStr), y: Number(yStr) } +} + +export const mouseButton: InputBackend['mouseButton'] = async (button, action, count) => { + const downFlag = button === 'left' ? 'MOUSEEVENTF_LEFTDOWN' + : button === 'right' ? 'MOUSEEVENTF_RIGHTDOWN' + : 'MOUSEEVENTF_MIDDLEDOWN' + const upFlag = button === 'left' ? 'MOUSEEVENTF_LEFTUP' + : button === 'right' ? 'MOUSEEVENTF_RIGHTUP' + : 'MOUSEEVENTF_MIDDLEUP' + + if (action === 'click') { + const n = count ?? 1 + let clicks = '' + for (let i = 0; i < n; i++) { + clicks += `$i.mi.dwFlags=[CuWin32]::${downFlag}; [CuWin32]::SendInput(1, @($i), [Runtime.InteropServices.Marshal]::SizeOf($i)) | Out-Null; $i.mi.dwFlags=[CuWin32]::${upFlag}; [CuWin32]::SendInput(1, @($i), [Runtime.InteropServices.Marshal]::SizeOf($i)) | Out-Null; ` + } + ps(`${WIN32_TYPES}; $i = New-Object CuWin32+INPUT; $i.type=[CuWin32]::INPUT_MOUSE; ${clicks}`) + } else if (action === 'press') { + ps(`${WIN32_TYPES}; $i = New-Object CuWin32+INPUT; $i.type=[CuWin32]::INPUT_MOUSE; $i.mi.dwFlags=[CuWin32]::${downFlag}; [CuWin32]::SendInput(1, @($i), [Runtime.InteropServices.Marshal]::SizeOf($i)) | Out-Null`) + } else { + ps(`${WIN32_TYPES}; $i = New-Object CuWin32+INPUT; $i.type=[CuWin32]::INPUT_MOUSE; $i.mi.dwFlags=[CuWin32]::${upFlag}; [CuWin32]::SendInput(1, @($i), [Runtime.InteropServices.Marshal]::SizeOf($i)) | Out-Null`) + } +} + +export const mouseScroll: InputBackend['mouseScroll'] = async (amount, direction) => { + const flag = direction === 'vertical' ? 'MOUSEEVENTF_WHEEL' : 'MOUSEEVENTF_HWHEEL' + ps(`${WIN32_TYPES}; $i = New-Object CuWin32+INPUT; $i.type=[CuWin32]::INPUT_MOUSE; $i.mi.dwFlags=[CuWin32]::${flag}; $i.mi.mouseData=${amount * 120}; [CuWin32]::SendInput(1, @($i), [Runtime.InteropServices.Marshal]::SizeOf($i)) | Out-Null`) +} + +export const key: InputBackend['key'] = async (keyName, action) => { + const lower = keyName.toLowerCase() + const vk = VK_MAP[lower] + const flags = action === 'release' ? '2' : '0' + if (vk !== undefined) { + ps(`${WIN32_TYPES}; [CuWin32]::keybd_event(${vk}, 0, ${flags}, [UIntPtr]::Zero)`) + } else if (keyName.length === 1) { + // Single character — use VkKeyScan to resolve + const charCode = keyName.charCodeAt(0) + ps(`${WIN32_TYPES}; $vk = [CuWin32]::VkKeyScan([char]${charCode}) -band 0xFF; [CuWin32]::keybd_event([byte]$vk, 0, ${flags}, [UIntPtr]::Zero)`) + } +} + +export const keys: InputBackend['keys'] = async (parts) => { + const modifiers: number[] = [] + let finalKey: string | null = null + + for (const part of parts) { + const lower = part.toLowerCase() + if (MODIFIER_KEYS.has(lower)) { + const vk = VK_MAP[lower] + if (vk !== undefined) modifiers.push(vk) + } else { + finalKey = part + } + } + if (!finalKey) return + + // Build script: press modifiers → press key → release key → release modifiers + let script = WIN32_TYPES + '; ' + for (const vk of modifiers) { + script += `[CuWin32]::keybd_event(${vk}, 0, 0, [UIntPtr]::Zero); ` + } + const lower = finalKey.toLowerCase() + const vk = VK_MAP[lower] + if (vk !== undefined) { + script += `[CuWin32]::keybd_event(${vk}, 0, 0, [UIntPtr]::Zero); [CuWin32]::keybd_event(${vk}, 0, 2, [UIntPtr]::Zero); ` + } else if (finalKey.length === 1) { + const charCode = finalKey.charCodeAt(0) + script += `$vk = [CuWin32]::VkKeyScan([char]${charCode}) -band 0xFF; [CuWin32]::keybd_event([byte]$vk, 0, 0, [UIntPtr]::Zero); [CuWin32]::keybd_event([byte]$vk, 0, 2, [UIntPtr]::Zero); ` + } + for (const mk of modifiers.reverse()) { + script += `[CuWin32]::keybd_event(${mk}, 0, 2, [UIntPtr]::Zero); ` + } + ps(script) +} + +export const typeText: InputBackend['typeText'] = async (text) => { + const escaped = text.replace(/'/g, "''") + ps(`Add-Type -AssemblyName System.Windows.Forms; [System.Windows.Forms.SendKeys]::SendWait('${escaped}')`) +} + +export const getFrontmostAppInfo: InputBackend['getFrontmostAppInfo'] = () => { + try { + const out = ps(`${WIN32_TYPES} +$hwnd = [CuWin32]::GetForegroundWindow() +$procId = [uint32]0 +[CuWin32]::GetWindowThreadProcessId($hwnd, [ref]$procId) | Out-Null +$proc = Get-Process -Id $procId -ErrorAction SilentlyContinue +"$($proc.MainModule.FileName)|$($proc.ProcessName)"`) + if (!out || !out.includes('|')) return null + const [exePath, appName] = out.split('|', 2) + return { bundleId: exePath!, appName: appName! } + } catch { + return null + } +} diff --git a/packages/@ant/computer-use-input/src/index.ts b/packages/@ant/computer-use-input/src/index.ts index afb5a52ee..dc3f1063c 100644 --- a/packages/@ant/computer-use-input/src/index.ts +++ b/packages/@ant/computer-use-input/src/index.ts @@ -1,174 +1,73 @@ /** - * @ant/computer-use-input — macOS 键鼠模拟实现 + * @ant/computer-use-input — cross-platform keyboard & mouse simulation * - * 使用 macOS 原生工具实现: - * - AppleScript (osascript) — 应用信息、键盘输入 - * - CGEvent via AppleScript-ObjC bridge — 鼠标操作、位置查询 + * Platform backends: + * - darwin: AppleScript/JXA via CoreGraphics events + * - win32: PowerShell via Win32 P/Invoke (SetCursorPos, SendInput, keybd_event) * - * 仅 macOS 支持。其他平台返回 { isSupported: false } + * Add new platforms by creating backends/.ts implementing InputBackend. */ -import { $ } from 'bun' +import type { FrontmostAppInfo, InputBackend } from './types.js' -interface FrontmostAppInfo { - bundleId: string - appName: string -} - -// AppleScript key code mapping -const KEY_MAP: Record = { - return: 36, enter: 36, tab: 48, space: 49, delete: 51, backspace: 51, - escape: 53, esc: 53, - left: 123, right: 124, down: 125, up: 126, - f1: 122, f2: 120, f3: 99, f4: 118, f5: 96, f6: 97, - f7: 98, f8: 100, f9: 101, f10: 109, f11: 103, f12: 111, - home: 115, end: 119, pageup: 116, pagedown: 121, -} - -const MODIFIER_MAP: Record = { - command: 'command down', cmd: 'command down', meta: 'command down', super: 'command down', - shift: 'shift down', - option: 'option down', alt: 'option down', - control: 'control down', ctrl: 'control down', -} - -async function osascript(script: string): Promise { - const result = await $`osascript -e ${script}`.quiet().nothrow().text() - return result.trim() -} - -async function jxa(script: string): Promise { - const result = await $`osascript -l JavaScript -e ${script}`.quiet().nothrow().text() - return result.trim() -} +export type { FrontmostAppInfo, InputBackend } from './types.js' -function jxaSync(script: string): string { - const result = Bun.spawnSync({ - cmd: ['osascript', '-l', 'JavaScript', '-e', script], - stdout: 'pipe', stderr: 'pipe', - }) - return new TextDecoder().decode(result.stdout).trim() -} - -function buildMouseJxa(eventType: string, x: number, y: number, btn: number, clickState?: number): string { - let script = `ObjC.import("CoreGraphics"); var p = $.CGPointMake(${x},${y}); var e = $.CGEventCreateMouseEvent(null, $.${eventType}, p, ${btn});` - if (clickState !== undefined) { - script += ` $.CGEventSetIntegerValueField(e, $.kCGMouseEventClickState, ${clickState});` - } - script += ` $.CGEventPost($.kCGHIDEventTap, e);` - return script -} - -// ---- Implementation functions ---- - -async function moveMouse(x: number, y: number, _animated: boolean): Promise { - await jxa(buildMouseJxa('kCGEventMouseMoved', x, y, 0)) -} - -async function key(keyName: string, action: 'press' | 'release'): Promise { - if (action === 'release') return - const lower = keyName.toLowerCase() - const keyCode = KEY_MAP[lower] - if (keyCode !== undefined) { - await osascript(`tell application "System Events" to key code ${keyCode}`) - } else { - await osascript(`tell application "System Events" to keystroke "${keyName.length === 1 ? keyName : lower}"`) - } -} +// --------------------------------------------------------------------------- +// Platform dispatch +// --------------------------------------------------------------------------- -async function keys(parts: string[]): Promise { - const modifiers: string[] = [] - let finalKey: string | null = null - for (const part of parts) { - const mod = MODIFIER_MAP[part.toLowerCase()] - if (mod) modifiers.push(mod) - else finalKey = part - } - if (!finalKey) return - const lower = finalKey.toLowerCase() - const keyCode = KEY_MAP[lower] - const modStr = modifiers.length > 0 ? ` using {${modifiers.join(', ')}}` : '' - if (keyCode !== undefined) { - await osascript(`tell application "System Events" to key code ${keyCode}${modStr}`) - } else { - await osascript(`tell application "System Events" to keystroke "${finalKey.length === 1 ? finalKey : lower}"${modStr}`) +function loadBackend(): InputBackend | null { + try { + switch (process.platform) { + case 'darwin': + return require('./backends/darwin.js') as InputBackend + case 'win32': + return require('./backends/win32.js') as InputBackend + case 'linux': + return require('./backends/linux.js') as InputBackend + default: + return null + } + } catch { + return null } } -async function mouseLocation(): Promise<{ x: number; y: number }> { - const result = await jxa('ObjC.import("CoreGraphics"); var e = $.CGEventCreate(null); var p = $.CGEventGetLocation(e); p.x + "," + p.y') - const [xStr, yStr] = result.split(',') - return { x: Math.round(Number(xStr)), y: Math.round(Number(yStr)) } -} - -async function mouseButton( - button: 'left' | 'right' | 'middle', - action: 'click' | 'press' | 'release', - count?: number, -): Promise { - const pos = await mouseLocation() - const btn = button === 'left' ? 0 : button === 'right' ? 1 : 2 - const downType = btn === 0 ? 'kCGEventLeftMouseDown' : btn === 1 ? 'kCGEventRightMouseDown' : 'kCGEventOtherMouseDown' - const upType = btn === 0 ? 'kCGEventLeftMouseUp' : btn === 1 ? 'kCGEventRightMouseUp' : 'kCGEventOtherMouseUp' +const backend = loadBackend() - if (action === 'click') { - for (let i = 0; i < (count ?? 1); i++) { - await jxa(buildMouseJxa(downType, pos.x, pos.y, btn, i + 1)) - await jxa(buildMouseJxa(upType, pos.x, pos.y, btn, i + 1)) - } - } else if (action === 'press') { - await jxa(buildMouseJxa(downType, pos.x, pos.y, btn)) - } else { - await jxa(buildMouseJxa(upType, pos.x, pos.y, btn)) - } -} +// --------------------------------------------------------------------------- +// Unsupported stub (throws on call — guards via isSupported check) +// --------------------------------------------------------------------------- -async function mouseScroll(amount: number, direction: 'vertical' | 'horizontal'): Promise { - const script = direction === 'vertical' - ? `ObjC.import("CoreGraphics"); var e = $.CGEventCreateScrollWheelEvent(null, 0, 1, ${amount}); $.CGEventPost($.kCGHIDEventTap, e);` - : `ObjC.import("CoreGraphics"); var e = $.CGEventCreateScrollWheelEvent(null, 0, 2, 0, ${amount}); $.CGEventPost($.kCGHIDEventTap, e);` - await jxa(script) +function unsupported(): never { + throw new Error(`computer-use-input is not supported on ${process.platform}`) } -async function typeText(text: string): Promise { - const escaped = text.replace(/\\/g, '\\\\').replace(/"/g, '\\"') - await osascript(`tell application "System Events" to keystroke "${escaped}"`) -} +// --------------------------------------------------------------------------- +// Public API — matches the original export surface +// --------------------------------------------------------------------------- -function getFrontmostAppInfo(): FrontmostAppInfo | null { - try { - const result = Bun.spawnSync({ - cmd: ['osascript', '-e', ` - tell application "System Events" - set frontApp to first application process whose frontmost is true - set appName to name of frontApp - set bundleId to bundle identifier of frontApp - return bundleId & "|" & appName - end tell - `], - stdout: 'pipe', - stderr: 'pipe', - }) - const output = new TextDecoder().decode(result.stdout).trim() - if (!output || !output.includes('|')) return null - const [bundleId, appName] = output.split('|', 2) - return { bundleId: bundleId!, appName: appName! } - } catch { - return null - } -} +export const isSupported = backend !== null -// ---- Exports ---- +export const moveMouse = backend?.moveMouse ?? unsupported +export const key = backend?.key ?? unsupported +export const keys = backend?.keys ?? unsupported +export const mouseLocation = backend?.mouseLocation ?? unsupported +export const mouseButton = backend?.mouseButton ?? unsupported +export const mouseScroll = backend?.mouseScroll ?? unsupported +export const typeText = backend?.typeText ?? unsupported +export const getFrontmostAppInfo = backend?.getFrontmostAppInfo ?? (() => null) +// Legacy class type — used by inputLoader.ts for type narrowing export class ComputerUseInputAPI { - declare moveMouse: (x: number, y: number, animated: boolean) => Promise - declare key: (key: string, action: 'press' | 'release') => Promise - declare keys: (parts: string[]) => Promise - declare mouseLocation: () => Promise<{ x: number; y: number }> - declare mouseButton: (button: 'left' | 'right' | 'middle', action: 'click' | 'press' | 'release', count?: number) => Promise - declare mouseScroll: (amount: number, direction: 'vertical' | 'horizontal') => Promise - declare typeText: (text: string) => Promise - declare getFrontmostAppInfo: () => FrontmostAppInfo | null + declare moveMouse: InputBackend['moveMouse'] + declare key: InputBackend['key'] + declare keys: InputBackend['keys'] + declare mouseLocation: InputBackend['mouseLocation'] + declare mouseButton: InputBackend['mouseButton'] + declare mouseScroll: InputBackend['mouseScroll'] + declare typeText: InputBackend['typeText'] + declare getFrontmostAppInfo: InputBackend['getFrontmostAppInfo'] declare isSupported: true } @@ -177,7 +76,3 @@ interface ComputerUseInputUnsupported { } export type ComputerUseInput = ComputerUseInputAPI | ComputerUseInputUnsupported - -// Plain object with all methods as own properties — compatible with require() -export const isSupported = process.platform === 'darwin' -export { moveMouse, key, keys, mouseLocation, mouseButton, mouseScroll, typeText, getFrontmostAppInfo } diff --git a/packages/@ant/computer-use-input/src/types.ts b/packages/@ant/computer-use-input/src/types.ts new file mode 100644 index 000000000..ec80708b9 --- /dev/null +++ b/packages/@ant/computer-use-input/src/types.ts @@ -0,0 +1,19 @@ +export interface FrontmostAppInfo { + bundleId: string // macOS: bundle ID, Windows: exe path + appName: string +} + +export interface InputBackend { + moveMouse(x: number, y: number, animated: boolean): Promise + key(key: string, action: 'press' | 'release'): Promise + keys(parts: string[]): Promise + mouseLocation(): Promise<{ x: number; y: number }> + mouseButton( + button: 'left' | 'right' | 'middle', + action: 'click' | 'press' | 'release', + count?: number, + ): Promise + mouseScroll(amount: number, direction: 'vertical' | 'horizontal'): Promise + typeText(text: string): Promise + getFrontmostAppInfo(): FrontmostAppInfo | null +} diff --git a/packages/@ant/computer-use-mcp/src/deniedApps.ts b/packages/@ant/computer-use-mcp/src/deniedApps.ts new file mode 100644 index 000000000..92f14e0b1 --- /dev/null +++ b/packages/@ant/computer-use-mcp/src/deniedApps.ts @@ -0,0 +1,553 @@ +/** + * App category lookup for tiered CU permissions. Three categories land at a + * restricted tier instead of `"full"`: + * + * - **browser** → `"read"` tier — visible in screenshots, NO interaction. + * The model can read an already-open page but must use the Claude-in-Chrome + * MCP for navigation/clicking/typing. + * - **terminal** → `"click"` tier — visible + clickable, NO typing. The + * model can click a Run button or scroll test output in an IDE, but can't + * type into the integrated terminal. Use the Bash tool for shell work. + * - **trading** → `"read"` tier — same restrictions as browsers, but no + * CiC-MCP alternative exists. For platforms where a stray click can + * execute a trade or send a message to a counterparty. + * + * Uncategorized apps default to `"full"`. See `getDefaultTierForApp`. + * + * Identification is two-layered: + * 1. Bundle ID match (macOS-only; `InstalledApp.bundleId` is a + * CFBundleIdentifier and meaningless on Windows). Fast, exact, the + * primary mechanism while CU is darwin-gated. + * 2. Display-name substring match (cross-platform fallback). Catches + * unresolved requests ("Chrome" when Chrome isn't installed) AND will + * be the primary mechanism on Windows/Linux where there's no bundle ID. + * Windows-relevant names (PowerShell, cmd, Windows Terminal) are + * included now so they activate the moment the darwin gate lifts. + * + * Keep this file **import-free** (like sentinelApps.ts) — the renderer may + * import it via a package.json subpath export, and pulling in + * `@modelcontextprotocol/sdk` (a devDep) through the index → mcpServer chain + * would fail module resolution in Next.js. The `CuAppPermTier` type is + * duplicated as a string literal below rather than imported. + */ + +export type DeniedCategory = "browser" | "terminal" | "trading"; + +/** + * Map a category to its hardcoded tier. Return-type is the string-literal + * union inline (this file is import-free; see header comment). The + * authoritative type is `CuAppPermTier` in types.ts — keep in sync. + * + * Not bijective — both `"browser"` and `"trading"` map to `"read"`. Copy + * that differs by category (the "use CiC" hint is browser-only) must check + * the category, not just the tier. + */ +export function categoryToTier( + category: DeniedCategory | null, +): "read" | "click" | "full" { + if (category === "browser" || category === "trading") return "read"; + if (category === "terminal") return "click"; + return "full"; +} + +// ─── Bundle-ID deny sets (macOS) ───────────────────────────────────────── + +const BROWSER_BUNDLE_IDS: ReadonlySet = new Set([ + // Apple + "com.apple.Safari", + "com.apple.SafariTechnologyPreview", + // Google + "com.google.Chrome", + "com.google.Chrome.beta", + "com.google.Chrome.dev", + "com.google.Chrome.canary", + // Microsoft + "com.microsoft.edgemac", + "com.microsoft.edgemac.Beta", + "com.microsoft.edgemac.Dev", + "com.microsoft.edgemac.Canary", + // Mozilla + "org.mozilla.firefox", + "org.mozilla.firefoxdeveloperedition", + "org.mozilla.nightly", + // Chromium-based + "org.chromium.Chromium", + "com.brave.Browser", + "com.brave.Browser.beta", + "com.brave.Browser.nightly", + "com.operasoftware.Opera", + "com.operasoftware.OperaGX", + "com.operasoftware.OperaDeveloper", + "com.vivaldi.Vivaldi", + // The Browser Company + "company.thebrowser.Browser", // Arc + "company.thebrowser.dia", // Dia (agentic) + // Privacy-focused + "org.torproject.torbrowser", + "com.duckduckgo.macos.browser", + "ru.yandex.desktop.yandex-browser", + // Agentic / AI browsers — newer entrants with LLM integrations + "ai.perplexity.comet", + "com.sigmaos.sigmaos.macos", // SigmaOS + // Webkit-based misc + "com.kagi.kagimacOS", // Orion +]); + +/** + * Terminals + IDEs with integrated terminals. Supersets + * `SHELL_ACCESS_BUNDLE_IDS` from sentinelApps.ts — terminals proceed to the + * approval dialog at tier "click", and the sentinel warning renders + * alongside the tier badge. + */ +const TERMINAL_BUNDLE_IDS: ReadonlySet = new Set([ + // Dedicated terminals + "com.apple.Terminal", + "com.googlecode.iterm2", + "dev.warp.Warp-Stable", + "dev.warp.Warp-Beta", + "com.github.wez.wezterm", + "org.alacritty", + "io.alacritty", // pre-v0.11.0 (renamed 2022-07) — kept for legacy installs + "net.kovidgoyal.kitty", + "co.zeit.hyper", + "com.mitchellh.ghostty", + "org.tabby", + "com.termius-dmg.mac", // Termius + // IDEs with integrated terminals — we can't distinguish "type in the + // editor" from "type in the integrated terminal" via screenshot+click. + // VS Code family + "com.microsoft.VSCode", + "com.microsoft.VSCodeInsiders", + "com.vscodium", // VSCodium + "com.todesktop.230313mzl4w4u92", // Cursor + "com.exafunction.windsurf", // Windsurf / Codeium + "dev.zed.Zed", + "dev.zed.Zed-Preview", + // JetBrains family (all have integrated terminals) + "com.jetbrains.intellij", + "com.jetbrains.intellij.ce", + "com.jetbrains.pycharm", + "com.jetbrains.pycharm.ce", + "com.jetbrains.WebStorm", + "com.jetbrains.CLion", + "com.jetbrains.goland", + "com.jetbrains.rubymine", + "com.jetbrains.PhpStorm", + "com.jetbrains.datagrip", + "com.jetbrains.rider", + "com.jetbrains.AppCode", + "com.jetbrains.rustrover", + "com.jetbrains.fleet", + "com.google.android.studio", // Android Studio (JetBrains-based) + // Other IDEs + "com.axosoft.gitkraken", // GitKraken has an integrated terminal panel. Also keeps the "kraken" trading-substring from miscategorizing it — bundle-ID wins. + "com.sublimetext.4", + "com.sublimetext.3", + "org.vim.MacVim", + "com.neovim.neovim", + "org.gnu.Emacs", + // Xcode's previous carve-out (full tier for Interface Builder / simulator) + // was reversed — at tier "click" IB and simulator taps still work (both are + // plain clicks) while the integrated terminal is blocked from keyboard input. + "com.apple.dt.Xcode", + "org.eclipse.platform.ide", + "org.netbeans.ide", + "com.microsoft.visual-studio", // Visual Studio for Mac + // AppleScript/automation execution surfaces — same threat as terminals: + // type(script) → key("cmd+r") runs arbitrary code. Added after #28011 + // removed the osascript MCP server, making CU the only tool-call route + // to AppleScript. + "com.apple.ScriptEditor2", + "com.apple.Automator", + "com.apple.shortcuts", +]); + +/** + * Trading / crypto platforms — granted at tier `"read"` so the agent can see + * balances and prices but can't click into an order, transfer, or IB chat. + * Bundle IDs populated from Homebrew cask `uninstall.quit` stanzas as they're + * verified; the name-substring fallback below is the primary check. Bloomberg + * Terminal has no native macOS build per their FAQ (web/Citrix only). + * + * Budgeting/accounting apps (Quicken, YNAB, QuickBooks, etc.) are NOT listed + * here — they default to tier `"full"`. The risk model for brokerage/crypto + * (a stray click can execute a trade) doesn't apply to budgeting apps; the + * Cowork system prompt carries the soft instruction to never execute trades + * or transfer money on the user's behalf. + */ +const TRADING_BUNDLE_IDS: ReadonlySet = new Set([ + // Verified via Homebrew quit/zap stanzas + mdls + electron-builder source. + // Trading + "com.webull.desktop.v1", // Webull (direct download, Qt) + "com.webull.trade.mac.v1", // Webull (Mac App Store) + "com.tastytrade.desktop", + "com.tradingview.tradingviewapp.desktop", + "com.fidelity.activetrader", // Fidelity Trader+ (new) + "com.fmr.activetrader", // Fidelity Active Trader Pro (legacy) + // Interactive Brokers TWS — install4j wrapper; Homebrew quit stanza is + // authoritative for this exact value but install4j IDs can drift across + // major versions — name-substring "trader workstation" is the fallback. + "com.install4j.5889-6375-8446-2021", + // Crypto + "com.binance.BinanceDesktop", + "com.electron.exodus", + // Electrum uses PyInstaller with bundle_identifier=None → defaults to + // org.pythonmac.unspecified.. Confirmed in spesmilo/electrum + // source + Homebrew zap. IntuneBrew's "org.electrum.electrum" is a fork. + "org.pythonmac.unspecified.Electrum", + "com.ledger.live", + "io.trezor.TrezorSuite", + // No native macOS app (name-substring only): Schwab, E*TRADE, TradeStation, + // Robinhood, NinjaTrader, Coinbase, Kraken, Bloomberg. thinkorswim + // install4j ID drifts per-install — substring safer. +]); + +// ─── Policy-deny (not a tier — cannot be granted at all) ───────────────── +// +// Streaming / ebook / music apps and a handful of publisher apps. These +// are auto-denied before the approval dialog — no tier can be granted. +// Rationale is copyright / content-control (the agent has no legitimate +// need to screenshot Netflix or click Play on Spotify). +// +// Sourced from the ACP CU-apps blocklist xlsx ("Full block" tab). See +// /tmp/extract_cu_blocklist.py for the extraction script. + +const POLICY_DENIED_BUNDLE_IDS: ReadonlySet = new Set([ + // Verified via Homebrew quit/zap + mdls /System/Applications + IntuneBrew. + // Apple built-ins + "com.apple.TV", + "com.apple.Music", + "com.apple.iBooksX", + "com.apple.podcasts", + // Music + "com.spotify.client", + "com.amazon.music", + "com.tidal.desktop", + "com.deezer.deezer-desktop", + "com.pandora.desktop", + "com.electron.pocket-casts", // direct-download Electron wrapper + "au.com.shiftyjelly.PocketCasts", // Mac App Store + // Video + "tv.plex.desktop", + "tv.plex.htpc", + "tv.plex.plexamp", + "com.amazon.aiv.AIVApp", // Prime Video (iOS-on-Apple-Silicon) + // Ebooks + "net.kovidgoyal.calibre", + "com.amazon.Kindle", // legacy desktop, discontinued + "com.amazon.Lassen", // current Mac App Store (iOS-on-Mac) + "com.kobo.desktop.Kobo", + // No native macOS app (name-substring only): Netflix, Disney+, Hulu, + // HBO Max, Peacock, Paramount+, YouTube, Crunchyroll, Tubi, Vudu, + // Audible, Reddit, NYTimes. Their iOS apps don't opt into iPad-on-Mac. +]); + +const POLICY_DENIED_NAME_SUBSTRINGS: readonly string[] = [ + // Video streaming + "netflix", + "disney+", + "hulu", + "prime video", + "apple tv", + "peacock", + "paramount+", + // "plex" is too generic — would match "Perplexity". Covered by + // tv.plex.* bundle IDs on macOS. + "tubi", + "crunchyroll", + "vudu", + // E-readers / audiobooks + "kindle", + "apple books", + "kobo", + "play books", + "calibre", + "libby", + "readium", + "audible", + "libro.fm", + "speechify", + // Music + "spotify", + "apple music", + "amazon music", + "youtube music", + "tidal", + "deezer", + "pandora", + "pocket casts", + // Publisher / social apps (from the same blocklist tab) + "naver", + "reddit", + "sony music", + "vegas pro", + "pitchfork", + "economist", + "nytimes", + // Skipped (too generic for substring matching — need bundle ID): + // HBO Max / Max, YouTube (non-Music), Nook, Sony Catalyst, Wired +]; + +/** + * Policy-level auto-deny. Unlike `userDeniedBundleIds` (per-user Settings + * page), this is baked into the build. `buildAccessRequest` strips these + * before the approval dialog with "blocked by policy" guidance; the agent + * is told to not retry. + */ +export function isPolicyDenied( + bundleId: string | undefined, + displayName: string, +): boolean { + if (bundleId && POLICY_DENIED_BUNDLE_IDS.has(bundleId)) return true; + const lower = displayName.toLowerCase(); + for (const sub of POLICY_DENIED_NAME_SUBSTRINGS) { + if (lower.includes(sub)) return true; + } + return false; +} + +export function getDeniedCategory(bundleId: string): DeniedCategory | null { + if (BROWSER_BUNDLE_IDS.has(bundleId)) return "browser"; + if (TERMINAL_BUNDLE_IDS.has(bundleId)) return "terminal"; + if (TRADING_BUNDLE_IDS.has(bundleId)) return "trading"; + return null; +} + +// ─── Display-name fallback (cross-platform) ────────────────────────────── + +/** + * Lowercase substrings checked against the requested display name. Catches: + * - Unresolved requests (app not installed, Spotlight miss) + * - Future Windows/Linux support where bundleId is meaningless + * + * Matched via `.includes()` on `name.toLowerCase()`. Entries are ordered + * by specificity (more-specific first is irrelevant since we return on + * first match, but groupings are by category for readability). + */ +const BROWSER_NAME_SUBSTRINGS: readonly string[] = [ + "safari", + "chrome", + "firefox", + "microsoft edge", + "brave", + "opera", + "vivaldi", + "chromium", + // Arc/Dia: the canonical display name is just "Arc"/"Dia" — too short for + // substring matching (false-positives: "Arcade", "Diagram"). Covered by + // bundle ID on macOS. The "... browser" entries below catch natural-language + // phrasings ("the arc browser") but NOT the canonical short name. + "arc browser", + "tor browser", + "duckduckgo", + "yandex", + "orion browser", + // Agentic / AI browsers + "comet", // Perplexity's browser — "Comet" substring risks false positives + // but leaving for now; "comet" in an app name is rare + "sigmaos", + "dia browser", +]; + +const TERMINAL_NAME_SUBSTRINGS: readonly string[] = [ + // macOS / cross-platform terminals + "terminal", // catches Terminal, Windows Terminal (NOT iTerm — separate entry) + "iterm", + "wezterm", + "alacritty", + "kitty", + "ghostty", + "tabby", + "termius", + // AppleScript runners — see bundle-ID comment above. "shortcuts" is too + // generic for substring matching (many apps have "shortcuts" in the name); + // covered by bundle ID only, like warp/hyper. + "script editor", + "automator", + // NOTE: "warp" and "hyper" are too generic for substring matching — + // they'd false-positive on "Warpaint" or "Hyperion". Covered by bundle ID + // (dev.warp.Warp-Stable, co.zeit.hyper) for macOS; Windows exe-name + // matching can be added when Windows CU ships. + // Windows shells (activate when the darwin gate lifts) + "powershell", + "cmd.exe", + "command prompt", + "git bash", + "conemu", + "cmder", + // IDEs (VS Code family) + "visual studio code", + "visual studio", // catches VS for Mac + Windows + "vscode", + "vs code", + "vscodium", + "cursor", // Cursor IDE — "cursor" is generic but IDE is the only common app + "windsurf", + // Zed: display name is just "Zed" — too short for substring matching + // (false-positives). Covered by bundle ID (dev.zed.Zed) on macOS. + // IDEs (JetBrains family) + "intellij", + "pycharm", + "webstorm", + "clion", + "goland", + "rubymine", + "phpstorm", + "datagrip", + "rider", + "appcode", + "rustrover", + "fleet", + "android studio", + // Other IDEs + "sublime text", + "macvim", + "neovim", + "emacs", + "xcode", + "eclipse", + "netbeans", +]; + +const TRADING_NAME_SUBSTRINGS: readonly string[] = [ + // Trading — brokerage apps. Sourced from the ACP CU-apps blocklist xlsx + // ("Read Only" tab). Name-substring safe for proper nouns below; generic + // names (IG, Delta, HTX) are skipped and need bundle-ID matching once + // verified. + "bloomberg", + "ameritrade", + "thinkorswim", + "schwab", + "fidelity", + "e*trade", + "interactive brokers", + "trader workstation", // Interactive Brokers TWS + "tradestation", + "webull", + "robinhood", + "tastytrade", + "ninjatrader", + "tradingview", + "moomoo", + "tradezero", + "prorealtime", + "plus500", + "saxotrader", + "oanda", + "metatrader", + "forex.com", + "avaoptions", + "ctrader", + "jforex", + "iq option", + "olymp trade", + "binomo", + "pocket option", + "raceoption", + "expertoption", + "quotex", + "naga", + "morgan stanley", + "ubs neo", + "eikon", // Thomson Reuters / LSEG Workspace + // Crypto — exchanges, wallets, portfolio trackers + "coinbase", + "kraken", + "binance", + "okx", + "bybit", + // "gate.io" is too generic — the ".io" TLD suffix is common in app names + // (e.g., "Draw.io"). Needs bundle-ID matching once verified. + "phemex", + "stormgain", + "crypto.com", + // "exodus" is too generic — it's a common noun and would match unrelated + // apps/games. Needs bundle-ID matching once verified. + "electrum", + "ledger live", + "trezor", + "guarda", + "atomic wallet", + "bitpay", + "bisq", + "koinly", + "cointracker", + "blockfi", + "stripe cli", + // Crypto games / metaverse (same trade-execution risk model) + "decentraland", + "axie infinity", + "gods unchained", +]; + +/** + * Display-name substring match. Called when bundle-ID resolution returned + * nothing (`resolved === undefined`) or when no bundle-ID deny-list entry + * matched. Returns the category for the first matching substring, or null. + * + * Case-insensitive, substring — so `"Google Chrome"`, `"chrome"`, and + * `"Chrome Canary"` all match the `"chrome"` entry. + */ +export function getDeniedCategoryByDisplayName( + name: string, +): DeniedCategory | null { + const lower = name.toLowerCase(); + // Trading first — proper-noun-only set, most specific. "Bloomberg Terminal" + // contains "terminal" and would miscategorize if TERMINAL_NAME_SUBSTRINGS + // ran first. + for (const sub of TRADING_NAME_SUBSTRINGS) { + if (lower.includes(sub)) return "trading"; + } + for (const sub of BROWSER_NAME_SUBSTRINGS) { + if (lower.includes(sub)) return "browser"; + } + for (const sub of TERMINAL_NAME_SUBSTRINGS) { + if (lower.includes(sub)) return "terminal"; + } + return null; +} + +/** + * Combined check — bundle ID first (exact, fast), then display-name + * fallback. This is the function tool-call handlers should use. + * + * `bundleId` may be undefined (unresolved request — model asked for an app + * that isn't installed or Spotlight didn't find). In that case only the + * display-name check runs. + */ +export function getDeniedCategoryForApp( + bundleId: string | undefined, + displayName: string, +): DeniedCategory | null { + if (bundleId) { + const byId = getDeniedCategory(bundleId); + if (byId) return byId; + } + return getDeniedCategoryByDisplayName(displayName); +} + +/** + * Default tier for an app at grant time. Wraps `getDeniedCategoryForApp` + + * `categoryToTier`. Browsers → `"read"`, terminals/IDEs → `"click"`, + * everything else → `"full"`. + * + * Called by `buildAccessRequest` to populate `ResolvedAppRequest.proposedTier` + * before the approval dialog shows. + */ +export function getDefaultTierForApp( + bundleId: string | undefined, + displayName: string, +): "read" | "click" | "full" { + return categoryToTier(getDeniedCategoryForApp(bundleId, displayName)); +} + +export const _test = { + BROWSER_BUNDLE_IDS, + TERMINAL_BUNDLE_IDS, + TRADING_BUNDLE_IDS, + POLICY_DENIED_BUNDLE_IDS, + BROWSER_NAME_SUBSTRINGS, + TERMINAL_NAME_SUBSTRINGS, + TRADING_NAME_SUBSTRINGS, + POLICY_DENIED_NAME_SUBSTRINGS, +}; diff --git a/packages/@ant/computer-use-mcp/src/executor.ts b/packages/@ant/computer-use-mcp/src/executor.ts new file mode 100644 index 000000000..8092c68e9 --- /dev/null +++ b/packages/@ant/computer-use-mcp/src/executor.ts @@ -0,0 +1,111 @@ +export interface DisplayGeometry { + displayId: number + width: number + height: number + scaleFactor: number + originX: number + originY: number +} + +export interface ScreenshotResult { + base64: string + width: number + height: number + displayWidth: number + displayHeight: number + originX: number + originY: number + displayId?: number +} + +export interface FrontmostApp { + bundleId: string + displayName: string +} + +export interface InstalledApp { + bundleId: string + displayName: string + path: string + iconDataUrl?: string +} + +export interface RunningApp { + bundleId: string + displayName: string + pid?: number +} + +export interface ResolvePrepareCaptureResult extends ScreenshotResult { + hidden: string[] + activated?: string + displayId: number +} + +export interface ComputerExecutorCapabilities { + screenshotFiltering: 'native' | 'none' + platform: 'darwin' | 'win32' + hostBundleId: string +} + +export interface ComputerExecutor { + capabilities: ComputerExecutorCapabilities + prepareForAction( + allowlistBundleIds: string[], + displayId?: number, + ): Promise + previewHideSet( + allowlistBundleIds: string[], + displayId?: number, + ): Promise> + getDisplaySize(displayId?: number): Promise + listDisplays(): Promise + findWindowDisplays( + bundleIds: string[], + ): Promise> + resolvePrepareCapture(opts: { + allowedBundleIds: string[] + preferredDisplayId?: number + autoResolve: boolean + doHide?: boolean + }): Promise + screenshot(opts: { + allowedBundleIds: string[] + displayId?: number + }): Promise + zoom( + regionLogical: { x: number; y: number; w: number; h: number }, + allowedBundleIds: string[], + displayId?: number, + ): Promise<{ base64: string; width: number; height: number }> + key(keySequence: string, repeat?: number): Promise + holdKey(keyNames: string[], durationMs: number): Promise + type(text: string, opts: { viaClipboard: boolean }): Promise + readClipboard(): Promise + writeClipboard(text: string): Promise + moveMouse(x: number, y: number): Promise + click( + x: number, + y: number, + button: 'left' | 'right' | 'middle', + count: 1 | 2 | 3, + modifiers?: string[], + ): Promise + mouseDown(): Promise + mouseUp(): Promise + getCursorPosition(): Promise<{ x: number; y: number }> + drag( + from: { x: number; y: number } | undefined, + to: { x: number; y: number }, + ): Promise + scroll(x: number, y: number, dx: number, dy: number): Promise + getFrontmostApp(): Promise + appUnderPoint( + x: number, + y: number, + ): Promise<{ bundleId: string; displayName: string } | null> + listInstalledApps(): Promise + getAppIcon(path: string): Promise + listRunningApps(): Promise + openApp(bundleId: string): Promise +} diff --git a/packages/@ant/computer-use-mcp/src/imageResize.ts b/packages/@ant/computer-use-mcp/src/imageResize.ts new file mode 100644 index 000000000..fc529714c --- /dev/null +++ b/packages/@ant/computer-use-mcp/src/imageResize.ts @@ -0,0 +1,108 @@ +/** + * Port of the API's image transcoder target-size algorithm. Pre-sizing + * screenshots to this function's output means the API's early-return fires + * (tokens ≤ max) and the image is NOT resized server-side — so the model + * sees exactly the dimensions in `ScreenshotResult.width/height` and + * `scaleCoord` stays coherent. + * + * Rust reference: api/api/image_transcoder/rust_transcoder/src/utils/resize.rs + * Sibling TS port: apps/claude-browser-use/src/utils/imageResize.ts (identical + * algorithm, lives in the Chrome extension tree — not a shared package). + * + * See COORDINATES.md for why this matters for click accuracy. + */ + +export interface ResizeParams { + pxPerToken: number; + maxTargetPx: number; + maxTargetTokens: number; +} + +/** + * Production defaults — match `resize.rs:160-164` and Chrome's + * `CDPService.ts:638-642`. Vision encoder uses 28px tiles; 1568 is both + * the long-edge cap (56 tiles) AND the token budget. + */ +export const API_RESIZE_PARAMS: ResizeParams = { + pxPerToken: 28, + maxTargetPx: 1568, + maxTargetTokens: 1568, +}; + +/** ceil(px / pxPerToken). Matches resize.rs:74-76 (which uses integer ceil-div). */ +export function nTokensForPx(px: number, pxPerToken: number): number { + return Math.floor((px - 1) / pxPerToken) + 1; +} + +function nTokensForImg( + width: number, + height: number, + pxPerToken: number, +): number { + return nTokensForPx(width, pxPerToken) * nTokensForPx(height, pxPerToken); +} + +/** + * Binary-search along the width dimension for the largest image that: + * - preserves the input aspect ratio + * - has long edge ≤ maxTargetPx + * - has ceil(w/pxPerToken) × ceil(h/pxPerToken) ≤ maxTargetTokens + * + * Returns [width, height]. No-op if input already satisfies all three. + * + * The long-edge constraint alone (what we used to use) is insufficient on + * squarer-than-16:9 displays: 1568×1014 (MBP 16" AR) is 56×37 = 2072 tokens, + * over budget, and gets server-resized to 1372×887 — model then clicks in + * 1372-space but scaleCoord assumed 1568-space → ~14% coord error. + * + * Matches resize.rs:91-155 exactly (verified against its test vectors). + */ +export function targetImageSize( + width: number, + height: number, + params: ResizeParams, +): [number, number] { + const { pxPerToken, maxTargetPx, maxTargetTokens } = params; + + if ( + width <= maxTargetPx && + height <= maxTargetPx && + nTokensForImg(width, height, pxPerToken) <= maxTargetTokens + ) { + return [width, height]; + } + + // Normalize to landscape for the search; transpose result back. + if (height > width) { + const [w, h] = targetImageSize(height, width, params); + return [h, w]; + } + + const aspectRatio = width / height; + + // Loop invariant: lowerBoundWidth is always valid, upperBoundWidth is + // always invalid. ~12 iterations for a 4000px image. + let upperBoundWidth = width; + let lowerBoundWidth = 1; + + for (;;) { + if (lowerBoundWidth + 1 === upperBoundWidth) { + return [ + lowerBoundWidth, + Math.max(Math.round(lowerBoundWidth / aspectRatio), 1), + ]; + } + + const middleWidth = Math.floor((lowerBoundWidth + upperBoundWidth) / 2); + const middleHeight = Math.max(Math.round(middleWidth / aspectRatio), 1); + + if ( + middleWidth <= maxTargetPx && + nTokensForImg(middleWidth, middleHeight, pxPerToken) <= maxTargetTokens + ) { + lowerBoundWidth = middleWidth; + } else { + upperBoundWidth = middleWidth; + } + } +} diff --git a/packages/@ant/computer-use-mcp/src/index.ts b/packages/@ant/computer-use-mcp/src/index.ts index b35f1ef74..1e012cb2d 100644 --- a/packages/@ant/computer-use-mcp/src/index.ts +++ b/packages/@ant/computer-use-mcp/src/index.ts @@ -1,163 +1,69 @@ -/** - * @ant/computer-use-mcp — Stub 实现 - * - * 提供类型安全的 stub,所有函数返回合理的默认值。 - * 在 feature('CHICAGO_MCP') = false 时不会被实际调用, - * 但确保 import 不报错且类型正确。 - */ - -import type { - ComputerUseHostAdapter, - CoordinateMode, - GrantFlags, - Logger, -} from './types' +export type { + ComputerExecutor, + DisplayGeometry, + FrontmostApp, + InstalledApp, + ResolvePrepareCaptureResult, + RunningApp, + ScreenshotResult, +} from "./executor.js"; -// Re-export types from types.ts -export type { CoordinateMode, Logger } from './types' export type { - ComputerUseConfig, + AppGrant, + CuAppPermTier, ComputerUseHostAdapter, + ComputerUseOverrides, + ComputerUseSessionContext, + CoordinateMode, + CuGrantFlags, CuPermissionRequest, CuPermissionResponse, CuSubGates, -} from './types' -export { DEFAULT_GRANT_FLAGS } from './types' - -// --------------------------------------------------------------------------- -// Types (defined here for callers that import from the main entry) -// --------------------------------------------------------------------------- - -export interface DisplayGeometry { - width: number - height: number - displayId?: number - originX?: number - originY?: number -} - -export interface FrontmostApp { - bundleId: string - displayName: string -} - -export interface InstalledApp { - bundleId: string - displayName: string - path: string -} - -export interface RunningApp { - bundleId: string - displayName: string -} - -export interface ScreenshotResult { - base64: string - width: number - height: number -} - -export type ResolvePrepareCaptureResult = ScreenshotResult - -export interface ScreenshotDims { - width: number - height: number - displayWidth: number - displayHeight: number - displayId: number - originX: number - originY: number -} - -export interface CuCallToolResultContent { - type: 'image' | 'text' - data?: string - mimeType?: string - text?: string -} - -export interface CuCallToolResult { - content: CuCallToolResultContent[] - telemetry: { - error_kind?: string - [key: string]: unknown - } -} - -export type ComputerUseSessionContext = Record - -// --------------------------------------------------------------------------- -// API_RESIZE_PARAMS — 默认的截图缩放参数 -// --------------------------------------------------------------------------- - -export const API_RESIZE_PARAMS = { - maxWidth: 1280, - maxHeight: 800, - maxPixels: 1280 * 800, -} - -// --------------------------------------------------------------------------- -// ComputerExecutor — stub class -// --------------------------------------------------------------------------- - -export class ComputerExecutor { - capabilities: Record = {} -} - -// --------------------------------------------------------------------------- -// Functions — 返回合理默认值的 stub -// --------------------------------------------------------------------------- - -/** - * 计算目标截图尺寸。 - * 在物理宽高和 API 限制之间取最优尺寸。 - */ -export function targetImageSize( - physW: number, - physH: number, - _params?: typeof API_RESIZE_PARAMS, -): [number, number] { - const maxW = _params?.maxWidth ?? 1280 - const maxH = _params?.maxHeight ?? 800 - const scale = Math.min(1, maxW / physW, maxH / physH) - return [Math.round(physW * scale), Math.round(physH * scale)] -} - -/** - * 绑定会话上下文,返回工具调度函数。 - * Stub 返回一个始终返回空结果的调度器。 - */ -export function bindSessionContext( - _adapter: ComputerUseHostAdapter, - _coordinateMode: CoordinateMode, - _ctx: ComputerUseSessionContext, -): (name: string, args: unknown) => Promise { - return async (_name: string, _args: unknown) => ({ - content: [], - telemetry: {}, - }) -} - -/** - * 构建 Computer Use 工具定义列表。 - * Stub 返回空数组(无工具)。 - */ -export function buildComputerUseTools( - _capabilities?: Record, - _coordinateMode?: CoordinateMode, - _installedAppNames?: string[], -): Array<{ name: string; description: string; inputSchema: Record }> { - return [] -} - -/** - * 创建 Computer Use MCP server。 - * Stub 返回 null(服务未启用)。 - */ -export function createComputerUseMcpServer( - _adapter?: ComputerUseHostAdapter, - _coordinateMode?: CoordinateMode, -): null { - return null -} + CuTeachPermissionRequest, + Logger, + ResolvedAppRequest, + ScreenshotDims, + TeachStepRequest, + TeachStepResult, +} from "./types.js"; + +export { DEFAULT_GRANT_FLAGS } from "./types.js"; + +export { + SENTINEL_BUNDLE_IDS, + getSentinelCategory, +} from "./sentinelApps.js"; +export type { SentinelCategory } from "./sentinelApps.js"; + +export { + categoryToTier, + getDefaultTierForApp, + getDeniedCategory, + getDeniedCategoryByDisplayName, + getDeniedCategoryForApp, + isPolicyDenied, +} from "./deniedApps.js"; +export type { DeniedCategory } from "./deniedApps.js"; + +export { isSystemKeyCombo, normalizeKeySequence } from "./keyBlocklist.js"; + +export { ALL_SUB_GATES_OFF, ALL_SUB_GATES_ON } from "./subGates.js"; + +export { API_RESIZE_PARAMS, targetImageSize } from "./imageResize.js"; +export type { ResizeParams } from "./imageResize.js"; + +export { defersLockAcquire, handleToolCall } from "./toolCalls.js"; +export type { + CuCallTelemetry, + CuCallToolResult, + CuErrorKind, +} from "./toolCalls.js"; + +export { bindSessionContext, createComputerUseMcpServer } from "./mcpServer.js"; +export { buildComputerUseTools } from "./tools.js"; + +export { + comparePixelAtLocation, + validateClickTarget, +} from "./pixelCompare.js"; +export type { CropRawPatchFn, PixelCompareResult } from "./pixelCompare.js"; diff --git a/packages/@ant/computer-use-mcp/src/keyBlocklist.ts b/packages/@ant/computer-use-mcp/src/keyBlocklist.ts new file mode 100644 index 000000000..1373e1506 --- /dev/null +++ b/packages/@ant/computer-use-mcp/src/keyBlocklist.ts @@ -0,0 +1,153 @@ +/** + * Key combos that cross app boundaries or terminate processes. Gated behind + * the `systemKeyCombos` grant flag. When that flag is off, the `key` tool + * rejects these and returns a tool error telling the model to request the + * flag; all other combos work normally. + * + * Matching is canonicalized: every modifier alias the Rust executor accepts + * collapses to one canonical name. Without this, `command+q` / `meta+q` / + * `cmd+alt+escape` bypass the gate — see keyBlocklist.test.ts for the three + * bypass forms and the Rust parity check that catches future alias drift. + */ + +/** + * Every modifier alias enigo_wrap.rs accepts (two copies: :351-359, :564-572), + * mapped to one canonical per Key:: variant. Left/right variants collapse — + * the blocklist doesn't distinguish which Ctrl. + * + * Canonical names are Rust's own variant names lowercased. Blocklist entries + * below use ONLY these. "meta" reads odd for Cmd+Q but it's honest: Rust + * sends Key::Meta, which is Cmd on darwin and Win on win32. + */ +const CANONICAL_MODIFIER: Readonly> = { + // Key::Meta — "meta"|"super"|"command"|"cmd"|"windows"|"win" + meta: "meta", + super: "meta", + command: "meta", + cmd: "meta", + windows: "meta", + win: "meta", + // Key::Control + LControl + RControl + ctrl: "ctrl", + control: "ctrl", + lctrl: "ctrl", + lcontrol: "ctrl", + rctrl: "ctrl", + rcontrol: "ctrl", + // Key::Shift + LShift + RShift + shift: "shift", + lshift: "shift", + rshift: "shift", + // Key::Alt and Key::Option — distinct Rust variants but same keycode on + // darwin (kVK_Option). Collapse: cmd+alt+escape and cmd+option+escape + // both Force Quit. + alt: "alt", + option: "alt", +}; + +/** Sort order for canonicals. ctrl < alt < shift < meta. */ +const MODIFIER_ORDER = ["ctrl", "alt", "shift", "meta"]; + +/** + * Canonical-form entries only. Every modifier must be a CANONICAL_MODIFIER + * *value* (not key), modifiers must be in MODIFIER_ORDER, non-modifier last. + * The self-consistency test enforces this. + */ +const BLOCKED_DARWIN = new Set([ + "meta+q", // Cmd+Q — quit frontmost app + "shift+meta+q", // Cmd+Shift+Q — log out + "alt+meta+escape", // Cmd+Option+Esc — Force Quit dialog + "meta+tab", // Cmd+Tab — app switcher + "meta+space", // Cmd+Space — Spotlight + "ctrl+meta+q", // Ctrl+Cmd+Q — lock screen +]); + +const BLOCKED_WIN32 = new Set([ + "ctrl+alt+delete", // Secure Attention Sequence + "alt+f4", // close window + "alt+tab", // window switcher + "meta+l", // Win+L — lock + "meta+d", // Win+D — show desktop +]); + +/** + * Partition into sorted-canonical modifiers and non-modifier keys. + * Shared by normalizeKeySequence (join for display) and isSystemKeyCombo + * (check mods+each-key to catch the cmd+q+a suffix bypass). + */ +function partitionKeys(seq: string): { mods: string[]; keys: string[] } { + const parts = seq + .toLowerCase() + .split("+") + .map((p) => p.trim()) + .filter(Boolean); + const mods: string[] = []; + const keys: string[] = []; + for (const p of parts) { + const canonical = CANONICAL_MODIFIER[p]; + if (canonical !== undefined) { + mods.push(canonical); + } else { + keys.push(p); + } + } + // Dedupe: "cmd+command+q" → "meta+q", not "meta+meta+q". + const uniqueMods = [...new Set(mods)]; + uniqueMods.sort( + (a, b) => MODIFIER_ORDER.indexOf(a) - MODIFIER_ORDER.indexOf(b), + ); + return { mods: uniqueMods, keys }; +} + +/** + * Normalize "Cmd + Shift + Q" → "shift+meta+q": lowercase, trim, alias → + * canonical, dedupe, sort modifiers, non-modifiers last. + */ +export function normalizeKeySequence(seq: string): string { + const { mods, keys } = partitionKeys(seq); + return [...mods, ...keys].join("+"); +} + +/** + * True if the sequence would fire a blocked OS shortcut. + * + * Checks mods + EACH non-modifier key individually, not just the full + * joined string. `cmd+q+a` → Rust presses Cmd, then Q (Cmd+Q fires here), + * then A. Exact-match against "meta+q+a" misses; checking "meta+q" and + * "meta+a" separately catches the Q. + * + * Modifiers-only sequences ("cmd+shift") are checked as-is — no key to + * pair with, and no blocklist entry is modifier-only, so this is a no-op + * that falls through to false. Covers the click-modifier case where + * `left_click(text="cmd")` is legitimate. + */ +export function isSystemKeyCombo( + seq: string, + platform: "darwin" | "win32", +): boolean { + const blocklist = platform === "darwin" ? BLOCKED_DARWIN : BLOCKED_WIN32; + const { mods, keys } = partitionKeys(seq); + const prefix = mods.length > 0 ? mods.join("+") + "+" : ""; + + // No non-modifier keys (e.g. "cmd+shift" as click-modifiers) — check the + // whole thing. Never matches (no blocklist entry is modifier-only) but + // keeps the contract simple: every call reaches a .has(). + if (keys.length === 0) { + return blocklist.has(mods.join("+")); + } + + // mods + each key. Any hit blocks the whole sequence. + for (const key of keys) { + if (blocklist.has(prefix + key)) { + return true; + } + } + return false; +} + +export const _test = { + CANONICAL_MODIFIER, + BLOCKED_DARWIN, + BLOCKED_WIN32, + MODIFIER_ORDER, +}; diff --git a/packages/@ant/computer-use-mcp/src/mcpServer.ts b/packages/@ant/computer-use-mcp/src/mcpServer.ts new file mode 100644 index 000000000..4b1f0ca24 --- /dev/null +++ b/packages/@ant/computer-use-mcp/src/mcpServer.ts @@ -0,0 +1,313 @@ +/** + * MCP server factory + session-context binder. + * + * Two entry points: + * + * `bindSessionContext` — the wrapper closure. Takes a `ComputerUseSessionContext` + * (getters + callbacks backed by host session state), returns a dispatcher. + * Reusable by both the MCP CallTool handler here AND Cowork's + * `InternalServerDefinition.handleToolCall` (which doesn't go through MCP). + * This replaces the duplicated wrapper closures in apps/desktop/…/serverDef.ts + * and the Claude Code CLI's CU host wrapper — both did the same thing: build `ComputerUseOverrides` + * fresh from getters, call `handleToolCall`, stash screenshot, merge permissions. + * + * `createComputerUseMcpServer` — the Server object. When `context` is provided, + * the CallTool handler is real (uses `bindSessionContext`). When not, it's the + * legacy stub that returns a not-wired error. The tool-schema ListTools handler + * is the same either way. + */ + +import { Server } from "@modelcontextprotocol/sdk/server/index.js"; +import type { CallToolResult } from "@modelcontextprotocol/sdk/types.js"; +import { + CallToolRequestSchema, + ListToolsRequestSchema, +} from "@modelcontextprotocol/sdk/types.js"; + +import type { ScreenshotResult } from "./executor.js"; +import type { CuCallToolResult } from "./toolCalls.js"; +import { + defersLockAcquire, + handleToolCall, + resetMouseButtonHeld, +} from "./toolCalls.js"; +import { buildComputerUseTools } from "./tools.js"; +import type { + AppGrant, + ComputerUseHostAdapter, + ComputerUseOverrides, + ComputerUseSessionContext, + CoordinateMode, + CuGrantFlags, + CuPermissionResponse, +} from "./types.js"; +import { DEFAULT_GRANT_FLAGS } from "./types.js"; + +const DEFAULT_LOCK_HELD_MESSAGE = + "Another Claude session is currently using the computer. Wait for that " + + "session to finish, or find a non-computer-use approach."; + +/** + * Dedupe `granted` into `existing` on bundleId, spread truthy-only flags over + * defaults+existing. Truthy-only: a subsequent `request_access` that doesn't + * request clipboard can't revoke an earlier clipboard grant — revocation lives + * in a Settings page, not here. + * + * Same merge both hosts implemented independently today. + */ +function mergePermissionResponse( + existing: readonly AppGrant[], + existingFlags: CuGrantFlags, + response: CuPermissionResponse, +): { apps: AppGrant[]; flags: CuGrantFlags } { + const seen = new Set(existing.map((a) => a.bundleId)); + const apps = [ + ...existing, + ...response.granted.filter((g) => !seen.has(g.bundleId)), + ]; + const truthyFlags = Object.fromEntries( + Object.entries(response.flags).filter(([, v]) => v === true), + ); + const flags: CuGrantFlags = { + ...DEFAULT_GRANT_FLAGS, + ...existingFlags, + ...truthyFlags, + }; + return { apps, flags }; +} + +/** + * Bind session state to a reusable dispatcher. The returned function is the + * wrapper closure: async lock gate → build overrides fresh → `handleToolCall` + * → stash screenshot → strip piggybacked fields. + * + * The last-screenshot blob is held in a closure cell here (not on `ctx`), so + * hosts don't need to guarantee `ctx` object identity across calls — they just + * need to hold onto the returned dispatcher. Cowork caches per + * `InternalServerContext` in a WeakMap; the CLI host constructs once at server creation. + */ +export function bindSessionContext( + adapter: ComputerUseHostAdapter, + coordinateMode: CoordinateMode, + ctx: ComputerUseSessionContext, +): (name: string, args: unknown) => Promise { + const { logger, serverName } = adapter; + + // Screenshot blob persists here across calls — NOT on `ctx`. Hosts hold + // onto the returned dispatcher; that's the identity that matters. + let lastScreenshot: ScreenshotResult | undefined; + + const wrapPermission = ctx.onPermissionRequest + ? async ( + req: Parameters>[0], + signal: AbortSignal, + ): Promise => { + const response = await ctx.onPermissionRequest!(req, signal); + const { apps, flags } = mergePermissionResponse( + ctx.getAllowedApps(), + ctx.getGrantFlags(), + response, + ); + logger.debug( + `[${serverName}] permission result: granted=${response.granted.length} denied=${response.denied.length}`, + ); + ctx.onAllowedAppsChanged?.(apps, flags); + return response; + } + : undefined; + + const wrapTeachPermission = ctx.onTeachPermissionRequest + ? async ( + req: Parameters>[0], + signal: AbortSignal, + ): Promise => { + const response = await ctx.onTeachPermissionRequest!(req, signal); + logger.debug( + `[${serverName}] teach permission result: granted=${response.granted.length} denied=${response.denied.length}`, + ); + // Teach doesn't request grant flags — preserve existing. + const { apps } = mergePermissionResponse( + ctx.getAllowedApps(), + ctx.getGrantFlags(), + response, + ); + ctx.onAllowedAppsChanged?.(apps, { + ...DEFAULT_GRANT_FLAGS, + ...ctx.getGrantFlags(), + }); + return response; + } + : undefined; + + return async (name, args) => { + // ─── Async lock gate ───────────────────────────────────────────────── + // Replaces the sync Gate-3 in `handleToolCall` — we pass + // `checkCuLock: undefined` below so it no-ops. Hosts with + // cross-process locks (O_EXCL file) await the real primitive here + // instead of pre-computing + feeding a fake sync result. + if (ctx.checkCuLock) { + const lock = await ctx.checkCuLock(); + if (lock.holder !== undefined && !lock.isSelf) { + const text = + ctx.formatLockHeldMessage?.(lock.holder) ?? DEFAULT_LOCK_HELD_MESSAGE; + return { + content: [{ type: "text", text }], + isError: true, + telemetry: { error_kind: "cu_lock_held" }, + }; + } + if (lock.holder === undefined && !defersLockAcquire(name)) { + await ctx.acquireCuLock?.(); + // Re-check: the awaits above yield the microtask queue, so another + // session's check+acquire can interleave with ours. Hosts where + // acquire is a no-op when already held (Cowork's CuLockManager) give + // no signal that we lost — verify we're now the holder before + // proceeding. The CLI's O_EXCL file lock would surface this as a throw from + // acquire instead; this re-check is a belt-and-suspenders for that + // path too. + const recheck = await ctx.checkCuLock(); + if (recheck.holder !== undefined && !recheck.isSelf) { + const text = + ctx.formatLockHeldMessage?.(recheck.holder) ?? + DEFAULT_LOCK_HELD_MESSAGE; + return { + content: [{ type: "text", text }], + isError: true, + telemetry: { error_kind: "cu_lock_held" }, + }; + } + // Fresh holder → any prior session's mouseButtonHeld is stale. + // Mirrors what Gate-3 does on the acquire branch. After the + // re-check so we only clear module state when we actually won. + resetMouseButtonHeld(); + } + } + + // ─── Build overrides fresh ─────────────────────────────────────────── + // Blob-first; dims-fallback with base64:"" when the closure cell is + // unset (cross-respawn). scaleCoord reads dims; pixelCompare sees "" → + // isEmpty → skip. + const dimsFallback = lastScreenshot + ? undefined + : ctx.getLastScreenshotDims?.(); + + // Per-call AbortController for dialog dismissal. Aborted in `finally` — + // if handleToolCall finishes (MCP timeout, throw) before the user + // answers, the host's dialog handler sees the abort and tears down. + const dialogAbort = new AbortController(); + + const overrides: ComputerUseOverrides = { + allowedApps: [...ctx.getAllowedApps()], + grantFlags: ctx.getGrantFlags(), + userDeniedBundleIds: ctx.getUserDeniedBundleIds(), + coordinateMode, + selectedDisplayId: ctx.getSelectedDisplayId(), + displayPinnedByModel: ctx.getDisplayPinnedByModel?.(), + displayResolvedForApps: ctx.getDisplayResolvedForApps?.(), + lastScreenshot: + lastScreenshot ?? + (dimsFallback ? { ...dimsFallback, base64: "" } : undefined), + onPermissionRequest: wrapPermission + ? (req) => wrapPermission(req, dialogAbort.signal) + : undefined, + onTeachPermissionRequest: wrapTeachPermission + ? (req) => wrapTeachPermission(req, dialogAbort.signal) + : undefined, + onAppsHidden: ctx.onAppsHidden, + getClipboardStash: ctx.getClipboardStash, + onClipboardStashChanged: ctx.onClipboardStashChanged, + onResolvedDisplayUpdated: ctx.onResolvedDisplayUpdated, + onDisplayPinned: ctx.onDisplayPinned, + onDisplayResolvedForApps: ctx.onDisplayResolvedForApps, + onTeachModeActivated: ctx.onTeachModeActivated, + onTeachStep: ctx.onTeachStep, + onTeachWorking: ctx.onTeachWorking, + getTeachModeActive: ctx.getTeachModeActive, + // Undefined → handleToolCall's sync Gate-3 no-ops. The async gate + // above already ran. + checkCuLock: undefined, + acquireCuLock: undefined, + isAborted: ctx.isAborted, + }; + + logger.debug( + `[${serverName}] tool=${name} allowedApps=${overrides.allowedApps.length} coordMode=${coordinateMode}`, + ); + + // ─── Dispatch ──────────────────────────────────────────────────────── + try { + const result = await handleToolCall(adapter, name, args, overrides); + + if (result.screenshot) { + lastScreenshot = result.screenshot; + const { base64: _blob, ...dims } = result.screenshot; + logger.debug(`[${serverName}] screenshot dims: ${JSON.stringify(dims)}`); + ctx.onScreenshotCaptured?.(dims); + } + + return result; + } finally { + dialogAbort.abort(); + } + }; +} + +export function createComputerUseMcpServer( + adapter: ComputerUseHostAdapter, + coordinateMode: CoordinateMode, + context?: ComputerUseSessionContext, +): Server { + const { serverName, logger } = adapter; + + const server = new Server( + { name: serverName, version: "0.1.3" }, + { capabilities: { tools: {}, logging: {} } }, + ); + + const tools = buildComputerUseTools( + adapter.executor.capabilities, + coordinateMode, + ); + + server.setRequestHandler(ListToolsRequestSchema, async () => + adapter.isDisabled() ? { tools: [] } : { tools }, + ); + + if (context) { + const dispatch = bindSessionContext(adapter, coordinateMode, context); + server.setRequestHandler( + CallToolRequestSchema, + async (request): Promise => { + const { screenshot: _s, telemetry: _t, ...result } = await dispatch( + request.params.name, + request.params.arguments ?? {}, + ); + return result; + }, + ); + return server; + } + + // Legacy: no context → stub handler. Reached only if something calls the + // server over MCP transport WITHOUT going through a binder (a wiring + // regression). Clear error instead of silent failure. + server.setRequestHandler( + CallToolRequestSchema, + async (request): Promise => { + logger.warn( + `[${serverName}] tool call "${request.params.name}" reached the stub handler — no session context bound. Per-session state unavailable.`, + ); + return { + content: [ + { + type: "text", + text: "This computer-use server instance is not wired to a session. Per-session app permissions are not available on this code path.", + }, + ], + isError: true, + }; + }, + ); + + return server; +} diff --git a/packages/@ant/computer-use-mcp/src/pixelCompare.ts b/packages/@ant/computer-use-mcp/src/pixelCompare.ts new file mode 100644 index 000000000..05153f602 --- /dev/null +++ b/packages/@ant/computer-use-mcp/src/pixelCompare.ts @@ -0,0 +1,171 @@ +/** + * Staleness guard ported from the Vercept acquisition. + * + * Compares the model's last-seen screenshot against a fresh-right-now + * screenshot at the click target, so the model never clicks pixels it hasn't + * seen. If the 9×9 patch around the target differs, the click is aborted and + * the model is told to re-screenshot. This is NOT a popup detector. + * + * Semantics preserved exactly: + * - Skip on no `lastScreenshot` (cold start) — click proceeds. + * - Skip on any internal error (crop throws, screenshot fails, etc.) — + * click proceeds. Validation failure must never block the action. + * - 9×9 exact byte equality on raw pixel bytes. No fuzzing, no tolerance. + * - Compare in percentage coords so Retina scale doesn't matter. + * + * JPEG decode + crop is INJECTED via `ComputerUseHostAdapter.cropRawPatch`. + * The original used `sharp` (LGPL, native `.node` addon); we inject Electron's + * `nativeImage` (Chromium decoders, BSD, nothing to bundle) from the host, so + * this package never imports it — the crop is a function parameter. + */ + +import type { ScreenshotResult } from "./executor.js"; +import type { Logger } from "./types.js"; + +/** Injected by the host. See `ComputerUseHostAdapter.cropRawPatch`. */ +export type CropRawPatchFn = ( + jpegBase64: string, + rect: { x: number; y: number; width: number; height: number }, +) => Buffer | null; + +/** 9×9 is empirically the sweet spot — large enough to catch a tooltip + * appearing, small enough to not false-positive on surrounding animation. + **/ +const DEFAULT_GRID_SIZE = 9; + +export interface PixelCompareResult { + /** true → click may proceed. false → patch changed, abort the click. */ + valid: boolean; + /** true → validation did not run (cold start, sub-gate off, or internal + * error). The caller MUST treat this identically to `valid: true`. */ + skipped: boolean; + /** Populated when valid === false. Returned to the model verbatim. */ + warning?: string; +} + +/** + * Compute the crop rect for a patch centered on (xPercent, yPercent). + * + * Dimensions come from ScreenshotResult.width/height (physical pixels). Both + * screenshots have the same dimensions (same display, consecutive captures), + * so the rect is the same for both. + */ +function computeCropRect( + imgW: number, + imgH: number, + xPercent: number, + yPercent: number, + gridSize: number, +): { x: number; y: number; width: number; height: number } | null { + if (!imgW || !imgH) return null; + + const clampedX = Math.max(0, Math.min(100, xPercent)); + const clampedY = Math.max(0, Math.min(100, yPercent)); + + const centerX = Math.round((clampedX / 100.0) * imgW); + const centerY = Math.round((clampedY / 100.0) * imgH); + + const halfGrid = Math.floor(gridSize / 2); + const cropX = Math.max(0, centerX - halfGrid); + const cropY = Math.max(0, centerY - halfGrid); + const cropW = Math.min(gridSize, imgW - cropX); + const cropH = Math.min(gridSize, imgH - cropY); + if (cropW <= 0 || cropH <= 0) return null; + + return { x: cropX, y: cropY, width: cropW, height: cropH }; +} + +/** + * Compare the same patch location between two screenshots. + * + * @returns true when the raw pixel bytes are identical. false on any + * difference, or on any internal error (the caller treats an error here as + * `skipped`, so the false is harmless). + */ +export function comparePixelAtLocation( + crop: CropRawPatchFn, + lastScreenshot: ScreenshotResult, + freshScreenshot: ScreenshotResult, + xPercent: number, + yPercent: number, + gridSize: number = DEFAULT_GRID_SIZE, +): boolean { + // Both screenshots are of the same display — use the fresh one's + // dimensions (less likely to be stale than last's). + const rect = computeCropRect( + freshScreenshot.width, + freshScreenshot.height, + xPercent, + yPercent, + gridSize, + ); + if (!rect) return false; + + const patch1 = crop(lastScreenshot.base64, rect); + const patch2 = crop(freshScreenshot.base64, rect); + if (!patch1 || !patch2) return false; + + // Direct buffer equality. Note: nativeImage.toBitmap() gives BGRA, sharp's + // .raw() gave RGB. + // Doesn't matter — we're comparing two same-format buffers for equality. + return patch1.equals(patch2); +} + +/** + * Battle-tested click-target validation ported from the Vercept acquisition, + * with the fresh-screenshot capture delegated to the caller (we don't have + * a global `SystemActions.takeScreenshot()` — the executor is injected). + * + * Skip conditions (any of these → `{ valid: true, skipped: true }`): + * - `lastScreenshot` is undefined (cold start). + * - `takeFreshScreenshot()` throws or returns null. + * - Injected crop function returns null (decode failure). + * - Any other exception. + * + * The caller decides whether to invoke this at all (sub-gate check lives + * in toolCalls.ts, not here). + */ +export async function validateClickTarget( + crop: CropRawPatchFn, + lastScreenshot: ScreenshotResult | undefined, + xPercent: number, + yPercent: number, + takeFreshScreenshot: () => Promise, + logger: Logger, + gridSize: number = DEFAULT_GRID_SIZE, +): Promise { + if (!lastScreenshot) { + return { valid: true, skipped: true }; + } + + try { + const fresh = await takeFreshScreenshot(); + if (!fresh) { + return { valid: true, skipped: true }; + } + + const pixelsMatch = comparePixelAtLocation( + crop, + lastScreenshot, + fresh, + xPercent, + yPercent, + gridSize, + ); + + if (pixelsMatch) { + return { valid: true, skipped: false }; + } + return { + valid: false, + skipped: false, + warning: + "Screen content at the target location changed since the last screenshot. Take a new screenshot before clicking.", + }; + } catch (err) { + // Skip validation on technical errors, execute action anyway. + // Battle-tested: validation failure must never block the click. + logger.debug("[pixelCompare] validation error, skipping", err); + return { valid: true, skipped: true }; + } +} diff --git a/packages/@ant/computer-use-mcp/src/sentinelApps.ts b/packages/@ant/computer-use-mcp/src/sentinelApps.ts index 27a67a199..0d26de600 100644 --- a/packages/@ant/computer-use-mcp/src/sentinelApps.ts +++ b/packages/@ant/computer-use-mcp/src/sentinelApps.ts @@ -1,32 +1,43 @@ /** - * Sentinel apps — 需要特殊权限警告的应用列表 + * Bundle IDs that are escalations-in-disguise. The approval UI shows a warning + * badge for these; they are NOT blocked. Power users may legitimately want the + * model controlling a terminal. * - * 包含终端、文件管理器、系统设置等敏感应用。 - * Computer Use 操作这些应用时会显示额外警告。 + * Imported by the renderer via the `./sentinelApps` subpath (package.json + * `exports`), which keeps Next.js from reaching index.ts → mcpServer.ts → + * @modelcontextprotocol/sdk (devDep, would fail module resolution). Keep + * this file import-free so the subpath stays clean. */ -type SentinelCategory = 'shell' | 'filesystem' | 'system_settings' +/** These apps can execute arbitrary shell commands. */ +const SHELL_ACCESS_BUNDLE_IDS = new Set([ + "com.apple.Terminal", + "com.googlecode.iterm2", + "com.microsoft.VSCode", + "dev.warp.Warp-Stable", + "com.github.wez.wezterm", + "io.alacritty", + "net.kovidgoyal.kitty", + "com.jetbrains.intellij", + "com.jetbrains.pycharm", +]); -const SENTINEL_MAP: Record = { - // Shell / Terminal - 'com.apple.Terminal': 'shell', - 'com.googlecode.iterm2': 'shell', - 'dev.warp.Warp-Stable': 'shell', - 'io.alacritty': 'shell', - 'com.github.wez.wezterm': 'shell', - 'net.kovidgoyal.kitty': 'shell', - 'co.zeit.hyper': 'shell', +/** Finder in the allowlist ≈ browse + open-any-file. */ +const FILESYSTEM_ACCESS_BUNDLE_IDS = new Set(["com.apple.finder"]); - // Filesystem - 'com.apple.finder': 'filesystem', +const SYSTEM_SETTINGS_BUNDLE_IDS = new Set(["com.apple.systempreferences"]); - // System Settings - 'com.apple.systempreferences': 'system_settings', - 'com.apple.SystemPreferences': 'system_settings', -} +export const SENTINEL_BUNDLE_IDS: ReadonlySet = new Set([ + ...SHELL_ACCESS_BUNDLE_IDS, + ...FILESYSTEM_ACCESS_BUNDLE_IDS, + ...SYSTEM_SETTINGS_BUNDLE_IDS, +]); -export const sentinelApps: string[] = Object.keys(SENTINEL_MAP) +export type SentinelCategory = "shell" | "filesystem" | "system_settings"; export function getSentinelCategory(bundleId: string): SentinelCategory | null { - return SENTINEL_MAP[bundleId] ?? null + if (SHELL_ACCESS_BUNDLE_IDS.has(bundleId)) return "shell"; + if (FILESYSTEM_ACCESS_BUNDLE_IDS.has(bundleId)) return "filesystem"; + if (SYSTEM_SETTINGS_BUNDLE_IDS.has(bundleId)) return "system_settings"; + return null; } diff --git a/packages/@ant/computer-use-mcp/src/subGates.ts b/packages/@ant/computer-use-mcp/src/subGates.ts new file mode 100644 index 000000000..7a8867844 --- /dev/null +++ b/packages/@ant/computer-use-mcp/src/subGates.ts @@ -0,0 +1,19 @@ +import type { CuSubGates } from './types.js' + +export const ALL_SUB_GATES_ON: CuSubGates = { + pixelValidation: true, + clipboardPasteMultiline: true, + mouseAnimation: true, + hideBeforeAction: true, + autoTargetDisplay: true, + clipboardGuard: true, +} + +export const ALL_SUB_GATES_OFF: CuSubGates = { + pixelValidation: false, + clipboardPasteMultiline: false, + mouseAnimation: false, + hideBeforeAction: false, + autoTargetDisplay: false, + clipboardGuard: false, +} diff --git a/packages/@ant/computer-use-mcp/src/toolCalls.ts b/packages/@ant/computer-use-mcp/src/toolCalls.ts new file mode 100644 index 000000000..557eab9f6 --- /dev/null +++ b/packages/@ant/computer-use-mcp/src/toolCalls.ts @@ -0,0 +1,3649 @@ +/** + * Tool dispatch. Every security decision from plan §2 is enforced HERE, + * before any executor method is called. + * + * Enforcement order, every call: + * 1. Kill switch (`adapter.isDisabled()`). + * 2. TCC gate (`adapter.ensureOsPermissions()`). `request_access` is + * exempted — it threads the ungranted state to the renderer so the + * user can grant TCC perms from inside the approval dialog. + * 3. Tool-specific gates (see dispatch table) — ANY exception in a gate + * returns a tool error, executor never called. + * 4. Executor call. + * + * For input actions (click/type/key/scroll/drag/move_mouse) the tool-specific + * gates are, in order: + * a. `prepareForAction` — hide every non-allowlisted app, then defocus us + * (battle-tested pre-action sequence from the Vercept acquisition). + * Sub-gated via `hideBeforeAction`. After this runs the screenshot is + * TRUE (what the + * model sees IS what's at each pixel) and we are not keyboard-focused. + * b. Frontmost gate — branched by actionKind: + * mouse: frontmost ∈ allowlist ∪ {hostBundleId, Finder} → pass. + * hostBundleId passes because the executor's + * `withClickThrough` bracket makes us click-through. + * keyboard: frontmost ∈ allowlist ∪ {Finder} → pass. + * hostBundleId → ERROR (safety net — defocus should have + * moved us off; if it didn't, typing would go into our + * own chat box). + * After step (a) this gate fires RARELY — only when something popped + * up between prepare and action, or the 5-try hide loop gave up. + * Checked FRESH on every call, not cached across calls. + * + * For click variants only, AFTER the above gates but BEFORE the executor call: + * c. Pixel-validation staleness check (sub-gated). + */ + +import type { CallToolResult } from "@modelcontextprotocol/sdk/types.js"; +import { randomUUID } from "node:crypto"; + +import { getDefaultTierForApp, getDeniedCategoryForApp, isPolicyDenied } from "./deniedApps.js"; +import type { + ComputerExecutor, + DisplayGeometry, + InstalledApp, + ScreenshotResult, +} from "./executor.js"; +import { isSystemKeyCombo } from "./keyBlocklist.js"; +import { validateClickTarget } from "./pixelCompare.js"; +import { SENTINEL_BUNDLE_IDS } from "./sentinelApps.js"; +import type { + AppGrant, + ComputerUseHostAdapter, + ComputerUseOverrides, + CoordinateMode, + CuAppPermTier, + CuGrantFlags, + CuPermissionRequest, + CuSubGates, + CuTeachPermissionRequest, + Logger, + ResolvedAppRequest, + TeachStepRequest, +} from "./types.js"; + +/** + * Finder is never hidden by the hide loop (hiding Finder kills the Desktop), + * so it's always a valid frontmost. + */ +const FINDER_BUNDLE_ID = "com.apple.finder"; + +/** + * Categorical error classes for the cu_tool_call telemetry event. Never + * free text — error messages may contain file paths / app content (PII). + */ +export type CuErrorKind = + | "allowlist_empty" + | "tcc_not_granted" + | "cu_lock_held" + | "teach_mode_conflict" + | "teach_mode_not_active" + | "executor_threw" + | "capture_failed" + | "app_denied" // no longer emitted (tiered model replaced hard-deny); kept for schema compat + | "bad_args" // malformed tool args (type/shape/range/unknown value) + | "app_not_granted" // target app not in session allowlist (distinct from allowlist_empty) + | "tier_insufficient" // app in allowlist but at a tier too low for the action + | "feature_unavailable" // tool callable but session not wired for it + | "state_conflict" // wrong state for action (call sequence, mouse already held) + | "grant_flag_required" // action needs a grant flag (systemKeyCombos, clipboard*) from request_access + | "display_error" // display enumeration failed (platform) + | "other"; + +/** + * Telemetry payload piggybacked on the result — populated by handlers, + * consumed and stripped by the host wrapper (serverDef.ts) before the + * result goes to the SDK. Same pattern as `screenshot`. + */ +export interface CuCallTelemetry { + /** request_access / request_teach_access: apps NEWLY granted in THIS call + * (does NOT include idempotent re-grants of already-allowed apps). */ + granted_count?: number; + /** request_access / request_teach_access: apps denied in THIS call */ + denied_count?: number; + /** request_access / request_teach_access: apps safety-denied (browser) this call */ + denied_browser_count?: number; + /** request_access / request_teach_access: apps safety-denied (terminal) this call */ + denied_terminal_count?: number; + /** Categorical error class (only set when isError) */ + error_kind?: CuErrorKind; +} + +/** + * `CallToolResult` augmented with the screenshot payload. `bindSessionContext` + * reads `result.screenshot` after a `screenshot` tool call and stashes it in a + * closure cell for the next pixel-validation. MCP clients never see this + * field — the host wrapper strips it before returning to the SDK. + */ +export type CuCallToolResult = CallToolResult & { + screenshot?: ScreenshotResult; + /** Piggybacked telemetry — stripped by the host wrapper before SDK return. */ + telemetry?: CuCallTelemetry; +}; + +// --------------------------------------------------------------------------- +// Small result helpers (mirror of chrome-mcp's inline `{content, isError}`) +// --------------------------------------------------------------------------- + +function errorResult(text: string, errorKind?: CuErrorKind): CuCallToolResult { + return { + content: [{ type: "text", text }], + isError: true, + telemetry: errorKind ? { error_kind: errorKind } : undefined, + }; +} + +function okText(text: string): CuCallToolResult { + return { content: [{ type: "text", text }] }; +} + +function okJson(obj: unknown, telemetry?: CuCallTelemetry): CuCallToolResult { + return { + content: [{ type: "text", text: JSON.stringify(obj) }], + telemetry, + }; +} + +// --------------------------------------------------------------------------- +// Arg validation — lightweight, no zod (mirrors chrome-mcp's cast-and-check) +// --------------------------------------------------------------------------- + +function asRecord(args: unknown): Record { + if (typeof args === "object" && args !== null) { + return args as Record; + } + return {}; +} + +function requireNumber( + args: Record, + key: string, +): number | Error { + const v = args[key]; + if (typeof v !== "number" || !Number.isFinite(v)) { + return new Error(`"${key}" must be a finite number.`); + } + return v; +} + +function requireString( + args: Record, + key: string, +): string | Error { + const v = args[key]; + if (typeof v !== "string") { + return new Error(`"${key}" must be a string.`); + } + return v; +} + +/** + * Extract (x, y) from `coordinate: [x, y]` tuple. + * array of length 2, both non-negative numbers. + */ +function extractCoordinate( + args: Record, + paramName: string = "coordinate", +): [number, number] | Error { + const coord = args[paramName]; + if (coord === undefined) { + return new Error(`${paramName} is required`); + } + if (!Array.isArray(coord) || coord.length !== 2) { + return new Error(`${paramName} must be an array of length 2`); + } + const [x, y] = coord; + if (typeof x !== "number" || typeof y !== "number" || x < 0 || y < 0) { + return new Error(`${paramName} must be a tuple of non-negative numbers`); + } + return [x, y]; +} + +// --------------------------------------------------------------------------- +// Coordinate scaling +// --------------------------------------------------------------------------- + +/** + * Convert model-space coordinates to the logical points that enigo expects. + * + * - `normalized_0_100`: (x / 100) * display.width. `display` is fetched + * fresh per tool call — never cached across calls — + * so a mid-session display-settings change doesn't leave us stale. + * - `pixels`: the model sent image-space pixel coords (it read them off the + * last screenshot). With the 1568-px long-edge downsample, the + * screenshot-px → logical-pt ratio is `displayWidth / screenshotWidth`, + * NOT `1/scaleFactor`. Uses the display geometry stashed at CAPTURE time + * (`lastScreenshot.displayWidth`), not fresh — so the transform matches + * what the model actually saw even if the user changed display settings + * since. (Chrome's ScreenshotContext pattern — CDPService.ts:1486-1493.) + */ +function scaleCoord( + rawX: number, + rawY: number, + mode: CoordinateMode, + display: DisplayGeometry, + lastScreenshot: ScreenshotResult | undefined, + logger: Logger, +): { x: number; y: number } { + if (mode === "normalized_0_100") { + // Origin offset targets the selected display in virtual-screen space. + return { + x: Math.round((rawX / 100) * display.width) + display.originX, + y: Math.round((rawY / 100) * display.height) + display.originY, + }; + } + + // mode === "pixels": model sent image-space pixel coords. + if (lastScreenshot) { + // The transform. Chrome coordinateScaling.ts:22-34 + claude-in-a-box + // ComputerTool.swift:70-80 — two independent convergent impls. + // Uses the display geometry stashed AT CAPTURE TIME, not fresh. + // Origin from the same snapshot keeps clicks coherent with the captured display. + return { + x: + Math.round( + rawX * (lastScreenshot.displayWidth / lastScreenshot.width), + ) + lastScreenshot.originX, + y: + Math.round( + rawY * (lastScreenshot.displayHeight / lastScreenshot.height), + ) + lastScreenshot.originY, + }; + } + + // Cold start: model sent pixel coords without having taken a screenshot. + // Degenerate — fall back to the old /sf behavior and warn. + logger.warn( + "[computer-use] pixels-mode coordinate received with no prior screenshot; " + + "falling back to /scaleFactor. Click may be off if downsample is active.", + ); + return { + x: Math.round(rawX / display.scaleFactor) + display.originX, + y: Math.round(rawY / display.scaleFactor) + display.originY, + }; +} + +/** + * Convert model-space coordinates to the 0–100 percentage that + * pixelCompare.ts works in. The staleness check operates in screenshot-image + * space; comparing by percentage lets us crop both last and fresh screenshots + * at the same relative location without caring about their absolute dims. + * + * With the 1568-px downsample, `screenshot.width != display.width * sf`, so + * the old `rawX / (display.width * sf)` formula is wrong. The correct + * denominator is just `lastScreenshot.width` — the model's raw pixel coord is + * already in that image's coordinate space. `DisplayGeometry` is no longer + * consumed at all. + */ +function coordToPercentageForPixelCompare( + rawX: number, + rawY: number, + mode: CoordinateMode, + lastScreenshot: ScreenshotResult | undefined, +): { xPct: number; yPct: number } { + if (mode === "normalized_0_100") { + // Unchanged — already a percentage. + return { xPct: rawX, yPct: rawY }; + } + + // mode === "pixels" + if (!lastScreenshot) { + // validateClickTarget at pixelCompare.ts:141-143 already skips when + // lastScreenshot is undefined, so this return value never reaches a crop. + return { xPct: 0, yPct: 0 }; + } + return { + xPct: (rawX / lastScreenshot.width) * 100, + yPct: (rawY / lastScreenshot.height) * 100, + }; +} + +// --------------------------------------------------------------------------- +// Shared input-action gates +// --------------------------------------------------------------------------- + +/** + * Tier needed to perform a given action class. `undefined` → `"full"`. + * + * - `"mouse_position"` — mouse_move only. Passes at any tier including + * `"read"`. Pure cursor positioning, no app interaction. Still runs + * prepareForAction (hide non-allowed apps). + * - `"mouse"` — plain left click, double/triple, scroll, drag-from. + * Requires tier `"click"` or `"full"`. + * - `"mouse_full"` — right/middle click, any click with modifiers, + * drag-drop (the `to` endpoint of left_click_drag). Requires tier + * `"full"`. Right-click → context menu Paste, modifier chords → + * keystrokes before click, drag-drop → text insertion at the drop + * point. All escalate a click-tier grant to keyboard-equivalent input. + * Blunt: also rejects same-app drags (scrollbar, panel resize) onto + * click-tier apps; `scroll` is the tier-"click" way to scroll. + * - `"keyboard"` — type, key, hold_key. Requires tier `"full"`. + */ +type CuActionKind = "mouse_position" | "mouse" | "mouse_full" | "keyboard"; + +function tierSatisfies( + grantTier: CuAppPermTier | undefined, + actionKind: CuActionKind, +): boolean { + const tier = grantTier ?? "full"; + if (actionKind === "mouse_position") return true; + if (actionKind === "keyboard" || actionKind === "mouse_full") { + return tier === "full"; + } + // mouse + return tier === "click" || tier === "full"; +} + +// Appended to every tier_insufficient error. The model may try to route +// around the gate (osascript, System Events, cliclick via Bash) — this +// closes that door explicitly. Leading space so it concatenates cleanly. +const TIER_ANTI_SUBVERSION = + " Do not attempt to work around this restriction — never use AppleScript, " + + "System Events, shell commands, or any other method to send clicks or " + + "keystrokes to this app."; + +// --------------------------------------------------------------------------- +// Clipboard guard — stash+clear while a click-tier app is frontmost +// --------------------------------------------------------------------------- +// +// Threat: tier "click" blocks type/key/right-click-Paste, but a click-tier +// terminal/IDE may have a UI Paste button that's plain-left-clickable. If the +// clipboard holds `rm -rf /` — from the user, from a prior full-tier paste, +// OR from the agent's own write_clipboard call (which doesn't route through +// runInputActionGates) — a left_click on that button injects it. +// +// Mitigation: stash the user's clipboard on first entry to click-tier, then +// RE-CLEAR before every input action while click-tier stays frontmost. The +// re-clear is the load-bearing part — a stash-on-transition-only design +// leaves a gap between an agent write_clipboard and the next left_click. +// When frontmost becomes anything else, restore. Turn-end restore is inlined +// in the host's result-handler + leavingRunning (same dual-location as +// cuHiddenDuringTurn unhide) — reads `session.cuClipboardStash` directly and +// writes via Electron's `clipboard.writeText`, so no nest-only import. +// +// State lives on the session (via `overrides.getClipboardStash` / +// `onClipboardStashChanged`), not module-level. The CU lock still guarantees +// one session at a time, but session-scoped state means the host's turn-end +// restore doesn't need to reach back into this package. + +async function syncClipboardStash( + adapter: ComputerUseHostAdapter, + overrides: ComputerUseOverrides, + frontmostIsClickTier: boolean, +): Promise { + const current = overrides.getClipboardStash?.(); + if (!frontmostIsClickTier) { + // Restore + clear. Idempotent — if nothing is stashed, no-op. + if (current === undefined) return; + try { + await adapter.executor.writeClipboard(current); + // Clear only after a successful write — a transient pasteboard + // failure must not irrecoverably drop the stash. + overrides.onClipboardStashChanged?.(undefined); + } catch { + // Best effort — stash held, next non-click action retries. + } + return; + } + // Stash the user's clipboard on FIRST entry to click-tier only. + if (current === undefined) { + try { + const read = await adapter.executor.readClipboard(); + overrides.onClipboardStashChanged?.(read); + } catch { + // readClipboard failed — use empty sentinel so we don't retry the stash + // on the next action; restore becomes a harmless writeClipboard(""). + overrides.onClipboardStashChanged?.(""); + } + } + // Re-clear on EVERY click-tier action, not just the first. Defeats the + // bypass where the agent calls write_clipboard (which doesn't route + // through runInputActionGates) between stash and a left_click on a UI + // Paste button — the next action's clear clobbers the agent's write + // before the click lands. + try { + await adapter.executor.writeClipboard(""); + } catch { + // Transient pasteboard failure. The tier-"click" right-click/modifier + // block still holds; this is a net, not a promise. + } +} + +/** Every click/type/key/scroll/drag/move_mouse runs through this before + * touching the executor. Returns null on pass, error-result on block. + * Any throw inside → caught by handleToolCall's outer try → tool error. */ +async function runInputActionGates( + adapter: ComputerUseHostAdapter, + overrides: ComputerUseOverrides, + subGates: CuSubGates, + actionKind: CuActionKind, +): Promise { + // Step A+B — hide non-allowlisted apps + defocus us. Sub-gated. After this + // runs, the frontmost gate below becomes a rare edge-case detector (something + // popped up between prepare and action) rather than a normal-path blocker. + // ALL grant tiers stay visible — visibility is the baseline (tier "read"). + if (subGates.hideBeforeAction) { + const hidden = await adapter.executor.prepareForAction( + overrides.allowedApps.map((a) => a.bundleId), + overrides.selectedDisplayId, + ); + // Empty-check so we don't spam the callback on every action when nothing + // was hidden (the common case after the first action of a turn). + if (hidden.length > 0) { + overrides.onAppsHidden?.(hidden); + } + } + + // Frontmost gate. Check FRESH on every call. + const frontmost = await adapter.executor.getFrontmostApp(); + + const tierByBundleId = new Map( + overrides.allowedApps.map((a) => [a.bundleId, a.tier] as const), + ); + + // After handleToolCall's tier backfill, every grant has a concrete tier — + // .get() returning undefined means the app is not in the allowlist at all. + const frontmostTier = frontmost + ? tierByBundleId.get(frontmost.bundleId) + : undefined; + + // Clipboard guard. Per-action, not per-tool-call — runs for every sub-action + // inside computer_batch and teach_step/teach_batch, so clicking into a + // click-tier app mid-batch stashes+clears before the next click lands. + // Lives here (not in handleToolCall) so deferAcquire tools (request_access, + // list_granted_applications), `wait`, and the teach_step blocking-dialog + // phase don't trigger a sync — only input actions do. + if (subGates.clipboardGuard) { + await syncClipboardStash(adapter, overrides, frontmostTier === "click"); + } + + if (!frontmost) { + // No frontmost app (rare — login window?). Let it through; the click + // will land somewhere and PixelCompare catches staleness. + return null; + } + + const { hostBundleId } = adapter.executor.capabilities; + + if (frontmostTier !== undefined) { + if (tierSatisfies(frontmostTier, actionKind)) return null; + // In the allowlist but tier doesn't cover this action. Tailor the + // guidance to the actual tier — at "read", suggesting left_click or Bash + // is wrong (nothing is allowed; use Chrome MCP). At "click", the + // mouse_full/keyboard-specific messages apply. + if (frontmostTier === "read") { + // tier "read" is not category-unique (browser AND trading map to it) — + // re-look-up so the CiC hint only shows for actual browsers. + const isBrowser = + getDeniedCategoryForApp(frontmost.bundleId, frontmost.displayName) === + "browser"; + return errorResult( + `"${frontmost.displayName}" is granted at tier "read" — ` + + `visible in screenshots only, no clicks or typing.` + + (isBrowser + ? " Use the Claude-in-Chrome MCP for browser interaction (tools " + + "named `mcp__Claude_in_Chrome__*`; load via ToolSearch if " + + "deferred)." + : " No interaction is permitted; ask the user to take any " + + "actions in this app themselves.") + + TIER_ANTI_SUBVERSION, + "tier_insufficient", + ); + } + // frontmostTier === "click" (tier === "full" would have passed tierSatisfies) + if (actionKind === "keyboard") { + return errorResult( + `"${frontmost.displayName}" is granted at tier "click" — ` + + `typing, key presses, and paste require tier "full". The keys ` + + `would go to this app's text fields or integrated terminal. To ` + + `type into a different app, click it first to bring it forward. ` + + `For shell commands, use the Bash tool.` + TIER_ANTI_SUBVERSION, + "tier_insufficient", + ); + } + // actionKind === "mouse_full" ("mouse" and "mouse_position" pass at "click") + return errorResult( + `"${frontmost.displayName}" is granted at tier "click" — ` + + `right-click, middle-click, and clicks with modifier keys require ` + + `tier "full". Right-click opens a context menu with Paste/Cut, and ` + + `modifier chords fire as keystrokes before the click. Plain ` + + `left_click is allowed here.` + TIER_ANTI_SUBVERSION, + "tier_insufficient", + ); + } + // Finder is never-hide, always allowed. + if (frontmost.bundleId === FINDER_BUNDLE_ID) return null; + + if (frontmost.bundleId === hostBundleId) { + if (actionKind !== "keyboard") { + // mouse and mouse_full are both click events — click-through works. + // We're click-through (executor's withClickThrough). Pass. + return null; + } + // Keyboard safety net — defocus (prepareForAction step B) should have + // moved us off. If we're still here, typing would go to our chat box. + return errorResult( + "Claude's own window still has keyboard focus. This should not happen " + + "after the pre-action defocus. Click on the target application first.", + "state_conflict", + ); + } + + // Non-allowlisted, non-us, non-Finder. RARE after the hide loop — means + // something popped up between prepare and action, or the 5-try loop gave up. + return errorResult( + `"${frontmost.displayName}" is not in the allowed applications and is ` + + `currently in front. Take a new screenshot — it may have appeared ` + + `since your last one.`, + "app_not_granted", + ); +} + +/** + * Hit-test gate: reject a mouse action if the window under (x, y) belongs + * to an app whose tier doesn't cover mouse input. Closes the gap where a + * tier-"full" app is frontmost but the click lands on a tier-"read" window + * overlapping it — `runInputActionGates` passes (frontmost is fine), but the + * click actually goes to the read-tier app. + * + * Runs AFTER `scaleCoord` (needs global coords) and BEFORE the executor call. + * Returns null on pass (target is tier-"click"/"full", or desktop/Finder/us), + * error-result on block. + * + * When `appUnderPoint` returns null (desktop, or platform without hit-test), + * falls through — the frontmost check in `runInputActionGates` already ran. + */ +async function runHitTestGate( + adapter: ComputerUseHostAdapter, + overrides: ComputerUseOverrides, + subGates: CuSubGates, + x: number, + y: number, + actionKind: CuActionKind, +): Promise { + const target = await adapter.executor.appUnderPoint(x, y); + if (!target) return null; // desktop / nothing under point / platform no-op + + // Finder (desktop, file dialogs) is always clickable — same exemption as + // runInputActionGates. Our own overlay is filtered by Swift (pid != self). + if (target.bundleId === FINDER_BUNDLE_ID) return null; + + const tierByBundleId = new Map( + overrides.allowedApps.map((a) => [a.bundleId, a.tier] as const), + ); + + if (!tierByBundleId.has(target.bundleId)) { + // Not in the allowlist at all. The frontmost check would catch this if + // the target were frontmost, but here a different app is in front. This + // is the "something popped up" edge case — a new window appeared between + // screenshot and click, or a background app's window overlaps the target. + return errorResult( + `Click at these coordinates would land on "${target.displayName}", ` + + `which is not in the allowed applications. Take a fresh screenshot ` + + `to see the current window layout.`, + "app_not_granted", + ); + } + + const targetTier = tierByBundleId.get(target.bundleId); + + // Frontmost-based sync (runInputActionGates) misses the case where + // the click lands on a NON-FRONTMOST click-tier window. Re-sync by + // the hit-test target's tier — if target is click-tier, stash+clear + // before the click lands, regardless of what's frontmost. + if (subGates.clipboardGuard && targetTier === "click") { + await syncClipboardStash(adapter, overrides, true); + } + + if (tierSatisfies(targetTier, actionKind)) return null; + + // Target is in the allowlist but tier doesn't cover this action. + // runHitTestGate is only called with mouse/mouse_full (keyboard routes to + // frontmost, not window-under-cursor). The branch above catches + // mouse_full ∧ click; the only remaining fall-through is tier "read". + if (actionKind === "mouse_full" && targetTier === "click") { + return errorResult( + `Click at these coordinates would land on "${target.displayName}", ` + + `which is granted at tier "click" — right-click, middle-click, and ` + + `clicks with modifier keys require tier "full" (they can Paste via ` + + `the context menu or fire modifier-chord keystrokes). Plain ` + + `left_click is allowed here.` + TIER_ANTI_SUBVERSION, + "tier_insufficient", + ); + } + const isBrowser = + getDeniedCategoryForApp(target.bundleId, target.displayName) === "browser"; + return errorResult( + `Click at these coordinates would land on "${target.displayName}", ` + + `which is granted at tier "read" (screenshots only, no interaction). ` + + (isBrowser + ? "Use the Claude-in-Chrome MCP for browser interaction." + : "Ask the user to take any actions in this app themselves.") + + TIER_ANTI_SUBVERSION, + "tier_insufficient", + ); +} + +// --------------------------------------------------------------------------- +// Screenshot helpers +// --------------------------------------------------------------------------- + +/** + * §6 item 9 — screenshot retry on implausibly-small buffer. Battle-tested + * threshold (1024 bytes). We retry exactly once. + */ +const MIN_SCREENSHOT_BYTES = 1024; + +function decodedByteLength(base64: string): number { + // 3 bytes per 4 chars, minus padding. Good enough for a threshold check. + const padding = base64.endsWith("==") ? 2 : base64.endsWith("=") ? 1 : 0; + return Math.floor((base64.length * 3) / 4) - padding; +} + +async function takeScreenshotWithRetry( + executor: ComputerExecutor, + allowedBundleIds: string[], + logger: ComputerUseHostAdapter["logger"], + displayId?: number, +): Promise { + let shot = await executor.screenshot({ allowedBundleIds, displayId }); + if (decodedByteLength(shot.base64) < MIN_SCREENSHOT_BYTES) { + logger.warn( + `[computer-use] screenshot implausibly small (${decodedByteLength(shot.base64)} bytes decoded), retrying once`, + ); + shot = await executor.screenshot({ allowedBundleIds, displayId }); + } + return shot; +} + +// --------------------------------------------------------------------------- +// Grapheme iteration — §6 item 7, ported from the Vercept acquisition +// --------------------------------------------------------------------------- + +const INTER_GRAPHEME_SLEEP_MS = 8; // §6 item 4 — 125 Hz USB polling + +function segmentGraphemes(text: string): string[] { + try { + // Node 18+ has Intl.Segmenter; the try is defence against a stripped- + // -down runtime (falls back to code points). + const Segmenter = ( + Intl as typeof Intl & { + Segmenter?: new ( + locale?: string, + options?: { granularity: "grapheme" | "word" | "sentence" }, + ) => { segment: (s: string) => Iterable<{ segment: string }> }; + } + ).Segmenter; + if (typeof Segmenter === "function") { + const seg = new Segmenter(undefined, { granularity: "grapheme" }); + return Array.from(seg.segment(text), (s) => s.segment); + } + } catch { + // fall through + } + // Code-point iteration. Keeps surrogate pairs together but splits ZWJ. + return Array.from(text); +} + +function sleep(ms: number): Promise { + return new Promise((r) => setTimeout(r, ms)); +} + +/** + * Split a chord string like "ctrl+shift" into individual key names. + * Same parsing as `key` tool / executor.key / keyBlocklist.normalizeKeySequence. + */ +function parseKeyChord(text: string): string[] { + return text + .split("+") + .map((s) => s.trim()) + .filter(Boolean); +} + +// --------------------------------------------------------------------------- +// left_mouse_down / left_mouse_up held-state tracking +// --------------------------------------------------------------------------- + +/** + * Errors on double-down but not on up-without-down. Module-level, but + * reset on every lock acquire (handleToolCall → acquireCuLock branch) so + * a session interrupted mid-drag (overlay stop during left_mouse_down) + * doesn't leave the flag true for the next lock holder. + * + * Still scoped wrong within a single lock cycle if sessions could interleave + * tool calls, but the lock enforces at-most-one-session-uses-CU so they + * can't. The per-turn reset is the correctness boundary. + */ +let mouseButtonHeld = false; +/** Whether mouse_move occurred between left_mouse_down and left_mouse_up. + * When false at mouseUp, the decomposed sequence is a click-release (not a + * drop) — hit-test at "mouse", not "mouse_full". */ +let mouseMoved = false; + +/** Clears the cross-call drag flags. Called from Gate-3 on lock-acquire and + * from `bindSessionContext` in mcpServer.ts — a fresh lock holder must not + * inherit a prior session's mid-drag state. */ +export function resetMouseButtonHeld(): void { + mouseButtonHeld = false; + mouseMoved = false; +} + +/** If a left_mouse_down set the OS button without a matching left_mouse_up + * ever getting its turn, release it now. Same release-before-return as + * handleClick. No-op when not held — callers don't need to check. */ +async function releaseHeldMouse( + adapter: ComputerUseHostAdapter, +): Promise { + if (!mouseButtonHeld) return; + await adapter.executor.mouseUp(); + mouseButtonHeld = false; + mouseMoved = false; +} + +/** + * Tools that check the lock but don't acquire it. `request_access` and + * `list_granted_applications` hit the CHECK (so a blocked session doesn't + * show an approval dialog for access it can't use) but defer ACQUIRE — the + * enter-CU notification/overlay only fires on the first action tool. + * + * `request_teach_access` is NOT here: approving teach mode hides the main + * window, and the lock must be held before that. See Gate-3 block in + * `handleToolCall` for the full explanation. + * + * Exported for `bindSessionContext` in mcpServer.ts so the async lock gate + * uses the same set as the sync one. + */ +export function defersLockAcquire(toolName: string): boolean { + return ( + toolName === "request_access" || + toolName === "list_granted_applications" + ); +} + +// --------------------------------------------------------------------------- +// request_access helpers +// --------------------------------------------------------------------------- + +/** Reverse-DNS-ish: contains at least one dot, no spaces, no slashes. Lets + * raw bundle IDs pass through resolution. */ +const REVERSE_DNS_RE = /^[A-Za-z0-9][\w.-]*\.[A-Za-z0-9][\w.-]*$/; + +function looksLikeBundleId(s: string): boolean { + return REVERSE_DNS_RE.test(s) && !s.includes(" "); +} + +function resolveRequestedApps( + requestedNames: string[], + installed: InstalledApp[], + alreadyGrantedBundleIds: ReadonlySet, +): ResolvedAppRequest[] { + const byLowerDisplayName = new Map(); + const byBundleId = new Map(); + for (const app of installed) { + byBundleId.set(app.bundleId, app); + // Last write wins on collisions. Ambiguous-name handling (multiple + // candidates in the dialog) is plan-documented but deferred — the + // InstalledApps enumerator dedupes by bundle ID, so true display-name + // collisions are rare. TODO(chicago, post-P1): surface all candidates. + byLowerDisplayName.set(app.displayName.toLowerCase(), app); + } + + return requestedNames.map((requested): ResolvedAppRequest => { + let resolved: InstalledApp | undefined; + if (looksLikeBundleId(requested)) { + resolved = byBundleId.get(requested); + } + if (!resolved) { + resolved = byLowerDisplayName.get(requested.toLowerCase()); + } + const bundleId = resolved?.bundleId; + // When unresolved AND the requested string looks like a bundle ID, use it + // directly for tier lookup (e.g. "company.thebrowser.Browser" with Arc not + // installed — the reverse-DNS string won't match any display-name substring). + const bundleIdCandidate = + bundleId ?? (looksLikeBundleId(requested) ? requested : undefined); + return { + requestedName: requested, + resolved, + isSentinel: bundleId ? SENTINEL_BUNDLE_IDS.has(bundleId) : false, + alreadyGranted: bundleId ? alreadyGrantedBundleIds.has(bundleId) : false, + proposedTier: getDefaultTierForApp( + bundleIdCandidate, + resolved?.displayName ?? requested, + ), + }; + }); +} + +// --------------------------------------------------------------------------- +// Individual tool handlers +// --------------------------------------------------------------------------- + +async function handleRequestAccess( + adapter: ComputerUseHostAdapter, + args: Record, + overrides: ComputerUseOverrides, + tccState: { accessibility: boolean; screenRecording: boolean } | undefined, +): Promise { + if (!overrides.onPermissionRequest) { + return errorResult( + "This session was not wired with a permission handler. Computer control is not available here.", + "feature_unavailable", + ); + } + + // Teach mode hides the main window; permission dialogs render in that + // window. Without this, handleToolPermission blocks on an invisible + // prompt and the overlay spins forever. Tell the model to exit teach + // mode, request access, then re-enter. + if (overrides.getTeachModeActive?.()) { + return errorResult( + "Cannot request additional permissions during teach mode — the permission dialog would be hidden. End teach mode (finish the tour or let the turn complete), then call request_access, then start a new tour.", + "teach_mode_conflict", + ); + } + + const reason = requireString(args, "reason"); + if (reason instanceof Error) return errorResult(reason.message, "bad_args"); + + // TCC-ungranted branch. The renderer shows a toggle panel INSTEAD OF the + // app list when `tccState` is present on the request, so we skip app + // resolution entirely (listInstalledApps() may fail without Screen + // Recording anyway). The user grants the OS perms from inside the dialog, + // then clicks "Ask again" — both buttons resolve with deny by design + // (ComputerUseApproval.tsx) so the model re-calls request_access and + // gets the app list on the next call. + if (tccState) { + const req: CuPermissionRequest = { + requestId: randomUUID(), + reason, + apps: [], + requestedFlags: {}, + screenshotFiltering: adapter.executor.capabilities.screenshotFiltering, + tccState, + }; + await overrides.onPermissionRequest(req); + + // Re-check: the user may have granted in System Settings while the + // dialog was up. The `tccState` arg is a pre-dialog snapshot — reading + // it here would tell the model "not yet granted" even after the user + // granted, and the model waits for confirmation instead of retrying. + // The renderer's TCC panel already live-polls (computerUseTccStore); + // this is the same re-check on the tool-result side. + const recheck = await adapter.ensureOsPermissions(); + if (recheck.granted) { + return errorResult( + "macOS Accessibility and Screen Recording are now both granted. " + + "Call request_access again immediately — the next call will show " + + "the app selection list.", + ); + } + + const missing: string[] = []; + if (!recheck.accessibility) missing.push("Accessibility"); + if (!recheck.screenRecording) missing.push("Screen Recording"); + return errorResult( + `macOS ${missing.join(" and ")} permission(s) not yet granted. ` + + `The permission panel has been shown. Once the user grants the ` + + `missing permission(s), call request_access again.`, + "tcc_not_granted", + ); + } + + const rawApps = args.apps; + if (!Array.isArray(rawApps) || !rawApps.every((a) => typeof a === "string")) { + return errorResult('"apps" must be an array of strings.', "bad_args"); + } + const apps = rawApps as string[]; + + const requestedFlags: Partial = {}; + if (typeof args.clipboardRead === "boolean") { + requestedFlags.clipboardRead = args.clipboardRead; + } + if (typeof args.clipboardWrite === "boolean") { + requestedFlags.clipboardWrite = args.clipboardWrite; + } + if (typeof args.systemKeyCombos === "boolean") { + requestedFlags.systemKeyCombos = args.systemKeyCombos; + } + + const { + needDialog, + skipDialogGrants, + willHide, + tieredApps, + userDenied, + policyDenied, + } = await buildAccessRequest( + adapter, + apps, + overrides.allowedApps, + new Set(overrides.userDeniedBundleIds), + overrides.selectedDisplayId, + ); + + let dialogGranted: AppGrant[] = []; + let dialogDenied: Array<{ + bundleId: string; + reason: "user_denied" | "not_installed"; + }> = []; + let dialogFlags: CuGrantFlags = overrides.grantFlags; + + if (needDialog.length > 0 || Object.keys(requestedFlags).length > 0) { + const req: CuPermissionRequest = { + requestId: randomUUID(), + reason, + apps: needDialog, + requestedFlags, + screenshotFiltering: adapter.executor.capabilities.screenshotFiltering, + // Undefined when empty so the renderer skips the section cleanly. + ...(willHide.length > 0 && { + willHide, + autoUnhideEnabled: adapter.getAutoUnhideEnabled(), + }), + }; + const response = await overrides.onPermissionRequest(req); + dialogGranted = response.granted; + dialogDenied = response.denied; + dialogFlags = response.flags; + } + + // Do NOT return display geometry or coordinateMode. See COORDINATES.md + // ("Never give the model a number that invites rescaling"). scaleCoord + // already transforms server-side; the coordinate convention is baked into + // the tool param descriptions at server-construction time. + const allGranted = [...skipDialogGrants, ...dialogGranted]; + // Filter tieredApps to what was actually granted — if the user unchecked + // Chrome in the dialog, don't explain Chrome's tier. + const grantedBundleIds = new Set(allGranted.map((g) => g.bundleId)); + const grantedTieredApps = tieredApps.filter((t) => + grantedBundleIds.has(t.bundleId), + ); + // Best-effort — grants are already persisted by wrappedPermissionHandler; + // a listDisplays/findWindowDisplays failure (monitor hot-unplug, NAPI + // error) must not tank the grant response. Same discipline as + // buildMonitorNote's listDisplays try/catch. + let windowLocations: Awaited> = []; + try { + windowLocations = await buildWindowLocations(adapter, allGranted); + } catch (e) { + adapter.logger.warn( + `[computer-use] buildWindowLocations failed: ${String(e)}`, + ); + } + return okJson( + { + granted: allGranted, + denied: dialogDenied, + // Policy blocklist — precedes userDenied in precedence and response + // order. No escape hatch; the agent is told to find another approach. + ...(policyDenied.length > 0 && { + policyDenied: { + apps: policyDenied, + guidance: buildPolicyDeniedGuidance(policyDenied), + }, + }), + // User-configured auto-deny — stripped before the dialog; this is the + // agent's only signal that these apps exist but are user-blocked. + ...(userDenied.length > 0 && { + userDenied: { + apps: userDenied, + guidance: buildUserDeniedGuidance(userDenied), + }, + }), + // Upfront guidance so the model knows what each tier allows BEFORE + // hitting the gate. Only included when something was tier-restricted. + ...(grantedTieredApps.length > 0 && { + tierGuidance: buildTierGuidanceMessage(grantedTieredApps), + }), + screenshotFiltering: adapter.executor.capabilities.screenshotFiltering, + // Where each granted app currently has open windows, across monitors. + // Omitted when the app isn't running or has no normal windows. + ...(windowLocations.length > 0 ? { windowLocations } : {}), + }, + { + // dialogGranted only — skipDialogGrants are idempotent re-grants of + // apps already in the allowlist (no user action, dialog skips them). + // Matching denied_count's this-call-only semantics. + granted_count: dialogGranted.length, + denied_count: dialogDenied.length, + ...tierAssignmentTelemetry(grantedTieredApps), + }, + ); +} + +/** + * For each granted app with open windows, which displays those windows are + * on. Single-monitor setups return an empty array (no multi-monitor signal + * to give). Apps not running, or running with no normal windows, are omitted. + */ +async function buildWindowLocations( + adapter: ComputerUseHostAdapter, + granted: AppGrant[], +): Promise< + Array<{ + bundleId: string; + displayName: string; + displays: Array<{ id: number; label?: string; isPrimary?: boolean }>; + }> +> { + if (granted.length === 0) return []; + + const displays = await adapter.executor.listDisplays(); + if (displays.length <= 1) return []; + + const grantedBundleIds = granted.map((g) => g.bundleId); + const windowLocs = await adapter.executor.findWindowDisplays(grantedBundleIds); + const displayById = new Map(displays.map((d) => [d.displayId, d])); + const idsByBundle = new Map(windowLocs.map((w) => [w.bundleId, w.displayIds])); + + const out = []; + for (const g of granted) { + const displayIds = idsByBundle.get(g.bundleId); + if (!displayIds || displayIds.length === 0) continue; + out.push({ + bundleId: g.bundleId, + displayName: g.displayName, + displays: displayIds.map((id) => { + const d = displayById.get(id); + return { id, label: d?.label, isPrimary: d?.isPrimary }; + }), + }); + } + return out; +} + +/** + * Shared app-resolution + partition + hide-preview pipeline. Extracted from + * `handleRequestAccess` so `handleRequestTeachAccess` can call the same path. + * + * Does the full app-name→InstalledApp resolution, assigns each a tier + * (browser→"read", terminal/IDE→"click", else "full" — see deniedApps.ts), + * splits into already-granted (skip the dialog, preserve grantedAt+tier) vs + * need-dialog, and computes the willHide preview. Unlike the previous + * hard-deny model, ALL apps proceed to the dialog; the tier just constrains + * what actions are allowed once granted. + */ +/** An app assigned a restricted tier (not `"full"`). Used to build the + * guidance message telling the model what it can/can't do. */ +interface TieredApp { + bundleId: string; + displayName: string; + /** Never `"full"` — only restricted tiers are collected. */ + tier: "read" | "click"; +} + +interface AccessRequestParts { + needDialog: ResolvedAppRequest[]; + skipDialogGrants: AppGrant[]; + willHide: Array<{ bundleId: string; displayName: string }>; + /** Resolved apps with `proposedTier !== "full"` — for the guidance text. + * Unresolved apps are omitted (they go to `denied` with `not_installed`). */ + tieredApps: TieredApp[]; + /** Apps stripped by the user's Settings auto-deny list. Surfaced in the + * response with guidance; never reach the dialog. */ + userDenied: Array<{ requestedName: string; displayName: string }>; + /** Apps stripped by the baked-in policy blocklist (streaming/music/ebooks, + * etc. — `deniedApps.isPolicyDenied`). Precedence over userDenied. */ + policyDenied: Array<{ requestedName: string; displayName: string }>; +} + +async function buildAccessRequest( + adapter: ComputerUseHostAdapter, + apps: string[], + allowedApps: AppGrant[], + userDeniedBundleIds: ReadonlySet, + selectedDisplayId?: number, +): Promise { + const alreadyGranted = new Set(allowedApps.map((g) => g.bundleId)); + const installed = await adapter.executor.listInstalledApps(); + const resolved = resolveRequestedApps(apps, installed, alreadyGranted); + + // Policy-level auto-deny (baked-in, not user-configurable). Stripped + // before userDenied — checks bundle ID AND display name (covers + // unresolved requests). Precedence: policy > user setting > tier. + const policyDenied: Array<{ requestedName: string; displayName: string }> = + []; + const afterPolicy: typeof resolved = []; + for (const r of resolved) { + const displayName = r.resolved?.displayName ?? r.requestedName; + if (isPolicyDenied(r.resolved?.bundleId, displayName)) { + policyDenied.push({ requestedName: r.requestedName, displayName }); + } else { + afterPolicy.push(r); + } + } + + // User-configured auto-deny (Settings → Desktop app → Computer Use). + // Stripped BEFORE + // tier assignment — these never reach the dialog regardless of category. + // Bundle-ID match only (the Settings UI picks from installed apps, which + // always have a bundle ID). Unresolved requests pass through to the tier + // system; the user can't preemptively deny an app that isn't installed. + const userDenied: Array<{ requestedName: string; displayName: string }> = []; + const surviving: typeof afterPolicy = []; + for (const r of afterPolicy) { + if (r.resolved && userDeniedBundleIds.has(r.resolved.bundleId)) { + userDenied.push({ + requestedName: r.requestedName, + displayName: r.resolved.displayName, + }); + } else { + surviving.push(r); + } + } + + // Collect resolved apps with a restricted tier for the guidance message. + // Unresolved apps with a restricted tier (e.g. model asks for "Chrome" but + // it's not installed) are omitted — they'll end up in the `denied` list + // with reason "not_installed" and the model will see that instead. + const tieredApps: TieredApp[] = []; + for (const r of surviving) { + if (r.proposedTier === "full" || !r.resolved) continue; + tieredApps.push({ + bundleId: r.resolved.bundleId, + displayName: r.resolved.displayName, + tier: r.proposedTier, + }); + } + + // Idempotence: apps that are already granted skip the dialog and are + // merged into the `granted` response. Existing grants keep their tier + // (which may differ from the current proposedTier if policy changed). + const skipDialog = surviving.filter((r) => r.alreadyGranted); + const needDialog = surviving.filter((r) => !r.alreadyGranted); + + // Populate icons only for what the dialog will actually show. Sequential + // awaits are fine — the Swift module is cached (listInstalledApps above + // loaded it), each N-API call is synchronous, and the darwin executor + // memoizes by path. Failures leave iconDataUrl undefined; renderer falls + // back to a grey box. + for (const r of needDialog) { + if (!r.resolved) continue; + try { + r.resolved.iconDataUrl = await adapter.executor.getAppIcon( + r.resolved.path, + ); + } catch { + // leave undefined + } + } + + const now = Date.now(); + const skipDialogGrants: AppGrant[] = skipDialog + .filter((r) => r.resolved) + .map((r) => { + // Reuse the existing grant (preserving grantedAt + tier) rather than + // synthesizing a new one — keeps Settings-page "Granted 3m ago" honest. + const existing = allowedApps.find( + (g) => g.bundleId === r.resolved!.bundleId, + ); + return ( + existing ?? { + bundleId: r.resolved!.bundleId, + displayName: r.resolved!.displayName, + grantedAt: now, + tier: r.proposedTier, + } + ); + }); + + // Preview what will be hidden if the user approves exactly the requested + // set plus what they already have. All tiers are visible, so everything + // resolved goes in the exempt set. + const exemptForPreview = [ + ...allowedApps.map((a) => a.bundleId), + ...surviving.filter((r) => r.resolved).map((r) => r.resolved!.bundleId), + ]; + const willHide = await adapter.executor.previewHideSet( + exemptForPreview, + selectedDisplayId, + ); + + return { + needDialog, + skipDialogGrants, + willHide, + tieredApps, + userDenied, + policyDenied, + }; +} + +/** + * Build guidance text for apps granted at a restricted tier. Returned + * inline in the okJson response so the model knows upfront what it can + * do with each app, instead of learning by hitting the tier gate. + */ +function buildTierGuidanceMessage(tiered: TieredApp[]): string { + // tier "read" is not category-unique — split so browsers get the CiC hint + // and trading platforms get "ask the user" instead. + const readBrowsers = tiered.filter( + (t) => + t.tier === "read" && + getDeniedCategoryForApp(t.bundleId, t.displayName) === "browser", + ); + const readOther = tiered.filter( + (t) => + t.tier === "read" && + getDeniedCategoryForApp(t.bundleId, t.displayName) !== "browser", + ); + const clickTier = tiered.filter((t) => t.tier === "click"); + + const parts: string[] = []; + + if (readBrowsers.length > 0) { + const names = readBrowsers.map((b) => `"${b.displayName}"`).join(", "); + parts.push( + `${names} ${readBrowsers.length === 1 ? "is a browser" : "are browsers"} — ` + + `granted at tier "read" (visible in screenshots only; no clicks or ` + + `typing). You can read what's on screen but cannot navigate, click, ` + + `or type into ${readBrowsers.length === 1 ? "it" : "them"}. For browser ` + + `interaction, use the Claude-in-Chrome MCP (tools named ` + + `\`mcp__Claude_in_Chrome__*\`; load via ToolSearch if deferred).`, + ); + } + + if (readOther.length > 0) { + const names = readOther.map((t) => `"${t.displayName}"`).join(", "); + parts.push( + `${names} ${readOther.length === 1 ? "is" : "are"} granted at tier ` + + `"read" (visible in screenshots only; no clicks or typing). You can ` + + `read what's on screen but cannot interact. Ask the user to take any ` + + `actions in ${readOther.length === 1 ? "this app" : "these apps"} ` + + `themselves.`, + ); + } + + if (clickTier.length > 0) { + const names = clickTier.map((t) => `"${t.displayName}"`).join(", "); + parts.push( + `${names} ${clickTier.length === 1 ? "has" : "have"} terminal or IDE ` + + `capabilities — granted at tier "click" (visible + plain left-click ` + + `only; NO typing, key presses, right-click, modifier-clicks, or ` + + `drag-drop). You can click buttons and scroll output, but ` + + `${clickTier.length === 1 ? "its" : "their"} integrated terminal and ` + + `editor are off-limits to keyboard input. Right-click (context-menu ` + + `Paste) and dragging text onto ${clickTier.length === 1 ? "it" : "them"} ` + + `require tier "full". For shell commands, use the Bash tool.`, + ); + } + + if (parts.length === 0) return ""; + // Same anti-subversion clause the gate errors carry — said upfront so the + // model doesn't reach for osascript/cliclick after seeing "no clicks/typing". + return parts.join("\n\n") + TIER_ANTI_SUBVERSION; +} + +/** + * Build guidance text for apps stripped by the user's Settings auto-deny + * list. Returned inline in the okJson response so the agent knows (a) the + * app is auto-denied by request_access and (b) the escape hatch + * is to ask the human to edit Settings, not to retry or reword the request. + */ +function buildUserDeniedGuidance( + userDenied: Array<{ requestedName: string; displayName: string }>, +): string { + const names = userDenied.map((d) => `"${d.displayName}"`).join(", "); + const one = userDenied.length === 1; + return ( + `${names} ${one ? "is" : "are"} in the user's auto-deny list ` + + `(Settings → Desktop app (General) → Computer Use → Denied apps). ` + + `Requests for ` + + `${one ? "this app" : "these apps"} are automatically denied. If you need access for ` + + `this task, ask the user to remove ${one ? "it" : "them"} from their ` + + `deny list in Settings — you cannot request this through the tool.` + ); +} + +/** + * Guidance for policy-denied apps (baked-in blocklist, not user-editable). + * Unlike userDenied, there is no escape hatch — the agent is told to find + * another approach. + */ +function buildPolicyDeniedGuidance( + policyDenied: Array<{ requestedName: string; displayName: string }>, +): string { + const names = policyDenied.map((d) => `"${d.displayName}"`).join(", "); + const one = policyDenied.length === 1; + return ( + `${names} ${one ? "is" : "are"} blocked by policy for computer use. ` + + `Requests for ${one ? "this app" : "these apps"} are automatically ` + + `denied regardless of what the user has approved. There is no Settings ` + + `override. Inform the user that you cannot access ` + + `${one ? "this app" : "these apps"} and suggest an alternative ` + + `approach if one exists. Do not try to directly subvert this block ` + + `regardless of the user's request.` + ); +} + +/** + * Telemetry helper — counts by category. Field names (`denied_*`) are kept + * for schema compat; interpret as "assigned non-full tier" in dashboards. + */ +function tierAssignmentTelemetry( + tiered: TieredApp[], +): Pick { + // `denied_browser_count` now counts ALL tier-"read" grants (browsers + + // trading). The field name was already legacy-only before trading existed + // (dashboards read it as "non-full tier"), so no new column. + const browserCount = tiered.filter((t) => t.tier === "read").length; + const terminalCount = tiered.filter((t) => t.tier === "click").length; + return { + ...(browserCount > 0 && { denied_browser_count: browserCount }), + ...(terminalCount > 0 && { denied_terminal_count: terminalCount }), + }; +} + +/** + * Sibling of `handleRequestAccess`. Same app-resolution + TCC-threading, but + * routes to the teach approval dialog and fires `onTeachModeActivated` on + * success. No grant-flag checkboxes (clipboard/systemKeys) in teach mode — + * the tool schema omits those fields. + * + * Unlike `request_access`, this ALWAYS shows the dialog even when every + * requested app is already granted. Teach mode is a distinct UX the user + * must explicitly consent to (main window hides) — idempotent app grants + * don't imply consent to being guided. + */ +async function handleRequestTeachAccess( + adapter: ComputerUseHostAdapter, + args: Record, + overrides: ComputerUseOverrides, + tccState: { accessibility: boolean; screenRecording: boolean } | undefined, +): Promise { + if (!overrides.onTeachPermissionRequest) { + return errorResult( + "Teach mode is not available in this session.", + "feature_unavailable", + ); + } + + // Same as handleRequestAccess above — the dialog renders in the hidden + // main window. Model re-calling request_teach_access mid-tour (to add + // another app) is plausible since request_access docs say "call again + // mid-session to add more apps" and this uses the same grant model. + if (overrides.getTeachModeActive?.()) { + return errorResult( + "Teach mode is already active. To add more apps, end the current tour first, then call request_teach_access again with the full app list.", + "teach_mode_conflict", + ); + } + + const reason = requireString(args, "reason"); + if (reason instanceof Error) return errorResult(reason.message, "bad_args"); + + // TCC-ungranted branch — identical to handleRequestAccess's. The renderer + // shows the same TCC toggle panel regardless of which request tool got here. + if (tccState) { + const req: CuTeachPermissionRequest = { + requestId: randomUUID(), + reason, + apps: [], + screenshotFiltering: adapter.executor.capabilities.screenshotFiltering, + tccState, + }; + await overrides.onTeachPermissionRequest(req); + + // Same re-check as handleRequestAccess — user may have granted while the + // dialog was up, and the pre-dialog snapshot would mislead the model. + const recheck = await adapter.ensureOsPermissions(); + if (recheck.granted) { + return errorResult( + "macOS Accessibility and Screen Recording are now both granted. " + + "Call request_teach_access again immediately — the next call will " + + "show the app selection list.", + ); + } + + const missing: string[] = []; + if (!recheck.accessibility) missing.push("Accessibility"); + if (!recheck.screenRecording) missing.push("Screen Recording"); + return errorResult( + `macOS ${missing.join(" and ")} permission(s) not yet granted. ` + + `The permission panel has been shown. Once the user grants the ` + + `missing permission(s), call request_teach_access again.`, + "tcc_not_granted", + ); + } + + const rawApps = args.apps; + if (!Array.isArray(rawApps) || !rawApps.every((a) => typeof a === "string")) { + return errorResult('"apps" must be an array of strings.', "bad_args"); + } + const apps = rawApps as string[]; + + const { + needDialog, + skipDialogGrants, + willHide, + tieredApps, + userDenied, + policyDenied, + } = await buildAccessRequest( + adapter, + apps, + overrides.allowedApps, + new Set(overrides.userDeniedBundleIds), + overrides.selectedDisplayId, + ); + + // All requested apps were user-denied (or unresolvable) and none pre-granted + // — skip the dialog entirely. Without this, onTeachPermissionRequest fires + // with apps:[] and the user sees an empty approval dialog where Allow and + // Deny produce the same result (granted=[] → teachModeActive stays false). + // handleRequestAccess has the equivalent guard at the needDialog.length + // check; teach didn't need one before user-deny because needDialog=[] + // previously implied skipDialogGrants.length > 0 (all-already-granted). + if (needDialog.length === 0 && skipDialogGrants.length === 0) { + return okJson( + { + granted: [], + denied: [], + ...(policyDenied.length > 0 && { + policyDenied: { + apps: policyDenied, + guidance: buildPolicyDeniedGuidance(policyDenied), + }, + }), + ...(userDenied.length > 0 && { + userDenied: { + apps: userDenied, + guidance: buildUserDeniedGuidance(userDenied), + }, + }), + teachModeActive: false, + screenshotFiltering: adapter.executor.capabilities.screenshotFiltering, + }, + { granted_count: 0, denied_count: 0 }, + ); + } + + const req: CuTeachPermissionRequest = { + requestId: randomUUID(), + reason, + apps: needDialog, + screenshotFiltering: adapter.executor.capabilities.screenshotFiltering, + ...(willHide.length > 0 && { + willHide, + autoUnhideEnabled: adapter.getAutoUnhideEnabled(), + }), + }; + const response = await overrides.onTeachPermissionRequest(req); + + const granted = [...skipDialogGrants, ...response.granted]; + // Gate on explicit dialog consent, NOT on merged grant length. + // skipDialogGrants are pre-existing idempotent app grants — they don't + // imply the user said yes to THIS dialog. Without the userConsented + // check, Deny would still activate teach mode whenever any requested + // app was previously granted (worst case: needDialog=[] → Allow and + // Deny payloads are structurally identical). + const teachModeActive = response.userConsented === true && granted.length > 0; + if (teachModeActive) { + overrides.onTeachModeActivated?.(); + } + + const grantedBundleIds = new Set(granted.map((g) => g.bundleId)); + const grantedTieredApps = tieredApps.filter((t) => + grantedBundleIds.has(t.bundleId), + ); + + return okJson( + { + granted, + denied: response.denied, + ...(policyDenied.length > 0 && { + policyDenied: { + apps: policyDenied, + guidance: buildPolicyDeniedGuidance(policyDenied), + }, + }), + ...(userDenied.length > 0 && { + userDenied: { + apps: userDenied, + guidance: buildUserDeniedGuidance(userDenied), + }, + }), + ...(grantedTieredApps.length > 0 && { + tierGuidance: buildTierGuidanceMessage(grantedTieredApps), + }), + teachModeActive, + screenshotFiltering: adapter.executor.capabilities.screenshotFiltering, + }, + { + // response.granted only — skipDialogGrants are idempotent re-grants. + // See handleRequestAccess's parallel comment. + granted_count: response.granted.length, + denied_count: response.denied.length, + ...tierAssignmentTelemetry(grantedTieredApps), + }, + ); +} + +// --------------------------------------------------------------------------- +// teach_step + teach_batch — shared step primitives +// --------------------------------------------------------------------------- + +/** A fully-validated teach step, anchor already scaled to logical points. */ +interface ValidatedTeachStep { + explanation: string; + nextPreview: string; + anchorLogical: TeachStepRequest["anchorLogical"]; + actions: Array>; +} + +/** + * Validate one raw step record and scale its anchor. `label` is prefixed to + * error messages so teach_batch can say `steps[2].actions[0]` instead of + * just `actions[0]`. + * + * The anchor transform is the whole coordinate story: model sends image-pixel + * coords (same space as click coords, per COORDINATES.md), `scaleCoord` turns + * them into logical points against `overrides.lastScreenshot`. For + * teach_batch, lastScreenshot stays at its pre-call value for the entire + * batch — same invariant as computer_batch's "coordinates refer to the + * PRE-BATCH screenshot". Anchors for step 2+ must therefore target elements + * the model can predict will be at those coordinates after step 1's actions. + */ +async function validateTeachStepArgs( + raw: Record, + adapter: ComputerUseHostAdapter, + overrides: ComputerUseOverrides, + label: string, +): Promise { + const explanation = requireString(raw, "explanation"); + if (explanation instanceof Error) { + return new Error(`${label}: ${explanation.message}`); + } + const nextPreview = requireString(raw, "next_preview"); + if (nextPreview instanceof Error) { + return new Error(`${label}: ${nextPreview.message}`); + } + + const actions = raw.actions; + if (!Array.isArray(actions)) { + return new Error( + `${label}: "actions" must be an array (empty is allowed).`, + ); + } + for (const [i, act] of actions.entries()) { + if (typeof act !== "object" || act === null) { + return new Error(`${label}: actions[${i}] must be an object`); + } + const action = (act as Record).action; + if (typeof action !== "string") { + return new Error(`${label}: actions[${i}].action must be a string`); + } + if (!BATCHABLE_ACTIONS.has(action)) { + return new Error( + `${label}: actions[${i}].action="${action}" is not allowed. ` + + `Allowed: ${[...BATCHABLE_ACTIONS].join(", ")}.`, + ); + } + } + + let anchorLogical: TeachStepRequest["anchorLogical"]; + if (raw.anchor !== undefined) { + const anchor = raw.anchor; + if ( + !Array.isArray(anchor) || + anchor.length !== 2 || + typeof anchor[0] !== "number" || + typeof anchor[1] !== "number" || + !Number.isFinite(anchor[0]) || + !Number.isFinite(anchor[1]) + ) { + return new Error( + `${label}: "anchor" must be a [x, y] number tuple or omitted.`, + ); + } + const display = await adapter.executor.getDisplaySize( + overrides.selectedDisplayId, + ); + anchorLogical = scaleCoord( + anchor[0], + anchor[1], + overrides.coordinateMode, + display, + overrides.lastScreenshot, + adapter.logger, + ); + } + + return { + explanation, + nextPreview, + anchorLogical, + actions: actions as Array>, + }; +} + +/** Outcome of showing one tooltip + running its actions. */ +type TeachStepOutcome = + | { kind: "exit" } + | { kind: "ok"; results: BatchActionResult[] } + | { + kind: "action_error"; + executed: number; + failed: BatchActionResult; + remaining: number; + /** The inner action's telemetry (error_kind), forwarded so the + * caller can pass it to okJson and keep cu_tool_call accurate + * when the failure happened inside a batch. */ + telemetry: CuCallTelemetry | undefined; + }; + +/** + * Show the tooltip, block for Next/Exit, run actions on Next. + * + * Action execution is a straight lift from `handleComputerBatch`: + * prepareForAction ONCE per step (the user clicked Next — they consented to + * that step's sequence), pixelValidation OFF (committed sequence), frontmost + * gate still per-action, stop-on-first-error with partial results. + * + * Empty `actions` is valid — "read this, click Next to continue" steps. + * Assumes `overrides.onTeachStep` is set (caller guards). + */ +async function executeTeachStep( + step: ValidatedTeachStep, + adapter: ComputerUseHostAdapter, + overrides: ComputerUseOverrides, + subGates: CuSubGates, +): Promise { + // Block until Next or Exit. Same pending-promise pattern as + // onPermissionRequest — host stores the resolver, overlay IPC fires it. + // `!` is safe: both callers guard on overrides.onTeachStep before reaching here. + const stepResult = await overrides.onTeachStep!({ + explanation: step.explanation, + nextPreview: step.nextPreview, + anchorLogical: step.anchorLogical, + }); + + if (stepResult.action === "exit") { + // The host's Exit handler also calls stopSession, so the turn is + // already unwinding. Caller decides what to return for the transcript. + // A PREVIOUS step's left_mouse_down may have left the OS button held. + await releaseHeldMouse(adapter); + return { kind: "exit" }; + } + + // Next clicked. Flip overlay to spinner before we start driving. + overrides.onTeachWorking?.(); + + if (step.actions.length === 0) { + return { kind: "ok", results: [] }; + } + + if (subGates.hideBeforeAction) { + const hidden = await adapter.executor.prepareForAction( + overrides.allowedApps.map((a) => a.bundleId), + overrides.selectedDisplayId, + ); + if (hidden.length > 0) { + overrides.onAppsHidden?.(hidden); + } + } + + const stepSubGates: CuSubGates = { + ...subGates, + hideBeforeAction: false, + pixelValidation: false, + // Anchors are pre-computed against the display at batch start. + // A mid-batch resolver switch would break tooltip positioning. + autoTargetDisplay: false, + }; + + const results: BatchActionResult[] = []; + for (const [i, act] of step.actions.entries()) { + // Same abort check as handleComputerBatch — Exit calls stopSession so + // this IS the exit path, just caught mid-dispatch instead of at the + // onTeachStep await above. Callers already handle { kind: "exit" }. + if (overrides.isAborted?.()) { + await releaseHeldMouse(adapter); + return { kind: "exit" }; + } + // Same inter-step settle as handleComputerBatch. + if (i > 0) await sleep(10); + const action = act.action as string; + + // Drop mid-step screenshot piggyback — same invariant as computer_batch. + // Click coords stay anchored to the screenshot the model took BEFORE + // calling teach_step/teach_batch. + const { screenshot: _dropped, ...inner } = await dispatchAction( + action, + act, + adapter, + overrides, + stepSubGates, + ); + + const text = firstTextContent(inner); + const result = { action, ok: !inner.isError, output: text }; + results.push(result); + + if (inner.isError) { + await releaseHeldMouse(adapter); + return { + kind: "action_error", + executed: results.length - 1, + failed: result, + remaining: step.actions.length - results.length, + telemetry: inner.telemetry, + }; + } + } + + return { kind: "ok", results }; +} + +/** + * Fold a fresh screenshot into the result. Eliminates the separate + * screenshot tool call the model would otherwise make before the next + * teach_step (one fewer API round trip per step). handleScreenshot + * runs its own prepareForAction — that's correct: actions may have + * opened something outside the allowlist. The .screenshot piggyback + * flows through to serverDef.ts's stash → lastScreenshot updates → + * the next teach_step.anchor scales against THIS image, which is what + * the model is now looking at. + */ +async function appendTeachScreenshot( + resultJson: unknown, + adapter: ComputerUseHostAdapter, + overrides: ComputerUseOverrides, + subGates: CuSubGates, +): Promise { + const shotResult = await handleScreenshot(adapter, overrides, subGates); + if (shotResult.isError) { + // Hide+screenshot failed (rare — e.g. SCContentFilter error). Don't + // tank the step; just omit the image. Model will call screenshot + // itself and see the real error. + return okJson(resultJson); + } + return { + content: [ + { type: "text", text: JSON.stringify(resultJson) }, + // handleScreenshot's content is [maybeMonitorNote, maybeHiddenNote, + // image]. Spread all — both notes are useful context and the model + // expects them alongside screenshots. + ...shotResult.content, + ], + // For serverDef.ts to stash. Next teach_step.anchor scales against this. + screenshot: shotResult.screenshot, + }; +} + +/** + * Show one guided-tour tooltip and block until the user clicks Next or Exit. + * On Next, execute `actions[]` with `computer_batch` semantics. + */ +async function handleTeachStep( + adapter: ComputerUseHostAdapter, + args: Record, + overrides: ComputerUseOverrides, + subGates: CuSubGates, +): Promise { + if (!overrides.onTeachStep) { + return errorResult( + "Teach mode is not active. Call request_teach_access first.", + "teach_mode_not_active", + ); + } + + const step = await validateTeachStepArgs( + args, + adapter, + overrides, + "teach_step", + ); + if (step instanceof Error) return errorResult(step.message, "bad_args"); + + const outcome = await executeTeachStep(step, adapter, overrides, subGates); + + if (outcome.kind === "exit") { + return okJson({ exited: true }); + } + if (outcome.kind === "action_error") { + return okJson( + { + executed: outcome.executed, + failed: outcome.failed, + remaining: outcome.remaining, + }, + outcome.telemetry, + ); + } + + // ok. No screenshot for empty actions — screen didn't change, model's + // existing screenshot is still accurate. + if (step.actions.length === 0) { + return okJson({ executed: 0, results: [] }); + } + return appendTeachScreenshot( + { executed: outcome.results.length, results: outcome.results }, + adapter, + overrides, + subGates, + ); +} + +/** + * Queue a whole guided tour in one tool call. Parallels `computer_batch`: N + * steps → one model→API round trip instead of N. Each step still blocks for + * its own Next click (the user paces the tour), but the model doesn't wait + * for a round trip between steps. + * + * Validates ALL steps upfront so a typo in step 5 doesn't surface after the + * user has already clicked through steps 1–4. + * + * Anchors for every step scale against the pre-call `lastScreenshot` — same + * PRE-BATCH invariant as computer_batch. Steps 2+ should either omit anchor + * (centered tooltip) or target elements the model predicts won't have moved. + * + * Result shape: + * {exited: true, stepsCompleted: N} — user clicked Exit + * {stepsCompleted, stepFailed, executed, failed, …} — action error at step N + * {stepsCompleted, results: [...]} + screenshot — all steps ran + */ +async function handleTeachBatch( + adapter: ComputerUseHostAdapter, + args: Record, + overrides: ComputerUseOverrides, + subGates: CuSubGates, +): Promise { + if (!overrides.onTeachStep) { + return errorResult( + "Teach mode is not active. Call request_teach_access first.", + "teach_mode_not_active", + ); + } + + const rawSteps = args.steps; + if (!Array.isArray(rawSteps) || rawSteps.length < 1) { + return errorResult('"steps" must be a non-empty array.', "bad_args"); + } + + // Validate upfront — fail fast before showing any tooltip. + const steps: ValidatedTeachStep[] = []; + for (const [i, raw] of rawSteps.entries()) { + if (typeof raw !== "object" || raw === null) { + return errorResult(`steps[${i}] must be an object`, "bad_args"); + } + const v = await validateTeachStepArgs( + raw as Record, + adapter, + overrides, + `steps[${i}]`, + ); + if (v instanceof Error) return errorResult(v.message, "bad_args"); + steps.push(v); + } + + const allResults: BatchActionResult[][] = []; + for (const [i, step] of steps.entries()) { + const outcome = await executeTeachStep(step, adapter, overrides, subGates); + + if (outcome.kind === "exit") { + return okJson({ exited: true, stepsCompleted: i }); + } + if (outcome.kind === "action_error") { + return okJson( + { + stepsCompleted: i, + stepFailed: i, + executed: outcome.executed, + failed: outcome.failed, + remaining: outcome.remaining, + results: allResults, + }, + outcome.telemetry, + ); + } + allResults.push(outcome.results); + } + + // Final screenshot only if any step ran actions (screen changed). + const screenChanged = steps.some((s) => s.actions.length > 0); + const resultJson = { stepsCompleted: steps.length, results: allResults }; + if (!screenChanged) { + return okJson(resultJson); + } + return appendTeachScreenshot(resultJson, adapter, overrides, subGates); +} + +/** + * Build the hidden-apps note that accompanies a screenshot. Tells the model + * which apps got hidden (not in allowlist) and how to add them. Returns + * undefined when nothing was hidden since the last screenshot. + */ +async function buildHiddenNote( + adapter: ComputerUseHostAdapter, + hiddenSinceLastSeen: string[], +): Promise { + if (hiddenSinceLastSeen.length === 0) return undefined; + const running = await adapter.executor.listRunningApps(); + const nameOf = new Map(running.map((a) => [a.bundleId, a.displayName])); + const names = hiddenSinceLastSeen.map((id) => nameOf.get(id) ?? id); + const list = names.map((n) => `"${n}"`).join(", "); + const one = names.length === 1; + return ( + `${list} ${one ? "was" : "were"} open and got hidden before this screenshot ` + + `(not in the session allowlist). If a previous action was meant to open ` + + `${one ? "it" : "one of them"}, that's why you don't see it — call ` + + `request_access to add ${one ? "it" : "them"} to the allowlist.` + ); +} + +/** + * Assign a human-readable label to each display. Falls back to `display N` + * when NSScreen.localizedName is undefined; disambiguates identical labels + * (matched-pair external monitors) with a `(2)` suffix. Used by both + * buildMonitorNote and handleSwitchDisplay so the name the model sees in a + * screenshot note is the same name it can pass back to switch_display. + */ +function uniqueDisplayLabels( + displays: readonly DisplayGeometry[], +): Map { + // Sort by displayId so the (N) suffix is stable regardless of + // NSScreen.screens iteration order — same label always maps to same + // physical display across buildMonitorNote → switch_display round-trip, + // even if display configuration reorders between the two calls. + const sorted = [...displays].sort((a, b) => a.displayId - b.displayId); + const counts = new Map(); + const out = new Map(); + for (const d of sorted) { + const base = d.label ?? `display ${d.displayId}`; + const n = (counts.get(base) ?? 0) + 1; + counts.set(base, n); + out.set(d.displayId, n === 1 ? base : `${base} (${n})`); + } + return out; +} + +/** + * Build the monitor-context text that accompanies a screenshot. Tells the + * model which monitor it's looking at (by human name), lists other attached + * monitors, and flags when the monitor changed vs. the previous screenshot. + * + * Only emitted when there are 2+ displays AND (first screenshot OR the + * display changed). Single-monitor setups and steady-state same-monitor + * screenshots get no text — avoids noise. + */ +async function buildMonitorNote( + adapter: ComputerUseHostAdapter, + shotDisplayId: number, + lastDisplayId: number | undefined, + canSwitchDisplay: boolean, +): Promise { + // listDisplays failure (e.g. Swift returns zero screens during monitor + // hot-unplug) must not tank the screenshot — this note is optional context. + let displays; + try { + displays = await adapter.executor.listDisplays(); + } catch (e) { + adapter.logger.warn(`[computer-use] listDisplays failed: ${String(e)}`); + return undefined; + } + if (displays.length < 2) return undefined; + + const labels = uniqueDisplayLabels(displays); + const nameOf = (id: number): string => labels.get(id) ?? `display ${id}`; + + const current = nameOf(shotDisplayId); + const others = displays + .filter((d) => d.displayId !== shotDisplayId) + .map((d) => nameOf(d.displayId)); + const switchHint = canSwitchDisplay + ? " Use switch_display to capture a different monitor." + : ""; + const othersList = + others.length > 0 + ? ` Other attached monitors: ${others.map((n) => `"${n}"`).join(", ")}.` + + switchHint + : ""; + + // 0 is kCGNullDirectDisplay (sentinel from old sessions persisted + // pre-multimon) — treat same as undefined. + if (lastDisplayId === undefined || lastDisplayId === 0) { + return `This screenshot was taken on monitor "${current}".` + othersList; + } + if (lastDisplayId !== shotDisplayId) { + const prev = nameOf(lastDisplayId); + return ( + `This screenshot was taken on monitor "${current}", which is different ` + + `from your previous screenshot (taken on "${prev}").` + + othersList + ); + } + return undefined; +} + +async function handleScreenshot( + adapter: ComputerUseHostAdapter, + overrides: ComputerUseOverrides, + subGates: CuSubGates, +): Promise { + // §2 — empty allowlist → tool error, no screenshot. + if (overrides.allowedApps.length === 0) { + return errorResult( + "No applications are granted for this session. Call request_access first.", + "allowlist_empty", + ); + } + + // Atomic resolve→prepare→capture (one Swift call, no scheduler gap). + // Off → fall through to separate-calls path below. + if (subGates.autoTargetDisplay) { + // Model's explicit switch_display pin overrides everything — Swift's + // straight cuDisplayInfo(forDisplayID:) passthrough, no chase chain. + // Otherwise sticky display: only auto-resolve when the allowed-app + // set has changed since the display was last resolved. Prevents the + // resolver yanking the display on every screenshot. + const allowedBundleIds = overrides.allowedApps.map((a) => a.bundleId); + const currentAppSetKey = allowedBundleIds.slice().sort().join(","); + const appSetChanged = currentAppSetKey !== overrides.displayResolvedForApps; + const autoResolve = !overrides.displayPinnedByModel && appSetChanged; + + const result = await adapter.executor.resolvePrepareCapture({ + allowedBundleIds, + preferredDisplayId: overrides.selectedDisplayId, + autoResolve, + // Keep the hideBeforeAction sub-gate independently rollable — + // atomic path honors the same toggle the non-atomic path checks + // at the prepareForAction call site. + doHide: subGates.hideBeforeAction, + }); + + // Non-atomic path's takeScreenshotWithRetry has a MIN_SCREENSHOT_BYTES + // check + retry. The atomic call is expensive (resolve+prepare+capture), + // so no retry here — just a warning when the result is implausibly + // small (transient display state like sleep wake). Skip when + // captureError is set (base64 is intentionally empty then). + if ( + result.captureError === undefined && + decodedByteLength(result.base64) < MIN_SCREENSHOT_BYTES + ) { + adapter.logger.warn( + `[computer-use] resolvePrepareCapture result implausibly small (${decodedByteLength(result.base64)} bytes decoded) — possible transient display state`, + ); + } + + // Resolver picked a different display than the session had selected + // (host window moved, or allowed app on a different display). Write + // the pick back to session so teach overlay positioning and subsequent + // non-resolver calls track the same display. Fire-and-forget. + if (result.displayId !== overrides.selectedDisplayId) { + adapter.logger.debug( + `[computer-use] resolver: preferred=${overrides.selectedDisplayId} resolved=${result.displayId}`, + ); + overrides.onResolvedDisplayUpdated?.(result.displayId); + } + // Record the app set this display was resolved for, so the next + // screenshot skips auto-resolve until the set changes again. Gated on + // autoResolve (not just appSetChanged) — when pinned, we didn't + // actually resolve, so don't update the key. + if (autoResolve) { + overrides.onDisplayResolvedForApps?.(currentAppSetKey); + } + + // Report hidden apps only when the model has already seen the screen. + let hiddenSinceLastSeen: string[] = []; + if (overrides.lastScreenshot !== undefined) { + hiddenSinceLastSeen = result.hidden; + } + if (result.hidden.length > 0) { + overrides.onAppsHidden?.(result.hidden); + } + + // Partial-success case: hide succeeded, capture failed (SCK perm + // revoked mid-session). onAppsHidden fired above so auto-unhide will + // restore hidden apps at turn end. Now surface the error to the model. + if (result.captureError !== undefined) { + return errorResult(result.captureError, "capture_failed"); + } + + const hiddenNote = await buildHiddenNote(adapter, hiddenSinceLastSeen); + + // Cherry-pick — don't spread `result` (would leak resolver fields into lastScreenshot). + const shot: ScreenshotResult = { + base64: result.base64, + width: result.width, + height: result.height, + displayWidth: result.displayWidth, + displayHeight: result.displayHeight, + displayId: result.displayId, + originX: result.originX, + originY: result.originY, + }; + + const monitorNote = await buildMonitorNote( + adapter, + shot.displayId, + overrides.lastScreenshot?.displayId, + overrides.onDisplayPinned !== undefined, + ); + + return { + content: [ + ...(monitorNote ? [{ type: "text" as const, text: monitorNote }] : []), + ...(hiddenNote ? [{ type: "text" as const, text: hiddenNote }] : []), + { + type: "image", + data: shot.base64, + mimeType: "image/jpeg", + }, + ], + screenshot: shot, + }; + } + + // Same hide+defocus sequence as input actions. Screenshot needs hide too + // — if a non-allowlisted app is on top, SCContentFilter would composite it + // out, but the pixels BELOW it are what the model would see, and those are + // NOT what's actually there. Hiding first makes the screenshot TRUE. + let hiddenSinceLastSeen: string[] = []; + if (subGates.hideBeforeAction) { + const hidden = await adapter.executor.prepareForAction( + overrides.allowedApps.map((a) => a.bundleId), + overrides.selectedDisplayId, + ); + // "Something appeared since the model last looked." Report whenever: + // (a) prepare hid something AND + // (b) the model has ALREADY SEEN the screen (lastScreenshot is set). + // + // (b) is the discriminator that silences the first screenshot's + // expected-noise hide. NOT a delta against a cumulative set — that was + // the earlier bug: cuHiddenDuringTurn only grows, so once Preview is in + // it (from the first screenshot's hide), subsequent re-hides of Preview + // delta to zero. The double-click → Preview opens → re-hide → silent + // loop never breaks. + // + // With this check: every re-hide fires. If the model loops "click → file + // opens in Preview → screenshot → Preview hidden", it gets told EVERY + // time. Eventually it'll request_access for Preview (or give up). + // + // False positive: user alt-tabs mid-turn → Safari re-hidden → reported. + // Rare, and "Safari appeared" is at worst mild noise — far better than + // the false-negative of never explaining why the file vanished. + if (overrides.lastScreenshot !== undefined) { + hiddenSinceLastSeen = hidden; + } + if (hidden.length > 0) { + overrides.onAppsHidden?.(hidden); + } + } + + const allowedBundleIds = overrides.allowedApps.map((g) => g.bundleId); + const shot = await takeScreenshotWithRetry( + adapter.executor, + allowedBundleIds, + adapter.logger, + overrides.selectedDisplayId, + ); + + const hiddenNote = await buildHiddenNote(adapter, hiddenSinceLastSeen); + + const monitorNote = await buildMonitorNote( + adapter, + shot.displayId, + overrides.lastScreenshot?.displayId, + overrides.onDisplayPinned !== undefined, + ); + + return { + content: [ + ...(monitorNote ? [{ type: "text" as const, text: monitorNote }] : []), + ...(hiddenNote ? [{ type: "text" as const, text: hiddenNote }] : []), + { + type: "image", + data: shot.base64, + mimeType: "image/jpeg", + }, + ], + // Piggybacked for serverDef.ts to stash on InternalServerContext. + screenshot: shot, + }; +} + +/** + * Region-crop upscaled screenshot. Coord invariant (computer_use_v2.py:1092): + * click coords ALWAYS refer to the full-screen screenshot, never the zoom. + * Enforced structurally: this handler's return has NO `.screenshot` field, + * so serverDef.ts's `if (result.screenshot)` branch cannot fire and + * `cuLastScreenshot` is never touched. `executor.zoom()`'s return type also + * lacks displayWidth/displayHeight, so it's not assignable to + * `ScreenshotResult` even by accident. + */ +async function handleZoom( + adapter: ComputerUseHostAdapter, + args: Record, + overrides: ComputerUseOverrides, +): Promise { + // region: [x0, y0, x1, y1] in IMAGE-PX of lastScreenshot — same space the + // model reads click coords from. + const region = args.region; + if (!Array.isArray(region) || region.length !== 4) { + return errorResult( + "region must be an array of length 4: [x0, y0, x1, y1]", + "bad_args", + ); + } + const [x0, y0, x1, y1] = region; + if (![x0, y0, x1, y1].every((v) => typeof v === "number" && v >= 0)) { + return errorResult( + "region values must be non-negative numbers", + "bad_args", + ); + } + if (x1 <= x0) + return errorResult("region x1 must be greater than x0", "bad_args"); + if (y1 <= y0) + return errorResult("region y1 must be greater than y0", "bad_args"); + + const last = overrides.lastScreenshot; + if (!last) { + return errorResult( + "take a screenshot before zooming (region coords are relative to it)", + "state_conflict", + ); + } + if (x1 > last.width || y1 > last.height) { + return errorResult( + `region exceeds screenshot bounds (${last.width}×${last.height})`, + "bad_args", + ); + } + + // image-px → logical-pt. Same ratio as scaleCoord (:198-199) — + // displayWidth / width, not 1/scaleFactor. The ratio is folded. + const ratioX = last.displayWidth / last.width; + const ratioY = last.displayHeight / last.height; + const regionLogical = { + x: x0 * ratioX, + y: y0 * ratioY, + w: (x1 - x0) * ratioX, + h: (y1 - y0) * ratioY, + }; + + const allowedIds = overrides.allowedApps.map((g) => g.bundleId); + // Crop from the same display as lastScreenshot so the zoom region + // matches the image the model is reading coords from. + const zoomed = await adapter.executor.zoom( + regionLogical, + allowedIds, + last.displayId, + ); + + // Return the image. NO `.screenshot` piggyback — this is the invariant. + return { + content: [{ type: "image", data: zoomed.base64, mimeType: "image/jpeg" }], + }; +} + +/** Shared handler for all five click variants. */ +async function handleClickVariant( + adapter: ComputerUseHostAdapter, + args: Record, + overrides: ComputerUseOverrides, + subGates: CuSubGates, + button: "left" | "right" | "middle", + count: 1 | 2 | 3, +): Promise { + // A prior left_mouse_down may have set mouseButtonHeld without a matching + // left_mouse_up (e.g. drag rejected by a tier gate, model falls back to + // left_click). executor.click() does its own mouseDown+mouseUp, releasing + // the OS button — but without this, the JS flag stays true and all + // subsequent mouse_move calls take the held-button path ("mouse"/ + // "mouse_full" actionKind + hit-test), causing spurious rejections on + // click-tier and read-tier windows. Release first so click() gets a clean + // slate. + if (mouseButtonHeld) { + await adapter.executor.mouseUp(); + mouseButtonHeld = false; + mouseMoved = false; + } + + const coord = extractCoordinate(args); + if (coord instanceof Error) return errorResult(coord.message, "bad_args"); + const [rawX, rawY] = coord; + + // left_click(coordinate=[x,y], text="shift") — hold modifiers + // during the click. Same chord parsing as the key tool. + let modifiers: string[] | undefined; + if (args.text !== undefined) { + if (typeof args.text !== "string") { + return errorResult("text must be a string", "bad_args"); + } + // Same gate as handleKey/handleHoldKey. withModifiers presses each name + // via native.key(m, "press") — a non-modifier like "q" in text="cmd+q" + // gets pressed while Cmd is held → Cmd+Q fires before the click. + if ( + isSystemKeyCombo(args.text, adapter.executor.capabilities.platform) && + !overrides.grantFlags.systemKeyCombos + ) { + return errorResult( + `The modifier chord "${args.text}" would fire a system shortcut. ` + + "Request the systemKeyCombos grant flag via request_access, or use " + + "only modifier keys (shift, ctrl, alt, cmd) in the text parameter.", + "grant_flag_required", + ); + } + modifiers = parseKeyChord(args.text); + } + + // Right/middle-click and any click with a modifier chord escalate to + // keyboard-equivalent input at tier "click" (context-menu Paste, chord + // keystrokes). Compute once, pass to both gates. + const clickActionKind: CuActionKind = + button !== "left" || (modifiers !== undefined && modifiers.length > 0) + ? "mouse_full" + : "mouse"; + + const gate = await runInputActionGates( + adapter, + overrides, + subGates, + clickActionKind, + ); + if (gate) return gate; + + const display = await adapter.executor.getDisplaySize( + overrides.selectedDisplayId, + ); + + // §6 item P — pixel-validation staleness check. Sub-gated. + // Runs AFTER the gates (no point validating if we're about to refuse + // anyway) but BEFORE the executor call. + if (subGates.pixelValidation) { + const { xPct, yPct } = coordToPercentageForPixelCompare( + rawX, + rawY, + overrides.coordinateMode, + overrides.lastScreenshot, + ); + const validation = await validateClickTarget( + adapter.cropRawPatch, + overrides.lastScreenshot, + xPct, + yPct, + async () => { + // The fresh screenshot for validation uses the SAME allow-set as + // the model's last screenshot did, so we compare like with like. + const allowedIds = overrides.allowedApps.map((g) => g.bundleId); + try { + // Fresh shot must match lastScreenshot's display, not the current + // selection — pixel-compare is against the model's last image. + return await adapter.executor.screenshot({ + allowedBundleIds: allowedIds, + displayId: overrides.lastScreenshot?.displayId, + }); + } catch { + return null; + } + }, + adapter.logger, + ); + if (!validation.valid && validation.warning) { + // Warning result — model told to re-screenshot. + return okText(validation.warning); + } + } + + const { x, y } = scaleCoord( + rawX, + rawY, + overrides.coordinateMode, + display, + overrides.lastScreenshot, + adapter.logger, + ); + + const hitGate = await runHitTestGate( + adapter, + overrides, + subGates, + x, + y, + clickActionKind, + ); + if (hitGate) return hitGate; + + await adapter.executor.click(x, y, button, count, modifiers); + return okText("Clicked."); +} + +async function handleType( + adapter: ComputerUseHostAdapter, + args: Record, + overrides: ComputerUseOverrides, + subGates: CuSubGates, +): Promise { + const text = requireString(args, "text"); + if (text instanceof Error) return errorResult(text.message, "bad_args"); + + const gate = await runInputActionGates( + adapter, + overrides, + subGates, + "keyboard", + ); + if (gate) return gate; + + // §6 item 3 — clipboard-paste fast path for multi-line. Sub-gated AND + // requires clipboardWrite grant. The save/restore + read-back-verify + // lives in the EXECUTOR (task #5), not here. Here we just route. + const viaClipboard = + text.includes("\n") && + overrides.grantFlags.clipboardWrite && + subGates.clipboardPasteMultiline; + + if (viaClipboard) { + await adapter.executor.type(text, { viaClipboard: true }); + return okText("Typed (via clipboard)."); + } + + // §6 item 7 — grapheme-cluster iteration. Prevents ZWJ emoji → �. + // §6 item 4 — 8ms between graphemes (125 Hz USB polling). Battle-tested: + // sleep BEFORE each keystroke, not after. + // + // \n, \r, \t MUST route through executor.key(), not type(). Two reasons: + // 1. enigo.text("\n") on macOS posts a stale CGEvent with virtualKey=0 + // after stripping the newline — virtualKey 0 is the 'a' key, so a + // ghost 'a' gets typed. Upstream bug in enigo 0.6.1 fast_text(). + // 2. Unicode text-insertion of '\n' is not a Return key press. URL bars + // and terminals ignore it; the model's intent (submit/execute) is lost. + // CRLF (\r\n) is one grapheme cluster (UAX #29 GB3), so check for it too. + const graphemes = segmentGraphemes(text); + for (const [i, g] of graphemes.entries()) { + // Same abort check as handleComputerBatch. At 8ms/grapheme a 50-char + // type() runs ~400ms; this is where an in-flight batch actually + // spends its time. + if (overrides.isAborted?.()) { + return errorResult( + `Typing aborted after ${i} of ${graphemes.length} graphemes (user interrupt).`, + ); + } + await sleep(INTER_GRAPHEME_SLEEP_MS); + if (g === "\n" || g === "\r" || g === "\r\n") { + await adapter.executor.key("return"); + } else if (g === "\t") { + await adapter.executor.key("tab"); + } else { + await adapter.executor.type(g, { viaClipboard: false }); + } + } + return okText(`Typed ${graphemes.length} grapheme(s).`); +} + +async function handleKey( + adapter: ComputerUseHostAdapter, + args: Record, + overrides: ComputerUseOverrides, + subGates: CuSubGates, +): Promise { + const keySequence = requireString(args, "text"); + if (keySequence instanceof Error) + return errorResult("text is required", "bad_args"); + + // Cap 100, error strings match. + let repeat: number | undefined; + if (args.repeat !== undefined) { + if ( + typeof args.repeat !== "number" || + !Number.isInteger(args.repeat) || + args.repeat < 1 + ) { + return errorResult("repeat must be a positive integer", "bad_args"); + } + if (args.repeat > 100) { + return errorResult("repeat exceeds maximum of 100", "bad_args"); + } + repeat = args.repeat; + } + + // §2 — blocklist check BEFORE gates. A blocked combo with an ungranted + // app frontmost should return the blocklist error, not the frontmost + // error — the model's fix is to request the flag, not change focus. + if ( + isSystemKeyCombo(keySequence, adapter.executor.capabilities.platform) && + !overrides.grantFlags.systemKeyCombos + ) { + return errorResult( + `"${keySequence}" is a system-level shortcut. Request the \`systemKeyCombos\` grant via request_access to use it.`, + "grant_flag_required", + ); + } + + const gate = await runInputActionGates( + adapter, + overrides, + subGates, + "keyboard", + ); + if (gate) return gate; + + await adapter.executor.key(keySequence, repeat); + return okText("Key pressed."); +} + +async function handleScroll( + adapter: ComputerUseHostAdapter, + args: Record, + overrides: ComputerUseOverrides, + subGates: CuSubGates, +): Promise { + const coord = extractCoordinate(args); + if (coord instanceof Error) return errorResult(coord.message, "bad_args"); + const [rawX, rawY] = coord; + + // Uses scroll_direction + scroll_amount. + // Map to our dx/dy executor interface. + const dir = args.scroll_direction; + if (dir !== "up" && dir !== "down" && dir !== "left" && dir !== "right") { + return errorResult( + "scroll_direction must be 'up', 'down', 'left', or 'right'", + "bad_args", + ); + } + const amount = args.scroll_amount; + if (typeof amount !== "number" || !Number.isInteger(amount) || amount < 0) { + return errorResult("scroll_amount must be a non-negative int", "bad_args"); + } + if (amount > 100) { + return errorResult("scroll_amount exceeds maximum of 100", "bad_args"); + } + // up → dy = -amount; down → dy = +amount; left → dx = -amount; right → dx = +amount. + const dx = dir === "left" ? -amount : dir === "right" ? amount : 0; + const dy = dir === "up" ? -amount : dir === "down" ? amount : 0; + + const gate = await runInputActionGates(adapter, overrides, subGates, "mouse"); + if (gate) return gate; + + const display = await adapter.executor.getDisplaySize( + overrides.selectedDisplayId, + ); + const { x, y } = scaleCoord( + rawX, + rawY, + overrides.coordinateMode, + display, + overrides.lastScreenshot, + adapter.logger, + ); + + // When the button is held, executor.scroll's internal moveMouse generates + // a leftMouseDragged event (enigo reads NSEvent.pressedMouseButtons) — + // same mechanism as handleMoveMouse's held-button path. Upgrade the + // hit-test to "mouse_full" so scroll can't be used to drag-drop text onto + // a click-tier terminal, and mark mouseMoved so the subsequent + // left_mouse_up hit-tests as a drop not a click-release. + const hitGate = await runHitTestGate( + adapter, + overrides, + subGates, + x, + y, + mouseButtonHeld ? "mouse_full" : "mouse", + ); + if (hitGate) return hitGate; + if (mouseButtonHeld) mouseMoved = true; + + await adapter.executor.scroll(x, y, dx, dy); + return okText("Scrolled."); +} + +async function handleDrag( + adapter: ComputerUseHostAdapter, + args: Record, + overrides: ComputerUseOverrides, + subGates: CuSubGates, +): Promise { + // executor.drag() does its own press+release internally. Without this + // defensive clear, a prior left_mouse_down leaves mouseButtonHeld=true + // across the drag and desyncs the flag from OS state — same mechanism as + // the handleClickVariant clear above. Release first so drag() gets a + // clean slate. + if (mouseButtonHeld) { + await adapter.executor.mouseUp(); + mouseButtonHeld = false; + mouseMoved = false; + } + + // `coordinate` is the END point + // (required). `start_coordinate` is OPTIONAL — when omitted, drag from + // current cursor position. + const endCoord = extractCoordinate(args, "coordinate"); + if (endCoord instanceof Error) + return errorResult(endCoord.message, "bad_args"); + const rawTo = endCoord; + + let rawFrom: [number, number] | undefined; + if (args.start_coordinate !== undefined) { + const startCoord = extractCoordinate(args, "start_coordinate"); + if (startCoord instanceof Error) + return errorResult(startCoord.message, "bad_args"); + rawFrom = startCoord; + } + // else: rawFrom stays undefined → executor drags from current cursor. + + const gate = await runInputActionGates(adapter, overrides, subGates, "mouse"); + if (gate) return gate; + + const display = await adapter.executor.getDisplaySize( + overrides.selectedDisplayId, + ); + const from = + rawFrom === undefined + ? undefined + : scaleCoord( + rawFrom[0], + rawFrom[1], + overrides.coordinateMode, + display, + overrides.lastScreenshot, + adapter.logger, + ); + const to = scaleCoord( + rawTo[0], + rawTo[1], + overrides.coordinateMode, + display, + overrides.lastScreenshot, + adapter.logger, + ); + + // Check both drag endpoints. `from` is where the mouseDown happens (picks + // up), `to` is where mouseUp happens (drops). When start_coordinate is + // omitted the drag begins at the cursor — same bypass as mouse_move → + // left_mouse_down, so read the cursor and hit-test it (mirrors + // handleLeftMouseDown). + // + // The `to` endpoint uses "mouse_full" (not "mouse"): dropping text onto a + // terminal inserts it as if typed (macOS text drag-drop). Same threat as + // right-click→Paste. `from` stays "mouse" — picking up is a read. + const fromPoint = from ?? (await adapter.executor.getCursorPosition()); + const fromGate = await runHitTestGate( + adapter, + overrides, + subGates, + fromPoint.x, + fromPoint.y, + "mouse", + ); + if (fromGate) return fromGate; + const toGate = await runHitTestGate( + adapter, + overrides, + subGates, + to.x, + to.y, + "mouse_full", + ); + if (toGate) return toGate; + + await adapter.executor.drag(from, to); + return okText("Dragged."); +} + +async function handleMoveMouse( + adapter: ComputerUseHostAdapter, + args: Record, + overrides: ComputerUseOverrides, + subGates: CuSubGates, +): Promise { + const coord = extractCoordinate(args); + if (coord instanceof Error) return errorResult(coord.message, "bad_args"); + const [rawX, rawY] = coord; + + // When the button is held, moveMouse generates leftMouseDragged events on + // the window under the cursor — that's interaction, not positioning. + // Upgrade to "mouse" and hit-test the destination. When the button is NOT + // held: pure positioning, passes at any tier, no hit-test (mouseDown/Up + // hit-test the cursor to close the mouse_move→left_mouse_down decomposition). + const actionKind: CuActionKind = mouseButtonHeld ? "mouse" : "mouse_position"; + const gate = await runInputActionGates( + adapter, + overrides, + subGates, + actionKind, + ); + if (gate) return gate; + + const display = await adapter.executor.getDisplaySize( + overrides.selectedDisplayId, + ); + const { x, y } = scaleCoord( + rawX, + rawY, + overrides.coordinateMode, + display, + overrides.lastScreenshot, + adapter.logger, + ); + + if (mouseButtonHeld) { + // "mouse_full" — same as left_click_drag's to-endpoint. Dragging onto a + // click-tier terminal is text injection regardless of which primitive + // (atomic drag vs. decomposed down/move/up) delivers the events. + const hitGate = await runHitTestGate( + adapter, + overrides, + subGates, + x, + y, + "mouse_full", + ); + if (hitGate) return hitGate; + } + + await adapter.executor.moveMouse(x, y); + if (mouseButtonHeld) mouseMoved = true; + return okText("Moved."); +} + +async function handleOpenApplication( + adapter: ComputerUseHostAdapter, + args: Record, + overrides: ComputerUseOverrides, +): Promise { + const app = requireString(args, "app"); + if (app instanceof Error) return errorResult(app.message, "bad_args"); + + // Resolve display-name → bundle ID. Same logic as request_access. + const allowed = new Set(overrides.allowedApps.map((g) => g.bundleId)); + let targetBundleId: string | undefined; + + if (looksLikeBundleId(app) && allowed.has(app)) { + targetBundleId = app; + } else { + // Try display name → bundle ID, but ONLY against the allowlist itself. + // Avoids paying the listInstalledApps() cost on the hot path and is + // arguably more correct: if the user granted "Slack", the model asking + // to open "Slack" should match THAT grant. + const match = overrides.allowedApps.find( + (g) => g.displayName.toLowerCase() === app.toLowerCase(), + ); + targetBundleId = match?.bundleId; + } + + if (!targetBundleId || !allowed.has(targetBundleId)) { + return errorResult( + `"${app}" is not granted for this session. Call request_access first.`, + "app_not_granted", + ); + } + + // open_application works at any tier — bringing an app forward is exactly + // what tier "read" enables (you need it on screen to screenshot it). The + // tier gates on click/type catch any follow-up interaction. + + await adapter.executor.openApp(targetBundleId); + + // On multi-monitor setups, macOS may place the opened window on a monitor + // the resolver won't pick (e.g. Claude + another allowed app are co-located + // elsewhere). Nudge the model toward switch_display BEFORE it wastes steps + // clicking on dock icons. Single-monitor → no hint. listDisplays failure is + // non-fatal — the hint is advisory. + if (overrides.onDisplayPinned !== undefined) { + let displayCount = 1; + try { + displayCount = (await adapter.executor.listDisplays()).length; + } catch { + // hint skipped + } + if (displayCount >= 2) { + return okText( + `Opened "${app}". If it isn't visible in the next screenshot, it may ` + + `have opened on a different monitor — use switch_display to check.`, + ); + } + } + + return okText(`Opened "${app}".`); +} + +async function handleSwitchDisplay( + adapter: ComputerUseHostAdapter, + args: Record, + overrides: ComputerUseOverrides, +): Promise { + const display = requireString(args, "display"); + if (display instanceof Error) return errorResult(display.message, "bad_args"); + + if (!overrides.onDisplayPinned) { + return errorResult( + "Display switching is not available in this session.", + "feature_unavailable", + ); + } + + if (display.toLowerCase() === "auto") { + overrides.onDisplayPinned(undefined); + return okText( + "Returned to automatic monitor selection. Call screenshot to continue.", + ); + } + + // Resolve label → displayId fresh. Same source buildMonitorNote reads, + // so whatever name the model saw in a screenshot note resolves here. + let displays; + try { + displays = await adapter.executor.listDisplays(); + } catch (e) { + return errorResult( + `Failed to enumerate displays: ${String(e)}`, + "display_error", + ); + } + + if (displays.length < 2) { + return errorResult( + "Only one monitor is connected. There is nothing to switch to.", + "bad_args", + ); + } + + const labels = uniqueDisplayLabels(displays); + const wanted = display.toLowerCase(); + const target = displays.find( + (d) => labels.get(d.displayId)?.toLowerCase() === wanted, + ); + if (!target) { + const available = displays + .map((d) => `"${labels.get(d.displayId)}"`) + .join(", "); + return errorResult( + `No monitor named "${display}" is connected. Available monitors: ${available}.`, + "bad_args", + ); + } + + overrides.onDisplayPinned(target.displayId); + return okText( + `Switched to monitor "${labels.get(target.displayId)}". Call screenshot to see it.`, + ); +} + +function handleListGrantedApplications( + overrides: ComputerUseOverrides, +): CuCallToolResult { + return okJson({ + allowedApps: overrides.allowedApps, + grantFlags: overrides.grantFlags, + }); +} + +async function handleReadClipboard( + adapter: ComputerUseHostAdapter, + overrides: ComputerUseOverrides, + subGates: CuSubGates, +): Promise { + if (!overrides.grantFlags.clipboardRead) { + return errorResult( + "Clipboard read is not granted. Request `clipboardRead` via request_access.", + "grant_flag_required", + ); + } + + // read_clipboard doesn't route through runInputActionGates — sync here so + // reading after clicking into a click-tier app sees the cleared clipboard + // (same as what the app's own Paste would see). + if (subGates.clipboardGuard) { + const frontmost = await adapter.executor.getFrontmostApp(); + const tierByBundleId = new Map( + overrides.allowedApps.map((a) => [a.bundleId, a.tier] as const), + ); + const frontmostTier = frontmost + ? tierByBundleId.get(frontmost.bundleId) + : undefined; + await syncClipboardStash(adapter, overrides, frontmostTier === "click"); + } + + // clipboardGuard may have stashed+cleared — read the actual (possibly + // empty) clipboard. The agent sees what the app would see. + const text = await adapter.executor.readClipboard(); + return okJson({ text }); +} + +async function handleWriteClipboard( + adapter: ComputerUseHostAdapter, + args: Record, + overrides: ComputerUseOverrides, + subGates: CuSubGates, +): Promise { + if (!overrides.grantFlags.clipboardWrite) { + return errorResult( + "Clipboard write is not granted. Request `clipboardWrite` via request_access.", + "grant_flag_required", + ); + } + const text = requireString(args, "text"); + if (text instanceof Error) return errorResult(text.message, "bad_args"); + + if (subGates.clipboardGuard) { + const frontmost = await adapter.executor.getFrontmostApp(); + const tierByBundleId = new Map( + overrides.allowedApps.map((a) => [a.bundleId, a.tier] as const), + ); + const frontmostTier = frontmost + ? tierByBundleId.get(frontmost.bundleId) + : undefined; + + // Defense-in-depth for the clipboardGuard bypass: write_clipboard + + // left_click on a click-tier app's UI Paste button. The re-clear in + // syncClipboardStash already defeats it (the next action clobbers the + // write), but rejecting here gives the agent a clear signal instead of + // silently voiding its write. + if (frontmost && frontmostTier === "click") { + return errorResult( + `"${frontmost.displayName}" is a tier-"click" app and currently ` + + `frontmost. write_clipboard is blocked because the next action ` + + `would clear the clipboard anyway — a UI Paste button in this ` + + `app cannot be used to inject text. Bring a tier-"full" app ` + + `forward before writing to the clipboard.` + + TIER_ANTI_SUBVERSION, + "tier_insufficient", + ); + } + + // write_clipboard doesn't route through runInputActionGates — sync here + // so clicking away from a click-tier app then writing restores the user's + // stash before the agent's text lands. + await syncClipboardStash(adapter, overrides, frontmostTier === "click"); + } + + await adapter.executor.writeClipboard(text); + return okText("Clipboard written."); +} + +/** + * wait(duration=N). Sleeps N seconds, capped at 100. + * No frontmost gate — no input, nothing to protect. Kill-switch + TCC + * are checked in handleToolCall before dispatch reaches here. + */ +async function handleWait( + args: Record, +): Promise { + const duration = args.duration; + if (typeof duration !== "number" || !Number.isFinite(duration)) { + return errorResult("duration must be a number", "bad_args"); + } + if (duration < 0) { + return errorResult("duration must be non-negative", "bad_args"); + } + if (duration > 100) { + return errorResult( + "duration is too long. Duration is in seconds.", + "bad_args", + ); + } + await sleep(duration * 1000); + return okText(`Waited ${duration}s.`); +} + +/** + * Returns "X=...,Y=..." plain text. We return richer JSON with + * coordinateSpace annotation — the model handles both shapes. + * + * When lastScreenshot is present: inverse of scaleCoord — logical points → + * image-pixels via `imageX = logicalX × (screenshotWidth / displayWidth)`. + * Uses capture-time dims so the returned coords match what the model would + * read off that screenshot. + * + * No frontmost gate — read-only, no input. + */ +async function handleCursorPosition( + adapter: ComputerUseHostAdapter, + overrides: ComputerUseOverrides, +): Promise { + const logical = await adapter.executor.getCursorPosition(); + const shot = overrides.lastScreenshot; + if (shot) { + // Inverse of scaleCoord: subtract capture-time origin to go from + // virtual-screen to display-relative before the image-px transform. + const localX = logical.x - shot.originX; + const localY = logical.y - shot.originY; + // Cursor off the captured display (multi-monitor): local coords go + // negative or exceed display dims. Return logical_points + hint rather + // than garbage image-px. + if ( + localX < 0 || + localX > shot.displayWidth || + localY < 0 || + localY > shot.displayHeight + ) { + return okJson({ + x: logical.x, + y: logical.y, + coordinateSpace: "logical_points", + note: "cursor is on a different monitor than your last screenshot; take a fresh screenshot", + }); + } + const x = Math.round(localX * (shot.width / shot.displayWidth)); + const y = Math.round(localY * (shot.height / shot.displayHeight)); + return okJson({ x, y, coordinateSpace: "image_pixels" }); + } + return okJson({ + x: logical.x, + y: logical.y, + coordinateSpace: "logical_points", + note: "take a screenshot first for image-pixel coordinates", + }); +} + +/** + * Presses each key in the + * chord, sleeps duration seconds, releases in reverse. Same duration bounds + * as wait. Keyboard action → frontmost gate applies; same systemKeyCombos + * blocklist check as key. + */ +async function handleHoldKey( + adapter: ComputerUseHostAdapter, + args: Record, + overrides: ComputerUseOverrides, + subGates: CuSubGates, +): Promise { + const text = requireString(args, "text"); + if (text instanceof Error) return errorResult(text.message, "bad_args"); + + const duration = args.duration; + if (typeof duration !== "number" || !Number.isFinite(duration)) { + return errorResult("duration must be a number", "bad_args"); + } + if (duration < 0) { + return errorResult("duration must be non-negative", "bad_args"); + } + if (duration > 100) { + return errorResult( + "duration is too long. Duration is in seconds.", + "bad_args", + ); + } + + // Blocklist check BEFORE gates — same reasoning as handleKey. Holding + // cmd+q is just as dangerous as tapping it. + if ( + isSystemKeyCombo(text, adapter.executor.capabilities.platform) && + !overrides.grantFlags.systemKeyCombos + ) { + return errorResult( + `"${text}" is a system-level shortcut. Request the \`systemKeyCombos\` grant via request_access to use it.`, + "grant_flag_required", + ); + } + + const gate = await runInputActionGates( + adapter, + overrides, + subGates, + "keyboard", + ); + if (gate) return gate; + + const keyNames = parseKeyChord(text); + await adapter.executor.holdKey(keyNames, duration * 1000); + return okText("Key held."); +} + +/** + * Raw press at current cursor, no coordinate. + * Move first with mouse_move. Errors if already held. + */ +async function handleLeftMouseDown( + adapter: ComputerUseHostAdapter, + overrides: ComputerUseOverrides, + subGates: CuSubGates, +): Promise { + if (mouseButtonHeld) { + return errorResult( + "mouse button already held, call left_mouse_up first", + "state_conflict", + ); + } + + const gate = await runInputActionGates(adapter, overrides, subGates, "mouse"); + if (gate) return gate; + + // macOS routes mouseDown to the window under the cursor, not the frontmost + // app. Without this hit-test, mouse_move (positioning, passes at any tier) + // + left_mouse_down decomposes a click that lands on a tier-"read" window + // overlapping a tier-"full" frontmost app — bypassing runHitTestGate's + // whole purpose. All three are batchable, so the bypass is atomic. + const cursor = await adapter.executor.getCursorPosition(); + const hitGate = await runHitTestGate( + adapter, + overrides, + subGates, + cursor.x, + cursor.y, + "mouse", + ); + if (hitGate) return hitGate; + + await adapter.executor.mouseDown(); + mouseButtonHeld = true; + mouseMoved = false; + return okText("Mouse button pressed."); +} + +/** + * Raw release at current cursor. Does NOT error + * if not held (idempotent release). + */ +async function handleLeftMouseUp( + adapter: ComputerUseHostAdapter, + overrides: ComputerUseOverrides, + subGates: CuSubGates, +): Promise { + // Any gate rejection here must release the button FIRST — otherwise the + // OS button stays pressed and mouseButtonHeld stays true. Recovery + // attempts (mouse_move back to a safe app) would generate leftMouseDragged + // events into whatever window is under the cursor, including the very + // read-tier window the gate was protecting. A single mouseUp on a + // restricted window is one event; a stuck button is cascading damage. + // + // This includes the frontmost gate: focus can change between mouseDown and + // mouseUp (something else grabbed focus), in which case runInputActionGates + // rejects here even though it passed at mouseDown. + const releaseFirst = async ( + err: CuCallToolResult, + ): Promise => { + await adapter.executor.mouseUp(); + mouseButtonHeld = false; + mouseMoved = false; + return err; + }; + + const gate = await runInputActionGates(adapter, overrides, subGates, "mouse"); + if (gate) return releaseFirst(gate); + + // When the cursor moved since mouseDown, this is a drop (text-injection + // vector) — hit-test at "mouse_full" same as left_click_drag's `to`. When + // NO move happened, this is a click-release — same semantics as the atomic + // left_click, hit-test at "mouse". Without this distinction, a decomposed + // click on a click-tier app fails here while the atomic left_click works, + // and releaseFirst fires mouseUp anyway so the OS sees a complete click + // while the model gets a misleading error. + const cursor = await adapter.executor.getCursorPosition(); + const hitGate = await runHitTestGate( + adapter, + overrides, + subGates, + cursor.x, + cursor.y, + mouseMoved ? "mouse_full" : "mouse", + ); + if (hitGate) return releaseFirst(hitGate); + + await adapter.executor.mouseUp(); + mouseButtonHeld = false; + mouseMoved = false; + return okText("Mouse button released."); +} + +// --------------------------------------------------------------------------- +// Batch dispatch +// --------------------------------------------------------------------------- + +/** + * Actions allowed inside a computer_batch call. Excludes request_access, + * open_application, clipboard, list_granted (no latency benefit, complicates + * security model). + */ +const BATCHABLE_ACTIONS: ReadonlySet = new Set([ + "key", + "type", + "mouse_move", + "left_click", + "left_click_drag", + "right_click", + "middle_click", + "double_click", + "triple_click", + "scroll", + "hold_key", + "screenshot", + "cursor_position", + "left_mouse_down", + "left_mouse_up", + "wait", +]); + +interface BatchActionResult { + action: string; + ok: boolean; + output: string; +} + +/** + * Executes `actions: [{action, …}, …]` + * sequentially in ONE model→API round trip — the dominant latency cost + * (seconds, vs. ~50ms local overhead per action). + * + * Gate semantics (the security model): + * - Kill-switch + TCC: checked ONCE by handleToolCall before reaching here. + * - prepareForAction: run ONCE at the top. The user approved "do this + * sequence"; hiding apps per-action is wasted work and fast-pathed anyway. + * - Frontmost gate: checked PER ACTION. State can change mid-batch — a + * click might open a non-allowed app. This is the safety net: if action + * 3 of 5 opened Safari (not allowed), action 4's frontmost check fires + * and stops the batch there. + * - PixelCompare: SKIPPED inside batch. The model committed to the full + * sequence without intermediate screenshots; validating mid-batch clicks + * against a pre-batch screenshot would false-positive constantly. + * + * Both skips are implemented by passing `{...subGates, hideBeforeAction: + * false, pixelValidation: false}` to each inner dispatch — the handlers' + * existing gate logic does the right thing, no new code paths. + * + * Stop-on-first-error: accumulate results, on + * first `isError` stop executing, return everything so far + the error. The + * model sees exactly where the batch broke and what succeeded before it. + * + * Mid-batch screenshots are allowed (for inspection) but NEVER piggyback — + * their `.screenshot` field is dropped. Same invariant as zoom: click coords + * always refer to the PRE-BATCH `lastScreenshot`. If the model wants to click + * based on a new screenshot, it ends the batch and screenshots separately. + */ +async function handleComputerBatch( + adapter: ComputerUseHostAdapter, + args: Record, + overrides: ComputerUseOverrides, + subGates: CuSubGates, +): Promise { + const actions = args.actions; + if (!Array.isArray(actions) || actions.length === 0) { + return errorResult("actions must be a non-empty array", "bad_args"); + } + + for (const [i, act] of actions.entries()) { + if (typeof act !== "object" || act === null) { + return errorResult(`actions[${i}] must be an object`, "bad_args"); + } + const action = (act as Record).action; + if (typeof action !== "string") { + return errorResult(`actions[${i}].action must be a string`, "bad_args"); + } + if (!BATCHABLE_ACTIONS.has(action)) { + return errorResult( + `actions[${i}].action="${action}" is not allowed in a batch. ` + + `Allowed: ${[...BATCHABLE_ACTIONS].join(", ")}.`, + "bad_args", + ); + } + } + + // prepareForAction ONCE. After this, inner dispatches skip it via + // hideBeforeAction:false. + if (subGates.hideBeforeAction) { + const hidden = await adapter.executor.prepareForAction( + overrides.allowedApps.map((a) => a.bundleId), + overrides.selectedDisplayId, + ); + if (hidden.length > 0) { + overrides.onAppsHidden?.(hidden); + } + } + + // Inner actions: skip prepare (already ran), skip pixelCompare (stale by + // design). Frontmost still checked — runInputActionGates does it + // unconditionally. + const batchSubGates: CuSubGates = { + ...subGates, + hideBeforeAction: false, + pixelValidation: false, + // Batch already took its screenshot (appended at end); a mid-batch + // resolver switch would make that screenshot inconsistent with + // earlier clicks' lastScreenshot-based scaleCoord targeting. + autoTargetDisplay: false, + }; + + const results: BatchActionResult[] = []; + for (const [i, act] of actions.entries()) { + // Overlay Stop → host's stopSession → lifecycleState leaves "running" + // synchronously before query.interrupt(). The SDK abort tears down the + // host's await but not this loop — without this check the remaining + // actions fire into a dead session. + if (overrides.isAborted?.()) { + await releaseHeldMouse(adapter); + return errorResult( + `Batch aborted after ${results.length} of ${actions.length} actions (user interrupt).`, + ); + } + + // Small inter-step settle. Synthetic CGEvents post instantly; some apps + // need a tick to process step N's input before step N+1 lands (e.g. a + // click opening a menu before the next click targets a menu item). + if (i > 0) await sleep(10); + + const actionArgs = act as Record; + const action = actionArgs.action as string; + + // Drop mid-batch screenshot piggyback (strip .screenshot). Click coords + // stay anchored to the pre-batch lastScreenshot. + const { screenshot: _dropped, ...inner } = await dispatchAction( + action, + actionArgs, + adapter, + overrides, + batchSubGates, + ); + + const text = firstTextContent(inner); + const result = { action, ok: !inner.isError, output: text }; + results.push(result); + + if (inner.isError) { + // Stop-on-first-error. Return everything so far + the error. + // Forward the inner action's telemetry (error_kind) so cu_tool_call + // reflects the actual failure — without this, batch-internal errors + // emit error_kind: undefined despite the inner handler tagging it. + // Release held mouse: the error may be a mid-grapheme abort in + // handleType, or a frontmost gate, landing between mouse_down and + // mouse_up. + await releaseHeldMouse(adapter); + return okJson( + { + completed: results.slice(0, -1), + failed: result, + remaining: actions.length - results.length, + }, + inner.telemetry, + ); + } + } + + return okJson({ completed: results }); +} + +function firstTextContent(r: CuCallToolResult): string { + const first = r.content[0]; + return first && first.type === "text" ? first.text : ""; +} + +/** + * Action dispatch shared by handleToolCall and handleComputerBatch. Called + * AFTER kill-switch + TCC gates have passed. Never sees request_access — it's + * special-cased in handleToolCall for the tccState thread-through. + */ +async function dispatchAction( + name: string, + a: Record, + adapter: ComputerUseHostAdapter, + overrides: ComputerUseOverrides, + subGates: CuSubGates, +): Promise { + switch (name) { + case "screenshot": + return handleScreenshot(adapter, overrides, subGates); + + case "zoom": + return handleZoom(adapter, a, overrides); + + case "left_click": + return handleClickVariant(adapter, a, overrides, subGates, "left", 1); + case "double_click": + return handleClickVariant(adapter, a, overrides, subGates, "left", 2); + case "triple_click": + return handleClickVariant(adapter, a, overrides, subGates, "left", 3); + case "right_click": + return handleClickVariant(adapter, a, overrides, subGates, "right", 1); + case "middle_click": + return handleClickVariant(adapter, a, overrides, subGates, "middle", 1); + + case "type": + return handleType(adapter, a, overrides, subGates); + + case "key": + return handleKey(adapter, a, overrides, subGates); + + case "scroll": + return handleScroll(adapter, a, overrides, subGates); + + case "left_click_drag": + return handleDrag(adapter, a, overrides, subGates); + + case "mouse_move": + return handleMoveMouse(adapter, a, overrides, subGates); + + case "wait": + return handleWait(a); + + case "cursor_position": + return handleCursorPosition(adapter, overrides); + + case "hold_key": + return handleHoldKey(adapter, a, overrides, subGates); + + case "left_mouse_down": + return handleLeftMouseDown(adapter, overrides, subGates); + + case "left_mouse_up": + return handleLeftMouseUp(adapter, overrides, subGates); + + case "open_application": + return handleOpenApplication(adapter, a, overrides); + + case "switch_display": + return handleSwitchDisplay(adapter, a, overrides); + + case "list_granted_applications": + return handleListGrantedApplications(overrides); + + case "read_clipboard": + return handleReadClipboard(adapter, overrides, subGates); + + case "write_clipboard": + return handleWriteClipboard(adapter, a, overrides, subGates); + + case "computer_batch": + return handleComputerBatch(adapter, a, overrides, subGates); + + default: + return errorResult(`Unknown tool "${name}".`, "bad_args"); + } +} + +// --------------------------------------------------------------------------- +// Main dispatch +// --------------------------------------------------------------------------- + +export async function handleToolCall( + adapter: ComputerUseHostAdapter, + name: string, + args: unknown, + rawOverrides: ComputerUseOverrides, +): Promise { + const { logger, serverName } = adapter; + + // Normalize the allowlist before any gate runs: + // + // (a) Strip user-denied. A grant from a previous session (before the user + // added the app to Settings → Desktop app → Computer Use → Denied apps) + // must not survive. Without + // this, a stale grant bypasses the auto-deny. Stripped silently — the + // agent already saw the userDenied guidance at request_access time, and + // a live frontmost-gate rejection cites "not in allowed applications". + // + // (b) Strip policy-denied. Same story as (a) for a grant that predates a + // blocklist addition. buildAccessRequest denies these up front for new + // requests; this catches stale persisted grants. + // + // (c) Backfill tier. A grant persisted before the tier field existed has + // `tier: undefined`, which `tierSatisfies` treats as `"full"` — wrong + // for a legacy Chrome grant. Assign the hardcoded tier based on + // bundle-ID category. Modern grants already have a tier. + // + // `.some()` guard keeps the hot path (empty deny list, no legacy grants) + // zero-alloc. + const userDeniedSet = new Set(rawOverrides.userDeniedBundleIds); + const overrides: ComputerUseOverrides = rawOverrides.allowedApps.some( + (a) => + a.tier === undefined || + userDeniedSet.has(a.bundleId) || + isPolicyDenied(a.bundleId, a.displayName), + ) + ? { + ...rawOverrides, + allowedApps: rawOverrides.allowedApps + .filter((a) => !userDeniedSet.has(a.bundleId)) + .filter((a) => !isPolicyDenied(a.bundleId, a.displayName)) + .map((a) => + a.tier !== undefined + ? a + : { ...a, tier: getDefaultTierForApp(a.bundleId, a.displayName) }, + ), + } + : rawOverrides; + + // ─── Gate 1: kill switch ───────────────────────────────────────────── + if (adapter.isDisabled()) { + return errorResult( + "Computer control is disabled in Settings. Enable it and try again.", + "other", + ); + } + + // ─── Gate 2: TCC ───────────────────────────────────────────────────── + // Accessibility + Screen Recording on macOS. Pure check — no dialog, + // no relaunch. `request_access` is exempted: it threads the ungranted + // state through to the renderer, which shows a TCC toggle panel instead + // of the app list. Every other tool short-circuits here. + const osPerms = await adapter.ensureOsPermissions(); + let tccState: + | { accessibility: boolean; screenRecording: boolean } + | undefined; + if (!osPerms.granted) { + // Both request_* tools thread tccState through to the renderer's + // TCC toggle panel. Every other tool short-circuits. + if (name !== "request_access" && name !== "request_teach_access") { + return errorResult( + "Accessibility and Screen Recording permissions are required. " + + "Call request_access to show the permission panel.", + "tcc_not_granted", + ); + } + tccState = { + accessibility: osPerms.accessibility, + screenRecording: osPerms.screenRecording, + }; + } + + // ─── Gate 3: global CU lock ────────────────────────────────────────── + // At most one session uses CU at a time. Every tool including + // request_access hits the CHECK — even showing the approval dialog while + // another session holds the lock would be confusing ("why approve access + // that can't be used?"). + // + // But ACQUIRE is split: request_access and list_granted_applications + // check-without-acquire (the overlay + notifications are driven by + // cuLockChanged, and showing "Claude is using your computer" while the + // agent is only ASKING for access is premature). First action tool + // acquires and the overlay appears. If the user denies and no action + // follows, the overlay never shows. + // + // request_teach_access is NOT in this set — approving teach mode HIDES + // the main window (via onTeachModeActivated), and the lock must be held + // before that happens. Otherwise a concurrent session's request_access + // would render its dialog in an invisible main window during the gap + // between hide and the first teach_step (seconds of model inference). + // The old acquire-always-at-Gate-3 behavior was correct for teach; only + // the non-teach permission tools benefit from deferral. + // + // Host releases on idle/stop/archive; this package never releases. Both + // Cowork (LAM) and CCD (LSM) wire checkCuLock via the shared cuLock + // singleton. When undefined (tests/future hosts), no gate — absence of + // the mechanism ≠ locked out. + const deferAcquire = defersLockAcquire(name); + const lock = overrides.checkCuLock?.(); + if (lock) { + if (lock.holder !== undefined && !lock.isSelf) { + return errorResult( + "Another Claude session is currently using the computer. Wait for " + + "the user to acknowledge it is finished (stop button in the Claude " + + "window), or find a non-computer-use approach if one is readily " + + "apparent.", + "cu_lock_held", + ); + } + if (lock.holder === undefined && !deferAcquire) { + // Acquire. Emits cuLockChanged → overlay shows. Idempotent — if + // someone else acquired between check and here (won't happen on a + // single-threaded event loop, but defensive), this is a no-op. + overrides.acquireCuLock?.(); + // Fresh lock holder → any prior session's mouseButtonHeld is stale + // (e.g. overlay stop mid-drag). Clear it so this session doesn't get + // a spurious "already held" error. resetMouseButtonHeld is file-local; + // this is the one non-test callsite. + resetMouseButtonHeld(); + } + // lock.isSelf → already held by us, proceed. + // lock.holder === undefined && deferAcquire → + // checked but not acquired — proceed, first action will acquire. + } + + // Sub-gates read FRESH every call so a GrowthBook flip takes effect + // mid-session (plan §3). + const subGates = adapter.getSubGates(); + + // Clipboard guard runs per-action inside runInputActionGates + inline in + // handleReadClipboard/handleWriteClipboard. NOT here — per-tool-call sync + // would run once for computer_batch and miss sub-actions 2..N, and would + // fire during deferAcquire tools / `wait` / teach_step's blocking-dialog + // phase where no input is happening. + + const a = asRecord(args); + + logger.silly( + `[${serverName}] tool=${name} args=${JSON.stringify(a).slice(0, 200)}`, + ); + + // ─── Fail-closed dispatch ──────────────────────────────────────────── + // ANY exception below → tool error, executor never left in a half-called + // state. Explicit inversion of the prior `catch → return true` fail-open. + try { + // request_access / request_teach_access: need tccState thread-through; + // dispatchAction never sees them (not batchable). + // teach_step: blocking UI tool, also not batchable; needs subGates for + // its action-execution phase. + if (name === "request_access") { + return await handleRequestAccess(adapter, a, overrides, tccState); + } + if (name === "request_teach_access") { + return await handleRequestTeachAccess(adapter, a, overrides, tccState); + } + if (name === "teach_step") { + return await handleTeachStep(adapter, a, overrides, subGates); + } + if (name === "teach_batch") { + return await handleTeachBatch(adapter, a, overrides, subGates); + } + return await dispatchAction(name, a, adapter, overrides, subGates); + } catch (err) { + // Fail-closed. If the gate machinery itself throws (e.g. + // getFrontmostApp() rejects), the executor has NOT been called yet for + // the gated tools — the gates run before the executor in every handler. + // For ungated tools, the executor may have been mid-call; that's fine — + // the result is still a tool error, never an implicit success. + const msg = err instanceof Error ? err.message : String(err); + logger.error(`[${serverName}] tool=${name} threw: ${msg}`, err); + return errorResult(`Tool "${name}" failed: ${msg}`, "executor_threw"); + } +} + +export const _test = { + scaleCoord, + coordToPercentageForPixelCompare, + segmentGraphemes, + decodedByteLength, + resolveRequestedApps, + buildAccessRequest, + buildTierGuidanceMessage, + buildUserDeniedGuidance, + tierSatisfies, + looksLikeBundleId, + extractCoordinate, + parseKeyChord, + buildMonitorNote, + handleSwitchDisplay, + uniqueDisplayLabels, +}; diff --git a/packages/@ant/computer-use-mcp/src/tools.ts b/packages/@ant/computer-use-mcp/src/tools.ts new file mode 100644 index 000000000..c744a2329 --- /dev/null +++ b/packages/@ant/computer-use-mcp/src/tools.ts @@ -0,0 +1,706 @@ +/** + * MCP tool schemas for the computer-use server. Mirrors + * claude-for-chrome-mcp/src/browserTools.ts in shape (plain `Tool`-shaped + * object literals, no zod). + * + * Coordinate descriptions are baked in at tool-list build time from the + * `chicago_coordinate_mode` gate. The model sees exactly ONE coordinate + * convention in the param descriptions and never learns the other exists. + * The host (`serverDef.ts`) reads the same frozen gate value for + * `scaleCoord` — both must agree or clicks land in the wrong space. + */ + +import type { Tool } from "@modelcontextprotocol/sdk/types.js"; + +import type { CoordinateMode } from "./types.js"; + +// See packages/desktop/computer-use-mcp/COORDINATES.md before touching any +// model-facing coordinate text. Chrome's browserTools.ts:143 is the reference +// phrasing — "pixels from the left edge", no geometry, no number to do math with. +const COORD_DESC: Record = { + pixels: { + x: "Horizontal pixel position read directly from the most recent screenshot image, measured from the left edge. The server handles all scaling.", + y: "Vertical pixel position read directly from the most recent screenshot image, measured from the top edge. The server handles all scaling.", + }, + normalized_0_100: { + x: "Horizontal position as a percentage of screen width, 0.0–100.0 (0 = left edge, 100 = right edge).", + y: "Vertical position as a percentage of screen height, 0.0–100.0 (0 = top edge, 100 = bottom edge).", + }, +}; + +const FRONTMOST_GATE_DESC = + "The frontmost application must be in the session allowlist at the time of this call, or this tool returns an error and does nothing."; + +/** + * Item schema for the `actions` array in `computer_batch`, `teach_step`, and + * `teach_batch`. All three dispatch through the same `dispatchAction` path + * with the same validation — keep this enum in sync with `BATCHABLE_ACTIONS` + * in toolCalls.ts. + */ +const BATCH_ACTION_ITEM_SCHEMA = { + type: "object", + properties: { + action: { + type: "string", + enum: [ + "key", + "type", + "mouse_move", + "left_click", + "left_click_drag", + "right_click", + "middle_click", + "double_click", + "triple_click", + "scroll", + "hold_key", + "screenshot", + "cursor_position", + "left_mouse_down", + "left_mouse_up", + "wait", + ], + description: "The action to perform.", + }, + coordinate: { + type: "array", + items: { type: "number" }, + minItems: 2, + maxItems: 2, + description: + "(x, y) for click/mouse_move/scroll/left_click_drag end point.", + }, + start_coordinate: { + type: "array", + items: { type: "number" }, + minItems: 2, + maxItems: 2, + description: + "(x, y) drag start — left_click_drag only. Omit to drag from current cursor.", + }, + text: { + type: "string", + description: + "For type: the text. For key/hold_key: the chord string. For click/scroll: modifier keys to hold.", + }, + scroll_direction: { + type: "string", + enum: ["up", "down", "left", "right"], + }, + scroll_amount: { type: "integer", minimum: 0, maximum: 100 }, + duration: { + type: "number", + description: "Seconds (0–100). For hold_key/wait.", + }, + repeat: { + type: "integer", + minimum: 1, + maximum: 100, + description: "For key: repeat count.", + }, + }, + required: ["action"], +}; + +/** + * Build the tool list. Parameterized by capabilities and coordinate mode so + * descriptions are honest and unambiguous (plan §1 — "Unfiltered + honest"). + * + * `coordinateMode` MUST match what the host passes to `scaleCoord` at tool- + * -call time. Both should read the same frozen-at-load gate constant. + * + * `installedAppNames` — optional pre-sanitized list of app display names to + * enumerate in the `request_access` description. The caller is responsible + * for sanitization (length cap, character allowlist, sort, count cap) — + * this function just splices the list into the description verbatim. Omit + * to fall back to the generic "display names or bundle IDs" wording. + */ +export function buildComputerUseTools( + caps: { + screenshotFiltering: "native" | "none"; + platform: "darwin" | "win32"; + /** Include request_teach_access + teach_step. Read once at server construction. */ + teachMode?: boolean; + }, + coordinateMode: CoordinateMode, + installedAppNames?: string[], +): Tool[] { + const coord = COORD_DESC[coordinateMode]; + + // Shared hint suffix for BOTH request_access and request_teach_access — + // they use the same resolveRequestedApps path, so the model should get + // the same enumeration for both. + const installedAppsHint = + installedAppNames && installedAppNames.length > 0 + ? ` Available applications on this machine: ${installedAppNames.join(", ")}.` + : ""; + + // [x, y]` tuple — param shape for all + // click/move/scroll tools. + const coordinateTuple = { + type: "array", + items: { type: "number" }, + minItems: 2, + maxItems: 2, + description: `(x, y): ${coord.x}`, + }; + // Modifier hold during click. Shared across all 5 click variants. + const clickModifierText = { + type: "string", + description: + 'Modifier keys to hold during the click (e.g. "shift", "ctrl+shift"). Supports the same syntax as the key tool.', + }; + + const screenshotDesc = + caps.screenshotFiltering === "native" + ? "Take a screenshot of the primary display. Applications not in the session allowlist are excluded at the compositor level — only granted apps and the desktop are visible." + : "Take a screenshot of the primary display. On this platform, screenshots are NOT filtered — all open windows are visible. Input actions targeting apps not in the session allowlist are rejected."; + + return [ + { + name: "request_access", + description: + "Request user permission to control a set of applications for this session. Must be called before any other tool in this server. " + + "The user sees a single dialog listing all requested apps and either allows the whole set or denies it. " + + "Call this again mid-session to add more apps; previously granted apps remain granted. " + + "Returns the granted apps, denied apps, and screenshot filtering capability.", + inputSchema: { + type: "object" as const, + properties: { + apps: { + type: "array", + items: { type: "string" }, + description: + "Application display names (e.g. \"Slack\", \"Calendar\") or bundle identifiers (e.g. \"com.tinyspeck.slackmacgap\"). Display names are resolved case-insensitively against installed apps." + + installedAppsHint, + }, + reason: { + type: "string", + description: + "One-sentence explanation shown to the user in the approval dialog. Explain the task, not the mechanism.", + }, + clipboardRead: { + type: "boolean", + description: + "Also request permission to read the user's clipboard (separate checkbox in the dialog).", + }, + clipboardWrite: { + type: "boolean", + description: + "Also request permission to write the user's clipboard. When granted, multi-line `type` calls use the clipboard fast path.", + }, + systemKeyCombos: { + type: "boolean", + description: + "Also request permission to send system-level key combos (quit app, switch app, lock screen). Without this, those specific combos are blocked.", + }, + }, + required: ["apps", "reason"], + }, + }, + + { + name: "screenshot", + description: + screenshotDesc + + " Returns an error if the allowlist is empty. The returned image is what subsequent click coordinates are relative to.", + inputSchema: { + type: "object" as const, + properties: { + save_to_disk: { + type: "boolean", + description: + "Save the image to disk so it can be attached to a message for the user. Returns the saved path in the tool result. Only set this when you intend to share the image — screenshots you're just looking at don't need saving.", + }, + }, + required: [], + }, + }, + + { + name: "zoom", + description: + "Take a higher-resolution screenshot of a specific region of the last full-screen screenshot. Use this liberally to inspect small text, button labels, or fine UI details that are hard to read in the downsampled full-screen image. " + + "IMPORTANT: Coordinates in subsequent click calls always refer to the full-screen screenshot, never the zoomed image. This tool is read-only for inspecting detail.", + inputSchema: { + type: "object" as const, + properties: { + region: { + type: "array", + items: { type: "integer" }, + minItems: 4, + maxItems: 4, + description: + "(x0, y0, x1, y1): Rectangle to zoom into, in the coordinate space of the most recent full-screen screenshot. x0,y0 = top-left, x1,y1 = bottom-right.", + }, + save_to_disk: { + type: "boolean", + description: + "Save the image to disk so it can be attached to a message for the user. Returns the saved path in the tool result. Only set this when you intend to share the image.", + }, + }, + required: ["region"], + }, + }, + + { + name: "left_click", + description: `Left-click at the given coordinates. ${FRONTMOST_GATE_DESC}`, + inputSchema: { + type: "object" as const, + properties: { + coordinate: coordinateTuple, + text: clickModifierText, + }, + required: ["coordinate"], + }, + }, + + { + name: "double_click", + description: `Double-click at the given coordinates. Selects a word in most text editors. ${FRONTMOST_GATE_DESC}`, + inputSchema: { + type: "object" as const, + properties: { + coordinate: coordinateTuple, + text: clickModifierText, + }, + required: ["coordinate"], + }, + }, + + { + name: "triple_click", + description: `Triple-click at the given coordinates. Selects a line in most text editors. ${FRONTMOST_GATE_DESC}`, + inputSchema: { + type: "object" as const, + properties: { + coordinate: coordinateTuple, + text: clickModifierText, + }, + required: ["coordinate"], + }, + }, + + { + name: "right_click", + description: `Right-click at the given coordinates. Opens a context menu in most applications. ${FRONTMOST_GATE_DESC}`, + inputSchema: { + type: "object" as const, + properties: { + coordinate: coordinateTuple, + text: clickModifierText, + }, + required: ["coordinate"], + }, + }, + + { + name: "middle_click", + description: `Middle-click (scroll-wheel click) at the given coordinates. ${FRONTMOST_GATE_DESC}`, + inputSchema: { + type: "object" as const, + properties: { + coordinate: coordinateTuple, + text: clickModifierText, + }, + required: ["coordinate"], + }, + }, + + { + name: "type", + description: `Type text into whatever currently has keyboard focus. ${FRONTMOST_GATE_DESC} Newlines are supported. For keyboard shortcuts use \`key\` instead.`, + inputSchema: { + type: "object" as const, + properties: { + text: { type: "string", description: "Text to type." }, + }, + required: ["text"], + }, + }, + + { + name: "key", + description: + `Press a key or key combination (e.g. "return", "escape", "cmd+a", "ctrl+shift+tab"). ${FRONTMOST_GATE_DESC} ` + + "System-level combos (quit app, switch app, lock screen) require the `systemKeyCombos` grant — without it they return an error. All other combos work.", + inputSchema: { + type: "object" as const, + properties: { + text: { + type: "string", + description: 'Modifiers joined with "+", e.g. "cmd+shift+a".', + }, + repeat: { + type: "integer", + minimum: 1, + maximum: 100, + description: "Number of times to repeat the key press. Default is 1.", + }, + }, + required: ["text"], + }, + }, + + { + name: "scroll", + description: `Scroll at the given coordinates. ${FRONTMOST_GATE_DESC}`, + inputSchema: { + type: "object" as const, + properties: { + coordinate: coordinateTuple, + scroll_direction: { + type: "string", + enum: ["up", "down", "left", "right"], + description: "Direction to scroll.", + }, + scroll_amount: { + type: "integer", + minimum: 0, + maximum: 100, + description: "Number of scroll ticks.", + }, + }, + required: ["coordinate", "scroll_direction", "scroll_amount"], + }, + }, + + { + name: "left_click_drag", + description: `Press, move to target, and release. ${FRONTMOST_GATE_DESC}`, + inputSchema: { + type: "object" as const, + properties: { + coordinate: { + ...coordinateTuple, + description: `(x, y) end point: ${coord.x}`, + }, + start_coordinate: { + ...coordinateTuple, + description: `(x, y) start point. If omitted, drags from the current cursor position. ${coord.x}`, + }, + }, + required: ["coordinate"], + }, + }, + + { + name: "mouse_move", + description: `Move the mouse cursor without clicking. Useful for triggering hover states. ${FRONTMOST_GATE_DESC}`, + inputSchema: { + type: "object" as const, + properties: { + coordinate: coordinateTuple, + }, + required: ["coordinate"], + }, + }, + + { + name: "open_application", + description: + "Bring an application to the front, launching it if necessary. The target application must already be in the session allowlist — call request_access first.", + inputSchema: { + type: "object" as const, + properties: { + app: { + type: "string", + description: + "Display name (e.g. \"Slack\") or bundle identifier (e.g. \"com.tinyspeck.slackmacgap\").", + }, + }, + required: ["app"], + }, + }, + + { + name: "switch_display", + description: + "Switch which monitor subsequent screenshots capture. Use this when the " + + "application you need is on a different monitor than the one shown. " + + "The screenshot tool tells you which monitor it captured and lists " + + "other attached monitors by name — pass one of those names here. " + + "After switching, call screenshot to see the new monitor. " + + 'Pass "auto" to return to automatic monitor selection.', + inputSchema: { + type: "object" as const, + properties: { + display: { + type: "string", + description: + 'Monitor name from the screenshot note (e.g. "Built-in Retina Display", ' + + '"LG UltraFine"), or "auto" to re-enable automatic selection.', + }, + }, + required: ["display"], + }, + }, + + { + name: "list_granted_applications", + description: + "List the applications currently in the session allowlist, plus the active grant flags and coordinate mode. No side effects.", + inputSchema: { + type: "object" as const, + properties: {}, + required: [], + }, + }, + + { + name: "read_clipboard", + description: + "Read the current clipboard contents as text. Requires the `clipboardRead` grant.", + inputSchema: { + type: "object" as const, + properties: {}, + required: [], + }, + }, + + { + name: "write_clipboard", + description: + "Write text to the clipboard. Requires the `clipboardWrite` grant.", + inputSchema: { + type: "object" as const, + properties: { + text: { type: "string" }, + }, + required: ["text"], + }, + }, + + { + name: "wait", + description: "Wait for a specified duration.", + inputSchema: { + type: "object" as const, + properties: { + duration: { + type: "number", + description: "Duration in seconds (0–100).", + }, + }, + required: ["duration"], + }, + }, + + { + name: "cursor_position", + description: + "Get the current mouse cursor position. Returns image-pixel coordinates relative to the most recent screenshot, or logical points if no screenshot has been taken.", + inputSchema: { + type: "object" as const, + properties: {}, + required: [], + }, + }, + + { + name: "hold_key", + description: + `Press and hold a key or key combination for the specified duration, then release. ${FRONTMOST_GATE_DESC} ` + + "System-level combos require the `systemKeyCombos` grant.", + inputSchema: { + type: "object" as const, + properties: { + text: { + type: "string", + description: 'Key or chord to hold, e.g. "space", "shift+down".', + }, + duration: { + type: "number", + description: "Duration in seconds (0–100).", + }, + }, + required: ["text", "duration"], + }, + }, + + { + name: "left_mouse_down", + description: + `Press the left mouse button at the current cursor position and leave it held. ${FRONTMOST_GATE_DESC} ` + + "Use mouse_move first to position the cursor. Call left_mouse_up to release. Errors if the button is already held.", + inputSchema: { + type: "object" as const, + properties: {}, + required: [], + }, + }, + + { + name: "left_mouse_up", + description: + `Release the left mouse button at the current cursor position. ${FRONTMOST_GATE_DESC} ` + + "Pairs with left_mouse_down. Safe to call even if the button is not currently held.", + inputSchema: { + type: "object" as const, + properties: {}, + required: [], + }, + }, + + { + name: "computer_batch", + description: + "Execute a sequence of actions in ONE tool call. Each individual tool call requires a model→API round trip (seconds); " + + "batching a predictable sequence eliminates all but one. Use this whenever you can predict the outcome of several actions ahead — " + + "e.g. click a field, type into it, press Return. Actions execute sequentially and stop on the first error. " + + `${FRONTMOST_GATE_DESC} The frontmost check runs before EACH action inside the batch — if an action opens a non-allowed app, the next action's gate fires and the batch stops there. ` + + "Mid-batch screenshot actions are allowed for inspection but coordinates in subsequent clicks always refer to the PRE-BATCH full-screen screenshot.", + inputSchema: { + type: "object" as const, + properties: { + actions: { + type: "array", + minItems: 1, + items: BATCH_ACTION_ITEM_SCHEMA, + description: + 'List of actions. Example: [{"action":"left_click","coordinate":[100,200]},{"action":"type","text":"hello"},{"action":"key","text":"Return"}]', + }, + }, + required: ["actions"], + }, + }, + + ...(caps.teachMode ? buildTeachTools(coord, installedAppsHint) : []), + ]; +} + +/** + * Teach-mode tools. Split out so the spread above stays a single expression; + * takes `coord` so `teach_step.anchor`'s description uses the same + * frozen coordinate-mode phrasing as click coords, and `installedAppsHint` + * so `request_teach_access.apps` gets the same enumeration as + * `request_access.apps` (same resolution path → same hint). + */ +function buildTeachTools( + coord: { x: string; y: string }, + installedAppsHint: string, +): Tool[] { + // Shared between teach_step (top-level) and teach_batch (inside steps[] + // items). Depends on coord, so it lives inside this factory. + const teachStepProperties = { + explanation: { + type: "string", + description: + "Tooltip body text. Explain what the user is looking at and why it matters. " + + "This is the ONLY place the user sees your words — be complete but concise.", + }, + next_preview: { + type: "string", + description: + "One line describing exactly what will happen when the user clicks Next. " + + 'Example: "Next: I\'ll click Create Bucket and type the name." ' + + "Shown below the explanation in a smaller font.", + }, + anchor: { + type: "array", + items: { type: "number" }, + minItems: 2, + maxItems: 2, + description: + `(x, y) — where the tooltip arrow points. ${coord.x} ` + + "Omit to center the tooltip with no arrow (for general-context steps).", + }, + actions: { + type: "array", + // Empty allowed — "read this, click Next" steps. + items: BATCH_ACTION_ITEM_SCHEMA, + description: + "Actions to execute when the user clicks Next. Same item schema as computer_batch.actions. " + + "Empty array is valid for purely explanatory steps. Actions run sequentially and stop on first error.", + }, + } as const; + + return [ + { + name: "request_teach_access", + description: + "Request permission to guide the user through a task step-by-step with on-screen tooltips. " + + "Use this INSTEAD OF request_access when the user wants to LEARN how to do something " + + '(phrases like "teach me", "walk me through", "show me how", "help me learn"). ' + + "On approval the main Claude window hides and a fullscreen tooltip overlay appears. " + + "You then call teach_step repeatedly; each call shows one tooltip and waits for the user to click Next. " + + "Same app-allowlist semantics as request_access, but no clipboard/system-key flags. " + + "Teach mode ends automatically when your turn ends.", + inputSchema: { + type: "object" as const, + properties: { + apps: { + type: "array", + items: { type: "string" }, + description: + 'Application display names (e.g. "Slack", "Calendar") or bundle identifiers. Resolved case-insensitively against installed apps.' + + installedAppsHint, + }, + reason: { + type: "string", + description: + 'What you will be teaching. Shown in the approval dialog as "Claude wants to guide you through {reason}". Keep it short and task-focused.', + }, + }, + required: ["apps", "reason"], + }, + }, + + { + name: "teach_step", + description: + "Show one guided-tour tooltip and wait for the user to click Next. On Next, execute the actions, " + + "take a fresh screenshot, and return both — you do NOT need a separate screenshot call between steps. " + + "The returned image shows the state after your actions ran; anchor the next teach_step against it. " + + "IMPORTANT — the user only sees the tooltip during teach mode. Put ALL narration in `explanation`. " + + "Text you emit outside teach_step calls is NOT visible until teach mode ends. " + + "Pack as many actions as possible into each step's `actions` array — the user waits through " + + "the whole round trip between clicks, so one step that fills a form beats five steps that fill one field each. " + + "Returns {exited:true} if the user clicks Exit — do not call teach_step again after that. " + + "Take an initial screenshot before your FIRST teach_step to anchor it.", + inputSchema: { + type: "object" as const, + properties: teachStepProperties, + required: ["explanation", "next_preview", "actions"], + }, + }, + + { + name: "teach_batch", + description: + "Queue multiple teach steps in one tool call. Parallels computer_batch: " + + "N steps → one model↔API round trip instead of N. Each step still shows a tooltip " + + "and waits for the user's Next click, but YOU aren't waiting for a round trip between steps. " + + "You can call teach_batch multiple times in one tour — treat each batch as one predictable " + + "SEGMENT (typically: all the steps on one page). The returned screenshot shows the state " + + "after the batch's final actions; anchor the NEXT teach_batch against it. " + + "WITHIN a batch, all anchors and click coordinates refer to the PRE-BATCH screenshot " + + "(same invariant as computer_batch) — for steps 2+ in a batch, either omit anchor " + + "(centered tooltip) or target elements you know won't have moved. " + + "Good pattern: batch 5 tooltips on page A (last step navigates) → read returned screenshot → " + + "batch 3 tooltips on page B → done. " + + "Returns {exited:true, stepsCompleted:N} if the user clicks Exit — do NOT call again after that; " + + "{stepsCompleted, stepFailed, ...} if an action errors mid-batch; " + + "otherwise {stepsCompleted, results:[...]} plus a final screenshot. " + + "Fall back to individual teach_step calls when you need to react to each intermediate screenshot.", + inputSchema: { + type: "object" as const, + properties: { + steps: { + type: "array", + minItems: 1, + items: { + type: "object", + properties: teachStepProperties, + required: ["explanation", "next_preview", "actions"], + }, + description: + "Ordered steps. Validated upfront — a typo in step 5 errors before any tooltip shows.", + }, + }, + required: ["steps"], + }, + }, + ]; +} diff --git a/packages/@ant/computer-use-mcp/src/types.ts b/packages/@ant/computer-use-mcp/src/types.ts index 2247360d5..656f795dc 100644 --- a/packages/@ant/computer-use-mcp/src/types.ts +++ b/packages/@ant/computer-use-mcp/src/types.ts @@ -1,70 +1,622 @@ +import type { + ComputerExecutor, + InstalledApp, + ScreenshotResult, +} from "./executor.js"; + +/** `ScreenshotResult` without the base64 blob. The shape hosts persist for + * cross-respawn `scaleCoord` survival. */ +export type ScreenshotDims = Omit; + +/** Shape mirrors claude-for-chrome-mcp/src/types.ts:1-7 */ +export interface Logger { + info: (message: string, ...args: unknown[]) => void; + error: (message: string, ...args: unknown[]) => void; + warn: (message: string, ...args: unknown[]) => void; + debug: (message: string, ...args: unknown[]) => void; + silly: (message: string, ...args: unknown[]) => void; +} + /** - * @ant/computer-use-mcp — Types + * Per-app permission tier. Hardcoded by category at grant time — the + * approval dialog displays the tier but the user cannot change it (for now). + * + * - `"read"` — visible in screenshots, NO interaction (no clicks, no typing). + * Browsers land here: the model can read a page that's already open, but + * must use the Claude-in-Chrome MCP for any navigation/clicking. Trading + * platforms land here too (no CiC alternative — the model asks the user). + * - `"click"` — visible + plain left-click, scroll. NO typing/keys, + * NO right/middle-click, NO modifier-clicks, NO drag-drop (all text- + * injection vectors). Terminals/IDEs land here: the model can click a + * Run button or scroll test output, but `type("rm -rf /")` is blocked + * and so is right-click→Paste and dragging text onto the terminal. + * - `"full"` — visible + click + type/key/paste. Everything else. * - * 从调用侧反推的真实类型定义,替代 any stub。 + * Enforced in `runInputActionGates` via the frontmost-app check: keyboard + * actions require `"full"`, mouse actions require `"click"` or higher. */ +export type CuAppPermTier = "read" | "click" | "full"; -export type CoordinateMode = 'pixels' | 'normalized' +/** + * A single app the user has approved for the current session. Session-scoped + * only — there is no "once" or "forever" scope (unlike Chrome's per-domain + * three-way). CU has no natural "once" unit; one task = hundreds of clicks. + * Mirrors how `chromeAllowedDomains` is a plain `string[]` with no per-item + * scope. + */ +export interface AppGrant { + bundleId: string; + displayName: string; + /** Epoch ms. For Settings-page display ("Granted 3m ago"). */ + grantedAt: number; + /** Undefined → `"full"` (back-compat for pre-tier grants persisted in + * session state). */ + tier?: CuAppPermTier; +} + +/** Orthogonal to the app allowlist. */ +export interface CuGrantFlags { + clipboardRead: boolean; + clipboardWrite: boolean; + /** + * When false, the `key` tool rejects combos in `keyBlocklist.ts` + * (cmd+q, cmd+tab, cmd+space, cmd+shift+q, ctrl+alt+delete). All other + * key sequences work regardless. + */ + systemKeyCombos: boolean; +} +export const DEFAULT_GRANT_FLAGS: CuGrantFlags = { + clipboardRead: false, + clipboardWrite: false, + systemKeyCombos: false, +}; + +/** + * Host picks via GrowthBook JSON feature `chicago_coordinate_mode`, baked + * into tool param descriptions at server-construction time. The model sees + * ONE convention and never learns the other exists. `normalized_0_100` + * sidesteps the Retina scaleFactor bug class entirely. + */ +export type CoordinateMode = "pixels" | "normalized_0_100"; + +/** + * Independent kill switches for subtle/risky ported behaviors. Read from + * GrowthBook by the host adapter, consulted in `toolCalls.ts`. + */ export interface CuSubGates { - pixelValidation: boolean - clipboardPasteMultiline: boolean - mouseAnimation: boolean - hideBeforeAction: boolean - autoTargetDisplay: boolean - clipboardGuard: boolean + /** 9×9 exact-byte staleness guard before click. */ + pixelValidation: boolean; + /** Route `type("foo\nbar")` through clipboard instead of keystroke-by-keystroke. */ + clipboardPasteMultiline: boolean; + /** + * Ease-out-cubic mouse glide at 60fps, distance-proportional duration + * (2000 px/sec, capped at 0.5s). Adds up to ~0.5s latency + * per click. When off, cursor teleports instantly. + */ + mouseAnimation: boolean; + /** + * Pre-action sequence: hide non-allowlisted apps, then defocus us (from the + * Vercept acquisition). When off, the + * frontmost gate fires in the normal case and the model gets stuck — this + * is the A/B-test-the-old-broken-behavior switch. + */ + hideBeforeAction: boolean; + /** + * Auto-resolve the target display before each screenshot when the + * selected display has no allowed-app windows. When on, `handleScreenshot` + * uses the atomic Swift path; off → sticks with `selectedDisplayId`. + */ + autoTargetDisplay: boolean; + /** + * Stash+clear the clipboard while a tier-"click" app is frontmost. + * Closes the gap where a click-tier terminal/IDE has a UI Paste button + * that's plain-left-clickable — without this, the tier "click" + * keyboard block can be routed around by clicking Paste. Restored when + * a non-"click" app becomes frontmost, or at turn end. + */ + clipboardGuard: boolean; } -export interface Logger { - silly(message: string, ...args: unknown[]): void - debug(message: string, ...args: unknown[]): void - info(message: string, ...args: unknown[]): void - warn(message: string, ...args: unknown[]): void - error(message: string, ...args: unknown[]): void +// ---------------------------------------------------------------------------- +// Permission request/response (mirror of BridgePermissionRequest, types.ts:77-94) +// ---------------------------------------------------------------------------- + +/** One entry per app the model asked for, after name → bundle ID resolution. */ +export interface ResolvedAppRequest { + /** What the model asked for (e.g. "Slack", "com.tinyspeck.slackmacgap"). */ + requestedName: string; + /** The resolved InstalledApp if found, else undefined (shown greyed in the UI). */ + resolved?: InstalledApp; + /** Shell-access-equivalent bundle IDs get a UI warning. See sentinelApps.ts. */ + isSentinel: boolean; + /** Already in the allowlist → skip the checkbox, return in `granted` immediately. */ + alreadyGranted: boolean; + /** Hardcoded tier for this app (browser→"read", terminal→"click", else "full"). + * The dialog displays this read-only; the renderer passes it through + * verbatim in the AppGrant. */ + proposedTier: CuAppPermTier; } +/** + * Payload for the renderer approval dialog. Rides through the existing + * `ToolPermissionRequest.input: unknown` field + * (packages/utils/desktop/bridge/common/claude.web.ts:1262) — no IPC schema + * change needed. + */ export interface CuPermissionRequest { - apps: Array<{ bundleId: string; displayName: string }> - requestedFlags: GrantFlags - reason: string - tccState: { accessibility: boolean; screenRecording: boolean } - willHide: string[] + requestId: string; + /** Model-provided reason string. Shown prominently in the approval UI. */ + reason: string; + apps: ResolvedAppRequest[]; + /** What the model asked for. User can toggle independently of apps. */ + requestedFlags: Partial; + /** + * For the "On Windows, Claude can see all apps..." footnote. Taken from + * `executor.capabilities.screenshotFiltering` so the renderer doesn't + * need to know about platforms. + */ + screenshotFiltering: "native" | "none"; + /** + * Present only when TCC permissions are NOT yet granted. When present, + * the renderer shows a TCC toggle panel (two rows: Accessibility, Screen + * Recording) INSTEAD OF the app list. Clicking a row's "Request" button + * triggers the OS prompt; the store polls on window-focus and flips the + * toggle when the grant is detected. macOS itself prompts the user to + * restart after granting Screen Recording — we don't. + */ + tccState?: { + accessibility: boolean; + screenRecording: boolean; + }; + /** + * Apps with windows on the CU display that aren't in the requested + * allowlist. These will be hidden the first time Claude takes an action. + * Computed at request_access time — may be slightly stale by the time the + * user clicks Allow, but it's a preview, not a contract. Absent when + * empty so the renderer can skip the section cleanly. + */ + willHide?: Array<{ bundleId: string; displayName: string }>; + /** + * `chicagoAutoUnhide` app preference at request time. The renderer picks + * between "...then restored when Claude is done" and "...will be hidden" + * copy. Absent when `willHide` is absent (same condition). + */ + autoUnhideEnabled?: boolean; } -export interface GrantFlags { - clipboardRead: boolean - clipboardWrite: boolean - systemKeyCombos: boolean +/** + * What the renderer stuffs into `updatedInput._cuGrants` when the user clicks + * "Allow for this session" (mirror of the `_allowAllSites` sentinel at + * LocalAgentModeSessionManager.ts:2794). + */ +export interface CuPermissionResponse { + granted: AppGrant[]; + /** Bundle IDs the user unchecked, or apps that weren't installed. */ + denied: Array<{ bundleId: string; reason: "user_denied" | "not_installed" }>; + flags: CuGrantFlags; + /** + * Whether the user clicked Allow in THIS dialog. Only set by the + * teach-mode handler — regular request_access doesn't need it (the + * session manager's `result.behavior` gates the merge there). Needed + * because when all requested apps are already granted (skipDialogGrants + * non-empty, needDialog empty), Allow and Deny produce identical + * `{granted:[], denied:[]}` payloads and the tool handler can't tell + * them apart without this. Undefined → legacy/regular path, do not + * gate on it. + */ + userConsented?: boolean; } -export interface CuPermissionResponse { - granted: string[] - denied: string[] - flags: GrantFlags +// ---------------------------------------------------------------------------- +// Host adapter (mirror of ClaudeForChromeContext, types.ts:33-62) +// ---------------------------------------------------------------------------- + +/** + * Process-lifetime singleton dependencies. Everything that does NOT vary per + * tool call. Built once by `apps/desktop/src/main/nest-only/chicago/hostAdapter.ts`. + * No Electron imports in this package — the host injects everything. + */ +export interface ComputerUseHostAdapter { + serverName: string; + logger: Logger; + executor: ComputerExecutor; + + /** + * TCC state check — Accessibility + Screen Recording on macOS. Pure check, + * no dialog, no relaunch. When either is missing, `request_access` threads + * the state through to the renderer which shows a toggle panel; all other + * tools return a tool error. + */ + ensureOsPermissions(): Promise< + | { granted: true } + | { granted: false; accessibility: boolean; screenRecording: boolean } + >; + + /** The Settings-page kill switch (`chicagoEnabled` app preference). */ + isDisabled(): boolean; + + /** + * The `chicagoAutoUnhide` app preference. Consumed by `buildAccessRequest` + * to populate `CuPermissionRequest.autoUnhideEnabled` so the renderer's + * "will be hidden" copy can say "then restored" only when true. + */ + getAutoUnhideEnabled(): boolean; + + /** + * Sub-gates re-read on every tool call so GrowthBook flips take effect + * mid-session without restart. + */ + getSubGates(): CuSubGates; + + /** + * JPEG decode + crop + raw pixel bytes, for the PixelCompare staleness guard. + * Injected so this package stays Electron-free. The host implements it via + * `nativeImage.createFromBuffer(jpeg).crop(rect).toBitmap()` — Chromium's + * decoders, BSD-licensed, no `.node` binary. + * + * Returns null on decode/crop failure — caller treats null as `skipped`, + * click proceeds (validation failure must never block the action). + */ + cropRawPatch( + jpegBase64: string, + rect: { x: number; y: number; width: number; height: number }, + ): Buffer | null; } -export const DEFAULT_GRANT_FLAGS: GrantFlags = { - clipboardRead: false, - clipboardWrite: false, - systemKeyCombos: false, +// ---------------------------------------------------------------------------- +// Session context (getter/callback bag for bindSessionContext) +// ---------------------------------------------------------------------------- + +/** + * Per-session state binding for `bindSessionContext`. Hosts build this once + * per session with getters that read fresh from their session store and + * callbacks that write back. The returned dispatcher builds + * `ComputerUseOverrides` from these getters on every call. + * + * Callbacks must be set at construction time — `bindSessionContext` reads + * them once at bind, not per call. + * + * The lock hooks are **async** — `bindSessionContext` awaits them before + * `handleToolCall`, then passes `checkCuLock: undefined` in overrides so the + * sync Gate-3 in `handleToolCall` no-ops. Hosts with in-memory sync locks + * (Cowork) wrap them trivially; hosts with cross-process locks (the CLI's + * O_EXCL file) call the real async primitive directly. + */ +export interface ComputerUseSessionContext { + // ── Read state fresh per call ────────────────────────────────────── + + getAllowedApps(): readonly AppGrant[]; + getGrantFlags(): CuGrantFlags; + /** Per-user auto-deny list (Settings page). Empty array = none. */ + getUserDeniedBundleIds(): readonly string[]; + getSelectedDisplayId(): number | undefined; + getDisplayPinnedByModel?(): boolean; + getDisplayResolvedForApps?(): string | undefined; + getTeachModeActive?(): boolean; + /** Dims-only fallback when `lastScreenshot` is unset (cross-respawn). + * `bindSessionContext` reconstructs `{...dims, base64: ""}` so scaleCoord + * works and pixelCompare correctly skips. */ + getLastScreenshotDims?(): ScreenshotDims | undefined; + + // ── Write-back callbacks ─────────────────────────────────────────── + + /** Shows the approval dialog. Host routes to its UI, awaits user. The + * signal is aborted if the tool call finishes before the user answers + * (MCP timeout, etc.) — hosts dismiss the dialog on abort. */ + onPermissionRequest?( + req: CuPermissionRequest, + signal: AbortSignal, + ): Promise; + /** Teach-mode sibling of `onPermissionRequest`. */ + onTeachPermissionRequest?( + req: CuTeachPermissionRequest, + signal: AbortSignal, + ): Promise; + /** Called by `bindSessionContext` after merging a permission response into + * the allowlist (dedupe on bundleId, truthy-only flag spread). Host + * persists for resume survival. */ + onAllowedAppsChanged?(apps: readonly AppGrant[], flags: CuGrantFlags): void; + onAppsHidden?(bundleIds: string[]): void; + /** Reads the session's clipboardGuard stash. undefined → no stash held. */ + getClipboardStash?(): string | undefined; + /** Writes the clipboardGuard stash. undefined clears it. */ + onClipboardStashChanged?(stash: string | undefined): void; + onResolvedDisplayUpdated?(displayId: number): void; + onDisplayPinned?(displayId: number | undefined): void; + onDisplayResolvedForApps?(sortedBundleIdsKey: string): void; + /** Called after each screenshot. Host persists for respawn survival. */ + onScreenshotCaptured?(dims: ScreenshotDims): void; + onTeachModeActivated?(): void; + onTeachStep?(req: TeachStepRequest): Promise; + onTeachWorking?(): void; + + // ── Lock (async) ─────────────────────────────────────────────────── + + /** At most one session uses CU at a time. Awaited by `bindSessionContext` + * before dispatch. Undefined → no lock gating (proceed). */ + checkCuLock?(): Promise<{ holder: string | undefined; isSelf: boolean }>; + /** Take the lock. Called when `checkCuLock` returned `holder: undefined` + * on a non-deferring tool. Host emits enter-CU signals here. */ + acquireCuLock?(): Promise; + /** Host-specific lock-held error text. Default is the package's generic + * message. The CLI host includes the holder session-ID prefix. */ + formatLockHeldMessage?(holder: string): string; + + /** User-abort signal. Passed through to `ComputerUseOverrides.isAborted` + * for the mid-loop checks in handleComputerBatch / handleType. See that + * field for semantics. */ + isAborted?(): boolean; } -export interface ComputerUseConfig { - coordinateMode: CoordinateMode - enabledTools: string[] +// ---------------------------------------------------------------------------- +// Per-call overrides (mirror of PermissionOverrides, types.ts:97-102) +// ---------------------------------------------------------------------------- + +/** + * Built FRESH on every tool call by `bindSessionContext` from + * `ComputerUseSessionContext` getters. This is what lets a singleton MCP + * server carry per-session state — the state lives on the host's session + * store, not the server. + */ +export interface ComputerUseOverrides { + allowedApps: AppGrant[]; + grantFlags: CuGrantFlags; + coordinateMode: CoordinateMode; + + /** + * User-configured auto-deny list (Settings → Desktop app → Computer Use). + * Bundle IDs + * here are stripped from request_access BEFORE the approval dialog — they + * never reach the user for approval regardless of tier. The response tells + * the agent to ask the user to remove the app from their deny list in + * Settings if access is genuinely needed. + * + * Per-USER, persists across restarts (read from appPreferences per call, + * not session state). Contrast with `allowedApps` which is per-session. + * Empty array = no user-configured denies (the default). + */ + userDeniedBundleIds: readonly string[]; + + /** + * Display CU operates on; read fresh per call. `scaleCoord` uses the + * `originX/Y` snapshotted in `lastScreenshot`, so mid-session switches + * only affect the NEXT screenshot/prepare call. + */ + selectedDisplayId?: number; + + /** + * The `request_access` tool handler calls this and awaits. The wrapper + * closure in serverDef.ts (mirroring InternalMcpServerManager.ts:131-177) + * routes through `handleToolPermission` → IPC → renderer ChicagoApproval. + * When it resolves, the wrapper side-effectfully mutates + * `InternalServerContext.cuAllowedApps` BEFORE returning here. + * + * Undefined when the session wasn't wired with a permission handler (e.g. + * a future headless mode). `request_access` returns a tool error in that case. + */ + onPermissionRequest?: (req: CuPermissionRequest) => Promise; + + /** + * For the pixel-validation staleness guard. The model's-last-screenshot, + * stashed by serverDef.ts after each `screenshot` tool call. Undefined on + * cold start → pixel validation skipped (click proceeds). + */ + lastScreenshot?: ScreenshotResult; + + /** + * Fired after every `prepareForAction` with the bundle IDs it just hid. + * The wrapper closure in serverDef.ts accumulates these into + * `Session.cuHiddenDuringTurn` via a write-through callback (same pattern + * as `onCuPermissionUpdated`). At turn end (`sdkMessage.type === "result"`), + * if the `chicagoAutoUnhide` setting is on, everything in the set is + * unhidden. Set is cleared regardless of the setting so it doesn't leak + * across turns. + * + * Undefined when the session wasn't wired with a tracker — unhide just + * doesn't happen. + */ + onAppsHidden?: (bundleIds: string[]) => void; + + /** + * Reads the clipboardGuard stash from session state. `undefined` means no + * stash is held — `syncClipboardStash` stashes on first entry to click-tier + * and clears on restore. Sibling of the `cuHiddenDuringTurn` getter pattern + * — state lives on the host's session, not module-level here. + */ + getClipboardStash?: () => string | undefined; + + /** + * Writes the clipboardGuard stash to session state. `undefined` clears. + * Sibling of `onAppsHidden` — the wrapper closure writes through to + * `Session.cuClipboardStash`. At turn end the host reads + clears it + * directly and restores via Electron's `clipboard.writeText` (no nest-only + * import surface). + */ + onClipboardStashChanged?: (stash: string | undefined) => void; + + /** + * Write the resolver's picked display back to session so teach overlay + * positioning and subsequent non-resolver calls use the same display. + * Fired by `handleScreenshot` in the atomic `autoTargetDisplay` path when + * `resolvePrepareCapture`'s pick differs from `selectedDisplayId`. + * Fire-and-forget. + */ + onResolvedDisplayUpdated?: (displayId: number) => void; + + /** + * Set when the model explicitly picked a display via `switch_display`. + * When true, `handleScreenshot` passes `autoResolve: false` so the Swift + * resolver honors `selectedDisplayId` directly (straight cuDisplayInfo + * passthrough) instead of running the co-location/chase chain. The + * resolver's Step 2 ("host + allowed co-located → host") otherwise + * overrides any `selectedDisplayId` whenever an allowed app shares the + * host's monitor. + */ + displayPinnedByModel?: boolean; + + /** + * Write the model's explicit display pick to session. `displayId: + * undefined` clears both `selectedDisplayId` and the pin (back to auto). + * Sibling of `onResolvedDisplayUpdated` but also sets the pin flag — + * the two are semantically distinct (resolver-picked vs model-picked). + */ + onDisplayPinned?: (displayId: number | undefined) => void; + + /** + * Sorted comma-joined bundle-ID set the display was last auto-resolved + * for. `handleScreenshot` compares this to the current allowed set and + * only passes `autoResolve: true` when they differ — so the resolver + * doesn't yank the display on every screenshot, only when the app set + * has changed since the last resolve (or manual switch). + */ + displayResolvedForApps?: string; + + /** + * Records which app set the current display selection was made for. Fired + * alongside `onResolvedDisplayUpdated` when the resolver picks, so the next + * screenshot sees a matching set and skips auto-resolve. + */ + onDisplayResolvedForApps?: (sortedBundleIdsKey: string) => void; + + /** + * Global CU lock — at most one session actively uses CU at a time. Checked + * in `handleToolCall` after kill-switch/TCC, before dispatch. Every CU tool + * including `request_access` goes through it. + * + * - `holder === undefined` → lock is free, safe to acquire + * - `isSelf === true` → this session already holds it (no-op, proceed) + * - `holder !== undefined && !isSelf` → blocked, return tool error + * + * `undefined` callback → lock system not wired (e.g. CCD). Proceed without + * gating — absence of the mechanism ≠ locked out. + * + * The host manages release (on session idle/stop/archive) — this package + * never releases. + */ + checkCuLock?: () => { holder: string | undefined; isSelf: boolean }; + + /** + * Take the lock for this session. `handleToolCall` calls this exactly once + * per turn, on the FIRST CU tool call when `checkCuLock().holder` is + * undefined. No-op if already held (defensive — the check should have + * short-circuited). Host emits an event the overlay listens to. + */ + acquireCuLock?: () => void; + + /** + * User-abort signal. Checked mid-iteration inside `handleComputerBatch` + * and `handleType`'s grapheme loop so an in-flight batch/type stops + * promptly on overlay Stop instead of running to completion after the + * host has already abandoned the tool result. + * + * Undefined → never aborts (e.g. unwired host). Live per-check read — + * same lazy-getter pattern as `checkCuLock`. + */ + isAborted?: () => boolean; + + // ── Teach mode ─────────────────────────────────────────────────────── + // Wired only when the host's teachModeEnabled gate is on. All five + // undefined → `request_teach_access` / `teach_step` return tool errors + // and teach mode is effectively off. + + /** + * Sibling of `onPermissionRequest`. Same blocking-await-on-renderer-dialog + * semantics, but routes to ComputerUseTeachApproval.tsx (which explains + * the window-hides-during-guide behavior) instead of ComputerUseApproval. + * The wrapper closure in serverDef.ts writes grants through to session state + * via `onCuPermissionUpdated` exactly as `onPermissionRequest` does. + */ + onTeachPermissionRequest?: ( + req: CuTeachPermissionRequest, + ) => Promise; + + /** + * Called by `handleRequestTeachAccess` after the user approves and at least + * one app was granted. Host sets `session.teachModeActive = true`, emits + * `teachModeChanged` → teach controller hides the main window and shows the + * fullscreen overlay. Cleared by the host on turn end (`transitionTo("idle")`) + * alongside the CU lock release. + */ + onTeachModeActivated?: () => void; + + /** + * Read by `handleRequestAccess` and `handleRequestTeachAccess` to + * short-circuit with a clear tool error when teach mode is active. The + * main window is hidden during teach mode, so permission dialogs render + * invisibly and handleToolPermission blocks forever on an invisible + * prompt. Better to tell the model to exit teach mode first. Getter + * (not a boolean field) because teach mode state lives on the session, + * not on this per-call overrides object. + */ + getTeachModeActive?: () => boolean; + + /** + * Called by `handleTeachStep` with the scaled anchor + text. Host stores + * the resolver, emits `teachStepRequested` → teach controller pushes the + * payload to the overlay → user reads, clicks Next → IPC → host calls the + * stored resolver → this promise resolves. `{action: "exit"}` when the user + * clicks Exit (or the turn is interrupted) — `handleTeachStep` short-circuits + * without executing actions. + * + * Same blocking-promise pattern as `onPermissionRequest`, but resolved by + * the teach overlay's own preload (not the main renderer's tool-approval UI). + */ + onTeachStep?: (req: TeachStepRequest) => Promise; + + /** + * Called immediately after `onTeachStep` resolves with "next", before + * action dispatch begins. Host emits `teachStepWorking` → overlay flips to + * the spinner state (Next button gone, Exit stays, "Working…" + rotating + * notch). The next `onTeachStep` call replaces the spinner with the new + * tooltip content. + */ + onTeachWorking?: () => void; } -export interface ComputerUseHostAdapter { - serverName: string - logger: Logger - executor: ComputerExecutor - ensureOsPermissions(): Promise<{ granted: true } | { granted: false; accessibility: boolean; screenRecording: boolean }> - isDisabled(): boolean - getSubGates(): CuSubGates - getAutoUnhideEnabled(): boolean - cropRawPatch?(base64: string, x: number, y: number, w: number, h: number): Promise +// ---------------------------------------------------------------------------- +// Teach mode (guided-tour tooltips with Next-button action execution) +// ---------------------------------------------------------------------------- + +/** + * Payload the host pushes to the teach overlay BrowserWindow. Built by + * `handleTeachStep` in toolCalls.ts from the model's `teach_step` args. + * + * `anchorLogical` here is POST-`scaleCoord` — **full-display** logical + * macOS points (origin = monitor top-left, menu bar included, since + * cuDisplayInfo returns CGDisplayBounds). The overlay window is positioned + * at `workArea.{x,y}` (excludes menu bar/Dock), so `updateTeachStep` in + * teach/window.ts subtracts the workArea offset before IPC so the HTML's + * CSS coords match. + */ +export interface TeachStepRequest { + explanation: string; + nextPreview: string; + /** Full-display logical points. Undefined → overlay centers the tooltip, hides the arrow. */ + anchorLogical?: { x: number; y: number }; } -export interface ComputerExecutor { - capabilities: Record +export type TeachStepResult = { action: "next" } | { action: "exit" }; + +/** + * Payload for the renderer's ComputerUseTeachApproval dialog. Rides through + * `ToolPermissionRequest.input: unknown` same as `CuPermissionRequest`. + * Separate type (not a flag on `CuPermissionRequest`) so the two approval + * components can narrow independently and the teach dialog is free to drop + * fields it doesn't render (no grant-flag checkboxes in teach mode). + */ +export interface CuTeachPermissionRequest { + requestId: string; + /** Model-provided reason. Shown in the dialog headline ("guide you through {reason}"). */ + reason: string; + apps: ResolvedAppRequest[]; + screenshotFiltering: "native" | "none"; + /** Present only when TCC is ungranted — same semantics as `CuPermissionRequest.tccState`. */ + tccState?: { + accessibility: boolean; + screenRecording: boolean; + }; + willHide?: Array<{ bundleId: string; displayName: string }>; + /** Same semantics as `CuPermissionRequest.autoUnhideEnabled`. */ + autoUnhideEnabled?: boolean; } diff --git a/packages/@ant/computer-use-swift/src/backends/darwin.ts b/packages/@ant/computer-use-swift/src/backends/darwin.ts new file mode 100644 index 000000000..620f162a9 --- /dev/null +++ b/packages/@ant/computer-use-swift/src/backends/darwin.ts @@ -0,0 +1,258 @@ +/** + * macOS backend for computer-use-swift + * + * Uses AppleScript/JXA/screencapture for display info, app management, + * and screenshots. + */ + +import { readFileSync, unlinkSync } from 'fs' +import { tmpdir } from 'os' +import { join } from 'path' +import type { + AppInfo, AppsAPI, DisplayAPI, DisplayGeometry, InstalledApp, + PrepareDisplayResult, RunningApp, ScreenshotAPI, ScreenshotResult, + SwiftBackend, WindowDisplayInfo, +} from '../types.js' + +// --------------------------------------------------------------------------- +// Helpers +// --------------------------------------------------------------------------- + +function jxaSync(script: string): string { + const result = Bun.spawnSync({ + cmd: ['osascript', '-l', 'JavaScript', '-e', script], + stdout: 'pipe', stderr: 'pipe', + }) + return new TextDecoder().decode(result.stdout).trim() +} + +function osascriptSync(script: string): string { + const result = Bun.spawnSync({ + cmd: ['osascript', '-e', script], + stdout: 'pipe', stderr: 'pipe', + }) + return new TextDecoder().decode(result.stdout).trim() +} + +async function osascript(script: string): Promise { + const proc = Bun.spawn(['osascript', '-e', script], { + stdout: 'pipe', stderr: 'pipe', + }) + const text = await new Response(proc.stdout).text() + await proc.exited + return text.trim() +} + +async function jxa(script: string): Promise { + const proc = Bun.spawn(['osascript', '-l', 'JavaScript', '-e', script], { + stdout: 'pipe', stderr: 'pipe', + }) + const text = await new Response(proc.stdout).text() + await proc.exited + return text.trim() +} + +// --------------------------------------------------------------------------- +// DisplayAPI +// --------------------------------------------------------------------------- + +export const display: DisplayAPI = { + getSize(displayId?: number): DisplayGeometry { + const all = this.listAll() + if (displayId !== undefined) { + const found = all.find(d => d.displayId === displayId) + if (found) return found + } + return all[0] ?? { width: 1920, height: 1080, scaleFactor: 2, displayId: 1 } + }, + + listAll(): DisplayGeometry[] { + try { + const raw = jxaSync(` + ObjC.import("CoreGraphics"); + var displays = $.CGDisplayCopyAllDisplayModes ? [] : []; + var active = $.CGGetActiveDisplayList(10, null, Ref()); + var countRef = Ref(); + $.CGGetActiveDisplayList(0, null, countRef); + var count = countRef[0]; + var idBuf = Ref(); + $.CGGetActiveDisplayList(count, idBuf, countRef); + var result = []; + for (var i = 0; i < count; i++) { + var did = idBuf[i]; + var w = $.CGDisplayPixelsWide(did); + var h = $.CGDisplayPixelsHigh(did); + var mode = $.CGDisplayCopyDisplayMode(did); + var pw = $.CGDisplayModeGetPixelWidth(mode); + var sf = pw > 0 && w > 0 ? pw / w : 2; + result.push({width: w, height: h, scaleFactor: sf, displayId: did}); + } + JSON.stringify(result); + `) + return (JSON.parse(raw) as DisplayGeometry[]).map(d => ({ + width: Number(d.width), height: Number(d.height), + scaleFactor: Number(d.scaleFactor), displayId: Number(d.displayId), + })) + } catch { + try { + const raw = jxaSync(` + ObjC.import("AppKit"); + var screens = $.NSScreen.screens; + var result = []; + for (var i = 0; i < screens.count; i++) { + var s = screens.objectAtIndex(i); + var frame = s.frame; + var desc = s.deviceDescription; + var screenNumber = desc.objectForKey($("NSScreenNumber")).intValue; + var backingFactor = s.backingScaleFactor; + result.push({ + width: Math.round(frame.size.width), + height: Math.round(frame.size.height), + scaleFactor: backingFactor, + displayId: screenNumber + }); + } + JSON.stringify(result); + `) + return (JSON.parse(raw) as DisplayGeometry[]).map(d => ({ + width: Number(d.width), height: Number(d.height), + scaleFactor: Number(d.scaleFactor), displayId: Number(d.displayId), + })) + } catch { + return [{ width: 1920, height: 1080, scaleFactor: 2, displayId: 1 }] + } + } + }, +} + +// --------------------------------------------------------------------------- +// AppsAPI +// --------------------------------------------------------------------------- + +export const apps: AppsAPI = { + async prepareDisplay(_allowlistBundleIds, _surrogateHost, _displayId) { + return { activated: '', hidden: [] } + }, + + async previewHideSet(_bundleIds, _displayId) { + return [] + }, + + async findWindowDisplays(bundleIds) { + return bundleIds.map(bundleId => ({ bundleId, displayIds: [1] })) + }, + + async appUnderPoint(_x, _y) { + try { + const result = await jxa(` + ObjC.import("CoreGraphics"); + ObjC.import("AppKit"); + var pt = $.CGPointMake(${_x}, ${_y}); + var app = $.NSWorkspace.sharedWorkspace.frontmostApplication; + JSON.stringify({bundleId: app.bundleIdentifier.js, displayName: app.localizedName.js}); + `) + return JSON.parse(result) + } catch { + return null + } + }, + + async listInstalled() { + try { + const result = await osascript(` + tell application "System Events" + set appList to "" + repeat with appFile in (every file of folder "Applications" of startup disk whose name ends with ".app") + set appPath to POSIX path of (appFile as alias) + set appName to name of appFile + set appList to appList & appPath & "|" & appName & "\\n" + end repeat + return appList + end tell + `) + return result.split('\n').filter(Boolean).map(line => { + const [path, name] = line.split('|', 2) + const displayName = (name ?? '').replace(/\.app$/, '') + return { + bundleId: `com.app.${displayName.toLowerCase().replace(/\s+/g, '-')}`, + displayName, + path: path ?? '', + } + }) + } catch { + return [] + } + }, + + iconDataUrl(_path) { + return null + }, + + listRunning() { + try { + const raw = jxaSync(` + var apps = Application("System Events").applicationProcesses.whose({backgroundOnly: false}); + var result = []; + for (var i = 0; i < apps.length; i++) { + try { + var a = apps[i]; + result.push({bundleId: a.bundleIdentifier(), displayName: a.name()}); + } catch(e) {} + } + JSON.stringify(result); + `) + return JSON.parse(raw) + } catch { + return [] + } + }, + + async open(bundleId) { + await osascript(`tell application id "${bundleId}" to activate`) + }, + + async unhide(bundleIds) { + for (const bundleId of bundleIds) { + await osascript(` + tell application "System Events" + set visible of application process (name of application process whose bundle identifier is "${bundleId}") to true + end tell + `) + } + }, +} + +// --------------------------------------------------------------------------- +// ScreenshotAPI +// --------------------------------------------------------------------------- + +async function captureScreenToBase64(args: string[]): Promise<{ base64: string; width: number; height: number }> { + const tmpFile = join(tmpdir(), `cu-screenshot-${Date.now()}.png`) + const proc = Bun.spawn(['screencapture', ...args, tmpFile], { + stdout: 'pipe', stderr: 'pipe', + }) + await proc.exited + try { + const buf = readFileSync(tmpFile) + const base64 = buf.toString('base64') + const width = buf.readUInt32BE(16) + const height = buf.readUInt32BE(20) + return { base64, width, height } + } finally { + try { unlinkSync(tmpFile) } catch {} + } +} + +export const screenshot: ScreenshotAPI = { + async captureExcluding(_allowedBundleIds, _quality, _targetW, _targetH, displayId) { + const args = ['-x'] + if (displayId !== undefined) args.push('-D', String(displayId)) + return captureScreenToBase64(args) + }, + + async captureRegion(_allowedBundleIds, x, y, w, h, _outW, _outH, _quality, displayId) { + const args = ['-x', '-R', `${x},${y},${w},${h}`] + if (displayId !== undefined) args.push('-D', String(displayId)) + return captureScreenToBase64(args) + }, +} diff --git a/packages/@ant/computer-use-swift/src/backends/linux.ts b/packages/@ant/computer-use-swift/src/backends/linux.ts new file mode 100644 index 000000000..692575433 --- /dev/null +++ b/packages/@ant/computer-use-swift/src/backends/linux.ts @@ -0,0 +1,278 @@ +/** + * Linux backend for computer-use-swift + * + * Uses xrandr for display info, scrot for screenshots, + * wmctrl/xdotool for window management, and xdg-open for launching apps. + * + * Requires: xrandr, scrot, xdotool, wmctrl (optional) + */ + +import type { + AppInfo, AppsAPI, DisplayAPI, DisplayGeometry, InstalledApp, + PrepareDisplayResult, RunningApp, ScreenshotAPI, ScreenshotResult, + SwiftBackend, WindowDisplayInfo, +} from '../types.js' + +// --------------------------------------------------------------------------- +// Shell helpers +// --------------------------------------------------------------------------- + +function run(cmd: string[]): string { + const result = Bun.spawnSync({ + cmd, + stdout: 'pipe', + stderr: 'pipe', + }) + return new TextDecoder().decode(result.stdout).trim() +} + +async function runAsync(cmd: string[]): Promise { + const proc = Bun.spawn(cmd, { stdout: 'pipe', stderr: 'pipe' }) + const out = await new Response(proc.stdout).text() + await proc.exited + return out.trim() +} + +function commandExists(name: string): boolean { + const result = Bun.spawnSync({ cmd: ['which', name], stdout: 'pipe', stderr: 'pipe' }) + return result.exitCode === 0 +} + +// --------------------------------------------------------------------------- +// DisplayAPI +// --------------------------------------------------------------------------- + +export const display: DisplayAPI = { + getSize(displayId?: number): DisplayGeometry { + const all = this.listAll() + if (displayId !== undefined) { + const found = all.find(d => d.displayId === displayId) + if (found) return found + } + return all[0] ?? { width: 1920, height: 1080, scaleFactor: 1, displayId: 0 } + }, + + listAll(): DisplayGeometry[] { + try { + const raw = run(['xrandr', '--query']) + const displays: DisplayGeometry[] = [] + let idx = 0 + + // Match lines like: "HDMI-1 connected 1920x1080+0+0" or "eDP-1 connected primary 2560x1440+0+0" + const regex = /^\S+\s+connected\s+(?:primary\s+)?(\d+)x(\d+)\+\d+\+\d+/gm + let match: RegExpExecArray | null + while ((match = regex.exec(raw)) !== null) { + displays.push({ + width: Number(match[1]), + height: Number(match[2]), + scaleFactor: 1, + displayId: idx++, + }) + } + + if (displays.length === 0) { + return [{ width: 1920, height: 1080, scaleFactor: 1, displayId: 0 }] + } + return displays + } catch { + return [{ width: 1920, height: 1080, scaleFactor: 1, displayId: 0 }] + } + }, +} + +// --------------------------------------------------------------------------- +// AppsAPI +// --------------------------------------------------------------------------- + +export const apps: AppsAPI = { + async prepareDisplay(_allowlistBundleIds, _surrogateHost, _displayId): Promise { + return { activated: '', hidden: [] } + }, + + async previewHideSet(_bundleIds, _displayId): Promise { + return [] + }, + + async findWindowDisplays(bundleIds): Promise { + return bundleIds.map(bundleId => ({ bundleId, displayIds: [0] })) + }, + + async appUnderPoint(x, y): Promise { + try { + // Move mouse to point, get window under cursor + const out = run(['xdotool', 'mousemove', '--sync', String(x), String(y), 'getmouselocation', '--shell']) + const windowMatch = out.match(/WINDOW=(\d+)/) + if (!windowMatch) return null + + const windowId = windowMatch[1] + const pidStr = run(['xdotool', 'getwindowpid', windowId!]) + if (!pidStr) return null + + let exePath = '' + try { exePath = run(['readlink', '-f', `/proc/${pidStr}/exe`]) } catch { /* ignore */ } + + let appName = '' + try { appName = run(['cat', `/proc/${pidStr}/comm`]) } catch { /* ignore */ } + + if (!exePath && !appName) return null + return { bundleId: exePath || pidStr!, displayName: appName || 'unknown' } + } catch { + return null + } + }, + + async listInstalled(): Promise { + try { + // Read .desktop files from standard locations + const dirs = ['/usr/share/applications', '/usr/local/share/applications', `${process.env.HOME}/.local/share/applications`] + const apps: InstalledApp[] = [] + + for (const dir of dirs) { + let files: string + try { + files = run(['find', dir, '-name', '*.desktop', '-maxdepth', '1']) + } catch { continue } + + for (const filepath of files.split('\n').filter(Boolean)) { + try { + const content = run(['cat', filepath]) + const nameMatch = content.match(/^Name=(.+)$/m) + const execMatch = content.match(/^Exec=(.+)$/m) + const noDisplay = content.match(/^NoDisplay=true$/m) + if (noDisplay) continue + + const name = nameMatch?.[1] ?? '' + const exec = execMatch?.[1] ?? '' + if (!name) continue + + apps.push({ + bundleId: filepath.split('/').pop()?.replace('.desktop', '') ?? '', + displayName: name, + path: exec.split(/\s+/)[0] ?? '', + }) + } catch { /* skip unreadable files */ } + } + } + + return apps.slice(0, 200) + } catch { + return [] + } + }, + + iconDataUrl(_path): string | null { + return null + }, + + listRunning(): RunningApp[] { + try { + // Try wmctrl first + if (commandExists('wmctrl')) { + const raw = run(['wmctrl', '-l', '-p']) + const apps: RunningApp[] = [] + for (const line of raw.split('\n').filter(Boolean)) { + // wmctrl format: "0x04000003 0 12345 hostname Window Title" + const parts = line.split(/\s+/) + const pid = parts[2] + if (!pid || pid === '0') continue + + let exePath = '' + try { exePath = run(['readlink', '-f', `/proc/${pid}/exe`]) } catch { /* ignore */ } + let appName = '' + try { appName = run(['cat', `/proc/${pid}/comm`]) } catch { /* ignore */ } + + if (appName) { + apps.push({ bundleId: exePath || pid, displayName: appName }) + } + } + // Deduplicate by bundleId + const seen = new Set() + return apps.filter(a => { + if (seen.has(a.bundleId)) return false + seen.add(a.bundleId) + return true + }).slice(0, 50) + } + + // Fallback: ps with visible processes + const raw = run(['ps', '-eo', 'pid,comm', '--no-headers']) + const apps: RunningApp[] = [] + for (const line of raw.split('\n').filter(Boolean).slice(0, 50)) { + const match = line.trim().match(/^(\d+)\s+(.+)$/) + if (match) { + apps.push({ bundleId: match[1]!, displayName: match[2]! }) + } + } + return apps + } catch { + return [] + } + }, + + async open(name): Promise { + // Try gtk-launch first (for .desktop file names), fall back to xdg-open + try { + const desktopName = name.endsWith('.desktop') ? name : `${name}.desktop` + if (commandExists('gtk-launch')) { + await runAsync(['gtk-launch', desktopName]) + return + } + } catch { /* fall through */ } + + await runAsync(['xdg-open', name]) + }, + + async unhide(bundleIds): Promise { + for (const id of bundleIds) { + try { + if (commandExists('wmctrl') && id.startsWith('0x')) { + // Window ID — use wmctrl + await runAsync(['wmctrl', '-i', '-R', id]) + } else { + // Try xdotool windowactivate with search by name + await runAsync(['xdotool', 'search', '--name', id, 'windowactivate']) + } + } catch { /* ignore failures for individual windows */ } + } + }, +} + +// --------------------------------------------------------------------------- +// ScreenshotAPI +// --------------------------------------------------------------------------- + +const SCREENSHOT_PATH = '/tmp/cu-screenshot.png' + +export const screenshot: ScreenshotAPI = { + async captureExcluding(_allowedBundleIds, _quality, _targetW, _targetH, _displayId): Promise { + try { + await runAsync(['scrot', '-o', SCREENSHOT_PATH]) + + // Read the file as base64 + const file = Bun.file(SCREENSHOT_PATH) + const buffer = await file.arrayBuffer() + const base64 = Buffer.from(buffer).toString('base64') + + // Get dimensions from display info + const size = display.getSize(_displayId) + return { base64, width: size.width, height: size.height } + } catch { + return { base64: '', width: 0, height: 0 } + } + }, + + async captureRegion(_allowedBundleIds, x, y, w, h, _outW, _outH, _quality, _displayId): Promise { + try { + // scrot -a x,y,w,h captures a specific region + await runAsync(['scrot', '-a', `${x},${y},${w},${h}`, '-o', SCREENSHOT_PATH]) + + const file = Bun.file(SCREENSHOT_PATH) + const buffer = await file.arrayBuffer() + const base64 = Buffer.from(buffer).toString('base64') + + return { base64, width: w, height: h } + } catch { + return { base64: '', width: 0, height: 0 } + } + }, +} diff --git a/packages/@ant/computer-use-swift/src/backends/win32.ts b/packages/@ant/computer-use-swift/src/backends/win32.ts new file mode 100644 index 000000000..70f006bf1 --- /dev/null +++ b/packages/@ant/computer-use-swift/src/backends/win32.ts @@ -0,0 +1,263 @@ +/** + * Windows backend for computer-use-swift + * + * Uses PowerShell with .NET System.Drawing / System.Windows.Forms for + * screenshots and Win32 P/Invoke for window/process management. + */ + +import type { + AppInfo, AppsAPI, DisplayAPI, DisplayGeometry, InstalledApp, + PrepareDisplayResult, RunningApp, ScreenshotAPI, ScreenshotResult, + SwiftBackend, WindowDisplayInfo, +} from '../types.js' + +import { listWindows } from 'src/utils/computerUse/win32/windowEnum.js' +import { captureWindow, captureWindowByHwnd } from 'src/utils/computerUse/win32/windowCapture.js' + +// --------------------------------------------------------------------------- +// PowerShell helper +// --------------------------------------------------------------------------- + +function ps(script: string): string { + const result = Bun.spawnSync({ + cmd: ['powershell', '-NoProfile', '-NonInteractive', '-Command', script], + stdout: 'pipe', + stderr: 'pipe', + }) + return new TextDecoder().decode(result.stdout).trim() +} + +async function psAsync(script: string): Promise { + const proc = Bun.spawn( + ['powershell', '-NoProfile', '-NonInteractive', '-Command', script], + { stdout: 'pipe', stderr: 'pipe' }, + ) + const out = await new Response(proc.stdout).text() + await proc.exited + return out.trim() +} + +// --------------------------------------------------------------------------- +// DisplayAPI +// --------------------------------------------------------------------------- + +export const display: DisplayAPI = { + getSize(displayId?: number): DisplayGeometry { + const all = this.listAll() + if (displayId !== undefined) { + const found = all.find(d => d.displayId === displayId) + if (found) return found + } + return all[0] ?? { width: 1920, height: 1080, scaleFactor: 1, displayId: 0 } + }, + + listAll(): DisplayGeometry[] { + try { + const raw = ps(` +Add-Type -AssemblyName System.Windows.Forms +$result = @() +$idx = 0 +foreach ($s in [System.Windows.Forms.Screen]::AllScreens) { + $result += "$($s.Bounds.Width),$($s.Bounds.Height),$idx,$($s.Primary)" + $idx++ +} +$result -join "|" +`) + return raw.split('|').filter(Boolean).map(entry => { + const [w, h, id, primary] = entry.split(',') + return { + width: Number(w), + height: Number(h), + scaleFactor: 1, // Windows DPI scaling handled at system level + displayId: Number(id), + } + }) + } catch { + return [{ width: 1920, height: 1080, scaleFactor: 1, displayId: 0 }] + } + }, +} + +// --------------------------------------------------------------------------- +// AppsAPI +// --------------------------------------------------------------------------- + +export const apps: AppsAPI = { + async prepareDisplay(_allowlistBundleIds, _surrogateHost, _displayId) { + return { activated: '', hidden: [] } + }, + + async previewHideSet(_bundleIds, _displayId) { + return [] + }, + + async findWindowDisplays(bundleIds) { + return bundleIds.map(bundleId => ({ bundleId, displayIds: [0] })) + }, + + async appUnderPoint(_x, _y) { + try { + const out = ps(` +Add-Type @' +using System; +using System.Runtime.InteropServices; +public class WinPt { + [StructLayout(LayoutKind.Sequential)] public struct POINT { public int X; public int Y; } + [DllImport("user32.dll")] public static extern IntPtr WindowFromPoint(POINT p); + [DllImport("user32.dll")] public static extern uint GetWindowThreadProcessId(IntPtr hWnd, out uint pid); +} +'@ +$pt = New-Object WinPt+POINT +$pt.X = ${_x}; $pt.Y = ${_y} +$hwnd = [WinPt]::WindowFromPoint($pt) +$pid = [uint32]0 +[WinPt]::GetWindowThreadProcessId($hwnd, [ref]$pid) | Out-Null +$proc = Get-Process -Id $pid -ErrorAction SilentlyContinue +"$($proc.MainModule.FileName)|$($proc.ProcessName)" +`) + if (!out || !out.includes('|')) return null + const [exePath, name] = out.split('|', 2) + return { bundleId: exePath!, displayName: name! } + } catch { + return null + } + }, + + async listInstalled() { + try { + const raw = await psAsync(` +$apps = @() +$paths = @( + 'HKLM:\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Uninstall\\*', + 'HKLM:\\SOFTWARE\\WOW6432Node\\Microsoft\\Windows\\CurrentVersion\\Uninstall\\*', + 'HKCU:\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Uninstall\\*' +) +foreach ($p in $paths) { + Get-ItemProperty $p -ErrorAction SilentlyContinue | Where-Object { $_.DisplayName } | ForEach-Object { + $apps += "$($_.DisplayName)|$($_.InstallLocation)|$($_.PSChildName)" + } +} +$apps | Select-Object -Unique | Select-Object -First 200 +`) + return raw.split('\n').filter(Boolean).map(line => { + const [name, path, id] = line.split('|', 3) + return { + bundleId: id ?? name ?? '', + displayName: name ?? '', + path: path ?? '', + } + }) + } catch { + return [] + } + }, + + iconDataUrl(_path) { + return null + }, + + listRunning() { + try { + const windows = listWindows() + return windows.map(w => ({ + bundleId: String(w.hwnd), + displayName: w.title, + })) + } catch { + return [] + } + }, + + async open(name) { + // On Windows, name is the exe path (bundleId) or process name. + // Try exe path first, fall back to process name lookup. + const escaped = name.replace(/'/g, "''") + await psAsync(` +if (Test-Path '${escaped}') { + Start-Process '${escaped}' +} else { + Start-Process -FilePath '${escaped}' -ErrorAction SilentlyContinue +}`) + }, + + async unhide(bundleIds) { + // Windows: bring window to foreground + for (const name of bundleIds) { + await psAsync(` +Add-Type @' +using System; +using System.Runtime.InteropServices; +public class WinShow { + [DllImport("user32.dll")] public static extern bool ShowWindow(IntPtr hWnd, int nCmd); + [DllImport("user32.dll")] public static extern bool SetForegroundWindow(IntPtr hWnd); +} +'@ +$proc = Get-Process -Name "${name}" -ErrorAction SilentlyContinue | Select-Object -First 1 +if ($proc) { [WinShow]::ShowWindow($proc.MainWindowHandle, 9) | Out-Null; [WinShow]::SetForegroundWindow($proc.MainWindowHandle) | Out-Null } +`) + } + }, +} + +// --------------------------------------------------------------------------- +// ScreenshotAPI +// --------------------------------------------------------------------------- + +export const screenshot: ScreenshotAPI = { + async captureExcluding(_allowedBundleIds, _quality, _targetW, _targetH, displayId) { + const raw = await psAsync(` +Add-Type -AssemblyName System.Windows.Forms +Add-Type -AssemblyName System.Drawing +$screen = if (${displayId ?? -1} -ge 0) { [System.Windows.Forms.Screen]::AllScreens[${displayId ?? 0}] } else { [System.Windows.Forms.Screen]::PrimaryScreen } +$bounds = $screen.Bounds +$bmp = New-Object System.Drawing.Bitmap($bounds.Width, $bounds.Height) +$g = [System.Drawing.Graphics]::FromImage($bmp) +$g.CopyFromScreen($bounds.Location, [System.Drawing.Point]::Empty, $bounds.Size) +$g.Dispose() +$ms = New-Object System.IO.MemoryStream +$bmp.Save($ms, [System.Drawing.Imaging.ImageFormat]::Png) +$bmp.Dispose() +$bytes = $ms.ToArray() +$ms.Dispose() +"$($bounds.Width),$($bounds.Height)," + [Convert]::ToBase64String($bytes) +`) + const firstComma = raw.indexOf(',') + const secondComma = raw.indexOf(',', firstComma + 1) + const width = Number(raw.slice(0, firstComma)) + const height = Number(raw.slice(firstComma + 1, secondComma)) + const base64 = raw.slice(secondComma + 1) + return { base64, width, height } + }, + + async captureRegion(_allowedBundleIds, x, y, w, h, _outW, _outH, _quality, _displayId) { + const raw = await psAsync(` +Add-Type -AssemblyName System.Windows.Forms +Add-Type -AssemblyName System.Drawing +$bmp = New-Object System.Drawing.Bitmap(${w}, ${h}) +$g = [System.Drawing.Graphics]::FromImage($bmp) +$g.CopyFromScreen(${x}, ${y}, 0, 0, (New-Object System.Drawing.Size(${w}, ${h}))) +$g.Dispose() +$ms = New-Object System.IO.MemoryStream +$bmp.Save($ms, [System.Drawing.Imaging.ImageFormat]::Png) +$bmp.Dispose() +$bytes = $ms.ToArray() +$ms.Dispose() +"${w},${h}," + [Convert]::ToBase64String($bytes) +`) + const firstComma = raw.indexOf(',') + const secondComma = raw.indexOf(',', firstComma + 1) + const base64 = raw.slice(secondComma + 1) + return { base64, width: w, height: h } + }, + + /** + * Capture a specific window by title or HWND using PrintWindow. + * Works even for occluded or background windows. + */ + captureWindowTarget(titleOrHwnd: string | number): ScreenshotResult | null { + if (typeof titleOrHwnd === 'number') { + return captureWindowByHwnd(titleOrHwnd) + } + return captureWindow(titleOrHwnd) + }, +} diff --git a/packages/@ant/computer-use-swift/src/index.ts b/packages/@ant/computer-use-swift/src/index.ts index 87a0ade16..cae206425 100644 --- a/packages/@ant/computer-use-swift/src/index.ts +++ b/packages/@ant/computer-use-swift/src/index.ts @@ -1,377 +1,84 @@ /** - * @ant/computer-use-swift — macOS 实现 + * @ant/computer-use-swift — cross-platform display, apps, and screenshot API * - * 用 AppleScript/JXA/screencapture 替代原始 Swift 原生模块。 - * 提供显示器信息、应用管理、截图等功能。 + * Platform backends: + * - darwin: AppleScript/JXA + screencapture + * - win32: PowerShell + System.Drawing + Win32 P/Invoke * - * 仅 macOS 支持。 + * Add new platforms by creating backends/.ts implementing SwiftBackend. */ -import { readFileSync, unlinkSync } from 'fs' -import { tmpdir } from 'os' -import { join } from 'path' +// Re-export all types +export type { + DisplayGeometry, + PrepareDisplayResult, + AppInfo, + InstalledApp, + RunningApp, + ScreenshotResult, + ResolvePrepareCaptureResult, + WindowDisplayInfo, + DisplayAPI, + AppsAPI, + ScreenshotAPI, + SwiftBackend, +} from './types.js' + +import type { ResolvePrepareCaptureResult, SwiftBackend } from './types.js' // --------------------------------------------------------------------------- -// Types (exported for callers) +// Platform dispatch // --------------------------------------------------------------------------- -export interface DisplayGeometry { - width: number - height: number - scaleFactor: number - displayId: number -} - -export interface PrepareDisplayResult { - activated: string - hidden: string[] -} - -export interface AppInfo { - bundleId: string - displayName: string -} - -export interface InstalledApp { - bundleId: string - displayName: string - path: string - iconDataUrl?: string -} - -export interface RunningApp { - bundleId: string - displayName: string -} - -export interface ScreenshotResult { - base64: string - width: number - height: number -} - -export interface ResolvePrepareCaptureResult { - base64: string - width: number - height: number -} - -export interface WindowDisplayInfo { - bundleId: string - displayIds: number[] -} - -// --------------------------------------------------------------------------- -// Helpers -// --------------------------------------------------------------------------- - -function jxaSync(script: string): string { - const result = Bun.spawnSync({ - cmd: ['osascript', '-l', 'JavaScript', '-e', script], - stdout: 'pipe', stderr: 'pipe', - }) - return new TextDecoder().decode(result.stdout).trim() -} - -function osascriptSync(script: string): string { - const result = Bun.spawnSync({ - cmd: ['osascript', '-e', script], - stdout: 'pipe', stderr: 'pipe', - }) - return new TextDecoder().decode(result.stdout).trim() -} - -async function osascript(script: string): Promise { - const proc = Bun.spawn(['osascript', '-e', script], { - stdout: 'pipe', stderr: 'pipe', - }) - const text = await new Response(proc.stdout).text() - await proc.exited - return text.trim() -} - -async function jxa(script: string): Promise { - const proc = Bun.spawn(['osascript', '-l', 'JavaScript', '-e', script], { - stdout: 'pipe', stderr: 'pipe', - }) - const text = await new Response(proc.stdout).text() - await proc.exited - return text.trim() -} - -// --------------------------------------------------------------------------- -// DisplayAPI -// --------------------------------------------------------------------------- - -interface DisplayAPI { - getSize(displayId?: number): DisplayGeometry - listAll(): DisplayGeometry[] -} - -const displayAPI: DisplayAPI = { - getSize(displayId?: number): DisplayGeometry { - const all = this.listAll() - if (displayId !== undefined) { - const found = all.find(d => d.displayId === displayId) - if (found) return found - } - return all[0] ?? { width: 1920, height: 1080, scaleFactor: 2, displayId: 1 } - }, - - listAll(): DisplayGeometry[] { - try { - const raw = jxaSync(` - ObjC.import("CoreGraphics"); - var displays = $.CGDisplayCopyAllDisplayModes ? [] : []; - var active = $.CGGetActiveDisplayList(10, null, Ref()); - var countRef = Ref(); - $.CGGetActiveDisplayList(0, null, countRef); - var count = countRef[0]; - var idBuf = Ref(); - $.CGGetActiveDisplayList(count, idBuf, countRef); - var result = []; - for (var i = 0; i < count; i++) { - var did = idBuf[i]; - var w = $.CGDisplayPixelsWide(did); - var h = $.CGDisplayPixelsHigh(did); - var mode = $.CGDisplayCopyDisplayMode(did); - var pw = $.CGDisplayModeGetPixelWidth(mode); - var sf = pw > 0 && w > 0 ? pw / w : 2; - result.push({width: w, height: h, scaleFactor: sf, displayId: did}); - } - JSON.stringify(result); - `) - return (JSON.parse(raw) as DisplayGeometry[]).map(d => ({ - width: Number(d.width), height: Number(d.height), - scaleFactor: Number(d.scaleFactor), displayId: Number(d.displayId), - })) - } catch { - // Fallback: use NSScreen via JXA - try { - const raw = jxaSync(` - ObjC.import("AppKit"); - var screens = $.NSScreen.screens; - var result = []; - for (var i = 0; i < screens.count; i++) { - var s = screens.objectAtIndex(i); - var frame = s.frame; - var desc = s.deviceDescription; - var screenNumber = desc.objectForKey($("NSScreenNumber")).intValue; - var backingFactor = s.backingScaleFactor; - result.push({ - width: Math.round(frame.size.width), - height: Math.round(frame.size.height), - scaleFactor: backingFactor, - displayId: screenNumber - }); - } - JSON.stringify(result); - `) - return (JSON.parse(raw) as DisplayGeometry[]).map(d => ({ - width: Number(d.width), - height: Number(d.height), - scaleFactor: Number(d.scaleFactor), - displayId: Number(d.displayId), - })) - } catch { - return [{ width: 1920, height: 1080, scaleFactor: 2, displayId: 1 }] - } - } - }, -} - -// --------------------------------------------------------------------------- -// AppsAPI -// --------------------------------------------------------------------------- - -interface AppsAPI { - prepareDisplay(allowlistBundleIds: string[], surrogateHost: string, displayId?: number): Promise - previewHideSet(bundleIds: string[], displayId?: number): Promise - findWindowDisplays(bundleIds: string[]): Promise - appUnderPoint(x: number, y: number): Promise - listInstalled(): Promise - iconDataUrl(path: string): string | null - listRunning(): RunningApp[] - open(bundleId: string): Promise - unhide(bundleIds: string[]): Promise -} - -const appsAPI: AppsAPI = { - async prepareDisplay( - _allowlistBundleIds: string[], - _surrogateHost: string, - _displayId?: number, - ): Promise { - return { activated: '', hidden: [] } - }, - - async previewHideSet( - _bundleIds: string[], - _displayId?: number, - ): Promise { - return [] - }, - - async findWindowDisplays(bundleIds: string[]): Promise { - // Each running app is assumed to be on display 1 - return bundleIds.map(bundleId => ({ bundleId, displayIds: [1] })) - }, - - async appUnderPoint(_x: number, _y: number): Promise { - // Use JXA to find app at mouse position via accessibility - try { - const result = await jxa(` - ObjC.import("CoreGraphics"); - ObjC.import("AppKit"); - var pt = $.CGPointMake(${_x}, ${_y}); - // Get frontmost app as a fallback - var app = $.NSWorkspace.sharedWorkspace.frontmostApplication; - JSON.stringify({bundleId: app.bundleIdentifier.js, displayName: app.localizedName.js}); - `) - return JSON.parse(result) - } catch { - return null - } - }, - - async listInstalled(): Promise { - try { - const result = await osascript(` - tell application "System Events" - set appList to "" - repeat with appFile in (every file of folder "Applications" of startup disk whose name ends with ".app") - set appPath to POSIX path of (appFile as alias) - set appName to name of appFile - set appList to appList & appPath & "|" & appName & "\\n" - end repeat - return appList - end tell - `) - return result.split('\n').filter(Boolean).map(line => { - const [path, name] = line.split('|', 2) - // Derive bundleId from Info.plist would be ideal, but use path-based fallback - const displayName = (name ?? '').replace(/\.app$/, '') - return { - bundleId: `com.app.${displayName.toLowerCase().replace(/\s+/g, '-')}`, - displayName, - path: path ?? '', - } - }) - } catch { - return [] +function loadBackend(): SwiftBackend | null { + try { + switch (process.platform) { + case 'darwin': + return require('./backends/darwin.js') as SwiftBackend + case 'win32': + return require('./backends/win32.js') as SwiftBackend + case 'linux': + return require('./backends/linux.js') as SwiftBackend + default: + return null } - }, - - iconDataUrl(_path: string): string | null { + } catch { return null - }, - - listRunning(): RunningApp[] { - try { - const raw = jxaSync(` - var apps = Application("System Events").applicationProcesses.whose({backgroundOnly: false}); - var result = []; - for (var i = 0; i < apps.length; i++) { - try { - var a = apps[i]; - result.push({bundleId: a.bundleIdentifier(), displayName: a.name()}); - } catch(e) {} - } - JSON.stringify(result); - `) - return JSON.parse(raw) - } catch { - return [] - } - }, - - async open(bundleId: string): Promise { - await osascript(`tell application id "${bundleId}" to activate`) - }, - - async unhide(bundleIds: string[]): Promise { - for (const bundleId of bundleIds) { - await osascript(` - tell application "System Events" - set visible of application process (name of application process whose bundle identifier is "${bundleId}") to true - end tell - `) - } - }, -} - -// --------------------------------------------------------------------------- -// ScreenshotAPI -// --------------------------------------------------------------------------- - -interface ScreenshotAPI { - captureExcluding( - allowedBundleIds: string[], quality: number, - targetW: number, targetH: number, displayId?: number, - ): Promise - captureRegion( - allowedBundleIds: string[], - x: number, y: number, w: number, h: number, - outW: number, outH: number, quality: number, displayId?: number, - ): Promise -} - -async function captureScreenToBase64(args: string[]): Promise<{ base64: string; width: number; height: number }> { - const tmpFile = join(tmpdir(), `cu-screenshot-${Date.now()}.png`) - const proc = Bun.spawn(['screencapture', ...args, tmpFile], { - stdout: 'pipe', stderr: 'pipe', - }) - await proc.exited - - try { - const buf = readFileSync(tmpFile) - const base64 = buf.toString('base64') - // Parse PNG header for dimensions (bytes 16-23) - const width = buf.readUInt32BE(16) - const height = buf.readUInt32BE(20) - return { base64, width, height } - } finally { - try { unlinkSync(tmpFile) } catch {} } } -const screenshotAPI: ScreenshotAPI = { - async captureExcluding( - _allowedBundleIds: string[], - _quality: number, - _targetW: number, - _targetH: number, - displayId?: number, - ): Promise { - const args = ['-x'] // silent - if (displayId !== undefined) { - args.push('-D', String(displayId)) - } - return captureScreenToBase64(args) - }, - - async captureRegion( - _allowedBundleIds: string[], - x: number, y: number, w: number, h: number, - _outW: number, _outH: number, _quality: number, - displayId?: number, - ): Promise { - const args = ['-x', '-R', `${x},${y},${w},${h}`] - if (displayId !== undefined) { - args.push('-D', String(displayId)) - } - return captureScreenToBase64(args) - }, -} +const backend = loadBackend() // --------------------------------------------------------------------------- -// ComputerUseAPI — Main export +// ComputerUseAPI — Main export (preserves original class interface) // --------------------------------------------------------------------------- export class ComputerUseAPI { - apps: AppsAPI = appsAPI - display: DisplayAPI = displayAPI - screenshot: ScreenshotAPI = screenshotAPI + // When no backend is loaded (unsupported platform), all APIs are no-op stubs. + // These stubs should never be reached in practice — callers check isSupported + // or the feature gate before invoking. + + apps = backend?.apps ?? { + async prepareDisplay() { return { activated: '', hidden: [] } }, + async previewHideSet() { return [] }, + async findWindowDisplays(ids: string[]) { return ids.map(b => ({ bundleId: b, displayIds: [] as number[] })) }, + async appUnderPoint() { return null }, + async listInstalled() { return [] }, + iconDataUrl() { return null }, + listRunning() { return [] }, + async open() { throw new Error('computer-use-swift: no backend for this platform') }, + async unhide() {}, + } + + display = backend?.display ?? { + getSize() { throw new Error('computer-use-swift: no backend for this platform') }, + listAll() { throw new Error('computer-use-swift: no backend for this platform') }, + } + + screenshot = backend?.screenshot ?? { + async captureExcluding() { throw new Error('computer-use-swift: no backend for this platform') }, + async captureRegion() { throw new Error('computer-use-swift: no backend for this platform') }, + } async resolvePrepareCapture( allowedBundleIds: string[], diff --git a/packages/@ant/computer-use-swift/src/types.ts b/packages/@ant/computer-use-swift/src/types.ts new file mode 100644 index 000000000..5dc199ecd --- /dev/null +++ b/packages/@ant/computer-use-swift/src/types.ts @@ -0,0 +1,80 @@ +export interface DisplayGeometry { + width: number + height: number + scaleFactor: number + displayId: number +} + +export interface PrepareDisplayResult { + activated: string + hidden: string[] +} + +export interface AppInfo { + bundleId: string + displayName: string +} + +export interface InstalledApp { + bundleId: string + displayName: string + path: string + iconDataUrl?: string +} + +export interface RunningApp { + bundleId: string + displayName: string +} + +export interface ScreenshotResult { + base64: string + width: number + height: number +} + +export interface ResolvePrepareCaptureResult { + base64: string + width: number + height: number +} + +export interface WindowDisplayInfo { + bundleId: string + displayIds: number[] +} + +export interface DisplayAPI { + getSize(displayId?: number): DisplayGeometry + listAll(): DisplayGeometry[] +} + +export interface AppsAPI { + prepareDisplay(allowlistBundleIds: string[], surrogateHost: string, displayId?: number): Promise + previewHideSet(bundleIds: string[], displayId?: number): Promise + findWindowDisplays(bundleIds: string[]): Promise + appUnderPoint(x: number, y: number): Promise + listInstalled(): Promise + iconDataUrl(path: string): string | null + listRunning(): RunningApp[] + open(bundleId: string): Promise + unhide(bundleIds: string[]): Promise +} + +export interface ScreenshotAPI { + captureExcluding( + allowedBundleIds: string[], quality: number, + targetW: number, targetH: number, displayId?: number, + ): Promise + captureRegion( + allowedBundleIds: string[], + x: number, y: number, w: number, h: number, + outW: number, outH: number, quality: number, displayId?: number, + ): Promise +} + +export interface SwiftBackend { + display: DisplayAPI + apps: AppsAPI + screenshot: ScreenshotAPI +} diff --git a/packages/audio-capture-napi/src/index.ts b/packages/audio-capture-napi/src/index.ts index a06c36b79..d511f924d 100644 --- a/packages/audio-capture-napi/src/index.ts +++ b/packages/audio-capture-napi/src/index.ts @@ -1,151 +1,152 @@ -// audio-capture-napi: cross-platform audio capture using SoX (rec) on macOS -// and arecord (ALSA) on Linux. Replaces the original cpal-based native module. -import { type ChildProcess, spawn, spawnSync } from 'child_process' - -// ─── State ─────────────────────────────────────────────────────────── - -let recordingProcess: ChildProcess | null = null -let availabilityCache: boolean | null = null - -// ─── Helpers ───────────────────────────────────────────────────────── - -function commandExists(cmd: string): boolean { - const result = spawnSync(cmd, ['--version'], { - stdio: 'ignore', - timeout: 3000, - }) - return result.error === undefined +type AudioCaptureNapi = { + startRecording( + onData: (data: Buffer) => void, + onEnd: () => void, + ): boolean + stopRecording(): void + isRecording(): boolean + startPlayback(sampleRate: number, channels: number): boolean + writePlaybackData(data: Buffer): void + stopPlayback(): void + isPlaying(): boolean + // TCC microphone authorization status (macOS only): + // 0 = notDetermined, 1 = restricted, 2 = denied, 3 = authorized. + // Linux: always returns 3 (authorized) — no system-level microphone permission API. + // Windows: returns 3 (authorized) if registry key absent or allowed, + // 2 (denied) if microphone access is explicitly denied. + microphoneAuthorizationStatus?(): number } -// ─── Public API ────────────────────────────────────────────────────── +let cachedModule: AudioCaptureNapi | null = null +let loadAttempted = false -/** - * Check whether a supported audio recording command is available. - * Returns true if `rec` (SoX) is found on macOS, or `arecord` (ALSA) on Linux. - * Windows is not supported and always returns false. - */ -export function isNativeAudioAvailable(): boolean { - if (availabilityCache !== null) { - return availabilityCache +function loadModule(): AudioCaptureNapi | null { + if (loadAttempted) { + return cachedModule } + loadAttempted = true - if (process.platform === 'win32') { - availabilityCache = false - return false + // Supported platforms: macOS (darwin), Linux, Windows (win32) + const platform = process.platform + if (platform !== 'darwin' && platform !== 'linux' && platform !== 'win32') { + return null } - if (process.platform === 'darwin') { - // macOS: use SoX rec - availabilityCache = commandExists('rec') - return availabilityCache + // Candidate 1: native-embed path (bun compile). AUDIO_CAPTURE_NODE_PATH is + // defined at build time in build-with-plugins.ts for native builds only — the + // define resolves it to the static literal "../../audio-capture.node" so bun + // compile can rewrite it to /$bunfs/root/audio-capture.node. MUST stay a + // direct require(env var) — bun cannot analyze require(variable) from a loop. + if (process.env.AUDIO_CAPTURE_NODE_PATH) { + try { + // eslint-disable-next-line @typescript-eslint/no-require-imports + cachedModule = require( + process.env.AUDIO_CAPTURE_NODE_PATH, + ) as AudioCaptureNapi + return cachedModule + } catch { + // fall through to runtime fallbacks below + } } - if (process.platform === 'linux') { - // Linux: prefer arecord, fall back to rec - availabilityCache = commandExists('arecord') || commandExists('rec') - return availabilityCache + // Candidates 2-4: npm-install, dev/source, and workspace layouts. + // In bundled output, require() resolves relative to cli.js at the package root. + // In dev, it resolves relative to this file. When loaded from a workspace + // package (packages/audio-capture-napi/src/), we need an absolute path fallback. + const platformDir = `${process.arch}-${platform}` + const fallbacks = [ + `./vendor/audio-capture/${platformDir}/audio-capture.node`, + `../audio-capture/${platformDir}/audio-capture.node`, + `${process.cwd()}/vendor/audio-capture/${platformDir}/audio-capture.node`, + ] + for (const p of fallbacks) { + try { + // eslint-disable-next-line @typescript-eslint/no-require-imports + cachedModule = require(p) as AudioCaptureNapi + return cachedModule + } catch { + // try next + } } - - availabilityCache = false - return false + return null } -/** - * Check whether a recording is currently in progress. - */ -export function isNativeRecordingActive(): boolean { - return recordingProcess !== null && !recordingProcess.killed -} - -/** - * Stop the active recording process, if any. - */ -export function stopNativeRecording(): void { - if (recordingProcess) { - const proc = recordingProcess - recordingProcess = null - if (!proc.killed) { - proc.kill('SIGTERM') - } - } +export function isNativeAudioAvailable(): boolean { + return loadModule() !== null } -/** - * Start recording audio. Raw PCM data (16kHz, 16-bit signed, mono) is - * streamed via the onData callback. onEnd is called when recording stops - * (either from silence detection or process termination). - * - * Returns true if recording started successfully, false otherwise. - */ export function startNativeRecording( onData: (data: Buffer) => void, onEnd: () => void, ): boolean { - // Don't start if already recording - if (isNativeRecordingActive()) { - stopNativeRecording() - } - - if (!isNativeAudioAvailable()) { + const mod = loadModule() + if (!mod) { return false } + return mod.startRecording(onData, onEnd) +} - let child: ChildProcess +export function stopNativeRecording(): void { + const mod = loadModule() + if (!mod) { + return + } + mod.stopRecording() +} - if (process.platform === 'darwin' || (process.platform === 'linux' && commandExists('rec'))) { - // Use SoX rec: output raw PCM 16kHz 16-bit signed mono to stdout - child = spawn( - 'rec', - [ - '-q', // quiet - '--buffer', - '1024', // small buffer for low latency - '-t', 'raw', // raw PCM output - '-r', '16000', // 16kHz sample rate - '-e', 'signed', // signed integer encoding - '-b', '16', // 16-bit - '-c', '1', // mono - '-', // output to stdout - ], - { stdio: ['pipe', 'pipe', 'pipe'] }, - ) - } else if (process.platform === 'linux' && commandExists('arecord')) { - // Use arecord: output raw PCM 16kHz 16-bit signed LE mono to stdout - child = spawn( - 'arecord', - [ - '-f', 'S16_LE', // signed 16-bit little-endian - '-r', '16000', // 16kHz sample rate - '-c', '1', // mono - '-t', 'raw', // raw PCM, no header - '-q', // quiet - '-', // output to stdout - ], - { stdio: ['pipe', 'pipe', 'pipe'] }, - ) - } else { +export function isNativeRecordingActive(): boolean { + const mod = loadModule() + if (!mod) { return false } + return mod.isRecording() +} - recordingProcess = child - - child.stdout?.on('data', (chunk: Buffer) => { - onData(chunk) - }) +export function startNativePlayback( + sampleRate: number, + channels: number, +): boolean { + const mod = loadModule() + if (!mod) { + return false + } + return mod.startPlayback(sampleRate, channels) +} - // Consume stderr to prevent backpressure - child.stderr?.on('data', () => {}) +export function writeNativePlaybackData(data: Buffer): void { + const mod = loadModule() + if (!mod) { + return + } + mod.writePlaybackData(data) +} - child.on('close', () => { - recordingProcess = null - onEnd() - }) +export function stopNativePlayback(): void { + const mod = loadModule() + if (!mod) { + return + } + mod.stopPlayback() +} - child.on('error', () => { - recordingProcess = null - onEnd() - }) +export function isNativePlaying(): boolean { + const mod = loadModule() + if (!mod) { + return false + } + return mod.isPlaying() +} - return true +// Returns the microphone authorization status. +// On macOS, returns the TCC status: 0=notDetermined, 1=restricted, 2=denied, 3=authorized. +// On Linux, always returns 3 (authorized) — no system-level mic permission API. +// On Windows, returns 3 (authorized) if registry key absent or allowed, 2 (denied) if explicitly denied. +// Returns 0 (notDetermined) if the native module is unavailable. +export function microphoneAuthorizationStatus(): number { + const mod = loadModule() + if (!mod || !mod.microphoneAuthorizationStatus) { + return 0 + } + return mod.microphoneAuthorizationStatus() } diff --git a/scripts/dev.ts b/scripts/dev.ts index 437508988..68b4734ba 100644 --- a/scripts/dev.ts +++ b/scripts/dev.ts @@ -15,7 +15,7 @@ const defineArgs = Object.entries(defines).flatMap(([k, v]) => [ // Bun --feature flags: enable feature() gates at runtime. // Default features enabled in dev mode. -const DEFAULT_FEATURES = ["BUDDY", "TRANSCRIPT_CLASSIFIER", "BRIDGE_MODE", "AGENT_TRIGGERS_REMOTE"]; +const DEFAULT_FEATURES = ["BUDDY", "TRANSCRIPT_CLASSIFIER", "BRIDGE_MODE", "AGENT_TRIGGERS_REMOTE", "CHICAGO_MCP", "VOICE_MODE"]; // Any env var matching FEATURE_=1 will also enable that feature. // e.g. FEATURE_PROACTIVE=1 bun run dev diff --git a/src/commands.ts b/src/commands.ts index 10f03b222..8cae75452 100644 --- a/src/commands.ts +++ b/src/commands.ts @@ -415,7 +415,7 @@ const getWorkflowCommands = feature('WORKFLOW_SCRIPTS') * so this must be re-evaluated on every getCommands() call. */ export function meetsAvailabilityRequirement(cmd: Command): boolean { - if (!cmd.availability) return true + if (!cmd.availability || cmd.availability.length === 0) return true for (const a of cmd.availability) { switch (a) { case 'claude-ai': diff --git a/src/commands/chrome/chrome.tsx b/src/commands/chrome/chrome.tsx index 775005568..b659c2e80 100644 --- a/src/commands/chrome/chrome.tsx +++ b/src/commands/chrome/chrome.tsx @@ -186,7 +186,7 @@ function ClaudeInChromeMenu(t0) { } else { options = $[8]; } - const isDisabled = isWSL || true && !isClaudeAISubscriber; + const isDisabled = isWSL; let t5; if ($[18] !== onDone) { t5 = () => onDone(); @@ -212,7 +212,7 @@ function ClaudeInChromeMenu(t0) { } let t8; if ($[23] !== isClaudeAISubscriber) { - t8 = true && !isClaudeAISubscriber && Claude in Chrome requires a claude.ai subscription.; + t8 = false; $[23] = isClaudeAISubscriber; $[24] = t8; } else { diff --git a/src/commands/chrome/index.ts b/src/commands/chrome/index.ts index 0e9af5ff3..6e5b4f82f 100644 --- a/src/commands/chrome/index.ts +++ b/src/commands/chrome/index.ts @@ -4,7 +4,7 @@ import type { Command } from '../../commands.js' const command: Command = { name: 'chrome', description: 'Claude in Chrome (Beta) settings', - availability: ['claude-ai'], + availability: [], isEnabled: () => !getIsNonInteractiveSession(), type: 'local-jsx', load: () => import('./chrome.js'), diff --git a/src/components/BuiltinStatusLine.tsx b/src/components/BuiltinStatusLine.tsx new file mode 100644 index 000000000..14b7ec594 --- /dev/null +++ b/src/components/BuiltinStatusLine.tsx @@ -0,0 +1,152 @@ +import React, { useEffect, useState } from 'react'; +import { formatCost } from '../cost-tracker.js'; +import { Box, Text } from '../ink.js'; +import { formatTokens } from '../utils/format.js'; +import { ProgressBar } from './design-system/ProgressBar.js'; +import { useTerminalSize } from '../hooks/useTerminalSize.js'; + +type RateLimitBucket = { + utilization: number; + resets_at: number; +}; + +type BuiltinStatusLineProps = { + modelName: string; + contextUsedPct: number; + usedTokens: number; + contextWindowSize: number; + totalCostUsd: number; + rateLimits: { + five_hour?: RateLimitBucket; + seven_day?: RateLimitBucket; + }; +}; + +/** + * Format a countdown from now until the given epoch time (in seconds). + * Returns a compact human-readable string like "3h12m", "5d20h", "45m", or "now". + */ +export function formatCountdown(epochSeconds: number): string { + const diff = epochSeconds - Date.now() / 1000; + if (diff <= 0) return 'now'; + + const days = Math.floor(diff / 86400); + const hours = Math.floor((diff % 86400) / 3600); + const minutes = Math.floor((diff % 3600) / 60); + + if (days >= 1) return `${days}d${hours}h`; + if (hours >= 1) return `${hours}h${minutes}m`; + return `${minutes}m`; +} + +function Separator() { + return {' \u2502 '}; +} + +function BuiltinStatusLineInner({ + modelName, + contextUsedPct, + usedTokens, + contextWindowSize, + totalCostUsd, + rateLimits, +}: BuiltinStatusLineProps) { + const { columns } = useTerminalSize(); + + // Force re-render every 60s so countdowns stay current + const [tick, setTick] = useState(0); + useEffect(() => { + const hasResetTime = rateLimits.five_hour?.resets_at || rateLimits.seven_day?.resets_at; + if (!hasResetTime) return; + const id = setInterval(() => setTick(t => t + 1), 60_000); + return () => clearInterval(id); + }, [rateLimits.five_hour?.resets_at, rateLimits.seven_day?.resets_at]); + + // Suppress unused-variable lint for tick (it exists only to trigger re-renders) + void tick; + + // Model display: use first two words (e.g. "Opus 4.6") instead of just first word + const modelParts = modelName.split(' '); + const shortModel = modelParts.length >= 2 ? `${modelParts[0]} ${modelParts[1]}` : modelName; + + const wide = columns >= 100; + const narrow = columns < 60; + + const hasFiveHour = rateLimits.five_hour != null; + const hasSevenDay = rateLimits.seven_day != null; + + const fiveHourPct = hasFiveHour ? Math.round(rateLimits.five_hour!.utilization * 100) : 0; + const sevenDayPct = hasSevenDay ? Math.round(rateLimits.seven_day!.utilization * 100) : 0; + + // Token display: "50k/1M" + const tokenDisplay = `${formatTokens(usedTokens)}/${formatTokens(contextWindowSize)}`; + + return ( + + {/* Model name */} + {shortModel} + + {/* Context usage with token counts */} + + Context + {contextUsedPct}% + {!narrow && ({tokenDisplay})} + + {/* 5-hour session rate limit */} + {hasFiveHour && ( + <> + + Session + {wide && ( + <> + + + + )} + {fiveHourPct}% + {!narrow && rateLimits.five_hour!.resets_at > 0 && ( + {formatCountdown(rateLimits.five_hour!.resets_at)} + )} + + )} + + {/* 7-day weekly rate limit */} + {hasSevenDay && ( + <> + + Weekly + {wide && ( + <> + + + + )} + {sevenDayPct}% + {!narrow && rateLimits.seven_day!.resets_at > 0 && ( + {formatCountdown(rateLimits.seven_day!.resets_at)} + )} + + )} + + {/* Cost */} + {totalCostUsd > 0 && ( + <> + + {formatCost(totalCostUsd)} + + )} + + ); +} + +export const BuiltinStatusLine = React.memo(BuiltinStatusLineInner); diff --git a/src/components/StatusLine.tsx b/src/components/StatusLine.tsx index 509ccda89..95f3cebf5 100644 --- a/src/components/StatusLine.tsx +++ b/src/components/StatusLine.tsx @@ -1,323 +1,61 @@ import { feature } from 'bun:bundle'; import * as React from 'react'; -import { memo, useCallback, useEffect, useRef } from 'react'; -import { logEvent } from 'src/services/analytics/index.js'; -import { useAppState, useSetAppState } from 'src/state/AppState.js'; -import type { PermissionMode } from 'src/utils/permissions/PermissionMode.js'; -import { getIsRemoteMode, getKairosActive, getMainThreadAgentType, getOriginalCwd, getSdkBetas, getSessionId } from '../bootstrap/state.js'; -import { DEFAULT_OUTPUT_STYLE_NAME } from '../constants/outputStyles.js'; -import { useNotifications } from '../context/notifications.js'; -import { getTotalAPIDuration, getTotalCost, getTotalDuration, getTotalInputTokens, getTotalLinesAdded, getTotalLinesRemoved, getTotalOutputTokens } from '../cost-tracker.js'; +import { memo } from 'react'; +import { useAppState } from 'src/state/AppState.js'; +import { getSdkBetas, getKairosActive } from '../bootstrap/state.js'; +import { getTotalCost, getTotalInputTokens, getTotalOutputTokens } from '../cost-tracker.js'; import { useMainLoopModel } from '../hooks/useMainLoopModel.js'; -import { type ReadonlySettings, useSettings } from '../hooks/useSettings.js'; -import { Ansi, Box, Text } from '../ink.js'; +import { type ReadonlySettings } from '../hooks/useSettings.js'; import { getRawUtilization } from '../services/claudeAiLimits.js'; import type { Message } from '../types/message.js'; -import type { StatusLineCommandInput } from '../types/statusLine.js'; -import type { VimMode } from '../types/textInputTypes.js'; -import { checkHasTrustDialogAccepted } from '../utils/config.js'; import { calculateContextPercentages, getContextWindowForModel } from '../utils/context.js'; -import { getCwd } from '../utils/cwd.js'; -import { logForDebugging } from '../utils/debug.js'; -import { isFullscreenEnvEnabled } from '../utils/fullscreen.js'; -import { createBaseHookInput, executeStatusLineCommand } from '../utils/hooks.js'; import { getLastAssistantMessage } from '../utils/messages.js'; -import { getRuntimeMainLoopModel, type ModelName, renderModelName } from '../utils/model/model.js'; -import { getCurrentSessionTitle } from '../utils/sessionStorage.js'; +import { getRuntimeMainLoopModel, renderModelName } from '../utils/model/model.js'; import { doesMostRecentAssistantMessageExceed200k, getCurrentUsage } from '../utils/tokens.js'; -import { getCurrentWorktreeSession } from '../utils/worktree.js'; -import { isVimModeEnabled } from './PromptInput/utils.js'; +import { BuiltinStatusLine } from './BuiltinStatusLine.js'; + export function statusLineShouldDisplay(settings: ReadonlySettings): boolean { - // Assistant mode: statusline fields (model, permission mode, cwd) reflect the - // REPL/daemon process, not what the agent child is actually running. Hide it. if (feature('KAIROS') && getKairosActive()) return false; - return settings?.statusLine !== undefined; -} -function buildStatusLineCommandInput(permissionMode: PermissionMode, exceeds200kTokens: boolean, settings: ReadonlySettings, messages: Message[], addedDirs: string[], mainLoopModel: ModelName, vimMode?: VimMode): StatusLineCommandInput { - const agentType = getMainThreadAgentType(); - const worktreeSession = getCurrentWorktreeSession(); - const runtimeModel = getRuntimeMainLoopModel({ - permissionMode, - mainLoopModel, - exceeds200kTokens - }); - const outputStyleName = settings?.outputStyle || DEFAULT_OUTPUT_STYLE_NAME; - const currentUsage = getCurrentUsage(messages); - const contextWindowSize = getContextWindowForModel(runtimeModel, getSdkBetas()); - const contextPercentages = calculateContextPercentages(currentUsage, contextWindowSize); - const sessionId = getSessionId(); - const sessionName = getCurrentSessionTitle(sessionId); - const rawUtil = getRawUtilization(); - const rateLimits: StatusLineCommandInput['rate_limits'] = { - ...(rawUtil.five_hour && { - five_hour: { - used_percentage: rawUtil.five_hour.utilization * 100, - resets_at: rawUtil.five_hour.resets_at - } - }), - ...(rawUtil.seven_day && { - seven_day: { - used_percentage: rawUtil.seven_day.utilization * 100, - resets_at: rawUtil.seven_day.resets_at - } - }) - }; - return { - ...createBaseHookInput(), - ...(sessionName && { - session_name: sessionName - }), - model: { - id: runtimeModel, - display_name: renderModelName(runtimeModel) - }, - workspace: { - current_dir: getCwd(), - project_dir: getOriginalCwd(), - added_dirs: addedDirs - }, - version: MACRO.VERSION, - output_style: { - name: outputStyleName - }, - cost: { - total_cost_usd: getTotalCost(), - total_duration_ms: getTotalDuration(), - total_api_duration_ms: getTotalAPIDuration(), - total_lines_added: getTotalLinesAdded(), - total_lines_removed: getTotalLinesRemoved() - }, - context_window: { - total_input_tokens: getTotalInputTokens(), - total_output_tokens: getTotalOutputTokens(), - context_window_size: contextWindowSize, - current_usage: currentUsage, - used_percentage: contextPercentages.used, - remaining_percentage: contextPercentages.remaining - }, - exceeds_200k_tokens: exceeds200kTokens, - ...((rateLimits.five_hour || rateLimits.seven_day) && { - rate_limits: rateLimits - }), - ...(isVimModeEnabled() && { - vim: { - mode: vimMode ?? 'INSERT' - } - }), - ...(agentType && { - agent: { - name: agentType - } - }), - ...(getIsRemoteMode() && { - remote: { - session_id: getSessionId() - } - }), - ...(worktreeSession && { - worktree: { - name: worktreeSession.worktreeName, - path: worktreeSession.worktreePath, - branch: worktreeSession.worktreeBranch, - original_cwd: worktreeSession.originalCwd, - original_branch: worktreeSession.originalBranch - } - }) - }; + return true; } + type Props = { - // messages stays behind a ref (read only in the debounced callback); - // lastAssistantMessageId is the actual re-render trigger. messagesRef: React.RefObject; lastAssistantMessageId: string | null; - vimMode?: VimMode; + vimMode?: unknown; }; + export function getLastAssistantMessageId(messages: Message[]): string | null { return getLastAssistantMessage(messages)?.uuid ?? null; } -function StatusLineInner({ - messagesRef, - lastAssistantMessageId, - vimMode -}: Props): React.ReactNode { - const abortControllerRef = useRef(undefined); - const permissionMode = useAppState(s => s.toolPermissionContext.mode); - const additionalWorkingDirectories = useAppState(s => s.toolPermissionContext.additionalWorkingDirectories); - const statusLineText = useAppState(s => s.statusLineText); - const setAppState = useSetAppState(); - const settings = useSettings(); - const { - addNotification - } = useNotifications(); - // AppState-sourced model — same source as API requests. getMainLoopModel() - // re-reads settings.json on every call, so another session's /model write - // would leak into this session's statusline (anthropics/claude-code#37596). - const mainLoopModel = useMainLoopModel(); - - // Keep latest values in refs for stable callback access - const settingsRef = useRef(settings); - settingsRef.current = settings; - const vimModeRef = useRef(vimMode); - vimModeRef.current = vimMode; - const permissionModeRef = useRef(permissionMode); - permissionModeRef.current = permissionMode; - const addedDirsRef = useRef(additionalWorkingDirectories); - addedDirsRef.current = additionalWorkingDirectories; - const mainLoopModelRef = useRef(mainLoopModel); - mainLoopModelRef.current = mainLoopModel; - - // Track previous state to detect changes and cache expensive calculations - const previousStateRef = useRef<{ - messageId: string | null; - exceeds200kTokens: boolean; - permissionMode: PermissionMode; - vimMode: VimMode | undefined; - mainLoopModel: ModelName; - }>({ - messageId: null, - exceeds200kTokens: false, - permissionMode, - vimMode, - mainLoopModel - }); - - // Debounce timer ref - const debounceTimerRef = useRef | undefined>(undefined); - // True when the next invocation should log its result (first run or after settings reload) - const logNextResultRef = useRef(true); - - // Stable update function — reads latest values from refs - const doUpdate = useCallback(async () => { - // Cancel any in-flight requests - abortControllerRef.current?.abort(); - const controller = new AbortController(); - abortControllerRef.current = controller; - const msgs = messagesRef.current; - const logResult = logNextResultRef.current; - logNextResultRef.current = false; - try { - let exceeds200kTokens = previousStateRef.current.exceeds200kTokens; - - // Only recalculate 200k check if messages changed - const currentMessageId = getLastAssistantMessageId(msgs); - if (currentMessageId !== previousStateRef.current.messageId) { - exceeds200kTokens = doesMostRecentAssistantMessageExceed200k(msgs); - previousStateRef.current.messageId = currentMessageId; - previousStateRef.current.exceeds200kTokens = exceeds200kTokens; - } - const statusInput = buildStatusLineCommandInput(permissionModeRef.current, exceeds200kTokens, settingsRef.current, msgs, Array.from(addedDirsRef.current.keys()), mainLoopModelRef.current, vimModeRef.current); - const text = await executeStatusLineCommand(statusInput, controller.signal, undefined, logResult); - if (!controller.signal.aborted) { - setAppState(prev => { - if (prev.statusLineText === text) return prev; - return { - ...prev, - statusLineText: text - }; - }); - } - } catch { - // Silently ignore errors in status line updates - } - }, [messagesRef, setAppState]); - - // Stable debounced schedule function — no deps, uses refs - const scheduleUpdate = useCallback(() => { - if (debounceTimerRef.current !== undefined) { - clearTimeout(debounceTimerRef.current); - } - debounceTimerRef.current = setTimeout((ref, doUpdate) => { - ref.current = undefined; - void doUpdate(); - }, 300, debounceTimerRef, doUpdate); - }, [doUpdate]); - - // Only trigger update when assistant message, permission mode, vim mode, or model actually changes - useEffect(() => { - if (lastAssistantMessageId !== previousStateRef.current.messageId || permissionMode !== previousStateRef.current.permissionMode || vimMode !== previousStateRef.current.vimMode || mainLoopModel !== previousStateRef.current.mainLoopModel) { - // Don't update messageId here — let doUpdate handle it so - // exceeds200kTokens is recalculated with the latest messages - previousStateRef.current.permissionMode = permissionMode; - previousStateRef.current.vimMode = vimMode; - previousStateRef.current.mainLoopModel = mainLoopModel; - scheduleUpdate(); - } - }, [lastAssistantMessageId, permissionMode, vimMode, mainLoopModel, scheduleUpdate]); - - // When the statusLine command changes (hot reload), log the next result - const statusLineCommand = settings?.statusLine?.command; - const isFirstSettingsRender = useRef(true); - useEffect(() => { - if (isFirstSettingsRender.current) { - isFirstSettingsRender.current = false; - return; - } - logNextResultRef.current = true; - void doUpdate(); - }, [statusLineCommand, doUpdate]); - - // Separate effect for logging on mount - useEffect(() => { - const statusLine = settings?.statusLine; - if (statusLine) { - logEvent('tengu_status_line_mount', { - command_length: statusLine.command.length, - padding: statusLine.padding - }); - // Log if status line is configured but disabled by disableAllHooks - if (settings.disableAllHooks === true) { - logForDebugging('Status line is configured but disableAllHooks is true', { - level: 'warn' - }); - } - // executeStatusLineCommand (hooks.ts) returns undefined when trust is - // blocked — statusLineText stays undefined forever, user sees nothing, - // and tengu_status_line_mount above fires anyway so telemetry looks fine. - if (!checkHasTrustDialogAccepted()) { - addNotification({ - key: 'statusline-trust-blocked', - text: 'statusline skipped · restart to fix', - color: 'warning', - priority: 'low' - }); - logForDebugging('Status line command skipped: workspace trust not accepted', { - level: 'warn' - }); - } - } - // eslint-disable-next-line react-hooks/exhaustive-deps - // biome-ignore lint/correctness/useExhaustiveDependencies: intentional - }, []); // Only run once on mount - settings stable for initial logging +function StatusLineInner({ messagesRef, lastAssistantMessageId }: Props): React.ReactNode { + const mainLoopModel = useMainLoopModel(); + const permissionMode = useAppState(s => s.toolPermissionContext.mode); - // Initial update on mount + cleanup on unmount - useEffect(() => { - void doUpdate(); - return () => { - abortControllerRef.current?.abort(); - if (debounceTimerRef.current !== undefined) { - clearTimeout(debounceTimerRef.current); - } - }; - // eslint-disable-next-line react-hooks/exhaustive-deps - // biome-ignore lint/correctness/useExhaustiveDependencies: intentional - }, []); // Only run once on mount, not when doUpdate changes + const messages = messagesRef.current ?? []; - // Get padding from settings or default to 0 - const paddingX = settings?.statusLine?.padding ?? 0; + const exceeds200kTokens = lastAssistantMessageId ? doesMostRecentAssistantMessageExceed200k(messages) : false; - // StatusLine must have stable height in fullscreen — the footer is - // flexShrink:0 so a 0→1 row change when the command finishes steals - // a row from ScrollBox and shifts content. Reserve the row while loading - // (same trick as PromptInputFooterLeftSide). - return - {statusLineText ? - {statusLineText} - : isFullscreenEnvEnabled() ? : null} - ; + const runtimeModel = getRuntimeMainLoopModel({ permissionMode, mainLoopModel, exceeds200kTokens }); + const modelDisplay = renderModelName(runtimeModel); + const currentUsage = getCurrentUsage(messages); + const contextWindowSize = getContextWindowForModel(runtimeModel, getSdkBetas()); + const contextPercentages = calculateContextPercentages(currentUsage, contextWindowSize); + const rawUtil = getRawUtilization(); + const totalCost = getTotalCost(); + const usedTokens = getTotalInputTokens() + getTotalOutputTokens(); + + return ( + + ); } -// Parent (PromptInputFooter) re-renders on every setMessages, but StatusLine's -// own props now only change when lastAssistantMessageId flips — memo keeps it -// from being dragged along (previously ~18 no-prop-change renders per session). export const StatusLine = memo(StatusLineInner); diff --git a/src/hooks/useChromeExtensionNotification.tsx b/src/hooks/useChromeExtensionNotification.tsx index beef384f3..a32aac968 100644 --- a/src/hooks/useChromeExtensionNotification.tsx +++ b/src/hooks/useChromeExtensionNotification.tsx @@ -21,14 +21,7 @@ async function _temp() { if (!shouldEnableClaudeInChrome(chromeFlag)) { return null; } - if (true && !isClaudeAISubscriber()) { - return { - key: "chrome-requires-subscription", - jsx: Claude in Chrome requires a claude.ai subscription, - priority: "immediate", - timeoutMs: 5000 - }; - } + // Subscription check bypassed const installed = await isChromeExtensionInstalled(); if (!installed && !isRunningOnHomespace()) { return { diff --git a/src/main.tsx b/src/main.tsx index ccb6097a0..56eb77b26 100644 --- a/src/main.tsx +++ b/src/main.tsx @@ -1525,7 +1525,7 @@ async function run(): Promise { }; // Store the explicit CLI flag so teammates can inherit it setChromeFlagOverride(chromeOpts.chrome); - const enableClaudeInChrome = shouldEnableClaudeInChrome(chromeOpts.chrome) && ((process.env.USER_TYPE) === 'ant' || isClaudeAISubscriber()); + const enableClaudeInChrome = shouldEnableClaudeInChrome(chromeOpts.chrome); const autoEnableClaudeInChrome = !enableClaudeInChrome && shouldAutoEnableClaudeInChrome(); if (enableClaudeInChrome) { const platform = getPlatform(); @@ -1602,7 +1602,7 @@ async function run(): Promise { // `type: 'stdio'`. An enterprise-config ant with the GB gate on would // otherwise process.exit(1). Chrome has the same latent issue but has // shipped without incident; chicago places itself correctly. - if (feature('CHICAGO_MCP') && getPlatform() === 'macos' && !getIsNonInteractiveSession()) { + if (feature('CHICAGO_MCP') && !getIsNonInteractiveSession()) { try { const { getChicagoEnabled diff --git a/src/utils/auth.ts b/src/utils/auth.ts index 64a618082..d278c3a4a 100644 --- a/src/utils/auth.ts +++ b/src/utils/auth.ts @@ -112,14 +112,13 @@ export function isAnthropicAuthEnabled(): boolean { return !!process.env.CLAUDE_CODE_OAUTH_TOKEN } + const settings = getSettings_DEPRECATED() || {} const is3P = isEnvTruthy(process.env.CLAUDE_CODE_USE_BEDROCK) || isEnvTruthy(process.env.CLAUDE_CODE_USE_VERTEX) || - isEnvTruthy(process.env.CLAUDE_CODE_USE_FOUNDRY) - - // Check if user has configured an external API key source - // This allows externally-provided API keys to work (without requiring proxy configuration) - const settings = getSettings_DEPRECATED() || {} + isEnvTruthy(process.env.CLAUDE_CODE_USE_FOUNDRY) || + (settings as any).modelType === 'openai' || + !!process.env.OPENAI_BASE_URL const apiKeyHelper = settings.apiKeyHelper const hasExternalAuthToken = process.env.ANTHROPIC_AUTH_TOKEN || diff --git a/src/utils/computerUse/common.ts b/src/utils/computerUse/common.ts index 4b4410744..37e6701e1 100644 --- a/src/utils/computerUse/common.ts +++ b/src/utils/computerUse/common.ts @@ -52,8 +52,8 @@ export function getTerminalBundleId(): string | null { * takes this shape (no `hostBundleId`, no `teachMode`). */ export const CLI_CU_CAPABILITIES = { - screenshotFiltering: 'native' as const, - platform: 'darwin' as const, + screenshotFiltering: (process.platform === 'darwin' ? 'native' : 'none') as any, + platform: (process.platform === 'win32' ? 'windows' : process.platform === 'linux' ? 'linux' : 'darwin') as any, } export function isComputerUseMCPServer(name: string): boolean { diff --git a/src/utils/computerUse/drainRunLoop.ts b/src/utils/computerUse/drainRunLoop.ts index 1aec3f059..28dac7c57 100644 --- a/src/utils/computerUse/drainRunLoop.ts +++ b/src/utils/computerUse/drainRunLoop.ts @@ -59,6 +59,7 @@ export const releasePump = release * concurrent drainRunLoop() calls share one setInterval. */ export async function drainRunLoop(fn: () => Promise): Promise { + if (process.platform !== 'darwin') return fn() retain() let timer: ReturnType | undefined try { diff --git a/src/utils/computerUse/escHotkey.ts b/src/utils/computerUse/escHotkey.ts index b91f80f34..24ba17cc4 100644 --- a/src/utils/computerUse/escHotkey.ts +++ b/src/utils/computerUse/escHotkey.ts @@ -23,6 +23,7 @@ import { requireComputerUseSwift } from './swiftLoader.js' let registered = false export function registerEscHotkey(onEscape: () => void): boolean { + if (process.platform !== 'darwin') return false if (registered) return true const cu = requireComputerUseSwift() if (!(cu as any).hotkey.registerEscape(onEscape)) { diff --git a/src/utils/computerUse/executor.ts b/src/utils/computerUse/executor.ts index 6e221941e..2b6c9ade1 100644 --- a/src/utils/computerUse/executor.ts +++ b/src/utils/computerUse/executor.ts @@ -68,6 +68,24 @@ function computeTargetDims( } async function readClipboardViaPbpaste(): Promise { + if (process.platform === 'win32') { + const { stdout, code } = await execFileNoThrow('powershell', ['-NoProfile', '-Command', 'Get-Clipboard'], { + useCwd: false, + }) + if (code !== 0) { + throw new Error(`PowerShell Get-Clipboard exited with code ${code}`) + } + return stdout + } + if (process.platform === 'linux') { + const { stdout, code } = await execFileNoThrow('xclip', ['-selection', 'clipboard', '-o'], { + useCwd: false, + }) + if (code !== 0) { + throw new Error(`xclip exited with code ${code}`) + } + return stdout + } const { stdout, code } = await execFileNoThrow('pbpaste', [], { useCwd: false, }) @@ -78,6 +96,25 @@ async function readClipboardViaPbpaste(): Promise { } async function writeClipboardViaPbcopy(text: string): Promise { + if (process.platform === 'win32') { + const { code } = await execFileNoThrow('powershell', ['-NoProfile', '-Command', `Set-Clipboard -Value '${text.replace(/'/g, "''")}'`], { + useCwd: false, + }) + if (code !== 0) { + throw new Error(`PowerShell Set-Clipboard exited with code ${code}`) + } + return + } + if (process.platform === 'linux') { + const { code } = await execFileNoThrow('xclip', ['-selection', 'clipboard'], { + input: text, + useCwd: false, + }) + if (code !== 0) { + throw new Error(`xclip exited with code ${code}`) + } + return + } const { code } = await execFileNoThrow('pbcopy', [], { input: text, useCwd: false, @@ -192,7 +229,7 @@ async function typeViaClipboard(input: Input, text: string): Promise { if ((await readClipboardViaPbpaste()) !== text) { throw new Error('Clipboard write did not round-trip.') } - await input.keys(['command', 'v']) + await input.keys([process.platform === 'darwin' ? 'command' : 'ctrl', 'v']) await sleep(100) } finally { if (typeof saved === 'string') { @@ -260,9 +297,9 @@ export function createCliExecutor(opts: { getMouseAnimationEnabled: () => boolean getHideBeforeActionEnabled: () => boolean }): ComputerExecutor { - if (process.platform !== 'darwin') { + if (process.platform !== 'darwin' && process.platform !== 'win32' && process.platform !== 'linux') { throw new Error( - `createCliExecutor called on ${process.platform}. Computer control is macOS-only.`, + `createCliExecutor called on ${process.platform}. Computer control requires macOS, Windows, or Linux.`, ) } @@ -377,7 +414,7 @@ export function createCliExecutor(opts: { d.height, d.scaleFactor, ) - return drainRunLoop(() => + const raw = await drainRunLoop(() => cu.resolvePrepareCapture( withoutTerminal(opts.allowedBundleIds), surrogateHost, @@ -389,6 +426,14 @@ export function createCliExecutor(opts: { opts.doHide, ), ) + // Ensure the result has fields expected by toolCalls.ts (hidden, displayId). + // macOS native returns these from Swift; our cross-platform ComputerUseAPI + // returns {base64, width, height} — fill in the missing fields. + return { + ...raw, + hidden: (raw as any).hidden ?? [], + displayId: (raw as any).displayId ?? opts.preferredDisplayId ?? d.displayId, + } }, /** diff --git a/src/utils/computerUse/gates.ts b/src/utils/computerUse/gates.ts index 6563a4807..b18af2a4a 100644 --- a/src/utils/computerUse/gates.ts +++ b/src/utils/computerUse/gates.ts @@ -10,7 +10,7 @@ type ChicagoConfig = CuSubGates & { } const DEFAULTS: ChicagoConfig = { - enabled: false, + enabled: true, pixelValidation: false, clipboardPasteMultiline: true, mouseAnimation: true, @@ -37,9 +37,7 @@ function readConfig(): ChicagoConfig { // regardless of subscription tier — not all ants are max/pro, and per // CLAUDE.md:281, USER_TYPE !== 'ant' branches get zero antfooding. function hasRequiredSubscription(): boolean { - if (process.env.USER_TYPE === 'ant') return true - const tier = getSubscriptionType() - return tier === 'max' || tier === 'pro' + return true } export function getChicagoEnabled(): boolean { diff --git a/src/utils/computerUse/hostAdapter.ts b/src/utils/computerUse/hostAdapter.ts index 85bbc882b..acefbaa3d 100644 --- a/src/utils/computerUse/hostAdapter.ts +++ b/src/utils/computerUse/hostAdapter.ts @@ -45,6 +45,7 @@ export function getComputerUseHostAdapter(): ComputerUseHostAdapter { getHideBeforeActionEnabled: () => getChicagoSubGates().hideBeforeAction, }), ensureOsPermissions: async () => { + if (process.platform !== 'darwin') return { granted: true } const cu = requireComputerUseSwift() const accessibility = (cu as any).tcc.checkAccessibility() const screenRecording = (cu as any).tcc.checkScreenRecording() diff --git a/src/utils/computerUse/swiftLoader.ts b/src/utils/computerUse/swiftLoader.ts index 1a8a9b257..0fcc23fea 100644 --- a/src/utils/computerUse/swiftLoader.ts +++ b/src/utils/computerUse/swiftLoader.ts @@ -13,11 +13,17 @@ let cached: ComputerUseAPI | undefined * these in drainRunLoop(). */ export function requireComputerUseSwift(): ComputerUseAPI { - if (process.platform !== 'darwin') { - throw new Error('@ant/computer-use-swift is macOS-only') - } + if (cached) return cached // eslint-disable-next-line @typescript-eslint/no-require-imports - return (cached ??= require('@ant/computer-use-swift') as ComputerUseAPI) + const mod = require('@ant/computer-use-swift') + // macOS native .node exports a plain object with apps/display/screenshot directly. + // Our cross-platform package exports { ComputerUseAPI } class — needs instantiation. + if (mod.ComputerUseAPI && typeof mod.ComputerUseAPI === 'function') { + cached = new mod.ComputerUseAPI() as ComputerUseAPI + } else { + cached = mod as ComputerUseAPI + } + return cached } export type { ComputerUseAPI } diff --git a/src/utils/computerUse/win32/ocr.ts b/src/utils/computerUse/win32/ocr.ts new file mode 100644 index 000000000..69ca3a6e1 --- /dev/null +++ b/src/utils/computerUse/win32/ocr.ts @@ -0,0 +1,257 @@ +/** + * OCR module using Windows.Media.Ocr.OcrEngine via PowerShell. + * Captures a screen region or window, then runs WinRT OCR to extract text. + */ + +export interface OcrLine { + text: string + bounds: { x: number; y: number; w: number; h: number } +} + +export interface OcrResult { + text: string + lines: OcrLine[] + language: string +} + +function emptyResult(language: string): OcrResult { + return { text: '', lines: [], language } +} + +function runPs(script: string): string { + const result = Bun.spawnSync({ + cmd: ['powershell', '-NoProfile', '-NonInteractive', '-Command', script], + stdout: 'pipe', + stderr: 'pipe', + }) + return new TextDecoder().decode(result.stdout).trim() +} + +/** + * PowerShell script that: + * 1. Screenshots a screen region using CopyFromScreen + * 2. Saves to temp PNG + * 3. Loads via WinRT BitmapDecoder -> SoftwareBitmap + * 4. Runs OcrEngine.RecognizeAsync + * 5. Outputs JSON with text, lines, and bounding rects + */ +function buildOcrRegionScript( + x: number, + y: number, + w: number, + h: number, + lang: string, +): string { + return ` +Add-Type -AssemblyName System.Drawing +Add-Type -AssemblyName System.Runtime.WindowsRuntime + +# Load WinRT types +$null = [Windows.Media.Ocr.OcrEngine, Windows.Foundation, ContentType = WindowsRuntime] +$null = [Windows.Graphics.Imaging.SoftwareBitmap, Windows.Foundation, ContentType = WindowsRuntime] +$null = [Windows.Graphics.Imaging.BitmapDecoder, Windows.Foundation, ContentType = WindowsRuntime] +$null = [Windows.Storage.StorageFile, Windows.Foundation, ContentType = WindowsRuntime] +$null = [Windows.Storage.Streams.RandomAccessStream, Windows.Foundation, ContentType = WindowsRuntime] +$null = [Windows.Globalization.Language, Windows.Foundation, ContentType = WindowsRuntime] + +# Await helper for WinRT async operations +$asTaskGeneric = ([System.WindowsRuntimeSystemExtensions].GetMethods() | Where-Object { + $_.Name -eq 'AsTask' -and $_.GetParameters().Count -eq 1 -and + $_.GetParameters()[0].ParameterType.Name -eq 'IAsyncOperation\`1' +})[0] +Function Await($WinRtTask, $ResultType) { + $asTask = $asTaskGeneric.MakeGenericMethod($ResultType) + $netTask = $asTask.Invoke($null, @($WinRtTask)) + $netTask.Wait(-1) | Out-Null + $netTask.Result +} + +try { + # Step 1: Screenshot region + $bmp = New-Object System.Drawing.Bitmap(${w}, ${h}) + $g = [System.Drawing.Graphics]::FromImage($bmp) + $g.CopyFromScreen(${x}, ${y}, 0, 0, (New-Object System.Drawing.Size(${w}, ${h}))) + $g.Dispose() + + # Step 2: Save to temp file + $tmpFile = [System.IO.Path]::Combine([System.IO.Path]::GetTempPath(), "ocrtemp_$([guid]::NewGuid().ToString('N')).png") + $bmp.Save($tmpFile, [System.Drawing.Imaging.ImageFormat]::Png) + $bmp.Dispose() + + # Step 3: Open as StorageFile -> BitmapDecoder -> SoftwareBitmap + $storageFile = Await ([Windows.Storage.StorageFile]::GetFileFromPathAsync($tmpFile)) ([Windows.Storage.StorageFile]) + $stream = Await ($storageFile.OpenAsync([Windows.Storage.FileAccessMode]::Read)) ([Windows.Storage.Streams.IRandomAccessStream]) + $decoder = Await ([Windows.Graphics.Imaging.BitmapDecoder]::CreateAsync($stream)) ([Windows.Graphics.Imaging.BitmapDecoder]) + $softwareBmp = Await ($decoder.GetSoftwareBitmapAsync()) ([Windows.Graphics.Imaging.SoftwareBitmap]) + + # Step 4: Create OCR engine + $ocrLang = New-Object Windows.Globalization.Language('${lang}') + $engine = [Windows.Media.Ocr.OcrEngine]::TryCreateFromLanguage($ocrLang) + if ($engine -eq $null) { + # Fallback to en-US + $ocrLang = New-Object Windows.Globalization.Language('en-US') + $engine = [Windows.Media.Ocr.OcrEngine]::TryCreateFromLanguage($ocrLang) + } + if ($engine -eq $null) { + Write-Output '{"text":"","lines":[],"language":"${lang}"}' + return + } + + # Step 5: Run OCR + $ocrResult = Await ($engine.RecognizeAsync($softwareBmp)) ([Windows.Media.Ocr.OcrResult]) + + # Step 6: Extract lines with bounding rects + $lines = @() + foreach ($line in $ocrResult.Lines) { + $minX = [double]::MaxValue; $minY = [double]::MaxValue + $maxX = 0.0; $maxY = 0.0 + foreach ($word in $line.Words) { + $r = $word.BoundingRect + if ($r.X -lt $minX) { $minX = $r.X } + if ($r.Y -lt $minY) { $minY = $r.Y } + if (($r.X + $r.Width) -gt $maxX) { $maxX = $r.X + $r.Width } + if (($r.Y + $r.Height) -gt $maxY) { $maxY = $r.Y + $r.Height } + } + $lines += @{ + text = $line.Text + bounds = @{ + x = [int]$minX + y = [int]$minY + w = [int]($maxX - $minX) + h = [int]($maxY - $minY) + } + } + } + + $output = @{ + text = $ocrResult.Text + lines = $lines + language = $ocrLang.LanguageTag + } + Write-Output (ConvertTo-Json $output -Depth 4 -Compress) + + # Cleanup + $stream.Dispose() + Remove-Item $tmpFile -ErrorAction SilentlyContinue +} catch { + Write-Output '{"text":"","lines":[],"language":"${lang}"}' +} +` +} + +/** + * PowerShell script to get a window's bounding rect by title. + */ +function buildGetWindowRectScript(windowTitle: string): string { + const escaped = windowTitle.replace(/'/g, "''") + return ` +Add-Type @' +using System; +using System.Runtime.InteropServices; +public class WinRect { + [DllImport("user32.dll", CharSet=CharSet.Unicode)] + public static extern IntPtr FindWindow(string c, string t); + [DllImport("user32.dll")] + public static extern bool GetWindowRect(IntPtr h, out RECT r); + [StructLayout(LayoutKind.Sequential)] + public struct RECT { public int L, T, R, B; } + public static string Get(string title) { + IntPtr hwnd = FindWindow(null, title); + if (hwnd == IntPtr.Zero) return "NOT_FOUND"; + RECT r; GetWindowRect(hwnd, out r); + int w = r.R - r.L; int h = r.B - r.T; + if (w <= 0 || h <= 0) return "INVALID_SIZE"; + return r.L + "," + r.T + "," + w + "," + h; + } +} +'@ +[WinRect]::Get('${escaped}') +` +} + +function parseOcrOutput(raw: string, lang: string): OcrResult { + if (!raw) return emptyResult(lang) + try { + const parsed = JSON.parse(raw) + return { + text: parsed.text ?? '', + lines: Array.isArray(parsed.lines) + ? parsed.lines.map((l: any) => ({ + text: l.text ?? '', + bounds: { + x: l.bounds?.x ?? 0, + y: l.bounds?.y ?? 0, + w: l.bounds?.w ?? 0, + h: l.bounds?.h ?? 0, + }, + })) + : [], + language: parsed.language ?? lang, + } + } catch { + return emptyResult(lang) + } +} + +/** + * Perform OCR on a screen region. + * Screenshots the specified rectangle, then runs WinRT OcrEngine. + * + * @param x - Left coordinate + * @param y - Top coordinate + * @param w - Width in pixels + * @param h - Height in pixels + * @param lang - BCP-47 language tag (default 'en-US'). Confirmed: 'en-US', 'zh-Hans-CN' + */ +export async function ocrRegion( + x: number, + y: number, + w: number, + h: number, + lang?: string, +): Promise { + const language = lang ?? 'en-US' + if (w <= 0 || h <= 0) return emptyResult(language) + + try { + const script = buildOcrRegionScript(x, y, w, h, language) + const raw = runPs(script) + return parseOcrOutput(raw, language) + } catch { + return emptyResult(language) + } +} + +/** + * Perform OCR on a specific window by its title. + * Gets the window rect, then delegates to ocrRegion. + * + * @param windowTitle - Exact window title to find via FindWindow + * @param lang - BCP-47 language tag (default 'en-US') + */ +export async function ocrWindow( + windowTitle: string, + lang?: string, +): Promise { + const language = lang ?? 'en-US' + + try { + const rectScript = buildGetWindowRectScript(windowTitle) + const raw = runPs(rectScript) + const trimmed = raw.trim() + + if (!trimmed || trimmed === 'NOT_FOUND' || trimmed === 'INVALID_SIZE') { + return emptyResult(language) + } + + const parts = trimmed.split(',') + if (parts.length !== 4) return emptyResult(language) + + const [x, y, w, h] = parts.map(Number) + if (!w || !h) return emptyResult(language) + + return ocrRegion(x, y, w, h, lang) + } catch { + return emptyResult(language) + } +} diff --git a/src/utils/computerUse/win32/uiAutomation.ts b/src/utils/computerUse/win32/uiAutomation.ts new file mode 100644 index 000000000..292d7e646 --- /dev/null +++ b/src/utils/computerUse/win32/uiAutomation.ts @@ -0,0 +1,308 @@ +/** + * Windows UI Automation module + * + * Provides UI element tree inspection, element lookup, programmatic click, + * value setting, and hit-testing via PowerShell + System.Windows.Automation. + */ + +export interface UIElement { + name: string + controlType: string // Button, Edit, Text, List, Window, etc. + automationId: string + boundingRect: { x: number; y: number; w: number; h: number } + isEnabled: boolean + value?: string + children?: UIElement[] +} + +// --------------------------------------------------------------------------- +// Helper +// --------------------------------------------------------------------------- + +const UIA_ASSEMBLIES = ` +Add-Type -AssemblyName UIAutomationClient +Add-Type -AssemblyName UIAutomationTypes +Add-Type -AssemblyName WindowsBase +` + +function ps(script: string): string { + const result = Bun.spawnSync({ + cmd: ['powershell', '-NoProfile', '-NonInteractive', '-Command', script], + stdout: 'pipe', + stderr: 'pipe', + }) + return new TextDecoder().decode(result.stdout).trim() +} + +function parseJsonSafe(raw: string, fallback: T): T { + try { + if (!raw) return fallback + return JSON.parse(raw) as T + } catch { + return fallback + } +} + +// PowerShell snippet that finds a window by exact or partial title match. +// Assumes $title is already set in the calling script. +const PS_FIND_WINDOW = ` +$root = [System.Windows.Automation.AutomationElement]::RootElement +$window = $root.FindFirst( + [System.Windows.Automation.TreeScope]::Children, + [System.Windows.Automation.PropertyCondition]::new( + [System.Windows.Automation.AutomationElement]::NameProperty, $title)) +if ($window -eq $null) { + $all = $root.FindAll( + [System.Windows.Automation.TreeScope]::Children, + [System.Windows.Automation.Condition]::TrueCondition) + foreach ($el in $all) { + if ($el.Current.Name -and $el.Current.Name.Contains($title)) { + $window = $el + break + } + } +} +` + +// --------------------------------------------------------------------------- +// Public API +// --------------------------------------------------------------------------- + +/** + * Get the UI element tree of a window, up to `depth` levels deep (default 3). + */ +export function getUITree(windowTitle: string, depth: number = 3): UIElement[] { + const escapedTitle = windowTitle.replace(/'/g, "''") + const script = ` +${UIA_ASSEMBLIES} +$title = '${escapedTitle}' +${PS_FIND_WINDOW} +if ($window -eq $null) { + Write-Output '[]' + exit +} + +function Get-UIChildren($parent, $currentDepth, $maxDepth) { + if ($currentDepth -ge $maxDepth) { return @() } + $children = $parent.FindAll( + [System.Windows.Automation.TreeScope]::Children, + [System.Windows.Automation.Condition]::TrueCondition) + $result = @() + foreach ($el in $children) { + $rect = $el.Current.BoundingRectangle + $obj = @{ + name = [string]$el.Current.Name + controlType = $el.Current.ControlType.ProgrammaticName -replace 'ControlType\\.', '' + automationId = [string]$el.Current.AutomationId + boundingRect = @{ + x = [int]$rect.X + y = [int]$rect.Y + w = [int]$rect.Width + h = [int]$rect.Height + } + isEnabled = $el.Current.IsEnabled + } + try { + $vp = $el.GetCurrentPattern([System.Windows.Automation.ValuePattern]::Pattern) + if ($vp -ne $null) { $obj['value'] = $vp.Current.Value } + } catch {} + $sub = Get-UIChildren $el ($currentDepth + 1) $maxDepth + if ($sub.Count -gt 0) { $obj['children'] = $sub } + $result += $obj + } + return $result +} + +$tree = Get-UIChildren $window 0 ${depth} +if ($tree -eq $null -or $tree.Count -eq 0) { + Write-Output '[]' +} else { + $tree | ConvertTo-Json -Depth 20 -Compress +} +` + const raw = ps(script) + const parsed = parseJsonSafe(raw, []) + return Array.isArray(parsed) ? parsed : [parsed] +} + +/** + * Find a single element inside a window matching the given query fields. + */ +export function findElement( + windowTitle: string, + query: { name?: string; controlType?: string; automationId?: string }, +): UIElement | null { + const escapedTitle = windowTitle.replace(/'/g, "''") + + // Build conditions array + const conditions: string[] = [] + if (query.name) { + const v = query.name.replace(/'/g, "''") + conditions.push( + `[System.Windows.Automation.PropertyCondition]::new([System.Windows.Automation.AutomationElement]::NameProperty, '${v}')`, + ) + } + if (query.controlType) { + const v = query.controlType.replace(/'/g, "''") + conditions.push( + `[System.Windows.Automation.PropertyCondition]::new([System.Windows.Automation.AutomationElement]::ControlTypeProperty, [System.Windows.Automation.ControlType]::${v})`, + ) + } + if (query.automationId) { + const v = query.automationId.replace(/'/g, "''") + conditions.push( + `[System.Windows.Automation.PropertyCondition]::new([System.Windows.Automation.AutomationElement]::AutomationIdProperty, '${v}')`, + ) + } + + if (conditions.length === 0) return null + + let conditionExpr: string + if (conditions.length === 1) { + conditionExpr = conditions[0] + } else { + conditionExpr = `[System.Windows.Automation.AndCondition]::new(@(${conditions.join(', ')}))` + } + + const script = ` +${UIA_ASSEMBLIES} +$title = '${escapedTitle}' +${PS_FIND_WINDOW} +if ($window -eq $null) { + Write-Output 'null' + exit +} +$cond = ${conditionExpr} +$el = $window.FindFirst([System.Windows.Automation.TreeScope]::Descendants, $cond) +if ($el -eq $null) { + Write-Output 'null' + exit +} +$rect = $el.Current.BoundingRectangle +$obj = @{ + name = [string]$el.Current.Name + controlType = $el.Current.ControlType.ProgrammaticName -replace 'ControlType\\.', '' + automationId = [string]$el.Current.AutomationId + boundingRect = @{ + x = [int]$rect.X + y = [int]$rect.Y + w = [int]$rect.Width + h = [int]$rect.Height + } + isEnabled = $el.Current.IsEnabled +} +try { + $vp = $el.GetCurrentPattern([System.Windows.Automation.ValuePattern]::Pattern) + if ($vp -ne $null) { $obj['value'] = $vp.Current.Value } +} catch {} +$obj | ConvertTo-Json -Compress +` + const raw = ps(script) + return parseJsonSafe(raw, null) +} + +/** + * Click an element by its automationId using InvokePattern. + */ +export function clickElement(windowTitle: string, automationId: string): boolean { + const escapedTitle = windowTitle.replace(/'/g, "''") + const escapedId = automationId.replace(/'/g, "''") + + const script = ` +${UIA_ASSEMBLIES} +$title = '${escapedTitle}' +${PS_FIND_WINDOW} +if ($window -eq $null) { + Write-Output 'false' + exit +} +$cond = [System.Windows.Automation.PropertyCondition]::new( + [System.Windows.Automation.AutomationElement]::AutomationIdProperty, '${escapedId}') +$el = $window.FindFirst([System.Windows.Automation.TreeScope]::Descendants, $cond) +if ($el -eq $null) { + Write-Output 'false' + exit +} +try { + $ip = $el.GetCurrentPattern([System.Windows.Automation.InvokePattern]::Pattern) + $ip.Invoke() + Write-Output 'true' +} catch { + Write-Output 'false' +} +` + return ps(script) === 'true' +} + +/** + * Set the value of an element by its automationId using ValuePattern. + */ +export function setValue(windowTitle: string, automationId: string, value: string): boolean { + const escapedTitle = windowTitle.replace(/'/g, "''") + const escapedId = automationId.replace(/'/g, "''") + const escapedValue = value.replace(/'/g, "''") + + const script = ` +${UIA_ASSEMBLIES} +$title = '${escapedTitle}' +${PS_FIND_WINDOW} +if ($window -eq $null) { + Write-Output 'false' + exit +} +$cond = [System.Windows.Automation.PropertyCondition]::new( + [System.Windows.Automation.AutomationElement]::AutomationIdProperty, '${escapedId}') +$el = $window.FindFirst([System.Windows.Automation.TreeScope]::Descendants, $cond) +if ($el -eq $null) { + Write-Output 'false' + exit +} +try { + $vp = $el.GetCurrentPattern([System.Windows.Automation.ValuePattern]::Pattern) + $vp.SetValue('${escapedValue}') + Write-Output 'true' +} catch { + Write-Output 'false' +} +` + return ps(script) === 'true' +} + +/** + * Get the UI element at a specific screen coordinate. + */ +export function elementAtPoint(x: number, y: number): UIElement | null { + const script = ` +${UIA_ASSEMBLIES} +try { + $point = [System.Windows.Point]::new(${x}, ${y}) + $el = [System.Windows.Automation.AutomationElement]::FromPoint($point) + if ($el -eq $null) { + Write-Output 'null' + exit + } + $rect = $el.Current.BoundingRectangle + $obj = @{ + name = [string]$el.Current.Name + controlType = $el.Current.ControlType.ProgrammaticName -replace 'ControlType\\.', '' + automationId = [string]$el.Current.AutomationId + boundingRect = @{ + x = [int]$rect.X + y = [int]$rect.Y + w = [int]$rect.Width + h = [int]$rect.Height + } + isEnabled = $el.Current.IsEnabled + } + try { + $vp = $el.GetCurrentPattern([System.Windows.Automation.ValuePattern]::Pattern) + if ($vp -ne $null) { $obj['value'] = $vp.Current.Value } + } catch {} + $obj | ConvertTo-Json -Compress +} catch { + Write-Output 'null' +} +` + const raw = ps(script) + return parseJsonSafe(raw, null) +} diff --git a/src/utils/computerUse/win32/windowCapture.ts b/src/utils/computerUse/win32/windowCapture.ts new file mode 100644 index 000000000..fe090f21f --- /dev/null +++ b/src/utils/computerUse/win32/windowCapture.ts @@ -0,0 +1,129 @@ +/** + * Window-level screenshot capture using Win32 PrintWindow API. + * Captures windows even when occluded or minimized. + */ + +interface CaptureResult { + base64: string + width: number + height: number +} + +const CAPTURE_BY_TITLE_PS = ` +Add-Type -AssemblyName System.Drawing +Add-Type -ReferencedAssemblies System.Drawing @' +using System; +using System.Runtime.InteropServices; +using System.Drawing; +using System.Drawing.Imaging; +public class WinCap { + [DllImport("user32.dll", CharSet=CharSet.Unicode)] + public static extern IntPtr FindWindow(string c, string t); + [DllImport("user32.dll")] + public static extern bool GetWindowRect(IntPtr h, out RECT r); + [DllImport("user32.dll")] + public static extern bool PrintWindow(IntPtr h, IntPtr hdc, uint f); + [StructLayout(LayoutKind.Sequential)] + public struct RECT { public int L, T, R, B; } + + public static string Capture(string title) { + IntPtr hwnd = FindWindow(null, title); + if (hwnd == IntPtr.Zero) return "NOT_FOUND"; + RECT r; GetWindowRect(hwnd, out r); + int w = r.R - r.L; int h = r.B - r.T; + if (w <= 0 || h <= 0) return "INVALID_SIZE"; + Bitmap bmp = new Bitmap(w, h); + Graphics g = Graphics.FromImage(bmp); + IntPtr hdc = g.GetHdc(); + PrintWindow(hwnd, hdc, 2); + g.ReleaseHdc(hdc); g.Dispose(); + var ms = new System.IO.MemoryStream(); + bmp.Save(ms, ImageFormat.Png); + bmp.Dispose(); + return w + "," + h + "," + Convert.ToBase64String(ms.ToArray()); + } +} +'@ +` + +const CAPTURE_BY_HWND_PS = ` +Add-Type -AssemblyName System.Drawing +Add-Type -ReferencedAssemblies System.Drawing @' +using System; +using System.Runtime.InteropServices; +using System.Drawing; +using System.Drawing.Imaging; +public class WinCapH { + [DllImport("user32.dll")] + public static extern bool GetWindowRect(IntPtr h, out RECT r); + [DllImport("user32.dll")] + public static extern bool PrintWindow(IntPtr h, IntPtr hdc, uint f); + [DllImport("user32.dll")] + public static extern bool IsWindow(IntPtr hWnd); + [StructLayout(LayoutKind.Sequential)] + public struct RECT { public int L, T, R, B; } + + public static string Capture(IntPtr hwnd) { + if (!IsWindow(hwnd)) return "NOT_FOUND"; + RECT r; GetWindowRect(hwnd, out r); + int w = r.R - r.L; int h = r.B - r.T; + if (w <= 0 || h <= 0) return "INVALID_SIZE"; + Bitmap bmp = new Bitmap(w, h); + Graphics g = Graphics.FromImage(bmp); + IntPtr hdc = g.GetHdc(); + PrintWindow(hwnd, hdc, 2); + g.ReleaseHdc(hdc); g.Dispose(); + var ms = new System.IO.MemoryStream(); + bmp.Save(ms, ImageFormat.Png); + bmp.Dispose(); + return w + "," + h + "," + Convert.ToBase64String(ms.ToArray()); + } +} +'@ +` + +function parseCaptureOutput(raw: string): CaptureResult | null { + const trimmed = raw.trim() + if (!trimmed || trimmed === 'NOT_FOUND' || trimmed === 'INVALID_SIZE') { + return null + } + const firstComma = trimmed.indexOf(',') + const secondComma = trimmed.indexOf(',', firstComma + 1) + if (firstComma === -1 || secondComma === -1) return null + + const width = Number(trimmed.slice(0, firstComma)) + const height = Number(trimmed.slice(firstComma + 1, secondComma)) + const base64 = trimmed.slice(secondComma + 1) + + if (!width || !height || !base64) return null + return { base64, width, height } +} + +function runPs(script: string): string { + const result = Bun.spawnSync({ + cmd: ['powershell', '-NoProfile', '-NonInteractive', '-Command', script], + stdout: 'pipe', + stderr: 'pipe', + }) + return new TextDecoder().decode(result.stdout).trim() +} + +/** + * Capture a window screenshot by its exact title. + * Uses PrintWindow which works even for occluded/background windows. + */ +export function captureWindow(title: string): CaptureResult | null { + const escaped = title.replace(/'/g, "''") + const script = `${CAPTURE_BY_TITLE_PS}\n[WinCap]::Capture('${escaped}')` + const raw = runPs(script) + return parseCaptureOutput(raw) +} + +/** + * Capture a window screenshot by its HWND handle. + */ +export function captureWindowByHwnd(hwnd: number): CaptureResult | null { + const script = `${CAPTURE_BY_HWND_PS}\n[WinCapH]::Capture([IntPtr]::new(${hwnd}))` + const raw = runPs(script) + return parseCaptureOutput(raw) +} diff --git a/src/utils/computerUse/win32/windowEnum.ts b/src/utils/computerUse/win32/windowEnum.ts new file mode 100644 index 000000000..03bdbbebb --- /dev/null +++ b/src/utils/computerUse/win32/windowEnum.ts @@ -0,0 +1,86 @@ +/** + * Window enumeration using Win32 EnumWindows API. + * Returns visible windows with their HWND, PID, and title. + */ + +export interface WindowInfo { + hwnd: number + pid: number + title: string +} + +const ENUM_WINDOWS_PS = ` +Add-Type @' +using System; +using System.Collections.Generic; +using System.Runtime.InteropServices; +using System.Text; +public class WinEnum { + public delegate bool EnumWindowsProc(IntPtr hWnd, IntPtr lParam); + + [DllImport("user32.dll")] + public static extern bool EnumWindows(EnumWindowsProc lpEnumFunc, IntPtr lParam); + [DllImport("user32.dll")] + public static extern bool IsWindowVisible(IntPtr hWnd); + [DllImport("user32.dll", CharSet=CharSet.Unicode)] + public static extern int GetWindowText(IntPtr hWnd, StringBuilder lpString, int nMaxCount); + [DllImport("user32.dll")] + public static extern int GetWindowTextLength(IntPtr hWnd); + [DllImport("user32.dll")] + public static extern uint GetWindowThreadProcessId(IntPtr hWnd, out uint processId); + + public static List results = new List(); + + public static void Run() { + results.Clear(); + EnumWindows(delegate(IntPtr hWnd, IntPtr lParam) { + if (!IsWindowVisible(hWnd)) return true; + int len = GetWindowTextLength(hWnd); + if (len == 0) return true; + StringBuilder sb = new StringBuilder(len + 1); + GetWindowText(hWnd, sb, sb.Capacity); + string title = sb.ToString(); + if (string.IsNullOrWhiteSpace(title)) return true; + uint pid = 0; + GetWindowThreadProcessId(hWnd, out pid); + results.Add(hWnd.ToInt64() + "|" + pid + "|" + title); + return true; + }, IntPtr.Zero); + } +} +'@ +[WinEnum]::Run() +[WinEnum]::results | ForEach-Object { $_ } +` + +/** + * List all visible windows with non-empty titles. + * Returns HWND, PID, and window title for each. + */ +export function listWindows(): WindowInfo[] { + const result = Bun.spawnSync({ + cmd: ['powershell', '-NoProfile', '-NonInteractive', '-Command', ENUM_WINDOWS_PS], + stdout: 'pipe', + stderr: 'pipe', + }) + const raw = new TextDecoder().decode(result.stdout).trim() + if (!raw) return [] + + return raw + .split('\n') + .filter(Boolean) + .map(line => { + const trimmed = line.trim() + const firstPipe = trimmed.indexOf('|') + const secondPipe = trimmed.indexOf('|', firstPipe + 1) + if (firstPipe === -1 || secondPipe === -1) return null + + const hwnd = Number(trimmed.slice(0, firstPipe)) + const pid = Number(trimmed.slice(firstPipe + 1, secondPipe)) + const title = trimmed.slice(secondPipe + 1) + + if (isNaN(hwnd) || isNaN(pid) || !title) return null + return { hwnd, pid, title } + }) + .filter((item): item is WindowInfo => item !== null) +} diff --git a/vendor/audio-capture-src/index.ts b/vendor/audio-capture-src/index.ts new file mode 100644 index 000000000..4cac2037c --- /dev/null +++ b/vendor/audio-capture-src/index.ts @@ -0,0 +1,151 @@ + +type AudioCaptureNapi = { + startRecording( + onData: (data: Buffer) => void, + onEnd: () => void, + ): boolean + stopRecording(): void + isRecording(): boolean + startPlayback(sampleRate: number, channels: number): boolean + writePlaybackData(data: Buffer): void + stopPlayback(): void + isPlaying(): boolean + // TCC microphone authorization status (macOS only): + // 0 = notDetermined, 1 = restricted, 2 = denied, 3 = authorized. + // Linux: always returns 3 (authorized) — no system-level microphone permission API. + // Windows: returns 3 (authorized) if registry key absent or allowed, + // 2 (denied) if microphone access is explicitly denied. + microphoneAuthorizationStatus?(): number +} + +let cachedModule: AudioCaptureNapi | null = null +let loadAttempted = false + +function loadModule(): AudioCaptureNapi | null { + if (loadAttempted) { + return cachedModule + } + loadAttempted = true + + // Supported platforms: macOS (darwin), Linux, Windows (win32) + const platform = process.platform + if (platform !== 'darwin' && platform !== 'linux' && platform !== 'win32') { + return null + } + + // Candidate 1: native-embed path (bun compile). AUDIO_CAPTURE_NODE_PATH is + // defined at build time in build-with-plugins.ts for native builds only — the + // define resolves it to the static literal "../../audio-capture.node" so bun + // compile can rewrite it to /$bunfs/root/audio-capture.node. MUST stay a + // direct require(env var) — bun cannot analyze require(variable) from a loop. + if (process.env.AUDIO_CAPTURE_NODE_PATH) { + try { + // eslint-disable-next-line @typescript-eslint/no-require-imports + cachedModule = require( + process.env.AUDIO_CAPTURE_NODE_PATH, + ) as AudioCaptureNapi + return cachedModule + } catch { + // fall through to runtime fallbacks below + } + } + + // Candidates 2/3: npm-install and dev/source layouts. Dynamic require is + // fine here — in bundled output (node --target build) require() resolves at + // runtime relative to cli.js at the package root; in dev it resolves + // relative to this file (vendor/audio-capture-src/index.ts). + const platformDir = `${process.arch}-${platform}` + const fallbacks = [ + `./vendor/audio-capture/${platformDir}/audio-capture.node`, + `../audio-capture/${platformDir}/audio-capture.node`, + ] + for (const p of fallbacks) { + try { + // eslint-disable-next-line @typescript-eslint/no-require-imports + cachedModule = require(p) as AudioCaptureNapi + return cachedModule + } catch { + // try next + } + } + return null +} + +export function isNativeAudioAvailable(): boolean { + return loadModule() !== null +} + +export function startNativeRecording( + onData: (data: Buffer) => void, + onEnd: () => void, +): boolean { + const mod = loadModule() + if (!mod) { + return false + } + return mod.startRecording(onData, onEnd) +} + +export function stopNativeRecording(): void { + const mod = loadModule() + if (!mod) { + return + } + mod.stopRecording() +} + +export function isNativeRecordingActive(): boolean { + const mod = loadModule() + if (!mod) { + return false + } + return mod.isRecording() +} + +export function startNativePlayback( + sampleRate: number, + channels: number, +): boolean { + const mod = loadModule() + if (!mod) { + return false + } + return mod.startPlayback(sampleRate, channels) +} + +export function writeNativePlaybackData(data: Buffer): void { + const mod = loadModule() + if (!mod) { + return + } + mod.writePlaybackData(data) +} + +export function stopNativePlayback(): void { + const mod = loadModule() + if (!mod) { + return + } + mod.stopPlayback() +} + +export function isNativePlaying(): boolean { + const mod = loadModule() + if (!mod) { + return false + } + return mod.isPlaying() +} + +// Returns the microphone authorization status. +// On macOS, returns the TCC status: 0=notDetermined, 1=restricted, 2=denied, 3=authorized. +// On Linux, always returns 3 (authorized) — no system-level mic permission API. +// On Windows, returns 3 (authorized) if registry key absent or allowed, 2 (denied) if explicitly denied. +// Returns 0 (notDetermined) if the native module is unavailable. +export function microphoneAuthorizationStatus(): number { + const mod = loadModule() + if (!mod || !mod.microphoneAuthorizationStatus) { + return 0 + } + return mod.microphoneAuthorizationStatus() +} diff --git a/vendor/audio-capture/arm64-darwin/audio-capture.node b/vendor/audio-capture/arm64-darwin/audio-capture.node new file mode 100644 index 000000000..a60bd3c5c Binary files /dev/null and b/vendor/audio-capture/arm64-darwin/audio-capture.node differ diff --git a/vendor/audio-capture/arm64-linux/audio-capture.node b/vendor/audio-capture/arm64-linux/audio-capture.node new file mode 100644 index 000000000..34f862ac3 Binary files /dev/null and b/vendor/audio-capture/arm64-linux/audio-capture.node differ diff --git a/vendor/audio-capture/arm64-win32/audio-capture.node b/vendor/audio-capture/arm64-win32/audio-capture.node new file mode 100644 index 000000000..43b0fbbc7 Binary files /dev/null and b/vendor/audio-capture/arm64-win32/audio-capture.node differ diff --git a/vendor/audio-capture/x64-darwin/audio-capture.node b/vendor/audio-capture/x64-darwin/audio-capture.node new file mode 100644 index 000000000..2b4f4c5f3 Binary files /dev/null and b/vendor/audio-capture/x64-darwin/audio-capture.node differ diff --git a/vendor/audio-capture/x64-linux/audio-capture.node b/vendor/audio-capture/x64-linux/audio-capture.node new file mode 100644 index 000000000..3f57d8550 Binary files /dev/null and b/vendor/audio-capture/x64-linux/audio-capture.node differ diff --git a/vendor/audio-capture/x64-win32/audio-capture.node b/vendor/audio-capture/x64-win32/audio-capture.node new file mode 100644 index 000000000..559527664 Binary files /dev/null and b/vendor/audio-capture/x64-win32/audio-capture.node differ