|
| 1 | +--- |
| 2 | +title: Computer Use |
| 3 | +description: Enable AI agents to control your computer like a human - clicking, typing, running commands, and editing files. |
| 4 | +--- |
| 5 | + |
| 6 | +Transform your AI agent into an autonomous computer operator. This built-in extension enables agents to see your screen, control the mouse and keyboard, execute shell commands, and edit files - just like a human user sitting at the computer. |
| 7 | + |
| 8 | +Based on [Anthropic's computer use tools](https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo), it brings full desktop automation capabilities to your AI workflows. |
| 9 | + |
| 10 | +## Using Computer Use Tools |
| 11 | + |
| 12 | +The built-in Computer Use extension tools are listed under the **computer_use** category that can be executed directly in the **Tools** page: |
| 13 | + |
| 14 | +<Screenshot src="/img/tools/computer_use_exec.webp" /> |
| 15 | + |
| 16 | +Which also shows the definitions of all other available Computer Use tools: |
| 17 | + |
| 18 | +<Screenshot src="/img/tools/computer_use_list.webp" /> |
| 19 | + |
| 20 | +## Selecting Tools |
| 21 | + |
| 22 | +Enable or disable individual tools based on your workflow needs: |
| 23 | + |
| 24 | +<Screenshot src="/img/tools/computer_use_select.webp" /> |
| 25 | + |
| 26 | +## Why Computer Use? |
| 27 | + |
| 28 | +Traditional AI tools operate through APIs and structured data. Computer Use breaks this barrier by letting agents interact with **any application** - web browsers, desktop apps, terminals, IDEs - exactly as you would. This unlocks automation for: |
| 29 | + |
| 30 | +- **Legacy Applications**: Automate software that lacks APIs |
| 31 | +- **Visual Verification**: Confirm that code actually renders correctly in a browser |
| 32 | +- **End-to-End Workflows**: Chain together multiple applications in a single task |
| 33 | +- **Interactive Testing**: Navigate UIs, fill forms, and verify results visually |
| 34 | + |
| 35 | +## Key Capabilities |
| 36 | + |
| 37 | +- **Visual Perception**: Capture screenshots to see what's on screen, zoom into specific regions for detailed inspection |
| 38 | +- **Mouse Control**: Move, click (single/double/triple), drag, and scroll anywhere on screen |
| 39 | +- **Keyboard Input**: Type text, press key combinations, execute shortcuts |
| 40 | +- **Shell Execution**: Run any command in a persistent bash session with preserved state |
| 41 | +- **File Operations**: View, create, edit, and undo changes to files with precision |
| 42 | + |
| 43 | +## Use Cases |
| 44 | + |
| 45 | +### Web Development with Visual Verification |
| 46 | + |
| 47 | +Ask an agent to build a web application, and it can write the code, launch a server, open the browser, and take a screenshot to prove it works: |
| 48 | + |
| 49 | +<Screenshot src="/img/tools/computer_use_tetris.webp" /> |
| 50 | + |
| 51 | +The agent combines all three tools seamlessly: |
| 52 | +1. **bash**: Create project directories and start a local server |
| 53 | +2. **edit**: Write HTML, CSS, and JavaScript files |
| 54 | +3. **computer**: Open the browser and capture the final result |
| 55 | + |
| 56 | +### Desktop Application Automation |
| 57 | + |
| 58 | +Agents can operate any GUI application: |
| 59 | +- Open applications, navigate menus, click buttons |
| 60 | +- Fill out forms and dialog boxes |
| 61 | +- Extract information from visual interfaces |
| 62 | +- Automate repetitive desktop workflows |
| 63 | + |
| 64 | +### System Administration |
| 65 | + |
| 66 | +Execute and verify system operations: |
| 67 | +- Run diagnostic commands and interpret output |
| 68 | +- Edit configuration files with undo capability |
| 69 | +- Verify changes by taking screenshots of system state |
| 70 | +- Chain complex multi-step operations in a persistent shell |
| 71 | + |
| 72 | +### Testing and QA |
| 73 | + |
| 74 | +Visual validation that code works as expected: |
| 75 | +- Navigate to specific URLs and verify page content |
| 76 | +- Interact with UI elements and confirm behavior |
| 77 | +- Capture screenshots for documentation or bug reports |
| 78 | +- Test across different screen regions with zoom |
| 79 | + |
| 80 | +--- |
| 81 | + |
| 82 | +## Tools |
| 83 | + |
| 84 | +### Computer |
| 85 | + |
| 86 | +Interact with the screen, mouse, and keyboard to control your desktop environment. |
| 87 | + |
| 88 | +| Action | Description | |
| 89 | +|--------|-------------| |
| 90 | +| `screenshot` | Capture the current screen state | |
| 91 | +| `mouse_move` | Move cursor to specific coordinates | |
| 92 | +| `left_click`, `right_click`, `middle_click` | Click at current or specified position | |
| 93 | +| `double_click`, `triple_click` | Multi-click actions for text selection | |
| 94 | +| `left_click_drag` | Click and drag to a target position | |
| 95 | +| `left_mouse_down`, `left_mouse_up` | Press/release mouse button for complex interactions | |
| 96 | +| `scroll` | Scroll in any direction by a specified amount | |
| 97 | +| `type` | Type text at the current cursor position | |
| 98 | +| `key` | Press key combinations (e.g., `Control+c`, `Return`) | |
| 99 | +| `hold_key` | Hold a key down for a specified duration | |
| 100 | +| `wait` | Pause execution for a specified duration | |
| 101 | +| `cursor_position` | Get current cursor coordinates | |
| 102 | +| `zoom` | Zoom into a specific screen region for detail | |
| 103 | + |
| 104 | +```python |
| 105 | +async def computer( |
| 106 | + action: Literal["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", |
| 107 | + "middle_click", "double_click", "left_mouse_down", "left_mouse_up", "scroll", |
| 108 | + "hold_key", "wait", "triple_click", "screenshot", "cursor_position", "zoom"], |
| 109 | + text: Annotated[str | None, "The text to type or the key to press"] = None, |
| 110 | + coordinate: Annotated[ |
| 111 | + tuple[int, int] | None, |
| 112 | + "(x, y): The x and y coordinates to move the mouse to" |
| 113 | + ] = None, |
| 114 | + scroll_direction: Literal["up", "down", "left", "right"] | None = None, |
| 115 | + scroll_amount: Annotated[int | None, "The number of lines to scroll"] = None, |
| 116 | + duration: Annotated[float | None, "Duration in seconds"] = None, |
| 117 | + key: Annotated[str | None, "The key sequence to press"] = None, |
| 118 | + region: Annotated[str | None, "(x0, y0, x1, y1): The region to zoom into"] = None, |
| 119 | +) -> list[dict[str, Any]] |
| 120 | +``` |
| 121 | + |
| 122 | +### Edit |
| 123 | + |
| 124 | +A precision file editor designed for AI agents with string-based operations that avoid line-number ambiguity. |
| 125 | + |
| 126 | +| Command | Description | |
| 127 | +|---------|-------------| |
| 128 | +| `view` | Read file contents or list directory contents | |
| 129 | +| `create` | Create a new file with specified content | |
| 130 | +| `str_replace` | Replace a unique string with new content | |
| 131 | +| `insert` | Insert text after a specific line number | |
| 132 | +| `undo_edit` | Revert the last edit to a file | |
| 133 | + |
| 134 | +View, create, and modify files with undo support. |
| 135 | + |
| 136 | +```python |
| 137 | +async def edit( |
| 138 | + command: Literal["view", "create", "str_replace", "insert", "undo_edit"], |
| 139 | + path: Annotated[str, "The absolute path to the file or directory"], |
| 140 | + file_text: Annotated[ |
| 141 | + str | None, |
| 142 | + "The content to write to the file (required for create)" |
| 143 | + ] = None, |
| 144 | + view_range: Annotated[ |
| 145 | + list[int] | None, |
| 146 | + "The range of lines to view (e.g. [1, 10])" |
| 147 | + ] = None, |
| 148 | + old_str: Annotated[ |
| 149 | + str | None, |
| 150 | + "The string to replace (required for str_replace)" |
| 151 | + ] = None, |
| 152 | + new_str: Annotated[ |
| 153 | + str | None, |
| 154 | + "The replacement string (required for str_replace and insert)" |
| 155 | + ] = None, |
| 156 | + insert_line: Annotated[ |
| 157 | + int | None, |
| 158 | + "The line number after which to insert (required for insert)" |
| 159 | + ] = None, |
| 160 | +) -> list[dict[str, Any]] |
| 161 | +``` |
| 162 | + |
| 163 | +### Bash |
| 164 | + |
| 165 | +Execute commands in a persistent shell session where environment variables, working directory, and state are preserved between calls. |
| 166 | + |
| 167 | +| Feature | Description | |
| 168 | +|---------|-------------| |
| 169 | +| Command execution | Run any bash command | |
| 170 | +| Persistent state | Working directory and variables persist | |
| 171 | +| Session restart | Reset the shell environment when needed | |
| 172 | +| Cross-platform open | Launch files/URLs with system default handler | |
| 173 | + |
| 174 | +```python |
| 175 | +async def run_bash( |
| 176 | + command: Annotated[str | None, "Command to run"], |
| 177 | + restart: Annotated[bool, "Restart the bash session"] = False, |
| 178 | +) -> list[dict[str, Any]] |
| 179 | +``` |
| 180 | + |
| 181 | +### Open File or URL |
| 182 | + |
| 183 | +Open a URL or file using the system's default handler (`xdg-open` on Linux, `open` on macOS, `start` on Windows). |
| 184 | + |
| 185 | +```python |
| 186 | +async def open( |
| 187 | + target: Annotated[str, "URL or file path to open"] |
| 188 | +) -> list[dict[str, Any]] |
| 189 | +``` |
0 commit comments