|
| 1 | +# Computer Use Tools |
| 2 | + |
| 3 | +This extension provides a set of tools that allow an Agent to interact with a computer environment in a way similar to a human user. It includes capabilities for screen interaction (mouse/keyboard), shell execution, and file editing. Based on [Anthropic's computer use tools](https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo). |
| 4 | + |
| 5 | +## Available Tools |
| 6 | + |
| 7 | +### 1. Computer Tool (`computer`) |
| 8 | +Allows interaction with the screen, keyboard, and mouse. |
| 9 | + |
| 10 | +**Capabilities:** |
| 11 | +- **Mouse Interaction**: Move cursor, click (left, right, middle, double, triple), click & drag. |
| 12 | +- **Keyboard Interaction**: Type text, press specific keys or key combinations. |
| 13 | +- **Screen**: Take screenshots, get cursor position. |
| 14 | +- **Zooming**: Zoom into specific regions of the screen (Action: `zoom`). |
| 15 | + |
| 16 | +**Key Parameters:** |
| 17 | +- `action`: The action to perform (e.g., `mouse_move`, `left_click`, `type`, `screenshot`, `zoom`). |
| 18 | +- `coordinate`: `(x, y)` coordinates for mouse actions. |
| 19 | +- `text`: Text to type. |
| 20 | +- `key`: Key sequence to press (e.g., `Return`, `Control+c`). |
| 21 | +- `region`: `(x0, y0, x1, y1)` region for zooming. |
| 22 | + |
| 23 | +### 2. Bash Tool (`bash`) |
| 24 | +Provides a persistent shell session to execute command-line instructions. |
| 25 | + |
| 26 | +**Capabilities:** |
| 27 | +- **Execute Commands**: Run any bash command. |
| 28 | +- **Persistent Session**: State (like environment variables, working directory) is preserved between calls within the same session. |
| 29 | +- **Process Management**: Can restart the session if needed. |
| 30 | +- **Open Files/URLs**: Helper function `open` allows opening files or URLs using the system's default handler (`xdg-open`, `open`, or `start`). |
| 31 | + |
| 32 | +**Key Parameters:** |
| 33 | +- `command`: The bash command to execute. |
| 34 | +- `restart`: Boolean to restart the session. |
| 35 | + |
| 36 | +### 3. Edit Tool (`str_replace_editor`) |
| 37 | +A filesystem editor for viewing and modifying files. |
| 38 | + |
| 39 | +**Capabilities:** |
| 40 | +- **View**: Read file contents or list directories. |
| 41 | +- **Create**: Create new files with content. |
| 42 | +- **String Replace**: Replace unique strings in a file (robust for LLM editing). |
| 43 | +- **Insert**: Insert text at specific line numbers. |
| 44 | +- **Undo**: Undo the last edit to a file. |
| 45 | + |
| 46 | +**Key Parameters:** |
| 47 | +- `command`: The edit command (`view`, `create`, `str_replace`, `insert`, `undo_edit`). |
| 48 | +- `path`: Absolute path to the file or directory. |
| 49 | +- `file_text`: Content for file creation. |
| 50 | +- `old_str` / `new_str`: Strings for replacement. |
| 51 | + |
| 52 | +--- |
| 53 | + |
| 54 | +## Capabilities & Workflows |
| 55 | + |
| 56 | +These tools are designed to work together to enable complex end-to-end tasks. An Agent can act as a developer, tester, or general user. |
| 57 | + |
| 58 | +### Example: "Build a Tetris web app in a tetris folder, open it then take a screenshot" |
| 59 | + |
| 60 | +To achieve this high-level task, the Agent would sequence the tools as follows: |
| 61 | + |
| 62 | +1. **Create the Project Structure** |
| 63 | + * **Tool**: `bash` |
| 64 | + * **Command**: `mkdir -p tetris` |
| 65 | + * *Result*: Creates the folder. |
| 66 | + |
| 67 | +2. **Create the Application Files** |
| 68 | + * **Tool**: `edit` (command: `create`) |
| 69 | + * **Path**: `/path/to/tetris/index.html` |
| 70 | + * **Content**: (HTML code for Tetris game) |
| 71 | + * *Result*: Writes the HTML file. |
| 72 | + |
| 73 | +3. **Open the Application** |
| 74 | + * **Tool**: `bash` (via helper `open`) or `bash` directly. |
| 75 | + * **Command**: `xdg-open /path/to/tetris/index.html` (Linux) or just `python -m http.server` and open localhost. |
| 76 | + * *Result*: Opens the file in the default web browser. |
| 77 | + |
| 78 | +4. **Wait & Verify** |
| 79 | + * **Tool**: `computer` |
| 80 | + * **Action**: `wait` or `screenshot` to see if it loaded. |
| 81 | + |
| 82 | +5. **Take a Screenshot** |
| 83 | + * **Tool**: `computer` |
| 84 | + * **Action**: `screenshot` |
| 85 | + * *Result*: Captures the visual state of the running Tetris app for the user to see. |
| 86 | + |
| 87 | +### How it handles the "Build a Tetris..." request: |
| 88 | +When a user gives the command: |
| 89 | +> "Build a Tetris web app in a tetris folder, open it then take a screenshot" |
| 90 | +
|
| 91 | +The Agent decomposes this into: |
| 92 | +1. **"Build... in a tetris folder"** -> Uses `bash` to make the directory and `edit` to write the `index.html` / `style.css` / `script.js` files. |
| 93 | +2. **"Open it"** -> Uses `bash` to run a server or open the file in a browser. |
| 94 | +3. **"Take a screenshot"** -> Uses `computer` to verify the visual output. |
| 95 | + |
| 96 | +This combination allows the Agent to not just generate code, but **verify** it visually and interactively, closing the loop on development tasks. |
0 commit comments