Skip to content

Commit 21e5583

Browse files
committed
Create README.md
1 parent b8243ff commit 21e5583

1 file changed

Lines changed: 96 additions & 0 deletions

File tree

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Computer Use Tools
2+
3+
This extension provides a set of tools that allow an Agent to interact with a computer environment in a way similar to a human user. It includes capabilities for screen interaction (mouse/keyboard), shell execution, and file editing. Based on [Anthropic's computer use tools](https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo).
4+
5+
## Available Tools
6+
7+
### 1. Computer Tool (`computer`)
8+
Allows interaction with the screen, keyboard, and mouse.
9+
10+
**Capabilities:**
11+
- **Mouse Interaction**: Move cursor, click (left, right, middle, double, triple), click & drag.
12+
- **Keyboard Interaction**: Type text, press specific keys or key combinations.
13+
- **Screen**: Take screenshots, get cursor position.
14+
- **Zooming**: Zoom into specific regions of the screen (Action: `zoom`).
15+
16+
**Key Parameters:**
17+
- `action`: The action to perform (e.g., `mouse_move`, `left_click`, `type`, `screenshot`, `zoom`).
18+
- `coordinate`: `(x, y)` coordinates for mouse actions.
19+
- `text`: Text to type.
20+
- `key`: Key sequence to press (e.g., `Return`, `Control+c`).
21+
- `region`: `(x0, y0, x1, y1)` region for zooming.
22+
23+
### 2. Bash Tool (`bash`)
24+
Provides a persistent shell session to execute command-line instructions.
25+
26+
**Capabilities:**
27+
- **Execute Commands**: Run any bash command.
28+
- **Persistent Session**: State (like environment variables, working directory) is preserved between calls within the same session.
29+
- **Process Management**: Can restart the session if needed.
30+
- **Open Files/URLs**: Helper function `open` allows opening files or URLs using the system's default handler (`xdg-open`, `open`, or `start`).
31+
32+
**Key Parameters:**
33+
- `command`: The bash command to execute.
34+
- `restart`: Boolean to restart the session.
35+
36+
### 3. Edit Tool (`str_replace_editor`)
37+
A filesystem editor for viewing and modifying files.
38+
39+
**Capabilities:**
40+
- **View**: Read file contents or list directories.
41+
- **Create**: Create new files with content.
42+
- **String Replace**: Replace unique strings in a file (robust for LLM editing).
43+
- **Insert**: Insert text at specific line numbers.
44+
- **Undo**: Undo the last edit to a file.
45+
46+
**Key Parameters:**
47+
- `command`: The edit command (`view`, `create`, `str_replace`, `insert`, `undo_edit`).
48+
- `path`: Absolute path to the file or directory.
49+
- `file_text`: Content for file creation.
50+
- `old_str` / `new_str`: Strings for replacement.
51+
52+
---
53+
54+
## Capabilities & Workflows
55+
56+
These tools are designed to work together to enable complex end-to-end tasks. An Agent can act as a developer, tester, or general user.
57+
58+
### Example: "Build a Tetris web app in a tetris folder, open it then take a screenshot"
59+
60+
To achieve this high-level task, the Agent would sequence the tools as follows:
61+
62+
1. **Create the Project Structure**
63+
* **Tool**: `bash`
64+
* **Command**: `mkdir -p tetris`
65+
* *Result*: Creates the folder.
66+
67+
2. **Create the Application Files**
68+
* **Tool**: `edit` (command: `create`)
69+
* **Path**: `/path/to/tetris/index.html`
70+
* **Content**: (HTML code for Tetris game)
71+
* *Result*: Writes the HTML file.
72+
73+
3. **Open the Application**
74+
* **Tool**: `bash` (via helper `open`) or `bash` directly.
75+
* **Command**: `xdg-open /path/to/tetris/index.html` (Linux) or just `python -m http.server` and open localhost.
76+
* *Result*: Opens the file in the default web browser.
77+
78+
4. **Wait & Verify**
79+
* **Tool**: `computer`
80+
* **Action**: `wait` or `screenshot` to see if it loaded.
81+
82+
5. **Take a Screenshot**
83+
* **Tool**: `computer`
84+
* **Action**: `screenshot`
85+
* *Result*: Captures the visual state of the running Tetris app for the user to see.
86+
87+
### How it handles the "Build a Tetris..." request:
88+
When a user gives the command:
89+
> "Build a Tetris web app in a tetris folder, open it then take a screenshot"
90+
91+
The Agent decomposes this into:
92+
1. **"Build... in a tetris folder"** -> Uses `bash` to make the directory and `edit` to write the `index.html` / `style.css` / `script.js` files.
93+
2. **"Open it"** -> Uses `bash` to run a server or open the file in a browser.
94+
3. **"Take a screenshot"** -> Uses `computer` to verify the visual output.
95+
96+
This combination allows the Agent to not just generate code, but **verify** it visually and interactively, closing the loop on development tasks.

0 commit comments

Comments
 (0)