Skip to content

Commit c8cbe70

Browse files
committed
Create computer_use.mdx
1 parent 482ab08 commit c8cbe70

File tree

1 file changed

+189
-0
lines changed

1 file changed

+189
-0
lines changed
Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
---
2+
title: Computer Use
3+
description: Enable AI agents to control your computer like a human - clicking, typing, running commands, and editing files.
4+
---
5+
6+
Transform your AI agent into an autonomous computer operator. This built-in extension enables agents to see your screen, control the mouse and keyboard, execute shell commands, and edit files - just like a human user sitting at the computer.
7+
8+
Based on [Anthropic's computer use tools](https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo), it brings full desktop automation capabilities to your AI workflows.
9+
10+
## Using Computer Use Tools
11+
12+
The built-in Computer Use extension tools are listed under the **computer_use** category that can be executed directly in the **Tools** page:
13+
14+
<Screenshot src="/img/tools/computer_use_exec.webp" />
15+
16+
Which also shows the definitions of all other available Computer Use tools:
17+
18+
<Screenshot src="/img/tools/computer_use_list.webp" />
19+
20+
## Selecting Tools
21+
22+
Enable or disable individual tools based on your workflow needs:
23+
24+
<Screenshot src="/img/tools/computer_use_select.webp" />
25+
26+
## Why Computer Use?
27+
28+
Traditional AI tools operate through APIs and structured data. Computer Use breaks this barrier by letting agents interact with **any application** - web browsers, desktop apps, terminals, IDEs - exactly as you would. This unlocks automation for:
29+
30+
- **Legacy Applications**: Automate software that lacks APIs
31+
- **Visual Verification**: Confirm that code actually renders correctly in a browser
32+
- **End-to-End Workflows**: Chain together multiple applications in a single task
33+
- **Interactive Testing**: Navigate UIs, fill forms, and verify results visually
34+
35+
## Key Capabilities
36+
37+
- **Visual Perception**: Capture screenshots to see what's on screen, zoom into specific regions for detailed inspection
38+
- **Mouse Control**: Move, click (single/double/triple), drag, and scroll anywhere on screen
39+
- **Keyboard Input**: Type text, press key combinations, execute shortcuts
40+
- **Shell Execution**: Run any command in a persistent bash session with preserved state
41+
- **File Operations**: View, create, edit, and undo changes to files with precision
42+
43+
## Use Cases
44+
45+
### Web Development with Visual Verification
46+
47+
Ask an agent to build a web application, and it can write the code, launch a server, open the browser, and take a screenshot to prove it works:
48+
49+
<Screenshot src="/img/tools/computer_use_tetris.webp" />
50+
51+
The agent combines all three tools seamlessly:
52+
1. **bash**: Create project directories and start a local server
53+
2. **edit**: Write HTML, CSS, and JavaScript files
54+
3. **computer**: Open the browser and capture the final result
55+
56+
### Desktop Application Automation
57+
58+
Agents can operate any GUI application:
59+
- Open applications, navigate menus, click buttons
60+
- Fill out forms and dialog boxes
61+
- Extract information from visual interfaces
62+
- Automate repetitive desktop workflows
63+
64+
### System Administration
65+
66+
Execute and verify system operations:
67+
- Run diagnostic commands and interpret output
68+
- Edit configuration files with undo capability
69+
- Verify changes by taking screenshots of system state
70+
- Chain complex multi-step operations in a persistent shell
71+
72+
### Testing and QA
73+
74+
Visual validation that code works as expected:
75+
- Navigate to specific URLs and verify page content
76+
- Interact with UI elements and confirm behavior
77+
- Capture screenshots for documentation or bug reports
78+
- Test across different screen regions with zoom
79+
80+
---
81+
82+
## Tools
83+
84+
### Computer
85+
86+
Interact with the screen, mouse, and keyboard to control your desktop environment.
87+
88+
| Action | Description |
89+
|--------|-------------|
90+
| `screenshot` | Capture the current screen state |
91+
| `mouse_move` | Move cursor to specific coordinates |
92+
| `left_click`, `right_click`, `middle_click` | Click at current or specified position |
93+
| `double_click`, `triple_click` | Multi-click actions for text selection |
94+
| `left_click_drag` | Click and drag to a target position |
95+
| `left_mouse_down`, `left_mouse_up` | Press/release mouse button for complex interactions |
96+
| `scroll` | Scroll in any direction by a specified amount |
97+
| `type` | Type text at the current cursor position |
98+
| `key` | Press key combinations (e.g., `Control+c`, `Return`) |
99+
| `hold_key` | Hold a key down for a specified duration |
100+
| `wait` | Pause execution for a specified duration |
101+
| `cursor_position` | Get current cursor coordinates |
102+
| `zoom` | Zoom into a specific screen region for detail |
103+
104+
```python
105+
async def computer(
106+
action: Literal["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click",
107+
"middle_click", "double_click", "left_mouse_down", "left_mouse_up", "scroll",
108+
"hold_key", "wait", "triple_click", "screenshot", "cursor_position", "zoom"],
109+
text: Annotated[str | None, "The text to type or the key to press"] = None,
110+
coordinate: Annotated[
111+
tuple[int, int] | None,
112+
"(x, y): The x and y coordinates to move the mouse to"
113+
] = None,
114+
scroll_direction: Literal["up", "down", "left", "right"] | None = None,
115+
scroll_amount: Annotated[int | None, "The number of lines to scroll"] = None,
116+
duration: Annotated[float | None, "Duration in seconds"] = None,
117+
key: Annotated[str | None, "The key sequence to press"] = None,
118+
region: Annotated[str | None, "(x0, y0, x1, y1): The region to zoom into"] = None,
119+
) -> list[dict[str, Any]]
120+
```
121+
122+
### Edit
123+
124+
A precision file editor designed for AI agents with string-based operations that avoid line-number ambiguity.
125+
126+
| Command | Description |
127+
|---------|-------------|
128+
| `view` | Read file contents or list directory contents |
129+
| `create` | Create a new file with specified content |
130+
| `str_replace` | Replace a unique string with new content |
131+
| `insert` | Insert text after a specific line number |
132+
| `undo_edit` | Revert the last edit to a file |
133+
134+
View, create, and modify files with undo support.
135+
136+
```python
137+
async def edit(
138+
command: Literal["view", "create", "str_replace", "insert", "undo_edit"],
139+
path: Annotated[str, "The absolute path to the file or directory"],
140+
file_text: Annotated[
141+
str | None,
142+
"The content to write to the file (required for create)"
143+
] = None,
144+
view_range: Annotated[
145+
list[int] | None,
146+
"The range of lines to view (e.g. [1, 10])"
147+
] = None,
148+
old_str: Annotated[
149+
str | None,
150+
"The string to replace (required for str_replace)"
151+
] = None,
152+
new_str: Annotated[
153+
str | None,
154+
"The replacement string (required for str_replace and insert)"
155+
] = None,
156+
insert_line: Annotated[
157+
int | None,
158+
"The line number after which to insert (required for insert)"
159+
] = None,
160+
) -> list[dict[str, Any]]
161+
```
162+
163+
### Bash
164+
165+
Execute commands in a persistent shell session where environment variables, working directory, and state are preserved between calls.
166+
167+
| Feature | Description |
168+
|---------|-------------|
169+
| Command execution | Run any bash command |
170+
| Persistent state | Working directory and variables persist |
171+
| Session restart | Reset the shell environment when needed |
172+
| Cross-platform open | Launch files/URLs with system default handler |
173+
174+
```python
175+
async def run_bash(
176+
command: Annotated[str | None, "Command to run"],
177+
restart: Annotated[bool, "Restart the bash session"] = False,
178+
) -> list[dict[str, Any]]
179+
```
180+
181+
### Open File or URL
182+
183+
Open a URL or file using the system's default handler (`xdg-open` on Linux, `open` on macOS, `start` on Windows).
184+
185+
```python
186+
async def open(
187+
target: Annotated[str, "URL or file path to open"]
188+
) -> list[dict[str, Any]]
189+
```

0 commit comments

Comments
 (0)