Skip to content

Commit d7a9979

Browse files
Merge pull request #36 from askui/add-locators
Add locators
2 parents 4de1bb1 + a19e477 commit d7a9979

70 files changed

Lines changed: 6109 additions & 821 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 179 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ pip install askui
8888
| | AskUI [INFO](https://hub.askui.com/) | Anthropic [INFO](https://console.anthropic.com/settings/keys) |
8989
|----------|----------|----------|
9090
| ENV Variables | `ASKUI_WORKSPACE_ID`, `ASKUI_TOKEN` | `ANTHROPIC_API_KEY` |
91-
| Supported Commands | `click()` | `click()`, `get()`, `act()` |
91+
| Supported Commands | `click()`, `get()`, `locate()`, `mouse_move()` | `act()`, `click()`, `get()`, `locate()`, `mouse_move()` |
9292
| Description | Faster Inference, European Server, Enterprise Ready | Supports complex actions |
9393

9494
To get started, set the environment variables required to authenticate with your chosen model provider.
@@ -130,7 +130,7 @@ You can test the Vision Agent with Huggingface models via their Spaces API. Plea
130130

131131
**Example Code:**
132132
```python
133-
agent.click("search field", model_name="OS-Copilot/OS-Atlas-Base-7B")
133+
agent.click("search field", model="OS-Copilot/OS-Atlas-Base-7B")
134134
```
135135

136136
### 3c. Host your own **AI Models**
@@ -143,7 +143,7 @@ You can use Vision Agent with UI-TARS if you provide your own UI-TARS API endpoi
143143

144144
2. Step: Provide the `TARS_URL` and `TARS_API_KEY` environment variables to Vision Agent.
145145

146-
3. Step: Use the `model_name="tars"` parameter in your `click()`, `get()` and `act()` commands.
146+
3. Step: Use the `model="tars"` parameter in your `click()`, `get()` and `act()` etc. commands or when initializing the `VisionAgent`.
147147

148148

149149
## ▶️ Start Building
@@ -171,46 +171,68 @@ with VisionAgent() as agent:
171171

172172
### 🎛️ Model Selection
173173

174-
Instead of relying on the default model for the entire automation script, you can specify a model for each `click` command using the `model_name` parameter.
174+
Instead of relying on the default model for the entire automation script, you can specify a model for each `click()` (or `act()`, `get()` etc.) command using the `model` parameter or when initializing the `VisionAgent` (overridden by the `model` parameter of individual commands).
175175

176176
| | AskUI | Anthropic |
177177
|----------|----------|----------|
178-
| `click()` | `askui-combo`, `askui-pta`, `askui-ocr` | `anthropic-claude-3-5-sonnet-20241022` |
178+
| `act()` | | `anthropic-claude-3-5-sonnet-20241022` |
179+
| `click()` | `askui`, `askui-combo`, `askui-pta`, `askui-ocr`, `askui-ai-element` | `anthropic-claude-3-5-sonnet-20241022` |
180+
| `get()` | | `askui`, `anthropic-claude-3-5-sonnet-20241022` |
181+
| `locate()` | `askui`, `askui-combo`, `askui-pta`, `askui-ocr`, `askui-ai-element` | `anthropic-claude-3-5-sonnet-20241022` |
182+
| `mouse_move()` | `askui`, `askui-combo`, `askui-pta`, `askui-ocr`, `askui-ai-element` | `anthropic-claude-3-5-sonnet-20241022` |
179183

180-
**Example:** `agent.click("Preview", model_name="askui-combo")`
181184

182-
<details>
183-
<summary>Antrophic AI Models</summary>
184-
185-
Supported commands are: `click()`, `type()`, `mouse_move()`, `get()`, `act()`
186-
| Model Name | Info | Execution Speed | Security | Cost | Reliability |
187-
|-------------|--------------------|--------------|--------------|--------------|--------------|
188-
| `anthropic-claude-3-5-sonnet-20241022` | The [Computer Use](https://docs.anthropic.com/en/docs/agents-and-tools/computer-use) model from Antrophic is a Large Action Model (LAM), which can autonomously achieve goals. e.g. `"Book me a flight from Berlin to Rom"` | slow, >1s per step | Model hosting by Anthropic | High, up to 1,5$ per act | Not recommended for production usage |
189-
> **Note:** Configure your Antrophic Model Provider [here](#3a-authenticate-with-an-ai-model-provider)
185+
**Example:**
190186

187+
```python
188+
from askui import VisionAgent
191189

192-
</details>
190+
with VisionAgent() as agent:
191+
# Uses the default model (depending on the environment variables set, see above)
192+
agent.click("Next")
193+
194+
with VisionAgent(model="askui-combo") as agent:
195+
# Uses the "askui-combo" model because it was specified when initializing the agent
196+
agent.click("Next")
197+
# Uses the "anthropic-claude-3-5-sonnet-20241022" model
198+
agent.click("Previous", model="anthropic-claude-3-5-sonnet-20241022")
199+
# Uses the "askui-combo" model again as no model was specified
200+
agent.click("Next")
201+
```
193202

194203
<details>
195204
<summary>AskUI AI Models</summary>
196205

197-
Supported commands are: `click()`, `type()`, `mouse_move()`
206+
Supported commands are: `click()`, `locate()`, `mouse_move()`
198207
| Model Name | Info | Execution Speed | Security | Cost | Reliability |
199208
|-------------|--------------------|--------------|--------------|--------------|--------------|
209+
| `askui` | `AskUI` is a combination of all the following models: `askui-pta`, `askui-ocr`, `askui-combo`, `askui-ai-element` where AskUI chooses the best model for the task depending on the input. | Fast, <500ms per step | Secure hosting by AskUI or on-premise | Low, <0,05$ per step | Recommended for production usage, can be (at least partially) retrained |
200210
| `askui-pta` | [`PTA-1`](https://huggingface.co/AskUI/PTA-1) (Prompt-to-Automation) is a vision language model (VLM) trained by [AskUI](https://www.askui.com/) which to address all kinds of UI elements by a textual description e.g. "`Login button`", "`Text login`" | fast, <500ms per step | Secure hosting by AskUI or on-premise | Low, <0,05$ per step | Recommended for production usage, can be retrained |
201211
| `askui-ocr` | `AskUI OCR` is an OCR model trained to address texts on UI Screens e.g. "`Login`", "`Search`" | Fast, <500ms per step | Secure hosting by AskUI or on-premise | low, <0,05$ per step | Recommended for production usage, can be retrained |
202212
| `askui-combo` | AskUI Combo is an combination from the `askui-pta` and the `askui-ocr` model to improve the accuracy. | Fast, <500ms per step | Secure hosting by AskUI or on-premise | low, <0,05$ per step | Recommended for production usage, can be retrained |
203-
| `askui-ai-element`| [AskUI AI Element](https://docs.askui.com/docs/general/Element%20Selection/aielement) allows you to address visual elements like icons or images by demonstrating what you looking for. Therefore, you have to crop out the element and give it a name. | Very fast, <5ms per step | Secure hosting by AskUI or on-premise | Low, <0,05$ per step | Recommended for production usage, determinitic behaviour |
213+
| `askui-ai-element`| [AskUI AI Element](https://docs.askui.com/docs/general/Element%20Selection/aielement) allows you to address visual elements like icons or images by demonstrating what you looking for. Therefore, you have to crop out the element and give it a name. | Very fast, <5ms per step | Secure hosting by AskUI or on-premise | Low, <0,05$ per step | Recommended for production usage, deterministic behaviour |
204214

205215
> **Note:** Configure your AskUI Model Provider [here](#3a-authenticate-with-an-ai-model-provider)
206216
217+
</details>
218+
219+
<details>
220+
<summary>Antrophic AI Models</summary>
221+
222+
Supported commands are: `act()`, `get()`, `click()`, `locate()`, `mouse_move()`
223+
| Model Name | Info | Execution Speed | Security | Cost | Reliability |
224+
|-------------|--------------------|--------------|--------------|--------------|--------------|
225+
| `anthropic-claude-3-5-sonnet-20241022` | The [Computer Use](https://docs.anthropic.com/en/docs/agents-and-tools/computer-use) model from Antrophic is a Large Action Model (LAM), which can autonomously achieve goals. e.g. `"Book me a flight from Berlin to Rom"` | slow, >1s per step | Model hosting by Anthropic | High, up to 1,5$ per act | Not recommended for production usage |
226+
> **Note:** Configure your Antrophic Model Provider [here](#3a-authenticate-with-an-ai-model-provider)
227+
228+
207229
</details>
208230

209231

210232
<details>
211233
<summary>Huggingface AI Models (Spaces API)</summary>
212234

213-
Supported commands are: `click()`, `type()`, `mouse_move()`
235+
Supported commands are: `click()`, `locate()`, `mouse_move()`
214236
| Model Name | Info | Execution Speed | Security | Cost | Reliability |
215237
|-------------|--------------------|--------------|--------------|--------------|--------------|
216238
| `AskUI/PTA-1` | [`PTA-1`](https://huggingface.co/AskUI/PTA-1) (Prompt-to-Automation) is a vision language model (VLM) trained by [AskUI](https://www.askui.com/) which to address all kinds of UI elements by a textual description e.g. "`Login button`", "`Text login`" | fast, <500ms per step | Huggingface hosted | Prices for Huggingface hosting | Not recommended for production applications |
@@ -226,7 +248,7 @@ Supported commands are: `click()`, `type()`, `mouse_move()`
226248
<details>
227249
<summary>Self Hosted UI Models</summary>
228250

229-
Supported commands are: `click()`, `type()`, `mouse_move()`, `get()`, `act()`
251+
Supported commands are: `click()`, `locate()`, `mouse_move()`, `get()`, `act()`
230252
| Model Name | Info | Execution Speed | Security | Cost | Reliability |
231253
|-------------|--------------------|--------------|--------------|--------------|--------------|
232254
| `tars` | [`UI-Tars`](https://github.com/bytedance/UI-TARS) is a Large Action Model (LAM) based on Qwen2 and fine-tuned by [ByteDance](https://www.bytedance.com/) on UI data e.g. "`Book me a flight to rom`" | slow, >1s per step | Self-hosted | Depening on infrastructure | Out-of-the-box not recommended for production usage |
@@ -269,26 +291,160 @@ agent.tools.clipboard.copy("...")
269291
result = agent.tools.clipboard.paste()
270292
```
271293

272-
### 📜 Logging & Reporting
294+
### 📜 Logging
273295

274-
You want a better understanding of what you agent is doing? Set the `log_level` to DEBUG. You can also generate a report of the automation run by setting `enable_report` to `True`.
296+
You want a better understanding of what you agent is doing? Set the `log_level` to DEBUG.
275297

276298
```python
277299
import logging
278300

279-
with VisionAgent(log_level=logging.DEBUG, enable_report=True) as agent:
301+
with VisionAgent(log_level=logging.DEBUG) as agent:
302+
agent...
303+
```
304+
305+
### 📜 Reporting
306+
307+
You want to see a report of the actions your agent took? Register a reporter using the `reporters` parameter.
308+
309+
```python
310+
from typing import Optional, Union
311+
from typing_extensions import override
312+
from askui.reporting import SimpleHtmlReporter
313+
from PIL import Image
314+
315+
with VisionAgent(reporters=[SimpleHtmlReporter()]) as agent:
316+
agent...
317+
```
318+
319+
You can also create your own reporter by implementing the `Reporter` interface.
320+
321+
```python
322+
from askui.reporting import Reporter
323+
324+
class CustomReporter(Reporter):
325+
@override
326+
def add_message(
327+
self,
328+
role: str,
329+
content: Union[str, dict, list],
330+
image: Optional[Image.Image] = None,
331+
) -> None:
332+
# adding message to the report (see implementation of `SimpleHtmlReporter` as an example)
333+
pass
334+
335+
@override
336+
def generate(self) -> None:
337+
# generate the report if not generated live (see implementation of `SimpleHtmlReporter` as an example)
338+
pass
339+
340+
341+
with VisionAgent(reporters=[CustomReporter()]) as agent:
342+
agent...
343+
```
344+
345+
You can also use multiple reporters at once. Their `generate()` and `add_message()` methods will be called in the order of the reporters in the list.
346+
347+
```python
348+
with VisionAgent(reporters=[SimpleHtmlReporter(), CustomReporter()]) as agent:
280349
agent...
281350
```
282351

283352
### 🖥️ Multi-Monitor Support
284353

285-
You have multiple monitors? Choose which one to automate by setting `display` to 1 or 2.
354+
You have multiple monitors? Choose which one to automate by setting `display` to `1`, `2` etc. To find the correct display or monitor, you have to play play around a bit setting it to different values. We are going to improve this soon. By default, the agent will use display 1.
286355

287356
```python
288357
with VisionAgent(display=1) as agent:
289358
agent...
290359
```
291360

361+
### 🎯 Locating elements
362+
363+
If you have a hard time locating (clicking, moving mouse to etc.) elements by simply using text, e.g.,
364+
365+
```python
366+
agent.click("Password textfield")
367+
agent.type("********")
368+
```
369+
370+
you can build more sophisticated locators.
371+
372+
**⚠️ Warning:** Support can vary depending on the model you are using. Currently, only, the `askui` model provides best support for locators. This model is chosen by default if `ASKUI_WORKSPACE_ID` and `ASKUI_TOKEN` environment variables are set and it is not overridden using the `model` parameter.
373+
374+
Example:
375+
376+
```python
377+
from askui import locators as loc
378+
379+
password_textfield_label = loc.Text("Password")
380+
password_textfield = loc.Element("textfield").right_of(password_textfield_label)
381+
382+
agent.click(password_textfield)
383+
agent.type("********")
384+
```
385+
386+
### 📊 Extracting information
387+
388+
The `get()` method allows you to extract information from the screen. You can use it to:
389+
390+
- Get text or data from the screen
391+
- Check the state of UI elements
392+
- Make decisions based on screen content
393+
- Analyze static images
394+
395+
#### Basic usage
396+
397+
```python
398+
# Get text from screen
399+
url = agent.get("What is the current url shown in the url bar?")
400+
print(url) # e.g., "github.com/login"
401+
402+
# Check UI state
403+
# Just as an example, may be flaky if used as is, better use a response schema to check for a boolean value (see below)
404+
is_logged_in = agent.get("Is the user logged in? Answer with 'yes' or 'no'.") == "yes"
405+
if is_logged_in:
406+
agent.click("Logout")
407+
else:
408+
agent.click("Login")
409+
```
410+
411+
#### Using custom images
412+
413+
Instead of taking a screenshot, you can analyze specific images:
414+
415+
```python
416+
from PIL import Image
417+
418+
# From PIL Image
419+
image = Image.open("screenshot.png")
420+
result = agent.get("What's in this image?", image)
421+
422+
# From file path
423+
result = agent.get("What's in this image?", "screenshot.png")
424+
```
425+
426+
#### Using response schemas
427+
428+
For structured data extraction, use Pydantic models extending `JsonSchemaBase`:
429+
430+
```python
431+
from askui import JsonSchemaBase
432+
433+
class UserInfo(JsonSchemaBase):
434+
username: str
435+
is_online: bool
436+
437+
# Get structured data
438+
user_info = agent.get(
439+
"What is the username and online status?",
440+
response_schema=UserInfo
441+
)
442+
print(f"User {user_info.username} is {'online' if user_info.is_online else 'offline'}")
443+
```
444+
445+
**⚠️ Limitations:**
446+
- Nested Pydantic schemas are not currently supported
447+
- Response schema is currently only supported by "askui" model (default model if `ASKUI_WORKSPACE_ID` and `ASKUI_TOKEN` are set)
292448

293449
## What is AskUI Vision Agent?
294450

pdm.lock

Lines changed: 27 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,10 @@ path = "src/askui/__init__.py"
3939
distribution = true
4040

4141
[tool.pdm.scripts]
42-
test = "pytest"
43-
"test:unit" = "pytest tests/unit"
44-
"test:integration" = "pytest tests/integration"
42+
test = "pytest -n auto"
43+
"test:e2e" = "pytest -n auto tests/e2e"
44+
"test:integration" = "pytest -n auto tests/integration"
45+
"test:unit" = "pytest -n auto tests/unit"
4546
sort = "isort ."
4647
format = "black ."
4748
lint = "ruff check ."
@@ -56,6 +57,7 @@ test = [
5657
"black>=25.1.0",
5758
"ruff>=0.9.5",
5859
"pytest-mock>=3.14.0",
60+
"pytest-xdist>=3.6.1",
5961
]
6062
chat = [
6163
"streamlit>=1.42.0",

src/askui/__init__.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,19 @@
33
__version__ = "0.2.5"
44

55
from .agent import VisionAgent
6+
from .models.router import ModelRouter
7+
from .models.types.response_schemas import ResponseSchema, ResponseSchemaBase
8+
from .tools.toolbox import AgentToolbox
9+
from .tools.agent_os import AgentOs, ModifierKey, PcKey
10+
611

712
__all__ = [
13+
"AgentOs",
14+
"AgentToolbox",
15+
"ModelRouter",
16+
"ModifierKey",
17+
"PcKey",
18+
"ResponseSchema",
19+
"ResponseSchemaBase",
820
"VisionAgent",
921
]

0 commit comments

Comments
 (0)