You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -143,7 +143,7 @@ You can use Vision Agent with UI-TARS if you provide your own UI-TARS API endpoi
143
143
144
144
2. Step: Provide the `TARS_URL` and `TARS_API_KEY` environment variables to Vision Agent.
145
145
146
-
3. Step: Use the `model_name="tars"` parameter in your `click()`, `get()` and `act()` commands.
146
+
3. Step: Use the `model="tars"` parameter in your `click()`, `get()` and `act()`etc. commands or when initializing the `VisionAgent`.
147
147
148
148
149
149
## ▶️ Start Building
@@ -171,46 +171,68 @@ with VisionAgent() as agent:
171
171
172
172
### 🎛️ Model Selection
173
173
174
-
Instead of relying on the default model for the entire automation script, you can specify a model for each `click`command using the `model_name` parameter.
174
+
Instead of relying on the default model for the entire automation script, you can specify a model for each `click()` (or `act()`, `get()` etc.) command using the `model` parameter or when initializing the `VisionAgent` (overridden by the `model` parameter of individual commands).
|`anthropic-claude-3-5-sonnet-20241022`| The [Computer Use](https://docs.anthropic.com/en/docs/agents-and-tools/computer-use) model from Antrophic is a Large Action Model (LAM), which can autonomously achieve goals. e.g. `"Book me a flight from Berlin to Rom"`| slow, >1s per step | Model hosting by Anthropic | High, up to 1,5$ per act | Not recommended for production usage |
189
-
> **Note:** Configure your Antrophic Model Provider [here](#3a-authenticate-with-an-ai-model-provider)
185
+
**Example:**
190
186
187
+
```python
188
+
from askui import VisionAgent
191
189
192
-
</details>
190
+
with VisionAgent() as agent:
191
+
# Uses the default model (depending on the environment variables set, see above)
192
+
agent.click("Next")
193
+
194
+
with VisionAgent(model="askui-combo") as agent:
195
+
# Uses the "askui-combo" model because it was specified when initializing the agent
196
+
agent.click("Next")
197
+
# Uses the "anthropic-claude-3-5-sonnet-20241022" model
|`askui`|`AskUI` is a combination of all the following models: `askui-pta`, `askui-ocr`, `askui-combo`, `askui-ai-element` where AskUI chooses the best model for the task depending on the input. | Fast, <500ms per step | Secure hosting by AskUI or on-premise | Low, <0,05$ per step | Recommended for production usage, can be (at least partially) retrained |
200
210
|`askui-pta`|[`PTA-1`](https://huggingface.co/AskUI/PTA-1) (Prompt-to-Automation) is a vision language model (VLM) trained by [AskUI](https://www.askui.com/) which to address all kinds of UI elements by a textual description e.g. "`Login button`", "`Text login`" | fast, <500ms per step | Secure hosting by AskUI or on-premise | Low, <0,05$ per step | Recommended for production usage, can be retrained |
201
211
|`askui-ocr`|`AskUI OCR` is an OCR model trained to address texts on UI Screens e.g. "`Login`", "`Search`" | Fast, <500ms per step | Secure hosting by AskUI or on-premise | low, <0,05$ per step | Recommended for production usage, can be retrained |
202
212
|`askui-combo`| AskUI Combo is an combination from the `askui-pta` and the `askui-ocr` model to improve the accuracy. | Fast, <500ms per step | Secure hosting by AskUI or on-premise | low, <0,05$ per step | Recommended for production usage, can be retrained |
203
-
|`askui-ai-element`|[AskUI AI Element](https://docs.askui.com/docs/general/Element%20Selection/aielement) allows you to address visual elements like icons or images by demonstrating what you looking for. Therefore, you have to crop out the element and give it a name. | Very fast, <5ms per step | Secure hosting by AskUI or on-premise | Low, <0,05$ per step | Recommended for production usage, determinitic behaviour |
213
+
|`askui-ai-element`|[AskUI AI Element](https://docs.askui.com/docs/general/Element%20Selection/aielement) allows you to address visual elements like icons or images by demonstrating what you looking for. Therefore, you have to crop out the element and give it a name. | Very fast, <5ms per step | Secure hosting by AskUI or on-premise | Low, <0,05$ per step | Recommended for production usage, deterministic behaviour |
204
214
205
215
> **Note:** Configure your AskUI Model Provider [here](#3a-authenticate-with-an-ai-model-provider)
|`anthropic-claude-3-5-sonnet-20241022`| The [Computer Use](https://docs.anthropic.com/en/docs/agents-and-tools/computer-use) model from Antrophic is a Large Action Model (LAM), which can autonomously achieve goals. e.g. `"Book me a flight from Berlin to Rom"`| slow, >1s per step | Model hosting by Anthropic | High, up to 1,5$ per act | Not recommended for production usage |
226
+
> **Note:** Configure your Antrophic Model Provider [here](#3a-authenticate-with-an-ai-model-provider)
227
+
228
+
207
229
</details>
208
230
209
231
210
232
<details>
211
233
<summary>Huggingface AI Models (Spaces API)</summary>
|`AskUI/PTA-1`|[`PTA-1`](https://huggingface.co/AskUI/PTA-1) (Prompt-to-Automation) is a vision language model (VLM) trained by [AskUI](https://www.askui.com/) which to address all kinds of UI elements by a textual description e.g. "`Login button`", "`Text login`" | fast, <500ms per step | Huggingface hosted | Prices for Huggingface hosting | Not recommended for production applications |
|`tars`|[`UI-Tars`](https://github.com/bytedance/UI-TARS) is a Large Action Model (LAM) based on Qwen2 and fine-tuned by [ByteDance](https://www.bytedance.com/) on UI data e.g. "`Book me a flight to rom`" | slow, >1s per step | Self-hosted | Depening on infrastructure | Out-of-the-box not recommended for production usage |
You want a better understanding of what you agent is doing? Set the `log_level` to DEBUG. You can also generate a report of the automation run by setting `enable_report` to `True`.
296
+
You want a better understanding of what you agent is doing? Set the `log_level` to DEBUG.
275
297
276
298
```python
277
299
import logging
278
300
279
-
with VisionAgent(log_level=logging.DEBUG, enable_report=True) as agent:
301
+
with VisionAgent(log_level=logging.DEBUG) as agent:
302
+
agent...
303
+
```
304
+
305
+
### 📜 Reporting
306
+
307
+
You want to see a report of the actions your agent took? Register a reporter using the `reporters` parameter.
308
+
309
+
```python
310
+
from typing import Optional, Union
311
+
from typing_extensions import override
312
+
from askui.reporting import SimpleHtmlReporter
313
+
fromPILimport Image
314
+
315
+
with VisionAgent(reporters=[SimpleHtmlReporter()]) as agent:
316
+
agent...
317
+
```
318
+
319
+
You can also create your own reporter by implementing the `Reporter` interface.
320
+
321
+
```python
322
+
from askui.reporting import Reporter
323
+
324
+
classCustomReporter(Reporter):
325
+
@override
326
+
defadd_message(
327
+
self,
328
+
role: str,
329
+
content: Union[str, dict, list],
330
+
image: Optional[Image.Image] =None,
331
+
) -> None:
332
+
# adding message to the report (see implementation of `SimpleHtmlReporter` as an example)
333
+
pass
334
+
335
+
@override
336
+
defgenerate(self) -> None:
337
+
# generate the report if not generated live (see implementation of `SimpleHtmlReporter` as an example)
338
+
pass
339
+
340
+
341
+
with VisionAgent(reporters=[CustomReporter()]) as agent:
342
+
agent...
343
+
```
344
+
345
+
You can also use multiple reporters at once. Their `generate()` and `add_message()` methods will be called in the order of the reporters in the list.
346
+
347
+
```python
348
+
with VisionAgent(reporters=[SimpleHtmlReporter(), CustomReporter()]) as agent:
280
349
agent...
281
350
```
282
351
283
352
### 🖥️ Multi-Monitor Support
284
353
285
-
You have multiple monitors? Choose which one to automate by setting `display` to 1 or 2.
354
+
You have multiple monitors? Choose which one to automate by setting `display` to `1`, `2` etc. To find the correct display or monitor, you have to play play around a bit setting it to different values. We are going to improve this soon. By default, the agent will use display 1.
286
355
287
356
```python
288
357
with VisionAgent(display=1) as agent:
289
358
agent...
290
359
```
291
360
361
+
### 🎯 Locating elements
362
+
363
+
If you have a hard time locating (clicking, moving mouse to etc.) elements by simply using text, e.g.,
364
+
365
+
```python
366
+
agent.click("Password textfield")
367
+
agent.type("********")
368
+
```
369
+
370
+
you can build more sophisticated locators.
371
+
372
+
**⚠️ Warning:** Support can vary depending on the model you are using. Currently, only, the `askui` model provides best support for locators. This model is chosen by default if `ASKUI_WORKSPACE_ID` and `ASKUI_TOKEN` environment variables are set and it is not overridden using the `model` parameter.
0 commit comments