askui
diff --git a/‎README.md‎
Lines changed: 179 additions & 23 deletions b/‎README.md‎
Lines changed: 179 additions & 23 deletions
diff --git a/‎pdm.lock‎
Lines changed: 27 additions & 1 deletion b/‎pdm.lock‎
Lines changed: 27 additions & 1 deletion
diff --git a/‎pyproject.toml‎
Lines changed: 5 additions & 3 deletions b/‎pyproject.toml‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎src/askui/__init__.py‎
Lines changed: 12 additions & 0 deletions b/‎src/askui/__init__.py‎
Lines changed: 12 additions & 0 deletions
@@ -88,7 +88,7 @@ pip install askui
 |  | AskUI [INFO](https://hub.askui.com/) | Anthropic [INFO](https://console.anthropic.com/settings/keys) |
 |----------|----------|----------|
 | ENV Variables    | `ASKUI_WORKSPACE_ID`, `ASKUI_TOKEN`   | `ANTHROPIC_API_KEY`   |
-| Supported Commands    | `click()`   | `click()`, `get()`, `act()`   |
+| Supported Commands    | `click()`, `get()`, `locate()`, `mouse_move()`   | `act()`, `click()`, `get()`, `locate()`, `mouse_move()`  |
 | Description    | Faster Inference, European Server, Enterprise Ready   | Supports complex actions   |
 
 To get started, set the environment variables required to authenticate with your chosen model provider.
@@ -130,7 +130,7 @@ You can test the Vision Agent with Huggingface models via their Spaces API. Plea
 
 **Example Code:**
 ```python
-agent.click("search field", model_name="OS-Copilot/OS-Atlas-Base-7B")
+agent.click("search field", model="OS-Copilot/OS-Atlas-Base-7B")
 ```
 
 ### 3c. Host your own **AI Models**
@@ -143,7 +143,7 @@ You can use Vision Agent with UI-TARS if you provide your own UI-TARS API endpoi
 
 2. Step: Provide the `TARS_URL` and `TARS_API_KEY` environment variables to Vision Agent.
 
-3. Step: Use the `model_name="tars"` parameter in your `click()`, `get()` and `act()` commands.
+3. Step: Use the `model="tars"` parameter in your `click()`, `get()` and `act()` etc. commands or when initializing the `VisionAgent`.
 
 
 ## ▶️ Start Building
@@ -171,46 +171,68 @@ with VisionAgent() as agent:
 
 ### 🎛️ Model Selection
 
-Instead of relying on the default model for the entire automation script, you can specify a model for each `click` command using the `model_name` parameter.
+Instead of relying on the default model for the entire automation script, you can specify a model for each `click()` (or `act()`, `get()` etc.) command using the `model` parameter or when initializing the `VisionAgent` (overridden by the `model` parameter of individual commands).
 
 |  | AskUI | Anthropic |
 |----------|----------|----------|
-| `click()`    | `askui-combo`, `askui-pta`, `askui-ocr`   | `anthropic-claude-3-5-sonnet-20241022`   |
+| `act()`    | | `anthropic-claude-3-5-sonnet-20241022`   |
+| `click()`    | `askui`, `askui-combo`, `askui-pta`, `askui-ocr`, `askui-ai-element` | `anthropic-claude-3-5-sonnet-20241022`   |
+| `get()`    | | `askui`, `anthropic-claude-3-5-sonnet-20241022`   |
+| `locate()` | `askui`, `askui-combo`, `askui-pta`, `askui-ocr`, `askui-ai-element`   | `anthropic-claude-3-5-sonnet-20241022` |
+| `mouse_move()` | `askui`, `askui-combo`, `askui-pta`, `askui-ocr`, `askui-ai-element`   | `anthropic-claude-3-5-sonnet-20241022` |
 
-**Example:** `agent.click("Preview", model_name="askui-combo")`
 
-<details>
-  <summary>Antrophic AI Models</summary>
-
-Supported commands are: `click()`, `type()`, `mouse_move()`, `get()`, `act()`
-| Model Name  | Info | Execution Speed | Security | Cost | Reliability | 
-|-------------|--------------------|--------------|--------------|--------------|--------------|
-| `anthropic-claude-3-5-sonnet-20241022` | The [Computer Use](https://docs.anthropic.com/en/docs/agents-and-tools/computer-use) model from Antrophic is a Large Action Model (LAM), which can autonomously achieve goals. e.g. `"Book me a flight from Berlin to Rom"` | slow, >1s per step | Model hosting by Anthropic | High, up to 1,5$ per act | Not recommended for production usage |
-> **Note:** Configure your Antrophic Model Provider [here](#3a-authenticate-with-an-ai-model-provider)
+**Example:** 
 
+```python
+from askui import VisionAgent
 
-</details>
+with VisionAgent() as agent:
+    # Uses the default model (depending on the environment variables set, see above)
+    agent.click("Next")
+
+with VisionAgent(model="askui-combo") as agent:
+    # Uses the "askui-combo" model because it was specified when initializing the agent
+    agent.click("Next")
+    # Uses the "anthropic-claude-3-5-sonnet-20241022" model
+    agent.click("Previous", model="anthropic-claude-3-5-sonnet-20241022")
+    # Uses the "askui-combo" model again as no model was specified
+    agent.click("Next")
+```
 
 <details>
   <summary>AskUI AI Models</summary>
 
-Supported commands are: `click()`, `type()`, `mouse_move()`
+Supported commands are: `click()`, `locate()`, `mouse_move()`
 | Model Name  | Info | Execution Speed | Security | Cost | Reliability | 
 |-------------|--------------------|--------------|--------------|--------------|--------------|
+| `askui` | `AskUI` is a combination of all the following models: `askui-pta`, `askui-ocr`, `askui-combo`, `askui-ai-element` where AskUI chooses the best model for the task depending on the input. | Fast, <500ms per step | Secure hosting by AskUI or on-premise | Low, <0,05$ per step | Recommended for production usage, can be (at least partially) retrained |
 | `askui-pta` | [`PTA-1`](https://huggingface.co/AskUI/PTA-1) (Prompt-to-Automation) is a vision language model (VLM) trained by [AskUI](https://www.askui.com/) which to address all kinds of UI elements by a textual description e.g. "`Login button`", "`Text login`" | fast, <500ms per step | Secure hosting by AskUI or on-premise | Low, <0,05$ per step | Recommended for production usage, can be retrained |
 | `askui-ocr` | `AskUI OCR` is an OCR model trained to address texts on UI Screens e.g. "`Login`", "`Search`" | Fast, <500ms per step | Secure hosting by AskUI or on-premise | low, <0,05$ per step | Recommended for production usage, can be retrained |
 | `askui-combo` | AskUI Combo is an combination from the `askui-pta` and the `askui-ocr` model to improve the accuracy. | Fast, <500ms per step | Secure hosting by AskUI or on-premise | low, <0,05$ per step | Recommended for production usage, can be retrained |
-| `askui-ai-element`| [AskUI AI Element](https://docs.askui.com/docs/general/Element%20Selection/aielement) allows you to address visual elements like icons or images by demonstrating what you looking for. Therefore, you have to crop out the element and give it a name.  | Very fast, <5ms per step | Secure hosting by AskUI or on-premise | Low, <0,05$ per step | Recommended for production usage, determinitic behaviour |
+| `askui-ai-element`| [AskUI AI Element](https://docs.askui.com/docs/general/Element%20Selection/aielement) allows you to address visual elements like icons or images by demonstrating what you looking for. Therefore, you have to crop out the element and give it a name.  | Very fast, <5ms per step | Secure hosting by AskUI or on-premise | Low, <0,05$ per step | Recommended for production usage, deterministic behaviour |
 
 > **Note:** Configure your AskUI Model Provider [here](#3a-authenticate-with-an-ai-model-provider)
 
+</details>
+
+<details>
+  <summary>Antrophic AI Models</summary>
+
+Supported commands are: `act()`, `get()`, `click()`, `locate()`, `mouse_move()`
+| Model Name  | Info | Execution Speed | Security | Cost | Reliability | 
+|-------------|--------------------|--------------|--------------|--------------|--------------|
+| `anthropic-claude-3-5-sonnet-20241022` | The [Computer Use](https://docs.anthropic.com/en/docs/agents-and-tools/computer-use) model from Antrophic is a Large Action Model (LAM), which can autonomously achieve goals. e.g. `"Book me a flight from Berlin to Rom"` | slow, >1s per step | Model hosting by Anthropic | High, up to 1,5$ per act | Not recommended for production usage |
+> **Note:** Configure your Antrophic Model Provider [here](#3a-authenticate-with-an-ai-model-provider)
+
+
 </details>
 
 
 <details>
   <summary>Huggingface AI Models (Spaces API)</summary>
 
-Supported commands are: `click()`, `type()`, `mouse_move()`
+Supported commands are: `click()`, `locate()`, `mouse_move()`
 | Model Name  | Info | Execution Speed | Security | Cost | Reliability | 
 |-------------|--------------------|--------------|--------------|--------------|--------------|
 | `AskUI/PTA-1` | [`PTA-1`](https://huggingface.co/AskUI/PTA-1) (Prompt-to-Automation) is a vision language model (VLM) trained by [AskUI](https://www.askui.com/) which to address all kinds of UI elements by a textual description e.g. "`Login button`", "`Text login`" | fast, <500ms per step | Huggingface hosted | Prices for Huggingface hosting | Not recommended for production applications |
@@ -226,7 +248,7 @@ Supported commands are: `click()`, `type()`, `mouse_move()`
 <details>
   <summary>Self Hosted UI Models</summary>
 
-Supported commands are: `click()`, `type()`, `mouse_move()`, `get()`, `act()`
+Supported commands are: `click()`, `locate()`, `mouse_move()`, `get()`, `act()`
 | Model Name  | Info | Execution Speed |  Security | Cost | Reliability | 
 |-------------|--------------------|--------------|--------------|--------------|--------------|
 | `tars` | [`UI-Tars`](https://github.com/bytedance/UI-TARS) is a Large Action Model (LAM) based on Qwen2 and fine-tuned by [ByteDance](https://www.bytedance.com/) on UI data e.g. "`Book me a flight to rom`" | slow, >1s per step | Self-hosted | Depening on infrastructure | Out-of-the-box not recommended for production usage |
@@ -269,26 +291,160 @@ agent.tools.clipboard.copy("...")
 result = agent.tools.clipboard.paste()
 ```
 
-### 📜 Logging & Reporting
+### 📜 Logging
 
-You want a better understanding of what you agent is doing? Set the `log_level` to DEBUG. You can also generate a report of the automation run by setting `enable_report` to `True`.
+You want a better understanding of what you agent is doing? Set the `log_level` to DEBUG.
 
 ```python
 import logging
 
-with VisionAgent(log_level=logging.DEBUG, enable_report=True) as agent:
+with VisionAgent(log_level=logging.DEBUG) as agent:
+    agent...
+```
+
+### 📜 Reporting
+
+You want to see a report of the actions your agent took? Register a reporter using the `reporters` parameter.
+
+```python
+from typing import Optional, Union
+from typing_extensions import override
+from askui.reporting import SimpleHtmlReporter
+from PIL import Image
+
+with VisionAgent(reporters=[SimpleHtmlReporter()]) as agent:
+    agent...
+```
+
+You can also create your own reporter by implementing the `Reporter` interface.
+
+```python
+from askui.reporting import Reporter
+
+class CustomReporter(Reporter):
+    @override
+    def add_message(
+        self,
+        role: str,
+        content: Union[str, dict, list],
+        image: Optional[Image.Image] = None,
+    ) -> None:
+        # adding message to the report (see implementation of `SimpleHtmlReporter` as an example)
+        pass
+
+    @override
+    def generate(self) -> None:
+        # generate the report if not generated live (see implementation of `SimpleHtmlReporter` as an example)
+        pass
+
+
+with VisionAgent(reporters=[CustomReporter()]) as agent:
+    agent...
+```
+
+You can also use multiple reporters at once. Their `generate()` and `add_message()` methods will be called in the order of the reporters in the list.
+
+```python
+with VisionAgent(reporters=[SimpleHtmlReporter(), CustomReporter()]) as agent:
     agent...
 ```
 
 ### 🖥️ Multi-Monitor Support
 
-You have multiple monitors? Choose which one to automate by setting `display` to 1 or 2.
+You have multiple monitors? Choose which one to automate by setting `display` to `1`, `2` etc. To find the correct display or monitor, you have to play play around a bit setting it to different values. We are going to improve this soon. By default, the agent will use display 1.
 
 ```python
 with VisionAgent(display=1) as agent:
     agent...
 ```
 
+### 🎯 Locating elements
+
+If you have a hard time locating (clicking, moving mouse to etc.) elements by simply using text, e.g.,
+
+```python
+agent.click("Password textfield")
+agent.type("********")
+```
+
+you can build more sophisticated locators.
+
+**⚠️ Warning:** Support can vary depending on the model you are using. Currently, only, the `askui` model provides best support for locators. This model is chosen by default if `ASKUI_WORKSPACE_ID` and `ASKUI_TOKEN` environment variables are set and it is not overridden using the  `model` parameter.
+
+Example:
+
+```python
+from askui import locators as loc
+
+password_textfield_label = loc.Text("Password")
+password_textfield = loc.Element("textfield").right_of(password_textfield_label)
+
+agent.click(password_textfield)
+agent.type("********")
+```
+
+### 📊 Extracting information
+
+The `get()` method allows you to extract information from the screen. You can use it to:
+
+- Get text or data from the screen
+- Check the state of UI elements
+- Make decisions based on screen content
+- Analyze static images
+
+#### Basic usage
+
+```python
+# Get text from screen
+url = agent.get("What is the current url shown in the url bar?")
+print(url)  # e.g., "github.com/login"
+
+# Check UI state
+# Just as an example, may be flaky if used as is, better use a response schema to check for a boolean value (see below)
+is_logged_in = agent.get("Is the user logged in? Answer with 'yes' or 'no'.") == "yes"
+if is_logged_in:
+    agent.click("Logout")
+else:
+    agent.click("Login")
+```
+
+#### Using custom images
+
+Instead of taking a screenshot, you can analyze specific images:
+
+```python
+from PIL import Image
+
+# From PIL Image
+image = Image.open("screenshot.png")
+result = agent.get("What's in this image?", image)
+
+# From file path
+result = agent.get("What's in this image?", "screenshot.png")
+```
+
+#### Using response schemas
+
+For structured data extraction, use Pydantic models extending `JsonSchemaBase`:
+
+```python
+from askui import JsonSchemaBase
+
+class UserInfo(JsonSchemaBase):
+    username: str
+    is_online: bool
+
+# Get structured data
+user_info = agent.get(
+    "What is the username and online status?",
+    response_schema=UserInfo
+)
+print(f"User {user_info.username} is {'online' if user_info.is_online else 'offline'}")
+```
+
+**⚠️ Limitations:**
+- Nested Pydantic schemas are not currently supported
+- Response schema is currently only supported by "askui" model (default model if `ASKUI_WORKSPACE_ID` and `ASKUI_TOKEN` are set)
 
 ## What is AskUI Vision Agent?
 
 
@@ -39,9 +39,10 @@ path = "src/askui/__init__.py"
 distribution = true
 
 [tool.pdm.scripts]
-test = "pytest"
-"test:unit" = "pytest tests/unit"
-"test:integration" = "pytest tests/integration"
+test = "pytest -n auto"
+"test:e2e" = "pytest -n auto tests/e2e"
+"test:integration" = "pytest -n auto tests/integration"
+"test:unit" = "pytest -n auto tests/unit"
 sort = "isort ."
 format = "black ."
 lint = "ruff check ."
@@ -56,6 +57,7 @@ test = [
     "black>=25.1.0",
     "ruff>=0.9.5",
     "pytest-mock>=3.14.0",
+    "pytest-xdist>=3.6.1",
 ]
 chat = [
     "streamlit>=1.42.0",
 
@@ -3,7 +3,19 @@
 __version__ = "0.2.5"
 
 from .agent import VisionAgent
+from .models.router import ModelRouter
+from .models.types.response_schemas import ResponseSchema, ResponseSchemaBase
+from .tools.toolbox import AgentToolbox
+from .tools.agent_os import AgentOs, ModifierKey, PcKey
+
 
 __all__ = [
+    "AgentOs",
+    "AgentToolbox",
+    "ModelRouter",
+    "ModifierKey",
+    "PcKey",
+    "ResponseSchema",
+    "ResponseSchemaBase",
     "VisionAgent",
 ]