CUA exposes a single browser_dom tool with 10 actions. The agent chooses which action to call based on the task and page state.
| Action | Description | Returns |
|---|---|---|
goto(url) |
Navigate to a URL | Page map (metadata + landmarks + elements) |
click(selector) |
Click an element (CSS, text=, role= selectors) |
Mutation delta + page map |
screenshot |
Capture the viewport | Screenshot + DOM snapshot |
key_press(text, credential_ref, key) |
Type plain text or a runtime credential ref, and/or press a key (Enter, Tab, etc.) | Confirmation |
scroll(direction, amount) |
Scroll the page | Page map |
extract(selector, mode) |
Extract content as markdown (default), text, HTML, or form values | Content string + page map |
get_dom(selector?) |
Get a compact DOM snapshot (optionally scoped) | DOM string |
select(selector, value) |
Select a dropdown option | Confirmation |
evaluate(script) |
Execute arbitrary JavaScript on the page | Page map (if URL changed) |
execute_sequence(steps) |
Batch multiple actions in a single tool call | Combined results + page map |
Each tool call has ~3-5s of overhead (API round-trip + thinking). Without batching, filling a 5-field form takes 5 separate calls = ~20s of pure overhead. With execute_sequence, it's a single call:
{
"action": "execute_sequence",
"steps": [
{"action": "key_press", "selector": "#email", "credential_ref": "email"},
{"action": "key_press", "selector": "#password", "credential_ref": "password"},
{"action": "click", "selector": "button[type=submit]"}
]
}Intermediate steps skip screenshots for speed. Only the final step captures the DOM, so the agent sees the result of the entire sequence in one response.
When credential_ref is used, the LLM sees only the reference name. The runtime resolves the real secret immediately before filling the field, and logs keep only the ref name rather than the secret value.
| Design | How it works |
|---|---|
| Semantic page understanding | goto/click return a structured page map with three layers: (1) page metadata from Schema.org JSON-LD and Open Graph for page type classification, (2) semantic landmarks summarizing regions (form#login: 3 inputs, table#results: 5 cols, 47 rows), and (3) all interactive elements with parent-context disambiguation (Edit [row: "john@example.com"]). Falls back to Playwright's accessibility tree (ARIA roles/states) when JS extraction fails. |
| Action-outcome verification | Click actions use a DOM Mutation Observer to report exactly what changed: [URL → /dashboard; +modal.dialog; 3 attr changes]. The agent knows immediately whether its action worked. |
| Readability-based extraction | extract defaults to markdown mode — Readability-style content extractor + markdown conversion preserving headings, links, lists, tables, and code blocks. |
| Streaming execution | Tool calls execute as they arrive from the Claude API stream, not after the full response. |
| Context pruning | Old screenshots, DOM snapshots, and thinking blocks are pruned via Pydantic AI HistoryProcessor before each model request. Input tokens stay flat regardless of run length. |
| Speculative prefetch | Page map fetch starts as a background task during click actions, overlapping with mutation observer collection. Result is consumed by the next page context request. |
| Page-change detection | After goto/click/execute_sequence, remaining tool calls in the same response are skipped — they were planned on stale state. |
| CAPTCHA auto-resolution | Patchright stealth patches + auto-wait up to 30s for Cloudflare, 5s for reCAPTCHA/hCaptcha. |
| Stuck detection | Repetition and cycle analysis with 3-tier escalation (hint → warning → stop). Hard stops block all subsequent actions. See Guardrails. |