Skip to content

Latest commit

 

History

History
51 lines (41 loc) · 3.92 KB

File metadata and controls

51 lines (41 loc) · 3.92 KB

Browser Tools

CUA exposes a single browser_dom tool with 10 actions. The agent chooses which action to call based on the task and page state.

Actions

Action Description Returns
goto(url) Navigate to a URL Page map (metadata + landmarks + elements)
click(selector) Click an element (CSS, text=, role= selectors) Mutation delta + page map
screenshot Capture the viewport Screenshot + DOM snapshot
key_press(text, credential_ref, key) Type plain text or a runtime credential ref, and/or press a key (Enter, Tab, etc.) Confirmation
scroll(direction, amount) Scroll the page Page map
extract(selector, mode) Extract content as markdown (default), text, HTML, or form values Content string + page map
get_dom(selector?) Get a compact DOM snapshot (optionally scoped) DOM string
select(selector, value) Select a dropdown option Confirmation
evaluate(script) Execute arbitrary JavaScript on the page Page map (if URL changed)
execute_sequence(steps) Batch multiple actions in a single tool call Combined results + page map

Why execute_sequence matters

Each tool call has ~3-5s of overhead (API round-trip + thinking). Without batching, filling a 5-field form takes 5 separate calls = ~20s of pure overhead. With execute_sequence, it's a single call:

{
  "action": "execute_sequence",
  "steps": [
    {"action": "key_press", "selector": "#email", "credential_ref": "email"},
    {"action": "key_press", "selector": "#password", "credential_ref": "password"},
    {"action": "click", "selector": "button[type=submit]"}
  ]
}

Intermediate steps skip screenshots for speed. Only the final step captures the DOM, so the agent sees the result of the entire sequence in one response.

When credential_ref is used, the LLM sees only the reference name. The runtime resolves the real secret immediately before filling the field, and logs keep only the ref name rather than the secret value.

Design Choices

Design How it works
Semantic page understanding goto/click return a structured page map with three layers: (1) page metadata from Schema.org JSON-LD and Open Graph for page type classification, (2) semantic landmarks summarizing regions (form#login: 3 inputs, table#results: 5 cols, 47 rows), and (3) all interactive elements with parent-context disambiguation (Edit [row: "john@example.com"]). Falls back to Playwright's accessibility tree (ARIA roles/states) when JS extraction fails.
Action-outcome verification Click actions use a DOM Mutation Observer to report exactly what changed: [URL → /dashboard; +modal.dialog; 3 attr changes]. The agent knows immediately whether its action worked.
Readability-based extraction extract defaults to markdown mode — Readability-style content extractor + markdown conversion preserving headings, links, lists, tables, and code blocks.
Streaming execution Tool calls execute as they arrive from the Claude API stream, not after the full response.
Context pruning Old screenshots, DOM snapshots, and thinking blocks are pruned via Pydantic AI HistoryProcessor before each model request. Input tokens stay flat regardless of run length.
Speculative prefetch Page map fetch starts as a background task during click actions, overlapping with mutation observer collection. Result is consumed by the next page context request.
Page-change detection After goto/click/execute_sequence, remaining tool calls in the same response are skipped — they were planned on stale state.
CAPTCHA auto-resolution Patchright stealth patches + auto-wait up to 30s for Cloudflare, 5s for reCAPTCHA/hCaptcha.
Stuck detection Repetition and cycle analysis with 3-tier escalation (hint → warning → stop). Hard stops block all subsequent actions. See Guardrails.