Evaluation

Local evaluation suites for benchmarking CUA against known tasks. Define cases in YAML, run them with repeated trials, and get scored results with pass/fail expectations.

Quick Start

# Run a suite
python -c "
import asyncio
from evaluation import load_suite, run_suite, write_suite_report

async def main():
    suite = await load_suite('evaluation/suites/example.yaml')
    report = await run_suite(suite)
    await write_suite_report(report, 'output/evals/report.json')
    print(f'{report.passed}/{report.total} passed ({report.pass_rate:.0%})')

asyncio.run(main())
"

Suite Format

Suites are YAML files with a list of evaluation cases:

name: my-suite
cases:
  - id: agent-example
    directive: Open example.com and return the page title
    start_url: https://example.com
    execution_mode: agent_only    # agent_only | playbook_only | hybrid_auto
    trials: 3                     # repeat count for statistical confidence
    benchmark_tags: [smoke, public]
    max_steps: 5
    output_schema:                # optional JSON schema for structured output
      type: object
      properties:
        summary: { type: string }
    expect:
      must_succeed: true
      summary_contains: ["example"]
      max_actions: 4
      min_trial_pass_rate: 0.8    # at least 80% of trials must pass

  - id: playbook-cancel-order
    directive: Cancel order 12345
    playbook: cancel_order
    execution_mode: playbook_only
    benchmark_tags: [deterministic]
    playbook_params:
      order_id: "12345"
    expect:
      must_succeed: true
      handoff: forbid             # no LLM fallback allowed

Execution Modes

Mode	Behavior
`hybrid_auto` (default)	Uses playbook if one is specified, otherwise uses the LLM agent. Playbook failures can hand off to the agent.
`playbook_only`	Runs the specified playbook with no LLM fallback. Requires the `playbook` field.
`agent_only`	Runs the LLM agent directly, ignoring any playbook.

Expectations

Each case has an expect block that defines pass/fail criteria. All expectations are optional; omitted checks are skipped.

Content Checks

Field	Type	Description
`must_succeed`	`bool`	The case must complete successfully (default: `true`)
`summary_contains`	`list[str]`	Summary must contain all listed substrings (case-insensitive)
`error_contains`	`list[str]`	Error message must contain all listed substrings
`extracted_text_contains`	`list[str]`	Extracted text from steps must contain all substrings
`required_data_keys`	`list[str]`	Dotted paths (e.g., `details.title`) that must exist in result data
`data_values_contain`	`dict[str, str]`	Dotted path to expected substring in result data values

Performance Bounds

Field	Type	Description
`max_duration_ms`	`int`	Maximum wall-clock time per trial
`max_actions`	`int`	Maximum number of actions per trial
`min_actions`	`int`	Minimum number of actions per trial
`max_input_tokens`	`int`	Maximum input tokens per trial
`max_output_tokens`	`int`	Maximum output tokens per trial
`max_estimated_cost_usd`	`float`	Maximum estimated cost per trial

Trial Aggregation

Field	Type	Description
`min_trial_pass_rate`	`float`	Minimum fraction of trials that must pass (e.g., `0.8` for 80%)
`max_p95_duration_ms`	`int`	Maximum p95 duration across trials

Handoff Behavior

Field	Type	Description
`handoff`	`allow\|require\|forbid`	Whether LLM handoff is allowed, required, or forbidden (default: `allow`)

Trials

Set trials: N on a case to run it multiple times. This is useful for measuring reliability and performance variance of non-deterministic LLM agent cases.

When trials > 1:

Each trial is scored independently against the case expectations
Results are aggregated into a single EvalCaseResult with averages and p95 metrics
The case passes only if all trials pass (unless min_trial_pass_rate is set)
The report includes trial_pass_rate, avg_duration_ms, p95_duration_ms, and avg_estimated_cost_usd

Cost Estimation

To track estimated LLM costs, set per-million token pricing on the case:

- id: cost-tracked-case
  directive: ...
  input_token_cost_per_million_usd: 3.0
  output_token_cost_per_million_usd: 15.0
  expect:
    max_estimated_cost_usd: 0.05

Cost is computed as (input_tokens / 1M) * input_rate + (output_tokens / 1M) * output_rate. Playbook-only cases always report $0.00 since they make no LLM calls.

Case Configuration

Field	Default	Description
`id`	(required)	Unique case identifier
`directive`	(required)	Natural language task description
`profile`	`default`	Agent profile to use
`playbook`	None	Playbook ID for deterministic execution
`execution_mode`	`hybrid_auto`	How to run the case
`trials`	`1`	Number of repeated runs
`benchmark_tags`	`[]`	Tags for filtering and grouping
`credentials`	None	Credentials dict for authenticated flows
`allow_private_networks`	`false`	Allow localhost and private IPs
`start_url`	None	URL to open on browser launch
`max_steps`	`50`	Maximum agent loop iterations
`thinking`	`high`	LLM thinking effort level
`output_schema`	None	JSON schema for structured output
`metadata`	`{}`	Arbitrary metadata (not used by the runner)

Report Structure

write_suite_report outputs a JSON file with the full result hierarchy:

EvalSuiteResult
  name, passed, failed, total, pass_rate
  avg_duration_ms, p95_duration_ms, avg_estimated_cost_usd
  deterministic_hit_rate, handoff_rate, handoff_rescue_rate
  case_results: [EvalCaseResult]
    id, passed, success, mode, summary, error
    duration_ms, actions, input_tokens, output_tokens
    trials_run, trial_pass_rate, avg_duration_ms, p95_duration_ms
    avg_estimated_cost_usd, failed_checks
    trial_results: [EvalTrialResult]   # included when trials > 1
      trial_index, passed, success, mode, duration_ms, ...

Architecture

evaluation/
├── models.py          Pydantic models (EvalCase, EvalTrialResult, EvalCaseResult, etc.)
├── scoring.py         Pure scoring and aggregation (no I/O)
├── runner.py          Browser orchestration, trial loop, suite runner
├── __init__.py        Public API
└── suites/
    └── example.yaml   Example benchmark suite

Scoring is fully separated from execution -- scoring.py contains pure functions with no I/O dependencies, making expectations and aggregation independently testable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation

Quick Start

Suite Format

Execution Modes

Expectations

Content Checks

Performance Bounds

Trial Aggregation

Handoff Behavior

Trials

Cost Estimation

Case Configuration

Report Structure

Architecture

FilesExpand file tree

evaluation.md

Latest commit

History

evaluation.md

File metadata and controls

Evaluation

Quick Start

Suite Format

Execution Modes

Expectations

Content Checks

Performance Bounds

Trial Aggregation

Handoff Behavior

Trials

Cost Estimation

Case Configuration

Report Structure

Architecture