Benchmark evaluation infrastructure for GUI automation agents.
Repository: OpenAdaptAI/openadapt-evals
pip install openadapt[evals]
# or
pip install openadapt-evalsThe evals package provides:
- Benchmark adapters for standardized evaluation
- API agent implementations (Claude, GPT-4V)
- Evaluation runners and metrics
- Mock environments for testing
# Evaluate a trained policy
openadapt eval run --checkpoint training_output/model.pt --benchmark waa
# Evaluate an API agent
openadapt eval run --agent api-claude --benchmark waaOptions:
--checkpoint- Path to trained policy checkpoint--agent- Agent type (api-claude, api-gpt4v, custom)--benchmark- Benchmark name (waa, osworld, etc.)--tasks- Number of tasks to evaluate (default: all)--output- Output directory for results
Test your setup without running actual benchmarks:
openadapt eval mock --tasks 10openadapt eval benchmarks| Benchmark | Description | Tasks |
|---|---|---|
waa |
Windows Agent Arena | 154 |
osworld |
OSWorld | 369 |
webarena |
WebArena | 812 |
mock |
Mock benchmark for testing | Configurable |
export ANTHROPIC_API_KEY=your-key-here
openadapt eval run --agent api-claude --benchmark waaexport OPENAI_API_KEY=your-key-here
openadapt eval run --agent api-gpt4v --benchmark waafrom openadapt_evals import ApiAgent, BenchmarkAdapter, evaluate_agent_on_benchmark
# Create an API agent
agent = ApiAgent.claude()
# Or load a trained policy
from openadapt_ml import AgentPolicy
agent = AgentPolicy.from_checkpoint("model.pt")
# Run evaluation
results = evaluate_agent_on_benchmark(
agent=agent,
benchmark="waa",
num_tasks=10
)
print(f"Success rate: {results.success_rate:.2%}")
print(f"Average steps: {results.avg_steps:.1f}")flowchart TB
subgraph Agent["Agent Under Test"]
POLICY[Agent Policy]
API[API Agent]
end
subgraph Benchmark["Benchmark System"]
ADAPTER[Benchmark Adapter]
MOCK[Mock Adapter]
LIVE[Live Adapter]
end
subgraph Tasks["Task Execution"]
TASK[Get Task]
OBS[Observe State]
ACT[Execute Action]
CHECK[Check Success]
end
subgraph Metrics["Metrics"]
SUCCESS[Success Rate]
STEPS[Avg Steps]
TIME[Execution Time]
end
POLICY --> ADAPTER
API --> ADAPTER
ADAPTER --> MOCK
ADAPTER --> LIVE
MOCK --> TASK
LIVE --> TASK
TASK --> OBS
OBS --> POLICY
OBS --> API
POLICY --> ACT
API --> ACT
ACT --> CHECK
CHECK -->|next| TASK
CHECK -->|done| SUCCESS
CHECK --> STEPS
CHECK --> TIME
| Export | Description |
|---|---|
ApiAgent |
API-based agent (Claude, GPT-4V) |
BenchmarkAdapter |
Benchmark interface |
MockAdapter |
Mock benchmark for testing |
evaluate_agent_on_benchmark |
Agent evaluation function |
EvalResults |
Evaluation results container |
| Metric | Description |
|---|---|
| Success Rate | Percentage of tasks completed successfully |
| Average Steps | Mean number of steps per task |
| Execution Time | Total and per-task timing |
| Error Rate | Percentage of tasks that errored |
- openadapt-ml - Learn policies to evaluate
- openadapt-capture - Collect demonstrations