Skip to content

Commit d081f71

Browse files
authored
Revise Agent-Diff Bench and Benchmark sections
Updated sections related to Agent-Diff Bench and Benchmark Suites in README.
1 parent 4919ee0 commit d081f71

1 file changed

Lines changed: 6 additions & 17 deletions

File tree

README.md

Lines changed: 6 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -129,17 +129,11 @@ See the [Python SDK](https://agentdiff.mintlify.app/sdks/python/installation) an
129129
- **Creation**: `client.init_env(templateService="slack", templateName="slack_default", impersonateUserId="U01AGENBOT9")`
130130
- **Cleanup**: `client.delete_env(envId)` or auto-expires after TTL
131131

132-
## Run Evaluations
133-
134-
- **[Prime Intellect](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench)** — Run evals or RL training with no setup required
135-
- **[Colab Notebooks](#try-it-now)** — Run locally with the example notebooks above
136-
- **[Dataset](https://huggingface.co/datasets/hubertmarek/agent-diff-bench)** — 224 tasks across all 4 services (80/20 train/test split)
137-
138-
## Benchmark
132+
## Agent-Diff Bench
139133

140134
The Agent-Diff benchmark comprises **224 tasks** across four enterprise services, each evaluated via deterministic state-diff contracts. Tasks span single-step CRUD operations to long-horizon, multi-entity workflows requiring search, conditional logic, and coordinated state changes.
141135

142-
### Benchmark Results
136+
### Agent-Diff Bench Results
143137

144138
| Model | Box | Calendar | Linear | Slack | **Overall** | Pass % | Cost/test | Score/$ |
145139
|---|---|---|---|---|---|---|---|---|
@@ -155,16 +149,11 @@ The Agent-Diff benchmark comprises **224 tasks** across four enterprise services
155149

156150
Per-service assertion-weighted scores (95% Bayesian CrI). No-docs baseline: agents receive no API documentation and must discover endpoints through exploration. 3 trials per task. Full methodology and documentation ablation results in the [paper](https://arxiv.org/abs/2602.11224).
157151

158-
## Benchmark Suites
152+
## Run Agent-Diff Bench
159153

160-
| Service | Test Suite | Tests | Coverage |
161-
|---------|-----------|-------|----------|
162-
| Box | [box_bench.json](examples/box/testsuites/box_bench.json) | 48 | File/folder ops, search, tags, comments, hubs, versioning |
163-
| Calendar | [calendar_bench.json](examples/calendar/testsuites/calendar_bench.json) | 60 | Event CRUD, recurring events, free/busy, ACL, lifecycle |
164-
| Linear | [linear_bench.json](examples/linear/testsuites/linear_bench.json) | 57 | Issues, labels, comments, workflow states, teams |
165-
| Slack | [slack_bench.json](examples/slack/testsuites/slack_bench.json) | 59 | Messages, channels, reactions, threading |
166-
167-
Each test defines expected state changes via declarative assertions. See the [assertions docs](https://agentdiff.mintlify.app/core-concepts/assertions) for how they work.
154+
- **[Prime Intellect](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench)** — Run evals or RL training with no setup required
155+
- **[Colab Notebooks](#try-it-now)** — Run locally with the example notebooks above
156+
- **[Dataset](https://huggingface.co/datasets/hubertmarek/agent-diff-bench)** — 224 tasks across all 4 services (80/20 train/test split). Each test defines expected state changes via declarative assertions. See the [assertions docs](https://agentdiff.mintlify.app/core-concepts/assertions) for how they work.
168157

169158

170159
## Documentation

0 commit comments

Comments
 (0)