You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**Cleanup**: `client.delete_env(envId)` or auto-expires after TTL
131
131
132
-
## Run Evaluations
133
-
134
-
-**[Prime Intellect](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench)** — Run evals or RL training with no setup required
135
-
-**[Colab Notebooks](#try-it-now)** — Run locally with the example notebooks above
136
-
-**[Dataset](https://huggingface.co/datasets/hubertmarek/agent-diff-bench)** — 224 tasks across all 4 services (80/20 train/test split)
137
-
138
-
## Benchmark
132
+
## Agent-Diff Bench
139
133
140
134
The Agent-Diff benchmark comprises **224 tasks** across four enterprise services, each evaluated via deterministic state-diff contracts. Tasks span single-step CRUD operations to long-horizon, multi-entity workflows requiring search, conditional logic, and coordinated state changes.
141
135
142
-
### Benchmark Results
136
+
### Agent-Diff Bench Results
143
137
144
138
| Model | Box | Calendar | Linear | Slack |**Overall**| Pass % | Cost/test | Score/$ |
145
139
|---|---|---|---|---|---|---|---|---|
@@ -155,16 +149,11 @@ The Agent-Diff benchmark comprises **224 tasks** across four enterprise services
155
149
156
150
Per-service assertion-weighted scores (95% Bayesian CrI). No-docs baseline: agents receive no API documentation and must discover endpoints through exploration. 3 trials per task. Full methodology and documentation ablation results in the [paper](https://arxiv.org/abs/2602.11224).
Each test defines expected state changes via declarative assertions. See the [assertions docs](https://agentdiff.mintlify.app/core-concepts/assertions) for how they work.
154
+
-**[Prime Intellect](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench)** — Run evals or RL training with no setup required
155
+
-**[Colab Notebooks](#try-it-now)** — Run locally with the example notebooks above
156
+
-**[Dataset](https://huggingface.co/datasets/hubertmarek/agent-diff-bench)** — 224 tasks across all 4 services (80/20 train/test split). Each test defines expected state changes via declarative assertions. See the [assertions docs](https://agentdiff.mintlify.app/core-concepts/assertions) for how they work.
0 commit comments