Skip to content

Commit 44d086f

Browse files
authored
Merge pull request #148 from agent-diff-bench/artzhuravel-patch-2
Revise test results for new models
2 parents 1bdd9ea + 11c76e8 commit 44d086f

1 file changed

Lines changed: 1 addition & 1 deletion

File tree

experiments/kdd 2026/new_tests_results.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
Assertion-weighted mean scores (0-100) on the **36 newly added tests** only: 17 Box, 5 Google Calendar, 7 Linear, and 6 Slack. All runs included API documentation.
44

55
| Model | Weighted Mean
6-
|---|---|---|
6+
|---|---|
77
| openai/gpt-5 | 88.10
88
| openai/gpt-5-mini | 87.61
99
| deepseek/deepseek-v3.2 | 84.26

0 commit comments

Comments
 (0)