Skip to content

Commit 712c7bb

Browse files
committed
Add SWE-bench Pro public task analysis
1 parent c391cb5 commit 712c7bb

1 file changed

Lines changed: 171 additions & 0 deletions

File tree

Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# SWE-bench Pro Public Task Analysis
2+
3+
Date: 2026-06-08
4+
5+
This note summarizes the public SWE-bench Pro task set in this repository and
6+
documents how the repository, language, and task-type breakdowns were produced.
7+
8+
## Source of Truth
9+
10+
The public benchmark task count should be taken from:
11+
12+
- `helper_code/sweap_eval_full_v2.jsonl`
13+
14+
That JSONL contains 731 rows, matching the public leaderboard report. Counting
15+
`run_scripts/` directly is too broad in this checkout: `run_scripts/` contains
16+
1,000 local instance directories.
17+
18+
Comparison between the JSONL and local run scripts:
19+
20+
| Measure | Count |
21+
|---|---:|
22+
| Public dataset rows in `helper_code/sweap_eval_full_v2.jsonl` | 731 |
23+
| Local `run_scripts/instance_*` directories | 1,000 |
24+
| Public dataset IDs with matching run scripts | 731 |
25+
| Extra local run scripts not referenced by the public JSONL | 269 |
26+
| Public dataset IDs missing run scripts | 0 |
27+
28+
The 269 extra local run scripts are not part of the current public task set
29+
because their `instance_id`s do not appear in `helper_code/sweap_eval_full_v2.jsonl`.
30+
31+
Extra local scripts by repository:
32+
33+
| Repository | Extra Scripts |
34+
|---|---:|
35+
| element-hq/element-web | 80 |
36+
| navidrome/navidrome | 57 |
37+
| tutao/tutanota | 54 |
38+
| gravitational/teleport | 17 |
39+
| internetarchive/openlibrary | 13 |
40+
| ansible/ansible | 12 |
41+
| protonmail/webclients | 11 |
42+
| qutebrowser/qutebrowser | 10 |
43+
| NodeBB/NodeBB | 7 |
44+
| flipt-io/flipt | 6 |
45+
| future-architect/vuls | 2 |
46+
47+
## Public Task Breakdown
48+
49+
Total public tasks: 731
50+
51+
### By Repository
52+
53+
| Repository | Language | Tasks | Percent |
54+
|---|---:|---:|---:|
55+
| ansible/ansible | Python | 96 | 13.1% |
56+
| internetarchive/openlibrary | Python | 91 | 12.4% |
57+
| flipt-io/flipt | Go | 85 | 11.6% |
58+
| qutebrowser/qutebrowser | Python | 79 | 10.8% |
59+
| gravitational/teleport | Go | 76 | 10.4% |
60+
| protonmail/webclients | TypeScript | 65 | 8.9% |
61+
| future-architect/vuls | Go | 62 | 8.5% |
62+
| navidrome/navidrome | Go | 57 | 7.8% |
63+
| element-hq/element-web | TypeScript | 56 | 7.7% |
64+
| NodeBB/NodeBB | JavaScript | 44 | 6.0% |
65+
| tutao/tutanota | TypeScript | 20 | 2.7% |
66+
67+
### By Language Type
68+
69+
Language was mapped from each repository's primary implementation language, not
70+
inferred per changed file.
71+
72+
| Language | Tasks | Percent |
73+
|---|---:|---:|
74+
| Go | 280 | 38.3% |
75+
| Python | 266 | 36.4% |
76+
| TypeScript | 141 | 19.3% |
77+
| JavaScript | 44 | 6.0% |
78+
79+
### By Task Type
80+
81+
The dataset does not include an official `task_type` column. These categories
82+
were derived with a deterministic keyword classifier over `problem_statement`.
83+
84+
| Task Type | Tasks | Percent |
85+
|---|---:|---:|
86+
| Security/Auth | 194 | 26.5% |
87+
| UI/UX | 174 | 23.8% |
88+
| Bug Fix | 95 | 13.0% |
89+
| Feature/Enhancement | 81 | 11.1% |
90+
| Performance/Optimization | 77 | 10.5% |
91+
| Refactor/Maintenance | 63 | 8.6% |
92+
| Tests/CI/Tooling | 45 | 6.2% |
93+
| Unclassified/General Fix | 2 | 0.3% |
94+
95+
## Deterministic Task-Type Classification
96+
97+
Classification used only each row's `problem_statement`, not `patch` or
98+
`test_patch`. An earlier classifier pass included patches and over-counted
99+
Tests/CI/Tooling because every benchmark task includes test patches.
100+
101+
For each row:
102+
103+
1. Extract a title from common Markdown patterns:
104+
- `# Title: ...`
105+
- `# ...`
106+
- `**Title: ...**`
107+
- `Title: ...`
108+
2. Lowercase the extracted title plus the first 2,500 characters of the problem
109+
statement.
110+
3. Apply ordered regex rules. First match wins.
111+
4. Fall back to `Unclassified/General Fix`.
112+
113+
Classifier order:
114+
115+
1. Security/Auth
116+
2. Performance/Optimization
117+
3. UI/UX
118+
4. Tests/CI/Tooling
119+
5. Refactor/Maintenance
120+
6. Feature/Enhancement
121+
7. Bug Fix
122+
8. Unclassified/General Fix
123+
124+
Representative keyword groups:
125+
126+
| Task Type | Representative Keywords |
127+
|---|---|
128+
| Security/Auth | security, vulnerable, cve, authentication, authorization, permission, login, logout, password, token, oauth, saml, oidc, 2fa, mfa, encrypt, decrypt, crypto, certificate, tls, ssl, xss, csrf, secret, credential |
129+
| Performance/Optimization | performance, optimize, slow, speed, latency, timeout, hang, deadlock, memory leak, cache, efficient, expensive, scale, rate limit, throughput |
130+
| UI/UX | ui, ux, user interface, frontend, button, modal, dialog, screen, page, view, display, render, layout, css, style, theme, tooltip, form, toast, accessibility, keyboard, focus, mobile, responsive, visual |
131+
| Tests/CI/Tooling | testing, pytest, jest, unit test, integration test, ci, workflow, lint, eslint, ruff, mypy, build system, docker, script, cli, tooling, harness |
132+
| Refactor/Maintenance | refactor, cleanup, rename, reorganize, move file, deprecated, migration, compatibility, upgrade, dependency, maintainability, technical debt |
133+
| Feature/Enhancement | feature request, enhancement, add support, implement, new endpoint, new api, allow users, ability to, introduce, provide, enable, create |
134+
| Bug Fix | bug, fix, incorrect, wrong, failure, error, exception, crash, broken, regression, cannot, unable, expected behavior, actual behavior |
135+
136+
## Reproduction Snippets
137+
138+
Count public rows:
139+
140+
```bash
141+
wc -l helper_code/sweap_eval_full_v2.jsonl
142+
```
143+
144+
Compare public IDs to local run scripts:
145+
146+
```python
147+
import json
148+
from pathlib import Path
149+
150+
ids = {
151+
json.loads(line)["instance_id"]
152+
for line in open("helper_code/sweap_eval_full_v2.jsonl")
153+
if line.strip()
154+
}
155+
scripts = {p.name for p in Path("run_scripts").iterdir() if p.is_dir()}
156+
157+
print("dataset ids", len(ids))
158+
print("run_scripts dirs", len(scripts))
159+
print("overlap", len(ids & scripts))
160+
print("extra scripts not in dataset", len(scripts - ids))
161+
print("dataset ids missing scripts", len(ids - scripts))
162+
```
163+
164+
## Caveats
165+
166+
- Repository counts are exact from `helper_code/sweap_eval_full_v2.jsonl`.
167+
- Language counts are mapped by repository, not by individual changed files.
168+
- Task-type counts are heuristic, deterministic, and reproducible, but they are
169+
not official benchmark annotations.
170+
- Some tasks span multiple categories. Because first match wins, a task with
171+
both authentication and UI language is classified as Security/Auth.

0 commit comments

Comments
 (0)