|
| 1 | +# SWE-bench Pro Public Task Analysis |
| 2 | + |
| 3 | +Date: 2026-06-08 |
| 4 | + |
| 5 | +This note summarizes the public SWE-bench Pro task set in this repository and |
| 6 | +documents how the repository, language, and task-type breakdowns were produced. |
| 7 | + |
| 8 | +## Source of Truth |
| 9 | + |
| 10 | +The public benchmark task count should be taken from: |
| 11 | + |
| 12 | +- `helper_code/sweap_eval_full_v2.jsonl` |
| 13 | + |
| 14 | +That JSONL contains 731 rows, matching the public leaderboard report. Counting |
| 15 | +`run_scripts/` directly is too broad in this checkout: `run_scripts/` contains |
| 16 | +1,000 local instance directories. |
| 17 | + |
| 18 | +Comparison between the JSONL and local run scripts: |
| 19 | + |
| 20 | +| Measure | Count | |
| 21 | +|---|---:| |
| 22 | +| Public dataset rows in `helper_code/sweap_eval_full_v2.jsonl` | 731 | |
| 23 | +| Local `run_scripts/instance_*` directories | 1,000 | |
| 24 | +| Public dataset IDs with matching run scripts | 731 | |
| 25 | +| Extra local run scripts not referenced by the public JSONL | 269 | |
| 26 | +| Public dataset IDs missing run scripts | 0 | |
| 27 | + |
| 28 | +The 269 extra local run scripts are not part of the current public task set |
| 29 | +because their `instance_id`s do not appear in `helper_code/sweap_eval_full_v2.jsonl`. |
| 30 | + |
| 31 | +Extra local scripts by repository: |
| 32 | + |
| 33 | +| Repository | Extra Scripts | |
| 34 | +|---|---:| |
| 35 | +| element-hq/element-web | 80 | |
| 36 | +| navidrome/navidrome | 57 | |
| 37 | +| tutao/tutanota | 54 | |
| 38 | +| gravitational/teleport | 17 | |
| 39 | +| internetarchive/openlibrary | 13 | |
| 40 | +| ansible/ansible | 12 | |
| 41 | +| protonmail/webclients | 11 | |
| 42 | +| qutebrowser/qutebrowser | 10 | |
| 43 | +| NodeBB/NodeBB | 7 | |
| 44 | +| flipt-io/flipt | 6 | |
| 45 | +| future-architect/vuls | 2 | |
| 46 | + |
| 47 | +## Public Task Breakdown |
| 48 | + |
| 49 | +Total public tasks: 731 |
| 50 | + |
| 51 | +### By Repository |
| 52 | + |
| 53 | +| Repository | Language | Tasks | Percent | |
| 54 | +|---|---:|---:|---:| |
| 55 | +| ansible/ansible | Python | 96 | 13.1% | |
| 56 | +| internetarchive/openlibrary | Python | 91 | 12.4% | |
| 57 | +| flipt-io/flipt | Go | 85 | 11.6% | |
| 58 | +| qutebrowser/qutebrowser | Python | 79 | 10.8% | |
| 59 | +| gravitational/teleport | Go | 76 | 10.4% | |
| 60 | +| protonmail/webclients | TypeScript | 65 | 8.9% | |
| 61 | +| future-architect/vuls | Go | 62 | 8.5% | |
| 62 | +| navidrome/navidrome | Go | 57 | 7.8% | |
| 63 | +| element-hq/element-web | TypeScript | 56 | 7.7% | |
| 64 | +| NodeBB/NodeBB | JavaScript | 44 | 6.0% | |
| 65 | +| tutao/tutanota | TypeScript | 20 | 2.7% | |
| 66 | + |
| 67 | +### By Language Type |
| 68 | + |
| 69 | +Language was mapped from each repository's primary implementation language, not |
| 70 | +inferred per changed file. |
| 71 | + |
| 72 | +| Language | Tasks | Percent | |
| 73 | +|---|---:|---:| |
| 74 | +| Go | 280 | 38.3% | |
| 75 | +| Python | 266 | 36.4% | |
| 76 | +| TypeScript | 141 | 19.3% | |
| 77 | +| JavaScript | 44 | 6.0% | |
| 78 | + |
| 79 | +### By Task Type |
| 80 | + |
| 81 | +The dataset does not include an official `task_type` column. These categories |
| 82 | +were derived with a deterministic keyword classifier over `problem_statement`. |
| 83 | + |
| 84 | +| Task Type | Tasks | Percent | |
| 85 | +|---|---:|---:| |
| 86 | +| Security/Auth | 194 | 26.5% | |
| 87 | +| UI/UX | 174 | 23.8% | |
| 88 | +| Bug Fix | 95 | 13.0% | |
| 89 | +| Feature/Enhancement | 81 | 11.1% | |
| 90 | +| Performance/Optimization | 77 | 10.5% | |
| 91 | +| Refactor/Maintenance | 63 | 8.6% | |
| 92 | +| Tests/CI/Tooling | 45 | 6.2% | |
| 93 | +| Unclassified/General Fix | 2 | 0.3% | |
| 94 | + |
| 95 | +## Deterministic Task-Type Classification |
| 96 | + |
| 97 | +Classification used only each row's `problem_statement`, not `patch` or |
| 98 | +`test_patch`. An earlier classifier pass included patches and over-counted |
| 99 | +Tests/CI/Tooling because every benchmark task includes test patches. |
| 100 | + |
| 101 | +For each row: |
| 102 | + |
| 103 | +1. Extract a title from common Markdown patterns: |
| 104 | + - `# Title: ...` |
| 105 | + - `# ...` |
| 106 | + - `**Title: ...**` |
| 107 | + - `Title: ...` |
| 108 | +2. Lowercase the extracted title plus the first 2,500 characters of the problem |
| 109 | + statement. |
| 110 | +3. Apply ordered regex rules. First match wins. |
| 111 | +4. Fall back to `Unclassified/General Fix`. |
| 112 | + |
| 113 | +Classifier order: |
| 114 | + |
| 115 | +1. Security/Auth |
| 116 | +2. Performance/Optimization |
| 117 | +3. UI/UX |
| 118 | +4. Tests/CI/Tooling |
| 119 | +5. Refactor/Maintenance |
| 120 | +6. Feature/Enhancement |
| 121 | +7. Bug Fix |
| 122 | +8. Unclassified/General Fix |
| 123 | + |
| 124 | +Representative keyword groups: |
| 125 | + |
| 126 | +| Task Type | Representative Keywords | |
| 127 | +|---|---| |
| 128 | +| Security/Auth | security, vulnerable, cve, authentication, authorization, permission, login, logout, password, token, oauth, saml, oidc, 2fa, mfa, encrypt, decrypt, crypto, certificate, tls, ssl, xss, csrf, secret, credential | |
| 129 | +| Performance/Optimization | performance, optimize, slow, speed, latency, timeout, hang, deadlock, memory leak, cache, efficient, expensive, scale, rate limit, throughput | |
| 130 | +| UI/UX | ui, ux, user interface, frontend, button, modal, dialog, screen, page, view, display, render, layout, css, style, theme, tooltip, form, toast, accessibility, keyboard, focus, mobile, responsive, visual | |
| 131 | +| Tests/CI/Tooling | testing, pytest, jest, unit test, integration test, ci, workflow, lint, eslint, ruff, mypy, build system, docker, script, cli, tooling, harness | |
| 132 | +| Refactor/Maintenance | refactor, cleanup, rename, reorganize, move file, deprecated, migration, compatibility, upgrade, dependency, maintainability, technical debt | |
| 133 | +| Feature/Enhancement | feature request, enhancement, add support, implement, new endpoint, new api, allow users, ability to, introduce, provide, enable, create | |
| 134 | +| Bug Fix | bug, fix, incorrect, wrong, failure, error, exception, crash, broken, regression, cannot, unable, expected behavior, actual behavior | |
| 135 | + |
| 136 | +## Reproduction Snippets |
| 137 | + |
| 138 | +Count public rows: |
| 139 | + |
| 140 | +```bash |
| 141 | +wc -l helper_code/sweap_eval_full_v2.jsonl |
| 142 | +``` |
| 143 | + |
| 144 | +Compare public IDs to local run scripts: |
| 145 | + |
| 146 | +```python |
| 147 | +import json |
| 148 | +from pathlib import Path |
| 149 | + |
| 150 | +ids = { |
| 151 | + json.loads(line)["instance_id"] |
| 152 | + for line in open("helper_code/sweap_eval_full_v2.jsonl") |
| 153 | + if line.strip() |
| 154 | +} |
| 155 | +scripts = {p.name for p in Path("run_scripts").iterdir() if p.is_dir()} |
| 156 | + |
| 157 | +print("dataset ids", len(ids)) |
| 158 | +print("run_scripts dirs", len(scripts)) |
| 159 | +print("overlap", len(ids & scripts)) |
| 160 | +print("extra scripts not in dataset", len(scripts - ids)) |
| 161 | +print("dataset ids missing scripts", len(ids - scripts)) |
| 162 | +``` |
| 163 | + |
| 164 | +## Caveats |
| 165 | + |
| 166 | +- Repository counts are exact from `helper_code/sweap_eval_full_v2.jsonl`. |
| 167 | +- Language counts are mapped by repository, not by individual changed files. |
| 168 | +- Task-type counts are heuristic, deterministic, and reproducible, but they are |
| 169 | + not official benchmark annotations. |
| 170 | +- Some tasks span multiple categories. Because first match wins, a task with |
| 171 | + both authentication and UI language is classified as Security/Auth. |
0 commit comments