You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: replace SDLC/Org distinction with unified 9 work-type taxonomy
All 275 tasks represent enterprise-scale developer work. The artificial
SDLC vs Org split (an artifact of build order) is replaced by 9 work
types: crossrepo, understand, refactor, security, feature, debug, fix,
test, document. Structural complexity (single/dual/multi-repo) is now a
secondary analysis dimension within each work type.
On-disk dirs retain csb_sdlc_*/csb_org_* prefixes for backward compat.
Updated: README, benchmarks/README, REPORT_CONTEXT, LEADERBOARD,
EVALUATION_PIPELINE, CONFIGS. New: taxonomy_rationale.md.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+49-73Lines changed: 49 additions & 73 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,9 @@
1
1
# CodeScaleBench
2
2
3
-
Benchmark suite for evaluating how AI coding agents leverage external context tools on software engineering tasks across the SDLC.
3
+
Benchmark suite for evaluating how AI coding agents leverage external context retrieval tools on realistic developer tasks in large, enterprise-scale codebases.
4
4
5
5
This repository contains:
6
-
-**Benchmark task definitions**(SDLC and Org suites with task specs, tests, and metadata)
**Combined canonical benchmark: 275 tasks** (131 SDLC across 9 suites + 144 Org across 11 suites). Suite sizes are DOE-driven (Neyman-optimal allocation) to maximize statistical power per suite rather than uniform sizing. Non-canonical tasks are archived in `benchmarks/backups/`.
72
+
## Task Taxonomy
73
+
74
+
All tasks represent realistic developer work in large, often multi-repo, enterprise codebases. Tasks are organized by **developer work type** — what the developer is doing — not by an artificial SDLC/Org distinction. See [docs/explanations/taxonomy_rationale.md](docs/explanations/taxonomy_rationale.md) for the design rationale.
75
+
76
+
| Work Type | Tasks | Description | Repo Scope |
77
+
|-----------|------:|-------------|------------|
78
+
|**crossrepo**| 47 | Cross-repo navigation, dependency tracing, org-wide discovery | 18 single, 9 dual, 20 multi |
79
+
|**understand**| 44 | Codebase comprehension, architecture, onboarding, domain knowledge | 36 single, 4 dual, 4 multi |
80
+
|**refactor**| 43 | Code transformation, migration, dependency updates | 26 single, 2 dual, 15 multi |
81
+
|**security**| 39 | Security review, vulnerability remediation, compliance audit | 26 single, 2 dual, 11 multi |
82
+
|**feature**| 34 | Feature implementation, org-wide feature work | 24 single, 2 dual, 8 multi |
83
+
|**debug**| 26 | Debugging, root cause analysis, incident triage | 15 single, 8 dual, 3 multi |
84
+
|**fix**| 19 | Bug repair from issue reports | 19 single |
85
+
|**test**| 12 | Test generation, code review, QA | 12 single |
86
+
|**document**| 11 | API docs, architecture docs, migration guides | 10 single, 1 dual |
87
+
|**Total**|**275**|| 186 single, 28 dual, 61 multi |
88
+
89
+
**Structural complexity** varies within each work type. Tasks range from single-repo (186) through dual-repo (28) to multi-repo (61), enabling analysis of whether context retrieval tools help more as repo scope widens.
109
90
110
91
Both baseline and MCP-Full agents have access to **all repos** in each task's fixture. The only difference is the method: baseline reads code locally, MCP-Full uses Sourcegraph MCP tools (local code is truncated). This ensures we measure whether MCP tools help agents work better — not whether MCP can access repos the baseline can't.
111
92
112
-
See [docs/ORG_TASKS.md](docs/ORG_TASKS.md) for the full task system, authoring guide, and oracle evaluation framework. See [docs/ORG_CALIBRATION.md](docs/ORG_CALIBRATION.md) for oracle coverage analysis.
93
+
Non-canonical tasks are archived in `benchmarks/backups/`. See [docs/ORG_TASKS.md](docs/ORG_TASKS.md) for the oracle evaluation framework.
113
94
114
95
---
115
96
116
97
## 2-Config Evaluation Matrix
117
98
118
-
All benchmarks are evaluated across two primary configurations (Baseline vs MCP). The concrete run config names differ by task type:
This directory contains SDLC-aligned suites plus Org org-scale retrieval suites. The canonical task set is defined by [`unified_benchmark_manifest.json`](../configs/unified_benchmark_manifest.json) (275 tasks across 20 suites: 131 SDLC + 144 Org). Suite sizes use DOE-driven Neyman-optimal allocation to maximize statistical power per suite.
3
+
275 tasks representing realistic developer work in large, enterprise-scale codebases. Tasks are organized by **developer work type** across 20 source directories. Suite sizes use DOE-driven Neyman-optimal allocation.
4
4
5
-
Non-canonical tasks are archived in `backups/`.
6
-
7
-
See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodology.
5
+
Non-canonical tasks are archived in `backups/`. See [`docs/explanations/taxonomy_rationale.md`](../docs/explanations/taxonomy_rationale.md) for the design rationale. See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodology.
8
6
9
7
---
10
8
11
-
## SDLC Suite Overview
12
-
13
-
| Suite | SDLC Phase | Tasks | Description |
14
-
|-------|-----------|------:|-------------|
15
-
|`csb_sdlc_feature`| Feature Implementation | 23 | New features, interface implementation, big-code features |
16
-
|`csb_sdlc_fix`| Bug Repair | 19 | Diagnosing and fixing real bugs across production codebases |
|**crossrepo**| 47 |`csb_org_crossrepo` (11), `csb_org_crossrepo_tracing` (11), `csb_org_crossorg` (12), `csb_org_platform` (13) | 18 single, 9 dual, 20 multi |
14
+
|**understand**| 44 |`csb_sdlc_understand` (11), `csb_sdlc_design` (11), `csb_org_domain` (11), `csb_org_onboarding` (11) | 36 single, 4 dual, 4 multi |
15
+
|**refactor**| 43 |`csb_sdlc_refactor` (18), `csb_org_migration` (25) | 26 single, 2 dual, 15 multi |
16
+
|**security**| 39 |`csb_sdlc_secure` (13), `csb_org_security` (13), `csb_org_compliance` (13) | 26 single, 2 dual, 11 multi |
17
+
|**feature**| 34 |`csb_sdlc_feature` (23), `csb_org_org` (11) | 24 single, 2 dual, 8 multi |
18
+
|**debug**| 26 |`csb_sdlc_debug` (13), `csb_org_incident` (13) | 15 single, 8 dual, 3 multi |
19
+
|**fix**| 19 |`csb_sdlc_fix` (19) | 19 single |
20
+
|**test**| 12 |`csb_sdlc_test` (12) | 12 single |
21
+
|**document**| 11 |`csb_sdlc_document` (11) | 10 single, 1 dual |
46
22
47
-
For suite taxonomy, authoring, and oracle evaluation details, see [`docs/ORG_TASKS.md`](../docs/ORG_TASKS.md).
23
+
The on-disk `csb_sdlc_*`and `csb_org_*` prefixes are legacy naming from the original build phases. All tasks target enterprise-scale codebases; the prefix does not imply a meaningful SDLC/Org distinction. The reporting layer maps directories to work types for analysis.
0 commit comments