Skip to content

Commit a2c9bd3

Browse files
sjarmakclaude
andcommitted
docs: replace SDLC/Org distinction with unified 9 work-type taxonomy
All 275 tasks represent enterprise-scale developer work. The artificial SDLC vs Org split (an artifact of build order) is replaced by 9 work types: crossrepo, understand, refactor, security, feature, debug, fix, test, document. Structural complexity (single/dual/multi-repo) is now a secondary analysis dimension within each work type. On-disk dirs retain csb_sdlc_*/csb_org_* prefixes for backward compat. Updated: README, benchmarks/README, REPORT_CONTEXT, LEADERBOARD, EVALUATION_PIPELINE, CONFIGS. New: taxonomy_rationale.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent cda9a5e commit a2c9bd3

File tree

7 files changed

+201
-191
lines changed

7 files changed

+201
-191
lines changed

README.md

Lines changed: 49 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
# CodeScaleBench
22

3-
Benchmark suite for evaluating how AI coding agents leverage external context tools on software engineering tasks across the SDLC.
3+
Benchmark suite for evaluating how AI coding agents leverage external context retrieval tools on realistic developer tasks in large, enterprise-scale codebases.
44

55
This repository contains:
6-
- **Benchmark task definitions** (SDLC and Org suites with task specs, tests, and metadata)
6+
- **275 benchmark tasks** across 9 developer work types (debug, fix, feature, refactor, security, understand, crossrepo, test, document)
77
- **Evaluation and run configs** (paired baseline vs MCP-enabled execution modes)
88
- **Metrics extraction and reporting pipelines** for score/cost/retrieval analysis
99
- **Run artifacts and agent traces** (in `runs/` and published summaries under `docs/official_results/`)
@@ -69,58 +69,34 @@ bash configs/run_selected_tasks.sh --dry-run
6969

7070
---
7171

72-
## CodeScaleBench-SDLC
73-
74-
Nine suites organized by software development lifecycle phase:
75-
76-
| Suite | SDLC Phase | Tasks | Description |
77-
|-------|-----------|------:|-------------|
78-
| `csb_sdlc_feature` | Feature Implementation | 23 | New features, interface implementation, big-code features |
79-
| `csb_sdlc_fix` | Bug Repair | 19 | Diagnosing and fixing real bugs across production codebases |
80-
| `csb_sdlc_refactor` | Cross-File Refactoring | 18 | Cross-file refactoring, enterprise dependency refactoring, rename refactoring |
81-
| `csb_sdlc_debug` | Debugging & Investigation | 13 | Root cause tracing, fault localization, provenance |
82-
| `csb_sdlc_secure` | Security & Compliance | 13 | CVE analysis, reachability, governance, access control |
83-
| `csb_sdlc_test` | Testing & QA | 12 | Code review, performance testing, code search validation, test generation |
84-
| `csb_sdlc_design` | Architecture & Design | 11 | Architecture analysis, dependency graphs, change impact |
85-
| `csb_sdlc_document` | Documentation | 11 | API references, architecture docs, migration guides, runbooks |
86-
| `csb_sdlc_understand` | Requirements & Discovery | 11 | Codebase comprehension, onboarding, Q&A, knowledge recovery |
87-
| **Total** | | **131** | |
88-
89-
## CodeScaleBench-Org
90-
91-
Eleven additional suites measure cross-repo discovery, symbol resolution, dependency tracing, and deep-search-driven investigation in polyrepo environments.
92-
93-
| Suite | Category | Tasks | Description |
94-
|-------|----------|------:|-------------|
95-
| `csb_org_migration` | Framework Migration | 25 | API migrations, breaking changes across repos |
96-
| `csb_org_compliance` | Compliance | 13 | Standards adherence, audit, and provenance workflows |
97-
| `csb_org_incident` | Incident Debugging | 13 | Error-to-code-path tracing across microservices |
98-
| `csb_org_platform` | Platform Knowledge | 13 | Service template discovery and tribal knowledge |
99-
| `csb_org_security` | Vulnerability Remediation | 13 | CVE mapping, missing auth middleware across repos |
100-
| `csb_org_crossorg` | Cross-Org Discovery | 12 | Interface implementations and authoritative repo identification across orgs |
101-
| `csb_org_crossrepo` | Cross-Repo Discovery | 11 | Cross-repo search, dependency discovery, impact analysis |
102-
| `csb_org_crossrepo_tracing` | Dependency Tracing | 11 | Cross-repo dependency chains, blast radius, symbol resolution |
103-
| `csb_org_domain` | Domain Lineage | 11 | Domain-specific lineage and analysis workflows |
104-
| `csb_org_onboarding` | Onboarding & Comprehension | 11 | API consumption mapping, end-to-end flow, architecture maps |
105-
| `csb_org_org` | Organizational Context | 11 | Agentic discovery, org-wide coding correctness |
106-
| **Total** | | **144** | |
107-
108-
**Combined canonical benchmark: 275 tasks** (131 SDLC across 9 suites + 144 Org across 11 suites). Suite sizes are DOE-driven (Neyman-optimal allocation) to maximize statistical power per suite rather than uniform sizing. Non-canonical tasks are archived in `benchmarks/backups/`.
72+
## Task Taxonomy
73+
74+
All tasks represent realistic developer work in large, often multi-repo, enterprise codebases. Tasks are organized by **developer work type** — what the developer is doing — not by an artificial SDLC/Org distinction. See [docs/explanations/taxonomy_rationale.md](docs/explanations/taxonomy_rationale.md) for the design rationale.
75+
76+
| Work Type | Tasks | Description | Repo Scope |
77+
|-----------|------:|-------------|------------|
78+
| **crossrepo** | 47 | Cross-repo navigation, dependency tracing, org-wide discovery | 18 single, 9 dual, 20 multi |
79+
| **understand** | 44 | Codebase comprehension, architecture, onboarding, domain knowledge | 36 single, 4 dual, 4 multi |
80+
| **refactor** | 43 | Code transformation, migration, dependency updates | 26 single, 2 dual, 15 multi |
81+
| **security** | 39 | Security review, vulnerability remediation, compliance audit | 26 single, 2 dual, 11 multi |
82+
| **feature** | 34 | Feature implementation, org-wide feature work | 24 single, 2 dual, 8 multi |
83+
| **debug** | 26 | Debugging, root cause analysis, incident triage | 15 single, 8 dual, 3 multi |
84+
| **fix** | 19 | Bug repair from issue reports | 19 single |
85+
| **test** | 12 | Test generation, code review, QA | 12 single |
86+
| **document** | 11 | API docs, architecture docs, migration guides | 10 single, 1 dual |
87+
| **Total** | **275** | | 186 single, 28 dual, 61 multi |
88+
89+
**Structural complexity** varies within each work type. Tasks range from single-repo (186) through dual-repo (28) to multi-repo (61), enabling analysis of whether context retrieval tools help more as repo scope widens.
10990

11091
Both baseline and MCP-Full agents have access to **all repos** in each task's fixture. The only difference is the method: baseline reads code locally, MCP-Full uses Sourcegraph MCP tools (local code is truncated). This ensures we measure whether MCP tools help agents work better — not whether MCP can access repos the baseline can't.
11192

112-
See [docs/ORG_TASKS.md](docs/ORG_TASKS.md) for the full task system, authoring guide, and oracle evaluation framework. See [docs/ORG_CALIBRATION.md](docs/ORG_CALIBRATION.md) for oracle coverage analysis.
93+
Non-canonical tasks are archived in `benchmarks/backups/`. See [docs/ORG_TASKS.md](docs/ORG_TASKS.md) for the oracle evaluation framework.
11394

11495
---
11596

11697
## 2-Config Evaluation Matrix
11798

118-
All benchmarks are evaluated across two primary configurations (Baseline vs MCP). The concrete run config names differ by task type:
119-
120-
- **SDLC suites** (`csb_sdlc_feature`, `csb_sdlc_refactor`, `csb_sdlc_fix`, etc.): `baseline-local-direct` + `mcp-remote-direct`
121-
- **Org suites** (`csb_org_*`): `baseline-local-direct` + `mcp-remote-direct`
122-
123-
At a high level, the distinction is:
99+
All 275 tasks are evaluated across two primary configurations (Baseline vs MCP):
124100

125101
| Config Name | Internal MCP mode | MCP Tools Available |
126102
|-------------------|---------------------|---------------------|
@@ -134,27 +110,27 @@ See [docs/reference/CONFIGS.md](docs/reference/CONFIGS.md) for the canonical con
134110
## Repository Structure
135111

136112
```
137-
benchmarks/ # Task definitions organized by SDLC phase + Org
138-
csb_sdlc_feature/ # Feature Implementation (23 tasks)
139-
csb_sdlc_fix/ # Bug Repair (19 tasks)
140-
csb_sdlc_refactor/ # Cross-File Refactoring (18 tasks)
141-
csb_sdlc_debug/ # Debugging & Investigation (13 tasks)
142-
csb_sdlc_secure/ # Security & Compliance (13 tasks)
143-
csb_sdlc_test/ # Testing & QA (12 tasks)
144-
csb_sdlc_design/ # Architecture & Design (11 tasks)
145-
csb_sdlc_document/ # Documentation (11 tasks)
146-
csb_sdlc_understand/ # Requirements & Discovery (11 tasks)
147-
csb_org_migration/ # Org: framework migration (25 tasks)
148-
csb_org_compliance/ # Org: compliance & audit (13 tasks)
149-
csb_org_incident/ # Org: incident debugging (13 tasks)
150-
csb_org_platform/ # Org: platform knowledge (13 tasks)
151-
csb_org_security/ # Org: vulnerability remediation (13 tasks)
152-
csb_org_crossorg/ # Org: cross-org discovery (12 tasks)
153-
csb_org_crossrepo/ # Org: cross-repo discovery (11 tasks)
154-
csb_org_crossrepo_tracing/ # Org: dependency tracing (11 tasks)
155-
csb_org_domain/ # Org: domain lineage (11 tasks)
156-
csb_org_onboarding/ # Org: onboarding (11 tasks)
157-
csb_org_org/ # Org: org context (11 tasks)
113+
benchmarks/ # 275 tasks across 20 source directories (9 work types)
114+
csb_sdlc_feature/ # feature: Feature Implementation (23 tasks)
115+
csb_sdlc_fix/ # fix: Bug Repair (19 tasks)
116+
csb_sdlc_refactor/ # refactor: Cross-File Refactoring (18 tasks)
117+
csb_sdlc_debug/ # debug: Debugging & Investigation (13 tasks)
118+
csb_sdlc_secure/ # security: CVE analysis, governance (13 tasks)
119+
csb_sdlc_test/ # test: Testing & QA (12 tasks)
120+
csb_sdlc_design/ # understand: Architecture analysis (11 tasks)
121+
csb_sdlc_document/ # document: API references, guides (11 tasks)
122+
csb_sdlc_understand/ # understand: Comprehension, onboarding (11 tasks)
123+
csb_org_migration/ # refactor: Framework migration (25 tasks)
124+
csb_org_compliance/ # security: Compliance & audit (13 tasks)
125+
csb_org_incident/ # debug: Incident debugging (13 tasks)
126+
csb_org_platform/ # crossrepo: Platform knowledge (13 tasks)
127+
csb_org_security/ # security: Vulnerability remediation (13 tasks)
128+
csb_org_crossorg/ # crossrepo: Cross-org discovery (12 tasks)
129+
csb_org_crossrepo/ # crossrepo: Cross-repo discovery (11 tasks)
130+
csb_org_crossrepo_tracing/ # crossrepo: Dependency tracing (11 tasks)
131+
csb_org_domain/ # understand: Domain lineage (11 tasks)
132+
csb_org_onboarding/ # understand: Onboarding (11 tasks)
133+
csb_org_org/ # feature: Org-wide feature work (11 tasks)
158134
backups/ # Archived non-canonical tasks
159135
configs/ # Run configs and task selection
160136
_common.sh # Shared infra: token refresh, parallel execution, multi-account
@@ -169,7 +145,7 @@ configs/ # Run configs and task selection
169145
test_2config.sh # Phase wrapper: Test (20 tasks)
170146
run_selected_tasks.sh # Unified runner for all tasks
171147
validate_one_per_benchmark.sh # Pre-flight smoke (1 task per suite)
172-
selected_benchmark_tasks.json # Canonical task selection: 275 tasks (131 SDLC + 144 Org)
148+
selected_benchmark_tasks.json # Canonical task selection: 275 tasks across 9 work types
173149
use_case_registry.json # 100 GTM use cases (Org task source)
174150
archive/ # Pre-SDLC migration scripts (preserved for history)
175151
scripts/ # Metrics extraction, evaluation, and operational tooling
@@ -274,7 +250,7 @@ This writes:
274250
Suite summaries are deduplicated to the latest result per
275251
`suite + config + task_name`; full historical rows remain in
276252
`official_results.json` under `all_tasks`.
277-
For SDLC suites, export normalizes legacy config labels:
253+
Export normalizes legacy config labels:
278254
`baseline` -> `baseline-local-direct`, `mcp` -> `mcp-remote-direct`.
279255

280256
Serve locally:
@@ -291,7 +267,7 @@ For the full multi-layer evaluation pipeline (verifier, LLM judge, statistical a
291267

292268
This section assumes Harbor is already installed and configured. If not, start with the Quickstart section above and `python3 scripts/check_infra.py`.
293269

294-
### SDLC Tasks
270+
### All Tasks
295271

296272
The unified runner executes all 275 canonical tasks across the 2-config matrix:
297273

@@ -325,13 +301,13 @@ bash configs/understand_2config.sh # 11 Requirements & Discovery tasks
325301

326302
### Filtering by Suite
327303

328-
All tasks (SDLC and Org) are in the unified `selected_benchmark_tasks.json`. Filter by suite with the `--benchmark` flag:
304+
All 275 tasks are in `selected_benchmark_tasks.json`. Filter by source directory with the `--benchmark` flag:
329305

330306
```bash
331-
# Run only Org security tasks
307+
# Run only security-related tasks from a specific source
332308
bash configs/run_selected_tasks.sh --benchmark csb_org_security
333309

334-
# Run only SDLC fix tasks
310+
# Run only fix tasks
335311
bash configs/run_selected_tasks.sh --benchmark csb_sdlc_fix
336312
```
337313

benchmarks/README.md

Lines changed: 17 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,26 @@
11
# CodeScaleBench Benchmarks
22

3-
This directory contains SDLC-aligned suites plus Org org-scale retrieval suites. The canonical task set is defined by [`unified_benchmark_manifest.json`](../configs/unified_benchmark_manifest.json) (275 tasks across 20 suites: 131 SDLC + 144 Org). Suite sizes use DOE-driven Neyman-optimal allocation to maximize statistical power per suite.
3+
275 tasks representing realistic developer work in large, enterprise-scale codebases. Tasks are organized by **developer work type** across 20 source directories. Suite sizes use DOE-driven Neyman-optimal allocation.
44

5-
Non-canonical tasks are archived in `backups/`.
6-
7-
See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodology.
5+
Non-canonical tasks are archived in `backups/`. See [`docs/explanations/taxonomy_rationale.md`](../docs/explanations/taxonomy_rationale.md) for the design rationale. See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodology.
86

97
---
108

11-
## SDLC Suite Overview
12-
13-
| Suite | SDLC Phase | Tasks | Description |
14-
|-------|-----------|------:|-------------|
15-
| `csb_sdlc_feature` | Feature Implementation | 23 | New features, interface implementation, big-code features |
16-
| `csb_sdlc_fix` | Bug Repair | 19 | Diagnosing and fixing real bugs across production codebases |
17-
| `csb_sdlc_refactor` | Cross-File Refactoring | 18 | Cross-file refactoring, enterprise dependency refactoring, rename refactoring |
18-
| `csb_sdlc_debug` | Debugging & Investigation | 13 | Root cause tracing, fault localization, provenance |
19-
| `csb_sdlc_secure` | Security & Compliance | 13 | CVE analysis, reachability, governance, access control |
20-
| `csb_sdlc_test` | Testing & QA | 12 | Code review, performance testing, code search validation, test generation |
21-
| `csb_sdlc_design` | Architecture & Design | 11 | Architecture analysis, dependency graphs, change impact |
22-
| `csb_sdlc_document` | Documentation | 11 | API references, architecture docs, migration guides, runbooks |
23-
| `csb_sdlc_understand` | Requirements & Discovery | 11 | Codebase comprehension, onboarding, Q&A, knowledge recovery |
24-
| **Total** | | **131** | |
25-
26-
---
27-
28-
## CodeScaleBench-Org Suite Overview
29-
30-
These suites measure cross-repo discovery, tracing, and org-scale code intelligence use cases.
9+
## Work Types
3110

32-
| Suite | Tasks | Description |
33-
|-------|------:|-------------|
34-
| `csb_org_migration` | 25 | Framework and platform migrations across repos |
35-
| `csb_org_compliance` | 13 | Compliance, audit, and provenance workflows |
36-
| `csb_org_incident` | 13 | Incident debugging across services and repos |
37-
| `csb_org_platform` | 13 | Platform/devtools and tribal-knowledge discovery |
38-
| `csb_org_security` | 13 | Vulnerability remediation and security analysis at org scale |
39-
| `csb_org_crossorg` | 12 | Cross-org discovery and authoritative repo identification |
40-
| `csb_org_crossrepo` | 11 | Cross-repo search, dependency discovery, impact analysis |
41-
| `csb_org_crossrepo_tracing` | 11 | Cross-repo dependency tracing and symbol resolution |
42-
| `csb_org_domain` | 11 | Domain-specific lineage and analysis workflows |
43-
| `csb_org_onboarding` | 11 | Onboarding, architecture comprehension, API discovery |
44-
| `csb_org_org` | 11 | Org-wide coding correctness tasks requiring broad context |
45-
| **Total** | **144** | |
11+
| Work Type | Tasks | Source Directories | Repo Scope |
12+
|-----------|------:|-------------------|------------|
13+
| **crossrepo** | 47 | `csb_org_crossrepo` (11), `csb_org_crossrepo_tracing` (11), `csb_org_crossorg` (12), `csb_org_platform` (13) | 18 single, 9 dual, 20 multi |
14+
| **understand** | 44 | `csb_sdlc_understand` (11), `csb_sdlc_design` (11), `csb_org_domain` (11), `csb_org_onboarding` (11) | 36 single, 4 dual, 4 multi |
15+
| **refactor** | 43 | `csb_sdlc_refactor` (18), `csb_org_migration` (25) | 26 single, 2 dual, 15 multi |
16+
| **security** | 39 | `csb_sdlc_secure` (13), `csb_org_security` (13), `csb_org_compliance` (13) | 26 single, 2 dual, 11 multi |
17+
| **feature** | 34 | `csb_sdlc_feature` (23), `csb_org_org` (11) | 24 single, 2 dual, 8 multi |
18+
| **debug** | 26 | `csb_sdlc_debug` (13), `csb_org_incident` (13) | 15 single, 8 dual, 3 multi |
19+
| **fix** | 19 | `csb_sdlc_fix` (19) | 19 single |
20+
| **test** | 12 | `csb_sdlc_test` (12) | 12 single |
21+
| **document** | 11 | `csb_sdlc_document` (11) | 10 single, 1 dual |
4622

47-
For suite taxonomy, authoring, and oracle evaluation details, see [`docs/ORG_TASKS.md`](../docs/ORG_TASKS.md).
23+
The on-disk `csb_sdlc_*` and `csb_org_*` prefixes are legacy naming from the original build phases. All tasks target enterprise-scale codebases; the prefix does not imply a meaningful SDLC/Org distinction. The reporting layer maps directories to work types for analysis.
4824

4925
---
5026

@@ -69,7 +45,7 @@ Each task follows this layout:
6945
# Run all 275 canonical tasks across 2 configs
7046
bash configs/run_selected_tasks.sh
7147

72-
# Run a single SDLC phase
48+
# Run a specific source directory
7349
bash configs/run_selected_tasks.sh --benchmark csb_sdlc_fix
7450

7551
# Single task
@@ -79,4 +55,4 @@ harbor run --path benchmarks/csb_sdlc_feature/servo-scrollend-event-feat-001 \
7955
-n 1
8056
```
8157

82-
See [`docs/CONFIGS.md`](../docs/CONFIGS.md) for the full tool-by-tool breakdown of each config.
58+
See [`docs/reference/CONFIGS.md`](../docs/reference/CONFIGS.md) for the full tool-by-tool breakdown of each config.

docs/EVALUATION_PIPELINE.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -52,8 +52,8 @@ Harbor run output (result.json, transcript)
5252

5353
## Layer 1: Deterministic Verifiers
5454

55-
Every task ships a `tests/test.sh` (SDLC tasks) or `tests/eval.sh` (Org
56-
tasks) that runs inside the Docker container after the agent finishes. The
55+
Every task ships a `tests/test.sh` or `tests/eval.sh`
56+
that runs inside the Docker container after the agent finishes. The
5757
verifier writes a reward (0.0–1.0) to `/logs/verifier/reward.txt`. Canonical
5858
tasks should also emit `/logs/verifier/validation_result.json` using the schema
5959
in [docs/reference/VALIDATION_RESULT_SCHEMA.md](reference/VALIDATION_RESULT_SCHEMA.md)
@@ -165,9 +165,9 @@ python3 scripts/run_judge.py --run runs/official/my_run/ --force
165165

166166
Output: `judge_result.json` written alongside each task's `result.json`.
167167

168-
### Hybrid Scoring (CodeScaleBench-Org Tasks)
168+
### Hybrid Scoring (Tasks with Criteria)
169169

170-
Org tasks with `tests/criteria.json` support hybrid evaluation:
170+
Tasks with `tests/criteria.json` support hybrid evaluation:
171171
`composite = 0.6 * verifier_reward + 0.4 * rubric_score`. Enable with
172172
`--hybrid` flag on `run_judge.py`.
173173

0 commit comments

Comments
 (0)