Skip to content

Commit 228a64e

Browse files
sjarmakclaude
andcommitted
docs: update READMEs and configs for ccb_feature/ccb_refactor split
Update README.md and benchmarks/README.md with new suite tables (9 SDLC suites, 199 tasks, 294 total). Update task counts and suite references in 6 config wrappers, _common.sh, and ground_truth_files.json. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent b8f25b6 commit 228a64e

File tree

10 files changed

+88
-59
lines changed

10 files changed

+88
-59
lines changed

README.md

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -65,19 +65,20 @@ bash configs/run_selected_tasks.sh --dry-run
6565

6666
## Benchmark Suites (SDLC-Aligned)
6767

68-
Eight suites organized by software development lifecycle phase:
68+
Nine suites organized by software development lifecycle phase:
6969

7070
| Suite | SDLC Phase | Tasks | Description |
7171
|-------|-----------|------:|-------------|
7272
| `ccb_understand` | Requirements & Discovery | 20 | Codebase comprehension, onboarding, Q&A, knowledge recovery |
7373
| `ccb_design` | Architecture & Design | 20 | Architecture analysis, dependency graphs, change impact |
7474
| `ccb_fix` | Bug Repair | 25 | Diagnosing and fixing real bugs across production codebases |
75-
| `ccb_build` | Feature & Refactoring | 25 | New features, refactoring, dependency management |
75+
| `ccb_feature` | Feature Implementation | 20 | New features, interface implementation, big-code features |
76+
| `ccb_refactor` | Cross-File Refactoring | 20 | Cross-file refactoring, enterprise dependency refactoring, rename refactoring |
7677
| `ccb_test` | Testing & QA | 20 | Code review, performance testing, code search validation, test generation |
7778
| `ccb_document` | Documentation | 20 | API references, architecture docs, migration guides, runbooks |
7879
| `ccb_secure` | Security & Compliance | 20 | CVE analysis, reachability, governance, access control |
7980
| `ccb_debug` | Debugging & Investigation | 20 | Root cause tracing, fault localization, provenance |
80-
| **Total** | | **170** | |
81+
| **Total** | | **199** | |
8182

8283
## MCP-Unique Suites (Org-Scale Context Retrieval)
8384

@@ -89,16 +90,16 @@ Eleven additional suites measure cross-repo discovery, symbol resolution, depend
8990
| `ccb_mcp_security` | B: Vulnerability Remediation | 10 | CVE mapping, missing auth middleware across repos |
9091
| `ccb_mcp_migration` | C: Framework Migration | 7 | API migrations, breaking changes across repos |
9192
| `ccb_mcp_incident` | D: Incident Debugging | 11 | Error-to-code-path tracing across microservices |
92-
| `ccb_mcp_onboarding` | E: Onboarding & Comprehension | 11 | API consumption mapping, end-to-end flow, architecture maps |
93+
| `ccb_mcp_onboarding` | E: Onboarding & Comprehension | 25 | API consumption mapping, end-to-end flow, architecture maps |
9394
| `ccb_mcp_compliance` | F: Compliance | 7 | Standards adherence, audit, and provenance workflows |
9495
| `ccb_mcp_crossorg` | G: Cross-Org Discovery | 5 | Interface implementations and authoritative repo identification across orgs |
9596
| `ccb_mcp_domain` | H: Domain Lineage | 10 | Config propagation, architecture patterns, domain analysis |
9697
| `ccb_mcp_org` | I: Organizational Context | 5 | Agentic discovery, org-wide coding correctness |
9798
| `ccb_mcp_platform` | J: Platform Knowledge | 5 | Service template discovery and tribal knowledge |
9899
| `ccb_mcp_crossrepo` | Legacy | 1 | Cross-repo discovery (compatibility) |
99-
| **Total** | | **81** | |
100+
| **Total** | | **95** | |
100101

101-
**Combined catalog total: 251 tasks** (170 SDLC + 81 MCP-unique). Of these, 212 are fully paired (baseline + MCP results) in official runs; the remaining 39 MCP-unique tasks have MCP results but are missing baselines.
102+
**Combined catalog total: 294 tasks** (199 SDLC across 9 suites + 95 MCP-unique across 11 suites).
102103

103104
Both baseline and MCP-Full agents have access to **all repos** in each task's fixture. The only difference is the method: baseline reads code locally, MCP-Full uses Sourcegraph MCP tools (local code is truncated). This ensures we measure whether MCP tools help agents work better — not whether MCP can access repos the baseline can't.
104105

@@ -110,7 +111,7 @@ See [docs/MCP_UNIQUE_TASKS.md](docs/MCP_UNIQUE_TASKS.md) for the full task syste
110111

111112
All benchmarks are evaluated across two paper-level configurations (Baseline vs MCP-Full). The concrete run config names differ by task type:
112113

113-
- **SDLC suites** (`ccb_build`, `ccb_fix`, etc.): `baseline-local-direct` + `mcp-remote-direct`
114+
- **SDLC suites** (`ccb_feature`, `ccb_refactor`, `ccb_fix`, etc.): `baseline-local-direct` + `mcp-remote-direct`
114115
- **MCP-unique suites** (`ccb_mcp_*`): `baseline-local-artifact` + `mcp-remote-artifact`
115116

116117
Legacy run directory names (`baseline`, `sourcegraph_full`, `artifact_full`) may still appear in historical outputs and are handled by analysis scripts.
@@ -130,7 +131,8 @@ See [docs/reference/CONFIGS.md](docs/reference/CONFIGS.md) for the canonical con
130131

131132
```
132133
benchmarks/ # Task definitions organized by SDLC phase + MCP-unique
133-
ccb_build/ # Feature & Refactoring (25 tasks)
134+
ccb_feature/ # Feature Implementation (20 tasks)
135+
ccb_refactor/ # Cross-File Refactoring (20 tasks)
134136
ccb_debug/ # Debugging & Investigation (20 tasks)
135137
ccb_design/ # Architecture & Design (20 tasks)
136138
ccb_document/ # Documentation (20 tasks)
@@ -152,7 +154,8 @@ benchmarks/ # Task definitions organized by SDLC phase + MCP-unique
152154
configs/ # Run configs and task selection
153155
_common.sh # Shared infra: token refresh, parallel execution, multi-account
154156
sdlc_suite_2config.sh # Generic SDLC runner (used by phase wrappers below)
155-
build_2config.sh # Phase wrapper: Build (25 tasks)
157+
feature_2config.sh # Phase wrapper: Feature (20 tasks)
158+
refactor_2config.sh # Phase wrapper: Refactor (20 tasks)
156159
debug_2config.sh # Phase wrapper: Debug (20 tasks)
157160
design_2config.sh # Phase wrapper: Design (20 tasks)
158161
document_2config.sh # Phase wrapper: Document (20 tasks)
@@ -280,10 +283,10 @@ This section assumes Harbor is already installed and configured. If not, start w
280283

281284
### SDLC Tasks
282285

283-
The unified runner executes all 170 SDLC tasks across the 2-config matrix:
286+
The unified runner executes all 199 SDLC tasks across the 2-config matrix:
284287

285288
```bash
286-
# Run all 170 SDLC tasks across 2 configs
289+
# Run all 199 SDLC tasks across 2 configs
287290
bash configs/run_selected_tasks.sh
288291

289292
# Run only the baseline config
@@ -300,7 +303,8 @@ Per-phase runners are also available:
300303

301304
```bash
302305
bash configs/fix_2config.sh # 25 Bug Repair tasks
303-
bash configs/build_2config.sh # 25 Feature & Refactoring tasks
306+
bash configs/feature_2config.sh # 20 Feature Implementation tasks
307+
bash configs/refactor_2config.sh # 20 Cross-File Refactoring tasks
304308
bash configs/understand_2config.sh # 20 Requirements & Discovery tasks
305309
bash configs/design_2config.sh # 20 Architecture & Design tasks
306310
bash configs/debug_2config.sh # 20 Debugging & Investigation tasks

benchmarks/README.md

Lines changed: 47 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# CodeContextBench Benchmarks
22

3-
This directory contains SDLC-aligned suites plus MCP-unique org-scale retrieval suites. The canonical selected task catalog is in [`selected_benchmark_tasks.json`](../configs/selected_benchmark_tasks.json) (currently 251 selected tasks across 19 suites).
3+
This directory contains SDLC-aligned suites plus MCP-unique org-scale retrieval suites. The canonical selected task catalog is in [`selected_benchmark_tasks.json`](../configs/selected_benchmark_tasks.json) (currently 294 selected tasks across 20 suites).
44

55
See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodology.
66

@@ -13,12 +13,13 @@ See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodol
1313
| `ccb_understand` | Requirements & Discovery | 20 | Codebase comprehension, onboarding, Q&A, knowledge recovery |
1414
| `ccb_design` | Architecture & Design | 20 | Architecture analysis, dependency graphs, change impact |
1515
| `ccb_fix` | Bug Repair | 25 | Diagnosing and fixing real bugs across production codebases |
16-
| `ccb_build` | Feature & Refactoring | 25 | New features, refactoring, dependency management |
16+
| `ccb_feature` | Feature Implementation | 20 | New features, interface implementation, big-code features |
17+
| `ccb_refactor` | Cross-File Refactoring | 20 | Cross-file refactoring, enterprise dependency refactoring, rename refactoring |
1718
| `ccb_test` | Testing & QA | 20 | Code review, performance testing, code search validation, test generation |
1819
| `ccb_document` | Documentation | 20 | API references, architecture docs, migration guides, runbooks |
1920
| `ccb_secure` | Security & Compliance | 20 | CVE analysis, reachability, governance, access control |
2021
| `ccb_debug` | Debugging & Investigation | 20 | Root cause tracing, fault localization, provenance |
21-
| **Total** | | **170** | |
22+
| **Total** | | **199** | |
2223

2324
---
2425

@@ -137,40 +138,64 @@ Diagnosing and fixing real bugs across production codebases (SWE-bench Pro, PyTo
137138

138139
---
139140

140-
## ccb_build (25 tasks) — Feature & Refactoring
141+
## ccb_feature (20 tasks) — Feature Implementation
141142

142-
New feature implementation, code refactoring, and dependency management tasks.
143+
New feature implementation, interface implementation, and big-code feature tasks.
143144

144145
| Task | Focus |
145146
|------|-------|
146147
| `bustub-hyperloglog-impl-001` | Implement HyperLogLog cardinality estimator |
147148
| `camel-fix-protocol-feat-001` | Implement camel-fix component FIX protocol |
148-
| `cgen-deps-install-001` | Set required package configuration |
149-
| `codecoverage-deps-install-001` | Configure project dependency versions |
150-
| `flipt-flagexists-refactor-001` | Add FlagExists to ReadOnlyFlagStore (Flipt) |
151-
| `dotenv-expand-deps-install-001` | Fix build system dependencies |
152-
| `dotnetkoans-deps-install-001` | Edit build dependencies, tests pass |
149+
| `cilium-policy-audit-logger-feat-001` | Implement Cilium policy audit logger |
150+
| `cilium-policy-quota-feat-001` | Implement Cilium policy quota enforcement |
151+
| `curl-http3-priority-feat-001` | Implement curl HTTP/3 priority support |
152+
| `django-rate-limit-middleware-feat-001` | Implement Django rate limit middleware |
153+
| `envoy-custom-header-filter-feat-001` | Implement Envoy custom header filter |
153154
| `envoy-grpc-server-impl-001` | Identify gRPC server implementations |
154-
| `eslint-markdown-deps-install-001` | Add missing package dependencies |
155155
| `flink-pricing-window-feat-001` | Implement PricingSessionWindow for trading |
156-
| `flipt-dep-refactor-001` | Dependency refactoring (Flipt) |
157-
| `python-http-class-naming-refac-001` | Standardize HTTP class naming |
158-
| `iamactionhunter-deps-install-001` | Resolve missing dependencies build |
159156
| `k8s-noschedule-taint-feat-001` | Implement NoScheduleNoTraffic taint effect |
160157
| `k8s-runtime-object-impl-001` | Find runtime.Object interface implementors |
161-
| `k8s-score-normalizer-refac-001` | Rename ScoreExtensions to ScoreNormalizer |
162-
| `kafka-batch-accumulator-refac-001` | Rename RecordAccumulator to BatchAccumulator |
163-
| `pcap-parser-deps-install-001` | Setup library dependencies correctly |
164-
| `rust-subtype-relation-refac-001` | Rename SubtypePredicate to SubtypeRelation |
158+
| `numpy-rolling-median-feat-001` | Implement NumPy rolling median |
159+
| `pandas-merge-asof-indicator-feat-001` | Implement pandas merge_asof indicator |
160+
| `prometheus-silence-bulk-api-feat-001` | Implement Prometheus silence bulk API |
161+
| `pytorch-gradient-noise-feat-001` | Implement PyTorch gradient noise |
165162
| `servo-scrollend-event-feat-001` | Add scrollend DOM event support |
166-
| `similar-asserts-deps-install-001` | Configure Cargo dependency resolution |
167163
| `strata-cds-tranche-feat-001` | Implement CDS tranche CDO product |
168-
| `strata-fx-european-refac-001` | Rename FxVanillaOption to FxEuropeanOption |
169164
| `tensorrt-mxfp4-quant-feat-001` | Add W4A8_MXFP4_INT8 quantization mode |
165+
| `terraform-compact-diff-fmt-feat-001` | Implement Terraform compact diff format |
170166
| `vscode-stale-diagnostics-feat-001` | Fix stale diagnostics after git branch |
171167

172168
---
173169

170+
## ccb_refactor (20 tasks) — Cross-File Refactoring
171+
172+
Cross-file refactoring, enterprise dependency refactoring, and rename refactoring tasks.
173+
174+
| Task | Focus |
175+
|------|-------|
176+
| `cilium-endpoint-manager-refac-001` | Refactor Cilium endpoint manager |
177+
| `curl-multi-process-refac-001` | Refactor curl multi-process handling |
178+
| `django-request-factory-refac-001` | Refactor Django request factory |
179+
| `envoy-listener-manager-refac-001` | Refactor Envoy listener manager |
180+
| `etcd-raft-storage-refac-001` | Refactor etcd raft storage layer |
181+
| `flipt-dep-refactor-001` | Dependency refactoring (Flipt) |
182+
| `flipt-flagexists-refactor-001` | Add FlagExists to ReadOnlyFlagStore (Flipt) |
183+
| `istio-discovery-server-refac-001` | Refactor Istio discovery server |
184+
| `k8s-score-normalizer-refac-001` | Rename ScoreExtensions to ScoreNormalizer |
185+
| `kafka-batch-accumulator-refac-001` | Rename RecordAccumulator to BatchAccumulator |
186+
| `kubernetes-scheduler-profile-refac-001` | Refactor Kubernetes scheduler profile |
187+
| `numpy-array-dispatch-refac-001` | Refactor NumPy array dispatch |
188+
| `pandas-index-engine-refac-001` | Refactor pandas index engine |
189+
| `prometheus-query-engine-refac-001` | Refactor Prometheus query engine |
190+
| `python-http-class-naming-refac-001` | Standardize HTTP class naming |
191+
| `pytorch-optimizer-foreach-refac-001` | Refactor PyTorch optimizer foreach |
192+
| `rust-subtype-relation-refac-001` | Rename SubtypePredicate to SubtypeRelation |
193+
| `scikit-learn-estimator-tags-refac-001` | Refactor scikit-learn estimator tags |
194+
| `strata-fx-european-refac-001` | Rename FxVanillaOption to FxEuropeanOption |
195+
| `terraform-eval-context-refac-001` | Refactor Terraform eval context |
196+
197+
---
198+
174199
## ccb_test (20 tasks) — Testing & QA
175200

176201
Code review with injected defects, performance testing, and code search validation.
@@ -305,14 +330,14 @@ Each task follows this layout:
305330
## Running Benchmarks
306331

307332
```bash
308-
# Run all selected tasks across 2 configs (currently 251 entries in selected_benchmark_tasks.json)
333+
# Run all selected tasks across 2 configs (currently 294 entries in selected_benchmark_tasks.json)
309334
bash configs/run_selected_tasks.sh
310335

311336
# Run a single SDLC phase
312337
bash configs/run_selected_tasks.sh --benchmark ccb_fix
313338

314339
# Single task
315-
harbor run --path benchmarks/ccb_build/servo-scrollend-event-feat-001 \
340+
harbor run --path benchmarks/ccb_feature/servo-scrollend-event-feat-001 \
316341
--agent-import-path agents.claude_baseline_agent:BaselineClaudeCodeAgent \
317342
--model anthropic/claude-haiku-4-5-20251001 \
318343
-n 1

configs/_common.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -229,8 +229,8 @@ ensure_base_images() {
229229

230230
# Pre-build all Docker images for a suite to warm the layer cache.
231231
# Call before run_paired_configs so Harbor's docker compose build is instant.
232-
# Args: $1 = suite name (e.g., ccb_build), remaining args passed through
233-
# Example: prebuild_images "ccb_build" --tasks "task1,task2"
232+
# Args: $1 = suite name (e.g., ccb_feature), remaining args passed through
233+
# Example: prebuild_images "ccb_feature" --tasks "task1,task2"
234234
prebuild_images() {
235235
local suite="${1:-}"
236236
shift || true

configs/codex_2config.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
# --agent-path PATH Override Harbor agent import path
1616
# --parallel N Max parallel task subshells (default: 1)
1717
# --category CATEGORY Run category label for jobs dir (default: staging)
18-
# --benchmark BENCH Optional benchmark filter (e.g. ccb_build, ccb_fix)
18+
# --benchmark BENCH Optional benchmark filter (e.g. ccb_feature, ccb_fix)
1919

2020
set -e
2121

configs/copilot_2config.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
# --agent-path PATH Override Harbor agent import path
1616
# --parallel N Max parallel task subshells (default: 1)
1717
# --category CATEGORY Run category label for jobs dir (default: staging)
18-
# --benchmark BENCH Optional benchmark filter (e.g. ccb_build, ccb_fix)
18+
# --benchmark BENCH Optional benchmark filter (e.g. ccb_feature, ccb_fix)
1919

2020
set -e
2121

configs/cursor_2config.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
# --agent-path PATH Override Harbor agent import path
1616
# --parallel N Max parallel task subshells (default: 1)
1717
# --category CATEGORY Run category label for jobs dir (default: staging)
18-
# --benchmark BENCH Optional benchmark filter (e.g. ccb_build, ccb_fix)
18+
# --benchmark BENCH Optional benchmark filter (e.g. ccb_feature, ccb_fix)
1919

2020
set -e
2121

configs/gemini_2config.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
# --agent-path PATH Override Harbor agent import path
1616
# --parallel N Max parallel task subshells (default: 1)
1717
# --category CATEGORY Run category label for jobs dir (default: staging)
18-
# --benchmark BENCH Optional benchmark filter (e.g. ccb_build, ccb_fix)
18+
# --benchmark BENCH Optional benchmark filter (e.g. ccb_feature, ccb_fix)
1919

2020
set -e
2121

0 commit comments

Comments
 (0)