docs: update READMEs and configs for ccb_feature/ccb_refactor split

sjarmak · claude · sjarmak · commit 228a64ec03ac · 2026-02-28T21:28:46.000Z
Update README.md and benchmarks/README.md with new suite tables
(9 SDLC suites, 199 tasks, 294 total). Update task counts and
suite references in 6 config wrappers, _common.sh, and
ground_truth_files.json.

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -65,19 +65,20 @@ bash configs/run_selected_tasks.sh --dry-run
 
 ## Benchmark Suites (SDLC-Aligned)
 
-Eight suites organized by software development lifecycle phase:
+Nine suites organized by software development lifecycle phase:
 
 | Suite | SDLC Phase | Tasks | Description |
 |-------|-----------|------:|-------------|
 | `ccb_understand` | Requirements & Discovery | 20 | Codebase comprehension, onboarding, Q&A, knowledge recovery |
 | `ccb_design` | Architecture & Design | 20 | Architecture analysis, dependency graphs, change impact |
 | `ccb_fix` | Bug Repair | 25 | Diagnosing and fixing real bugs across production codebases |
-| `ccb_build` | Feature & Refactoring | 25 | New features, refactoring, dependency management |
+| `ccb_feature` | Feature Implementation | 20 | New features, interface implementation, big-code features |
+| `ccb_refactor` | Cross-File Refactoring | 20 | Cross-file refactoring, enterprise dependency refactoring, rename refactoring |
 | `ccb_test` | Testing & QA | 20 | Code review, performance testing, code search validation, test generation |
 | `ccb_document` | Documentation | 20 | API references, architecture docs, migration guides, runbooks |
 | `ccb_secure` | Security & Compliance | 20 | CVE analysis, reachability, governance, access control |
 | `ccb_debug` | Debugging & Investigation | 20 | Root cause tracing, fault localization, provenance |
-| **Total** | | **170** | |
+| **Total** | | **199** | |
 
 ## MCP-Unique Suites (Org-Scale Context Retrieval)
 
@@ -89,16 +90,16 @@ Eleven additional suites measure cross-repo discovery, symbol resolution, depend
 | `ccb_mcp_security` | B: Vulnerability Remediation | 10 | CVE mapping, missing auth middleware across repos |
 | `ccb_mcp_migration` | C: Framework Migration | 7 | API migrations, breaking changes across repos |
 | `ccb_mcp_incident` | D: Incident Debugging | 11 | Error-to-code-path tracing across microservices |
-| `ccb_mcp_onboarding` | E: Onboarding & Comprehension | 11 | API consumption mapping, end-to-end flow, architecture maps |
+| `ccb_mcp_onboarding` | E: Onboarding & Comprehension | 25 | API consumption mapping, end-to-end flow, architecture maps |
 | `ccb_mcp_compliance` | F: Compliance | 7 | Standards adherence, audit, and provenance workflows |
 | `ccb_mcp_crossorg` | G: Cross-Org Discovery | 5 | Interface implementations and authoritative repo identification across orgs |
 | `ccb_mcp_domain` | H: Domain Lineage | 10 | Config propagation, architecture patterns, domain analysis |
 | `ccb_mcp_org` | I: Organizational Context | 5 | Agentic discovery, org-wide coding correctness |
 | `ccb_mcp_platform` | J: Platform Knowledge | 5 | Service template discovery and tribal knowledge |
 | `ccb_mcp_crossrepo` | Legacy | 1 | Cross-repo discovery (compatibility) |
-| **Total** | | **81** | |
+| **Total** | | **95** | |
 
-**Combined catalog total: 251 tasks** (170 SDLC + 81 MCP-unique). Of these, 212 are fully paired (baseline + MCP results) in official runs; the remaining 39 MCP-unique tasks have MCP results but are missing baselines.
+**Combined catalog total: 294 tasks** (199 SDLC across 9 suites + 95 MCP-unique across 11 suites).
 
 Both baseline and MCP-Full agents have access to **all repos** in each task's fixture. The only difference is the method: baseline reads code locally, MCP-Full uses Sourcegraph MCP tools (local code is truncated). This ensures we measure whether MCP tools help agents work better — not whether MCP can access repos the baseline can't.
 
@@ -110,7 +111,7 @@ See [docs/MCP_UNIQUE_TASKS.md](docs/MCP_UNIQUE_TASKS.md) for the full task syste
 
 All benchmarks are evaluated across two paper-level configurations (Baseline vs MCP-Full). The concrete run config names differ by task type:
 
-- **SDLC suites** (`ccb_build`, `ccb_fix`, etc.): `baseline-local-direct` + `mcp-remote-direct`
+- **SDLC suites** (`ccb_feature`, `ccb_refactor`, `ccb_fix`, etc.): `baseline-local-direct` + `mcp-remote-direct`
 - **MCP-unique suites** (`ccb_mcp_*`): `baseline-local-artifact` + `mcp-remote-artifact`
 
 Legacy run directory names (`baseline`, `sourcegraph_full`, `artifact_full`) may still appear in historical outputs and are handled by analysis scripts.
@@ -130,7 +131,8 @@ See [docs/reference/CONFIGS.md](docs/reference/CONFIGS.md) for the canonical con
 
 ```
 benchmarks/              # Task definitions organized by SDLC phase + MCP-unique
-  ccb_build/             #   Feature & Refactoring (25 tasks)
+  ccb_feature/           #   Feature Implementation (20 tasks)
+  ccb_refactor/          #   Cross-File Refactoring (20 tasks)
   ccb_debug/             #   Debugging & Investigation (20 tasks)
   ccb_design/            #   Architecture & Design (20 tasks)
   ccb_document/          #   Documentation (20 tasks)
@@ -152,7 +154,8 @@ benchmarks/              # Task definitions organized by SDLC phase + MCP-unique
 configs/                 # Run configs and task selection
   _common.sh             #   Shared infra: token refresh, parallel execution, multi-account
   sdlc_suite_2config.sh  #   Generic SDLC runner (used by phase wrappers below)
-  build_2config.sh       #   Phase wrapper: Build (25 tasks)
+  feature_2config.sh     #   Phase wrapper: Feature (20 tasks)
+  refactor_2config.sh    #   Phase wrapper: Refactor (20 tasks)
   debug_2config.sh       #   Phase wrapper: Debug (20 tasks)
   design_2config.sh      #   Phase wrapper: Design (20 tasks)
   document_2config.sh    #   Phase wrapper: Document (20 tasks)
@@ -280,10 +283,10 @@ This section assumes Harbor is already installed and configured. If not, start w
 
 ### SDLC Tasks
 
-The unified runner executes all 170 SDLC tasks across the 2-config matrix:
+The unified runner executes all 199 SDLC tasks across the 2-config matrix:
 
 ```bash
-# Run all 170 SDLC tasks across 2 configs
+# Run all 199 SDLC tasks across 2 configs
 bash configs/run_selected_tasks.sh
 
 # Run only the baseline config
@@ -300,7 +303,8 @@ Per-phase runners are also available:
 
 ```bash
 bash configs/fix_2config.sh              # 25 Bug Repair tasks
-bash configs/build_2config.sh            # 25 Feature & Refactoring tasks
+bash configs/feature_2config.sh          # 20 Feature Implementation tasks
+bash configs/refactor_2config.sh         # 20 Cross-File Refactoring tasks
 bash configs/understand_2config.sh       # 20 Requirements & Discovery tasks
 bash configs/design_2config.sh           # 20 Architecture & Design tasks
 bash configs/debug_2config.sh            # 20 Debugging & Investigation tasks
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -1,6 +1,6 @@
 # CodeContextBench Benchmarks
 
-This directory contains SDLC-aligned suites plus MCP-unique org-scale retrieval suites. The canonical selected task catalog is in [`selected_benchmark_tasks.json`](../configs/selected_benchmark_tasks.json) (currently 251 selected tasks across 19 suites).
+This directory contains SDLC-aligned suites plus MCP-unique org-scale retrieval suites. The canonical selected task catalog is in [`selected_benchmark_tasks.json`](../configs/selected_benchmark_tasks.json) (currently 294 selected tasks across 20 suites).
 
 See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodology.
 
@@ -13,12 +13,13 @@ See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodol
 | `ccb_understand` | Requirements & Discovery | 20 | Codebase comprehension, onboarding, Q&A, knowledge recovery |
 | `ccb_design` | Architecture & Design | 20 | Architecture analysis, dependency graphs, change impact |
 | `ccb_fix` | Bug Repair | 25 | Diagnosing and fixing real bugs across production codebases |
-| `ccb_build` | Feature & Refactoring | 25 | New features, refactoring, dependency management |
+| `ccb_feature` | Feature Implementation | 20 | New features, interface implementation, big-code features |
+| `ccb_refactor` | Cross-File Refactoring | 20 | Cross-file refactoring, enterprise dependency refactoring, rename refactoring |
 | `ccb_test` | Testing & QA | 20 | Code review, performance testing, code search validation, test generation |
 | `ccb_document` | Documentation | 20 | API references, architecture docs, migration guides, runbooks |
 | `ccb_secure` | Security & Compliance | 20 | CVE analysis, reachability, governance, access control |
 | `ccb_debug` | Debugging & Investigation | 20 | Root cause tracing, fault localization, provenance |
-| **Total** | | **170** | |
+| **Total** | | **199** | |
 
 ---
 
@@ -137,40 +138,64 @@ Diagnosing and fixing real bugs across production codebases (SWE-bench Pro, PyTo
 
 ---
 
-## ccb_build (25 tasks) — Feature & Refactoring
+## ccb_feature (20 tasks) — Feature Implementation
 
-New feature implementation, code refactoring, and dependency management tasks.
+New feature implementation, interface implementation, and big-code feature tasks.
 
 | Task | Focus |
 |------|-------|
 | `bustub-hyperloglog-impl-001` | Implement HyperLogLog cardinality estimator |
 | `camel-fix-protocol-feat-001` | Implement camel-fix component FIX protocol |
-| `cgen-deps-install-001` | Set required package configuration |
-| `codecoverage-deps-install-001` | Configure project dependency versions |
-| `flipt-flagexists-refactor-001` | Add FlagExists to ReadOnlyFlagStore (Flipt) |
-| `dotenv-expand-deps-install-001` | Fix build system dependencies |
-| `dotnetkoans-deps-install-001` | Edit build dependencies, tests pass |
+| `cilium-policy-audit-logger-feat-001` | Implement Cilium policy audit logger |
+| `cilium-policy-quota-feat-001` | Implement Cilium policy quota enforcement |
+| `curl-http3-priority-feat-001` | Implement curl HTTP/3 priority support |
+| `django-rate-limit-middleware-feat-001` | Implement Django rate limit middleware |
+| `envoy-custom-header-filter-feat-001` | Implement Envoy custom header filter |
 | `envoy-grpc-server-impl-001` | Identify gRPC server implementations |
-| `eslint-markdown-deps-install-001` | Add missing package dependencies |
 | `flink-pricing-window-feat-001` | Implement PricingSessionWindow for trading |
-| `flipt-dep-refactor-001` | Dependency refactoring (Flipt) |
-| `python-http-class-naming-refac-001` | Standardize HTTP class naming |
-| `iamactionhunter-deps-install-001` | Resolve missing dependencies build |
 | `k8s-noschedule-taint-feat-001` | Implement NoScheduleNoTraffic taint effect |
 | `k8s-runtime-object-impl-001` | Find runtime.Object interface implementors |
-| `k8s-score-normalizer-refac-001` | Rename ScoreExtensions to ScoreNormalizer |
-| `kafka-batch-accumulator-refac-001` | Rename RecordAccumulator to BatchAccumulator |
-| `pcap-parser-deps-install-001` | Setup library dependencies correctly |
-| `rust-subtype-relation-refac-001` | Rename SubtypePredicate to SubtypeRelation |
+| `numpy-rolling-median-feat-001` | Implement NumPy rolling median |
+| `pandas-merge-asof-indicator-feat-001` | Implement pandas merge_asof indicator |
+| `prometheus-silence-bulk-api-feat-001` | Implement Prometheus silence bulk API |
+| `pytorch-gradient-noise-feat-001` | Implement PyTorch gradient noise |
 | `servo-scrollend-event-feat-001` | Add scrollend DOM event support |
-| `similar-asserts-deps-install-001` | Configure Cargo dependency resolution |
 | `strata-cds-tranche-feat-001` | Implement CDS tranche CDO product |
-| `strata-fx-european-refac-001` | Rename FxVanillaOption to FxEuropeanOption |
 | `tensorrt-mxfp4-quant-feat-001` | Add W4A8_MXFP4_INT8 quantization mode |
+| `terraform-compact-diff-fmt-feat-001` | Implement Terraform compact diff format |
 | `vscode-stale-diagnostics-feat-001` | Fix stale diagnostics after git branch |
 
 ---
 
+## ccb_refactor (20 tasks) — Cross-File Refactoring
+
+Cross-file refactoring, enterprise dependency refactoring, and rename refactoring tasks.
+
+| Task | Focus |
+|------|-------|
+| `cilium-endpoint-manager-refac-001` | Refactor Cilium endpoint manager |
+| `curl-multi-process-refac-001` | Refactor curl multi-process handling |
+| `django-request-factory-refac-001` | Refactor Django request factory |
+| `envoy-listener-manager-refac-001` | Refactor Envoy listener manager |
+| `etcd-raft-storage-refac-001` | Refactor etcd raft storage layer |
+| `flipt-dep-refactor-001` | Dependency refactoring (Flipt) |
+| `flipt-flagexists-refactor-001` | Add FlagExists to ReadOnlyFlagStore (Flipt) |
+| `istio-discovery-server-refac-001` | Refactor Istio discovery server |
+| `k8s-score-normalizer-refac-001` | Rename ScoreExtensions to ScoreNormalizer |
+| `kafka-batch-accumulator-refac-001` | Rename RecordAccumulator to BatchAccumulator |
+| `kubernetes-scheduler-profile-refac-001` | Refactor Kubernetes scheduler profile |
+| `numpy-array-dispatch-refac-001` | Refactor NumPy array dispatch |
+| `pandas-index-engine-refac-001` | Refactor pandas index engine |
+| `prometheus-query-engine-refac-001` | Refactor Prometheus query engine |
+| `python-http-class-naming-refac-001` | Standardize HTTP class naming |
+| `pytorch-optimizer-foreach-refac-001` | Refactor PyTorch optimizer foreach |
+| `rust-subtype-relation-refac-001` | Rename SubtypePredicate to SubtypeRelation |
+| `scikit-learn-estimator-tags-refac-001` | Refactor scikit-learn estimator tags |
+| `strata-fx-european-refac-001` | Rename FxVanillaOption to FxEuropeanOption |
+| `terraform-eval-context-refac-001` | Refactor Terraform eval context |
+
+---
+
 ## ccb_test (20 tasks) — Testing & QA
 
 Code review with injected defects, performance testing, and code search validation.
@@ -305,14 +330,14 @@ Each task follows this layout:
 ## Running Benchmarks
 
 ```bash
-# Run all selected tasks across 2 configs (currently 251 entries in selected_benchmark_tasks.json)
+# Run all selected tasks across 2 configs (currently 294 entries in selected_benchmark_tasks.json)
 bash configs/run_selected_tasks.sh
 
 # Run a single SDLC phase
 bash configs/run_selected_tasks.sh --benchmark ccb_fix
 
 # Single task
-harbor run --path benchmarks/ccb_build/servo-scrollend-event-feat-001 \
+harbor run --path benchmarks/ccb_feature/servo-scrollend-event-feat-001 \
   --agent-import-path agents.claude_baseline_agent:BaselineClaudeCodeAgent \
   --model anthropic/claude-haiku-4-5-20251001 \
   -n 1
diff --git a/configs/_common.sh b/configs/_common.sh
@@ -229,8 +229,8 @@ ensure_base_images() {
 
 # Pre-build all Docker images for a suite to warm the layer cache.
 # Call before run_paired_configs so Harbor's docker compose build is instant.
-# Args: $1 = suite name (e.g., ccb_build), remaining args passed through
-#   Example: prebuild_images "ccb_build" --tasks "task1,task2"
+# Args: $1 = suite name (e.g., ccb_feature), remaining args passed through
+#   Example: prebuild_images "ccb_feature" --tasks "task1,task2"
 prebuild_images() {
     local suite="${1:-}"
     shift || true
diff --git a/configs/codex_2config.sh b/configs/codex_2config.sh
@@ -15,7 +15,7 @@
 #   --agent-path PATH      Override Harbor agent import path
 #   --parallel N           Max parallel task subshells (default: 1)
 #   --category CATEGORY    Run category label for jobs dir (default: staging)
-#   --benchmark BENCH      Optional benchmark filter (e.g. ccb_build, ccb_fix)
+#   --benchmark BENCH      Optional benchmark filter (e.g. ccb_feature, ccb_fix)
 
 set -e
 
diff --git a/configs/copilot_2config.sh b/configs/copilot_2config.sh
@@ -15,7 +15,7 @@
 #   --agent-path PATH      Override Harbor agent import path
 #   --parallel N           Max parallel task subshells (default: 1)
 #   --category CATEGORY    Run category label for jobs dir (default: staging)
-#   --benchmark BENCH      Optional benchmark filter (e.g. ccb_build, ccb_fix)
+#   --benchmark BENCH      Optional benchmark filter (e.g. ccb_feature, ccb_fix)
 
 set -e
 
diff --git a/configs/cursor_2config.sh b/configs/cursor_2config.sh
@@ -15,7 +15,7 @@
 #   --agent-path PATH      Override Harbor agent import path
 #   --parallel N           Max parallel task subshells (default: 1)
 #   --category CATEGORY    Run category label for jobs dir (default: staging)
-#   --benchmark BENCH      Optional benchmark filter (e.g. ccb_build, ccb_fix)
+#   --benchmark BENCH      Optional benchmark filter (e.g. ccb_feature, ccb_fix)
 
 set -e
 
diff --git a/configs/gemini_2config.sh b/configs/gemini_2config.sh
@@ -15,7 +15,7 @@
 #   --agent-path PATH      Override Harbor agent import path
 #   --parallel N           Max parallel task subshells (default: 1)
 #   --category CATEGORY    Run category label for jobs dir (default: staging)
-#   --benchmark BENCH      Optional benchmark filter (e.g. ccb_build, ccb_fix)
+#   --benchmark BENCH      Optional benchmark filter (e.g. ccb_feature, ccb_fix)
 
 set -e
 
diff --git a/configs/ground_truth_files.json b/configs/ground_truth_files.json
diff --git a/configs/openhands_2config.sh b/configs/openhands_2config.sh
diff --git a/configs/run_selected_tasks.sh b/configs/run_selected_tasks.sh