Skip to content

Commit 57f397a

Browse files
sjarmakclaude
andcommitted
feat: add DOE variance analysis, power curves, and rebalance handoff
Empirical variance decomposition from 175 paired pilot runs shows uniform n=20/suite is suboptimal. Neyman-optimal allocation at 150 tasks redistributes from low-variance suites (understand, secure, document) to high-variance ones (fix, test, feature) for equivalent or better statistical power with 17% fewer tasks. Includes power curves for 3-arm SCIP design and interaction effects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 35d5400 commit 57f397a

File tree

5 files changed

+1534
-1
lines changed

5 files changed

+1534
-1
lines changed

docs/ops/HANDOFF_DOE_REBALANCE.md

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
# Handoff: DOE-Driven SDLC Task Rebalance
2+
3+
## Goal
4+
5+
Rebalance the 9 SDLC benchmark suites from uniform 20 tasks/suite (180 total) to Neyman-optimal allocation at 150 tasks total, based on empirical variance decomposition from 175 paired pilot runs across 3-8 replicates per task. This maximizes statistical power for detecting the MCP treatment effect and its interaction with SDLC phase, codebase size, and task complexity — while reducing total task count by 17%.
6+
7+
## Context: Why This Matters
8+
9+
A Design of Experiments (DOE) variance decomposition showed that uniform n=20/suite is simultaneously over-sampling low-variance suites and under-sampling high-variance ones:
10+
11+
- **ccb_fix** (sigma2_task=0.1518, ICC=0.964): task heterogeneity dominates — the suite mixes trivially-solvable patches with multi-file fixes. Needs MORE tasks.
12+
- **ccb_understand** (sigma2_task=0.0123, ICC=0.078): agent stochasticity dominates — same task gives different results each run. Needs more REPS, not tasks.
13+
14+
The Neyman-optimal allocation for a 150-task budget (proportional to within-suite SD) is:
15+
16+
| Suite | Current n | Target n | Delta | Action |
17+
|-------------|-----------|----------|-------|------------------------|
18+
| fix | 20 | 25 | +5 | Promote 5 from backups |
19+
| test | 18 | 23 | +5 | Create 5 new tasks |
20+
| feature | 20 | 22 | +2 | Create 2 new tasks |
21+
| debug | 20 | 18 | -2 | Move 2 to backups |
22+
| refactor | 20 | 15 | -5 | Move 5 to backups |
23+
| design | 20 | 14 | -6 | Move 6 to backups |
24+
| document | 20 | 12 | -8 | Move 8 to backups |
25+
| secure | 20 | 11 | -9 | Move 9 to backups |
26+
| understand | 20 | 10 | -10 | Move 10 to backups |
27+
| **TOTAL** | **180** | **150** | **-30** | |
28+
29+
## Current Status
30+
31+
- DOE analysis scripts written and validated:
32+
- `scripts/doe_variance_analysis.py` — variance decomposition, per-suite power, Neyman/minimax allocation
33+
- `scripts/doe_power_curves.py` — power curves for main effect, SDLC interaction, continuous moderators, 3-arm SCIP design
34+
- Both scripts read from `runs/official/MANIFEST.json` (run_history section)
35+
- Analysis outputs verified against 175 paired tasks, 3-8 reps each
36+
- No task directories or selection files modified yet
37+
38+
## Files Changed (this session)
39+
40+
- `scripts/doe_variance_analysis.py` — NEW (variance decomposition + allocation)
41+
- `scripts/doe_power_curves.py` — NEW (power curves for interaction effects)
42+
43+
## Key Findings / Decisions
44+
45+
1. **150 tasks is the practical sweet spot** — gives >87% power for SDLC×Config interaction at d=0.15, >95% for main effect at d=0.10, >83% for complexity interaction
46+
2. **342 tasks would be needed for d=0.10 SDLC interaction** (38/suite) — not worth the cost unless that granularity is required
47+
3. **Three-arm SCIP design is cheap to add** — SCIP vs fuzzy contrast only needs 16-64 tasks because both arms use same MCP tools (lower delta variance)
48+
4. **Observed overall MCP delta is near zero (+0.001)** — the interesting signal is in the interaction (fix/understand benefit, debug/refactor hurt)
49+
5. **High-variance suites to GROW**: fix (sigma2=0.154), test (0.127), feature (0.116)
50+
6. **Low-variance suites to SHRINK**: understand (0.027), secure (0.031), document (0.038)
51+
52+
## Task Inventory for Rebalance
53+
54+
### Suites that GROW (need new/promoted tasks)
55+
56+
**ccb_fix (+5, target 25):**
57+
- 5 backup tasks available in `benchmarks/backups/ccb_fix_extra/`:
58+
- Check quality before promoting — they were removed for "over-represented repo" reason
59+
- If repo diversity is a concern, create new tasks from under-represented repos instead
60+
61+
**ccb_test (+5, target 23):**
62+
- 2 backup tasks in `benchmarks/backups/ccb_test_tac/` — but these need external RocketChat server (incompatible)
63+
- Must scaffold 5 new tasks using `/scaffold-task` skill
64+
- Prioritize high-variance task types (unit test generation, integration testing, code review)
65+
66+
**ccb_feature (+2, target 22):**
67+
- No backup tasks available
68+
- Scaffold 2 new tasks — prioritize languages/repos under-represented in current 20
69+
70+
### Suites that SHRINK (move to backups)
71+
72+
Selection criteria for which tasks to move OUT:
73+
1. **Keep high-variance tasks** (sigma2_rep > suite median) — they contribute most information
74+
2. **Keep tasks with extreme deltas** (|MCP - baseline| is large) — most informative for interaction estimation
75+
3. **Remove low-information tasks** (consistent pass or consistent fail across both configs) — they add no signal
76+
4. **Maintain language/repo diversity** in the remaining set
77+
78+
**ccb_debug (-2, target 18):** Move 2 lowest-information tasks
79+
**ccb_refactor (-5, target 15):** Move 5 lowest-information tasks
80+
**ccb_design (-6, target 14):** Move 6 lowest-information tasks
81+
**ccb_document (-8, target 12):** Move 8 lowest-information tasks
82+
**ccb_secure (-9, target 11):** Move 9 lowest-information tasks
83+
**ccb_understand (-10, target 10):** Move 10 lowest-information tasks
84+
85+
## Implementation Plan
86+
87+
### Phase 1: Identify tasks to move (analysis only)
88+
89+
```bash
90+
# Run the variance analysis to get per-task stats
91+
python3 scripts/doe_variance_analysis.py --json > /tmp/doe_analysis.json
92+
93+
# The JSON output includes task_means and task_stds per suite per config
94+
# Use these to rank tasks by information value
95+
```
96+
97+
Write a selection script (e.g. `doe_select_tasks.py`) that:
98+
1. Reads `MANIFEST.json` run_history for per-task reward vectors
99+
2. For each suite, ranks tasks by "information value":
100+
- High value = large |delta| OR high replicate variance (the task discriminates between configs or shows agent sensitivity)
101+
- Low value = delta ≈ 0 AND low replicate variance (both configs solve it the same way every time)
102+
3. Selects the top-N tasks to KEEP per suite (N = Neyman target)
103+
4. Outputs the keep/move lists
104+
105+
### Phase 2: Move task directories
106+
107+
```bash
108+
# For each suite that shrinks, move excess tasks to backups
109+
# Example for ccb_understand (20 → 10):
110+
mkdir -p benchmarks/backups/ccb_understand_doe_trim/
111+
mv benchmarks/ccb_understand/<task_to_remove>/ benchmarks/backups/ccb_understand_doe_trim/
112+
113+
# For ccb_fix (20 → 25), promote from backups:
114+
mv benchmarks/backups/ccb_fix_extra/<task>/ benchmarks/ccb_fix/
115+
```
116+
117+
### Phase 3: Scaffold new tasks (for suites that grow beyond backup supply)
118+
119+
```bash
120+
# ccb_test needs 5 new tasks, ccb_feature needs 2 new tasks
121+
# Use /scaffold-task skill for each
122+
```
123+
124+
### Phase 4: Regenerate selection file and manifest
125+
126+
```bash
127+
python3 scripts/select_benchmark_tasks.py # regenerate selection JSON
128+
python3 scripts/generate_manifest.py # update MANIFEST
129+
python3 scripts/repo_health.py # health gate
130+
```
131+
132+
### Phase 5: Validate and run pilot
133+
134+
```bash
135+
# Validate all tasks still pass preflight
136+
python3 scripts/validate_tasks_preflight.py
137+
138+
# Run one pass of all 150 tasks to confirm nothing broke
139+
# (use existing variance run configs as template)
140+
```
141+
142+
## Open Risks / Unknowns
143+
144+
1. **Backup task quality**: The 5 ccb_fix_extra tasks were removed for repo over-representation — need to check if promoting them creates unacceptable repo bias
145+
2. **New task scaffolding**: 7 new tasks needed (5 ccb_test, 2 ccb_feature) — each requires instruction, verifier, Dockerfile, and oracle curation. Budget ~2-3 hours per task.
146+
3. **Historical comparability**: Changing suite sizes means old runs (n=20) aren't directly comparable to new runs (variable n). Document this in the white paper methods section.
147+
4. **MCP-unique suites unaddressed**: This rebalance only covers SDLC suites. The 11 MCP-unique suites (220 tasks) don't have enough variance run data yet for the same analysis. Run doe_variance_analysis.py with `--include-mcp-unique` after collecting MCP-unique variance data.
148+
5. **ccb_test currently has 18 tasks** (not 20) — growing to 23 means adding 5, not 3
149+
150+
## Next Best Command
151+
152+
```bash
153+
# Start by writing the information-value ranking script
154+
# This is the prerequisite for deciding WHICH tasks to keep/move
155+
python3 scripts/doe_variance_analysis.py --json 2>/dev/null | python3 -c "
156+
import json, sys
157+
data = json.load(sys.stdin)
158+
print(json.dumps(data, indent=2)[:2000])
159+
"
160+
```
161+
162+
## Reference: Key DOE Parameters
163+
164+
- **delta=0.15**: minimum detectable effect size (15 percentage points) for SDLC interaction
165+
- **reps=3**: planned replicates per task per arm (minimum; 5 for understand/secure)
166+
- **arms=3**: baseline + MCP/fuzzy + MCP/SCIP (three-arm design)
167+
- **alpha=0.05**: significance level (two-sided)
168+
- **Power target: 0.80** (80% probability of detecting a true effect)
169+
- **Neyman allocation**: tasks ∝ within-suite SD (minimizes overall variance for fixed budget)

docs/ops/SCRIPT_INDEX.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -172,6 +172,7 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
172172
- `scripts/dependeval_eval_dr.py` - Utility script for dependeval eval dr.
173173
- `scripts/dependeval_eval_me.py` - Utility script for dependeval eval me.
174174
- `scripts/docgen_quality_sweep.py` - Utility script for docgen quality sweep.
175+
- `scripts/doe_power_curves.py` - Utility script for doe power curves.
175176
- `scripts/export_official_results.py` - Utility script for export official results.
176177
- `scripts/extract_analysis_metrics.py` - Utility script for extract analysis metrics.
177178
- `scripts/find_mcp_distracted.py` - Utility script for find mcp distracted.
@@ -203,6 +204,7 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
203204
- `scripts/scaffold_feature_tasks.py` - Utility script for scaffold feature tasks.
204205
- `scripts/scaffold_refactor_tasks.py` - Utility script for scaffold refactor tasks.
205206
- `scripts/scan_swebench_errors.py` - Utility script for scan swebench errors.
207+
- `scripts/sdlc_anomaly_scan.py` - Utility script for sdlc anomaly scan.
206208
- `scripts/smoke_artifact_verifier.py` - Utility script for smoke artifact verifier.
207209
- `scripts/verify_retrieval_eval_smoke.py` - Utility script for verify retrieval eval smoke.
208210

0 commit comments

Comments
 (0)