Skip to content

Commit 3fbd443

Browse files
sjarmakclaude
andcommitted
feat: add variance run data, utility scripts, oracle tests, and handoff docs
Promote 21 Daytona variance run batches (8 SDLC suites, passes 2-3) to runs/official/ -- these were successfully executed but not committed in prior sessions. Redact OAuth tokens from llamacpp test run agent logs. Add new scripts: migrate_dockerfiles_clone_as_claude.py (84 Dockerfile migration to avoid chown layer doubling), plan_variance_runs.py (variance run planning), handoff_monitor_scrollend.sh (session handoff helper). Add oracle evaluator tests (test_oracle_checks_tiered.py, test_oracle_curation.py) and reference doc (RESULT_DIRECTORY_SPEC.md). Update check_infra.py, curate_oracle.py, daytona_poc_runner.py, and daytona_runner.py with fixes from prior sessions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 39d37c3 commit 3fbd443

File tree

14,428 files changed

+3967648
-3
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

14,428 files changed

+3967648
-3
lines changed

docs/ops/HANDOFF_baseline_39.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Handoff: Run 39 Missing MCP-Unique Baselines
2+
3+
## Goal
4+
Run baseline-local-artifact evaluation for 39 MCP-unique tasks that currently only have MCP results. This brings total paired coverage from 212 to 251 (100%).
5+
6+
## Current Status
7+
- 81 MCP-unique tasks across 11 ccb_mcp_* suites
8+
- 42 already have paired results (baseline + MCP) in runs/official/
9+
- **39 need baseline runs** (listed below)
10+
- A sub-selection file is ready: `configs/mcp_baseline_rerun.json`
11+
12+
## Task List (39 tasks across 11 suites)
13+
14+
| Suite | Count | Task IDs |
15+
|-------|-------|----------|
16+
| ccb_mcp_compliance | 5 | CCX-compliance-052, 053, 057-ds, 115, 118 |
17+
| ccb_mcp_crossorg | 1 | CCX-crossorg-062 |
18+
| ccb_mcp_crossrepo | 1 | CCX-dep-trace-106 |
19+
| ccb_mcp_crossrepo_tracing | 4 | CCX-config-trace-003, CCX-dep-trace-002, 102, 116 |
20+
| ccb_mcp_domain | 6 | CCX-domain-071, 072, 073, 074, 101, 112 |
21+
| ccb_mcp_incident | 6 | CCX-incident-032, 033, 034, 037, 108, 110 |
22+
| ccb_mcp_migration | 5 | CCX-migration-022, 026, 107, 114, 117 |
23+
| ccb_mcp_onboarding | 3 | CCX-onboard-043, 103, 109 |
24+
| ccb_mcp_org | 3 | CCX-agentic-081, 082, 083 |
25+
| ccb_mcp_platform | 2 | CCX-platform-104, 119 |
26+
| ccb_mcp_security | 3 | CCX-vuln-remed-013, 105, 111 |
27+
28+
## Config Details
29+
- **Config**: `baseline-local-artifact` (auto-detected by runner for MCP-unique artifact-only tasks)
30+
- **Model**: `anthropic/claude-haiku-4-5-20251001` (default)
31+
- **Dockerfile**: Each task has `environment/Dockerfile.artifact_only` (full local code, agent produces answer.json)
32+
- **Category**: staging (promotes to official after validation)
33+
- **Selection file**: `configs/mcp_baseline_rerun.json` (pre-built, 39 tasks)
34+
35+
## Execution Steps
36+
37+
### 1. Pre-flight checks
38+
```bash
39+
python3 scripts/check_infra.py
40+
python3 scripts/validate_tasks_preflight.py --selection-file configs/mcp_baseline_rerun.json
41+
```
42+
43+
### 2. Dry run (verify task detection)
44+
```bash
45+
./configs/run_selected_tasks.sh \
46+
--selection-file configs/mcp_baseline_rerun.json \
47+
--baseline-only \
48+
--parallel 12 \
49+
--dry-run
50+
```
51+
52+
You should see:
53+
- 39 tasks detected
54+
- Auto-detection message: "Auto-detected MCP-unique tasks (artifact-only) → using artifact configs (mcp-remote-artifact)"
55+
- Config: `baseline-local-artifact`
56+
- `--baseline-only` means only baselines run (no MCP)
57+
58+
### 3. Launch baselines
59+
```bash
60+
./configs/run_selected_tasks.sh \
61+
--selection-file configs/mcp_baseline_rerun.json \
62+
--baseline-only \
63+
--parallel 12
64+
```
65+
66+
Interactive confirmation will appear — approve it. Runs land in `runs/staging/`.
67+
68+
Estimated time: ~1-2 hours at 12 parallel slots (3 accounts x 4 sessions, 39 tasks, ~5 min each with artifact mode).
69+
70+
### 4. Monitor progress
71+
```bash
72+
# Quick status
73+
ls runs/staging/*/baseline-local-artifact/ | wc -l
74+
75+
# Or use /watch-benchmarks skill for detailed status
76+
```
77+
78+
### 5. Post-run: validate and promote
79+
```bash
80+
# Check results
81+
python3 -c "
82+
import json, glob
83+
results = glob.glob('runs/staging/*/baseline-local-artifact/*/result.json')
84+
print(f'Results: {len(results)}/39')
85+
rewards = []
86+
for r in results:
87+
data = json.load(open(r))
88+
reward = data.get('reward', data.get('score', -1))
89+
rewards.append(reward)
90+
if reward == 0:
91+
task = r.split('/')[-2]
92+
print(f' ZERO: {task}')
93+
print(f'Mean reward: {sum(rewards)/len(rewards):.3f}')
94+
"
95+
96+
# Promote to official
97+
python3 scripts/promote_run.py --execute <staging_run_name>
98+
99+
# Regenerate MANIFEST
100+
python3 scripts/generate_manifest.py
101+
```
102+
103+
### 6. Verify full coverage
104+
```bash
105+
python3 -c "
106+
import json
107+
m = json.load(open('runs/official/MANIFEST.json'))
108+
mcp_suites = {k.split('/')[0] for k in m['runs'] if 'ccb_mcp_' in k}
109+
for suite in sorted(mcp_suites):
110+
bl_key = f'{suite}/baseline-local-artifact'
111+
mcp_key = f'{suite}/mcp-remote-artifact'
112+
bl_n = len(m['runs'].get(bl_key, {}).get('tasks', {}))
113+
mcp_n = len(m['runs'].get(mcp_key, {}).get('tasks', {}))
114+
print(f'{suite}: baseline={bl_n}, mcp={mcp_n}, paired={min(bl_n, mcp_n)}')
115+
"
116+
```
117+
118+
Target: all 81 MCP-unique tasks fully paired.
119+
120+
## Open Risks / Unknowns
121+
- Some artifact-mode tasks produce complex answer.json that oracle_checks.py evaluates — check for 0-score tasks
122+
- 5 mirrors are still pending creation (jdk, chromium, aosp, libreoffice, arangodb) but these only affect MCP runs, not baselines
123+
- If any tasks fail to build, check Dockerfile.artifact_only exists and has correct base image
124+
125+
## Key Files
126+
- `configs/mcp_baseline_rerun.json` — sub-selection file (39 tasks)
127+
- `configs/run_selected_tasks.sh` — unified runner
128+
- `configs/_common.sh` — shared infra (token refresh, account rotation, `baseline_config_for()`)
129+
- `agents/claude_baseline_agent.py` — agent code (V5 preamble)
130+
- `scripts/promote_run.py` — staging → official promotion
131+
- `scripts/generate_manifest.py` — MANIFEST regeneration
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Handoff: Monitor ccb_build rerun + scrollend=0 triage
2+
3+
## Goal
4+
- Keep watching the active 5-task `ccb_build` baseline rerun.
5+
- Confirm final rewards/artifacts for all 5 tasks.
6+
- Triage why `servo-scrollend-event-feat-001` got verifier reward `0`.
7+
8+
## Current Status (2026-02-27)
9+
- Active run root: `runs/staging/ccb_build_haiku_20260227_025524/baseline-local-direct`
10+
- 4 tasks done, 1 still running (`flink-pricing-window-feat-001`).
11+
- `servo-scrollend-event-feat-001` completed with verifier reward `0`.
12+
- `servo-scrollend` artifacts exist (`trajectory.json` and `claude-code.txt` are present).
13+
- Verifier output for `servo-scrollend` says:
14+
- `Change detection: unstaged=0 staged=0 untracked=0 commits=0`
15+
- `No code changes detected — agent did not execute successfully`
16+
17+
## Files Changed
18+
- `scripts/handoff_monitor_scrollend.sh` (new monitor/investigation helper script)
19+
20+
## Commands Run
21+
```bash
22+
chmod +x scripts/handoff_monitor_scrollend.sh
23+
scripts/handoff_monitor_scrollend.sh status
24+
scripts/handoff_monitor_scrollend.sh investigate
25+
```
26+
27+
## Monitoring Workflow
28+
1. Live watch the active rerun:
29+
```bash
30+
scripts/handoff_monitor_scrollend.sh watch 30
31+
```
32+
33+
2. One-shot status refresh:
34+
```bash
35+
scripts/handoff_monitor_scrollend.sh status
36+
```
37+
38+
3. Check only running harbor processes:
39+
```bash
40+
ps -ef | rg "harbor run --path .*/benchmarks/ccb_build|run_selected_tasks.sh" | rg -v rg
41+
```
42+
43+
## `scrollend` Zero-Score Investigation Workflow
44+
1. Run the built-in triage command:
45+
```bash
46+
scripts/handoff_monitor_scrollend.sh investigate
47+
```
48+
49+
2. Re-check verifier rationale directly:
50+
```bash
51+
trial=$(find runs/staging/ccb_build_haiku_20260227_025524/baseline-local-direct -type d -name 'servo-scrollend-event-feat-001__*' | head -n1)
52+
tail -n 80 "$trial/verifier/test-stdout.txt"
53+
jq '{status,reward,exception_info,verifier_reward:(.verifier_result.rewards.reward // null)}' "$trial/result.json"
54+
```
55+
56+
3. Confirm agent command completion signals:
57+
```bash
58+
find "$trial/agent" -maxdepth 2 -name return_code.txt -print -exec cat {} \;
59+
```
60+
61+
4. If verifier still reports "no changes", inspect whether agent reasoning got stuck without writing files:
62+
```bash
63+
tail -n 400 "$trial/agent/claude-code.txt"
64+
```
65+
66+
## Findings / Decisions
67+
- This is not a missing-artifact failure mode.
68+
- It currently looks like a no-op run outcome (no repo diff produced), so verifier correctly scored `0`.
69+
- Keep this rerun as authoritative evidence for `servo-scrollend` unless a follow-up rerun is explicitly requested.
70+
71+
## Open Risks / Unknowns
72+
- `flink-pricing-window-feat-001` is still in-flight and may still timeout or fail.
73+
- If `flink` fails, coverage for the 5-task recovery batch remains incomplete.
74+
75+
## Next Best Command
76+
```bash
77+
scripts/handoff_monitor_scrollend.sh watch 30
78+
```

0 commit comments

Comments
 (0)