sourcegraph
diff --git a/‎docs/ops/HANDOFF_baseline_39.md‎
Lines changed: 131 additions & 0 deletions b/‎docs/ops/HANDOFF_baseline_39.md‎
Lines changed: 131 additions & 0 deletions
diff --git a/‎docs/ops/HANDOFF_scrollend_monitor.md‎
Lines changed: 78 additions & 0 deletions b/‎docs/ops/HANDOFF_scrollend_monitor.md‎
Lines changed: 78 additions & 0 deletions
@@ -0,0 +1,131 @@
+# Handoff: Run 39 Missing MCP-Unique Baselines
+
+## Goal
+Run baseline-local-artifact evaluation for 39 MCP-unique tasks that currently only have MCP results. This brings total paired coverage from 212 to 251 (100%).
+
+## Current Status
+- 81 MCP-unique tasks across 11 ccb_mcp_* suites
+- 42 already have paired results (baseline + MCP) in runs/official/
+- **39 need baseline runs** (listed below)
+- A sub-selection file is ready: `configs/mcp_baseline_rerun.json`
+
+## Task List (39 tasks across 11 suites)
+
+| Suite | Count | Task IDs |
+|-------|-------|----------|
+| ccb_mcp_compliance | 5 | CCX-compliance-052, 053, 057-ds, 115, 118 |
+| ccb_mcp_crossorg | 1 | CCX-crossorg-062 |
+| ccb_mcp_crossrepo | 1 | CCX-dep-trace-106 |
+| ccb_mcp_crossrepo_tracing | 4 | CCX-config-trace-003, CCX-dep-trace-002, 102, 116 |
+| ccb_mcp_domain | 6 | CCX-domain-071, 072, 073, 074, 101, 112 |
+| ccb_mcp_incident | 6 | CCX-incident-032, 033, 034, 037, 108, 110 |
+| ccb_mcp_migration | 5 | CCX-migration-022, 026, 107, 114, 117 |
+| ccb_mcp_onboarding | 3 | CCX-onboard-043, 103, 109 |
+| ccb_mcp_org | 3 | CCX-agentic-081, 082, 083 |
+| ccb_mcp_platform | 2 | CCX-platform-104, 119 |
+| ccb_mcp_security | 3 | CCX-vuln-remed-013, 105, 111 |
+
+## Config Details
+- **Config**: `baseline-local-artifact` (auto-detected by runner for MCP-unique artifact-only tasks)
+- **Model**: `anthropic/claude-haiku-4-5-20251001` (default)
+- **Dockerfile**: Each task has `environment/Dockerfile.artifact_only` (full local code, agent produces answer.json)
+- **Category**: staging (promotes to official after validation)
+- **Selection file**: `configs/mcp_baseline_rerun.json` (pre-built, 39 tasks)
+
+## Execution Steps
+
+### 1. Pre-flight checks
+```bash
+python3 scripts/check_infra.py
+python3 scripts/validate_tasks_preflight.py --selection-file configs/mcp_baseline_rerun.json
+```
+
+### 2. Dry run (verify task detection)
+```bash
+./configs/run_selected_tasks.sh \
+  --selection-file configs/mcp_baseline_rerun.json \
+  --baseline-only \
+  --parallel 12 \
+  --dry-run
+```
+
+You should see:
+- 39 tasks detected
+- Auto-detection message: "Auto-detected MCP-unique tasks (artifact-only) → using artifact configs (mcp-remote-artifact)"
+- Config: `baseline-local-artifact`
+- `--baseline-only` means only baselines run (no MCP)
+
+### 3. Launch baselines
+```bash
+./configs/run_selected_tasks.sh \
+  --selection-file configs/mcp_baseline_rerun.json \
+  --baseline-only \
+  --parallel 12
+```
+
+Interactive confirmation will appear — approve it. Runs land in `runs/staging/`.
+
+Estimated time: ~1-2 hours at 12 parallel slots (3 accounts x 4 sessions, 39 tasks, ~5 min each with artifact mode).
+
+### 4. Monitor progress
+```bash
+# Quick status
+ls runs/staging/*/baseline-local-artifact/ | wc -l
+
+# Or use /watch-benchmarks skill for detailed status
+```
+
+### 5. Post-run: validate and promote
+```bash
+# Check results
+python3 -c "
+import json, glob
+results = glob.glob('runs/staging/*/baseline-local-artifact/*/result.json')
+print(f'Results: {len(results)}/39')
+rewards = []
+for r in results:
+    data = json.load(open(r))
+    reward = data.get('reward', data.get('score', -1))
+    rewards.append(reward)
+    if reward == 0:
+        task = r.split('/')[-2]
+        print(f'  ZERO: {task}')
+print(f'Mean reward: {sum(rewards)/len(rewards):.3f}')
+"
+
+# Promote to official
+python3 scripts/promote_run.py --execute <staging_run_name>
+
+# Regenerate MANIFEST
+python3 scripts/generate_manifest.py
+```
+
+### 6. Verify full coverage
+```bash
+python3 -c "
+import json
+m = json.load(open('runs/official/MANIFEST.json'))
+mcp_suites = {k.split('/')[0] for k in m['runs'] if 'ccb_mcp_' in k}
+for suite in sorted(mcp_suites):
+    bl_key = f'{suite}/baseline-local-artifact'
+    mcp_key = f'{suite}/mcp-remote-artifact'
+    bl_n = len(m['runs'].get(bl_key, {}).get('tasks', {}))
+    mcp_n = len(m['runs'].get(mcp_key, {}).get('tasks', {}))
+    print(f'{suite}: baseline={bl_n}, mcp={mcp_n}, paired={min(bl_n, mcp_n)}')
+"
+```
+
+Target: all 81 MCP-unique tasks fully paired.
+
+## Open Risks / Unknowns
+- Some artifact-mode tasks produce complex answer.json that oracle_checks.py evaluates — check for 0-score tasks
+- 5 mirrors are still pending creation (jdk, chromium, aosp, libreoffice, arangodb) but these only affect MCP runs, not baselines
+- If any tasks fail to build, check Dockerfile.artifact_only exists and has correct base image
+
+## Key Files
+- `configs/mcp_baseline_rerun.json` — sub-selection file (39 tasks)
+- `configs/run_selected_tasks.sh` — unified runner
+- `configs/_common.sh` — shared infra (token refresh, account rotation, `baseline_config_for()`)
+- `agents/claude_baseline_agent.py` — agent code (V5 preamble)
+- `scripts/promote_run.py` — staging → official promotion
+- `scripts/generate_manifest.py` — MANIFEST regeneration
@@ -0,0 +1,78 @@
+# Handoff: Monitor ccb_build rerun + scrollend=0 triage
+
+## Goal
+- Keep watching the active 5-task `ccb_build` baseline rerun.
+- Confirm final rewards/artifacts for all 5 tasks.
+- Triage why `servo-scrollend-event-feat-001` got verifier reward `0`.
+
+## Current Status (2026-02-27)
+- Active run root: `runs/staging/ccb_build_haiku_20260227_025524/baseline-local-direct`
+- 4 tasks done, 1 still running (`flink-pricing-window-feat-001`).
+- `servo-scrollend-event-feat-001` completed with verifier reward `0`.
+- `servo-scrollend` artifacts exist (`trajectory.json` and `claude-code.txt` are present).
+- Verifier output for `servo-scrollend` says:
+  - `Change detection: unstaged=0 staged=0 untracked=0 commits=0`
+  - `No code changes detected — agent did not execute successfully`
+
+## Files Changed
+- `scripts/handoff_monitor_scrollend.sh` (new monitor/investigation helper script)
+
+## Commands Run
+```bash
+chmod +x scripts/handoff_monitor_scrollend.sh
+scripts/handoff_monitor_scrollend.sh status
+scripts/handoff_monitor_scrollend.sh investigate
+```
+
+## Monitoring Workflow
+1. Live watch the active rerun:
+```bash
+scripts/handoff_monitor_scrollend.sh watch 30
+```
+
+2. One-shot status refresh:
+```bash
+scripts/handoff_monitor_scrollend.sh status
+```
+
+3. Check only running harbor processes:
+```bash
+ps -ef | rg "harbor run --path .*/benchmarks/ccb_build|run_selected_tasks.sh" | rg -v rg
+```
+
+## `scrollend` Zero-Score Investigation Workflow
+1. Run the built-in triage command:
+```bash
+scripts/handoff_monitor_scrollend.sh investigate
+```
+
+2. Re-check verifier rationale directly:
+```bash
+trial=$(find runs/staging/ccb_build_haiku_20260227_025524/baseline-local-direct -type d -name 'servo-scrollend-event-feat-001__*' | head -n1)
+tail -n 80 "$trial/verifier/test-stdout.txt"
+jq '{status,reward,exception_info,verifier_reward:(.verifier_result.rewards.reward // null)}' "$trial/result.json"
+```
+
+3. Confirm agent command completion signals:
+```bash
+find "$trial/agent" -maxdepth 2 -name return_code.txt -print -exec cat {} \;
+```
+
+4. If verifier still reports "no changes", inspect whether agent reasoning got stuck without writing files:
+```bash
+tail -n 400 "$trial/agent/claude-code.txt"
+```
+
+## Findings / Decisions
+- This is not a missing-artifact failure mode.
+- It currently looks like a no-op run outcome (no repo diff produced), so verifier correctly scored `0`.
+- Keep this rerun as authoritative evidence for `servo-scrollend` unless a follow-up rerun is explicitly requested.
+
+## Open Risks / Unknowns
+- `flink-pricing-window-feat-001` is still in-flight and may still timeout or fail.
+- If `flink` fails, coverage for the 5-task recovery batch remains incomplete.
+
+## Next Best Command
+```bash
+scripts/handoff_monitor_scrollend.sh watch 30
+```