|
1 | | -# CodeContextBench Agent Execution Guide |
2 | | - |
3 | | -## Overview |
4 | | - |
5 | | -Benchmark tasks are executed via **Harbor** (Docker container-based runner) with **Claude Code** as the agent. Each task runs in an isolated container. Three configurations are tested per benchmark: |
6 | | - |
7 | | -| Config | Description | |
8 | | -|--------|-------------| |
9 | | -| `baseline` | Claude Code with no MCP tools | |
10 | | -| `sourcegraph_base` | Claude Code + Sourcegraph MCP (basic search) | |
11 | | -| `sourcegraph_full` | Claude Code + Sourcegraph MCP (deep search + batch changes) | |
12 | | - |
13 | | -## Running Benchmarks |
14 | | - |
15 | | -### Single Benchmark |
16 | | - |
17 | | -```bash |
18 | | -# Sequential (default) |
19 | | -./configs/swebenchpro_2config.sh |
20 | | - |
21 | | -# Parallel with auto-detected concurrency |
22 | | -./configs/swebenchpro_2config.sh --parallel |
23 | | - |
24 | | -# Parallel with explicit job count |
25 | | -./configs/swebenchpro_2config.sh --parallel 4 |
26 | | -``` |
27 | | - |
28 | | -All 11 benchmark config scripts accept the `--parallel` flag: |
29 | | -- `swebenchpro_2config.sh` — SWE-bench Pro (36 tasks) |
30 | | -- `pytorch_2config.sh` — PyTorch (12 tasks) |
31 | | -- `locobench_2config.sh` — LoCoBench (25 tasks) |
32 | | -- `repoqa_2config.sh` — RepoQA (10 tasks) |
33 | | -- `k8s_docs_2config.sh` — Kubernetes Docs (5 tasks) |
34 | | -- `crossrepo_2config.sh` — Cross-Repo (4-5 tasks) |
35 | | -- `largerepo_2config.sh` — Large Repo (4 tasks) |
36 | | -- `tac_2config.sh` — TAC (8 tasks) |
37 | | -- `dibench_2config.sh` — DIBench (8 tasks) |
38 | | -- `sweperf_2config.sh` — SWE-Perf (3 tasks) |
39 | | -- `linuxflbench_2config.sh` — LinuxFLBench (5 tasks) |
40 | | - |
41 | | -### Config Scripts Structure |
42 | | - |
43 | | -Each config script: |
44 | | -1. Sources `configs/_common.sh` for shared infrastructure |
45 | | -2. Defines task IDs and agent configurations |
46 | | -3. Runs all three configs (baseline, sourcegraph_base, sourcegraph_full) |
47 | | -4. Validates results and produces `flagged_tasks.json` |
48 | | - |
49 | | -## Parallel Execution |
50 | | - |
51 | | -### Architecture |
52 | | - |
53 | | -Parallel execution uses background subshells with semaphore-style job limiting: |
54 | | - |
55 | | -``` |
56 | | -Main shell |
57 | | - ├── Subshell 1 (HOME=account1) → harbor run task_A |
58 | | - ├── Subshell 2 (HOME=account3) → harbor run task_B |
59 | | - ├── Subshell 3 (HOME=account1) → harbor run task_C |
60 | | - └── ... up to PARALLEL_JOBS concurrent |
61 | | -``` |
62 | | - |
63 | | -Each subshell gets `HOME` overridden to a specific account directory. This causes `harbor run` (and the Claude Code agent inside Docker) to read credentials from that account's `~/.claude/.credentials.json`. |
64 | | - |
65 | | -### Round-Robin Account Distribution |
66 | | - |
67 | | -Tasks are assigned to accounts in round-robin order: |
68 | | - |
69 | | -``` |
70 | | -Task 1 → account1 |
71 | | -Task 2 → account3 |
72 | | -Task 3 → account1 |
73 | | -Task 4 → account3 |
74 | | -... |
75 | | -``` |
76 | | - |
77 | | -This spreads API rate limit burden across accounts. Only Max-plan accounts are used (regular accounts are too rate-limited). |
78 | | - |
79 | | -### Configuration Variables |
80 | | - |
81 | | -Set in `configs/_common.sh` or via environment: |
82 | | - |
83 | | -| Variable | Default | Description | |
84 | | -|----------|---------|-------------| |
85 | | -| `PARALLEL_JOBS` | auto | Max concurrent tasks. Auto = `SESSIONS_PER_ACCOUNT * num_accounts` | |
86 | | -| `SESSIONS_PER_ACCOUNT` | `4` | Empirical max concurrent sessions per Max-plan account | |
87 | | -| `SKIP_ACCOUNTS` | `account2` | Space-separated account names to exclude | |
88 | | - |
89 | | -### How It Works |
90 | | - |
91 | | -1. **`setup_multi_accounts()`** scans `~/.claude-homes/account1/`, `account2/`, etc. |
92 | | -2. Accounts in `SKIP_ACCOUNTS` are excluded (e.g., non-Max accounts) |
93 | | -3. `PARALLEL_JOBS` is auto-set to `SESSIONS_PER_ACCOUNT * active_accounts` |
94 | | -4. **`run_tasks_parallel()`** launches background subshells up to `PARALLEL_JOBS` |
95 | | -5. Each subshell has `HOME` set to the next account in round-robin order |
96 | | -6. PID tracking + `kill -0` polling enforces the concurrency limit |
97 | | -7. After all tasks finish, `HOME` is restored to `REAL_HOME` |
98 | | - |
99 | | -## Multi-Account Setup |
100 | | - |
101 | | -### Prerequisites |
102 | | - |
103 | | -- 2+ Claude accounts with Max plan subscriptions |
104 | | -- Each account logged in and credential files placed correctly |
105 | | - |
106 | | -### Directory Structure |
107 | | - |
108 | | -``` |
109 | | -~/.claude-homes/ |
110 | | - account1/ |
111 | | - .claude/ |
112 | | - .credentials.json # Max plan account |
113 | | - account2/ |
114 | | - .claude/ |
115 | | - .credentials.json # Regular plan (skipped by default) |
116 | | - account3/ |
117 | | - .claude/ |
118 | | - .credentials.json # Max plan account |
119 | | -``` |
120 | | - |
121 | | -### Setting Up Accounts |
122 | | - |
123 | | -For each account: |
124 | | - |
125 | | -```bash |
126 | | -# 1. Create the directory |
127 | | -mkdir -p ~/.claude-homes/accountN/.claude |
128 | | - |
129 | | -# 2. Log out of any current session |
130 | | -claude logout |
131 | | - |
132 | | -# 3. Log in with the target account |
133 | | -claude login |
134 | | - |
135 | | -# 4. Copy credentials to the account directory |
136 | | -cp ~/.claude/.credentials.json ~/.claude-homes/accountN/.claude/.credentials.json |
137 | | -``` |
138 | | - |
139 | | -You must `claude logout` before `claude login` with a different account — otherwise the existing valid token is reused. |
140 | | - |
141 | | -### Verifying Accounts |
142 | | - |
143 | | -```bash |
144 | | -# Check all accounts are detected |
145 | | -SKIP_ACCOUNTS="" bash -c 'source configs/_common.sh; setup_multi_accounts' |
146 | | -``` |
147 | | - |
148 | | -Expected output: |
149 | | -``` |
150 | | -Multi-account mode: 3 accounts active |
151 | | - slot 1: /home/user/.claude-homes/account1 |
152 | | - slot 2: /home/user/.claude-homes/account2 |
153 | | - slot 3: /home/user/.claude-homes/account3 |
154 | | -``` |
155 | | - |
156 | | -### Rate Limits |
157 | | - |
158 | | -- **Max plan**: ~4 concurrent Claude Code sessions before throttling |
159 | | -- **Regular plan**: Significantly lower limits, not suitable for parallel runs |
160 | | -- With 2 Max accounts: up to 8 concurrent tasks |
161 | | -- Default `SKIP_ACCOUNTS=account2` excludes the regular-plan account |
162 | | - |
163 | | -## Token Refresh |
164 | | - |
165 | | -OAuth tokens expire after ~8 hours. The `ensure_fresh_token_all()` function refreshes all account tokens before each batch: |
166 | | - |
167 | | -- Checks `expiresAt` timestamp in each credential file |
168 | | -- Refreshes if less than 30 minutes remaining (`REFRESH_MARGIN=1800`) |
169 | | -- Uses the Claude OAuth endpoint with the `refreshToken` |
170 | | -- Writes updated tokens back to the credential file |
171 | | - |
172 | | -Token refresh runs automatically at the start of each config batch (baseline, sourcegraph_base, sourcegraph_full). |
173 | | - |
174 | | -## Output Structure |
175 | | - |
176 | | -``` |
177 | | -runs/official/{benchmark}_{model}_{timestamp}/ |
178 | | - baseline/ |
179 | | - {task_id}__{hash}/ |
180 | | - result.json # Pass/fail, reward score |
181 | | - trajectory.jsonl # Full agent interaction log |
182 | | - cost.json # Token usage and cost |
183 | | - sourcegraph_base/ |
184 | | - ... |
185 | | - sourcegraph_full/ |
186 | | - ... |
187 | | - flagged_tasks.json # Validation warnings |
188 | | -``` |
189 | | - |
190 | | -## Trajectory Generation |
191 | | - |
192 | | -### What Produces trajectory.json |
193 | | - |
194 | | -`trajectory.json` is generated by Harbor's `ClaudeCode._convert_events_to_trajectory()` method. It records per-step timestamps, tool calls, and token metrics in ATIF-v1.2 schema format. It is written to `agent/trajectory.json` in each task's output directory. |
195 | | - |
196 | | -### Why It Might Be Missing |
197 | | - |
198 | | -The **H3 bug** causes `trajectory.json` to fail when Claude Code spawns subagents via the `Task` tool. Harbor's `_get_session_dir()` gets confused by multiple session directories, causing the trajectory conversion to silently fail. The bug was fixed in `claude_baseline_agent.py` but older runs (~15%) are affected. |
199 | | - |
200 | | -When `trajectory.json` is missing, `claude-code.txt` (JSONL transcript) is always present and contains the same tool call data without per-step timestamps. |
201 | | - |
202 | | -### Runtime Detection |
203 | | - |
204 | | -Two levels of trajectory.json monitoring are built into the run pipeline: |
205 | | - |
206 | | -1. **Per-task**: `_check_task_trajectory()` runs automatically after each task completes in `run_tasks_parallel` (via the `_reap_one` PID reaper). Logs a WARNING immediately if trajectory.json is missing for that task. |
207 | | -2. **Per-batch**: `check_trajectory_coverage()` runs after each batch in `validate_and_report()`. Summarizes missing/total counts across all tasks in the batch. |
208 | | - |
209 | | -Both checks are non-blocking — warnings are logged but the pipeline continues. |
210 | | - |
211 | | -### Troubleshooting |
212 | | - |
213 | | -1. **Check coverage**: Review per-task WARNING lines in run output, or the batch-level TRAJECTORY CHECK summary |
214 | | -2. **Fallback**: `extract_time_to_context()` in `ir_metrics.py` automatically falls back to `synthesize_trajectory()` which estimates timestamps from the JSONL transcript using a calibrated seconds-per-step rate |
215 | | -3. **Root cause**: If new runs show missing trajectories, check for heavy `Task` tool usage (subagent spawning). Ensure `claude_baseline_agent.py` has the H3 fix applied |
216 | | - |
217 | | -## Post-Run Validation |
218 | | - |
219 | | -Each config script calls `validate_and_report()` after completing a batch. This runs `scripts/validate_task_run.py` to check for: |
220 | | - |
221 | | -- Missing result files |
222 | | -- Unexpected error states |
223 | | -- Zero-reward tasks (potential infrastructure issues vs genuine failures) |
224 | | - |
225 | | -Results are aggregated into `flagged_tasks.json` at the run level. |
226 | | - |
227 | | -## Generating the Evaluation Report |
228 | | - |
229 | | -After all runs complete: |
230 | | - |
| 1 | +# CodeContextBench Operations Guide |
| 2 | + |
| 3 | +This file is the operational quick-reference for benchmark maintenance. |
| 4 | +`CLAUDE.md` intentionally mirrors this file. |
| 5 | + |
| 6 | +## Canonical References |
| 7 | +- `README.md` - repo overview and quick start |
| 8 | +- `docs/CONFIGS.md` - config matrix and MCP behavior |
| 9 | +- `docs/QA_PROCESS.md` - pre-run, run-time, post-run validation |
| 10 | +- `docs/ERROR_CATALOG.md` - known failures and remediation |
| 11 | +- `docs/TASK_SELECTION.md` - curation/difficulty policy |
| 12 | +- `docs/TASK_CATALOG.md` - current task inventory |
| 13 | +- `docs/SCORING_SEMANTICS.md` - reward/pass interpretation |
| 14 | +- `docs/WORKFLOW_METRICS.md` - timing/cost metric definitions |
| 15 | +- `docs/AGENT_INTERFACE.md` - runtime I/O contract |
| 16 | +- `docs/EXTENSIBILITY.md` - safe suite/task/config extension |
| 17 | +- `docs/LEADERBOARD.md` - ranking policy |
| 18 | +- `docs/SUBMISSION.md` - submission format |
| 19 | + |
| 20 | +## Typical Skill Routing |
| 21 | +Use these defaults unless there is a task-specific reason not to. |
| 22 | + |
| 23 | +- Pre-run readiness: `check-infra`, `validate-tasks` |
| 24 | +- Launch/runs: `run-benchmark`, `run-status`, `watch-benchmarks` |
| 25 | +- Failure investigation: `triage-failure`, `quick-rerun` |
| 26 | +- Cross-config analysis: `compare-configs`, `mcp-audit`, `ir-analysis` |
| 27 | +- Cost/reporting: `cost-report`, `generate-report` |
| 28 | +- Data hygiene: `sync-metadata`, `reextract-metrics`, `archive-run` |
| 29 | +- Planning/prioritization: `whats-next` |
| 30 | + |
| 31 | +## Standard Workflow |
| 32 | +1. Run infrastructure checks before any batch. |
| 33 | +2. Validate task integrity before launch (include runtime smoke for new/changed tasks). |
| 34 | +3. Run the benchmark config (`configs/*_2config.sh` or equivalent). |
| 35 | +4. Monitor progress and classify errors while tasks are running. |
| 36 | +5. Validate outputs after each batch (`result.json`, `flagged_tasks.json`, trajectory coverage). |
| 37 | +6. Triage failures before rerunning; avoid blind reruns. |
| 38 | +7. Regenerate `MANIFEST.json` and evaluation report after run completion. |
| 39 | +8. Sync metadata if task definitions changed. |
| 40 | + |
| 41 | +## Quality Gates |
| 42 | +A run is considered healthy only if all are true: |
| 43 | + |
| 44 | +- No infra blockers (tokens, Docker, disk, credentials) |
| 45 | +- No unexpected missing `result.json` |
| 46 | +- Errored tasks are classified and actionable |
| 47 | +- Zero-reward clusters are explained (task difficulty vs infra/tooling) |
| 48 | +- Trajectory gaps are accounted for (or JSONL fallback noted) |
| 49 | +- Config comparisons are based on matched task sets |
| 50 | + |
| 51 | +## Run Hygiene |
| 52 | +- Prefer isolated, well-scoped reruns (don’t mix unrelated fixes in one batch). |
| 53 | +- Use parallel mode only when multi-account token state is confirmed fresh. |
| 54 | +- Keep run naming and suite/config metadata consistent. |
| 55 | +- Do not treat archived or draft analyses as canonical docs. |
| 56 | +- Keep `docs/` focused on maintained operational guidance. |
| 57 | + |
| 58 | +## Escalation Rules |
| 59 | +- Repeated infra failures: stop batch reruns and fix root cause first. |
| 60 | +- Suspected verifier bug: quarantine task, document evidence, and open follow-up. |
| 61 | +- Missing trajectories: use transcript fallback and record the limitation. |
| 62 | +- Widespread MCP regressions: run MCP usage audit before changing prompts/configs. |
| 63 | + |
| 64 | +## High-Use Commands |
231 | 65 | ```bash |
232 | | -python3 scripts/generate_manifest.py # Regenerate MANIFEST.json |
233 | | -python3 scripts/generate_eval_report.py # Aggregate results into report |
| 66 | +python3 scripts/check_infra.py |
| 67 | +python3 scripts/validate_tasks_preflight.py --all |
| 68 | +python3 scripts/validate_tasks_preflight.py --task <task_dir> --smoke-runtime |
| 69 | +python3 scripts/validate_task_run.py --run <run_dir> |
| 70 | +python3 scripts/aggregate_status.py --staging |
| 71 | +python3 scripts/compare_configs.py --run <run_dir> |
| 72 | +python3 scripts/mcp_audit.py --run <run_dir> |
| 73 | +python3 scripts/cost_report.py --run <run_dir> |
| 74 | +python3 scripts/generate_manifest.py |
| 75 | +python3 scripts/generate_eval_report.py |
234 | 76 | ``` |
235 | 77 |
|
236 | | -The MANIFEST tracks all runs, task counts, pass/fail rates, and mean rewards. |
237 | | - |
238 | | -## Landing the Plane (Session Completion) |
239 | | - |
240 | | -**When ending a work session**, you MUST complete ALL steps below. Work is NOT complete until `git push` succeeds. |
241 | | - |
242 | | -**MANDATORY WORKFLOW:** |
243 | | - |
244 | | -1. **File issues for remaining work** - Create issues for anything that needs follow-up |
245 | | -2. **Run quality gates** (if code changed) - Tests, linters, builds |
246 | | -3. **Update issue status** - Close finished work, update in-progress items |
247 | | -4. **PUSH TO REMOTE** - This is MANDATORY: |
248 | | - ```bash |
249 | | - git pull --rebase |
250 | | - bd sync |
251 | | - git push |
252 | | - git status # MUST show "up to date with origin" |
253 | | - ``` |
254 | | -5. **Clean up** - Clear stashes, prune remote branches |
255 | | -6. **Verify** - All changes committed AND pushed |
256 | | -7. **Hand off** - Provide context for next session |
257 | | - |
258 | | -**CRITICAL RULES:** |
259 | | -- Work is NOT complete until `git push` succeeds |
260 | | -- NEVER stop before pushing - that leaves work stranded locally |
261 | | -- NEVER say "ready to push when you are" - YOU must push |
262 | | -- If push fails, resolve and retry until it succeeds |
| 78 | +## Script Entrypoints |
| 79 | +- `configs/_common.sh` - shared run infra (parallelism, token refresh, validation hooks) |
| 80 | +- `configs/*_2config.sh` - per-suite run launchers |
| 81 | +- `configs/validate_one_per_benchmark.sh --smoke-runtime` - quick no-agent runtime smoke (1 task per benchmark) |
| 82 | +- `scripts/promote_run.py` - staging to official promotion flow |
0 commit comments