Skip to content

Commit 35dd557

Browse files
committed
chore: tighten preflight validation and operator guidance
1 parent 0fdab9d commit 35dd557

File tree

4 files changed

+403
-608
lines changed

4 files changed

+403
-608
lines changed

AGENTS.md

Lines changed: 79 additions & 259 deletions
Original file line numberDiff line numberDiff line change
@@ -1,262 +1,82 @@
1-
# CodeContextBench Agent Execution Guide
2-
3-
## Overview
4-
5-
Benchmark tasks are executed via **Harbor** (Docker container-based runner) with **Claude Code** as the agent. Each task runs in an isolated container. Three configurations are tested per benchmark:
6-
7-
| Config | Description |
8-
|--------|-------------|
9-
| `baseline` | Claude Code with no MCP tools |
10-
| `sourcegraph_base` | Claude Code + Sourcegraph MCP (basic search) |
11-
| `sourcegraph_full` | Claude Code + Sourcegraph MCP (deep search + batch changes) |
12-
13-
## Running Benchmarks
14-
15-
### Single Benchmark
16-
17-
```bash
18-
# Sequential (default)
19-
./configs/swebenchpro_2config.sh
20-
21-
# Parallel with auto-detected concurrency
22-
./configs/swebenchpro_2config.sh --parallel
23-
24-
# Parallel with explicit job count
25-
./configs/swebenchpro_2config.sh --parallel 4
26-
```
27-
28-
All 11 benchmark config scripts accept the `--parallel` flag:
29-
- `swebenchpro_2config.sh` — SWE-bench Pro (36 tasks)
30-
- `pytorch_2config.sh` — PyTorch (12 tasks)
31-
- `locobench_2config.sh` — LoCoBench (25 tasks)
32-
- `repoqa_2config.sh` — RepoQA (10 tasks)
33-
- `k8s_docs_2config.sh` — Kubernetes Docs (5 tasks)
34-
- `crossrepo_2config.sh` — Cross-Repo (4-5 tasks)
35-
- `largerepo_2config.sh` — Large Repo (4 tasks)
36-
- `tac_2config.sh` — TAC (8 tasks)
37-
- `dibench_2config.sh` — DIBench (8 tasks)
38-
- `sweperf_2config.sh` — SWE-Perf (3 tasks)
39-
- `linuxflbench_2config.sh` — LinuxFLBench (5 tasks)
40-
41-
### Config Scripts Structure
42-
43-
Each config script:
44-
1. Sources `configs/_common.sh` for shared infrastructure
45-
2. Defines task IDs and agent configurations
46-
3. Runs all three configs (baseline, sourcegraph_base, sourcegraph_full)
47-
4. Validates results and produces `flagged_tasks.json`
48-
49-
## Parallel Execution
50-
51-
### Architecture
52-
53-
Parallel execution uses background subshells with semaphore-style job limiting:
54-
55-
```
56-
Main shell
57-
├── Subshell 1 (HOME=account1) → harbor run task_A
58-
├── Subshell 2 (HOME=account3) → harbor run task_B
59-
├── Subshell 3 (HOME=account1) → harbor run task_C
60-
└── ... up to PARALLEL_JOBS concurrent
61-
```
62-
63-
Each subshell gets `HOME` overridden to a specific account directory. This causes `harbor run` (and the Claude Code agent inside Docker) to read credentials from that account's `~/.claude/.credentials.json`.
64-
65-
### Round-Robin Account Distribution
66-
67-
Tasks are assigned to accounts in round-robin order:
68-
69-
```
70-
Task 1 → account1
71-
Task 2 → account3
72-
Task 3 → account1
73-
Task 4 → account3
74-
...
75-
```
76-
77-
This spreads API rate limit burden across accounts. Only Max-plan accounts are used (regular accounts are too rate-limited).
78-
79-
### Configuration Variables
80-
81-
Set in `configs/_common.sh` or via environment:
82-
83-
| Variable | Default | Description |
84-
|----------|---------|-------------|
85-
| `PARALLEL_JOBS` | auto | Max concurrent tasks. Auto = `SESSIONS_PER_ACCOUNT * num_accounts` |
86-
| `SESSIONS_PER_ACCOUNT` | `4` | Empirical max concurrent sessions per Max-plan account |
87-
| `SKIP_ACCOUNTS` | `account2` | Space-separated account names to exclude |
88-
89-
### How It Works
90-
91-
1. **`setup_multi_accounts()`** scans `~/.claude-homes/account1/`, `account2/`, etc.
92-
2. Accounts in `SKIP_ACCOUNTS` are excluded (e.g., non-Max accounts)
93-
3. `PARALLEL_JOBS` is auto-set to `SESSIONS_PER_ACCOUNT * active_accounts`
94-
4. **`run_tasks_parallel()`** launches background subshells up to `PARALLEL_JOBS`
95-
5. Each subshell has `HOME` set to the next account in round-robin order
96-
6. PID tracking + `kill -0` polling enforces the concurrency limit
97-
7. After all tasks finish, `HOME` is restored to `REAL_HOME`
98-
99-
## Multi-Account Setup
100-
101-
### Prerequisites
102-
103-
- 2+ Claude accounts with Max plan subscriptions
104-
- Each account logged in and credential files placed correctly
105-
106-
### Directory Structure
107-
108-
```
109-
~/.claude-homes/
110-
account1/
111-
.claude/
112-
.credentials.json # Max plan account
113-
account2/
114-
.claude/
115-
.credentials.json # Regular plan (skipped by default)
116-
account3/
117-
.claude/
118-
.credentials.json # Max plan account
119-
```
120-
121-
### Setting Up Accounts
122-
123-
For each account:
124-
125-
```bash
126-
# 1. Create the directory
127-
mkdir -p ~/.claude-homes/accountN/.claude
128-
129-
# 2. Log out of any current session
130-
claude logout
131-
132-
# 3. Log in with the target account
133-
claude login
134-
135-
# 4. Copy credentials to the account directory
136-
cp ~/.claude/.credentials.json ~/.claude-homes/accountN/.claude/.credentials.json
137-
```
138-
139-
You must `claude logout` before `claude login` with a different account — otherwise the existing valid token is reused.
140-
141-
### Verifying Accounts
142-
143-
```bash
144-
# Check all accounts are detected
145-
SKIP_ACCOUNTS="" bash -c 'source configs/_common.sh; setup_multi_accounts'
146-
```
147-
148-
Expected output:
149-
```
150-
Multi-account mode: 3 accounts active
151-
slot 1: /home/user/.claude-homes/account1
152-
slot 2: /home/user/.claude-homes/account2
153-
slot 3: /home/user/.claude-homes/account3
154-
```
155-
156-
### Rate Limits
157-
158-
- **Max plan**: ~4 concurrent Claude Code sessions before throttling
159-
- **Regular plan**: Significantly lower limits, not suitable for parallel runs
160-
- With 2 Max accounts: up to 8 concurrent tasks
161-
- Default `SKIP_ACCOUNTS=account2` excludes the regular-plan account
162-
163-
## Token Refresh
164-
165-
OAuth tokens expire after ~8 hours. The `ensure_fresh_token_all()` function refreshes all account tokens before each batch:
166-
167-
- Checks `expiresAt` timestamp in each credential file
168-
- Refreshes if less than 30 minutes remaining (`REFRESH_MARGIN=1800`)
169-
- Uses the Claude OAuth endpoint with the `refreshToken`
170-
- Writes updated tokens back to the credential file
171-
172-
Token refresh runs automatically at the start of each config batch (baseline, sourcegraph_base, sourcegraph_full).
173-
174-
## Output Structure
175-
176-
```
177-
runs/official/{benchmark}_{model}_{timestamp}/
178-
baseline/
179-
{task_id}__{hash}/
180-
result.json # Pass/fail, reward score
181-
trajectory.jsonl # Full agent interaction log
182-
cost.json # Token usage and cost
183-
sourcegraph_base/
184-
...
185-
sourcegraph_full/
186-
...
187-
flagged_tasks.json # Validation warnings
188-
```
189-
190-
## Trajectory Generation
191-
192-
### What Produces trajectory.json
193-
194-
`trajectory.json` is generated by Harbor's `ClaudeCode._convert_events_to_trajectory()` method. It records per-step timestamps, tool calls, and token metrics in ATIF-v1.2 schema format. It is written to `agent/trajectory.json` in each task's output directory.
195-
196-
### Why It Might Be Missing
197-
198-
The **H3 bug** causes `trajectory.json` to fail when Claude Code spawns subagents via the `Task` tool. Harbor's `_get_session_dir()` gets confused by multiple session directories, causing the trajectory conversion to silently fail. The bug was fixed in `claude_baseline_agent.py` but older runs (~15%) are affected.
199-
200-
When `trajectory.json` is missing, `claude-code.txt` (JSONL transcript) is always present and contains the same tool call data without per-step timestamps.
201-
202-
### Runtime Detection
203-
204-
Two levels of trajectory.json monitoring are built into the run pipeline:
205-
206-
1. **Per-task**: `_check_task_trajectory()` runs automatically after each task completes in `run_tasks_parallel` (via the `_reap_one` PID reaper). Logs a WARNING immediately if trajectory.json is missing for that task.
207-
2. **Per-batch**: `check_trajectory_coverage()` runs after each batch in `validate_and_report()`. Summarizes missing/total counts across all tasks in the batch.
208-
209-
Both checks are non-blocking — warnings are logged but the pipeline continues.
210-
211-
### Troubleshooting
212-
213-
1. **Check coverage**: Review per-task WARNING lines in run output, or the batch-level TRAJECTORY CHECK summary
214-
2. **Fallback**: `extract_time_to_context()` in `ir_metrics.py` automatically falls back to `synthesize_trajectory()` which estimates timestamps from the JSONL transcript using a calibrated seconds-per-step rate
215-
3. **Root cause**: If new runs show missing trajectories, check for heavy `Task` tool usage (subagent spawning). Ensure `claude_baseline_agent.py` has the H3 fix applied
216-
217-
## Post-Run Validation
218-
219-
Each config script calls `validate_and_report()` after completing a batch. This runs `scripts/validate_task_run.py` to check for:
220-
221-
- Missing result files
222-
- Unexpected error states
223-
- Zero-reward tasks (potential infrastructure issues vs genuine failures)
224-
225-
Results are aggregated into `flagged_tasks.json` at the run level.
226-
227-
## Generating the Evaluation Report
228-
229-
After all runs complete:
230-
1+
# CodeContextBench Operations Guide
2+
3+
This file is the operational quick-reference for benchmark maintenance.
4+
`CLAUDE.md` intentionally mirrors this file.
5+
6+
## Canonical References
7+
- `README.md` - repo overview and quick start
8+
- `docs/CONFIGS.md` - config matrix and MCP behavior
9+
- `docs/QA_PROCESS.md` - pre-run, run-time, post-run validation
10+
- `docs/ERROR_CATALOG.md` - known failures and remediation
11+
- `docs/TASK_SELECTION.md` - curation/difficulty policy
12+
- `docs/TASK_CATALOG.md` - current task inventory
13+
- `docs/SCORING_SEMANTICS.md` - reward/pass interpretation
14+
- `docs/WORKFLOW_METRICS.md` - timing/cost metric definitions
15+
- `docs/AGENT_INTERFACE.md` - runtime I/O contract
16+
- `docs/EXTENSIBILITY.md` - safe suite/task/config extension
17+
- `docs/LEADERBOARD.md` - ranking policy
18+
- `docs/SUBMISSION.md` - submission format
19+
20+
## Typical Skill Routing
21+
Use these defaults unless there is a task-specific reason not to.
22+
23+
- Pre-run readiness: `check-infra`, `validate-tasks`
24+
- Launch/runs: `run-benchmark`, `run-status`, `watch-benchmarks`
25+
- Failure investigation: `triage-failure`, `quick-rerun`
26+
- Cross-config analysis: `compare-configs`, `mcp-audit`, `ir-analysis`
27+
- Cost/reporting: `cost-report`, `generate-report`
28+
- Data hygiene: `sync-metadata`, `reextract-metrics`, `archive-run`
29+
- Planning/prioritization: `whats-next`
30+
31+
## Standard Workflow
32+
1. Run infrastructure checks before any batch.
33+
2. Validate task integrity before launch (include runtime smoke for new/changed tasks).
34+
3. Run the benchmark config (`configs/*_2config.sh` or equivalent).
35+
4. Monitor progress and classify errors while tasks are running.
36+
5. Validate outputs after each batch (`result.json`, `flagged_tasks.json`, trajectory coverage).
37+
6. Triage failures before rerunning; avoid blind reruns.
38+
7. Regenerate `MANIFEST.json` and evaluation report after run completion.
39+
8. Sync metadata if task definitions changed.
40+
41+
## Quality Gates
42+
A run is considered healthy only if all are true:
43+
44+
- No infra blockers (tokens, Docker, disk, credentials)
45+
- No unexpected missing `result.json`
46+
- Errored tasks are classified and actionable
47+
- Zero-reward clusters are explained (task difficulty vs infra/tooling)
48+
- Trajectory gaps are accounted for (or JSONL fallback noted)
49+
- Config comparisons are based on matched task sets
50+
51+
## Run Hygiene
52+
- Prefer isolated, well-scoped reruns (don’t mix unrelated fixes in one batch).
53+
- Use parallel mode only when multi-account token state is confirmed fresh.
54+
- Keep run naming and suite/config metadata consistent.
55+
- Do not treat archived or draft analyses as canonical docs.
56+
- Keep `docs/` focused on maintained operational guidance.
57+
58+
## Escalation Rules
59+
- Repeated infra failures: stop batch reruns and fix root cause first.
60+
- Suspected verifier bug: quarantine task, document evidence, and open follow-up.
61+
- Missing trajectories: use transcript fallback and record the limitation.
62+
- Widespread MCP regressions: run MCP usage audit before changing prompts/configs.
63+
64+
## High-Use Commands
23165
```bash
232-
python3 scripts/generate_manifest.py # Regenerate MANIFEST.json
233-
python3 scripts/generate_eval_report.py # Aggregate results into report
66+
python3 scripts/check_infra.py
67+
python3 scripts/validate_tasks_preflight.py --all
68+
python3 scripts/validate_tasks_preflight.py --task <task_dir> --smoke-runtime
69+
python3 scripts/validate_task_run.py --run <run_dir>
70+
python3 scripts/aggregate_status.py --staging
71+
python3 scripts/compare_configs.py --run <run_dir>
72+
python3 scripts/mcp_audit.py --run <run_dir>
73+
python3 scripts/cost_report.py --run <run_dir>
74+
python3 scripts/generate_manifest.py
75+
python3 scripts/generate_eval_report.py
23476
```
23577

236-
The MANIFEST tracks all runs, task counts, pass/fail rates, and mean rewards.
237-
238-
## Landing the Plane (Session Completion)
239-
240-
**When ending a work session**, you MUST complete ALL steps below. Work is NOT complete until `git push` succeeds.
241-
242-
**MANDATORY WORKFLOW:**
243-
244-
1. **File issues for remaining work** - Create issues for anything that needs follow-up
245-
2. **Run quality gates** (if code changed) - Tests, linters, builds
246-
3. **Update issue status** - Close finished work, update in-progress items
247-
4. **PUSH TO REMOTE** - This is MANDATORY:
248-
```bash
249-
git pull --rebase
250-
bd sync
251-
git push
252-
git status # MUST show "up to date with origin"
253-
```
254-
5. **Clean up** - Clear stashes, prune remote branches
255-
6. **Verify** - All changes committed AND pushed
256-
7. **Hand off** - Provide context for next session
257-
258-
**CRITICAL RULES:**
259-
- Work is NOT complete until `git push` succeeds
260-
- NEVER stop before pushing - that leaves work stranded locally
261-
- NEVER say "ready to push when you are" - YOU must push
262-
- If push fails, resolve and retry until it succeeds
78+
## Script Entrypoints
79+
- `configs/_common.sh` - shared run infra (parallelism, token refresh, validation hooks)
80+
- `configs/*_2config.sh` - per-suite run launchers
81+
- `configs/validate_one_per_benchmark.sh --smoke-runtime` - quick no-agent runtime smoke (1 task per benchmark)
82+
- `scripts/promote_run.py` - staging to official promotion flow

0 commit comments

Comments
 (0)