Skip to content

Commit 32833bb

Browse files
committed
docs: simplify README wording and prune rerun config leftovers
1 parent 8875459 commit 32833bb

9 files changed

+12
-1143
lines changed

README.md

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -115,21 +115,19 @@ See [docs/MCP_UNIQUE_TASKS.md](docs/MCP_UNIQUE_TASKS.md) for the full task syste
115115

116116
## 2-Config Evaluation Matrix
117117

118-
All benchmarks are evaluated across two primary configurations (Baseline vs MCP-Full). The concrete run config names differ by task type:
118+
All benchmarks are evaluated across two primary configurations (Baseline vs MCP). The concrete run config names differ by task type:
119119

120120
- **SDLC suites** (`csb_sdlc_feature`, `csb_sdlc_refactor`, `csb_sdlc_fix`, etc.): `baseline-local-direct` + `mcp-remote-direct`
121-
- **Org suites** (`csb_org_*`): `baseline-local-direct` + `mcp-remote-direct` (some legacy runs used `baseline-local-artifact` + `mcp-remote-artifact`)
122-
123-
Legacy run directory names (`baseline`, `sourcegraph_full`, `artifact_full`) may still appear in historical outputs and are handled by analysis scripts.
121+
- **Org suites** (`csb_org_*`): `baseline-local-direct` + `mcp-remote-direct`
124122

125123
At a high level, the distinction is:
126124

127125
| Config Name | Internal MCP mode | MCP Tools Available |
128126
|-------------------|---------------------|---------------------|
129127
| Baseline | `none` | None (agent uses only built-in tools) |
130-
| MCP-Full | `sourcegraph_full` / `artifact_full` (task-dependent) | All 13 Sourcegraph MCP tools including `sg_deepsearch`, `sg_deepsearch_read` |
128+
| MCP | `sourcegraph` / `artifact` (task-dependent) | All 13 Sourcegraph MCP tools including `sg_deepsearch`, `sg_deepsearch_read` |
131129

132-
See [docs/reference/CONFIGS.md](docs/reference/CONFIGS.md) for the canonical configuration matrix and tool-by-tool breakdown. (`docs/CONFIGS.md` is a compatibility stub.)
130+
See [docs/reference/CONFIGS.md](docs/reference/CONFIGS.md) for the canonical configuration matrix and tool-by-tool breakdown.
133131

134132
---
135133

@@ -224,17 +222,18 @@ Each suite directory contains per-task subdirectories with `instruction.md`, `ta
224222

225223
## Metrics Extraction Pipeline
226224

227-
The `scripts/` directory contains a stdlib-only Python 3.10+ pipeline for extracting deterministic metrics from Harbor run output:
225+
The `scripts/` directory contains a stdlib-only Python 3.10+ pipeline for extracting deterministic metrics from Harbor run output.
226+
Use `runs/analysis` for active analysis runs (and `runs/official` when producing publishable exports):
228227

229228
```bash
230-
# Generate evaluation report from Harbor runs
229+
# Generate evaluation report from analysis runs
231230
python3 scripts/generate_eval_report.py \
232-
--runs-dir /path/to/runs/official/ \
231+
--runs-dir /path/to/runs/analysis/ \
233232
--output-dir ./eval_reports/
234233

235234
# Generate LLM judge context files
236235
python3 -m scripts.csb_metrics.judge_context \
237-
--runs-dir /path/to/runs/official/ \
236+
--runs-dir /path/to/runs/analysis/ \
238237
--benchmarks-dir ./benchmarks/ \
239238
--output-dir ./judge_contexts/
240239
```
@@ -247,9 +246,9 @@ The report generator produces:
247246

248247
See `python3 scripts/generate_eval_report.py --help` for all options.
249248

250-
### Publishable Official Results + Trace Browser
249+
### Official Results + Trace Browser
251250

252-
To export GitHub-friendly official results (valid scored tasks only) with parsed
251+
To export official results (valid scored tasks only) with parsed
253252
trace summaries and local browsing UI:
254253

255254
```bash
@@ -264,7 +263,7 @@ This writes:
264263
- `docs/official_results/tasks/*.md` -- per-task metrics + parsed tool/trace view
265264
- `docs/official_results/data/official_results.json` -- machine-readable dataset
266265
- `docs/official_results/audits/*.json` -- per-task audit artifacts (checksums + parsed trace events)
267-
- `docs/official_results/traces/*/trajectory.json` -- bundled raw trajectory traces for GitHub audit
266+
- `docs/official_results/traces/*/trajectory.json` -- bundled raw trajectory traces
268267
- `docs/official_results/index.html` -- interactive local browser
269268

270269
Suite summaries are deduplicated to the latest result per

0 commit comments

Comments
 (0)