You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Legacy run directory names (`baseline`, `sourcegraph_full`, `artifact_full`) may still appear in historical outputs and are handled by analysis scripts.
| Baseline |`none`| None (agent uses only built-in tools) |
130
-
| MCP-Full|`sourcegraph_full` / `artifact_full` (task-dependent) | All 13 Sourcegraph MCP tools including `sg_deepsearch`, `sg_deepsearch_read`|
128
+
| MCP |`sourcegraph` / `artifact` (task-dependent) | All 13 Sourcegraph MCP tools including `sg_deepsearch`, `sg_deepsearch_read`|
131
129
132
-
See [docs/reference/CONFIGS.md](docs/reference/CONFIGS.md) for the canonical configuration matrix and tool-by-tool breakdown. (`docs/CONFIGS.md` is a compatibility stub.)
130
+
See [docs/reference/CONFIGS.md](docs/reference/CONFIGS.md) for the canonical configuration matrix and tool-by-tool breakdown.
133
131
134
132
---
135
133
@@ -224,17 +222,18 @@ Each suite directory contains per-task subdirectories with `instruction.md`, `ta
224
222
225
223
## Metrics Extraction Pipeline
226
224
227
-
The `scripts/` directory contains a stdlib-only Python 3.10+ pipeline for extracting deterministic metrics from Harbor run output:
225
+
The `scripts/` directory contains a stdlib-only Python 3.10+ pipeline for extracting deterministic metrics from Harbor run output.
226
+
Use `runs/analysis` for active analysis runs (and `runs/official` when producing publishable exports):
228
227
229
228
```bash
230
-
# Generate evaluation report from Harbor runs
229
+
# Generate evaluation report from analysis runs
231
230
python3 scripts/generate_eval_report.py \
232
-
--runs-dir /path/to/runs/official/ \
231
+
--runs-dir /path/to/runs/analysis/ \
233
232
--output-dir ./eval_reports/
234
233
235
234
# Generate LLM judge context files
236
235
python3 -m scripts.csb_metrics.judge_context \
237
-
--runs-dir /path/to/runs/official/ \
236
+
--runs-dir /path/to/runs/analysis/ \
238
237
--benchmarks-dir ./benchmarks/ \
239
238
--output-dir ./judge_contexts/
240
239
```
@@ -247,9 +246,9 @@ The report generator produces:
247
246
248
247
See `python3 scripts/generate_eval_report.py --help` for all options.
249
248
250
-
### Publishable Official Results + Trace Browser
249
+
### Official Results + Trace Browser
251
250
252
-
To export GitHub-friendly official results (valid scored tasks only) with parsed
251
+
To export official results (valid scored tasks only) with parsed
0 commit comments