Skip to content

Commit f87e015

Browse files
committed
Rework official layout to use _raw and document clean top-level
1 parent b152728 commit f87e015

File tree

10 files changed

+630
-14
lines changed

10 files changed

+630
-14
lines changed

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,11 @@ Each suite directory contains per-task subdirectories with `instruction.md`, `ta
225225
The `scripts/` directory contains a stdlib-only Python 3.10+ pipeline for extracting deterministic metrics from Harbor run output.
226226
Use `runs/analysis` for active analysis runs (and `runs/official` when producing publishable exports):
227227

228+
Official runs layout note:
229+
- Raw source-of-truth run dirs now live under `runs/official/_raw/`.
230+
- Top-level `runs/official/` is kept clean for organized benchmark/model views (`csb_sdlc/`, `csb_org/`) plus `MANIFEST.json`.
231+
- Core scripts (manifest generation, promotion, organizer) resolve `_raw` automatically.
232+
228233
```bash
229234
# Generate evaluation report from analysis runs
230235
python3 scripts/generate_eval_report.py \

docs/OFFICIAL_RESULTS_BROWSER.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,10 @@ python3 scripts/export_official_results.py \
4646
--output-dir ./docs/official_results/
4747
```
4848

49+
Note: `runs/official/` uses a split layout (`_raw` for raw runs, organized
50+
views at top-level). Export tooling handles this automatically; pass
51+
`--runs-dir ./runs/official/` unless you intentionally want a custom root.
52+
4953
If you promote runs with:
5054

5155
```bash

docs/ops/SCRIPT_INDEX.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,7 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
212212
- `scripts/plot_csb_mcp_blog_figures.py` - Utility script for plot csb mcp blog figures.
213213
- `scripts/prepare_analysis_runs.py` - Utility script for prepare analysis runs.
214214
- `scripts/promote_agent_oracles.py` - Utility script for promote agent oracles.
215+
- `scripts/promote_blocked.py` - Utility script for promote blocked.
215216
- `scripts/push_base_images_ghcr.sh` - Utility script for push base images ghcr.
216217
- `scripts/regenerate_artifact_dockerfiles.py` - Utility script for regenerate artifact dockerfiles.
217218
- `scripts/rehost_sweap_images.py` - Utility script for rehost sweap images.

docs/reference/RESULT_DIRECTORY_SPEC.md

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,22 @@
1414

1515
## Directory Layouts
1616

17-
`runs/official/` contains batches with **three different directory structures**.
17+
`runs/official/` now has a split layout:
18+
19+
- Raw run data (source of truth): `runs/official/_raw/`
20+
- Organized symlink views: `runs/official/csb_sdlc/`, `runs/official/csb_org/`, etc.
21+
- Canonical manifest: `runs/official/MANIFEST.json`
22+
23+
Any scanner that reads run artifacts MUST scan `runs/official/_raw/` (or use
24+
`scripts/official_runs.py:raw_runs_dir(...)`), not top-level `runs/official/`.
25+
26+
Inside the raw root, batches use **three different directory structures**.
1827
Any scanner MUST handle all three or it will under-count results.
1928

2029
### Layout 1: Old Promoted Format (pre-2026-02-24)
2130

2231
```
23-
runs/official/{suite}_{model}_{date}/
32+
runs/official/_raw/{suite}_{model}_{date}/
2433
baseline/
2534
{suite}_{task_id}_{config_name}/ ← wrapper dir
2635
{trial_dirname}/ ← e.g. sgonly_task-name__AbCdEfG
@@ -40,7 +49,7 @@ Example (historical, from pre-split `csb_sdlc_build` runs): `csb_sdlc_build_haik
4049
### Layout 2: Harbor Nested Format (2026-02-24+)
4150

4251
```
43-
runs/official/{suite}_{model}_{timestamp}/
52+
runs/official/_raw/{suite}_{model}_{timestamp}/
4453
baseline-local-direct/
4554
{harbor_timestamp}/ ← e.g. 2026-02-26__00-09-23
4655
{task_dirname}/ ← e.g. task-name__AbCdEfG
@@ -57,7 +66,7 @@ runs/official/{suite}_{model}_{timestamp}/
5766
### Layout 3: CodeScaleBench-Org / Artifact Format
5867

5968
```
60-
runs/official/{suite}_{model}_{timestamp}/
69+
runs/official/_raw/{suite}_{model}_{timestamp}/
6170
baseline-local-direct/ (or baseline-local-artifact)
6271
{harbor_timestamp}/
6372
bl_{TASK_ID}_{hash}__hash/ ← bl_ prefix, uppercase task ID
@@ -191,15 +200,17 @@ def extract_task_id_from_result(data: dict, parent_dir: str, suites: set[str]) -
191200
from pathlib import Path
192201

193202
# Use rglob to find ALL result.json at any depth
194-
for rj in Path('runs/official').rglob('result.json'):
203+
from official_runs import raw_runs_dir
204+
raw_root = raw_runs_dir(Path('runs/official'))
205+
for rj in raw_root.rglob('result.json'):
195206
data = json.loads(rj.read_text())
196207

197208
# 1. Skip batch-level results
198209
if 'task_name' not in data:
199210
continue
200211

201212
# 2. Determine config from PATH COMPONENTS (not from result content)
202-
parts = rj.relative_to(official).parts
213+
parts = rj.relative_to(raw_root).parts
203214
is_baseline = any(p in BL_NAMES for p in parts)
204215
is_mcp = any(p in MCP_NAMES for p in parts)
205216

@@ -217,6 +228,7 @@ for rj in Path('runs/official').rglob('result.json'):
217228

218229
| Mistake | Consequence |
219230
|---|---|
231+
| Scanning top-level `runs/official/` instead of `_raw` | Mixes in organized symlink views and non-run artifacts |
220232
| Only checking 2-3 levels deep | Misses Layout 1 (old promoted, 4 levels deep) |
221233
| Using `task_id` field without checking if it's a dict | Crash or empty string |
222234
| Not stripping `sgonly_` prefix from `task_name` | No match against selection file |

scripts/generate_manifest.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,10 @@
1717
PROJECT_ROOT = Path(__file__).resolve().parent.parent
1818
sys.path.insert(0, str(PROJECT_ROOT / "scripts"))
1919
from config_utils import discover_configs
20+
from official_runs import raw_runs_dir
2021

21-
RUNS_DIR = PROJECT_ROOT / "runs" / "official"
22+
RUNS_DIR = raw_runs_dir(PROJECT_ROOT / "runs" / "official")
23+
MANIFEST_OUTPUT_PATH = PROJECT_ROOT / "runs" / "official" / "MANIFEST.json"
2224

2325
# Directories to skip entirely.
2426
# __v1_hinted: old run dirs from before enterprise task de-hinting (US-001..US-003).
@@ -938,7 +940,7 @@ def main():
938940
"run_history": run_history_section,
939941
}
940942

941-
output_path = RUNS_DIR / "MANIFEST.json"
943+
output_path = MANIFEST_OUTPUT_PATH
942944
with open(output_path, "w") as f:
943945
json.dump(manifest, f, indent=2)
944946

scripts/official_runs.py

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
DEFAULT_PREFIX_MAP_PATH = Path("configs/run_dir_prefix_map.json")
1111
TRIAGE_FILENAME = "triage.json"
1212
TRIAGE_DECISIONS = {"include", "exclude", "pending"}
13+
RAW_DIRNAME = "_raw"
1314

1415

1516
def should_skip(dirname: str) -> bool:
@@ -33,13 +34,29 @@ def detect_suite(run_dir_name: str, prefix_map: dict[str, str]) -> str | None:
3334
return None
3435

3536

37+
def raw_runs_dir(runs_dir: Path) -> Path:
38+
"""Return the directory that contains raw official run dirs.
39+
40+
Compatibility behavior:
41+
- New layout: runs/official/_raw (preferred)
42+
- Legacy layout: runs/official
43+
"""
44+
if runs_dir.name == RAW_DIRNAME and runs_dir.is_dir():
45+
return runs_dir
46+
candidate = runs_dir / RAW_DIRNAME
47+
if candidate.is_dir():
48+
return candidate
49+
return runs_dir
50+
51+
3652
def top_level_run_dirs(runs_dir: Path) -> list[Path]:
37-
if not runs_dir.is_dir():
53+
raw_dir = raw_runs_dir(runs_dir)
54+
if not raw_dir.is_dir():
3855
return []
3956
return sorted(
4057
[
4158
p
42-
for p in runs_dir.iterdir()
59+
for p in raw_dir.iterdir()
4360
if p.is_dir() and not should_skip(p.name)
4461
],
4562
key=lambda p: p.name,
@@ -90,4 +107,3 @@ def read_triage(run_dir: Path) -> tuple[dict | None, str | None]:
90107
if not triage.get(field):
91108
return triage, f"missing_{field}"
92109
return triage, None
93-

0 commit comments

Comments
 (0)