|
| 1 | +# Nightly Research Report — 2026-03-22 (Report #16) |
| 2 | + |
| 3 | +## Executive Summary |
| 4 | + |
| 5 | +Ten consecutive days with zero code fixes (last code commit: Mar 12). This report surfaces |
| 6 | +**four new findings** that either correct or extend previous reports: the cost pipeline fix is |
| 7 | +**simpler than report #15 stated** (no schema change required — the model plumbing is 95% |
| 8 | +already in place), `abc_audit.py` has **four additional duplicate function pairs** beyond the |
| 9 | +T5/R2 broken checks already documented, the **primary metrics output (`task_metrics.json`) is |
| 10 | +written non-atomically**, and the **25 total `json.load(open(` sites** across scripts exceed |
| 11 | +the "17+" count previously catalogued. |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +## 1. Code & Architecture Review |
| 16 | + |
| 17 | +### 1.1 Cost Pipeline Fix is Simpler Than Previously Reported (Correction to Report #15) |
| 18 | + |
| 19 | +Report #15 stated: *"TaskMetrics has no model field — cost pipeline fix requires schema change."* |
| 20 | +This is incorrect. `RunMetrics.model: str` already exists at `csb_metrics/models.py:148`. The |
| 21 | +extraction infrastructure is also already in place: |
| 22 | + |
| 23 | +| Component | Location | Status | |
| 24 | +|-----------|----------|--------| |
| 25 | +| Model extraction from config | `discovery.py:171` `_extract_model_from_config(batch_dir)` | Exists | |
| 26 | +| Model stored per-run | `discovery.py:407–413` `run_metadata[key]["model"]` | Exists | |
| 27 | +| `RunMetrics.model` field | `models.py:148` | Exists | |
| 28 | +| Model passed to cost calculator | `discovery.py:310` — **missing** | **Gap** | |
| 29 | + |
| 30 | +The actual gap is in `_process_task_dir()` (`discovery.py:195`): |
| 31 | + |
| 32 | +```python |
| 33 | +def _process_task_dir( |
| 34 | + task_dir: Path, |
| 35 | + benchmark: str, |
| 36 | + config_name: str, |
| 37 | + is_swebench: bool, # no model parameter |
| 38 | +) -> Optional[TaskMetrics]: |
| 39 | + ... |
| 40 | + if tm.cost_usd is None: |
| 41 | + tm.cost_usd = calculate_cost_from_tokens( # model not passed |
| 42 | + tm.input_tokens, tm.output_tokens, |
| 43 | + tm.cache_creation_tokens, tm.cache_read_tokens, |
| 44 | + ) |
| 45 | +``` |
| 46 | + |
| 47 | +The model is available in the `discover_runs()` caller context at the point |
| 48 | +`_process_task_dir()` is invoked. The fix is three lines: |
| 49 | +1. Add `model: str = "unknown"` to `_process_task_dir`'s signature |
| 50 | +2. Pass `model` to `calculate_cost_from_tokens()` |
| 51 | +3. Pass `model=run_metadata[key]["model"]` when calling `_process_task_dir()` in |
| 52 | + `discover_runs()` |
| 53 | + |
| 54 | +`extract_task_metrics.py:266` needs the same one-line model pass. The `MODEL_PRICING` table |
| 55 | +still needs `claude-sonnet-4-6` and `claude-haiku-4-6` added (2 dict entries). **Total fix |
| 56 | +scope: ~6 lines.** No dataclass schema change required. |
| 57 | + |
| 58 | +--- |
| 59 | + |
| 60 | +### 1.2 abc_audit.py Has Four Actual Duplicate Pairs Beyond the T5/R2 Broken Checks (NEW) |
| 61 | + |
| 62 | +Report #15 covered `check_t5_no_solution_leak` and `check_r2_no_contamination` as broken |
| 63 | +single-definition functions. This report identifies **four additional functions defined twice**: |
| 64 | + |
| 65 | +``` |
| 66 | +scripts/abc_audit.py: |
| 67 | + check_t10_shared_state lines 408 and 1548 ← Python uses 1548 (identical code) |
| 68 | + check_oa_equivalent_solutions 1102 and 1606 ← Python uses 1606 (identical code) |
| 69 | + check_ob_negated_solutions 1151 and 1655 ← Python uses 1655 (identical code) |
| 70 | + check_og_determinism 1210 and 1715 ← Python uses 1715 (identical code) |
| 71 | +``` |
| 72 | + |
| 73 | +Unlike T5/R2 (broken single definitions), these four pairs appear to be copy-paste duplicates |
| 74 | +where both copies contain identical logic. Python silently uses the last definition, making the |
| 75 | +first copy dead code. The dead copies span ~500 lines of `abc_audit.py` (lines 408–1544 and |
| 76 | +1102–1714) that serve no function and increase confusion about which implementation is active. |
| 77 | + |
| 78 | +Removing the four dead first-copies would shrink `abc_audit.py` by ~500 lines (~25% of the |
| 79 | +file). More importantly, adding a startup registry check (e.g., asserting all check functions |
| 80 | +appear exactly once in a `_CHECKS` list) would catch future duplicate definitions at import |
| 81 | +time rather than silently. |
| 82 | + |
| 83 | +The full duplicate inventory for abc_audit.py: |
| 84 | + |
| 85 | +| Function | Defined at | Python uses | Issue | |
| 86 | +|----------|-----------|-------------|-------| |
| 87 | +| `check_t5_no_solution_leak` | 293 only | 293 | Broken logic (report #15) | |
| 88 | +| `check_r2_no_contamination` | 712 only | 712 | Broken logic (report #15) | |
| 89 | +| `check_t10_shared_state` | 408, **1548** | **1548** | Dead first copy | |
| 90 | +| `check_oa_equivalent_solutions` | 1102, **1606** | **1606** | Dead first copy | |
| 91 | +| `check_ob_negated_solutions` | 1151, **1655** | **1655** | Dead first copy | |
| 92 | +| `check_og_determinism` | 1210, **1715** | **1715** | Dead first copy | |
| 93 | + |
| 94 | +--- |
| 95 | + |
| 96 | +### 1.3 Primary Metrics Output Written Non-Atomically (NEW HIGH SEVERITY) |
| 97 | + |
| 98 | +`scripts/extract_task_metrics.py:394` writes `task_metrics.json` — the primary per-task |
| 99 | +metrics artifact — without atomic temp+rename: |
| 100 | + |
| 101 | +```python |
| 102 | +out_path.write_text(json.dumps(tm.to_dict(), indent=2) + "\n") # line 394 |
| 103 | +``` |
| 104 | + |
| 105 | +If the process is killed or crashes mid-write (e.g., Ctrl+C during a long metrics extraction |
| 106 | +run, OOM, disk full), `task_metrics.json` is left partially written — valid JSON prefix, |
| 107 | +truncated content. This causes: |
| 108 | + |
| 109 | +1. **`--skip-completed` silently skips the task**: the flag checks for `task_metrics.json` |
| 110 | + existence, not validity. A corrupt file will cause the task to be skipped on the next run. |
| 111 | +2. **Downstream reports include garbage data**: `discover_runs()` loads task_metrics.json and |
| 112 | + merges its fields into `TaskMetrics`. A truncated JSON file raises `json.JSONDecodeError` |
| 113 | + that may be swallowed. |
| 114 | + |
| 115 | +`scripts/reextract_all_metrics.py:247` has the same pattern for bulk re-extraction: |
| 116 | + |
| 117 | +```python |
| 118 | +out_path.write_text(json.dumps(tm.to_dict(), indent=2) + "\n") # line 247 |
| 119 | +``` |
| 120 | + |
| 121 | +The correct pattern (already implemented in `account_health.py:109–111`): |
| 122 | + |
| 123 | +```python |
| 124 | +tmp = out_path.with_suffix(".tmp") |
| 125 | +tmp.write_text(json.dumps(tm.to_dict(), indent=2) + "\n") |
| 126 | +tmp.replace(out_path) # atomic on POSIX |
| 127 | +``` |
| 128 | + |
| 129 | +--- |
| 130 | + |
| 131 | +### 1.4 FD Leak Count Revised Upward: 25 Confirmed Sites (not "17+") |
| 132 | + |
| 133 | +Running `grep -rn "json.load(open\|json.loads(open" scripts/` finds **25 occurrences** |
| 134 | +across Python scripts and shell-embedded Python one-liners. The previously catalogued "17+" |
| 135 | +missed the following: |
| 136 | + |
| 137 | +| New sites | Location | Note | |
| 138 | +|-----------|----------|------| |
| 139 | +| `rerun_fixed_tasks.sh:325,326` | Shell Python one-liners | Process exit closes FD; low risk | |
| 140 | +| `rerun_zero_mcp_tasks.sh:231,232` | Shell Python one-liners | Same | |
| 141 | +| `rerun_crossrepo_2tasks.sh:142,143` | Shell Python one-liners | Same | |
| 142 | +| `rerun_errored_tasks.sh:253,254` | Shell Python one-liners | Same | |
| 143 | +| `rerun_crossrepo_all4.sh:149` | Shell Python one-liner | Same | |
| 144 | +| `daytona_curator_runner.py:565,624` | Two separate module-level loads of `/tmp/curator_config.json` | Hardcoded path + no context manager | |
| 145 | + |
| 146 | +The shell one-liners are low actual risk (the Python subprocess exits immediately, closing all |
| 147 | +FDs), but they inflate the metric and count against any Ruff automation. The |
| 148 | +`daytona_curator_runner.py` sites are a higher concern because the same hardcoded |
| 149 | +`/tmp/curator_config.json` path is loaded **twice** at module level without a context manager, |
| 150 | +and the `/tmp` path is unparameterized. |
| 151 | + |
| 152 | +--- |
| 153 | + |
| 154 | +### 1.5 CI Workflow Audit: Python Version Inconsistency Confirmed |
| 155 | + |
| 156 | +Full version matrix across the 4 workflows: |
| 157 | + |
| 158 | +| Workflow | Python | `permissions:` block | Trigger | |
| 159 | +|----------|--------|---------------------|---------| |
| 160 | +| `repo_health.yml` | 3.10 | Missing | push + PR | |
| 161 | +| `docs-consistency.yml` | **3.11** | Missing | push + PR | |
| 162 | +| `task_smoke_matrix.yml` | 3.10 | Missing | paths-filtered push + PR | |
| 163 | +| `roam.yml` | **3.12** | Present | PR only | |
| 164 | + |
| 165 | +Three findings: |
| 166 | +1. **`roam.yml` is the only workflow with a `permissions:` block** — and it only runs on PRs, |
| 167 | + not pushes. The push-to-main path (repo_health, docs-consistency) runs with default broad |
| 168 | + token scope on every commit. |
| 169 | +2. **`docs-consistency.yml` uses Python 3.11** vs the 3.10 baseline. Any 3.10→3.11 behavioral |
| 170 | + difference in `docs_consistency_check.py` would be invisible in local testing. |
| 171 | +3. **`refresh_agent_navigation.py --check` runs in both `repo_health.yml` and |
| 172 | + `docs-consistency.yml`** — identical step in two workflows, wasting CI minutes on every PR. |
| 173 | + One should be removed. |
| 174 | + |
| 175 | +--- |
| 176 | + |
| 177 | +### 1.6 validate_tasks_preflight.py Non-Atomic Writes in Smoke Fixture (NEW) |
| 178 | + |
| 179 | +`scripts/validate_tasks_preflight.py` writes test helper scripts into a sandbox: |
| 180 | + |
| 181 | +```python |
| 182 | +helper.write_text(helper_rewritten) # line 972 — sandbox/*.sh |
| 183 | +runner.write_text(runner_header + script_body) # line 993 — sandbox/_run_fixture.sh |
| 184 | +``` |
| 185 | + |
| 186 | +These writes happen inside a `try/except OSError` block (line 973), which means if the write |
| 187 | +partially completes, the exception is swallowed and the smoke test runs against a corrupt |
| 188 | +fixture. Since `_run_fixture.sh` is the entry point for the sandbox test, a partially written |
| 189 | +file causes a cryptic `bash: syntax error` rather than a clear "fixture write failed" error. |
| 190 | + |
| 191 | +--- |
| 192 | + |
| 193 | +## 2. Feature & UX Improvements |
| 194 | + |
| 195 | +### 2.1 Cost Reports Need Model Attribution Annotation |
| 196 | + |
| 197 | +Every cost report produced by `generate_eval_report.py`, `compare_configs.py`, and the HTML |
| 198 | +export currently uses Opus 4.5 pricing regardless of the actual run model. Before the full fix |
| 199 | +lands, adding a visible disclaimer would prevent misleading data from being acted on: |
| 200 | + |
| 201 | +``` |
| 202 | +⚠ Cost estimates use claude-opus-4-5 pricing (actual model not recorded) |
| 203 | + Sonnet runs may be 5× overstated; Haiku runs may be 19× overstated. |
| 204 | +``` |
| 205 | + |
| 206 | +This requires zero code changes in the cost pipeline — just a one-line annotation added to |
| 207 | +the HTML template and to the `compare_configs.py` summary header. |
| 208 | + |
| 209 | +### 2.2 abc_audit Dead-Copy Removal Should Be a CI Gate |
| 210 | + |
| 211 | +After removing the 4 dead duplicate function copies (~500 lines), a simple import-time check |
| 212 | +would prevent future regressions: |
| 213 | + |
| 214 | +```python |
| 215 | +# At bottom of abc_audit.py: |
| 216 | +_REGISTERED_CHECKS = { |
| 217 | + check_t5_no_solution_leak, check_t10_shared_state, check_r2_no_contamination, |
| 218 | + check_oa_equivalent_solutions, check_ob_negated_solutions, check_og_determinism, |
| 219 | + # ... |
| 220 | +} |
| 221 | +assert len(_REGISTERED_CHECKS) == len({f.__name__ for f in _REGISTERED_CHECKS}), \ |
| 222 | + "Duplicate check function detected — check abc_audit.py for duplicate def statements" |
| 223 | +``` |
| 224 | + |
| 225 | +Since function objects are compared by identity, this assertion fails at import time if any |
| 226 | +two registered entries have the same `__name__`, catching future copy-paste duplicates. |
| 227 | + |
| 228 | +### 2.3 Smoke Test Fixture Writes Should Be Atomic |
| 229 | + |
| 230 | +`validate_tasks_preflight.py` writes `_run_fixture.sh` and helper scripts directly. If the |
| 231 | +sandbox write fails mid-way (e.g., disk quota), the test proceeds with a corrupt script. |
| 232 | +Converting these two writes to atomic temp+rename ensures either a complete fixture or a clean |
| 233 | +failure. |
| 234 | + |
| 235 | +--- |
| 236 | + |
| 237 | +## 3. Research Recommendations |
| 238 | + |
| 239 | +### 3.1 `atomic_write` Utility Should Be a Shared Helper |
| 240 | + |
| 241 | +The temp+rename pattern (`write_text` to `.tmp` then `Path.replace()`) is the correct solution |
| 242 | +for at least 8 known write sites: |
| 243 | + |
| 244 | +| File | Line | Write Target | Risk | |
| 245 | +|------|------|-------------|------| |
| 246 | +| `daytona_runner.py` | 234 | OAuth credentials | Authentication corruption | |
| 247 | +| `sync_agent_guides.py` | 22 | AGENTS.md / CLAUDE.md | Operational guidance corruption | |
| 248 | +| `daytona_cost_guard.py` | 663 | Cost guard cache | Wrong cost estimates served | |
| 249 | +| `extract_task_metrics.py` | 394 | task_metrics.json | Skip-completed bypass + bad data | |
| 250 | +| `reextract_all_metrics.py` | 247 | task_metrics.json (bulk) | Same | |
| 251 | +| `aggregate_status.py` | 669 | Status aggregate | Stale status report | |
| 252 | +| `apply_verifier_fixes.py` | 103,117,134 | Verifier files | Corrupt verifiers | |
| 253 | +| `validate_tasks_preflight.py` | 972, 993 | Sandbox fixture | Cryptic bash errors | |
| 254 | + |
| 255 | +A 10-line `atomic_write(path: Path, content: str) -> None` utility in |
| 256 | +`csb_metrics/io_utils.py` (or `scripts/utils.py`) would let each site convert to a one-line |
| 257 | +call. This is preferable to a PRD that fixes each site individually, since a shared utility is |
| 258 | +testable and ensures uniform behavior. |
| 259 | + |
| 260 | +### 3.2 `_process_task_dir` Signature Audit Before Adding `model` |
| 261 | + |
| 262 | +Before adding `model: str` to `_process_task_dir()`, check all callers beyond `discover_runs()` |
| 263 | +to ensure they can supply the model. A quick `grep` finds: |
| 264 | + |
| 265 | +```bash |
| 266 | +grep -rn "_process_task_dir\|from csb_metrics.discovery import" scripts/ |
| 267 | +``` |
| 268 | + |
| 269 | +If any callers can't supply a model, `model: str = "unknown"` as a default is safe (it |
| 270 | +preserves existing behavior while enabling the fix for callers that do have the model). |
| 271 | + |
| 272 | +### 3.3 Function Registry Pattern for Audit Modules |
| 273 | + |
| 274 | +`abc_audit.py`'s silent-override problem is a general risk in any large audit module where |
| 275 | +functions are defined once and called collectively. A registration decorator (similar to pytest |
| 276 | +fixtures) prevents this class of bug: |
| 277 | + |
| 278 | +```python |
| 279 | +_checks: list[Callable] = [] |
| 280 | +def check(fn: Callable) -> Callable: |
| 281 | + if any(f.__name__ == fn.__name__ for f in _checks): |
| 282 | + raise RuntimeError(f"Duplicate check: {fn.__name__}") |
| 283 | + _checks.append(fn) |
| 284 | + return fn |
| 285 | + |
| 286 | +@check |
| 287 | +def check_t5_no_solution_leak(tasks: list[Path]) -> CriterionResult: |
| 288 | + ... |
| 289 | +``` |
| 290 | + |
| 291 | +This converts a runtime silent failure into an import-time `RuntimeError`, making duplicate |
| 292 | +definitions detectable before any audit run. |
| 293 | + |
| 294 | +--- |
| 295 | + |
| 296 | +## 4. Recommended Next Feature |
| 297 | + |
| 298 | +### Cost Pipeline Accuracy Fix (Simplified by This Report's Findings) |
| 299 | + |
| 300 | +**Why now:** Report #15 deferred this fix by stating it requires a schema change. This report |
| 301 | +confirms no schema change is needed — the model plumbing is 95% in place. The full fix is |
| 302 | +bounded and testable. |
| 303 | + |
| 304 | +**Scope (3 files, ~6 lines of production code):** |
| 305 | + |
| 306 | +**`scripts/csb_metrics/discovery.py`** — add `model` parameter to `_process_task_dir`: |
| 307 | +```python |
| 308 | +# Before (line 195): |
| 309 | +def _process_task_dir(task_dir, benchmark, config_name, is_swebench): |
| 310 | + |
| 311 | +# After: |
| 312 | +def _process_task_dir(task_dir, benchmark, config_name, is_swebench, model="unknown"): |
| 313 | + ... |
| 314 | + tm.cost_usd = calculate_cost_from_tokens( |
| 315 | + tm.input_tokens, tm.output_tokens, |
| 316 | + tm.cache_creation_tokens, tm.cache_read_tokens, |
| 317 | + model=model, # ← add this |
| 318 | + ) |
| 319 | +``` |
| 320 | + |
| 321 | +Pass `model=run_metadata[key]["model"]` at the call site in `discover_runs()`. |
| 322 | + |
| 323 | +**`scripts/extract_task_metrics.py`** — add model read + pass: |
| 324 | +```python |
| 325 | +# Read model from config.json (already done in discovery.py, replicate here) |
| 326 | +model = _read_model_from_result(result_path) # 3-line helper |
| 327 | +tm.cost_usd = calculate_cost_from_tokens(..., model=model) |
| 328 | +``` |
| 329 | + |
| 330 | +**`scripts/csb_metrics/extractors.py`** — add 2 pricing entries: |
| 331 | +```python |
| 332 | +"claude-sonnet-4-6": {"input": 3.0, "output": 15.0, "cache_write": 3.75, "cache_read": 0.30}, |
| 333 | +"claude-haiku-4-6": {"input": 0.80, "output": 4.0, "cache_write": 1.0, "cache_read": 0.08}, |
| 334 | +``` |
| 335 | + |
| 336 | +**Why these changes together:** |
| 337 | +- All Sonnet 4.5/4.6 runs have been reporting 5× inflated costs — all historical reports are |
| 338 | + wrong |
| 339 | +- All Haiku runs have been reporting 19× inflated costs |
| 340 | +- `configs/openhands_2config.sh:60` already defaults to `claude-sonnet-4-6`, making this a |
| 341 | + live bug for all OpenHands results |
| 342 | +- The fix requires no CI infrastructure changes, no new scripts, no new tests (existing |
| 343 | + `test_extract_task_metrics.py` already covers cost paths) |
| 344 | +- A one-commit PR fixing both callers + the pricing table constitutes a complete remediation |
| 345 | +- After this PR, cost reports are accurate for the first time since the project switched from |
| 346 | + Opus-only runs |
| 347 | + |
| 348 | +**Bonus (same PR):** Convert `extract_task_metrics.py:394` to atomic write (3 lines) while |
| 349 | +touching the file, preventing the corrupt-metrics-file bug noted in §1.3. |
| 350 | + |
| 351 | +--- |
| 352 | + |
| 353 | +## 5. Issues Summary (New This Report) |
| 354 | + |
| 355 | +| ID | File | Line(s) | Issue | Severity | |
| 356 | +|----|------|---------|-------|----------| |
| 357 | +| N-01 | `csb_metrics/discovery.py` | 195–200 | `_process_task_dir` missing `model` param — fix is 3 lines (not schema change) | HIGH | |
| 358 | +| N-02 | `scripts/abc_audit.py` | 408/1548 | `check_t10_shared_state` defined twice (T10 is 4th actual duplicate) | MEDIUM | |
| 359 | +| N-03 | `scripts/abc_audit.py` | 1102/1606, 1151/1655, 1210/1715 | OA/OB/OG: 3 more duplicate pairs with dead first copies (~500 lines of dead code) | MEDIUM | |
| 360 | +| N-04 | `scripts/extract_task_metrics.py` | 394 | Non-atomic write of `task_metrics.json` (primary metrics output) | HIGH | |
| 361 | +| N-05 | `scripts/reextract_all_metrics.py` | 247 | Non-atomic write in bulk re-extraction | MEDIUM | |
| 362 | +| N-06 | `scripts/validate_tasks_preflight.py` | 972, 993 | Non-atomic writes of smoke sandbox fixture | LOW | |
| 363 | +| N-07 | `scripts/csb_metrics/discovery.py` | 565, 624 | `daytona_curator_runner.py` loads `/tmp/curator_config.json` twice without context manager + hardcoded path | LOW | |
| 364 | +| N-08 | `.github/workflows/docs-consistency.yml` | 19 | Python 3.11 vs 3.10 baseline; `--check` nav step duplicated from repo_health.yml | LOW | |
| 365 | +| N-09 | All 4 CI workflows | — | Correction: `roam.yml` is the ONLY workflow with `permissions:` block (not "3/4 missing" — all push workflows are unguarded) | MEDIUM | |
| 366 | + |
| 367 | +--- |
| 368 | + |
| 369 | +*Remediation velocity: 10 consecutive days without a code fix (Mar 12 → Mar 22). ~130 open |
| 370 | +issues across 16 reports. Active PRD: `ralph/fix-common-patterns-2026-03-20` (15 stories, |
| 371 | +all unstarted). Recommended next feature: cost pipeline accuracy fix (~6 lines, no schema |
| 372 | +change, corrects 5–19× cost overstatement for all non-Opus runs).* |
0 commit comments