docs: nightly research report 2026-03-22 (report #16)

sjarmak · sjarmak · commit ce0e24990090 · 2026-03-22T23:09:37.000-04:00
New findings: cost pipeline fix is simpler than report #15 stated (RunMetrics.model
already exists, _process_task_dir just needs model param — no schema change required);
abc_audit.py has 4 more duplicate function pairs (T10/OA/OB/OG) beyond the T5/R2 broken
checks; extract_task_metrics.py:394 writes task_metrics.json non-atomically; FD leak
count revised to 25 confirmed sites. Recommended feature: cost pipeline accuracy fix.
diff --git a/reports/nightly/2026-03-22-review.md b/reports/nightly/2026-03-22-review.md
@@ -0,0 +1,372 @@
+# Nightly Research Report — 2026-03-22 (Report #16)
+
+## Executive Summary
+
+Ten consecutive days with zero code fixes (last code commit: Mar 12). This report surfaces
+**four new findings** that either correct or extend previous reports: the cost pipeline fix is
+**simpler than report #15 stated** (no schema change required — the model plumbing is 95%
+already in place), `abc_audit.py` has **four additional duplicate function pairs** beyond the
+T5/R2 broken checks already documented, the **primary metrics output (`task_metrics.json`) is
+written non-atomically**, and the **25 total `json.load(open(` sites** across scripts exceed
+the "17+" count previously catalogued.
+
+---
+
+## 1. Code & Architecture Review
+
+### 1.1 Cost Pipeline Fix is Simpler Than Previously Reported (Correction to Report #15)
+
+Report #15 stated: *"TaskMetrics has no model field — cost pipeline fix requires schema change."*
+This is incorrect. `RunMetrics.model: str` already exists at `csb_metrics/models.py:148`. The
+extraction infrastructure is also already in place:
+
+| Component | Location | Status |
+|-----------|----------|--------|
+| Model extraction from config | `discovery.py:171` `_extract_model_from_config(batch_dir)` | Exists |
+| Model stored per-run | `discovery.py:407–413` `run_metadata[key]["model"]` | Exists |
+| `RunMetrics.model` field | `models.py:148` | Exists |
+| Model passed to cost calculator | `discovery.py:310` — **missing** | **Gap** |
+
+The actual gap is in `_process_task_dir()` (`discovery.py:195`):
+
+```python
+def _process_task_dir(
+    task_dir: Path,
+    benchmark: str,
+    config_name: str,
+    is_swebench: bool,            # no model parameter
+) -> Optional[TaskMetrics]:
+    ...
+    if tm.cost_usd is None:
+        tm.cost_usd = calculate_cost_from_tokens(   # model not passed
+            tm.input_tokens, tm.output_tokens,
+            tm.cache_creation_tokens, tm.cache_read_tokens,
+        )
+```
+
+The model is available in the `discover_runs()` caller context at the point
+`_process_task_dir()` is invoked. The fix is three lines:
+1. Add `model: str = "unknown"` to `_process_task_dir`'s signature
+2. Pass `model` to `calculate_cost_from_tokens()`
+3. Pass `model=run_metadata[key]["model"]` when calling `_process_task_dir()` in
+   `discover_runs()`
+
+`extract_task_metrics.py:266` needs the same one-line model pass. The `MODEL_PRICING` table
+still needs `claude-sonnet-4-6` and `claude-haiku-4-6` added (2 dict entries). **Total fix
+scope: ~6 lines.** No dataclass schema change required.
+
+---
+
+### 1.2 abc_audit.py Has Four Actual Duplicate Pairs Beyond the T5/R2 Broken Checks (NEW)
+
+Report #15 covered `check_t5_no_solution_leak` and `check_r2_no_contamination` as broken
+single-definition functions. This report identifies **four additional functions defined twice**:
+
+```
+scripts/abc_audit.py:
+  check_t10_shared_state     lines  408  and  1548  ← Python uses 1548 (identical code)
+  check_oa_equivalent_solutions  1102  and  1606  ← Python uses 1606 (identical code)
+  check_ob_negated_solutions    1151  and  1655  ← Python uses 1655 (identical code)
+  check_og_determinism          1210  and  1715  ← Python uses 1715 (identical code)
+```
+
+Unlike T5/R2 (broken single definitions), these four pairs appear to be copy-paste duplicates
+where both copies contain identical logic. Python silently uses the last definition, making the
+first copy dead code. The dead copies span ~500 lines of `abc_audit.py` (lines 408–1544 and
+1102–1714) that serve no function and increase confusion about which implementation is active.
+
+Removing the four dead first-copies would shrink `abc_audit.py` by ~500 lines (~25% of the
+file). More importantly, adding a startup registry check (e.g., asserting all check functions
+appear exactly once in a `_CHECKS` list) would catch future duplicate definitions at import
+time rather than silently.
+
+The full duplicate inventory for abc_audit.py:
+
+| Function | Defined at | Python uses | Issue |
+|----------|-----------|-------------|-------|
+| `check_t5_no_solution_leak` | 293 only | 293 | Broken logic (report #15) |
+| `check_r2_no_contamination` | 712 only | 712 | Broken logic (report #15) |
+| `check_t10_shared_state` | 408, **1548** | **1548** | Dead first copy |
+| `check_oa_equivalent_solutions` | 1102, **1606** | **1606** | Dead first copy |
+| `check_ob_negated_solutions` | 1151, **1655** | **1655** | Dead first copy |
+| `check_og_determinism` | 1210, **1715** | **1715** | Dead first copy |
+
+---
+
+### 1.3 Primary Metrics Output Written Non-Atomically (NEW HIGH SEVERITY)
+
+`scripts/extract_task_metrics.py:394` writes `task_metrics.json` — the primary per-task
+metrics artifact — without atomic temp+rename:
+
+```python
+out_path.write_text(json.dumps(tm.to_dict(), indent=2) + "\n")  # line 394
+```
+
+If the process is killed or crashes mid-write (e.g., Ctrl+C during a long metrics extraction
+run, OOM, disk full), `task_metrics.json` is left partially written — valid JSON prefix,
+truncated content. This causes:
+
+1. **`--skip-completed` silently skips the task**: the flag checks for `task_metrics.json`
+   existence, not validity. A corrupt file will cause the task to be skipped on the next run.
+2. **Downstream reports include garbage data**: `discover_runs()` loads task_metrics.json and
+   merges its fields into `TaskMetrics`. A truncated JSON file raises `json.JSONDecodeError`
+   that may be swallowed.
+
+`scripts/reextract_all_metrics.py:247` has the same pattern for bulk re-extraction:
+
+```python
+out_path.write_text(json.dumps(tm.to_dict(), indent=2) + "\n")  # line 247
+```
+
+The correct pattern (already implemented in `account_health.py:109–111`):
+
+```python
+tmp = out_path.with_suffix(".tmp")
+tmp.write_text(json.dumps(tm.to_dict(), indent=2) + "\n")
+tmp.replace(out_path)  # atomic on POSIX
+```
+
+---
+
+### 1.4 FD Leak Count Revised Upward: 25 Confirmed Sites (not "17+")
+
+Running `grep -rn "json.load(open\|json.loads(open" scripts/` finds **25 occurrences**
+across Python scripts and shell-embedded Python one-liners. The previously catalogued "17+"
+missed the following:
+
+| New sites | Location | Note |
+|-----------|----------|------|
+| `rerun_fixed_tasks.sh:325,326` | Shell Python one-liners | Process exit closes FD; low risk |
+| `rerun_zero_mcp_tasks.sh:231,232` | Shell Python one-liners | Same |
+| `rerun_crossrepo_2tasks.sh:142,143` | Shell Python one-liners | Same |
+| `rerun_errored_tasks.sh:253,254` | Shell Python one-liners | Same |
+| `rerun_crossrepo_all4.sh:149` | Shell Python one-liner | Same |
+| `daytona_curator_runner.py:565,624` | Two separate module-level loads of `/tmp/curator_config.json` | Hardcoded path + no context manager |
+
+The shell one-liners are low actual risk (the Python subprocess exits immediately, closing all
+FDs), but they inflate the metric and count against any Ruff automation. The
+`daytona_curator_runner.py` sites are a higher concern because the same hardcoded
+`/tmp/curator_config.json` path is loaded **twice** at module level without a context manager,
+and the `/tmp` path is unparameterized.
+
+---
+
+### 1.5 CI Workflow Audit: Python Version Inconsistency Confirmed
+
+Full version matrix across the 4 workflows:
+
+| Workflow | Python | `permissions:` block | Trigger |
+|----------|--------|---------------------|---------|
+| `repo_health.yml` | 3.10 | Missing | push + PR |
+| `docs-consistency.yml` | **3.11** | Missing | push + PR |
+| `task_smoke_matrix.yml` | 3.10 | Missing | paths-filtered push + PR |
+| `roam.yml` | **3.12** | Present | PR only |
+
+Three findings:
+1. **`roam.yml` is the only workflow with a `permissions:` block** — and it only runs on PRs,
+   not pushes. The push-to-main path (repo_health, docs-consistency) runs with default broad
+   token scope on every commit.
+2. **`docs-consistency.yml` uses Python 3.11** vs the 3.10 baseline. Any 3.10→3.11 behavioral
+   difference in `docs_consistency_check.py` would be invisible in local testing.
+3. **`refresh_agent_navigation.py --check` runs in both `repo_health.yml` and
+   `docs-consistency.yml`** — identical step in two workflows, wasting CI minutes on every PR.
+   One should be removed.
+
+---
+
+### 1.6 validate_tasks_preflight.py Non-Atomic Writes in Smoke Fixture (NEW)
+
+`scripts/validate_tasks_preflight.py` writes test helper scripts into a sandbox:
+
+```python
+helper.write_text(helper_rewritten)          # line 972 — sandbox/*.sh
+runner.write_text(runner_header + script_body)  # line 993 — sandbox/_run_fixture.sh
+```
+
+These writes happen inside a `try/except OSError` block (line 973), which means if the write
+partially completes, the exception is swallowed and the smoke test runs against a corrupt
+fixture. Since `_run_fixture.sh` is the entry point for the sandbox test, a partially written
+file causes a cryptic `bash: syntax error` rather than a clear "fixture write failed" error.
+
+---
+
+## 2. Feature & UX Improvements
+
+### 2.1 Cost Reports Need Model Attribution Annotation
+
+Every cost report produced by `generate_eval_report.py`, `compare_configs.py`, and the HTML
+export currently uses Opus 4.5 pricing regardless of the actual run model. Before the full fix
+lands, adding a visible disclaimer would prevent misleading data from being acted on:
+
+```
+⚠  Cost estimates use claude-opus-4-5 pricing (actual model not recorded)
+   Sonnet runs may be 5× overstated; Haiku runs may be 19× overstated.
+```
+
+This requires zero code changes in the cost pipeline — just a one-line annotation added to
+the HTML template and to the `compare_configs.py` summary header.
+
+### 2.2 abc_audit Dead-Copy Removal Should Be a CI Gate
+
+After removing the 4 dead duplicate function copies (~500 lines), a simple import-time check
+would prevent future regressions:
+
+```python
+# At bottom of abc_audit.py:
+_REGISTERED_CHECKS = {
+    check_t5_no_solution_leak, check_t10_shared_state, check_r2_no_contamination,
+    check_oa_equivalent_solutions, check_ob_negated_solutions, check_og_determinism,
+    # ...
+}
+assert len(_REGISTERED_CHECKS) == len({f.__name__ for f in _REGISTERED_CHECKS}), \
+    "Duplicate check function detected — check abc_audit.py for duplicate def statements"
+```
+
+Since function objects are compared by identity, this assertion fails at import time if any
+two registered entries have the same `__name__`, catching future copy-paste duplicates.
+
+### 2.3 Smoke Test Fixture Writes Should Be Atomic
+
+`validate_tasks_preflight.py` writes `_run_fixture.sh` and helper scripts directly. If the
+sandbox write fails mid-way (e.g., disk quota), the test proceeds with a corrupt script.
+Converting these two writes to atomic temp+rename ensures either a complete fixture or a clean
+failure.
+
+---
+
+## 3. Research Recommendations
+
+### 3.1 `atomic_write` Utility Should Be a Shared Helper
+
+The temp+rename pattern (`write_text` to `.tmp` then `Path.replace()`) is the correct solution
+for at least 8 known write sites:
+
+| File | Line | Write Target | Risk |
+|------|------|-------------|------|
+| `daytona_runner.py` | 234 | OAuth credentials | Authentication corruption |
+| `sync_agent_guides.py` | 22 | AGENTS.md / CLAUDE.md | Operational guidance corruption |
+| `daytona_cost_guard.py` | 663 | Cost guard cache | Wrong cost estimates served |
+| `extract_task_metrics.py` | 394 | task_metrics.json | Skip-completed bypass + bad data |
+| `reextract_all_metrics.py` | 247 | task_metrics.json (bulk) | Same |
+| `aggregate_status.py` | 669 | Status aggregate | Stale status report |
+| `apply_verifier_fixes.py` | 103,117,134 | Verifier files | Corrupt verifiers |
+| `validate_tasks_preflight.py` | 972, 993 | Sandbox fixture | Cryptic bash errors |
+
+A 10-line `atomic_write(path: Path, content: str) -> None` utility in
+`csb_metrics/io_utils.py` (or `scripts/utils.py`) would let each site convert to a one-line
+call. This is preferable to a PRD that fixes each site individually, since a shared utility is
+testable and ensures uniform behavior.
+
+### 3.2 `_process_task_dir` Signature Audit Before Adding `model`
+
+Before adding `model: str` to `_process_task_dir()`, check all callers beyond `discover_runs()`
+to ensure they can supply the model. A quick `grep` finds:
+
+```bash
+grep -rn "_process_task_dir\|from csb_metrics.discovery import" scripts/
+```
+
+If any callers can't supply a model, `model: str = "unknown"` as a default is safe (it
+preserves existing behavior while enabling the fix for callers that do have the model).
+
+### 3.3 Function Registry Pattern for Audit Modules
+
+`abc_audit.py`'s silent-override problem is a general risk in any large audit module where
+functions are defined once and called collectively. A registration decorator (similar to pytest
+fixtures) prevents this class of bug:
+
+```python
+_checks: list[Callable] = []
+def check(fn: Callable) -> Callable:
+    if any(f.__name__ == fn.__name__ for f in _checks):
+        raise RuntimeError(f"Duplicate check: {fn.__name__}")
+    _checks.append(fn)
+    return fn
+
+@check
+def check_t5_no_solution_leak(tasks: list[Path]) -> CriterionResult:
+    ...
+```
+
+This converts a runtime silent failure into an import-time `RuntimeError`, making duplicate
+definitions detectable before any audit run.
+
+---
+
+## 4. Recommended Next Feature
+
+### Cost Pipeline Accuracy Fix (Simplified by This Report's Findings)
+
+**Why now:** Report #15 deferred this fix by stating it requires a schema change. This report
+confirms no schema change is needed — the model plumbing is 95% in place. The full fix is
+bounded and testable.
+
+**Scope (3 files, ~6 lines of production code):**
+
+**`scripts/csb_metrics/discovery.py`** — add `model` parameter to `_process_task_dir`:
+```python
+# Before (line 195):
+def _process_task_dir(task_dir, benchmark, config_name, is_swebench):
+
+# After:
+def _process_task_dir(task_dir, benchmark, config_name, is_swebench, model="unknown"):
+    ...
+    tm.cost_usd = calculate_cost_from_tokens(
+        tm.input_tokens, tm.output_tokens,
+        tm.cache_creation_tokens, tm.cache_read_tokens,
+        model=model,  # ← add this
+    )
+```
+
+Pass `model=run_metadata[key]["model"]` at the call site in `discover_runs()`.
+
+**`scripts/extract_task_metrics.py`** — add model read + pass:
+```python
+# Read model from config.json (already done in discovery.py, replicate here)
+model = _read_model_from_result(result_path)  # 3-line helper
+tm.cost_usd = calculate_cost_from_tokens(..., model=model)
+```
+
+**`scripts/csb_metrics/extractors.py`** — add 2 pricing entries:
+```python
+"claude-sonnet-4-6": {"input": 3.0,  "output": 15.0, "cache_write": 3.75,  "cache_read": 0.30},
+"claude-haiku-4-6":  {"input": 0.80, "output": 4.0,  "cache_write": 1.0,   "cache_read": 0.08},
+```
+
+**Why these changes together:**
+- All Sonnet 4.5/4.6 runs have been reporting 5× inflated costs — all historical reports are
+  wrong
+- All Haiku runs have been reporting 19× inflated costs
+- `configs/openhands_2config.sh:60` already defaults to `claude-sonnet-4-6`, making this a
+  live bug for all OpenHands results
+- The fix requires no CI infrastructure changes, no new scripts, no new tests (existing
+  `test_extract_task_metrics.py` already covers cost paths)
+- A one-commit PR fixing both callers + the pricing table constitutes a complete remediation
+- After this PR, cost reports are accurate for the first time since the project switched from
+  Opus-only runs
+
+**Bonus (same PR):** Convert `extract_task_metrics.py:394` to atomic write (3 lines) while
+touching the file, preventing the corrupt-metrics-file bug noted in §1.3.
+
+---
+
+## 5. Issues Summary (New This Report)
+
+| ID | File | Line(s) | Issue | Severity |
+|----|------|---------|-------|----------|
+| N-01 | `csb_metrics/discovery.py` | 195–200 | `_process_task_dir` missing `model` param — fix is 3 lines (not schema change) | HIGH |
+| N-02 | `scripts/abc_audit.py` | 408/1548 | `check_t10_shared_state` defined twice (T10 is 4th actual duplicate) | MEDIUM |
+| N-03 | `scripts/abc_audit.py` | 1102/1606, 1151/1655, 1210/1715 | OA/OB/OG: 3 more duplicate pairs with dead first copies (~500 lines of dead code) | MEDIUM |
+| N-04 | `scripts/extract_task_metrics.py` | 394 | Non-atomic write of `task_metrics.json` (primary metrics output) | HIGH |
+| N-05 | `scripts/reextract_all_metrics.py` | 247 | Non-atomic write in bulk re-extraction | MEDIUM |
+| N-06 | `scripts/validate_tasks_preflight.py` | 972, 993 | Non-atomic writes of smoke sandbox fixture | LOW |
+| N-07 | `scripts/csb_metrics/discovery.py` | 565, 624 | `daytona_curator_runner.py` loads `/tmp/curator_config.json` twice without context manager + hardcoded path | LOW |
+| N-08 | `.github/workflows/docs-consistency.yml` | 19 | Python 3.11 vs 3.10 baseline; `--check` nav step duplicated from repo_health.yml | LOW |
+| N-09 | All 4 CI workflows | — | Correction: `roam.yml` is the ONLY workflow with `permissions:` block (not "3/4 missing" — all push workflows are unguarded) | MEDIUM |
+
+---
+
+*Remediation velocity: 10 consecutive days without a code fix (Mar 12 → Mar 22). ~130 open
+issues across 16 reports. Active PRD: `ralph/fix-common-patterns-2026-03-20` (15 stories,
+all unstarted). Recommended next feature: cost pipeline accuracy fix (~6 lines, no schema
+change, corrects 5–19× cost overstatement for all non-Opus runs).*