Skip to content

Commit ce0e249

Browse files
committed
docs: nightly research report 2026-03-22 (report #16)
New findings: cost pipeline fix is simpler than report #15 stated (RunMetrics.model already exists, _process_task_dir just needs model param — no schema change required); abc_audit.py has 4 more duplicate function pairs (T10/OA/OB/OG) beyond the T5/R2 broken checks; extract_task_metrics.py:394 writes task_metrics.json non-atomically; FD leak count revised to 25 confirmed sites. Recommended feature: cost pipeline accuracy fix.
1 parent 3d600d5 commit ce0e249

File tree

1 file changed

+372
-0
lines changed

1 file changed

+372
-0
lines changed
Lines changed: 372 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,372 @@
1+
# Nightly Research Report — 2026-03-22 (Report #16)
2+
3+
## Executive Summary
4+
5+
Ten consecutive days with zero code fixes (last code commit: Mar 12). This report surfaces
6+
**four new findings** that either correct or extend previous reports: the cost pipeline fix is
7+
**simpler than report #15 stated** (no schema change required — the model plumbing is 95%
8+
already in place), `abc_audit.py` has **four additional duplicate function pairs** beyond the
9+
T5/R2 broken checks already documented, the **primary metrics output (`task_metrics.json`) is
10+
written non-atomically**, and the **25 total `json.load(open(` sites** across scripts exceed
11+
the "17+" count previously catalogued.
12+
13+
---
14+
15+
## 1. Code & Architecture Review
16+
17+
### 1.1 Cost Pipeline Fix is Simpler Than Previously Reported (Correction to Report #15)
18+
19+
Report #15 stated: *"TaskMetrics has no model field — cost pipeline fix requires schema change."*
20+
This is incorrect. `RunMetrics.model: str` already exists at `csb_metrics/models.py:148`. The
21+
extraction infrastructure is also already in place:
22+
23+
| Component | Location | Status |
24+
|-----------|----------|--------|
25+
| Model extraction from config | `discovery.py:171` `_extract_model_from_config(batch_dir)` | Exists |
26+
| Model stored per-run | `discovery.py:407–413` `run_metadata[key]["model"]` | Exists |
27+
| `RunMetrics.model` field | `models.py:148` | Exists |
28+
| Model passed to cost calculator | `discovery.py:310`**missing** | **Gap** |
29+
30+
The actual gap is in `_process_task_dir()` (`discovery.py:195`):
31+
32+
```python
33+
def _process_task_dir(
34+
task_dir: Path,
35+
benchmark: str,
36+
config_name: str,
37+
is_swebench: bool, # no model parameter
38+
) -> Optional[TaskMetrics]:
39+
...
40+
if tm.cost_usd is None:
41+
tm.cost_usd = calculate_cost_from_tokens( # model not passed
42+
tm.input_tokens, tm.output_tokens,
43+
tm.cache_creation_tokens, tm.cache_read_tokens,
44+
)
45+
```
46+
47+
The model is available in the `discover_runs()` caller context at the point
48+
`_process_task_dir()` is invoked. The fix is three lines:
49+
1. Add `model: str = "unknown"` to `_process_task_dir`'s signature
50+
2. Pass `model` to `calculate_cost_from_tokens()`
51+
3. Pass `model=run_metadata[key]["model"]` when calling `_process_task_dir()` in
52+
`discover_runs()`
53+
54+
`extract_task_metrics.py:266` needs the same one-line model pass. The `MODEL_PRICING` table
55+
still needs `claude-sonnet-4-6` and `claude-haiku-4-6` added (2 dict entries). **Total fix
56+
scope: ~6 lines.** No dataclass schema change required.
57+
58+
---
59+
60+
### 1.2 abc_audit.py Has Four Actual Duplicate Pairs Beyond the T5/R2 Broken Checks (NEW)
61+
62+
Report #15 covered `check_t5_no_solution_leak` and `check_r2_no_contamination` as broken
63+
single-definition functions. This report identifies **four additional functions defined twice**:
64+
65+
```
66+
scripts/abc_audit.py:
67+
check_t10_shared_state lines 408 and 1548 ← Python uses 1548 (identical code)
68+
check_oa_equivalent_solutions 1102 and 1606 ← Python uses 1606 (identical code)
69+
check_ob_negated_solutions 1151 and 1655 ← Python uses 1655 (identical code)
70+
check_og_determinism 1210 and 1715 ← Python uses 1715 (identical code)
71+
```
72+
73+
Unlike T5/R2 (broken single definitions), these four pairs appear to be copy-paste duplicates
74+
where both copies contain identical logic. Python silently uses the last definition, making the
75+
first copy dead code. The dead copies span ~500 lines of `abc_audit.py` (lines 408–1544 and
76+
1102–1714) that serve no function and increase confusion about which implementation is active.
77+
78+
Removing the four dead first-copies would shrink `abc_audit.py` by ~500 lines (~25% of the
79+
file). More importantly, adding a startup registry check (e.g., asserting all check functions
80+
appear exactly once in a `_CHECKS` list) would catch future duplicate definitions at import
81+
time rather than silently.
82+
83+
The full duplicate inventory for abc_audit.py:
84+
85+
| Function | Defined at | Python uses | Issue |
86+
|----------|-----------|-------------|-------|
87+
| `check_t5_no_solution_leak` | 293 only | 293 | Broken logic (report #15) |
88+
| `check_r2_no_contamination` | 712 only | 712 | Broken logic (report #15) |
89+
| `check_t10_shared_state` | 408, **1548** | **1548** | Dead first copy |
90+
| `check_oa_equivalent_solutions` | 1102, **1606** | **1606** | Dead first copy |
91+
| `check_ob_negated_solutions` | 1151, **1655** | **1655** | Dead first copy |
92+
| `check_og_determinism` | 1210, **1715** | **1715** | Dead first copy |
93+
94+
---
95+
96+
### 1.3 Primary Metrics Output Written Non-Atomically (NEW HIGH SEVERITY)
97+
98+
`scripts/extract_task_metrics.py:394` writes `task_metrics.json` — the primary per-task
99+
metrics artifact — without atomic temp+rename:
100+
101+
```python
102+
out_path.write_text(json.dumps(tm.to_dict(), indent=2) + "\n") # line 394
103+
```
104+
105+
If the process is killed or crashes mid-write (e.g., Ctrl+C during a long metrics extraction
106+
run, OOM, disk full), `task_metrics.json` is left partially written — valid JSON prefix,
107+
truncated content. This causes:
108+
109+
1. **`--skip-completed` silently skips the task**: the flag checks for `task_metrics.json`
110+
existence, not validity. A corrupt file will cause the task to be skipped on the next run.
111+
2. **Downstream reports include garbage data**: `discover_runs()` loads task_metrics.json and
112+
merges its fields into `TaskMetrics`. A truncated JSON file raises `json.JSONDecodeError`
113+
that may be swallowed.
114+
115+
`scripts/reextract_all_metrics.py:247` has the same pattern for bulk re-extraction:
116+
117+
```python
118+
out_path.write_text(json.dumps(tm.to_dict(), indent=2) + "\n") # line 247
119+
```
120+
121+
The correct pattern (already implemented in `account_health.py:109–111`):
122+
123+
```python
124+
tmp = out_path.with_suffix(".tmp")
125+
tmp.write_text(json.dumps(tm.to_dict(), indent=2) + "\n")
126+
tmp.replace(out_path) # atomic on POSIX
127+
```
128+
129+
---
130+
131+
### 1.4 FD Leak Count Revised Upward: 25 Confirmed Sites (not "17+")
132+
133+
Running `grep -rn "json.load(open\|json.loads(open" scripts/` finds **25 occurrences**
134+
across Python scripts and shell-embedded Python one-liners. The previously catalogued "17+"
135+
missed the following:
136+
137+
| New sites | Location | Note |
138+
|-----------|----------|------|
139+
| `rerun_fixed_tasks.sh:325,326` | Shell Python one-liners | Process exit closes FD; low risk |
140+
| `rerun_zero_mcp_tasks.sh:231,232` | Shell Python one-liners | Same |
141+
| `rerun_crossrepo_2tasks.sh:142,143` | Shell Python one-liners | Same |
142+
| `rerun_errored_tasks.sh:253,254` | Shell Python one-liners | Same |
143+
| `rerun_crossrepo_all4.sh:149` | Shell Python one-liner | Same |
144+
| `daytona_curator_runner.py:565,624` | Two separate module-level loads of `/tmp/curator_config.json` | Hardcoded path + no context manager |
145+
146+
The shell one-liners are low actual risk (the Python subprocess exits immediately, closing all
147+
FDs), but they inflate the metric and count against any Ruff automation. The
148+
`daytona_curator_runner.py` sites are a higher concern because the same hardcoded
149+
`/tmp/curator_config.json` path is loaded **twice** at module level without a context manager,
150+
and the `/tmp` path is unparameterized.
151+
152+
---
153+
154+
### 1.5 CI Workflow Audit: Python Version Inconsistency Confirmed
155+
156+
Full version matrix across the 4 workflows:
157+
158+
| Workflow | Python | `permissions:` block | Trigger |
159+
|----------|--------|---------------------|---------|
160+
| `repo_health.yml` | 3.10 | Missing | push + PR |
161+
| `docs-consistency.yml` | **3.11** | Missing | push + PR |
162+
| `task_smoke_matrix.yml` | 3.10 | Missing | paths-filtered push + PR |
163+
| `roam.yml` | **3.12** | Present | PR only |
164+
165+
Three findings:
166+
1. **`roam.yml` is the only workflow with a `permissions:` block** — and it only runs on PRs,
167+
not pushes. The push-to-main path (repo_health, docs-consistency) runs with default broad
168+
token scope on every commit.
169+
2. **`docs-consistency.yml` uses Python 3.11** vs the 3.10 baseline. Any 3.10→3.11 behavioral
170+
difference in `docs_consistency_check.py` would be invisible in local testing.
171+
3. **`refresh_agent_navigation.py --check` runs in both `repo_health.yml` and
172+
`docs-consistency.yml`** — identical step in two workflows, wasting CI minutes on every PR.
173+
One should be removed.
174+
175+
---
176+
177+
### 1.6 validate_tasks_preflight.py Non-Atomic Writes in Smoke Fixture (NEW)
178+
179+
`scripts/validate_tasks_preflight.py` writes test helper scripts into a sandbox:
180+
181+
```python
182+
helper.write_text(helper_rewritten) # line 972 — sandbox/*.sh
183+
runner.write_text(runner_header + script_body) # line 993 — sandbox/_run_fixture.sh
184+
```
185+
186+
These writes happen inside a `try/except OSError` block (line 973), which means if the write
187+
partially completes, the exception is swallowed and the smoke test runs against a corrupt
188+
fixture. Since `_run_fixture.sh` is the entry point for the sandbox test, a partially written
189+
file causes a cryptic `bash: syntax error` rather than a clear "fixture write failed" error.
190+
191+
---
192+
193+
## 2. Feature & UX Improvements
194+
195+
### 2.1 Cost Reports Need Model Attribution Annotation
196+
197+
Every cost report produced by `generate_eval_report.py`, `compare_configs.py`, and the HTML
198+
export currently uses Opus 4.5 pricing regardless of the actual run model. Before the full fix
199+
lands, adding a visible disclaimer would prevent misleading data from being acted on:
200+
201+
```
202+
⚠ Cost estimates use claude-opus-4-5 pricing (actual model not recorded)
203+
Sonnet runs may be 5× overstated; Haiku runs may be 19× overstated.
204+
```
205+
206+
This requires zero code changes in the cost pipeline — just a one-line annotation added to
207+
the HTML template and to the `compare_configs.py` summary header.
208+
209+
### 2.2 abc_audit Dead-Copy Removal Should Be a CI Gate
210+
211+
After removing the 4 dead duplicate function copies (~500 lines), a simple import-time check
212+
would prevent future regressions:
213+
214+
```python
215+
# At bottom of abc_audit.py:
216+
_REGISTERED_CHECKS = {
217+
check_t5_no_solution_leak, check_t10_shared_state, check_r2_no_contamination,
218+
check_oa_equivalent_solutions, check_ob_negated_solutions, check_og_determinism,
219+
# ...
220+
}
221+
assert len(_REGISTERED_CHECKS) == len({f.__name__ for f in _REGISTERED_CHECKS}), \
222+
"Duplicate check function detected — check abc_audit.py for duplicate def statements"
223+
```
224+
225+
Since function objects are compared by identity, this assertion fails at import time if any
226+
two registered entries have the same `__name__`, catching future copy-paste duplicates.
227+
228+
### 2.3 Smoke Test Fixture Writes Should Be Atomic
229+
230+
`validate_tasks_preflight.py` writes `_run_fixture.sh` and helper scripts directly. If the
231+
sandbox write fails mid-way (e.g., disk quota), the test proceeds with a corrupt script.
232+
Converting these two writes to atomic temp+rename ensures either a complete fixture or a clean
233+
failure.
234+
235+
---
236+
237+
## 3. Research Recommendations
238+
239+
### 3.1 `atomic_write` Utility Should Be a Shared Helper
240+
241+
The temp+rename pattern (`write_text` to `.tmp` then `Path.replace()`) is the correct solution
242+
for at least 8 known write sites:
243+
244+
| File | Line | Write Target | Risk |
245+
|------|------|-------------|------|
246+
| `daytona_runner.py` | 234 | OAuth credentials | Authentication corruption |
247+
| `sync_agent_guides.py` | 22 | AGENTS.md / CLAUDE.md | Operational guidance corruption |
248+
| `daytona_cost_guard.py` | 663 | Cost guard cache | Wrong cost estimates served |
249+
| `extract_task_metrics.py` | 394 | task_metrics.json | Skip-completed bypass + bad data |
250+
| `reextract_all_metrics.py` | 247 | task_metrics.json (bulk) | Same |
251+
| `aggregate_status.py` | 669 | Status aggregate | Stale status report |
252+
| `apply_verifier_fixes.py` | 103,117,134 | Verifier files | Corrupt verifiers |
253+
| `validate_tasks_preflight.py` | 972, 993 | Sandbox fixture | Cryptic bash errors |
254+
255+
A 10-line `atomic_write(path: Path, content: str) -> None` utility in
256+
`csb_metrics/io_utils.py` (or `scripts/utils.py`) would let each site convert to a one-line
257+
call. This is preferable to a PRD that fixes each site individually, since a shared utility is
258+
testable and ensures uniform behavior.
259+
260+
### 3.2 `_process_task_dir` Signature Audit Before Adding `model`
261+
262+
Before adding `model: str` to `_process_task_dir()`, check all callers beyond `discover_runs()`
263+
to ensure they can supply the model. A quick `grep` finds:
264+
265+
```bash
266+
grep -rn "_process_task_dir\|from csb_metrics.discovery import" scripts/
267+
```
268+
269+
If any callers can't supply a model, `model: str = "unknown"` as a default is safe (it
270+
preserves existing behavior while enabling the fix for callers that do have the model).
271+
272+
### 3.3 Function Registry Pattern for Audit Modules
273+
274+
`abc_audit.py`'s silent-override problem is a general risk in any large audit module where
275+
functions are defined once and called collectively. A registration decorator (similar to pytest
276+
fixtures) prevents this class of bug:
277+
278+
```python
279+
_checks: list[Callable] = []
280+
def check(fn: Callable) -> Callable:
281+
if any(f.__name__ == fn.__name__ for f in _checks):
282+
raise RuntimeError(f"Duplicate check: {fn.__name__}")
283+
_checks.append(fn)
284+
return fn
285+
286+
@check
287+
def check_t5_no_solution_leak(tasks: list[Path]) -> CriterionResult:
288+
...
289+
```
290+
291+
This converts a runtime silent failure into an import-time `RuntimeError`, making duplicate
292+
definitions detectable before any audit run.
293+
294+
---
295+
296+
## 4. Recommended Next Feature
297+
298+
### Cost Pipeline Accuracy Fix (Simplified by This Report's Findings)
299+
300+
**Why now:** Report #15 deferred this fix by stating it requires a schema change. This report
301+
confirms no schema change is needed — the model plumbing is 95% in place. The full fix is
302+
bounded and testable.
303+
304+
**Scope (3 files, ~6 lines of production code):**
305+
306+
**`scripts/csb_metrics/discovery.py`** — add `model` parameter to `_process_task_dir`:
307+
```python
308+
# Before (line 195):
309+
def _process_task_dir(task_dir, benchmark, config_name, is_swebench):
310+
311+
# After:
312+
def _process_task_dir(task_dir, benchmark, config_name, is_swebench, model="unknown"):
313+
...
314+
tm.cost_usd = calculate_cost_from_tokens(
315+
tm.input_tokens, tm.output_tokens,
316+
tm.cache_creation_tokens, tm.cache_read_tokens,
317+
model=model, # ← add this
318+
)
319+
```
320+
321+
Pass `model=run_metadata[key]["model"]` at the call site in `discover_runs()`.
322+
323+
**`scripts/extract_task_metrics.py`** — add model read + pass:
324+
```python
325+
# Read model from config.json (already done in discovery.py, replicate here)
326+
model = _read_model_from_result(result_path) # 3-line helper
327+
tm.cost_usd = calculate_cost_from_tokens(..., model=model)
328+
```
329+
330+
**`scripts/csb_metrics/extractors.py`** — add 2 pricing entries:
331+
```python
332+
"claude-sonnet-4-6": {"input": 3.0, "output": 15.0, "cache_write": 3.75, "cache_read": 0.30},
333+
"claude-haiku-4-6": {"input": 0.80, "output": 4.0, "cache_write": 1.0, "cache_read": 0.08},
334+
```
335+
336+
**Why these changes together:**
337+
- All Sonnet 4.5/4.6 runs have been reporting 5× inflated costs — all historical reports are
338+
wrong
339+
- All Haiku runs have been reporting 19× inflated costs
340+
- `configs/openhands_2config.sh:60` already defaults to `claude-sonnet-4-6`, making this a
341+
live bug for all OpenHands results
342+
- The fix requires no CI infrastructure changes, no new scripts, no new tests (existing
343+
`test_extract_task_metrics.py` already covers cost paths)
344+
- A one-commit PR fixing both callers + the pricing table constitutes a complete remediation
345+
- After this PR, cost reports are accurate for the first time since the project switched from
346+
Opus-only runs
347+
348+
**Bonus (same PR):** Convert `extract_task_metrics.py:394` to atomic write (3 lines) while
349+
touching the file, preventing the corrupt-metrics-file bug noted in §1.3.
350+
351+
---
352+
353+
## 5. Issues Summary (New This Report)
354+
355+
| ID | File | Line(s) | Issue | Severity |
356+
|----|------|---------|-------|----------|
357+
| N-01 | `csb_metrics/discovery.py` | 195–200 | `_process_task_dir` missing `model` param — fix is 3 lines (not schema change) | HIGH |
358+
| N-02 | `scripts/abc_audit.py` | 408/1548 | `check_t10_shared_state` defined twice (T10 is 4th actual duplicate) | MEDIUM |
359+
| N-03 | `scripts/abc_audit.py` | 1102/1606, 1151/1655, 1210/1715 | OA/OB/OG: 3 more duplicate pairs with dead first copies (~500 lines of dead code) | MEDIUM |
360+
| N-04 | `scripts/extract_task_metrics.py` | 394 | Non-atomic write of `task_metrics.json` (primary metrics output) | HIGH |
361+
| N-05 | `scripts/reextract_all_metrics.py` | 247 | Non-atomic write in bulk re-extraction | MEDIUM |
362+
| N-06 | `scripts/validate_tasks_preflight.py` | 972, 993 | Non-atomic writes of smoke sandbox fixture | LOW |
363+
| N-07 | `scripts/csb_metrics/discovery.py` | 565, 624 | `daytona_curator_runner.py` loads `/tmp/curator_config.json` twice without context manager + hardcoded path | LOW |
364+
| N-08 | `.github/workflows/docs-consistency.yml` | 19 | Python 3.11 vs 3.10 baseline; `--check` nav step duplicated from repo_health.yml | LOW |
365+
| N-09 | All 4 CI workflows || Correction: `roam.yml` is the ONLY workflow with `permissions:` block (not "3/4 missing" — all push workflows are unguarded) | MEDIUM |
366+
367+
---
368+
369+
*Remediation velocity: 10 consecutive days without a code fix (Mar 12 → Mar 22). ~130 open
370+
issues across 16 reports. Active PRD: `ralph/fix-common-patterns-2026-03-20` (15 stories,
371+
all unstarted). Recommended next feature: cost pipeline accuracy fix (~6 lines, no schema
372+
change, corrects 5–19× cost overstatement for all non-Opus runs).*

0 commit comments

Comments
 (0)