Skip to content

Commit 3000abd

Browse files
committed
perf(bench): hard 20s timeout + diagnostic counters in latency
Worker-side kill switch via threading.Timer + os._exit(137) bounds calibration wall-clock at timeout * ceil(n/workers) even when PPR or lazy-greedy blow up on pathological repos (vscode, mui). Differentiates kill-by-timeout from genuine BrokenProcessPool by elapsed time so pathological instances are checkpointed as timeouts instead of triggering infinite retry. Adds candidate_count, edge_count, scoring_ms, selection_ms, greedy_iters to LatencyBreakdown so the next calibration run shows which phase blew up, not just total scoring_selection_ms. fsync on checkpoint append prevents partial writes after worker kill.
1 parent f3275b6 commit 3000abd

12 files changed

Lines changed: 308 additions & 34 deletions

File tree

REVIEW_TESTS.md

Lines changed: 115 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -827,11 +827,122 @@ Must-fix this week.
827827

828828
Из ~70 findings реальных must-fix для submission paper — **5**. Остальные — управляемый debt; pattern-based primitives закроют ~30 за счёт 5 мелких рефакторов.
829829

830+
---
831+
832+
## Re-run 2026-04-30 — Round 1 (incremental)
833+
834+
Eight new R1 sonnet agents ran today against the same scope. Below are the **non-duplicate** findings — issues not already covered by the canonical R1/R2/R3 above. Severity uses the same rubric.
835+
836+
### New 🔴 Critical
837+
838+
**X1 — `GitError` is dead code; Rust raises `PyRuntimeError`, MCP catches `GitError`**
839+
- File:line: `src/treemapper/diffctx/__init__.py:6` (defines `GitError`); `src/treemapper/mcp/server.py:52` (`except GitError`); `diffctx/src/pybridge.rs:224,287` (raises `PyRuntimeError`).
840+
- Scenario: User passes a bad revision through MCP. The `try…except GitError` block never fires; `PyRuntimeError` propagates as a generic FastMCP `ToolError` with no `"Try 'HEAD~1..HEAD'…"` hint. `test_mcp.py::test_invalid_diff_range` only checks `pytest.raises(ToolError)` without message, so the dead branch passes CI.
841+
- Why no test catches it: existing test never asserts on the error message text.
842+
843+
**X2 — `_diffctx` ImportError path is untested; pip-install without wheel = raw traceback**
844+
- File:line: `src/treemapper/diffctx/pipeline.py` (deferred import), `src/treemapper/tokens.py:6`, `src/treemapper/diffctx/graph_analytics.py:5-17`, `src/treemapper/diffctx/graph_export.py:5-11` — module-level imports of `_diffctx`.
845+
- Scenario: User installs `treemapper` without the compiled extension (wrong Python ABI, source dist). First call to any diff feature crashes with raw `ImportError`. No graceful fallback message.
846+
847+
**X3 — PPR push-budget cap silently truncates BFS on large repos**
848+
- File:line: `diffctx/src/ppr.rs:75``max_pushes = (n * PPR.push_scale_factor).min(PPR.max_pushes_cap)`; line ~149 re-normalizes the score vector after early termination.
849+
- Scenario: Monorepo (≥20k fragments). Cap fires; propagation halts before mass reaches distant-but-relevant fragments; re-normalization disguises the truncation. User sees plausible rankings that omit the actually-relevant code.
850+
- Why it matters: paper integrity at scale. No yaml case has a repo big enough to hit the cap.
851+
852+
**X4 — Hybrid mode flips algorithm at exactly 50 candidate files**
853+
- File:line: `diffctx/src/mode.rs:93-119``n_candidate_files ≤ 50` ⇒ PPR + low_relevance_filter; `> 50` ⇒ Ego-graph.
854+
- Scenario: Mid-size project sits near the 50-file threshold. Adding/removing a single file flips scoring algorithms; user sees non-monotonic context changes between runs. Boundary (49/50/51) is exercised by no yaml case.
855+
856+
**X5 — `CochangeEdgeBuilder` is structurally dead in 100% of yaml tests**
857+
- File:line: `diffctx/src/edges/history/cochange.rs:112``if *count < COCHANGE.min_count { continue }`; `COCHANGE.min_count ≥ 2`. Test harness in `diffctx/tests/yaml_cases.rs` creates a repo with exactly two commits (initial + change), so every co-change pair count is 1 — always below threshold.
858+
- Scenario: A regression in pair counting, log-scale weighting, or `max_files_per_commit` skip would never be caught. The entire history-edge category has zero coverage.
859+
860+
**X6 — `IntervalIndex::overlaps` treats shared boundary line as overlap**
861+
- File:line: `diffctx/src/interval.rs``if end >= frag.start_line() { return true; }` triggers when `end == start_line` (back-to-back fragments sharing a single boundary line, valid in compact Rust/Go/Scala).
862+
- Scenario: A hunk touching the last line of function A causes function B (starting on that same line) to be permanently excluded from selection. Silently missing context.
863+
864+
**X7 — `DIFFCTX_OP_EGO_PER_HOP_DECAY` is a dead env knob**
865+
- File:line: `diffctx/src/config/scoring.rs:22` reads it into `EgoScoringConfig.per_hop_decay`. `diffctx/src/scoring.rs:112` calls `g.ego_graph(core_ids, self.max_depth)` — the decay parameter is **never passed** to `graph.rs:198::ego_graph(...)`.
866+
- Scenario: Operator tunes `DIFFCTX_OP_EGO_PER_HOP_DECAY=0.5` for a calibration sweep; observable behavior is unchanged; data is silently meaningless. Any paper figure using this knob has zero variance from changing it.
867+
868+
**X8 — Directory symlinks are silently dropped (Python tree mode)**
869+
- File:line: `src/treemapper/tree.py:172``if entry.is_symlink() or not entry.exists(): logger.debug(...); return None`. All symlinks (including dir-symlinks the user explicitly placed inside the repo) skipped without warning.
870+
- Scenario: User keeps `vendor/` or `shared/` as a symlinked dir. `treemapper .` silently omits all of it. No warning is emitted; existing test only verifies that a *file* symlink is absent.
871+
872+
### New 🟡 Warning (selected)
873+
874+
**X9 — UTF-16-LE/BE and UTF-8-with-BOM files**: `tree.py:231-247` only tests CP1251 fallback; UTF-16 files become `<unreadable content: not utf-8>` if `charset-normalizer` isn't installed. (`tests/test_basic.py::test_unicode_content_and_encoding_errors`).
875+
876+
**X10 — NFC vs NFD path round-trip on macOS**: HFS+/APFS returns NFD from `iterdir()`; PyYAML serializes as-is; downstream NFC lookups silently fail. Untested.
877+
878+
**X11 — `LexicalEdgeBuilder` zero-edge fallback when all changed identifiers are short**: `diffctx/src/edges/similarity/lexical.rs:96-104` — drops identifiers shorter than `query_min_identifier_length` (=3). A Go diff using `i`, `ok`, `err`, `db` produces zero lexical edges. Algorithm's recovery behavior is untested.
879+
880+
**X12 — `SiblingEdgeBuilder` breaks on backslash paths**: `diffctx/src/edges/structural/sibling.rs:20-27` uses `Path::new().parent()` without normalizing separators; `src\utils.rs` has no Unix parent → both files bucket under `""`, no sibling edges. No yaml case uses backslash paths.
881+
882+
**X13 — R extractor silently drops `.Rmd`, `.qmd`, `.rnw` files**: `diffctx/src/edges/semantic/r_lang.rs:13-16``is_r_file` only matches `.r` and `.rmd`. Quarto and Sweave notebooks produce no edges.
883+
884+
**X14 — `ScoringMode::Ppr`, `Ego`, `Bm25` have zero integration coverage**: every yaml test invocation in `diffctx/tests/yaml_cases.rs:240` and `diffctx/src/test_harness.rs:126` hardcodes `ScoringMode::Hybrid`. The three other modes differ in discovery and filtering paths and are never exercised end-to-end.
885+
886+
**X15 — Pure `git mv` (rename, no content change) → empty output**: `diffctx/src/git.rs::parse_diff` only collects hunks from `@@` lines. A bare `git mv` produces no `@@` lines → empty hunks → empty seeds → PPR emits nothing.
887+
888+
**X16 — `SAFE_RANGE_RE` accepts leading-dot ranges (`..origin/main`)** ✅: `diffctx/src/git.rs:31` regex `^[a-zA-Z0-9_.^~/@{}\-]+(\.\.\.?[a-zA-Z0-9_.^~/@{}\-]*)?$``.` is inside the character class, so `..origin/main` passes validation, then git rejects it at subprocess time with raw `fatal: ambiguous argument` rather than the designed `InvalidRange` error.
889+
890+
**X17 — Merge-commit combined-diff `@@@` headers silently ignored**: `diffctx/src/git.rs::HUNK_RE` matches `^@@ -` only; `@@@` (three-parent combined diff) is dropped. Files modified only inside a merge commit produce no hunk seeds.
891+
892+
**X18 — YAML literal block strips trailing whitespace**: `src/treemapper/writer.py:76-81` emits `|2`-indented blocks; PyYAML strips trailing spaces from each line on `safe_load`. Markdown line-break (`text \n`) and indented-docstring constructs silently mutate on round-trip. Distinct from O1 (which is about line-level newlines).
893+
894+
**X19 — `count_tokens` returns `u32`, casts via `as u32`**: `diffctx/src/tokenizer.rs:16` — silent overflow possible on very large inputs. `diffctx/src/pybridge.rs:335` forwards the same `u32`.
895+
896+
**X20 — `print_token_summary` bypasses `--quiet`**: `src/treemapper/tokens.py:47-48``print(..., file=sys.stderr)` ignores the quiet flag. Untested.
897+
898+
### New 🔵 Note
899+
900+
**X21 — `to_yaml`/`to_json` on `DiffContextResult` silently return empty string on serde failure** (`diffctx/src/pybridge.rs:103,108``unwrap_or_default()`). MCP clients receive `""` with no error.
901+
902+
**X22 — `forward_blend=0.0` env value silently inverts ranking direction** (`diffctx/src/config/limits.rs:142`; `ppr.rs:146`). Clamped to `[0,1]` but no degeneracy guard at the endpoints.
903+
904+
**X23 — `.scss` parsed with CSS grammar; `$var:…` produces ERROR-dominated tree** (`diffctx/src/parsers/tree_sitter_strategy.rs:407-416`). Only one SCSS yaml case exists and it passes via raw-anchor matching, not symbol extraction.
905+
906+
### False Positive from this re-run
907+
908+
- **F22** ("zero yaml test cases exist") — verified false: `find tests/cases/diff -name '*.yaml' | wc -l` = **2723**. The agent misread the harness path. Discard.
909+
910+
---
911+
912+
## Re-run 2026-04-30 — Updated Verdict
913+
914+
The 2026-04-28 verdict's top 5 (E1, R2-T1, R2-P1, E2, M2) re-verified against current source — all still real. The re-run surfaces eight additional 🔴 worth pulling forward:
915+
916+
### Updated must-fix-this-week (additions to prior list)
917+
918+
- **X3 (PPR push cap silent truncation)** — same paper-integrity class as E1; clamp + emit a `truncated_at_pushes` counter.
919+
- **X4 (Hybrid 50-file boundary)** — add yaml cases at 49/50/51 candidate files asserting algorithmic determinism near boundary.
920+
- **X5 (CochangeEdgeBuilder dead in 100% of yaml tests)** — add at least one yaml case with ≥2 commits per file pair so the entire history-edge category has any coverage at all.
921+
- **X6 (IntervalIndex shared-boundary overlap)** — fix to `end > start_line` (strict) and add adjacency yaml case.
922+
- **X7 (`EGO_PER_HOP_DECAY` dead knob)** — plumb through to `graph.ego_graph` or remove from config; add a determinism test that toggles the knob.
923+
- **X1 (GitError dead code)** — register `GitError` as a `create_exception!` in `pybridge.rs` so the MCP `except GitError` actually fires; add MCP test asserting the helpful message text.
924+
- **X8 (directory symlinks silently dropped)** — emit `tracing::warn` and add option `--follow-symlinks`; add test case with a directory symlink in the tree.
925+
- **X14 (`ScoringMode::Ppr/Ego/Bm25` no integration coverage)** — parameterize at least one yaml case across all four modes to confirm non-Hybrid paths produce non-empty output.
926+
927+
### Self-verification (re-read source, ✅ = confirmed)
928+
929+
- ✅ X1 verified at `src/treemapper/mcp/server.py:52` + `diffctx/src/pybridge.rs:224,287`.
930+
- ✅ X3 verified at `diffctx/src/ppr.rs:75`.
931+
- ✅ X5 verified at `diffctx/src/edges/history/cochange.rs:112` (`min_count` filter); harness at `diffctx/tests/yaml_cases.rs` creates exactly two commits.
932+
- ✅ X6 verified at `diffctx/src/interval.rs``end >= frag.start_line()` confirmed.
933+
- ✅ X7 verified at `diffctx/src/config/scoring.rs:22` (env read) + `diffctx/src/scoring.rs:112` (`g.ego_graph(core_ids, self.max_depth)` — no decay arg) + `diffctx/src/graph.rs:198` (`ego_graph` signature).
934+
- ✅ X8 verified at `src/treemapper/tree.py:172`.
935+
- ✅ X16 verified — `SAFE_RANGE_RE` at `diffctx/src/git.rs:31` includes `.` in character class.
936+
-**F22 false positive removed.**
937+
938+
---
939+
830940
## Audit Log
831941

832-
- Date: 2026-04-28
942+
- Date: 2026-04-28 (initial), 2026-04-30 (re-run)
833943
- Project: /Users/nikolay/treemapper
834944
- Skill: /review-tests
835-
- Total findings: 8 R1 sections × ~5 findings = ~40 R1 + ~6 R2.3 missed = ~46 raw; after R2 calibration: 5 must-fix, ~30 deferrable, ~10 false positives
836-
- Severity (final, post-R4): 🔴 5 (E1, R2-T1, R2-P1, E2, M2), 🟡 ~25, 🔵 ~15
837-
- Agents used: 8 + 4 + 2 + 1 = 15
945+
- Initial: 8 R1 sections × ~5 findings = ~40 R1 + ~6 R2.3 missed = ~46 raw; after R2 calibration: 5 must-fix, ~30 deferrable, ~10 false positives.
946+
- Re-run: 8 R1 sonnet agents, ~45 raw findings; after de-duplication against prior audit: 23 genuinely new (X1–X23), 1 false positive (F22).
947+
- Severity (post-R4 combined): 🔴 13 (E1, R2-T1, R2-P1, E2, M2, X1, X3, X4, X5, X6, X7, X8, X14), 🟡 ~30, 🔵 ~20.
948+
- Agents used: 8 + 4 + 2 + 1 = 15 (initial); 8 R1 + 1 R4 self-verification = 9 (re-run; R2/R3 reused since the 5 structural patterns from initial still hold).

benchmarks/adapters/calibrate.py

Lines changed: 63 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ def evaluate_grid(
6565
eval_fn: EvalFn,
6666
workers: int = 1,
6767
on_trial: TrialCallback | None = None,
68-
timeout_per_instance: float = 300.0,
68+
timeout_per_instance: float = 20.0,
6969
checkpoint_dir: Path | None = None,
7070
) -> list[TrialResult]:
7171
"""Run every grid point, return per-trial aggregates.
@@ -145,7 +145,7 @@ def evaluate_grid_cached( # noqa: C901 — pool teardown + per-cell demux + ret
145145
eval_all_cells_fn: EvalAllCellsFn,
146146
workers: int = 1,
147147
on_trial: TrialCallback | None = None,
148-
timeout_per_instance: float = 300.0,
148+
timeout_per_instance: float = 20.0,
149149
checkpoint_dir: Path | None = None,
150150
) -> list[TrialResult]:
151151
"""Inverted-loop calibration: outer = instance, inner = grid cells.
@@ -195,7 +195,7 @@ def evaluate_grid_cached( # noqa: C901 — pool teardown + per-cell demux + ret
195195

196196
def _make_pool() -> ProcessPoolExecutor:
197197
ctx = mp.get_context("spawn")
198-
p = ProcessPoolExecutor(max_workers=workers, mp_context=ctx, max_tasks_per_child=20)
198+
p = ProcessPoolExecutor(max_workers=workers, mp_context=ctx, max_tasks_per_child=40)
199199
list(p.map(int, range(workers)))
200200
return p
201201

@@ -204,38 +204,66 @@ def _record_per_cell(per_cell_results: list[tuple[RunParams, EvalResult]]) -> No
204204
lbl = params.label()
205205
ckpt = ckpts.get(lbl)
206206
if ckpt is not None:
207+
status = (result.extra or {}).get("status", "")
207208
err = str((result.extra or {}).get("error", ""))
208-
if "BrokenProcessPool" not in err:
209+
# Persist timeouts even though they surface via BPP — they
210+
# are deterministic per instance and must not retry forever.
211+
if status == "timeout" or "BrokenProcessPool" not in err:
209212
append_checkpoint(ckpt, result)
210213
results_by_cell[lbl].append(result)
211214

212215
def _drain(pool: ProcessPoolExecutor) -> None:
216+
import time as _time
217+
213218
futures: dict = {}
219+
submit_times: dict[str, float] = {}
214220
submit_failed: list[tuple[BenchmarkInstance, list[RunParams]]] = []
215221
pool_broken = False
216222
for inst, params_list in pending:
217223
try:
218-
futures[pool.submit(eval_all_cells_fn, inst, params_list)] = (inst, params_list)
224+
submit_times[inst.instance_id] = _time.monotonic()
225+
futures[
226+
pool.submit(
227+
_run_cells_with_kill_switch,
228+
eval_all_cells_fn,
229+
inst,
230+
params_list,
231+
timeout_per_instance,
232+
)
233+
] = (inst, params_list)
219234
except BrokenProcessPool:
220235
idx = pending.index((inst, params_list))
221236
submit_failed.extend(pending[idx:])
222237
pool_broken = True
223238
break
224-
outer_deadline = __import__("time").monotonic() + timeout_per_instance * len(points) * max(
225-
1, (len(pending) + workers - 1) // workers
226-
)
239+
outer_deadline = _time.monotonic() + timeout_per_instance * len(points) * max(1, (len(pending) + workers - 1) // workers)
227240
completed: set[str] = set()
228241
try:
229242
for future in as_completed(
230243
futures,
231-
timeout=max(0.0, outer_deadline - __import__("time").monotonic()),
244+
timeout=max(0.0, outer_deadline - _time.monotonic()),
232245
):
233246
inst, params_list = futures[future]
234247
try:
235248
per_cell = future.result(timeout=0)
236249
except BrokenProcessPool:
237-
pool_broken = True
238-
per_cell = [(p, _failure_eval(inst, p, "error", "BrokenProcessPool: worker died")) for p in params_list]
250+
elapsed = _time.monotonic() - submit_times.get(inst.instance_id, 0.0)
251+
if elapsed >= timeout_per_instance * 0.9:
252+
per_cell = [
253+
(
254+
p,
255+
_failure_eval(
256+
inst,
257+
p,
258+
"timeout",
259+
f"killed after {timeout_per_instance}s (elapsed {elapsed:.1f}s)",
260+
),
261+
)
262+
for p in params_list
263+
]
264+
else:
265+
pool_broken = True
266+
per_cell = [(p, _failure_eval(inst, p, "error", "BrokenProcessPool: worker died")) for p in params_list]
239267
except Exception as e:
240268
per_cell = [(p, _failure_eval(inst, p, "error", f"{type(e).__name__}: {e}")) for p in params_list]
241269
completed.add(inst.instance_id)
@@ -296,6 +324,30 @@ def _drain(pool: ProcessPoolExecutor) -> None:
296324
return out
297325

298326

327+
def _run_cells_with_kill_switch(
328+
eval_all_cells_fn: EvalAllCellsFn,
329+
instance: BenchmarkInstance,
330+
params_list: list[RunParams],
331+
timeout_s: float,
332+
) -> list[tuple[RunParams, EvalResult]]:
333+
"""Worker-side hard timeout for the cached calibration path. The whole
334+
`compute_scored_state + N selections` task must finish inside the
335+
deadline; otherwise `os._exit(137)` kills the worker and the pool
336+
spawns a replacement. Selection cost is sub-second per cell, so the
337+
deadline is effectively a budget on the heavy scoring phase.
338+
"""
339+
import os
340+
import threading
341+
342+
timer = threading.Timer(timeout_s, lambda: os._exit(137))
343+
timer.daemon = True
344+
timer.start()
345+
try:
346+
return eval_all_cells_fn(instance, params_list)
347+
finally:
348+
timer.cancel()
349+
350+
299351
def _failure_eval(
300352
instance: BenchmarkInstance,
301353
params: RunParams,

0 commit comments

Comments
 (0)