Skip to content

feat(m2): doom-loop pruning for BFTSLiteStrategy#4

Open
suzuke wants to merge 2 commits into
feat/m1b-search-strategyfrom
feat/m2-doom-loop
Open

feat(m2): doom-loop pruning for BFTSLiteStrategy#4
suzuke wants to merge 2 commits into
feat/m1b-search-strategyfrom
feat/m2-doom-loop

Conversation

@suzuke
Copy link
Copy Markdown
Owner

@suzuke suzuke commented Apr 25, 2026

Summary

Stacked on #3 (M1b SearchStrategy). First chunk of M2 — wires should_prune() into the search seam so BFTS stops re-expanding futile branches.

What's new

Doom-loop pruning

  • BFTSLiteStrategy.should_prune(ctx, candidate_id) — a kept node is pruned when its trailing prune_threshold direct children all fail to produce a strict metric improvement.
  • BFTSLiteStrategy.decide() filters pruned candidates BEFORE max-metric selection, so a high-scoring but doom-looped branch cannot starve a viable lower-scoring one.
  • All-pruned → Stop(reason="all kept nodes pruned (doom-loop)"), not Continue/Restart — those are separate recovery policies.

Failure semantics (matches reviewer round 1)

  • discard outcome → counts toward streak (always, regardless of metric)
  • crash / violation / skip → counts (no metric to compare; still a failed expansion attempt)
  • keep outcome with non-improving metric (incl. equal) → counts
  • Strict improvement at any tail position → resets the streak
  • Candidate without metric → should_prune returns False (defensive, bootstrap-safe)

Config

  • search.prune_threshold: int = 3 — validated as positive int in _build_search.
  • __post_init__ guard on BFTSLiteStrategy rejects prune_threshold < 1 for direct construction (tests/scripts that bypass config).
  • Other strategies (greedy/restart) ignore the kwarg.

Reviewer trail

Round Verdict Findings
1 NEEDS_TWEAK trailing-only counter / crash semantics / equal-metric / bootstrap-safe / all-pruned semantics / filter-before-select / config typing
2 VERIFIED All 7 addressed; one non-blocking note on prune_threshold=0 direct-construction → fixed in 833cb06

Stats

  • 2 commits, ~85 LOC strategy + ~12 LOC config + ~200 LOC tests
  • 2,397 passed / 4 skipped, 0 regressions on top of M1b baseline
  • 13 new unit tests in test_strategy.py (35 → 36 with __post_init__ test) covering reviewer's 10-case checklist + factory threading + bootstrap defence

Test plan

  • Trailing 3 failures → prune at threshold
  • Below threshold (2 failures) → not pruned
  • Strict improvement resets streak
  • Equal metric kept counts as non-improvement
  • Discard with worse metric counts (regardless of metric)
  • Crash / violation / skip count without metric comparison
  • Unrelated siblings don't affect candidate streak
  • Candidate without metric → return False (defence)
  • Minimize direction strict improvement
  • All pruned → Stop("doom-loop")
  • decide() filters pruned, picks next-best
  • BranchFrom to next-best when metric-best pruned
  • Factory threads prune_threshold correctly; greedy/restart accept-and-ignore
  • __post_init__ rejects 0 / negative
  • Real-agent demo gate — 30 iter on optimize-compress, see docs/M2-DEMO-GATE.md. Doom-loop pruning explicitly fired on n3, n21, n20, n19 (each accumulated 3 trailing failures and was removed from BFTS's candidate set).

Real-agent demo gate (M2 30-iter)

Iters Best compression_ratio Stop reason
Greedy 9 2.2528 5 consecutive failures, hard-stop
BFTS-lite 30 2.5013 clean max_iterations=30

Greedy hit the legacy max_retries=5 wall at iter 9 — exactly the failure mode this PR's doom-loop pruning is designed to escape. BFTS-lite reached iter 30 with 6 BranchFrom events and 4 nodes explicitly pruned by should_prune, finding a deeper kept path (n1→…→n21, 10 levels) that beat greedy's local optimum by +11%. Total cost $2.05, ~55 min wall (parallel runs).

Known limitations (M2 PR 11+)

  • Pruning is trailing-only; windowed (3-of-5) is a v2 follow-up if domains show oscillating bad branches.
  • Pruning policy lives on BFTSLiteStrategy. If multiple future strategies want to share it, extract a Pruner interface then (YAGNI now).

🤖 Generated with Claude Code

suzuke and others added 2 commits April 25, 2026 22:17
Wires `should_prune()` into the search seam so BFTS stops re-expanding
futile branches. A kept node is pruned when its trailing
`prune_threshold` direct children all fail to produce a strict metric
improvement. `decide()` filters pruned candidates BEFORE max-metric
selection so a high-scoring doom-looped branch cannot starve a viable
lower-scoring one. When every kept node is pruned, decide() returns
Stop("all kept nodes pruned (doom-loop)").

Reviewer round 1 verdict: NEEDS_TWEAK. Addressed:
- trailing-only counter (not windowed) — v1 default
- discard counts unconditionally; crash/violation/skip count without
  metric comparison (failed expansion attempts)
- equal metric is NOT a strict improvement (counts toward streak)
- candidate without metric → return False (defensive, bootstrap-safe)
- minimize direction handled (lower metric = improvement)
- all-pruned → Stop, not Continue or Restart
- decide() filters BEFORE selection (correctness ordering)

Config: `search.prune_threshold: int = 3` (validated as positive int).
Other strategies (greedy/restart) ignore the kwarg.

Tests: +13 (35 total in test_strategy.py); reviewer's 10-case checklist
covered. 0 regressions across full suite (2397 passed / 4 skipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer round 2 verdict was VERIFIED with one non-blocking note: direct
`BFTSLiteStrategy(prune_threshold=0)` would prune immediately because
`streak >= 0` is always true after the first non-improving child.

Config validation already catches this for the configured path; this
guard catches it for any direct construction (tests, scripts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
suzuke added a commit that referenced this pull request Apr 26, 2026
Address 3 of 4 round-2 reviewer follow-ups (#1, #2, #3). #4 (CLI auth
base class) deferred to PR 16c when there's a second adapter to
motivate the abstraction.

#1 stderr auth-phrase scan (codex_cli.py parse_output):
  If codex bails before --json takes effect (auth check fails before
  JSONL stream opens), the auth phrase is in stderr only. parse_output
  now scans `raw.stderr_tail` as a fallback after stdout — without
  this, the trial would land in error_type=UNKNOWN instead of AUTH.
  Tests: test_codex_auth_error_detected_in_stderr_when_no_jsonl,
  test_codex_stderr_unrelated_text_does_NOT_trigger_auth.

#2 reproducibility flags (build_argv):
  Added --ephemeral, --ignore-user-config, --ignore-rules. Without
  --ignore-user-config, a user's ~/.codex/config.toml could silently
  override `model` and break the cli_version-as-reproducibility-
  snapshot guarantee (cli_version snapshots the binary version, not
  the model). Without --ignore-rules, user execpolicy .rules files
  could re-enable codex tool calls outside the workspace-write
  sandbox (§INV-3). Without --ephemeral, codex persists session files
  across runs.
  Test: test_codex_argv_includes_reproducibility_flags.

#3 _FORBIDDEN_FLAGS calibration:
  Verified actual codex 0.124.0 flag names via `codex exec --help`.
  Removed two non-existent flags (`--bypass-approvals` was a
  fabrication; `--dangerously-skip-permissions` is a Claude CLI flag,
  not codex). Added the real `--dangerously-bypass-approvals-and-sandbox`
  (skips both approvals AND sandbox — explicit §INV-3 violation if
  ever passed). Updated meta-test to match.

Stats: +111/-11 across 2 files (`git diff --stat HEAD~1 HEAD | tail -1`).
Test suite: 60 passed (codex 23 + cli_subscription 37) in 0.15s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant