feat(m2): doom-loop pruning for BFTSLiteStrategy#4
Open
suzuke wants to merge 2 commits into
Open
Conversation
Wires `should_prune()` into the search seam so BFTS stops re-expanding
futile branches. A kept node is pruned when its trailing
`prune_threshold` direct children all fail to produce a strict metric
improvement. `decide()` filters pruned candidates BEFORE max-metric
selection so a high-scoring doom-looped branch cannot starve a viable
lower-scoring one. When every kept node is pruned, decide() returns
Stop("all kept nodes pruned (doom-loop)").
Reviewer round 1 verdict: NEEDS_TWEAK. Addressed:
- trailing-only counter (not windowed) — v1 default
- discard counts unconditionally; crash/violation/skip count without
metric comparison (failed expansion attempts)
- equal metric is NOT a strict improvement (counts toward streak)
- candidate without metric → return False (defensive, bootstrap-safe)
- minimize direction handled (lower metric = improvement)
- all-pruned → Stop, not Continue or Restart
- decide() filters BEFORE selection (correctness ordering)
Config: `search.prune_threshold: int = 3` (validated as positive int).
Other strategies (greedy/restart) ignore the kwarg.
Tests: +13 (35 total in test_strategy.py); reviewer's 10-case checklist
covered. 0 regressions across full suite (2397 passed / 4 skipped).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer round 2 verdict was VERIFIED with one non-blocking note: direct `BFTSLiteStrategy(prune_threshold=0)` would prune immediately because `streak >= 0` is always true after the first non-improving child. Config validation already catches this for the configured path; this guard catches it for any direct construction (tests, scripts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 25, 2026
suzuke
added a commit
that referenced
this pull request
Apr 26, 2026
Address 3 of 4 round-2 reviewer follow-ups (#1, #2, #3). #4 (CLI auth base class) deferred to PR 16c when there's a second adapter to motivate the abstraction. #1 stderr auth-phrase scan (codex_cli.py parse_output): If codex bails before --json takes effect (auth check fails before JSONL stream opens), the auth phrase is in stderr only. parse_output now scans `raw.stderr_tail` as a fallback after stdout — without this, the trial would land in error_type=UNKNOWN instead of AUTH. Tests: test_codex_auth_error_detected_in_stderr_when_no_jsonl, test_codex_stderr_unrelated_text_does_NOT_trigger_auth. #2 reproducibility flags (build_argv): Added --ephemeral, --ignore-user-config, --ignore-rules. Without --ignore-user-config, a user's ~/.codex/config.toml could silently override `model` and break the cli_version-as-reproducibility- snapshot guarantee (cli_version snapshots the binary version, not the model). Without --ignore-rules, user execpolicy .rules files could re-enable codex tool calls outside the workspace-write sandbox (§INV-3). Without --ephemeral, codex persists session files across runs. Test: test_codex_argv_includes_reproducibility_flags. #3 _FORBIDDEN_FLAGS calibration: Verified actual codex 0.124.0 flag names via `codex exec --help`. Removed two non-existent flags (`--bypass-approvals` was a fabrication; `--dangerously-skip-permissions` is a Claude CLI flag, not codex). Added the real `--dangerously-bypass-approvals-and-sandbox` (skips both approvals AND sandbox — explicit §INV-3 violation if ever passed). Updated meta-test to match. Stats: +111/-11 across 2 files (`git diff --stat HEAD~1 HEAD | tail -1`). Test suite: 60 passed (codex 23 + cli_subscription 37) in 0.15s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on #3 (M1b SearchStrategy). First chunk of M2 — wires
should_prune()into the search seam so BFTS stops re-expanding futile branches.What's new
Doom-loop pruning
BFTSLiteStrategy.should_prune(ctx, candidate_id)— a kept node is pruned when its trailingprune_thresholddirect children all fail to produce a strict metric improvement.BFTSLiteStrategy.decide()filters pruned candidates BEFORE max-metric selection, so a high-scoring but doom-looped branch cannot starve a viable lower-scoring one.Stop(reason="all kept nodes pruned (doom-loop)"), not Continue/Restart — those are separate recovery policies.Failure semantics (matches reviewer round 1)
discardoutcome → counts toward streak (always, regardless of metric)crash/violation/skip→ counts (no metric to compare; still a failed expansion attempt)keepoutcome with non-improving metric (incl. equal) → countsshould_prunereturns False (defensive, bootstrap-safe)Config
search.prune_threshold: int = 3— validated as positive int in_build_search.__post_init__guard onBFTSLiteStrategyrejectsprune_threshold < 1for direct construction (tests/scripts that bypass config).Reviewer trail
prune_threshold=0direct-construction → fixed in833cb06Stats
test_strategy.py(35 → 36 with__post_init__test) covering reviewer's 10-case checklist + factory threading + bootstrap defenceTest plan
Stop("doom-loop")decide()filters pruned, picks next-bestBranchFromto next-best when metric-best prunedprune_thresholdcorrectly; greedy/restart accept-and-ignore__post_init__rejects 0 / negativeoptimize-compress, seedocs/M2-DEMO-GATE.md. Doom-loop pruning explicitly fired on n3, n21, n20, n19 (each accumulated 3 trailing failures and was removed from BFTS's candidate set).Real-agent demo gate (M2 30-iter)
compression_ratiomax_iterations=30Greedy hit the legacy
max_retries=5wall at iter 9 — exactly the failure mode this PR's doom-loop pruning is designed to escape. BFTS-lite reached iter 30 with 6 BranchFrom events and 4 nodes explicitly pruned byshould_prune, finding a deeper kept path (n1→…→n21, 10 levels) that beat greedy's local optimum by +11%. Total cost $2.05, ~55 min wall (parallel runs).Known limitations (M2 PR 11+)
BFTSLiteStrategy. If multiple future strategies want to share it, extract aPrunerinterface then (YAGNI now).🤖 Generated with Claude Code