feat(m2): doom-loop pruning for BFTSLiteStrategy by suzuke · Pull Request #4 · suzuke/autocrucible

suzuke · 2026-04-25T14:21:19Z

Summary

Stacked on #3 (M1b SearchStrategy). First chunk of M2 — wires should_prune() into the search seam so BFTS stops re-expanding futile branches.

What's new

Doom-loop pruning

BFTSLiteStrategy.should_prune(ctx, candidate_id) — a kept node is pruned when its trailing prune_threshold direct children all fail to produce a strict metric improvement.
BFTSLiteStrategy.decide() filters pruned candidates BEFORE max-metric selection, so a high-scoring but doom-looped branch cannot starve a viable lower-scoring one.
All-pruned → Stop(reason="all kept nodes pruned (doom-loop)"), not Continue/Restart — those are separate recovery policies.

Failure semantics (matches reviewer round 1)

discard outcome → counts toward streak (always, regardless of metric)
crash / violation / skip → counts (no metric to compare; still a failed expansion attempt)
keep outcome with non-improving metric (incl. equal) → counts
Strict improvement at any tail position → resets the streak
Candidate without metric → should_prune returns False (defensive, bootstrap-safe)

Config

search.prune_threshold: int = 3 — validated as positive int in _build_search.
__post_init__ guard on BFTSLiteStrategy rejects prune_threshold < 1 for direct construction (tests/scripts that bypass config).
Other strategies (greedy/restart) ignore the kwarg.

Reviewer trail

Round	Verdict	Findings
1	NEEDS_TWEAK	trailing-only counter / crash semantics / equal-metric / bootstrap-safe / all-pruned semantics / filter-before-select / config typing
2	VERIFIED	All 7 addressed; one non-blocking note on `prune_threshold=0` direct-construction → fixed in `833cb06`

Stats

2 commits, ~85 LOC strategy + ~12 LOC config + ~200 LOC tests
2,397 passed / 4 skipped, 0 regressions on top of M1b baseline
13 new unit tests in test_strategy.py (35 → 36 with __post_init__ test) covering reviewer's 10-case checklist + factory threading + bootstrap defence

Test plan

Real-agent demo gate (M2 30-iter)

	Iters	Best `compression_ratio`	Stop reason
Greedy	9	2.2528	5 consecutive failures, hard-stop
BFTS-lite	30	2.5013	clean `max_iterations=30`

Greedy hit the legacy max_retries=5 wall at iter 9 — exactly the failure mode this PR's doom-loop pruning is designed to escape. BFTS-lite reached iter 30 with 6 BranchFrom events and 4 nodes explicitly pruned by should_prune, finding a deeper kept path (n1→…→n21, 10 levels) that beat greedy's local optimum by +11%. Total cost $2.05, ~55 min wall (parallel runs).

Known limitations (M2 PR 11+)

Pruning is trailing-only; windowed (3-of-5) is a v2 follow-up if domains show oscillating bad branches.
Pruning policy lives on BFTSLiteStrategy. If multiple future strategies want to share it, extract a Pruner interface then (YAGNI now).

🤖 Generated with Claude Code

Wires `should_prune()` into the search seam so BFTS stops re-expanding futile branches. A kept node is pruned when its trailing `prune_threshold` direct children all fail to produce a strict metric improvement. `decide()` filters pruned candidates BEFORE max-metric selection so a high-scoring doom-looped branch cannot starve a viable lower-scoring one. When every kept node is pruned, decide() returns Stop("all kept nodes pruned (doom-loop)"). Reviewer round 1 verdict: NEEDS_TWEAK. Addressed: - trailing-only counter (not windowed) — v1 default - discard counts unconditionally; crash/violation/skip count without metric comparison (failed expansion attempts) - equal metric is NOT a strict improvement (counts toward streak) - candidate without metric → return False (defensive, bootstrap-safe) - minimize direction handled (lower metric = improvement) - all-pruned → Stop, not Continue or Restart - decide() filters BEFORE selection (correctness ordering) Config: `search.prune_threshold: int = 3` (validated as positive int). Other strategies (greedy/restart) ignore the kwarg. Tests: +13 (35 total in test_strategy.py); reviewer's 10-case checklist covered. 0 regressions across full suite (2397 passed / 4 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewer round 2 verdict was VERIFIED with one non-blocking note: direct `BFTSLiteStrategy(prune_threshold=0)` would prune immediately because `streak >= 0` is always true after the first non-improving child. Config validation already catches this for the configured path; this guard catches it for any direct construction (tests, scripts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address 3 of 4 round-2 reviewer follow-ups (#1, #2, #3). #4 (CLI auth base class) deferred to PR 16c when there's a second adapter to motivate the abstraction. #1 stderr auth-phrase scan (codex_cli.py parse_output): If codex bails before --json takes effect (auth check fails before JSONL stream opens), the auth phrase is in stderr only. parse_output now scans `raw.stderr_tail` as a fallback after stdout — without this, the trial would land in error_type=UNKNOWN instead of AUTH. Tests: test_codex_auth_error_detected_in_stderr_when_no_jsonl, test_codex_stderr_unrelated_text_does_NOT_trigger_auth. #2 reproducibility flags (build_argv): Added --ephemeral, --ignore-user-config, --ignore-rules. Without --ignore-user-config, a user's ~/.codex/config.toml could silently override `model` and break the cli_version-as-reproducibility- snapshot guarantee (cli_version snapshots the binary version, not the model). Without --ignore-rules, user execpolicy .rules files could re-enable codex tool calls outside the workspace-write sandbox (§INV-3). Without --ephemeral, codex persists session files across runs. Test: test_codex_argv_includes_reproducibility_flags. #3 _FORBIDDEN_FLAGS calibration: Verified actual codex 0.124.0 flag names via `codex exec --help`. Removed two non-existent flags (`--bypass-approvals` was a fabrication; `--dangerously-skip-permissions` is a Claude CLI flag, not codex). Added the real `--dangerously-bypass-approvals-and-sandbox` (skips both approvals AND sandbox — explicit §INV-3 violation if ever passed). Updated meta-test to match. Stats: +111/-11 across 2 files (`git diff --stat HEAD~1 HEAD | tail -1`). Test suite: 60 passed (codex 23 + cli_subscription 37) in 0.15s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

suzuke and others added 2 commits April 25, 2026 22:17

This was referenced Apr 25, 2026

feat(m2): reporter compare mode + crucible compare --html #5

Open

feat(m2): WorktreeMutex + crucible cleanup CLI #8

Open

feat(m3): interactive d3 tree-view reporter #9

Open

suzuke mentioned this pull request Apr 26, 2026

feat(m3): GeminiCLIAdapter + CLISubscriptionAuthError base (M3 PR 16b) #16

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(m2): doom-loop pruning for BFTSLiteStrategy#4

feat(m2): doom-loop pruning for BFTSLiteStrategy#4
suzuke wants to merge 2 commits into
feat/m1b-search-strategyfrom
feat/m2-doom-loop

suzuke commented Apr 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

suzuke commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's new

Doom-loop pruning

Failure semantics (matches reviewer round 1)

Config

Reviewer trail

Stats

Test plan

Real-agent demo gate (M2 30-iter)

Known limitations (M2 PR 11+)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

suzuke commented Apr 25, 2026 •

edited

Loading