feat(m2): reporter compare mode + crucible compare --html by suzuke · Pull Request #5 · suzuke/autocrucible

suzuke · 2026-04-25T14:57:59Z

Summary

Stacked on #4 (M2 PR 10 doom-loop). Adds crucible compare a b --html — a side-by-side static HTML report for two ledgers. Useful for "greedy vs bfts-lite on the same example" demo-gate comparisons.

What's new

Renderer

crucible.reporter.compare.render_comparison_html(left, right, *, left_label, right_label, …) — outputs a single self-contained HTML doc with two columns, each rendered using the existing tree-view from M1b. Reuses _render_tree / _render_summary / _best_node_id / _color_for so the per-side cards match the single-view report.
Side-namespaced DOM ids — every id="..." and href="#..." in compare mode is namespaced with left- / right- prefixes so two trees with identical AttemptNode ids (n000001, n000002 …) coexist in one document without collisions. Single-view output is bit-identical to before this PR (default anchor_prefix="").
Δ best-metric line — only rendered when both sides agree on metric direction AND both bests exist. Otherwise omitted (no auto-winner verdict per M1b demo-gate disclaimer).

CLI

crucible compare a b --html [--html-out PATH] — writes <project>/reports/compare-a-vs-b.html by default.
--right-project DIR — opt-in cross-project compare (e.g. compress-greedy/ workspace vs compress-bfts/ workspace from M1b demo gate). Cross-project output defaults to cwd to avoid writing into the wrong project.

Strict read-only

No orchestrator changes. No ledger mutation. No config normalization. The only file written is the rendered HTML at --html-out (or its default).

Reviewer trail

Round	Verdict	Findings
1 (design)	ACCEPT	+ 4 implementation constraints (missing-data → `n/a`; Δ only when directions agree; explicit output path; strict read-only)
2	REJECTED	Blocking F1: duplicate DOM ids — both sides had same node ids in one document, making `href="#n000001"` ambiguous. Required side-anchor namespacing.
3	VERIFIED	F1 fixed via `anchor_prefix` kwarg on shared helpers; single-view byte-identical; comprehensive uniqueness tests added.

Stats

2 commits, 8 files changed (+822 / −24 LOC counting both commits)
2,415 passed / 4 skipped, 0 regressions on top of PR 10 (2,397) and M1b (2,296) baselines
17 new tests in test_reporter_compare.py (covering uniqueness, Δ rules, label escaping, custom title, per-side direction) + 4 new CLI tests

Test plan

Unit (`test_reporter_compare.py`, 13 cases)

CLI (`test_cli.py`, 4 new cases)

crucible compare a b --html writes to default <project>/reports/compare-a-vs-b.html
--html-out PATH honoured
Missing ledger errors clearly with non-zero exit
--right-project requires --html (rejected otherwise)

Manual smoke

Rendered 10kb HTML for greedy-vs-bfts ledger pair: both columns present, Δ visible, 2× ★ best, parent links resolve to same-side cards.

End-to-end on real ledgers (M2 demo gate)

crucible compare m2-30 m2-30 --html --project-dir .../compress-greedy --right-project .../compress-bfts rendered the M2 30-iter demo gate's two ledgers into a single 126 KB HTML doc. Greedy's 9-node linear chain and BFTS's 30-node branching tree visibly contrast; Δ best metric line shows right − left = +0.2485 (raw arithmetic delta, no winner verdict). DOM ids correctly namespaced as left-nXXXXX / right-nXXXXX — no anchor collisions despite both ledgers using the same id range. See docs/M2-DEMO-GATE.md §4 for screenshots/details.

Known limitations (non-blockers)

Header summary doesn't show strategy / wall-time (single-view summary reused). Reviewer note said this is acceptable for v1; would render n/a if added later.
No per-iter diff highlighting between sides (reviewer agreed: out of scope for v1).
No interactive drill-down (M3 d3.js territory).

🤖 Generated with Claude Code

Side-by-side static HTML for two ledgers — useful for "greedy vs bfts-lite on the same example" demo-gate comparisons. Strict read-only: no orchestrator changes, no ledger mutation, no config normalization. Renderer: `crucible.reporter.compare.render_comparison_html(left, right, *, left_label, right_label, …)`. Reuses html_tree's `_render_tree` / `_render_summary` / `_best_node_id` / `_color_for` so the per-side cards look identical to the single-view report. CLI: `crucible compare a b --html [--html-out PATH]` writes `<project>/reports/compare-a-vs-b.html` by default. `--right-project DIR` opts into cross-project comparison (e.g. compress-greedy workspace vs compress-bfts workspace from M1b demo gate). Cross-project default output is cwd to avoid writing into the wrong project. Reviewer round 1 verdict: ACCEPT with constraints — all addressed: - Missing data → "n/a" / empty panel, never silently zero - Δ line shown ONLY when both sides agree on metric direction (and both bests exist); otherwise omitted (no auto-winner verdict) - Output path: explicit `--html-out` or predictable default - Strict read-only: no writes anywhere outside the report file - Renderer extraction: kept html_tree.py as stable single-view facade, compare.py imports underscore helpers without changing their API Tests: 11 new in test_reporter_compare.py + 4 new CLI tests in test_cli.py. Full suite: 2413 passed / 4 skipped, 0 regressions over M2 PR 10 baseline (2397). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewer round 2 REJECTED the original PR 11 because both ledgers in a compare HTML normally share AttemptNode ids (n000001, n000002…), so rendering two trees produced duplicate `id="n000001"` elements and ambiguous `href="#n000001"` anchors. Fixed by namespacing every DOM id and intra-document anchor with a side-scoped prefix. Changes: - `_render_tree`, `_render_card`, `_render_summary` accept `anchor_prefix: str = ""` (kwarg-only). Default empty → single-view output unchanged. - `compare.py` passes `"left-"` / `"right-"` so `id="left-n000001"` and `id="right-n000001"` coexist; parent links and best-summary links use the same prefixed anchors. Display text remains the bare node id — the prefix is implementation detail, not user-facing. Tests: - Existing compare tests updated to assert side-scoped anchors AND that bare ids (which would collide) do NOT appear. - 2 new dedicated tests: `test_compare_dom_ids_are_unique_per_side` (no collision across 3-node × 2-side ledger) and `test_compare_best_link_uses_side_anchor` (best-link clicks land on the same-side card). - HTML validator tightened to assert `not p.tags_open` at EOF (reviewer non-blocker — catches stray unclosed tags). Full suite: 2415 passed / 4 skipped, 0 regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

First demo where BFTS-lite materially outperforms greedy because greedy hits max_retries=5 hard-stop while BFTS keeps exploring via BranchFrom + doom-loop pruning. Greedy: 9 iter, best 2.2528, stopped at 5-consecutive-failure wall BFTS: 30 iter, best 2.5013, clean max_iterations stop Total: $2.05, ~55 min wall (parallel runs) BFTS ledger shows 6 BranchFrom events and 4 nodes explicitly pruned by the M2 PR 10 doom-loop seam (n3, n21, n20, n19 each had 3 trailing failures → pruned from candidate set). Best result (2.5013 at iter 21) came from a deep path n1→n2→n9→n12→n13→n14→n17→n19→n20→n21 — 10 levels deep, well beyond what greedy reached before its hard-stop. Compare HTML rendered via the new `crucible compare --html` (M2 PR 11); file at /tmp/m2-30-compare.html locally, not committed (126 KB). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

suzuke and others added 3 commits April 25, 2026 22:34

This was referenced Apr 25, 2026

feat(m2): HMAC-SHA256 seal upgrade for eval-result.json #6

Open

feat(m2): WorktreeMutex + crucible cleanup CLI #8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(m2): reporter compare mode + crucible compare --html#5

feat(m2): reporter compare mode + crucible compare --html#5
suzuke wants to merge 3 commits into
feat/m2-doom-loopfrom
feat/m2-reporter-compare

suzuke commented Apr 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

suzuke commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's new

Renderer

CLI

Strict read-only

Reviewer trail

Stats

Test plan

Unit (test_reporter_compare.py, 13 cases)

CLI (test_cli.py, 4 new cases)

Manual smoke

End-to-end on real ledgers (M2 demo gate)

Known limitations (non-blockers)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

suzuke commented Apr 25, 2026 •

edited

Loading

Unit (`test_reporter_compare.py`, 13 cases)

CLI (`test_cli.py`, 4 new cases)