Skip to content

feat(m2): reporter compare mode + crucible compare --html#5

Open
suzuke wants to merge 3 commits into
feat/m2-doom-loopfrom
feat/m2-reporter-compare
Open

feat(m2): reporter compare mode + crucible compare --html#5
suzuke wants to merge 3 commits into
feat/m2-doom-loopfrom
feat/m2-reporter-compare

Conversation

@suzuke
Copy link
Copy Markdown
Owner

@suzuke suzuke commented Apr 25, 2026

Summary

Stacked on #4 (M2 PR 10 doom-loop). Adds crucible compare a b --html — a side-by-side static HTML report for two ledgers. Useful for "greedy vs bfts-lite on the same example" demo-gate comparisons.

What's new

Renderer

  • crucible.reporter.compare.render_comparison_html(left, right, *, left_label, right_label, …) — outputs a single self-contained HTML doc with two columns, each rendered using the existing tree-view from M1b. Reuses _render_tree / _render_summary / _best_node_id / _color_for so the per-side cards match the single-view report.
  • Side-namespaced DOM ids — every id="..." and href="#..." in compare mode is namespaced with left- / right- prefixes so two trees with identical AttemptNode ids (n000001, n000002 …) coexist in one document without collisions. Single-view output is bit-identical to before this PR (default anchor_prefix="").
  • Δ best-metric line — only rendered when both sides agree on metric direction AND both bests exist. Otherwise omitted (no auto-winner verdict per M1b demo-gate disclaimer).

CLI

  • crucible compare a b --html [--html-out PATH] — writes <project>/reports/compare-a-vs-b.html by default.
  • --right-project DIR — opt-in cross-project compare (e.g. compress-greedy/ workspace vs compress-bfts/ workspace from M1b demo gate). Cross-project output defaults to cwd to avoid writing into the wrong project.

Strict read-only

No orchestrator changes. No ledger mutation. No config normalization. The only file written is the rendered HTML at --html-out (or its default).

Reviewer trail

Round Verdict Findings
1 (design) ACCEPT + 4 implementation constraints (missing-data → n/a; Δ only when directions agree; explicit output path; strict read-only)
2 REJECTED Blocking F1: duplicate DOM ids — both sides had same node ids in one document, making href="#n000001" ambiguous. Required side-anchor namespacing.
3 VERIFIED F1 fixed via anchor_prefix kwarg on shared helpers; single-view byte-identical; comprehensive uniqueness tests added.

Stats

  • 2 commits, 8 files changed (+822 / −24 LOC counting both commits)
  • 2,415 passed / 4 skipped, 0 regressions on top of PR 10 (2,397) and M1b (2,296) baselines
  • 17 new tests in test_reporter_compare.py (covering uniqueness, Δ rules, label escaping, custom title, per-side direction) + 4 new CLI tests

Test plan

Unit (test_reporter_compare.py, 13 cases)

  • Both sides render with their labels + nodes
  • Empty ledger on one side → "(no attempts)" panel; other side renders normally
  • Best-of-run badge per side (count == 2 when both sides have metrics)
  • Δ rendered when both directions agree
  • Δ omitted when directions differ / either is None / no metrics
  • Parent relationships render with side-scoped href
  • Labels HTML-escaped (<script>&lt;script&gt;)
  • Custom title in <title> and <h1>
  • Per-side metric direction (minimize+maximize coexist)
  • DOM ids unique per side (3-node × 2-side ledger; reviewer F1 regression)
  • Best-summary link uses side-local anchor (reviewer F1 regression)
  • HTML well-formed with no unclosed tags at EOF (validator hardening)

CLI (test_cli.py, 4 new cases)

  • crucible compare a b --html writes to default <project>/reports/compare-a-vs-b.html
  • --html-out PATH honoured
  • Missing ledger errors clearly with non-zero exit
  • --right-project requires --html (rejected otherwise)

Manual smoke

  • Rendered 10kb HTML for greedy-vs-bfts ledger pair: both columns present, Δ visible, 2× ★ best, parent links resolve to same-side cards.

End-to-end on real ledgers (M2 demo gate)

  • crucible compare m2-30 m2-30 --html --project-dir .../compress-greedy --right-project .../compress-bfts rendered the M2 30-iter demo gate's two ledgers into a single 126 KB HTML doc. Greedy's 9-node linear chain and BFTS's 30-node branching tree visibly contrast; Δ best metric line shows right − left = +0.2485 (raw arithmetic delta, no winner verdict). DOM ids correctly namespaced as left-nXXXXX / right-nXXXXX — no anchor collisions despite both ledgers using the same id range. See docs/M2-DEMO-GATE.md §4 for screenshots/details.

Known limitations (non-blockers)

  • Header summary doesn't show strategy / wall-time (single-view summary reused). Reviewer note said this is acceptable for v1; would render n/a if added later.
  • No per-iter diff highlighting between sides (reviewer agreed: out of scope for v1).
  • No interactive drill-down (M3 d3.js territory).

🤖 Generated with Claude Code

suzuke and others added 3 commits April 25, 2026 22:34
Side-by-side static HTML for two ledgers — useful for "greedy vs
bfts-lite on the same example" demo-gate comparisons. Strict read-only:
no orchestrator changes, no ledger mutation, no config normalization.

Renderer: `crucible.reporter.compare.render_comparison_html(left, right,
*, left_label, right_label, …)`. Reuses html_tree's `_render_tree` /
`_render_summary` / `_best_node_id` / `_color_for` so the per-side cards
look identical to the single-view report.

CLI: `crucible compare a b --html [--html-out PATH]` writes
`<project>/reports/compare-a-vs-b.html` by default. `--right-project
DIR` opts into cross-project comparison (e.g. compress-greedy workspace
vs compress-bfts workspace from M1b demo gate). Cross-project default
output is cwd to avoid writing into the wrong project.

Reviewer round 1 verdict: ACCEPT with constraints — all addressed:
- Missing data → "n/a" / empty panel, never silently zero
- Δ line shown ONLY when both sides agree on metric direction (and
  both bests exist); otherwise omitted (no auto-winner verdict)
- Output path: explicit `--html-out` or predictable default
- Strict read-only: no writes anywhere outside the report file
- Renderer extraction: kept html_tree.py as stable single-view facade,
  compare.py imports underscore helpers without changing their API

Tests: 11 new in test_reporter_compare.py + 4 new CLI tests in
test_cli.py. Full suite: 2413 passed / 4 skipped, 0 regressions over
M2 PR 10 baseline (2397).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer round 2 REJECTED the original PR 11 because both ledgers in a
compare HTML normally share AttemptNode ids (n000001, n000002…), so
rendering two trees produced duplicate `id="n000001"` elements and
ambiguous `href="#n000001"` anchors. Fixed by namespacing every DOM id
and intra-document anchor with a side-scoped prefix.

Changes:
- `_render_tree`, `_render_card`, `_render_summary` accept `anchor_prefix:
  str = ""` (kwarg-only). Default empty → single-view output unchanged.
- `compare.py` passes `"left-"` / `"right-"` so `id="left-n000001"` and
  `id="right-n000001"` coexist; parent links and best-summary links use
  the same prefixed anchors. Display text remains the bare node id —
  the prefix is implementation detail, not user-facing.

Tests:
- Existing compare tests updated to assert side-scoped anchors AND
  that bare ids (which would collide) do NOT appear.
- 2 new dedicated tests: `test_compare_dom_ids_are_unique_per_side`
  (no collision across 3-node × 2-side ledger) and
  `test_compare_best_link_uses_side_anchor` (best-link clicks land on
  the same-side card).
- HTML validator tightened to assert `not p.tags_open` at EOF
  (reviewer non-blocker — catches stray unclosed tags).

Full suite: 2415 passed / 4 skipped, 0 regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First demo where BFTS-lite materially outperforms greedy because
greedy hits max_retries=5 hard-stop while BFTS keeps exploring via
BranchFrom + doom-loop pruning.

Greedy:  9 iter,  best 2.2528, stopped at 5-consecutive-failure wall
BFTS:   30 iter,  best 2.5013, clean max_iterations stop
Total:  $2.05, ~55 min wall (parallel runs)

BFTS ledger shows 6 BranchFrom events and 4 nodes explicitly pruned
by the M2 PR 10 doom-loop seam (n3, n21, n20, n19 each had 3 trailing
failures → pruned from candidate set). Best result (2.5013 at iter 21)
came from a deep path n1→n2→n9→n12→n13→n14→n17→n19→n20→n21 — 10 levels
deep, well beyond what greedy reached before its hard-stop.

Compare HTML rendered via the new `crucible compare --html` (M2 PR 11);
file at /tmp/m2-30-compare.html locally, not committed (126 KB).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant