Semantic-Org · jlukic · May 5, 2026 · May 5, 2026 · May 5, 2026 · May 5, 2026
diff --git a/.github/workflows/benchmarks-report.yml b/.github/workflows/benchmarks-report.yml
@@ -125,14 +125,20 @@ jobs:
           GH_TOKEN: ${{ steps.bot-token.outputs.token }}
           PR: ${{ steps.pr.outputs.number }}
           REPO: ${{ github.repository }}
+          BOT_LOGIN: semantic-performance-bot[bot]
         run: |
-          # --edit-last finds the most recent comment from the authenticated
-          # user; with the bot token, that targets the bot's previous bench
-          # comment (not github-actions[bot]'s earlier artifacts).
-          if ! gh pr comment "$PR" --repo "$REPO" --edit-last \
-                --body-file bench-report/comment.md 2>/dev/null; then
-            gh pr comment "$PR" --repo "$REPO" \
-              --body-file bench-report/comment.md
+          # Explicit author-login lookup (not `gh pr comment --edit-last`):
+          # --edit-last queries comments by viewer identity, which can drift
+          # across short-lived installation-token rotations and silently
+          # create a duplicate comment instead of editing the prior one.
+          # Filtering REST results by `user.login` is stable across tokens.
+          EXISTING_ID=$(gh api "repos/$REPO/issues/$PR/comments" \
+            --jq "[.[] | select(.user.login == \"$BOT_LOGIN\")] | last | .id // empty")
+          if [ -n "$EXISTING_ID" ]; then
+            jq -nR --rawfile body bench-report/comment.md '{body: $body}' \
+              | gh api "repos/$REPO/issues/comments/$EXISTING_ID" -X PATCH --input -
+          else
+            gh pr comment "$PR" --repo "$REPO" --body-file bench-report/comment.md
           fi
 
   # History path: append this commit's absolute CIs to bench-history.json

diff --git a/ai/plans/ROADMAP.md b/ai/plans/ROADMAP.md
@@ -83,8 +83,6 @@ Plans with an open PR or live pair work. Updated as ceremony when a PR opens; en
 
 - [Release 0.18.0](active/release-0-18-0.md) — [PR #122](https://github.com/Semantic-Org/Semantic-Next/pull/122) `docs/shippable` (menu trimming + audit pass pending). Ships the next tagged release; last was 0.17.0 in November.
 - [Signal Performance](active/signal-performance.md) — [PR #150](https://github.com/Semantic-Org/Semantic-Next/pull/150) freeze-by-default. Perf story unresolved (see plan's Bench Results); release inclusion is the open call.
-- [Bench Peak Attribution](active/bench-peak-attribution.md) — [PR #178](https://github.com/Semantic-Org/Semantic-Next/pull/178) methodology fix for cross-session absolute-ms comparisons. Eliminates phantom "Regressions from peak" on PRs.
-
 ---
 
 ## Phase 0 — Renderer Architecture
@@ -201,7 +199,6 @@ Slot in wherever there's a gap; not phase-gated.
 | P12 | [Template Spread Syntax](template-spread-syntax.md) | 4-8h | pair | scoped | `{>card ...friend}` — object spread in data passing. Ship when component templates demonstrate need. |
 | P13 | [Template Content Projection](template-wrapper-snippets.md) | 12-16h (1.5-2d) | pair | scoped | `{>content}` — content projection for snippets + subtemplates. Ship when component templates demonstrate need. |
 | P14 | [Template Let Bindings](template-let-bindings.md) | 10-14h (1-2d) | pair | scoped | `{#let}...{/let}` — snippet-for-vars. Ship when component templates demonstrate need. |
-| P15 | [Bench Peak Attribution](active/bench-peak-attribution.md) | 9-11h (1.5d) | pair | scoped | Fix the live peak-attribution bug. PR #174 (test-only, no perf changes) currently surfaces 25 phantom "Regressions from peak"; active perf PRs carry partial false-flagging too. Schema_v2 persists `percent_delta_ci` + `baseline_sha` per metric; reporter switches peak compare to same-session percent-delta; `--scope pr` drops main-history overlay on PR comments; drift flag with chain-of-percent-deltas when baselines differ. `bench-history.json` wiped to empty v2 (v1 entries fed the bug). Two PRs: methodology fix + suite cleanup (`toggle-{first,last}-10` + conditional `timeout` 3→2). |
 
 ---
 
@@ -215,6 +212,5 @@ Plans drafted but not on the active roadmap. See `ai/plans/icebox/` for files.
 - [Signals TC39 Integration](icebox/signals-tc39-integration.md) — adopt native `Signal.State`/`Signal.Computed` as backing primitives when TC39 ships. Blocked on TC39 Stage 3+.
 - [Add Icon Stroke Width](icebox/add-icon-stroke-width.md) — power-user feature, post-1.0.
 - [Audit Fix Continuation](icebox/audit-fix-continuation.md) — process work for follow-up audits.
-- [Bench Suite Expansion](icebox/bench-suite-expansion.md) — file-scoped hot-path micros (`micro-expression-evaluator`, `micro-signal`, etc.) + new end-to-end benches (`wake-count-single-key`, `nested-mutation`, `hydrate-1000-card`). Surgical adds; lands when underlying perf work needs them.
 - [Contributing Surface](icebox/contributing-surface.md) — pre-1.0 stance + 1.0 graduation pass + post-1.0 triage flow (size + scope, GH-shaped vs md-shaped). Most icebox graduates at 1.0; the rest stays internal.
 - [Registry](icebox/registry.md) — community registry for components and behaviors, runtime + compile-time consumption from one source, author-namespaced publishing under `@sui-hub` with editorial canonical aliases above. Post-Phase 4.
diff --git a/ai/plans/archive/bench-coverage-expansion.md b/ai/plans/archive/bench-coverage-expansion.md
@@ -0,0 +1,138 @@
+# Bench Coverage Expansion
+
+## Goal
+
+Two tracks of bench coverage shipped in one PR.
+
+**Track A — Fine-grained reactivity workloads.** Six workload-shaped metrics in a new `bench-template-reactivity` cell, anchored to production component patterns (tabs selection, card-search filtering, dropdown selected-item, async-search results) and to the doc-promised-but-unrealized invariants the renderer doesn't yet deliver. Pure client-mount path — no SSR, no hydration. Models the naive case of CDN-loaded or agentic-VM-loaded SUI components. A separate amplification of an existing `bench-hydrate.js` metric covers the SSR-hydration variant of the same invariants.
+
+**Track B — Per-file hot-path micros.** Per-package isolation benches that catch sub-noise-floor regressions on PRs touching a single hot-path file. The macro suite shifts by 2-3% on a 20% regression in `expression-evaluator.js` — below the resolution floor. Per-file micros surface those.
+
+Together they fill measurement gaps that the existing krausest / todo / hydrate / signal cells don't reach.
+
+## Status
+
+`complete` — six Track A metrics shipped (a seventh, `derived-cascade-100`, was dropped pre-merge after review found its workload measured regex throughput more than reactive dispatch and the FGR plan it was speculatively designed for doesn't ship expression-evaluator memoization), hydration metric amplified, Track B signal extensions + renderer micros + compiler micros all shipped, bench reporter gained a purpose-extraction + glossary system (extra-scope), per-suite purpose comments authored across all five bench files (extra-scope), snippet/subtemplate arg-source-propagation correctness tests landed in renderer (extra-scope), two previously-skipped granularity tests promoted (one to passing, one to `it.fails`). Plan + roadmap entries updated. PR #181 open and ready. Post-merge tail: bench-history.json populates after merge, the `--edit-last` workflow fix for duplicate-bot-comments takes effect post-merge.
+
+## Supersedes
+
+[`../icebox/bench-suite-expansion.md`](../icebox/bench-suite-expansion.md). Removed in this PR.
+
+- Icebox Track 2 (wake-count, nested-mutation, hydrate-1000) — fully absorbed. Wake-count drops in favor of wall-clock benches (real Reactions doing real small work makes wake count visible as wall-clock — no instrumentation primitive needed). `nested-mutation` is partly covered by Track A's `stable-ref-mutate-500`. `hydrate-1000-card` already exists in `bench-hydrate.js` (the `100` in metric names is residual from when items were 100; today they are 1000).
+- Icebox Track 1 (per-file micros) — absorbed as Track B below. The plan named this "templating" but `TemplateCompiler` lives in `@semantic-ui/compiler`, so the package-level infra was stood up there.
+
+## Track A — Fine-grained reactivity workloads
+
+Six metrics in `packages/component/bench/tachometer/bench-template-reactivity.js`, served via `tachometer-ci-template-reactivity.json`. Pure client-mount — `document.createElement → appendChild → flush → mutate`.
+
+### Subtemplate / snippet axis
+
+Direct exercise of the per-key isolation gap in subtemplate args. The doc at `docs/src/pages/docs/guides/templates/subtemplates.mdx` promises that shorthand individual props and `reactiveData={...}` are surgical-per-key. The skipped tests in `packages/renderer/test/browser/subtree-spurious.test.js` document that today's renderer flattens both through `unpackNodeData` and re-evals every child expression on any key change.
+
+| Metric | Workload | Coverage |
+|---|---|---|
+| `subtemplate-reactive-data-100` | 100 child subtemplates with `{>card reactiveData={label: getLabel, status: getStatus}}`. Mutate `labelVal` 50×. | Documented invariant: `it.fails` test in `subtree-spurious.test.js` (post-fix the test flips to passing). |
+| `subtemplate-shorthand-props-100` | 100 child subtemplates with `{>card label=getLabel status=getStatus}`. Mutate `labelVal` 50×. | Same gap, different syntactic form. A renderer fix that lands one likely lands both. |
+| `snippet-args-per-key-100` | 100 invocations of a `{#snippet card label=labelArg status=statusArg}` expansion. Mutate `labelVal` 50×. | Today already at the per-expression-isolation ideal (calibration confirmed via `subtree-caching.test.js #28` correctness tests). Bench locks in current correct behavior — slower verdict on a future regression flags coarse-snippet introduction. |
+
+### Broader reactivity axis
+
+Each anchored to a production example's mutation pattern.
+
+| Metric | Anchor | Workload | Coverage |
+|---|---|---|---|
+| `active-indicator-200` | tabs, dropdown, async-search `selectedIndex`, card-search filter highlight | 200 items each reading `is item.id selectedId` from one external signal. Cycle selectedId 100×. | External-signal-into-each fan-out. Calibration showed today's behavior re-evals 198 of 200 items per cycle (99× gap to the ideal of 2). |
+| `stable-ref-mutate-500` | card-search results, async-search results | 500-item list, replace `items[i]` with fresh ref. 100 cycles, each picking different `i`. | Per-key isolation in `#each` outside subtemplates. Calibration showed today is already at the ideal (1 per cycle). Bench locks in correct behavior — regression to coarse list re-eval is caught. |
+
+### Negative control
+
+Documented coarse semantic that should NOT improve under fine-grained reactivity work. Wall-clock should stay flat. Catches accidental tightening that would constitute a behavioral break.
+
+| Metric | Workload | Documented coarse path |
+|---|---|---|
+| `subtemplate-data-blob-100` | 100 child subtemplates with `{>card data=getCardData}`. Mutate label inside cardData 50×. | `data={expression}` evaluates the whole blob on every change; every child re-evaluates by design. |
+
+### Hydration coverage
+
+The existing `hydrate-helper-100-state-change` metric in `bench-hydrate.js` exercises the SSR-adoption variant of the same invariants (hydrate from SSR → mutate `state.activeID` → measure per-item helper re-eval cost) but previously sat at ~6ms with ±26-29% noise. Amplified to 10 cycles to clear the σ-floor.
+
+## Track B — Per-file hot-path micros
+
+Distributed across packages so each runs only when the owning package's `@semantic-ui/*` dep closure intersects the diff (per matrix discovery in `tools/ci/bench/matrix/discover.js`).
+
+| Bench | Owning package | Metrics shipped |
+|---|---|---|
+| Reactivity hot paths | reactivity | Extends existing `bench-signal.js` with `signal-set-same-10m`, `signal-sub-unsub-100k`, `reaction-flush-noop-5m`, `reaction-coalesce-200x100`, `reaction-dep-diff-30k`. |
+| Renderer hot paths | renderer | New `bench-renderer-micros.js` + supporting infra. Metrics: `micro-expr-simple-100k`, `micro-expr-lisp-50k`, `micro-expr-js-10k`, `micro-build-html-string-10k`, `micro-dom-walker-1000x15`. |
+| Compiler hot paths | compiler | New `bench-compiler-micros.js` + supporting infra. Metrics: `micro-compiler-parse-cold-normal-500` (headline for runtime-browser-compile case), `micro-compiler-parse-cold-complex-200` (kitchen-sink edge case), `micro-compiler-ast-walk-5k`, `micro-compiler-snippet-args-5k`. `parse-cached` skipped — confirmed no cache layer exists in the compiler. |
+
+Production distribution weights from `../../skills/workflows/contributing/improve-performance.md`: 79% of expression evaluations are property lookups (simple identifier + dotted path), 19% Lisp, 2% JS. The `micro-expr-*` mix is calibrated against this.
+
+## Reporter additions
+
+Two extra-scope changes in `tools/ci/bench/reporter/`:
+
+1. **Purpose-extraction system.** Each metric carries a `// purpose: <text>` single-line comment immediately above its `performance.mark(startMark(...))` call. The reporter walks bench files once, indexes per-metric source location and purpose, and renders a collapsed `📖 Bench glossary` `<details>` block at the bottom of the PR comment.
+2. **Single-pass file index.** The two prior resolvers (source-location + purpose) collapsed into one walk indexed by metric name. Drops the per-metric N×2 file scan.
+
+Voice rules for purpose comments: no colons/semicolons/em-dashes, no AI tells, capitalize first word, ≤25 words, hard cap 120 chars per line, lead with active verb, strip `measures`/`tests whether`/`validates` prefixes.
+
+## Renderer correctness tests
+
+Three new tests in `subtree-caching.test.js #28` confirm basic update-propagation when a snippet or subtemplate arg's source signal changes — DOM updates correctly. These passed on first run, confirming snippet/reactiveData/shorthand-props paths all do basic-correctness reactivity (the question was per-key isolation, not basic propagation).
+
+Two previously-skipped tests unburied:
+
+- `subtree-spurious.test.js` snippet-args granularity test — promoted to plain `it()`, passes today (snippet path is per-expression isolated).
+- `subtree-spurious.test.js` reactiveData granularity test — converted to `it.fails()`, surfaces the documented coarseness gap. Flips to a real failure when the gap is closed, prompting marker removal.
+
+## Sessions actuals
+
+| Session | Estimated | Done |
+|---|---|---|
+| 1. Track A calibration | ~3h | ✓ Findings: 5 metrics had real coarseness gaps, 1 metric already at ideal (kept as regression-protection). A seventh metric calibrated as a derived-cascade workload was dropped pre-merge — it measured regex throughput more than reactive dispatch and was speculatively designed for an FGR plan that doesn't ship expression-evaluator memoization. |
+| 2. Track A bench file + config | ~3h | ✓ |
+| 3. Hydration metric amplification | ~30m | ✓ |
+| 4. Track B signal cheap extensions | ~2h | ✓ |
+| 5. Track B renderer infra | ~4h | ✓ |
+| 6. Track B compiler infra | ~2h | ✓ (named "templating" in plan; actual home is `@semantic-ui/compiler`) |
+| 7. PR open + first CI run | — | Partial. PR #181 open, CI runs the two new package-level cells (renderer-micros, compiler-micros) directly; the `bench-template-reactivity` cell is overlaid out of this PR's CI run by the harness's PR-side bench overlay (test-taker can't author the test). Post-merge bench validates the full set. |
+
+## Out of scope (delivered as extras beyond plan)
+
+| Item | Why included |
+|---|---|
+| Bench reporter purpose system + glossary | One-line metric descriptions in PR comments help reviewers without click-through. Self-extends via `// purpose:` convention; future bench authors get glossary entries free. |
+| Per-suite purpose comments (4 existing suites + 4 new) | Glossary needs the data. ~50 metrics covered. |
+| Voice cleanup pass on purpose comments | Workspace voice spec set rules; first authoring round didn't fully apply them. Sweep aligned all 50. |
+| Snippet/subtemplate arg-source propagation correctness tests | Calibration session surfaced ambiguity (snippet markers showed 0/cycle — could mean basic reactivity broken OR per-expression isolation working). Tests resolved the ambiguity (passing means basic correctness fine, plus per-expression isolation works for snippets). |
+| Workflow `--edit-last` → explicit-id-lookup fix | Investigation of duplicate bot comments on this PR found `gh pr comment --edit-last` flaked once (likely token-rotation related). Replaced with stable REST lookup by author-login. |
+
+## Out of scope (deferred)
+
+| Item | Reason |
+|---|---|
+| Templating-package-level micros | Folded into compiler micros (TemplateCompiler lives in compiler). Standalone templating bench infra would need a different workload (Template.render runtime, not compile-time), which is already covered by the macro suite (todo / krausest / hydrate). |
+| Wake-count instrumentation as a tachometer measurement type | Wall-clock benches deliver the same signal cleanly. |
+
+## Methodology
+
+Same constraints as the rest of the bench harness:
+
+- Within-session percent-delta only across iterations (`differences[].percentChange`).
+- Same-session round-robin pattern (`this-change` / `tip-of-tree`).
+- `sampleSize: 50, timeout: 3, autoSampleConditions: ["2%"]`.
+- Wall-clock only — no wake-count instrumentation, no `mode: callback` divergence.
+- Author-neutrality preserved by the existing harness overlay-from-main.
+- Each metric amplified to clear ~10-15ms minimum (verified in node + chromium smoke tests).
+
+## Dependencies
+
+None. Bench harness, reporter, matrix discovery, history archival are all in place.
+
+## Completion
+
+- **Estimated:** 12-15h pair (Sessions 1-5+7); +2h if templating infra ships in this PR rather than as follow-up.
+- **Actual:** ~5h wall-clock across 19 commits (first commit `33a7c96c8` at 09:45, last `4f4a1f243` at 14:46), plus ~25 lens/scoring subagent dispatches across two code-review rounds.
+- **Completed:** 2026-05-05.
+- **Delta notes:** Came in well under estimate. Calibration was the fast-fix step — discovered two metrics already at the ideal in ~30 minutes, recovered scope (kept them as regression-protection benches) after the reframe. Code-review iteration added ~2h of fix passes. Reporter purpose-extraction + glossary, correctness tests, voice-cleanup sweep, and the workflow `--edit-last` fix were all extra-scope adds beyond the plan that emerged from review or downstream investigation. Templating-package-level micros folded into compiler-package home (`TemplateCompiler` lives in `@semantic-ui/compiler`, not `@semantic-ui/templating`).