Skip to content

Commit 1ba5e85

Browse files
authored
Build: Improve Bench Robustness & Reporting (#181)
1 parent 8f33fab commit 1ba5e85

32 files changed

Lines changed: 1729 additions & 116 deletions

.github/workflows/benchmarks-report.yml

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -125,14 +125,20 @@ jobs:
125125
GH_TOKEN: ${{ steps.bot-token.outputs.token }}
126126
PR: ${{ steps.pr.outputs.number }}
127127
REPO: ${{ github.repository }}
128+
BOT_LOGIN: semantic-performance-bot[bot]
128129
run: |
129-
# --edit-last finds the most recent comment from the authenticated
130-
# user; with the bot token, that targets the bot's previous bench
131-
# comment (not github-actions[bot]'s earlier artifacts).
132-
if ! gh pr comment "$PR" --repo "$REPO" --edit-last \
133-
--body-file bench-report/comment.md 2>/dev/null; then
134-
gh pr comment "$PR" --repo "$REPO" \
135-
--body-file bench-report/comment.md
130+
# Explicit author-login lookup (not `gh pr comment --edit-last`):
131+
# --edit-last queries comments by viewer identity, which can drift
132+
# across short-lived installation-token rotations and silently
133+
# create a duplicate comment instead of editing the prior one.
134+
# Filtering REST results by `user.login` is stable across tokens.
135+
EXISTING_ID=$(gh api "repos/$REPO/issues/$PR/comments" \
136+
--jq "[.[] | select(.user.login == \"$BOT_LOGIN\")] | last | .id // empty")
137+
if [ -n "$EXISTING_ID" ]; then
138+
jq -nR --rawfile body bench-report/comment.md '{body: $body}' \
139+
| gh api "repos/$REPO/issues/comments/$EXISTING_ID" -X PATCH --input -
140+
else
141+
gh pr comment "$PR" --repo "$REPO" --body-file bench-report/comment.md
136142
fi
137143
138144
# History path: append this commit's absolute CIs to bench-history.json

ai/plans/ROADMAP.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -83,8 +83,6 @@ Plans with an open PR or live pair work. Updated as ceremony when a PR opens; en
8383

8484
- [Release 0.18.0](active/release-0-18-0.md)[PR #122](https://github.com/Semantic-Org/Semantic-Next/pull/122) `docs/shippable` (menu trimming + audit pass pending). Ships the next tagged release; last was 0.17.0 in November.
8585
- [Signal Performance](active/signal-performance.md)[PR #150](https://github.com/Semantic-Org/Semantic-Next/pull/150) freeze-by-default. Perf story unresolved (see plan's Bench Results); release inclusion is the open call.
86-
- [Bench Peak Attribution](active/bench-peak-attribution.md)[PR #178](https://github.com/Semantic-Org/Semantic-Next/pull/178) methodology fix for cross-session absolute-ms comparisons. Eliminates phantom "Regressions from peak" on PRs.
87-
8886
---
8987

9088
## Phase 0 — Renderer Architecture
@@ -201,7 +199,6 @@ Slot in wherever there's a gap; not phase-gated.
201199
| P12 | [Template Spread Syntax](template-spread-syntax.md) | 4-8h | pair | scoped | `{>card ...friend}` — object spread in data passing. Ship when component templates demonstrate need. |
202200
| P13 | [Template Content Projection](template-wrapper-snippets.md) | 12-16h (1.5-2d) | pair | scoped | `{>content}` — content projection for snippets + subtemplates. Ship when component templates demonstrate need. |
203201
| P14 | [Template Let Bindings](template-let-bindings.md) | 10-14h (1-2d) | pair | scoped | `{#let}...{/let}` — snippet-for-vars. Ship when component templates demonstrate need. |
204-
| P15 | [Bench Peak Attribution](active/bench-peak-attribution.md) | 9-11h (1.5d) | pair | scoped | Fix the live peak-attribution bug. PR #174 (test-only, no perf changes) currently surfaces 25 phantom "Regressions from peak"; active perf PRs carry partial false-flagging too. Schema_v2 persists `percent_delta_ci` + `baseline_sha` per metric; reporter switches peak compare to same-session percent-delta; `--scope pr` drops main-history overlay on PR comments; drift flag with chain-of-percent-deltas when baselines differ. `bench-history.json` wiped to empty v2 (v1 entries fed the bug). Two PRs: methodology fix + suite cleanup (`toggle-{first,last}-10` + conditional `timeout` 3→2). |
205202

206203
---
207204

@@ -215,6 +212,5 @@ Plans drafted but not on the active roadmap. See `ai/plans/icebox/` for files.
215212
- [Signals TC39 Integration](icebox/signals-tc39-integration.md) — adopt native `Signal.State`/`Signal.Computed` as backing primitives when TC39 ships. Blocked on TC39 Stage 3+.
216213
- [Add Icon Stroke Width](icebox/add-icon-stroke-width.md) — power-user feature, post-1.0.
217214
- [Audit Fix Continuation](icebox/audit-fix-continuation.md) — process work for follow-up audits.
218-
- [Bench Suite Expansion](icebox/bench-suite-expansion.md) — file-scoped hot-path micros (`micro-expression-evaluator`, `micro-signal`, etc.) + new end-to-end benches (`wake-count-single-key`, `nested-mutation`, `hydrate-1000-card`). Surgical adds; lands when underlying perf work needs them.
219215
- [Contributing Surface](icebox/contributing-surface.md) — pre-1.0 stance + 1.0 graduation pass + post-1.0 triage flow (size + scope, GH-shaped vs md-shaped). Most icebox graduates at 1.0; the rest stays internal.
220216
- [Registry](icebox/registry.md) — community registry for components and behaviors, runtime + compile-time consumption from one source, author-namespaced publishing under `@sui-hub` with editorial canonical aliases above. Post-Phase 4.
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# Bench Coverage Expansion
2+
3+
## Goal
4+
5+
Two tracks of bench coverage shipped in one PR.
6+
7+
**Track A — Fine-grained reactivity workloads.** Six workload-shaped metrics in a new `bench-template-reactivity` cell, anchored to production component patterns (tabs selection, card-search filtering, dropdown selected-item, async-search results) and to the doc-promised-but-unrealized invariants the renderer doesn't yet deliver. Pure client-mount path — no SSR, no hydration. Models the naive case of CDN-loaded or agentic-VM-loaded SUI components. A separate amplification of an existing `bench-hydrate.js` metric covers the SSR-hydration variant of the same invariants.
8+
9+
**Track B — Per-file hot-path micros.** Per-package isolation benches that catch sub-noise-floor regressions on PRs touching a single hot-path file. The macro suite shifts by 2-3% on a 20% regression in `expression-evaluator.js` — below the resolution floor. Per-file micros surface those.
10+
11+
Together they fill measurement gaps that the existing krausest / todo / hydrate / signal cells don't reach.
12+
13+
## Status
14+
15+
`complete` — six Track A metrics shipped (a seventh, `derived-cascade-100`, was dropped pre-merge after review found its workload measured regex throughput more than reactive dispatch and the FGR plan it was speculatively designed for doesn't ship expression-evaluator memoization), hydration metric amplified, Track B signal extensions + renderer micros + compiler micros all shipped, bench reporter gained a purpose-extraction + glossary system (extra-scope), per-suite purpose comments authored across all five bench files (extra-scope), snippet/subtemplate arg-source-propagation correctness tests landed in renderer (extra-scope), two previously-skipped granularity tests promoted (one to passing, one to `it.fails`). Plan + roadmap entries updated. PR #181 open and ready. Post-merge tail: bench-history.json populates after merge, the `--edit-last` workflow fix for duplicate-bot-comments takes effect post-merge.
16+
17+
## Supersedes
18+
19+
[`../icebox/bench-suite-expansion.md`](../icebox/bench-suite-expansion.md). Removed in this PR.
20+
21+
- Icebox Track 2 (wake-count, nested-mutation, hydrate-1000) — fully absorbed. Wake-count drops in favor of wall-clock benches (real Reactions doing real small work makes wake count visible as wall-clock — no instrumentation primitive needed). `nested-mutation` is partly covered by Track A's `stable-ref-mutate-500`. `hydrate-1000-card` already exists in `bench-hydrate.js` (the `100` in metric names is residual from when items were 100; today they are 1000).
22+
- Icebox Track 1 (per-file micros) — absorbed as Track B below. The plan named this "templating" but `TemplateCompiler` lives in `@semantic-ui/compiler`, so the package-level infra was stood up there.
23+
24+
## Track A — Fine-grained reactivity workloads
25+
26+
Six metrics in `packages/component/bench/tachometer/bench-template-reactivity.js`, served via `tachometer-ci-template-reactivity.json`. Pure client-mount — `document.createElement → appendChild → flush → mutate`.
27+
28+
### Subtemplate / snippet axis
29+
30+
Direct exercise of the per-key isolation gap in subtemplate args. The doc at `docs/src/pages/docs/guides/templates/subtemplates.mdx` promises that shorthand individual props and `reactiveData={...}` are surgical-per-key. The skipped tests in `packages/renderer/test/browser/subtree-spurious.test.js` document that today's renderer flattens both through `unpackNodeData` and re-evals every child expression on any key change.
31+
32+
| Metric | Workload | Coverage |
33+
|---|---|---|
34+
| `subtemplate-reactive-data-100` | 100 child subtemplates with `{>card reactiveData={label: getLabel, status: getStatus}}`. Mutate `labelVal` 50×. | Documented invariant: `it.fails` test in `subtree-spurious.test.js` (post-fix the test flips to passing). |
35+
| `subtemplate-shorthand-props-100` | 100 child subtemplates with `{>card label=getLabel status=getStatus}`. Mutate `labelVal` 50×. | Same gap, different syntactic form. A renderer fix that lands one likely lands both. |
36+
| `snippet-args-per-key-100` | 100 invocations of a `{#snippet card label=labelArg status=statusArg}` expansion. Mutate `labelVal` 50×. | Today already at the per-expression-isolation ideal (calibration confirmed via `subtree-caching.test.js #28` correctness tests). Bench locks in current correct behavior — slower verdict on a future regression flags coarse-snippet introduction. |
37+
38+
### Broader reactivity axis
39+
40+
Each anchored to a production example's mutation pattern.
41+
42+
| Metric | Anchor | Workload | Coverage |
43+
|---|---|---|---|
44+
| `active-indicator-200` | tabs, dropdown, async-search `selectedIndex`, card-search filter highlight | 200 items each reading `is item.id selectedId` from one external signal. Cycle selectedId 100×. | External-signal-into-each fan-out. Calibration showed today's behavior re-evals 198 of 200 items per cycle (99× gap to the ideal of 2). |
45+
| `stable-ref-mutate-500` | card-search results, async-search results | 500-item list, replace `items[i]` with fresh ref. 100 cycles, each picking different `i`. | Per-key isolation in `#each` outside subtemplates. Calibration showed today is already at the ideal (1 per cycle). Bench locks in correct behavior — regression to coarse list re-eval is caught. |
46+
47+
### Negative control
48+
49+
Documented coarse semantic that should NOT improve under fine-grained reactivity work. Wall-clock should stay flat. Catches accidental tightening that would constitute a behavioral break.
50+
51+
| Metric | Workload | Documented coarse path |
52+
|---|---|---|
53+
| `subtemplate-data-blob-100` | 100 child subtemplates with `{>card data=getCardData}`. Mutate label inside cardData 50×. | `data={expression}` evaluates the whole blob on every change; every child re-evaluates by design. |
54+
55+
### Hydration coverage
56+
57+
The existing `hydrate-helper-100-state-change` metric in `bench-hydrate.js` exercises the SSR-adoption variant of the same invariants (hydrate from SSR → mutate `state.activeID` → measure per-item helper re-eval cost) but previously sat at ~6ms with ±26-29% noise. Amplified to 10 cycles to clear the σ-floor.
58+
59+
## Track B — Per-file hot-path micros
60+
61+
Distributed across packages so each runs only when the owning package's `@semantic-ui/*` dep closure intersects the diff (per matrix discovery in `tools/ci/bench/matrix/discover.js`).
62+
63+
| Bench | Owning package | Metrics shipped |
64+
|---|---|---|
65+
| Reactivity hot paths | reactivity | Extends existing `bench-signal.js` with `signal-set-same-10m`, `signal-sub-unsub-100k`, `reaction-flush-noop-5m`, `reaction-coalesce-200x100`, `reaction-dep-diff-30k`. |
66+
| Renderer hot paths | renderer | New `bench-renderer-micros.js` + supporting infra. Metrics: `micro-expr-simple-100k`, `micro-expr-lisp-50k`, `micro-expr-js-10k`, `micro-build-html-string-10k`, `micro-dom-walker-1000x15`. |
67+
| Compiler hot paths | compiler | New `bench-compiler-micros.js` + supporting infra. Metrics: `micro-compiler-parse-cold-normal-500` (headline for runtime-browser-compile case), `micro-compiler-parse-cold-complex-200` (kitchen-sink edge case), `micro-compiler-ast-walk-5k`, `micro-compiler-snippet-args-5k`. `parse-cached` skipped — confirmed no cache layer exists in the compiler. |
68+
69+
Production distribution weights from `../../skills/workflows/contributing/improve-performance.md`: 79% of expression evaluations are property lookups (simple identifier + dotted path), 19% Lisp, 2% JS. The `micro-expr-*` mix is calibrated against this.
70+
71+
## Reporter additions
72+
73+
Two extra-scope changes in `tools/ci/bench/reporter/`:
74+
75+
1. **Purpose-extraction system.** Each metric carries a `// purpose: <text>` single-line comment immediately above its `performance.mark(startMark(...))` call. The reporter walks bench files once, indexes per-metric source location and purpose, and renders a collapsed `📖 Bench glossary` `<details>` block at the bottom of the PR comment.
76+
2. **Single-pass file index.** The two prior resolvers (source-location + purpose) collapsed into one walk indexed by metric name. Drops the per-metric N×2 file scan.
77+
78+
Voice rules for purpose comments: no colons/semicolons/em-dashes, no AI tells, capitalize first word, ≤25 words, hard cap 120 chars per line, lead with active verb, strip `measures`/`tests whether`/`validates` prefixes.
79+
80+
## Renderer correctness tests
81+
82+
Three new tests in `subtree-caching.test.js #28` confirm basic update-propagation when a snippet or subtemplate arg's source signal changes — DOM updates correctly. These passed on first run, confirming snippet/reactiveData/shorthand-props paths all do basic-correctness reactivity (the question was per-key isolation, not basic propagation).
83+
84+
Two previously-skipped tests unburied:
85+
86+
- `subtree-spurious.test.js` snippet-args granularity test — promoted to plain `it()`, passes today (snippet path is per-expression isolated).
87+
- `subtree-spurious.test.js` reactiveData granularity test — converted to `it.fails()`, surfaces the documented coarseness gap. Flips to a real failure when the gap is closed, prompting marker removal.
88+
89+
## Sessions actuals
90+
91+
| Session | Estimated | Done |
92+
|---|---|---|
93+
| 1. Track A calibration | ~3h | ✓ Findings: 5 metrics had real coarseness gaps, 1 metric already at ideal (kept as regression-protection). A seventh metric calibrated as a derived-cascade workload was dropped pre-merge — it measured regex throughput more than reactive dispatch and was speculatively designed for an FGR plan that doesn't ship expression-evaluator memoization. |
94+
| 2. Track A bench file + config | ~3h ||
95+
| 3. Hydration metric amplification | ~30m ||
96+
| 4. Track B signal cheap extensions | ~2h ||
97+
| 5. Track B renderer infra | ~4h ||
98+
| 6. Track B compiler infra | ~2h | ✓ (named "templating" in plan; actual home is `@semantic-ui/compiler`) |
99+
| 7. PR open + first CI run || Partial. PR #181 open, CI runs the two new package-level cells (renderer-micros, compiler-micros) directly; the `bench-template-reactivity` cell is overlaid out of this PR's CI run by the harness's PR-side bench overlay (test-taker can't author the test). Post-merge bench validates the full set. |
100+
101+
## Out of scope (delivered as extras beyond plan)
102+
103+
| Item | Why included |
104+
|---|---|
105+
| Bench reporter purpose system + glossary | One-line metric descriptions in PR comments help reviewers without click-through. Self-extends via `// purpose:` convention; future bench authors get glossary entries free. |
106+
| Per-suite purpose comments (4 existing suites + 4 new) | Glossary needs the data. ~50 metrics covered. |
107+
| Voice cleanup pass on purpose comments | Workspace voice spec set rules; first authoring round didn't fully apply them. Sweep aligned all 50. |
108+
| Snippet/subtemplate arg-source propagation correctness tests | Calibration session surfaced ambiguity (snippet markers showed 0/cycle — could mean basic reactivity broken OR per-expression isolation working). Tests resolved the ambiguity (passing means basic correctness fine, plus per-expression isolation works for snippets). |
109+
| Workflow `--edit-last` → explicit-id-lookup fix | Investigation of duplicate bot comments on this PR found `gh pr comment --edit-last` flaked once (likely token-rotation related). Replaced with stable REST lookup by author-login. |
110+
111+
## Out of scope (deferred)
112+
113+
| Item | Reason |
114+
|---|---|
115+
| Templating-package-level micros | Folded into compiler micros (TemplateCompiler lives in compiler). Standalone templating bench infra would need a different workload (Template.render runtime, not compile-time), which is already covered by the macro suite (todo / krausest / hydrate). |
116+
| Wake-count instrumentation as a tachometer measurement type | Wall-clock benches deliver the same signal cleanly. |
117+
118+
## Methodology
119+
120+
Same constraints as the rest of the bench harness:
121+
122+
- Within-session percent-delta only across iterations (`differences[].percentChange`).
123+
- Same-session round-robin pattern (`this-change` / `tip-of-tree`).
124+
- `sampleSize: 50, timeout: 3, autoSampleConditions: ["2%"]`.
125+
- Wall-clock only — no wake-count instrumentation, no `mode: callback` divergence.
126+
- Author-neutrality preserved by the existing harness overlay-from-main.
127+
- Each metric amplified to clear ~10-15ms minimum (verified in node + chromium smoke tests).
128+
129+
## Dependencies
130+
131+
None. Bench harness, reporter, matrix discovery, history archival are all in place.
132+
133+
## Completion
134+
135+
- **Estimated:** 12-15h pair (Sessions 1-5+7); +2h if templating infra ships in this PR rather than as follow-up.
136+
- **Actual:** ~5h wall-clock across 19 commits (first commit `33a7c96c8` at 09:45, last `4f4a1f243` at 14:46), plus ~25 lens/scoring subagent dispatches across two code-review rounds.
137+
- **Completed:** 2026-05-05.
138+
- **Delta notes:** Came in well under estimate. Calibration was the fast-fix step — discovered two metrics already at the ideal in ~30 minutes, recovered scope (kept them as regression-protection benches) after the reframe. Code-review iteration added ~2h of fix passes. Reporter purpose-extraction + glossary, correctness tests, voice-cleanup sweep, and the workflow `--edit-last` fix were all extra-scope adds beyond the plan that emerged from review or downstream investigation. Templating-package-level micros folded into compiler-package home (`TemplateCompiler` lives in `@semantic-ui/compiler`, not `@semantic-ui/templating`).

0 commit comments

Comments
 (0)