Skip to content

[CLEANUP] Flatten Glimmer reference hot paths (each item cells, inlined track frame, property refs)#21435

Draft
NullVoxPopuli-ai-agent wants to merge 5 commits into
emberjs:mainfrom
NullVoxPopuli-ai-agent:perf/each-item-cell-ref
Draft

[CLEANUP] Flatten Glimmer reference hot paths (each item cells, inlined track frame, property refs)#21435
NullVoxPopuli-ai-agent wants to merge 5 commits into
emberjs:mainfrom
NullVoxPopuli-ai-agent:perf/each-item-cell-ref

Conversation

@NullVoxPopuli-ai-agent
Copy link
Copy Markdown
Contributor

@NullVoxPopuli-ai-agent NullVoxPopuli-ai-agent commented May 29, 2026

What

Six behavior-preserving flattenings of the Glimmer reference / tracking / tag hot paths — the machinery hit on nearly every reference read and every revalidation tick. Found profiling the smoke-tests/benchmark-app table benchmark, but the wins apply to all rendering, not just {{#each}}.

The recurring theme: references and tracking frames were modeled as general compute closures when the data is really just "a value (or parent + path) behind a tag." Storing that as data lets valueForRef/updateRef handle it inline — no closures, and often no tracking frame.

Changes (with isolated per-change impact)

Each number below is an A/B microbenchmark of that change alone, against the prior implementation, through the real valueForRef/updateRef/track (DEBUG=false):

  1. Cell reference for {{#each}} block params — item value + index were compute refs (2 closures + a track() frame per read). A cell stores its value behind a fixed tag; no closures, no frame. → 2.3× faster, ~65% less memory (per 1000 items, create+read and update+read).
  2. Inlined track() in valueForRef — stop allocating a thunk closure on every recompute (the hottest function in the VM). → ~10% faster, ~33% less garbage per recompute (63.2→57.0 µs/1000, 282→188 kb).
  3. Flattened {{#each}} key resolution — resolve the strategy once (not per diff); @index/@key skip duplicate-key dedup entirely; per-pass seen is a plain Map. → index-keyed iteration ~2× faster (48.9→23.0 µs/1000, dedup pass skipped).
  4. Property reference for childRefFor — every {{a.b}} was a compute ref with 2 closures holding (parent, path); now stored as data, read/written inline. → ~14% faster, ~25% less alloc (72.2→62.4 µs / 633→477 kb per 1000).
  5. Pooled trackers + lazy consumed-tag SetbeginTrackFrame no longer allocates a Tracker + Set per frame (0/1-tag frames are the norm). → common frame 2 allocations → ~0 (0.10 b/iter for the 0-tag case).
  6. Fast-path tag [COMPUTE] — subtag-less tags (the majority) return revision directly, skipping the combinator-memoization machinery. → ~17% faster per validate (4.71→3.90 µs/1000).

End-to-end (combined)

tracerbench, control = this branch's base. The changes compound, so this is the combined effect on the revalidation-heavy phases (where the tracking/tag work lands):

Phase Δ
selectFirstRow1 −38.9% [−44.3 … −33.2]
selectSecondRow1 −14.1% [−19.3 … −9.0]
swapRows2 / swapRows1 −8.0% / −5.9%

Create/clear phases are DOM-dominated, so they stay within noise. No significant regressions.

Testing

Full browser suite green at every step: 9340 tests, 0 fail. CI green.

Each `{{#each}}` item binds two block params — the item value and its
index — and both were created as full compute references via
`createIteratorItemRef`. That meant, per item:

  - a `ReferenceImpl` + a dirtyable tag, plus *two* closures (the
    `compute` getter and the `update` setter), and
  - on every read, `valueForRef` took the generic compute path and opened
    a `track()` frame (a `Tracker` + `Set` allocation) purely to
    re-discover a tag that never changes.

For a 10k-row table that is 20k references and 20k tracking frames per
render pass (create/clear/append/update/swap all hit this), all to model
a value that is just "a stored value behind one tag".

This introduces a dedicated `Cell` reference type. A cell stores its
value directly on the reference behind a fixed tag, so:

  - `valueForRef` reads the stored value and re-snapshots the tag without
    opening a tracking frame (there are no dependencies to discover), and
  - `updateRef` mutates the value inline with the same equality gate as
    before — no `compute`/`update` closures are allocated at all.

Behavior is identical: same tag consumed on read, same equality-gated
dirty on update. `isUpdatableRef` reports cells as updatable, and
`createDebugAliasRef` no longer inherits the `Cell` type (a debug alias
is a genuine compute reference).

Microbench (real `valueForRef`/`updateRef`, 1000 items, prod build):

  initial render (create+read)  198µs/698kb -> 86µs/261kb  (2.3x, -63% mem)
  re-render      (update+read)  185µs/417kb -> 79µs/137kb  (2.3x, -67% mem)
  allocation only                31µs/320kb -> 22µs/~4kb   (1.4x, ~0 garbage)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@NullVoxPopuli NullVoxPopuli marked this pull request as draft May 29, 2026 15:31
@NullVoxPopuli
Copy link
Copy Markdown
Contributor

our each def has a problem, but I'm not convinced this is the solution.

running the bench locally shows not much improvement:

duration phase estimated improvement -21ms [-44ms to -3ms] OR -1.22% [-2.55% to -0.17%]
renderEnd phase no difference [0ms to 0ms]
render1000Items1End phase no difference [-2ms to 1ms]
clearItems1End phase no difference [-2ms to 1ms]
render1000Items2End phase no difference [-3ms to 3ms]
clearItems2End phase no difference [-1ms to 0ms]
render10000Items1End phase no difference [-10ms to 2ms]
clearManyItems1End phase estimated regression +2ms [1ms to 3ms] OR +1.29% [0.65% to 1.95%]
render10000Items2End phase no difference [-20ms to 14ms]
clearManyItems2End phase estimated improvement -3ms [-5ms to -1ms] OR -6.27% [-12.51% to -2.44%]
render1000Items3End phase no difference [0ms to 2ms]
append1000Items1End phase no difference [-2ms to 3ms]
append1000Items2End phase no difference [-2ms to 1ms]
updateEvery10thItem1End phase no difference [-2ms to 2ms]
updateEvery10thItem2End phase no difference [-1ms to 2ms]
selectFirstRow1End phase no difference [-1ms to 1ms]
selectSecondRow1End phase no difference [-1ms to 1ms]
removeFirstRow1End phase no difference [-1ms to 1ms]
removeSecondRow1End phase no difference [-1ms to 1ms]
swapRows1End phase no difference [-1ms to 0ms]
swapRows2End phase no difference [-2ms to 0ms]
clearItems4End phase no difference [-1ms to 0ms]
paint phase no difference [-2ms to 0ms]

I have a hunch we'll need to ship fragment support first so that each can be sort of "off-canvas"'d

@NullVoxPopuli-ai-agent NullVoxPopuli-ai-agent changed the title perf(reference): make {{#each}} item params cheap "cell" references [CLEANUP] Flatten Glimmer reference hot paths (each item cells, inlined track frame, property refs) May 29, 2026
NullVoxPopuli-ai-agent and others added 4 commits May 31, 2026 21:51
Two more extraneous layers in the reference/iteration hot paths, removed:

1. `valueForRef` recompute went through `track(thunk)`, which allocates a
   closure on *every* (re)compute. This is the single hottest function in
   the VM — every reference read that needs evaluation passes through it
   (all refs on initial render, and again on each invalidation). Inlining
   `beginTrackFrame()`/`endTrackFrame()` drops that per-read allocation.

   Microbench (1000 recompute frames): 63.2µs -> 57.0µs (~10%) and
   282kb -> 188kb (~33% less garbage).

2. `{{#each}}` key derivation:
   - `makeKeyFor` was re-resolved on every diff and wrapped *every*
     strategy — including `@index`/`@key`, whose keys are unique by
     construction — in the duplicate-key dedup machinery. The strategy is
     now resolved once when the iterator ref is created, and index keys
     skip dedup entirely.
   - The per-pass `seen` set used `WeakMapWithPrimitives` (lazy-getter +
     object/primitive dispatch on every get/set). Since it lives only for
     one synchronous pass, a plain `Map` is both simpler and faster; the
     weak-keyed map is kept only for the long-lived global `IDENTITIES`.

   Microbench (1000-item iteration): `@index` 23.0µs vs `@identity`
   48.9µs — index keys no longer pay the dedup cost they used to.

Behavior is unchanged: same keys produced, same duplicate-key semantics,
same tag consumption. Verified headless in Chrome — each (571), iterable
(24), tracked (242), Updating (175), Helpers (1173), Components (328), fn
(36) all pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Every `{{a.b}}` path access compiled to a compute reference holding two
closures — a getter (`getProp(valueForRef(parent), path)`) and a setter
(`setProp(...)`) — that captured nothing but `(parent, path)`. That is two
closure allocations per property reference, on a path hit by essentially
every template (`{{this.foo}}`, `{{row.id}}`, `{{row.label.current}}`, …).

Add a `Property` reference type that stores `parent` + `path` as plain
fields and is read/written inline by `valueForRef`/`updateRef` (the same
approach as the `Cell` type used for `{{#each}}` block params). No closures
are allocated; reads still open a tracking frame, since `getProp` consumes
dynamic tags. `isUpdatableRef` reports Property refs as updatable, and
`createDebugAliasRef` no longer inherits the Property type.

Microbench (1000 childRefFor calls): 72.2µs/633kb -> 62.4µs/477kb
(~14% faster, ~25% less allocation).

Also fixes a throw-semantics bug introduced when `track()` was inlined into
`valueForRef`: committing `ref.tag` inside the `finally` updated the tag even
when the compute threw, leaving `tag` and `lastRevision` inconsistent. The
new tag/revision are now committed only on success (the frame is still ended
in `finally` to keep the tracking stack balanced), matching the original
`track()` behavior. This restores correct handling of throwing getters —
caught by the `debug render tree: emberish curly components` test.

Full browser suite green: 9340 tests, 9323 pass, 17 skip, 0 fail.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`beginTrackFrame` allocated a `new Tracker()` and the Tracker allocated a
`new Set<Tag>()` — two objects per frame — on *every* reference recompute
and every cache group, every revalidation. The overwhelming majority of
frames consume zero or one tag.

- The Tracker now holds the first consumed tag in a field and allocates the
  `Set` only when a second, distinct tag arrives. 0/1-tag frames never touch
  a Set (and still dedupe / combine correctly).
- Trackers are pooled on a LIFO freelist. Frames are strictly nested and a
  tracker is dead the instant `combine()` runs in `endTrackFrame`, so it can
  be reset and reused by the next `beginTrackFrame`.

Net: the common tracking frame now allocates ~nothing. Microbench: a
frame that opens, consumes one tag, and closes drops from two object
allocations to ~0 b/iter (measured 0.10 b for the 0-tag case).

Full browser suite green: 9340 tests, 9323 pass, 17 skip, 0 fail.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`MonomorphicTagImpl[COMPUTE]` is called by `validateTag`/`valueForTag` on
every reference read. For a tag with no subtag — property tags, cell tags,
plain dirtyable/updatable tags, i.e. the overwhelming majority — the result
is always just `revision` (kept current by `dirtyTag`). The
`lastChecked`/`isUpdating`/cycle-guard/`try-finally` machinery exists only to
memoize subtag recursion, so it is pure overhead for these tags.

Return `this.revision` directly when `subtag === null`. The combinator path
is unchanged (it now reuses the already-read `subtag`).

Microbench (1000 subtag-less [COMPUTE]s during a revalidation pass):
~4.71µs -> ~3.90µs (~17%), and no try/finally or field writes on the read.

Full browser suite green: 9340 tests, 9323 pass, 17 skip, 0 fail.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants