perf(bb): WebGPU MSM tree-reduce SMVP — parallel preamble + ebid timing fix by AztecBot · Pull Request #23349 · AztecProtocol/aztec-packages

AztecBot · 2026-05-17T12:03:55Z

Replaces the round-loop in smvp_batch_affine_gpu with a tree-reduce
SMVP that maximises GPU thread utilization across the bucket accumulation
phase. Branch is based on zw/msm-webgpu-mont-mul-bench.

Headline result (BrowserStack Apple M2 · Chrome 148)

Standalone bench bench-smvp-tree at logN=16 heavy-skew (65K entries,
512 buckets, one bucket holds ~half the entries):

Phase 1 + 15 Phase 2 layers = 27.3 ms total
Bit-for-bit match against CPU full-reduce reference (512/512 buckets)

On SwiftShader (CPU-emulated WebGPU) the same workload drops from the
v2 baseline's 23 s to 234 ms — about 100× — because the
parallel preamble (see below) saturates per-WG work instead of leaving
63 of 64 threads idle.

v3 parallel preamble

The v2 Phase 1/2 shaders ran the per-WG pair-detection state machine
on thread 0 alone — 1024 sequential operations while the other 63
threads idled at workgroupBarrier. That was a 64× thread-utilisation
loss on the per-WG critical path.

v3 rewrites the preamble cooperatively:

Step 1: each thread loads PER_THREAD_ENTRIES = MAX_SLICE_ENTRIES/TPB
        buckets from its chunk and computes "last break position"
        locally.
Step 2: TPB-wide Hillis-Steele max-scan reconstructs pos_in_run for
        every entry (log2(TPB) stages).
Step 3: each thread determines emit / pair flags from pos_in_run
        parity and successor-bucket equality.
Step 4: TPB-wide prefix-sum of per-thread emit + pair counts assigns
        raw_slot and pair_rank ranges per thread.
Steps 5-6: each thread writes its pair_idx_a/b, rank_to_raw, and
        prev_raw_for_pair entries from its assigned ranges.

The new pair schedule is identical to the v2 greedy state machine (same
parity-based pairing within each contiguous same-bucket run, fresh
open=None per slice). Phase A is also tightened to read only x (y is
not needed for dx = Q.x - P.x), halving Phase A's point-data bandwidth.

ebid timing fix

The tree-reduce path's smvp_tree_entry_bucket_id (ebid) kernel
previously dispatched in its own command encoder, which was submitted
before the caller's commandEncoder ran transpose. ebid read
all_csc_col_ptr_sb from its current GPU state — which still held the
prior MSM call's CSR. For warm-context benchmarks with deterministic
scalars the data happened to match, but with anything else it produced
silently wrong bucket assignments downstream.

Fix: record ebid into the caller's commandEncoder (after transpose +
ba_init), finish + submit it so the GPU runs transpose through ebid
before runTreeReduce reads back entry_bucket_id, then swap to a fresh
commandEncoder for scatter + finalize. The caller continues recording
BPR onto the new encoder. Required signature change: smvp_batch_affine_gpu
takes a commandEncoderRef wrapper so it can mutate the encoder
mid-call; msm.ts re-binds its local commandEncoder after the call
returns. Stock path is unaffected.

With the fix, production-integration SMVP output (per-subtask
running_x fingerprint, active-only) matches stock bit-for-bit on
14 of 18 subtasks at logN=16 with seeded scalars. The remaining
4 subtasks (2, 4, 6, 17) still drift bit-for-bit on M2 production
runs — likely a separate Phase 2 cross-slice edge case to be
debugged next.

What's in this PR

smvp_tree_phase1/2.template.wgsl — v3 parallel preamble shaders
smvp_tree_entry_bucket_id.template.wgsl — CSR → bucket-id GPU kernel
smvp_tree.ts — host orchestrator (Phase 1 + iterated Phase 2)
smvp_tree_scatter.template.wgsl — output → finalize pipeline bridge
batch_affine.ts — integration: use_tree_reduce=true swap, ebid
timing fix via commandEncoderRef
msm.ts — commandEncoderRef plumbing for the new SMVP entrypoint
bench-smvp-tree.ts + run-bench-smvp-tree.mjs — standalone bench
(extended to multi-subtask + GPU ebid mode for production-shape testing)
run-msm-page.mjs — dev-page autorun driver for cross-checking stock
vs tree on BrowserStack via the existing index.html
Step 0 infrastructure (BrowserStack runner + Vite results middleware)
from the original PR is retained — the rest of the work stacks on it.

Test plan

Standalone bench (bench-smvp-tree) on SwiftShader: uniform
60/6, sparse 8K/8000, heavy-skew 65K/512, sparse 16K/16383,
multi-subtask 4K/4000×4, production-scale 65K×2-subtask uniform.
All match CPU reference bit-for-bit (off-curve carry uses the same
tree-reduce parenthesization as GPU).
Standalone bench (bench-smvp-tree) on BrowserStack Apple M2
heavy-skew 65K/512: 27.3 ms total, 512/512 buckets match CPU
reference bit-for-bit.
GPU entry_bucket_id validated bit-for-bit against host
computation across multi-subtask scales.
Production integration (via use_tree_reduce=1) on BS Apple M2:
14/18 subtasks match stock SMVP per-subtask fingerprint after the
ebid timing fix. Remaining 4 subtasks under investigation.
Full end-to-end MSM cross-check (WebGPU vs WASM) at logN=16:
blocked on the remaining 4-subtask drift; track via the included
debug-readback gate window.__msm_debug_after_smvp.

Created by claudebox · group: slackbot

Adds a remote-device bench loop for the MSM-webgpu dev pages so the tree-reduce work can validate against real WebGPU hardware (Apple M2, Snapdragon 8 Elite, Tensor G4) from a workstation without a local GPU. - vite.config.ts: results/progress POST endpoints write JSONL to files named by MSM_WEBGPU_RESULTS_FILE / MSM_WEBGPU_PROGRESS_FILE; allow .trycloudflare.com so the dev server is reachable via Cloudflare Quick Tunnel. - results_post.ts: tiny in-page client used by bench/sanity pages to POST progress + final-state payloads (no keepalive — the page is alive when the bench completes). - bench-batch-affine.ts: post per-batch progress and a terminal done/error row. - scripts/run-browserstack.mjs: spawn vite + cloudflared, drive a BS worker through the REST API, watchdog-tail the JSONL with first-progress / stall / deadline budgets. - scripts/bs-targets.mjs: macOS Sequoia Chrome, S25 Ultra, Pixel 9 Pro XL presets (WebGPU stable). iPhone 15 Pro listed but flagged as needs-iOS-26-or-newer. Validated against macOS Sequoia Chrome 148 (Apple M2, hc=8) on ?total=8192&sizes=64,256,1024: B=64 ns/pair=305.2 median=2.500ms B=256 ns/pair=146.5 median=1.200ms B=1024 ns/pair=219.7 median=1.800ms

Implements smvp_tree_partition.ts: the host computes per-WG slice boundaries by binary search on bucketStart[], no GPU pre-pass. Uses the analytical identity running_adds(i) = i - bucket_idx(i) from msm-tree-reduce.md. Documents a design ambiguity the plan didn't call out: the identity under-counts when bucketStart contains empty buckets (bucket_idx jumps faster than the entry count grows). Resolved by requiring compacted input; compactBucketStart() + assertCompact() do the one-pass cleanup and a side activeBucketIds[] map carries the original bucket index for kernels that tag partials. Exports: - computeTotalAdds, bucketIdx, runningAdds, findAddsBoundary - compactBucketStart, assertCompact - buildSliceLayout(bucketStart, numWgs) -> SliceLayout { sliceStart, outputCount, outputOffset, totalAdds } 24 Jest tests pass — including the pair-detection brute-force walk that catches the empty-bucket regression, the heavy-bucket-skew case (7+ of 8 WGs fall inside a single 10k-population bucket), and the pathological totalAdds < numWgs case. No GPU code touched.

…validated) Phase 1 of the tree-reduce SMVP: pair detection + cooperative batch- affine + per-bucket-tagged write-out, one workgroup per slice. Files: - src/msm_webgpu/wgsl/cuzk/smvp_tree_phase1.template.wgsl — the kernel. Thread-0 serial pair-detection preamble fills a workgroup-shared pair_list (packed PAIR + UNPAIRED entries in slice walk order, which is already bucket-sorted so no reorder postlude is needed). Phase A/B/C/D batch-affine pattern from bench_batch_affine.template.wgsl, with rank-indexed chunks over the PAIR sub-stream so a single fr_inv_by_a amortises across the WG. UNPAIRED entries get a final cooperative copy pass with sign-flip. Loop bounds all `const` (MAX_PAIRS = MAX_SLICE_ENTRIES baked at compile time; v0 uses 128 to keep workgroup memory comfortable). - src/msm_webgpu/cuzk/shader_manager.ts — gen_smvp_tree_phase1_shader generator + import wiring. - dev/msm-webgpu/bench-smvp-tree-phase1.{html,ts} — standalone bench page with a CPU reference. The reference walks the slice with the same paired/unpaired state machine and computes Mont-form affine adds via BigInt mod-inverse; correctness is checked bit-for-bit against the GPU output. Status: structure-complete but NOT yet correctness-validated on hardware. The BS macOS Chrome 148 run hangs on the page before the first log call lands (the previous BS run on the same tunnel for bench-batch-affine worked fine, so the issue is page-specific not infrastructure). Likely candidates: an early-eval import side effect in smvp_tree_partition.ts, the buildSynthetic randomBelow loop generating off the main thread, or a Mont-form-conversion stall. Worth investigating with browser console access; the BS screenshot API doesn't surface uncaught errors. Documents a design decision in the shader header: Phase 1 does NOT collapse same-bucket pair results sequentially into a single per- bucket partial inside the slice (the plan's "merge consecutive same- bucket results into running sum" wording). Sequential merging would break batch-affine amortisation and would need (pop-1) sequential adds per heavy bucket. Instead Phase 1 halves per bucket (ceil(p/2) outputs per bucket per slice), letting the recursive Phase 2 dispatch do the rest of the reduction in log layers. The plan's wg_output_count[k] = "buckets touched" formula is overridden here by the per-slice CPU pair-detection walk that computes the actual output count.

The window.error / unhandledrejection listeners and skip_gpu URL flag were added to narrow down a BS-side hang in the phase1 bench page; they didn't surface the underlying issue and have been removed. Page remains structurally the same as bench-batch-affine.ts plus the buildSliceLayout import and the phase1-specific synthetic-data generation + CPU reference.

Phase 1 of the tree-reduce SMVP now passes correctness on local Chromium WebGPU (SwiftShader): 20/20 outputs match the CPU reference bit-for-bit on the small-N smoke (num_wgs=2, slice_entries=16). Three real bugs found and fixed by getting local WebGPU into the debug loop (via Playwright + chrome-headless-shell, no GPU on the dev container so SwiftShader is used): 1. randomBelow consumed only the LOW BYTE of each rng() output. For the 32-bit LCG the low 8 bits cycle every 256 outputs, so a 32-byte randomBelow draw cycles every 8 calls — fatal when the caller builds a Set of distinct values. Fixed to consume the full 32 bits. Latent bug in bench-batch-affine.ts too; harmless there because the only check is `pxMont !== qxMont` on adjacent calls. 2. WGSL `get_p()` redeclared in smvp_tree_phase1.template.wgsl. Already provided by the `montgomery_product_funcs` partial. Removed the local definition. 3. Shader needs 10 storage buffers per stage; WebGPU's default cap is 8. Adapter actually exposes 10+. get_device now requests the adapter max for `maxStorageBuffersPerShaderStage` alongside `maxComputeWorkgroupStorageSize`. CPU reference rewritten to do all arithmetic in canonical (non-Mont) form, then convert back to Mont for the diff against GPU output. The prior Mont-form-in-place pass got the inverse semantics wrong: fr_inv_by_a(dx_mont) returns inv_dx_canon * R^2 (a "double Mont" form, picked because the subsequent montgomery_product strips one R factor to give Mont-form slope), not inv_dx_canon * R as the original reference assumed. GPU bench wall-time: ~6.5ms for 32 entries / 20 outputs / 1 dispatch on SwiftShader CPU-emulated WebGPU. Not a benchmark number — real silicon will be 100× faster.

Phase 2 of the tree-reduce SMVP: recursive halving over partials. Structurally identical to Phase 1 (same pair-detection state machine, same Phase A/B/C/D batch-affine, same per-WG output write-out) but takes `(bucket_id, AffinePoint)` tuples directly rather than `(sign_bit | scalar_idx)` from the raw schedule + a separate entry_bucket_id table. One less indirection, no sign flip. Output schema matches Phase 1 so the recursion can rebind the same buffers and just swap the input/output roles each layer. Correctness gate: 19/19 outputs match CPU reference bit-for-bit on the small smoke (num_wgs=2, slice_entries=16) on local SwiftShader. GPU bench wall: 5.4ms (CPU-emulated WebGPU; M2 would be ~10× faster based on Phase 1 readback). Done definition for this step met.

…artial) Drives Phase 1 → CPU sort → Phase 2 → CPU sort → Phase 2 → ... until every bucket has one partial. CPU-side resort between phases (Step 4 is deferred to GPU follow-up — choice documented in module header). Standalone bench-smvp-tree.{html,ts} compares the final per-bucket partials against a CPU reference that computes the full sequential sum per bucket directly. Status: - Phase 1 alone: 1/1 buckets match (entries=2) - Phase 1 + 1× Phase 2 with mixed pair_result+unpaired input (entries=3): 1/1 buckets match - Phase 1 + 1× Phase 2 with two pair_result inputs (entries=4): 1/1 MISMATCH Repro: load `bench-smvp-tree.html?entries=4&buckets=1&seed=42` on local SwiftShader Chromium. CPU reference matches the sequential-add of 4 canonical points; orchestrator's Phase 2 output disagrees. Phase 2 standalone test (against synthetic Mont-form pair-like inputs) passes 19/19, so the bug must live in the boundary between Phase 1's output buffers and Phase 2's input expectations — likely a Mont-form / BigInt-stride mismatch that the standalone Phase 2 test wasn't hitting because its inputs are generated as fresh random Mont values rather than the output of a previous batch-affine. Next step in this debug path: instrument the orchestrator to print the Phase 1 readback values and diff each (P_2k + P_2k+1) against its corresponding CPU pair-add for entries=4. That narrows whether Phase 1's emitted bytes are wrong vs. whether Phase 2 misreads them. Step 6 (production swap) is unblocked from a structural standpoint — if the Phase 1/2 chain is fed by the existing transpose + bucket_start, the same bug surfaces and gives a concrete failing Quick Sanity Check to triangulate with.

…5 validated) The previous reference summed each bucket's points sequentially: ((P0+P1)+P2)+P3+... which only matches the GPU's tree-reduce parenthesization (P0+P1)+(P2+P3)+... when the inputs are on the EC group. The synthetic bench uses random off-curve bigints (we test the algebraic affine-add formula, not the group law), so the two orderings produce different bytes. Fixed by walking each bucket via the same pair-detection state machine the GPU uses, recursing layer-by-layer until one partial remains. Bench passes 5/5 buckets bit-for-bit on local SwiftShader (entries=40, buckets=5, seed=99) — including bucket=4 which has pop=9 and recurses through 4 layers. This validates the full Phase 1 → CPU sort → Phase 2 → CPU sort → ... chain. Step 5 correctness gate met.

The tree-reduce orchestrator (cuzk/smvp_tree.ts) is correctness-validated standalone but not yet integrated into the production MSM pipeline. This marker documents the integration checklist at the swap site so a follow-up session can wire it in without re-discovering the contract.

Bumped to 256 + 200 entries / 12 buckets validated correctness OK on local SwiftShader (5 layers, 0 mismatches, 140 ms wall) but BS macOS Chrome 148 fails to compile the resulting shader within the worker's initial-load window — either maxComputeWorkgroupStorageSize exceeded or the static-bound pair_list loops blow out the WGSL compile budget. Keeping 128 for the validated path (5/5 buckets bit-for-bit on M2 at entries=40). Scaling further is a follow-up that needs pair_list hoisted to global memory + per-WG pair_count uniform sized for the runtime count instead of MAX_PAIRS-bounded loop iterations.

…SWEET_B=1024 Phase 1/2 shaders rearchitected for thread utilization at the plan's target SWEET_B=1024 batch-affine size. v1's two main flaws: 1. Per-thread O(MAX_PAIRS) scans for rank → raw_slot lookup AND backward search for prev PAIR's raw_slot in Phase D. At MAX_PAIRS=1024 that's 1024 idle iterations per thread per phase. 2. `pair_bucket` in workgroup memory inflated per-WG storage past the 32 KiB cap, forcing MAX_SLICE_ENTRIES=128 and 8× more WGs than the plan called for. v2 fixes both. Thread-0 preamble builds 4 workgroup-shared arrays in ONE sequential pass: - pair_idx_a, pair_idx_b: per-raw-slot (PAIR or UNPAIRED) input entry indices - prev_raw_for_pair: per-raw-slot pointer to immediate prior PAIR's raw_slot (O(1) lookup in Phase D, no backward scan) - rank_to_raw: per-PAIR-rank pointer to raw_slot (O(1) Phase A/D iteration over PER_THREAD_PAIRS, not MAX_PAIRS) pair_bucket writes go straight to global `output_bucket_id` from the preamble — never in workgroup memory. Workgroup memory at MAX_PAIRS=1024 / TPB=64: 4 × 4 KB (pair arrays) + 2 × 5.12 KB (wg_fwd/bwd) + ~80 B = 26.4 KB fits in M2's 32 KiB cap. Phase A/D inner loops now iterate exactly PER_THREAD_PAIRS = 16 times each (down from MAX_PAIRS = 1024 in v1). 64× fewer idle iterations per thread per phase. Validation on local SwiftShader (Chromium headless, no GPU on dev container): - Phase 1 standalone at 4096 entries / 8 WGs × 512 entries: 2057 outputs, 0 mismatches, 6.5 ms median. - Orchestrator at 2048 entries / 64 buckets: 64/64 buckets match full-reduce CPU reference bit-for-bit. 3 layers, 18.8 ms total GPU wall (10.0 + 5.5 + 3.3 across phase1 + phase2 layer2 + layer3). Apple M2 should be ~10× faster (SwiftShader is CPU-emulated WebGPU). Pending BS validation.

…y bucket-sorted First-principles observation: Phase 1 / Phase 2 outputs are ALREADY globally bucket-sorted. Input entry_bucket_id is monotone non- decreasing (CSR layout); each WG walks its non-overlapping contiguous slice left-to-right emitting in walk order; WG outputs concatenated preserve monotonicity. No sort needed. Removes the readback-of-points + JS sort + upload between every phase. Saves O(N × NUM_LIMBS_U32 × 4) bytes of bus traffic + the O(N log N) JS sort per layer × log layers. Still does a small (4 B / partial) bucket-id readback to compute per-WG pair-count + output offsets host-side. Asserts global sort on the readback as a debug guard — cheap and catches partition regressions. Termination changed from "no more pair-adds possible" (required full bucket-id scan) to "count equals input num_active_buckets" (known from initial input). One bucket-id readback per phase, point data never moves between phases. Bench at 8192 entries / 256 buckets / 5 layers on local SwiftShader: - 256/256 buckets match full-reduce CPU reference bit-for-bit - GPU wall: 21.9 + 9.9 + 8.7 + 8.8 + 5.5 = 54.8 ms total For comparison the prior CPU-sort version at 2048 entries / 64 buckets / 3 layers was 140 ms total. 4× scale, 0.4× time — ~10× speedup from this change plus the v2 thread-utilization fix. Bench entry cap raised from 512 → 2^18 (1 << 18) and bucket cap from 64 → 2^14 so we can run real production-scale workloads.

…to finalize pipeline Two small kernels that turn the orchestrator's sparse (bucket_id, AffinePoint) outputs into the dense (running_x, running_y, bucket_active) arrays the existing finalize_collect → finalize_inverse → finalize_apply pipeline expects. With these in place the production swap in msm.ts is mechanical: replace the round-loop dispatch with runTreeReduce + scatter_init + scatter, and re-use the finalize chain unchanged for the affine→Jacobian + magnitude-bucket fold. scatter_init: one thread per bucket slot, zeros running_x/y + bucket_active across the full T*num_columns dense layout. scatter: one thread per orchestrator output, writes running_x[bucket_id]=P.x, running_y[bucket_id]=P.y, bucket_active[bucket_id]=1. Both kernels are trivially parallel (no atomics, no synchronisation beyond the bucket_active write which is the only output ever written by any thread for that bucket_id since the orchestrator's output is unique-per-bucket).

…alize pipeline `smvp_batch_affine_gpu_tree` is the production adapter that: 1. Reads CSR row pointers from `all_csc_col_ptr_sb`, computes per-entry bucket id, uploads. 2. Runs the v2 tree-reduce orchestrator (`runTreeReduce`). 3. Inits the dense workspace (`running_x/y_sb`, `bucket_active_sb`) via `scatter_init` (one thread per bucket slot). 4. Scatters the tree-reduce output (sparse, one per active bucket) into the dense workspace via `scatter` (one thread per output). 5. Returns. Caller continues with the existing `finalize_collect` → `finalize_inverse` → `finalize_apply` chain unchanged for the affine→Jacobian conversion and the magnitude-bucket fold. `buildTreeAdapterPipelines` compiles all four pipelines (phase1, phase2, scatter, scatter_init) once per (num_words, max_slice_entries) shape; cache the handle for the warm bench loop. ShaderManager wiring for `gen_smvp_tree_scatter_shader` + `gen_smvp_tree_scatter_init_shader` added alongside the existing phase1/phase2 generators. The actual msm.ts call-site swap is one more edit: replace the current `smvp_batch_affine_gpu(...)` call with two calls — first `smvp_batch_affine_gpu_tree(...)` to populate running_x/y + bucket_active via tree-reduce, then the existing finalize chain. That swap is mechanical now that the adapter is in place; pending the Quick Sanity Check correctness gate.

Validates the tree-reduce's main perf claim from the plan: a heavily skewed input (one bucket with pop = entries/2, the rest uniform) is handled in O(log heavy_pop) layers regardless of skew. Measured on Apple M2 via BS at entries=65536 / buckets=512 / skew=heavy (heavy bucket pop = 32 832): layers: 16 total GPU wall: 34.6 ms For comparison the same input at skew=uniform (max pop ~256): layers: 6 total GPU wall: 24.3 ms Heavy skew → only 1.4× more time despite a bucket that the current round-loop MSM would need ~32 832 sequential rounds to reduce. The plan's "5–10× faster on heavy-bucket workloads" claim looks conservative. Bench page now accepts `?skew=heavy` and abbreviates the pops log for runs with > 16 buckets.

Adds a `use_tree_reduce` flag-gated branch inside smvp_batch_affine_gpu that swaps the round-loop for the v2 tree-reduce pipeline: init (existing) → entry_bucket_id (new) → tree-reduce (new) → scatter (new) → finalize_collect → finalize_inverse → finalize_apply (all existing, unchanged). Wiring: - `compute_curve_msm` / `compute_bn254_msm_batch_affine` plumb `use_tree_reduce` through to smvp_batch_affine_gpu. - dev-page main.ts reads `?use_tree_reduce=1` and forwards it to the Quick Sanity Check path. - New `smvp_tree_entry_bucket_id` shader derives entry_bucket_id from the per-subtask CSR row-pointer layout (row_ptr[subtask*(num_columns+1) + bucket_local]). Per-subtask binary search; one thread per entry. - runTreeReduce no longer needs the bucketStart parameter (was already unused; removed cleanly). State on local SwiftShader: - Stock sanity at logN=16: state=done, gpu.x prefix e04e8689dc4d92e6, 4.4 s wall. - Tree sanity at logN=16: state=done (no crash), but gpu.x prefix 27e87ad6dbd157b6 — output disagrees with stock. Algorithm correctness for the per-bucket affine sums was validated standalone at 65 K entries / 1024 buckets on Apple M2 (1024/1024 buckets bit-for-bit against CPU tree-reduce ref). So the divergence is somewhere in the production-layout bridge: most likely entry_bucket_id derivation against the real CSR (per-subtask layout), the cross-subtask slice alignment of Phase 1, or the scatter's bucket_global → workspace slot mapping interacting unexpectedly with init's seeding pass. Pending follow-up: instrument running_x readback after the tree-reduce + scatter path and diff slot-by-slot against the stock path's running_x to localize the divergence to a bucket range. The shaders are stable so once we narrow the failing bucket the fix should be tight.

Gated behind window.__tree_debug = true. Dumps the first 32 entries of the tree-reduce's derived entry_bucket_id plus the first / last of the CSR row_ptr for subtask 0. Used to verify the per-subtask binary-search kernel against the production CSR layout — confirmed correct output for the logN=16 sanity input (num_columns=32768, num_subtasks=18, input_size=65536, totalEntries=1179648). The tree-reduce path runs to completion but produces a different final MSM gpu.x than stock. Bug is somewhere after entry_bucket_id — either in Phase 1/2 chain operating on the production CSR vs my synthetic test layout, or in scatter's interaction with the finalize stage's reads. Awaiting a follow-up debug pass with per-bucket running_x diffing (needs splitting smvp_batch_affine_gpu so we can intercept the buffer between init+scatter and finalize).

The v2 preamble had thread 0 do a 1024-op sequential pair-detection state machine while 63 threads idled at workgroupBarrier — a 64x thread-utilisation loss for the per-WG critical path. v3 distributes the preamble across all TPB threads: Step 1: each thread loads PER_THREAD_ENTRIES = MAX_SLICE_ENTRIES/TPB buckets from its chunk and computes "last break position" locally. Step 2: TPB-wide Hillis-Steele max-scan reconstructs pos_in_run for every entry (log2(TPB) stages). Step 3: each thread determines emit / pair flags from pos_in_run parity and successor-bucket equality. Step 4: TPB-wide prefix-sum of per-thread emit + pair counts assigns raw_slot and pair_rank ranges per thread. Steps 5-6: each thread writes its pair_idx_a/b, rank_to_raw, and prev_raw_for_pair entries from its assigned ranges. The new pair schedule is identical to the v2 greedy state machine (same parity-based pairing within each contiguous same-bucket run, fresh open=None per slice). Heavy-skew 65K/512 bench passes bit-for-bit on SwiftShader. Phase A is also tightened to load_point_x_only (Phase 1) or direct input_x[idx] (Phase 2). y is not needed for dx = Q.x - P.x; skipping the y reads halves Phase A's point-data bandwidth. Adds run-bench-smvp-tree.mjs to drive the page locally without BrowserStack for fast iteration.

Before this change, smvp_tree's entry_bucket_id (ebid) GPU kernel dispatched in its own command encoder and was submitted before the caller's commandEncoder ran transpose. ebid read all_csc_col_ptr_sb in its current GPU state — which still held the PRIOR MSM call's CSR. For warm-context benchmarks that's same data, but the persistent buffer's stale data made debug runs (different scalars per call) silently miss the bug. In the BS dump comparison with seeded scalars, this manifested as tree's running_x activating buckets that init didn't, plus subtle per-subtask sum drift. Fix: record ebid into the caller's commandEncoder (after transpose and ba_init), then finish + submit it so the GPU runs transpose through ebid before runTreeReduce reads back entry_bucket_id. After that, swap to a fresh commandEncoder for scatter + finalize so the caller can continue recording BPR onto the new encoder. Required signature change: smvp_batch_affine_gpu now takes a commandEncoderRef wrapper so it can mutate the encoder mid-call; msm.ts re-binds its local commandEncoder after the call returns. Stock path is unaffected (it never swaps the ref). Reduces production-integration mismatches from 18/18 subtasks to 4/18 (specific subtasks 2, 4, 6, 17 still drift bit-for-bit — likely a separate Phase 2 cross-slice carry edge case to be investigated). GPU entry_bucket_id is still validated bit-for-bit against the host in the standalone bench's multi-subtask + GPU-ebid mode. Also extends bench-smvp-tree.ts to drive multi-subtask synthetic inputs through the GPU ebid kernel (matches production layout), and adds a per-subtask SMVP fingerprint dump to compare stock vs tree running_x without depending on warm-context stale data.

AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 17, 2026

AztecBot added 18 commits May 17, 2026 12:06

AztecBot changed the title ~~feat(bb): BrowserStack-driven MSM-webgpu bench harness~~ perf(bb): WebGPU MSM tree-reduce SMVP — parallel preamble + ebid timing fix May 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(bb): WebGPU MSM tree-reduce SMVP — parallel preamble + ebid timing fix#23349

perf(bb): WebGPU MSM tree-reduce SMVP — parallel preamble + ebid timing fix#23349
AztecBot wants to merge 19 commits into
zw/msm-webgpu-mont-mul-benchfrom
cb/b7178f5b65e7

AztecBot commented May 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Headline result (BrowserStack Apple M2 · Chrome 148)

v3 parallel preamble

ebid timing fix

What's in this PR

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AztecBot commented May 17, 2026 •

edited

Loading