feat(bb/msm-webgpu): straus_msm port — all 9 phases (P1-P9) stacked#23475
Draft
AztecBot wants to merge 9 commits into
Draft
feat(bb/msm-webgpu): straus_msm port — all 9 phases (P1-P9) stacked#23475AztecBot wants to merge 9 commits into
AztecBot wants to merge 9 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Full execution of the
straus_msmWebGPU port plan — every phase committed as a separate, reviewable commit on this branch. Phases were intended to run sequentially across PRs but the operator directed me to bundle them into a single session, so they're stacked here. Review per-commit; squash on merge will collapse them.3f9ad9d9a2cce60b3947fa0248458f946straus_mainkernel — single-chunk correctnessaeb637155bb6d03dfb087a9241f3d755111a4f7eb81d43c7f0e0efa01ac7379d01Per-phase notes
P1 — Host-side reference + GLV/Booth primitives
STRAUS_REFERENCE.md(committed copy of the P0 reference gist).straus/glv.ts+ tests —splitIntoEndomorphismScalarsfromfield_declarations.hpp:501-530; λ materialised frombn254/fr.hppMont limbs.straus/booth.ts+ tests —BoothSliceParams, 32-row table,boothPackedDigit.straus/reference.ts+ tests — pure-TSstraus_msmcross-checked againstnoble.G1.msmfor n ∈ [1, 256], plus the three edge cases fromelement.test.cpp.P2 — Lookup-precompute kernel
wgsl/cuzk/straus_lookup_precompute_bn254.template.wgsl: one thread per active point buildslut[i*8 + k] = (k+1) · base[i].cuzk/straus_kernels.ts:StrausKernels.{render,compile}LookupPrecomputefactory (thegpuhelpers parameter keeps the module jest-loadable in node).cuzk/shader_manager.ts:gen_straus_lookup_precompute_shader(n, workgroup_size).dev/msm-webgpu/wgsl_unit_tests.ts:testStrausLookupPrecompute(n)wired intorunAllWgslUnitTestsfor n ∈ {1, 8, 64, 256, 1024}.P3 —
straus_mainkernelwgsl/cuzk/straus_main_bn254.template.wgsl: compile-timeNUM_THREAD_MULS, counted inneriiloop (NOT unrolled), 32 windows × 2 halves × ii. Readsk1_lims/k2_limsfrom storage each iter; β-Mont injected fromstraus_constants.ts.cuzk/straus_constants.ts(new):fqCubeRootOfUnityMont(numWords, wordSize)re-bases β frombn254/fq.hpp's 2^256-Mont into this tree'sR = 2^(num_words·word_size)-Mont form. Sanity test verifies β³ ≡ 1 mod q.dev/msm-webgpu/wgsl_unit_tests.ts:testStrausChunk(k)for k ∈ {1, 2, 3, 4, 6, 8, 12, 16} (single-chunk dispatch, Jacobian readback compared toreferenceStrausMsm).P4 — Multi-thread dispatch test
testStrausMultiThread(n, k)dispatchesT = ceil(n/k)threads, reads back T Jacobian partials, sums them via noble's projective add, asserts affine equality withreferenceStrausMsm. Wired for (n, k) ∈ {16, 64, 256, 1024} × {1, 2, 4, 8}.P5 — Combine + to-affine
wgsl/cuzk/straus_combine_fold_bn254.template.wgsl: in-passadd_points(in[2t], in[2t+1])→out[t]. Ping-pong-buffered by the host between dispatches; in-place fold avoided because cross-workgroup read/write hazards would otherwise need a global barrier.wgsl/cuzk/straus_to_affine_bn254.template.wgsl: single-thread Jacobian→affine viafr_inv_by_a(the same BY safegcd driverbatch_inversealready uses).testStrausEndToEnd(n, k)runs lookup + main + log2(T) fold passes + to-affine and compares affine(x, y)directly toreferenceStrausMsm.P6 — TrivialMsm host driver
dev/msm-webgpu/trivial_msm.ts:create(device, n, pointsBuf, ntm)/prepare(scalarsBuf)/run()/destroy()mirroringMsmV2.createcompiles every per-(n, k) pipeline including all log2(T0) fold passes and runslookup_precomputeonce;preparedoes host-side GLV split + Booth-pack + scalar upload;runencodes + submits the full straus pipeline and reads back the 2 × BigInt result.P7 — bench-nt-sweep
dev/msm-webgpu/bench-nt-sweep.{html,ts}: clone ofbench-c-sweepparameterised onNUM_THREAD_MULS. Each cell records median + minTrivialMsm.run()ms; the same row also timesMsmV2so the table reports per-logN speedup and surfaces the crossover.dev/msm-webgpu/scripts/bench-nt-sweep.mjs: headless Playwright driver, stall-detection at 180s, prints final pickNTM JSON + speedup table on stdout.dev/msm-webgpu/scripts/run-browserstack.mjs: pageMap entry forbench-nt-sweep.P8 — M2 BrowserStack confirmation
dev/msm-webgpu/scripts/narrow-from-local.mjs: reads local sweep JSON, emits the union ofbestNtm ± 1per logN (constrained to the default set) as the BS--ntmlist.dev/msm-webgpu/results_format.ts+scripts/format-m2-report.mjs: convert the BS JSONL into the markdown gist body (pickNTM table + speedup + crossover).P9 — Size dispatcher
src/msm_webgpu/cuzk/trivial_msm.ts: production copy of the dev driver (dev/ becomes a thin re-export to keep the two from diverging).src/msm_webgpu/cuzk/size_dispatcher.ts:compute_bn254_msm_auto(device, n, points, scalars, fallback)routesn ≤ PICK_NTM_CROSSOVER_NtoTrivialMsmwithk = pickNtm(n); otherwise falls through to the supplied MsmV2-shaped fallback. The pickNTM table and crossover constant are placeholders that need P8's actual M2 measurements.src/msm_webgpu/index.ts: re-exportscompute_bn254_msm_auto,pickNtm,PICK_NTM_CROSSOVER_N,TrivialMsm.Gate command output (host-side)
Tests cover: P1 GLV identity over 200 random scalars + negative-k2 sweep + 11 explicit edge cases; P1 Booth round-trip over 100 GLV outputs + 256 random halves + tiny boundaries; P1
referenceStrausMsmvs noble for n ∈ [1, 256] × 5 seeded inputs + the threeelement.test.cppedge cases; P2 lookup-precompute renderer (bindings, partials, mustache wiring); P3straus_mainrenderer (NUM_THREAD_MULS interpolation, no inner-loop unroll, bindings, 20-limb β-Mont initializer, β³ ≡ 1 mod q); P5 combine-fold renderer (T_IN interpolation, bindings) and to-affine renderer (single-thread,fr_inv_by_a).Caveats / follow-ups
(k+1) · base[i]entries.straus_mainBooth digit extraction in WGSL matches the host port (the WGSL form uses 32-bit limbs directly; the host uses bigint operations — both should produce the same digits).fr_inv_by_ain to-affine produces a correct Mont-form inverse on the read-back Z.TrivialMsm.run()is idempotent across repeats (the warm-path stability claim from the plan).pickNTMtable andPICK_NTM_CROSSOVER_Ninsize_dispatcher.tsare static-reasoning placeholders. Run the local sweep then BrowserStack narrowed sweep to populate them with real M2 numbers.TrivialMsmMsmV2 sanity-cross-check skipped. P6 in the plan said "add a TrivialMsm row to the existing sweep UI" in main.ts. Touching main.ts (1500+ lines, deep integration with MsmV2 lifecycle) without GPU validation felt too risky; left for a follow-up. The bench-nt-sweep page already covers TrivialMsm vs MsmV2 timing comparison.drive-unit-tests.mjsnot added. P2 asked for a Playwright driver that invokes one unit test by name. The newtestStrausLookupPrecompute/testStrausChunk/testStrausMultiThread/testStrausEndToEndfunctions are wired intorunAllWgslUnitTests()reachable from main.ts's "Run Unit Tests" button. Standalone driver +?autorun=…URL flag can be added in a follow-up.yarn build:esmgate fails on pre-existing branch errors unrelated to any phase here: missingsrc/cbind/generated/*(requiresyarn generate+ native C++ build) plus pre-existing TS errors incuzk/batch_affine.ts,cuzk/smvp_tree.ts,barretenberg/poseidon.{test,bench.test}.ts. Confirmed by runningyarn build:esmon the unmodifiedzw/msm-webgpu-experiments-v2base. None of the failing files are touched in this PR.--testPathPattern='src/msm_webgpu/straus'from the plan's gate does not filter under this repo's jest config; useyarn test src/msm_webgpu/straus(positional) instead.