feat(ecc): VectorField Fq Mont-mul + K=5 MSM batch_affine_add by notnotraju · Pull Request #23353 · AztecProtocol/aztec-packages

notnotraju · 2026-05-17T22:15:04Z

Stacked on top of #23210 (rk/wasm-simd-03-accumulator).

Two commits:

VectorField Fq Mont-mul specialization — extracts the Mont-mul body into vector_field_mont_mul_body.inl.hpp and adds an explicit specialization for Bn254FqParams alongside the existing Bn254FrParams one. Each specialization remains a separate TU function (preserves register scope, V8 reproduces the gist's hand-scheduled WAT). 9 new VectorFieldFqTest cases mirror the Fr coverage.
K=5 q1s1 path in batch_affine_add_interleaved — uses the new Fq specialization to run 5 independent batch-inversion chains in parallel through MSM's affine-add inner loop. Per group of 5 pairs (10 points), 30 scalar muls collapse to 6 width-5 vec muls (+ 12 amortized split-tree muls). Asymptotic ~5× kernel speedup on the mul work.

Dispatch: __wasm_simd128__ && Fq == bb::fq && num_points >= 20. Below threshold, on native, or on non-BN254 curves: falls through to the original K=1 path unchanged.

Includes snapshot-before-write logic: output slot for one lane can alias the input slot of a later lane in the same group (typical for large MSM bucket sums); buffering all 5 lanes' reads before any writes prevents y3 corruption.

Why this exists

The V8 chonk breakdown shows MSM evaluate_work_units is ~50% of WASM proving time. batch_affine_add_interleaved is its workhorse. Artem's PR #23004 hits the same surface at width-2 via paired-fp51 Mont-mul; per the Slack microbench discussion, the q1s1 (5-wide) kernel wins per-mul by ~50% over fp51 at width ≥ 4. This PR is the first consumer of that width advantage in MSM. Cross-engine deterministic (integer SIMD, not relaxed-SIMD) — no Edge 147 / Safari class of bugs.

End-to-end measurement to follow (microbench + chonk under V8/Node + BrowserStack matrix). Marking draft.

Tests

Native ecc_tests: 865/865 PASS (K=5 dormant; K=1 fallback intact)
WASM ecc_tests under wasmtime: 865/905 PASS, 40 SKIPPED (intentional), 0 FAILED — K=5 actively exercised

Stack

rk/wasm-simd-01-vector-field → rk/wasm-simd-02-vectorized-for → rk/wasm-simd-03-accumulator → this PR

Lifts the operator* WASM kernel body into vector_field_mont_mul_body.inl.hpp and stamps it for both Bn254FrParams and Bn254FqParams. The macros (BB_VF_LOAD_LIMBS, BB_VF_KARATSUBA_STAGES_1_4, BB_VF_RUN_STAGES_6_THROUGH_10) already reference unqualified R_INV_WASM / P_WASM / R_INV_MOD_2_29 — those resolve in each specialization's enclosing class scope to the appropriate Params constants, so the same body produces a correctly-bound kernel per Params. Each specialization remains explicit (rather than templating the body) so LLVM emits each as a standalone TU function, preserving the register-scope that lets V8 reproduce the gist's hand-scheduled WAT layout. New VectorFieldFqTest suite (9 tests) mirrors the Fr coverage for the operations exercised by curve arithmetic: ctor, add, sub, mul (150 random trials), eq, is_zero, distributivity, mul-by-one, type alias. Verified native ecc_tests 35/35 and wasm ecc_tests under wasmtime 35/35 PASS. Prereq for MSM-side q1s1 integration in subsequent PRs.

Width-5 fast path for batch_affine_add_interleaved, using the VectorField<Bn254FqParams> Mont-mul from the prior commit. Runs 5 independent batch-inversion chains in parallel, collapses each pass's N scalar muls into N/5 width-5 vec muls (asymptotic ~5×). Dispatch: __wasm_simd128__ && Fq == bb::fq && num_points >= 20. Below threshold or on native, falls through to the original K=1 path unchanged. Snapshot-before-write per group: output slot for one lane can alias the input slot of a later lane in the same group; buffering all 5 lanes' reads before any writes prevents y3 corruption at large N. Tests: ecc_tests 37/37 PASS native + wasmtime (K=5 exercised under wasmtime).

…cross iterations Hoists acc_lanes from std::array<Fq, 5> AoS into a VFq that stays in packed (9×29-bit u32x4) form across the K=5 forward loop. Previously each iteration did .to_array() at the end and reconstructed VFq from AoS at the start of the next — two AoS↔packed transposes per group that did no work. Per-group transpose count: 5 → 3 (the two Mont-muls are unchanged). Initial VFq::broadcast(Fq::one()) replaces the AoS-of-ones ctor (broadcast packs once and splats across the 4 quad lanes; the array ctor packs 5 copies and transposes). Final .to_array() + 4-mul lane collapse runs once after the loop. The AoS acc_lanes is still declared at function scope so the backward K=5 prefix-product tree can index it scalar-style; it's filled exactly once via the post-loop to_array(). batch_inversion_accumulator is initialized to Fq::one() upfront so the CAN_USE_K5=false / batch-too-small path drops through unchanged. Tests: ecc_tests 886/886 PASS native, 865/865 PASS wasmtime (K=5 exercised).

…ove e2e Reverts the body of 99eca83. Hoisting acc_lanes to a packed VFq across the K=5 forward loop looked like it should drop two AoS↔packed transposes per group, but benchmarking on V8/wasmtime showed no measurable change — LLVM appears to already SROA the round-trip when acc_lanes' address doesn't escape. Added a note in the K=5 forward header so the next person doesn't redo the experiment. Tests: ecc_tests 886/886 PASS native.

notnotraju added 4 commits May 13, 2026 02:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ecc): VectorField Fq Mont-mul + K=5 MSM batch_affine_add#23353

feat(ecc): VectorField Fq Mont-mul + K=5 MSM batch_affine_add#23353
notnotraju wants to merge 4 commits into
rk/wasm-simd-03-accumulatorfrom
rk/wasm-simd-04-fq-mont-mul

notnotraju commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

notnotraju commented May 17, 2026

Why this exists

Tests

Stack

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant