Skip to content

feat(ecc): VectorField Fq Mont-mul + K=5 MSM batch_affine_add#23353

Draft
notnotraju wants to merge 4 commits into
rk/wasm-simd-03-accumulatorfrom
rk/wasm-simd-04-fq-mont-mul
Draft

feat(ecc): VectorField Fq Mont-mul + K=5 MSM batch_affine_add#23353
notnotraju wants to merge 4 commits into
rk/wasm-simd-03-accumulatorfrom
rk/wasm-simd-04-fq-mont-mul

Conversation

@notnotraju
Copy link
Copy Markdown
Contributor

Stacked on top of #23210 (rk/wasm-simd-03-accumulator).

Two commits:

  1. VectorField Fq Mont-mul specialization — extracts the Mont-mul body into vector_field_mont_mul_body.inl.hpp and adds an explicit specialization for Bn254FqParams alongside the existing Bn254FrParams one. Each specialization remains a separate TU function (preserves register scope, V8 reproduces the gist's hand-scheduled WAT). 9 new VectorFieldFqTest cases mirror the Fr coverage.

  2. K=5 q1s1 path in batch_affine_add_interleaved — uses the new Fq specialization to run 5 independent batch-inversion chains in parallel through MSM's affine-add inner loop. Per group of 5 pairs (10 points), 30 scalar muls collapse to 6 width-5 vec muls (+ 12 amortized split-tree muls). Asymptotic ~5× kernel speedup on the mul work.

    Dispatch: __wasm_simd128__ && Fq == bb::fq && num_points >= 20. Below threshold, on native, or on non-BN254 curves: falls through to the original K=1 path unchanged.

    Includes snapshot-before-write logic: output slot for one lane can alias the input slot of a later lane in the same group (typical for large MSM bucket sums); buffering all 5 lanes' reads before any writes prevents y3 corruption.

Why this exists

The V8 chonk breakdown shows MSM evaluate_work_units is ~50% of WASM proving time. batch_affine_add_interleaved is its workhorse. Artem's PR #23004 hits the same surface at width-2 via paired-fp51 Mont-mul; per the Slack microbench discussion, the q1s1 (5-wide) kernel wins per-mul by ~50% over fp51 at width ≥ 4. This PR is the first consumer of that width advantage in MSM. Cross-engine deterministic (integer SIMD, not relaxed-SIMD) — no Edge 147 / Safari class of bugs.

End-to-end measurement to follow (microbench + chonk under V8/Node + BrowserStack matrix). Marking draft.

Tests

  • Native ecc_tests: 865/865 PASS (K=5 dormant; K=1 fallback intact)
  • WASM ecc_tests under wasmtime: 865/905 PASS, 40 SKIPPED (intentional), 0 FAILED — K=5 actively exercised

Stack

  • rk/wasm-simd-01-vector-field → rk/wasm-simd-02-vectorized-for → rk/wasm-simd-03-accumulator → this PR

Lifts the operator* WASM kernel body into vector_field_mont_mul_body.inl.hpp
and stamps it for both Bn254FrParams and Bn254FqParams. The macros
(BB_VF_LOAD_LIMBS, BB_VF_KARATSUBA_STAGES_1_4, BB_VF_RUN_STAGES_6_THROUGH_10)
already reference unqualified R_INV_WASM / P_WASM / R_INV_MOD_2_29 — those
resolve in each specialization's enclosing class scope to the appropriate
Params constants, so the same body produces a correctly-bound kernel per
Params.

Each specialization remains explicit (rather than templating the body) so
LLVM emits each as a standalone TU function, preserving the register-scope
that lets V8 reproduce the gist's hand-scheduled WAT layout.

New VectorFieldFqTest suite (9 tests) mirrors the Fr coverage for the
operations exercised by curve arithmetic: ctor, add, sub, mul (150 random
trials), eq, is_zero, distributivity, mul-by-one, type alias. Verified
native ecc_tests 35/35 and wasm ecc_tests under wasmtime 35/35 PASS.

Prereq for MSM-side q1s1 integration in subsequent PRs.
Width-5 fast path for batch_affine_add_interleaved, using the
VectorField<Bn254FqParams> Mont-mul from the prior commit. Runs 5
independent batch-inversion chains in parallel, collapses each pass's
N scalar muls into N/5 width-5 vec muls (asymptotic ~5×).

Dispatch: __wasm_simd128__ && Fq == bb::fq && num_points >= 20. Below
threshold or on native, falls through to the original K=1 path unchanged.

Snapshot-before-write per group: output slot for one lane can alias the
input slot of a later lane in the same group; buffering all 5 lanes'
reads before any writes prevents y3 corruption at large N.

Tests: ecc_tests 37/37 PASS native + wasmtime (K=5 exercised under wasmtime).
…cross iterations

Hoists acc_lanes from std::array<Fq, 5> AoS into a VFq that stays in packed
(9×29-bit u32x4) form across the K=5 forward loop. Previously each iteration
did .to_array() at the end and reconstructed VFq from AoS at the start of
the next — two AoS↔packed transposes per group that did no work.

Per-group transpose count: 5 → 3 (the two Mont-muls are unchanged). Initial
VFq::broadcast(Fq::one()) replaces the AoS-of-ones ctor (broadcast packs
once and splats across the 4 quad lanes; the array ctor packs 5 copies and
transposes). Final .to_array() + 4-mul lane collapse runs once after the loop.

The AoS acc_lanes is still declared at function scope so the backward K=5
prefix-product tree can index it scalar-style; it's filled exactly once via
the post-loop to_array(). batch_inversion_accumulator is initialized to
Fq::one() upfront so the CAN_USE_K5=false / batch-too-small path drops
through unchanged.

Tests: ecc_tests 886/886 PASS native, 865/865 PASS wasmtime (K=5 exercised).
…ove e2e

Reverts the body of 99eca83. Hoisting acc_lanes to a packed VFq across
the K=5 forward loop looked like it should drop two AoS↔packed transposes
per group, but benchmarking on V8/wasmtime showed no measurable change —
LLVM appears to already SROA the round-trip when acc_lanes' address
doesn't escape. Added a note in the K=5 forward header so the next person
doesn't redo the experiment.

Tests: ecc_tests 886/886 PASS native.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant