Skip to content

feat(bb): WebGPU field-mul bench + Karatsuba/sos3uv3 Mont mults#23341

Open
zac-williamson wants to merge 4 commits into
sb/msm-webgpufrom
zw/msm-webgpu-mont-mul-bench
Open

feat(bb): WebGPU field-mul bench + Karatsuba/sos3uv3 Mont mults#23341
zac-williamson wants to merge 4 commits into
sb/msm-webgpufrom
zw/msm-webgpu-mont-mul-bench

Conversation

@zac-williamson
Copy link
Copy Markdown
Contributor

Summary

Adds a standalone WebGPU micro-benchmark page (bench-field-mul.html + headless Playwright driver) that compares three BN254 Montgomery product implementations for chained-mul throughput:

Variant Path Median (n=2²⁰, k=100) Δ vs cios
cios u32 (20×13-bit) ~109 ms baseline
karat u32 (20×13-bit) ~80 ms −27%
sos3uv3 f32 (12×22-bit) ~79 ms −28%

karat (u32, the main win)

Recursive Karatsuba (20×20 → 10×10 → 5×5) over unsigned 13-bit limbs, with Yuval reduction using precomputed r_inv = W⁻¹ mod p. Nine 5×5 schoolbook sub-sub-products are computed independently and combined via two Karatsuba levels. Zero drains in the multiply phase: a single pp_cr_C slot overflows u32 by ~1.25×, and the wrap unwinds correctly through subsequent unsigned subtraction (algebraic identity P_mid[m] = Σ(x_lo·y_hi + x_hi·y_lo) is non-negative per limb at lazy values). Fully unrolled via mustache so all indices are compile-time constants — naga SROAs the temp slots into registers instead of thread-private memory.

sos3uv3 (f32, kept as reference)

22-bit f32 limbs with separate per-slot tlo[k]/thi[k] accumulators that break the inner-j carry chain. Each j writes unique tlo[j-1] and thi[j] so there's no overlap or RAW dependency across iterations. Single drain at end of each outer iter via bias_split_f32_le4w. The 22-bit width buys an exact 4-way sum (4·W = 2²⁴ fits in the f32 mantissa).

Test plan

  • yarn install to pick up playwright-core devDependency.
  • yarn generate:wgsl && yarn build:esm.
  • Start dev server: cd barretenberg/ts && ./node_modules/.bin/vite --config dev/msm-webgpu/vite.config.ts --no-open.
  • CLI bench: node barretenberg/ts/dev/msm-webgpu/scripts/bench-field-mul.mjs --path u32 --n 1048576 --k 100 --variant karat --validate-n 1024 --reps 6 (and --variant cios, and --path f32 --variant sos3uv3). Each should print VALIDATION OK and a timing reps=…median=… line.
  • Browser bench: open http://localhost:5173/dev/msm-webgpu/bench-field-mul.html?path=u32&variant=karat&n=1048576&k=100&validate-n=1024&reps=6 and read window.__bench.

…ults

Adds a standalone WebGPU micro-benchmark page comparing three BN254
Montgomery product implementations for chained-mul throughput:

- cios (u32): mitschabaude runtime-loop CIOS over 20×13-bit limbs.
  Baseline, ~109 ms at n=2^20, k=100.
- karat (u32): recursive Karatsuba + Yuval reduction. 9 5×5 schoolbook
  sub-sub-products are computed independently and combined via two
  Karatsuba levels; reduction uses precomputed r_inv = W^-1 mod p with
  zero drains in the multiply phase (unsigned wrap unwinds via
  subsequent subtraction). ~80 ms (~28% faster than cios).
- sos3uv3 (f32, reference): 22-bit f32 limbs with separate per-slot
  tlo/thi accumulators that break the inner-j carry chain. Single
  drain per outer iter via bias_split_f32_le4w. ~79 ms.

The bench harness:
- bench-field-mul.html is a standalone page; reads ?path=u32|f32
  &n=N&k=K&validate-n=N&reps=R&variant=V from the URL.
- bench-field-mul.ts runs k chained Mont mults per thread, validates
  the first `validate-n` outputs against a host BigInt reference, and
  writes timing into window.__bench.
- scripts/bench-field-mul.mjs is a Playwright driver for headless
  invocation from the CLI (added playwright-core as devDependency).
@socket-security
Copy link
Copy Markdown

socket-security Bot commented May 16, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addednpm/​playwright-core@​1.59.1741007999100

View full report

Routes the `montgomery_product_funcs` mustache partial through a
pre-rendered Karatsuba+Yuval body in every MSM shader that does a
base-field multiply (15 callsites: convert_points, smvp, horner,
batch_affine_{apply,schedule,finalize_*,init,apply_scatter},
batch_inverse{,_parallel}, bpr, decompress_g1, montgomery_parity).

The Karatsuba body benches ~27% faster than the mitschabaude
runtime-loop CIOS at n=2^20, k=100 (80 ms vs 109 ms). It exposes the
same `fn montgomery_product(x, y) -> BigInt` symbol plus the same
`get_p` / `conditional_reduce` helpers and uses the same 20×13-bit
limb layout, so the swap is a drop-in change with no callsite churn.

The field-mul bench retains both options (`?variant=cios` renders the
original template inline, `?variant=karat` reuses the class-level
default) so the two bodies can be compared side-by-side.
Phase 1 LANDED — BY safegcd inversion (fr_inv_by_a, Option A: 20×13-bit, BATCH=26, carry-free apply_matrix):
- Production swap-in: wgsl/cuzk/batch_inverse{,_parallel}.template.wgsl call fr_inv_by_a
- 1.5× faster than legacy fr_inv (Pornin K=12) at chained-inverse bench
- ~8% MSM wall reduction at logN=16 sanity check
- TS port (cuzk/bernstein_yang.ts, bernstein_yang_a.ts) + Jest tests (24 passing)
- WGSL impls: wgsl/field/by_inverse{,_a}.template.wgsl + wgsl/bigint/bigint_by.template.wgsl

Phase 2 EXPLORATORY — multi-window pooled batch_inverse + multi-window BPR:
- WPB plumbing in batch_inverse_parallel + dispatch_args + batch_affine.ts
- Default WPB=1 (= legacy behavior, no perf change)
- BPR_WINDOWS_PER_BATCH knob in bpr_bn254.template.wgsl
- Empirical: pooling without growing WG count gives 0% gain — design needs restructure

Standalone bench infrastructure:
- bench-divsteps, bench-apply-matrix, bench-fr-inv, bench-batch-affine
- Each with HTML page + TS dispatcher + Playwright runner under dev/msm-webgpu/scripts/
- profile-sanity.mjs for per-pass GPU time breakdown on the Quick Sanity Check

Tree-reduce design (Stage B) for autonomous remote execution:
- .claude/plans/msm-tree-reduce.md — full design (adaptive batch sizing, analytical slice partition, 2 distinct phase kernels)
- .claude/plans/remote-agent-brief.md — remote agent execution brief

Co-authored with Claude.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant