perf(bb): iter 7 — MAX_SLICE_ENTRIES 2048->4096, MAX_PAIRS 1024->2048

AztecBot · AztecBot · commit 5b8a3f7a8f81 · 2026-05-18T11:03:41.000Z
Halves per-layer WG count again. Both MSE and MAX_PAIRS doubled together
so pair_count_wg can't overflow rank_to_raw (which was the iter5 failure
mode). PER_THREAD_PAIRS = 16 (back to iter3a value before TPB=128 cut it
to 8). PER_THREAD_ENTRIES = 32, which is the safe upper bound for the
local_emit_mask/local_pair_mask u32 bitmasks. WG memory at TPB=128 is
~21.6 KB (only TPB-sized arrays — pair pool lives in global meta_pool).
Expected 10-30 ms savings on M2 vs iter3b's 221 ms. Validation: SwiftShader
deterministic across 3 runs, gpu_runs[0] == noble_x at logN=16.
diff --git a/barretenberg/ts/src/msm_webgpu/cuzk/batch_affine.ts b/barretenberg/ts/src/msm_webgpu/cuzk/batch_affine.ts
@@ -731,8 +731,8 @@ export const smvp_batch_affine_gpu = async (
     // bucket_active stays as init wrote it. The existing finalize stage
     // below consumes the populated running_x/y + bucket_active.
     const TREE_TPB = 128;
-    const TREE_MAX_SLICE_ENTRIES = 2048;
-    const TREE_MAX_PAIRS = 1024;
+    const TREE_MAX_SLICE_ENTRIES = 4096;
+    const TREE_MAX_PAIRS = 2048;
     const TREE_MAX_LAYERS = 25;
     const TREE_PRELUDE_WG_SIZE = 64;
     const TREE_SCAN_WG_SIZE = 256;