Skip to content

Commit 5b8a3f7

Browse files
committed
perf(bb): iter 7 — MAX_SLICE_ENTRIES 2048->4096, MAX_PAIRS 1024->2048
Halves per-layer WG count again. Both MSE and MAX_PAIRS doubled together so pair_count_wg can't overflow rank_to_raw (which was the iter5 failure mode). PER_THREAD_PAIRS = 16 (back to iter3a value before TPB=128 cut it to 8). PER_THREAD_ENTRIES = 32, which is the safe upper bound for the local_emit_mask/local_pair_mask u32 bitmasks. WG memory at TPB=128 is ~21.6 KB (only TPB-sized arrays — pair pool lives in global meta_pool). Expected 10-30 ms savings on M2 vs iter3b's 221 ms. Validation: SwiftShader deterministic across 3 runs, gpu_runs[0] == noble_x at logN=16.
1 parent 60f843e commit 5b8a3f7

1 file changed

Lines changed: 2 additions & 2 deletions

File tree

barretenberg/ts/src/msm_webgpu/cuzk/batch_affine.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -731,8 +731,8 @@ export const smvp_batch_affine_gpu = async (
731731
// bucket_active stays as init wrote it. The existing finalize stage
732732
// below consumes the populated running_x/y + bucket_active.
733733
const TREE_TPB = 128;
734-
const TREE_MAX_SLICE_ENTRIES = 2048;
735-
const TREE_MAX_PAIRS = 1024;
734+
const TREE_MAX_SLICE_ENTRIES = 4096;
735+
const TREE_MAX_PAIRS = 2048;
736736
const TREE_MAX_LAYERS = 25;
737737
const TREE_PRELUDE_WG_SIZE = 64;
738738
const TREE_SCAN_WG_SIZE = 256;

0 commit comments

Comments
 (0)