Commit 5b8a3f7
committed
perf(bb): iter 7 — MAX_SLICE_ENTRIES 2048->4096, MAX_PAIRS 1024->2048
Halves per-layer WG count again. Both MSE and MAX_PAIRS doubled together
so pair_count_wg can't overflow rank_to_raw (which was the iter5 failure
mode). PER_THREAD_PAIRS = 16 (back to iter3a value before TPB=128 cut it
to 8). PER_THREAD_ENTRIES = 32, which is the safe upper bound for the
local_emit_mask/local_pair_mask u32 bitmasks. WG memory at TPB=128 is
~21.6 KB (only TPB-sized arrays — pair pool lives in global meta_pool).
Expected 10-30 ms savings on M2 vs iter3b's 221 ms. Validation: SwiftShader
deterministic across 3 runs, gpu_runs[0] == noble_x at logN=16.1 parent 60f843e commit 5b8a3f7
1 file changed
Lines changed: 2 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
731 | 731 | | |
732 | 732 | | |
733 | 733 | | |
734 | | - | |
735 | | - | |
| 734 | + | |
| 735 | + | |
736 | 736 | | |
737 | 737 | | |
738 | 738 | | |
| |||
0 commit comments