perf(gpu): tail-path kernel for jagged constraint poly eval zerocheck by erabinov · Pull Request #2722 · succinctlabs/sp1

erabinov · 2026-04-17T19:53:58Z

Summary

New zerocheck_eval_tail.cu CUDA kernel that exploits parallelism over evaluation points in the later (tail) rounds of zerocheck, where the per-point workload shrinks and standard per-variable parallelism leaves the GPU underutilized.
Threshold-based dispatch in sp1-gpu-zerocheck to switch into the tail kernel once the remaining variable count crosses the threshold.
Wiring in sp1-gpu-sys (v2_kernels.rs, CMakeLists.txt) to build and expose the new kernel.
BlockAir implementation for GlobalChip (14 blocks: Poseidon2 permutation rounds mirroring Poseidon2WideChip, separate blocks for the curve-formula check, y6 sign check, sum_checker_x/sum_checker_y, and a dedicated block for all interactions). RiscvAir::Global is now dispatched through the per-block path. This increases the parallelism of constraint-polynomial evaluation on the GlobalChip and gives a further speedup on top of the tail-kernel work.

…ocheck

…unction

…nt-poly-eval-tail

erabinov added 5 commits April 17, 2026 10:55

working to use the parallelism for eval points in later rounds of zer…

0b55cd6

…ocheck

threshold for tail

9106ed5

__inline__ instead of __noinline__ for the program evalution device f…

ae6d165

…unction

air blocks for global

6d72667

Merge remote-tracking branch 'sp1/main' into erabinov/jagged-constrai…

57aa981

…nt-poly-eval-tail