Skip to content

perf(gpu): tail-path kernel for jagged constraint poly eval zerocheck#2722

Draft
erabinov wants to merge 5 commits into
mainfrom
erabinov/jagged-constraint-poly-eval-tail
Draft

perf(gpu): tail-path kernel for jagged constraint poly eval zerocheck#2722
erabinov wants to merge 5 commits into
mainfrom
erabinov/jagged-constraint-poly-eval-tail

Conversation

@erabinov
Copy link
Copy Markdown
Contributor

@erabinov erabinov commented Apr 17, 2026

Summary

  • New zerocheck_eval_tail.cu CUDA kernel that exploits parallelism over evaluation points in the later (tail) rounds of zerocheck, where the per-point workload shrinks and standard per-variable parallelism leaves the GPU underutilized.
  • Threshold-based dispatch in sp1-gpu-zerocheck to switch into the tail kernel once the remaining variable count crosses the threshold.
  • Wiring in sp1-gpu-sys (v2_kernels.rs, CMakeLists.txt) to build and expose the new kernel.
  • BlockAir implementation for GlobalChip (14 blocks: Poseidon2 permutation rounds mirroring Poseidon2WideChip, separate blocks for the curve-formula check, y6 sign check, sum_checker_x/sum_checker_y, and a dedicated block for all interactions). RiscvAir::Global is now dispatched through the per-block path. This increases the parallelism of constraint-polynomial evaluation on the GlobalChip and gives a further speedup on top of the tail-kernel work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant