Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+)#22522
Draft
aendk wants to merge 31 commits intoggml-org:masterfrom
Draft
Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+)#22522aendk wants to merge 31 commits intoggml-org:masterfrom
aendk wants to merge 31 commits intoggml-org:masterfrom
Conversation
…t input pointer access, and "launch" after last write, e.g. to tensors like dst.
overlap execution with previous kernels
to enable hip/musa compatibility
works or is without effect.
kernels args. This fixes PDL.
and template alias and template expansion
|
Hi @aendk, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Programmatic Dependent Launch (PDL) is a CUDA optimization for newer NVIDIA GPUs (CC >= 90; does not include Ada).

It enables overlapping execution of CUDA kernels of the same CUDA stream. Like CUDA graphs, it reduces kernel launch overhead on the device. The benefits of both are additive (PDL + CG > CG > PDL).
This can best be seen visually in this Nsight Systems screenshot of a single CUDA stream; kernels which should normally be strictly ordered are run concurrently:
PDL was already proposed last year in #15479.
This PR integrates better into the CUDA graph semantics, and has vastly better performance. On an RTX PRO 6000, a token generation phase speedup of 10% is not unusual, on DGX Spark, I've seen 4-5% improvement (model dependent, see detailed stats below).
For full PDL performance, kernels need to be equipped with two new features: A synchronization barrier (
GGML_CUDA_PDL_SYNC) and a launch signal (GGML_CUDA_PDL_LC). The synchronization barrier limits the kernel execution to wait on the data written by the preceeding kernel so that no race conditions or premature data accesses take place. The launch signal indicates at which point the current kernel can tolerate the start of the next kernel alongside it. Additionally, kernels need to be launched via the newggml_cuda_kernel_launch()function.The synchronization barrier can be placed by carefully inspecting the kernel code and identifying the first "real" data access (e.g. excluding pointer arithmetic) of the kernel input. The launch signal placement requires a bit of hand-tuning and benchmarking. In this draft PR, I enrolled all kernels used in
gpt-oss 20b,qwen3.5andnemotron 120B Super. Because these kernels are shared with other models, I've tested more models. I saw speed-ups in almost all models in token generation phases, with prefill/context phases being mostly neutral.Applied Heuristics:
GGML_CUDA_PDL_SYNC, a data race could occur. Before marking this merge-ready, I will double check this again. When reviewing, this should be kept in mind.GGML_CUDA_PDL_LCis a bit of trial and error. This is visible in some kernels where I've commented out some suboptimal placements in some commits. In some kernels, placingGGML_CUDA_PDL_LCis even perf negative (most notablymul_mat_vec_q). Generally, the earlier the signal is placed in the kernel, the more latency limited the kernel is, and the more shared resource contention (due to the premature launch of the successive kernel) the kernel can tolerate.Further Info on this Implementation
quantize_q8andmul_mat_vec_qare enrolled in PDL and are present in many models).GGML_CUDA_PDL_LCflag is a bit of trial & error, but good placement for one model appears to be beneficial for other models, too. In internal testing, I did not run into settings which are for example beneficial for model A, but worse for model B performance.Known issues/TODOs
GGML_CUDA_PDL_SYNC.GGML_CUDA_CC_HOPPERdid not work.How to test it
You need to have a newer NVIDIA GPU (e.g. Blackwell), and you need to compile with
-D GGML_CUDA_PDL=ONHow to enroll other kernels into PDL
ggml_cuda_kernel_launch()and setGGML_CUDA_PDL_SYNC(). Modifying the kernel launch without setting the sync barrier leads to a race condition.GGML_CUDA_PDL_LC(). My loose heuristic was to place it at the function start, measure performance, and then repeat the process for different locations in the middle of the kernel. I then picked the best performing placement. In my testing, placing it near the bottom of a kernel was almost always unproductive.Let me know if you are able to test it ! @ggerganov @JohannesGaessler @am17an @ORippler
Performance:
RTX PRO 6000
DGX Spark
Requirements