Skip to content

Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+)#22522

Draft
aendk wants to merge 31 commits intoggml-org:masterfrom
aendk:akieslinger/pdl-cuda-lc-experiments
Draft

Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+)#22522
aendk wants to merge 31 commits intoggml-org:masterfrom
aendk:akieslinger/pdl-cuda-lc-experiments

Conversation

@aendk
Copy link
Copy Markdown
Contributor

@aendk aendk commented Apr 29, 2026

Overview

Programmatic Dependent Launch (PDL) is a CUDA optimization for newer NVIDIA GPUs (CC >= 90; does not include Ada).
It enables overlapping execution of CUDA kernels of the same CUDA stream. Like CUDA graphs, it reduces kernel launch overhead on the device. The benefits of both are additive (PDL + CG > CG > PDL).
This can best be seen visually in this Nsight Systems screenshot of a single CUDA stream; kernels which should normally be strictly ordered are run concurrently:
Screenshot 2026-04-29 at 15 58 35

PDL was already proposed last year in #15479.
This PR integrates better into the CUDA graph semantics, and has vastly better performance. On an RTX PRO 6000, a token generation phase speedup of 10% is not unusual, on DGX Spark, I've seen 4-5% improvement (model dependent, see detailed stats below).

For full PDL performance, kernels need to be equipped with two new features: A synchronization barrier (GGML_CUDA_PDL_SYNC) and a launch signal (GGML_CUDA_PDL_LC). The synchronization barrier limits the kernel execution to wait on the data written by the preceeding kernel so that no race conditions or premature data accesses take place. The launch signal indicates at which point the current kernel can tolerate the start of the next kernel alongside it. Additionally, kernels need to be launched via the new ggml_cuda_kernel_launch() function.

The synchronization barrier can be placed by carefully inspecting the kernel code and identifying the first "real" data access (e.g. excluding pointer arithmetic) of the kernel input. The launch signal placement requires a bit of hand-tuning and benchmarking. In this draft PR, I enrolled all kernels used in gpt-oss 20b, qwen3.5 and nemotron 120B Super. Because these kernels are shared with other models, I've tested more models. I saw speed-ups in almost all models in token generation phases, with prefill/context phases being mostly neutral.

Applied Heuristics:

  • In this draft, for the synchronization barrier placement, I assumed that the first "real" data access of each kernel to be an input tensor. If the are cases where a preceding kernel outputs a scalar and the current kernel reads this scalar before GGML_CUDA_PDL_SYNC, a data race could occur. Before marking this merge-ready, I will double check this again. When reviewing, this should be kept in mind.
  • Correct placement of GGML_CUDA_PDL_LC is a bit of trial and error. This is visible in some kernels where I've commented out some suboptimal placements in some commits. In some kernels, placing GGML_CUDA_PDL_LC is even perf negative (most notably mul_mat_vec_q). Generally, the earlier the signal is placed in the kernel, the more latency limited the kernel is, and the more shared resource contention (due to the premature launch of the successive kernel) the kernel can tolerate.

Further Info on this Implementation

  • This approach can be used even if some kernels in the graph are not enrolled into PDL. If two successive kernels are enrolled, they leverage PDL (eg quantize_q8 and mul_mat_vec_q are enrolled in PDL and are present in many models).
  • Kernels can be enrolled one-by-one.
  • Optimizing the placement of the GGML_CUDA_PDL_LC flag is a bit of trial & error, but good placement for one model appears to be beneficial for other models, too. In internal testing, I did not run into settings which are for example beneficial for model A, but worse for model B performance.

Known issues/TODOs

  • Currently, there is no tooling like memcheck to identify a race condition in the case of an incorrectly placed GGML_CUDA_PDL_SYNC.
  • Need to find a way to automatically disable PDL for unsupported (NVIDIA) GPUs. A simple check on GGML_CUDA_CC_HOPPER did not work.
  • More kernels can be moved to PDL (different launch + sync barrier).
  • Need to remove commented out launch signal experimentation.
  • Like for CUDA graphs themselves, it might make sense to roll this feature out for token generation only at first. Need to check if that is feasible.

How to test it

You need to have a newer NVIDIA GPU (e.g. Blackwell), and you need to compile with -D GGML_CUDA_PDL=ON

How to enroll other kernels into PDL

  • Step 1 : modify the kernel launch with ggml_cuda_kernel_launch() and set GGML_CUDA_PDL_SYNC(). Modifying the kernel launch without setting the sync barrier leads to a race condition.
  • Step 2: Iterate on the placement of GGML_CUDA_PDL_LC(). My loose heuristic was to place it at the function start, measure performance, and then repeat the process for different locations in the middle of the kernel. I then picked the best performing placement. In my testing, placing it near the bottom of a kernel was almost always unproductive.

Let me know if you are able to test it ! @ggerganov @JohannesGaessler @am17an @ORippler

Performance:

RTX PRO 6000
| Model                              | Test   |   t/s master |   t/s akieslinger/pdl-cuda-lc-experiments |   Speedup |
|:-----------------------------------|:-------|-------------:|------------------------------------------:|----------:|
| gpt-oss 20B MXFP4 MoE              | pp512  |     12490.92 |                                  12424.74 |      0.99 |
| gpt-oss 20B MXFP4 MoE              | pp1024 |     12705.95 |                                  12729.38 |      1.00 |
| gpt-oss 20B MXFP4 MoE              | pp2048 |     12792.62 |                                  12828.74 |      1.00 |
| gpt-oss 20B MXFP4 MoE              | tg128  |       332.05 |                                    376.31 |      1.13 |
| gpt-oss 20B MXFP4 MoE              | tg256  |       335.49 |                                    375.20 |      1.12 |
| gpt-oss 20B MXFP4 MoE              | tg512  |       352.94 |                                    370.68 |      1.05 |
| llama 3B Q4_K_M                    | pp512  |     21970.62 |                                  21753.85 |      0.99 |
| llama 3B Q4_K_M                    | pp1024 |     21711.02 |                                  21676.37 |      1.00 |
| llama 3B Q4_K_M                    | pp2048 |     20886.10 |                                  20911.59 |      1.00 |
| llama 3B Q4_K_M                    | tg128  |       405.95 |                                    437.33 |      1.08 |
| llama 3B Q4_K_M                    | tg256  |       421.68 |                                    436.90 |      1.04 |
| llama 3B Q4_K_M                    | tg512  |       403.06 |                                    433.63 |      1.08 |
| llama 70B Q4_K_M                   | pp512  |      1247.76 |                                   1262.12 |      1.01 |
| llama 70B Q4_K_M                   | pp1024 |      1255.38 |                                   1249.06 |      0.99 |
| llama 70B Q4_K_M                   | pp2048 |      1237.33 |                                   1232.74 |      1.00 |
| llama 70B Q4_K_M                   | tg128  |        29.85 |                                     29.98 |      1.00 |
| llama 70B Q4_K_M                   | tg256  |        29.58 |                                     29.68 |      1.00 |
| llama 70B Q4_K_M                   | tg512  |        29.21 |                                     29.34 |      1.00 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | pp512  |      2249.21 |                                   2206.37 |      0.98 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | pp1024 |      2240.98 |                                   2201.13 |      0.98 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | pp2048 |      2239.70 |                                   2194.38 |      0.98 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | tg128  |        90.10 |                                     95.97 |      1.07 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | tg256  |        91.42 |                                     96.03 |      1.05 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | tg512  |        91.65 |                                     95.55 |      1.04 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | pp512  |      7364.98 |                                   7316.76 |      0.99 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | pp1024 |      7229.96 |                                   7203.64 |      1.00 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | pp2048 |      7230.63 |                                   7186.50 |      0.99 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | tg128  |       274.07 |                                    325.74 |      1.19 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | tg256  |       286.29 |                                    327.51 |      1.14 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | tg512  |       286.71 |                                    326.74 |      1.14 |
| qwen3 4B Q4_K_M                    | pp512  |     17249.67 |                                  17036.57 |      0.99 |
| qwen3 4B Q4_K_M                    | pp1024 |     16219.11 |                                  16239.87 |      1.00 |
| qwen3 4B Q4_K_M                    | pp2048 |     15760.55 |                                  15732.80 |      1.00 |
| qwen3 4B Q4_K_M                    | tg128  |       295.78 |                                    335.27 |      1.13 |
| qwen3 4B Q4_K_M                    | tg256  |       296.67 |                                    335.11 |      1.13 |
| qwen3 4B Q4_K_M                    | tg512  |       314.09 |                                    332.05 |      1.06 |
| qwen35 27B Q4_K_M                  | pp512  |      2889.48 |                                   2874.02 |      0.99 |
| qwen35 27B Q4_K_M                  | pp1024 |      2858.55 |                                   2857.95 |      1.00 |
| qwen35 27B Q4_K_M                  | pp2048 |      2857.10 |                                   2845.58 |      1.00 |
| qwen35 27B Q4_K_M                  | tg128  |        65.45 |                                     67.58 |      1.03 |
| qwen35 27B Q4_K_M                  | tg256  |        65.92 |                                     67.40 |      1.02 |
| qwen35 27B Q4_K_M                  | tg512  |        65.44 |                                     66.92 |      1.02 |
| qwen35moe 35B.A3B Q4_K_M           | pp512  |      7267.56 |                                   7275.02 |      1.00 |
| qwen35moe 35B.A3B Q4_K_M           | pp1024 |      7173.63 |                                   7221.01 |      1.01 |
| qwen35moe 35B.A3B Q4_K_M           | pp2048 |      7127.54 |                                   7154.39 |      1.00 |
| qwen35moe 35B.A3B Q4_K_M           | tg128  |       191.59 |                                    233.26 |      1.22 |
| qwen35moe 35B.A3B Q4_K_M           | tg256  |       212.29 |                                    234.41 |      1.10 |
| qwen35moe 35B.A3B Q4_K_M           | tg512  |       211.76 |                                    233.37 |      1.10 |
DGX Spark
| Model                              | Test   |   t/s akmaster |   t/s akieslinger/pdl-cuda-lc-experiments |   Speedup |
|:-----------------------------------|:-------|---------------:|------------------------------------------:|----------:|
| gpt-oss 20B MXFP4 MoE              | pp512  |        4102.32 |                                   4242.29 |      1.03 |
| gpt-oss 20B MXFP4 MoE              | pp1024 |        4144.62 |                                   4339.26 |      1.05 |
| gpt-oss 20B MXFP4 MoE              | pp2048 |        4136.31 |                                   4347.89 |      1.05 |
| gpt-oss 20B MXFP4 MoE              | tg128  |          79.53 |                                     84.05 |      1.06 |
| gpt-oss 20B MXFP4 MoE              | tg256  |          79.55 |                                     84.11 |      1.06 |
| gpt-oss 20B MXFP4 MoE              | tg512  |          78.97 |                                     83.55 |      1.06 |
| llama 3B Q4_K_M                    | pp512  |        7441.01 |                                   7372.57 |      0.99 |
| llama 3B Q4_K_M                    | pp1024 |        7344.68 |                                   7405.66 |      1.01 |
| llama 3B Q4_K_M                    | pp2048 |        7226.86 |                                   7340.45 |      1.02 |
| llama 3B Q4_K_M                    | tg128  |          88.49 |                                     90.37 |      1.02 |
| llama 3B Q4_K_M                    | tg256  |          88.42 |                                     90.32 |      1.02 |
| llama 3B Q4_K_M                    | tg512  |          87.71 |                                     89.71 |      1.02 |
| llama 70B Q4_K_M                   | pp512  |         315.86 |                                    316.65 |      1.00 |
| llama 70B Q4_K_M                   | pp1024 |         314.19 |                                    315.18 |      1.00 |
| llama 70B Q4_K_M                   | pp2048 |         311.22 |                                    311.97 |      1.00 |
| llama 70B Q4_K_M                   | tg128  |           4.63 |                                      4.69 |      1.01 |
| llama 70B Q4_K_M                   | tg256  |           4.63 |                                      4.69 |      1.01 |
| llama 70B Q4_K_M                   | tg512  |           4.62 |                                      4.69 |      1.01 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | pp512  |         571.02 |                                    573.89 |      1.01 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | pp1024 |         548.65 |                                    574.55 |      1.05 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | pp2048 |         571.77 |                                    574.15 |      1.00 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | tg128  |          16.51 |                                     16.95 |      1.03 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | tg256  |          16.56 |                                     16.94 |      1.02 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | tg512  |          16.52 |                                     16.89 |      1.02 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | pp512  |        2188.27 |                                   2233.61 |      1.02 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | pp1024 |        2213.50 |                                   2255.60 |      1.02 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | pp2048 |        2221.78 |                                   2245.72 |      1.01 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | tg128  |          73.50 |                                     76.68 |      1.04 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | tg256  |          73.75 |                                     76.81 |      1.04 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | tg512  |          73.57 |                                     76.61 |      1.04 |
| qwen3 4B Q4_K_M                    | pp512  |        5470.71 |                                   5420.62 |      0.99 |
| qwen3 4B Q4_K_M                    | pp1024 |        5304.73 |                                   5413.33 |      1.02 |
| qwen3 4B Q4_K_M                    | pp2048 |        5234.26 |                                   5294.97 |      1.01 |
| qwen3 4B Q4_K_M                    | tg128  |          70.79 |                                     72.88 |      1.03 |
| qwen3 4B Q4_K_M                    | tg256  |          70.75 |                                     72.83 |      1.03 |
| qwen3 4B Q4_K_M                    | tg512  |          70.17 |                                     72.29 |      1.03 |
| qwen35 27B Q4_K_M                  | pp512  |         801.70 |                                    810.28 |      1.01 |
| qwen35 27B Q4_K_M                  | pp1024 |         807.04 |                                    815.69 |      1.01 |
| qwen35 27B Q4_K_M                  | pp2048 |         799.88 |                                    811.95 |      1.02 |
| qwen35 27B Q4_K_M                  | tg128  |          11.23 |                                     11.48 |      1.02 |
| qwen35 27B Q4_K_M                  | tg256  |          11.23 |                                     11.46 |      1.02 |
| qwen35 27B Q4_K_M                  | tg512  |          11.22 |                                     11.46 |      1.02 |
| qwen35moe 35B.A3B Q4_K_M           | pp512  |        2312.35 |                                   2310.93 |      1.00 |
| qwen35moe 35B.A3B Q4_K_M           | pp1024 |        2323.34 |                                   2340.47 |      1.01 |
| qwen35moe 35B.A3B Q4_K_M           | pp2048 |        2346.21 |                                   2329.26 |      0.99 |
| qwen35moe 35B.A3B Q4_K_M           | tg128  |          60.31 |                                     62.98 |      1.04 |
| qwen35moe 35B.A3B Q4_K_M           | tg256  |          60.27 |                                     62.72 |      1.04 |
| qwen35moe 35B.A3B Q4_K_M           | tg512  |          60.04 |                                     62.50 |      1.04 |

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, for small autocompletes and inquiries about the code base. Every diff was manually modified, checked and tested by me before adding it to a commit.

aendk added 30 commits February 4, 2026 15:39
…t input pointer access, and "launch" after last write, e.g. to tensors like dst.
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Apr 29, 2026

Hi @aendk, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.


Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant