[CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy by gaugarg-nv · Pull Request #25057 · ggml-org/llama.cpp

gaugarg-nv · 2026-06-26T16:08:07Z

Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies. When tensors are not fully contiguous, but each row is contiguous, it now uses cudaMemcpy2DAsync instead of the slow element-wise scalar copy kernel.

This fixes the GDN recurrent snapshot update with -np 4, where rollback slots are separated by cache stride gaps. This has no impact on -np 1 perf as that is already using optimized cudaMemcpyAsync path.

MTP=3 on RTX 5090 with Unsloth Qwen3.6-27B Q4_K_M with -np 4 shows a perf gain of ~7.5%.

Master:

python3 mtp-bench.py
  code_python        pred= 192 draft= 162 acc= 137 rate=0.846 tok/s=147.0
  code_cpp           pred= 192 draft= 179 acc= 130 rate=0.726 tok/s=132.6
  explain_concept    pred= 192 draft= 189 acc= 127 rate=0.672 tok/s=126.5
  summarize          pred= 192 draft= 166 acc= 134 rate=0.807 tok/s=141.8
  qa_factual         pred= 192 draft= 178 acc= 131 rate=0.736 tok/s=134.1
  translation        pred= 192 draft= 186 acc= 129 rate=0.694 tok/s=129.5
  creative_short     pred= 192 draft= 201 acc= 122 rate=0.607 tok/s=117.3
  stepwise_math      pred= 192 draft= 179 acc= 131 rate=0.732 tok/s=133.0
  long_code_review   pred= 192 draft= 206 acc= 122 rate=0.592 tok/s=115.7

PR:

python3 mtp-bench.py
  code_python        pred= 192 draft= 162 acc= 137 rate=0.846 tok/s=157.9
  code_cpp           pred= 192 draft= 179 acc= 130 rate=0.726 tok/s=142.4
  explain_concept    pred= 192 draft= 189 acc= 127 rate=0.672 tok/s=135.9
  summarize          pred= 192 draft= 166 acc= 134 rate=0.807 tok/s=151.8
  qa_factual         pred= 192 draft= 178 acc= 131 rate=0.736 tok/s=143.8
  translation        pred= 192 draft= 186 acc= 129 rate=0.694 tok/s=139.2
  creative_short     pred= 192 draft= 201 acc= 122 rate=0.607 tok/s=126.3
  stepwise_math      pred= 192 draft= 179 acc= 131 rate=0.732 tok/s=143.5
  long_code_review   pred= 192 draft= 206 acc= 122 rate=0.592 tok/s=124.7

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, paired with Codex.

Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies. When tensors are not fully contiguous but each row is contiguous, it now uses cudaMemcpy2DAsync instead of the slow element-wise scalar copy kernel. This fixes the GDN recurrent snapshot update with -np 4, where rollback slots are separated by cache stride gaps.

am17an

Can we add a test-backend-ops for this shape in case it doesn't exist?

gaugarg-nv · 2026-06-27T05:27:02Z

Can we add a test-backend-ops for this shape in case it doesn't exist?

Thanks. There indeed was no test executing this path. I have added one now.

…ling

gaugarg-nv · 2026-06-27T08:34:31Z

OpenVino backend is failing these two tests. Adding checks in OpenVino to return unsupported for strided copies.
OpenVino doesn't hit this for GDN as the K > 1 (MTP) case is already unsupported.

gaugarg-nv requested a review from a team as a code owner June 26, 2026 16:08

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning CUDA Related to the CUDA backend labels Jun 26, 2026

am17an approved these changes Jun 27, 2026

View reviewed changes

Add new tests that execute the new optimized strided copy path

86474fa

gaugarg-nv requested a review from ggerganov as a code owner June 27, 2026 05:25

github-actions Bot added the testing Everything test related label Jun 27, 2026

Return unsupported for strided copy in OpenVINO, as new tests are fai…

e78d7d2

…ling

github-actions Bot added the OpenVINO label Jun 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy#25057

[CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy#25057
gaugarg-nv wants to merge 3 commits into
ggml-org:masterfrom
gaugarg-nv:fast_ggml_cpy

gaugarg-nv commented Jun 26, 2026

Uh oh!

am17an left a comment

Uh oh!

gaugarg-nv commented Jun 27, 2026

Uh oh!

gaugarg-nv commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gaugarg-nv commented Jun 26, 2026

Requirements

Uh oh!

am17an left a comment

Choose a reason for hiding this comment

Uh oh!

gaugarg-nv commented Jun 27, 2026

Uh oh!

gaugarg-nv commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants