Skip to content

[CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy#25057

Open
gaugarg-nv wants to merge 3 commits into
ggml-org:masterfrom
gaugarg-nv:fast_ggml_cpy
Open

[CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy#25057
gaugarg-nv wants to merge 3 commits into
ggml-org:masterfrom
gaugarg-nv:fast_ggml_cpy

Conversation

@gaugarg-nv

Copy link
Copy Markdown
Contributor

Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies. When tensors are not fully contiguous, but each row is contiguous, it now uses cudaMemcpy2DAsync instead of the slow element-wise scalar copy kernel.

This fixes the GDN recurrent snapshot update with -np 4, where rollback slots are separated by cache stride gaps. This has no impact on -np 1 perf as that is already using optimized cudaMemcpyAsync path.

MTP=3 on RTX 5090 with Unsloth Qwen3.6-27B Q4_K_M with -np 4 shows a perf gain of ~7.5%.

Master:

python3 mtp-bench.py
  code_python        pred= 192 draft= 162 acc= 137 rate=0.846 tok/s=147.0
  code_cpp           pred= 192 draft= 179 acc= 130 rate=0.726 tok/s=132.6
  explain_concept    pred= 192 draft= 189 acc= 127 rate=0.672 tok/s=126.5
  summarize          pred= 192 draft= 166 acc= 134 rate=0.807 tok/s=141.8
  qa_factual         pred= 192 draft= 178 acc= 131 rate=0.736 tok/s=134.1
  translation        pred= 192 draft= 186 acc= 129 rate=0.694 tok/s=129.5
  creative_short     pred= 192 draft= 201 acc= 122 rate=0.607 tok/s=117.3
  stepwise_math      pred= 192 draft= 179 acc= 131 rate=0.732 tok/s=133.0
  long_code_review   pred= 192 draft= 206 acc= 122 rate=0.592 tok/s=115.7

PR:

python3 mtp-bench.py
  code_python        pred= 192 draft= 162 acc= 137 rate=0.846 tok/s=157.9
  code_cpp           pred= 192 draft= 179 acc= 130 rate=0.726 tok/s=142.4
  explain_concept    pred= 192 draft= 189 acc= 127 rate=0.672 tok/s=135.9
  summarize          pred= 192 draft= 166 acc= 134 rate=0.807 tok/s=151.8
  qa_factual         pred= 192 draft= 178 acc= 131 rate=0.736 tok/s=143.8
  translation        pred= 192 draft= 186 acc= 129 rate=0.694 tok/s=139.2
  creative_short     pred= 192 draft= 201 acc= 122 rate=0.607 tok/s=126.3
  stepwise_math      pred= 192 draft= 179 acc= 131 rate=0.732 tok/s=143.5
  long_code_review   pred= 192 draft= 206 acc= 122 rate=0.592 tok/s=124.7

Requirements

Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies.
When tensors are not fully contiguous but each row is contiguous, it now uses cudaMemcpy2DAsync instead of the slow element-wise scalar copy kernel.

This fixes the GDN recurrent snapshot update with -np 4, where rollback slots are separated by cache stride gaps.
@gaugarg-nv gaugarg-nv requested a review from a team as a code owner June 26, 2026 16:08
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning CUDA Related to the CUDA backend labels Jun 26, 2026

@am17an am17an left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test-backend-ops for this shape in case it doesn't exist?

@gaugarg-nv gaugarg-nv requested a review from ggerganov as a code owner June 27, 2026 05:25
@gaugarg-nv

Copy link
Copy Markdown
Contributor Author

Can we add a test-backend-ops for this shape in case it doesn't exist?

Thanks. There indeed was no test executing this path. I have added one now.

@github-actions github-actions Bot added the testing Everything test related label Jun 27, 2026
@gaugarg-nv

Copy link
Copy Markdown
Contributor Author

OpenVino backend is failing these two tests. Adding checks in OpenVino to return unsupported for strided copies.
OpenVino doesn't hit this for GDN as the K > 1 (MTP) case is already unsupported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CUDA Related to the CUDA backend ggml changes relating to the ggml tensor library for machine learning OpenVINO testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants