[TENT] replace cudaMemcpyAsync with cuMemcpy to avoid dead lock by alogfans · Pull Request #2094 · kvcache-ai/Mooncake

alogfans · 2026-05-14T04:14:33Z

Replace CUDA Runtime API's cudaMemcpyAsync + cudaStreamSynchronize with CUDA Driver API's cuMemcpy in CudaPlatform::copy(). This avoids potential deadlocks caused by stream synchronization issues in downstream components like mooncake-pg.

The cuMemcpy API is synchronous but doesn't rely on the legacy default stream, providing better performance characteristics than cudaMemcpy while avoiding the cudaStreamSynchronize deadlocks that can occur with cudaMemcpyAsync.

Changes:

Add CUDA Driver API header (cuda.h)
Fix typo in stduint-uintn.h header include
Replace cudaMemcpyAsync + cudaStreamSynchronize with cuMemcpy
Add CUDA Driver API initialization check
Improve error handling with CUDA error string retrieval

Description

Module

Type of Change

How Has This Been Tested?

Checklist

I have performed a self-review of my own code.
I have formatted my own code using ./scripts/code_format.sh before submitting.
I have updated the documentation.
I have added tests to prove my changes are effective.

…lock Replace CUDA Runtime API's cudaMemcpyAsync + cudaStreamSynchronize with CUDA Driver API's cuMemcpy in CudaPlatform::copy(). This avoids potential deadlocks caused by stream synchronization issues in downstream components like mooncake-pg. The cuMemcpy API is synchronous but doesn't rely on the legacy default stream, providing better performance characteristics than cudaMemcpy while avoiding the cudaStreamSynchronize deadlocks that can occur with cudaMemcpyAsync. Changes: - Add CUDA Driver API header (cuda.h) - Fix typo in stduint-uintn.h header include - Replace cudaMemcpyAsync + cudaStreamSynchronize with cuMemcpy - Add CUDA Driver API initialization check - Improve error handling with CUDA error string retrieval Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request replaces the CUDA Runtime API with the Driver API in the copy method to resolve synchronization issues and updates system header includes. The review identifies a critical bug where cuMemcpy is used incorrectly without considering transfer directions, which will cause failures during host-device memory operations. Additionally, a typo was found in a header inclusion, and moving the CUDA Driver API initialization to the class constructor was suggested to improve efficiency and code structure.

gemini-code-assist · 2026-05-14T04:20:03Z

+    static CUresult init_result = cuInit(0);
+    if (init_result != CUDA_SUCCESS) {
+        const char* error_str = nullptr;
+        cuGetErrorString(init_result, &error_str);
+        return Status::InternalError(
+            std::string("CUDA Driver API init failed: ") + error_str);
+    }


Initializing the CUDA Driver API with cuInit(0) inside the copy function is inefficient and unconventional. While the static guard ensures it only runs once, it is better practice to perform this initialization in the CudaPlatform constructor or a dedicated initialization method. This ensures the platform is correctly set up before any operations are attempted and keeps the copy function focused on its primary task.

….cpp Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

codecov-commenter · 2026-05-14T07:32:13Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

….cpp Co-authored-by: Zhanhao Cao <cao2013zh@163.com>

alogfans requested review from chestnut-Q and doujiang24 as code owners May 14, 2026 04:14

github-actions Bot added run-ci Transfer Engine labels May 14, 2026

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

alogfans and others added 2 commits May 14, 2026 06:54

keep both copy modes

a0622f5

Update mooncake-transfer-engine/tent/src/platform/cuda/cuda_allocator…

2feef35

….cpp Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

caozhanhao reviewed May 14, 2026

View reviewed changes

Comment thread mooncake-transfer-engine/tent/src/platform/cuda/cuda_allocator.cpp Outdated

alogfans and others added 3 commits May 14, 2026 18:30

Update mooncake-transfer-engine/tent/src/platform/cuda/cuda_allocator…

015c0ea

….cpp Co-authored-by: Zhanhao Cao <cao2013zh@163.com>

code format

0bcb77c

Remove redundant closing brace in cuda_allocator.cpp

38354cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TENT] replace cudaMemcpyAsync with cuMemcpy to avoid dead lock#2094

[TENT] replace cudaMemcpyAsync with cuMemcpy to avoid dead lock#2094
alogfans wants to merge 6 commits into
kvcache-ai:mainfrom
alogfans:fix/tent-cuda-cumemcpy-deadlock

alogfans commented May 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

codecov-commenter commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alogfans commented May 14, 2026

Description

Module

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 14, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants