Skip to content

[TENT] replace cudaMemcpyAsync with cuMemcpy to avoid dead lock#2094

Open
alogfans wants to merge 6 commits into
kvcache-ai:mainfrom
alogfans:fix/tent-cuda-cumemcpy-deadlock
Open

[TENT] replace cudaMemcpyAsync with cuMemcpy to avoid dead lock#2094
alogfans wants to merge 6 commits into
kvcache-ai:mainfrom
alogfans:fix/tent-cuda-cumemcpy-deadlock

Conversation

@alogfans
Copy link
Copy Markdown
Collaborator

Replace CUDA Runtime API's cudaMemcpyAsync + cudaStreamSynchronize with CUDA Driver API's cuMemcpy in CudaPlatform::copy(). This avoids potential deadlocks caused by stream synchronization issues in downstream components like mooncake-pg.

The cuMemcpy API is synchronous but doesn't rely on the legacy default stream, providing better performance characteristics than cudaMemcpy while avoiding the cudaStreamSynchronize deadlocks that can occur with cudaMemcpyAsync.

Changes:

  • Add CUDA Driver API header (cuda.h)
  • Fix typo in stduint-uintn.h header include
  • Replace cudaMemcpyAsync + cudaStreamSynchronize with cuMemcpy
  • Add CUDA Driver API initialization check
  • Improve error handling with CUDA error string retrieval

Description

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

…lock

Replace CUDA Runtime API's cudaMemcpyAsync + cudaStreamSynchronize with
CUDA Driver API's cuMemcpy in CudaPlatform::copy(). This avoids potential
deadlocks caused by stream synchronization issues in downstream components
like mooncake-pg.

The cuMemcpy API is synchronous but doesn't rely on the legacy default stream,
providing better performance characteristics than cudaMemcpy while avoiding
the cudaStreamSynchronize deadlocks that can occur with cudaMemcpyAsync.

Changes:
- Add CUDA Driver API header (cuda.h)
- Fix typo in stduint-uintn.h header include
- Replace cudaMemcpyAsync + cudaStreamSynchronize with cuMemcpy
- Add CUDA Driver API initialization check
- Improve error handling with CUDA error string retrieval

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the CUDA Runtime API with the Driver API in the copy method to resolve synchronization issues and updates system header includes. The review identifies a critical bug where cuMemcpy is used incorrectly without considering transfer directions, which will cause failures during host-device memory operations. Additionally, a typo was found in a header inclusion, and moving the CUDA Driver API initialization to the class constructor was suggested to improve efficiency and code structure.

Comment thread mooncake-transfer-engine/tent/src/platform/cuda/cuda_allocator.cpp Outdated
Comment thread mooncake-transfer-engine/tent/src/platform/cuda/cuda_allocator.cpp Outdated
Comment on lines +68 to +74
static CUresult init_result = cuInit(0);
if (init_result != CUDA_SUCCESS) {
const char* error_str = nullptr;
cuGetErrorString(init_result, &error_str);
return Status::InternalError(
std::string("CUDA Driver API init failed: ") + error_str);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Initializing the CUDA Driver API with cuInit(0) inside the copy function is inefficient and unconventional. While the static guard ensures it only runs once, it is better practice to perform this initialization in the CudaPlatform constructor or a dedicated initialization method. This ensures the platform is correctly set up before any operations are attempted and keeps the copy function focused on its primary task.

alogfans and others added 2 commits May 14, 2026 06:54
….cpp

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Comment thread mooncake-transfer-engine/tent/src/platform/cuda/cuda_allocator.cpp Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants