[TENT] replace cudaMemcpyAsync with cuMemcpy to avoid dead lock#2094
[TENT] replace cudaMemcpyAsync with cuMemcpy to avoid dead lock#2094alogfans wants to merge 6 commits into
Conversation
…lock Replace CUDA Runtime API's cudaMemcpyAsync + cudaStreamSynchronize with CUDA Driver API's cuMemcpy in CudaPlatform::copy(). This avoids potential deadlocks caused by stream synchronization issues in downstream components like mooncake-pg. The cuMemcpy API is synchronous but doesn't rely on the legacy default stream, providing better performance characteristics than cudaMemcpy while avoiding the cudaStreamSynchronize deadlocks that can occur with cudaMemcpyAsync. Changes: - Add CUDA Driver API header (cuda.h) - Fix typo in stduint-uintn.h header include - Replace cudaMemcpyAsync + cudaStreamSynchronize with cuMemcpy - Add CUDA Driver API initialization check - Improve error handling with CUDA error string retrieval Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request replaces the CUDA Runtime API with the Driver API in the copy method to resolve synchronization issues and updates system header includes. The review identifies a critical bug where cuMemcpy is used incorrectly without considering transfer directions, which will cause failures during host-device memory operations. Additionally, a typo was found in a header inclusion, and moving the CUDA Driver API initialization to the class constructor was suggested to improve efficiency and code structure.
| static CUresult init_result = cuInit(0); | ||
| if (init_result != CUDA_SUCCESS) { | ||
| const char* error_str = nullptr; | ||
| cuGetErrorString(init_result, &error_str); | ||
| return Status::InternalError( | ||
| std::string("CUDA Driver API init failed: ") + error_str); | ||
| } |
There was a problem hiding this comment.
Initializing the CUDA Driver API with cuInit(0) inside the copy function is inefficient and unconventional. While the static guard ensures it only runs once, it is better practice to perform this initialization in the CudaPlatform constructor or a dedicated initialization method. This ensures the platform is correctly set up before any operations are attempted and keeps the copy function focused on its primary task.
….cpp Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
….cpp Co-authored-by: Zhanhao Cao <cao2013zh@163.com>
Replace CUDA Runtime API's cudaMemcpyAsync + cudaStreamSynchronize with CUDA Driver API's cuMemcpy in CudaPlatform::copy(). This avoids potential deadlocks caused by stream synchronization issues in downstream components like mooncake-pg.
The cuMemcpy API is synchronous but doesn't rely on the legacy default stream, providing better performance characteristics than cudaMemcpy while avoiding the cudaStreamSynchronize deadlocks that can occur with cudaMemcpyAsync.
Changes:
Description
Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-pg)mooncake-rl)Type of Change
How Has This Been Tested?
Checklist
./scripts/code_format.shbefore submitting.