[TENT] Add cross-platform dynamic library adapter with GPU vendor abstraction#2090
[TENT] Add cross-platform dynamic library adapter with GPU vendor abstraction#2090alogfans wants to merge 7 commits into
Conversation
- Add GPU vendor abstraction layer supporting NVIDIA CUDA, Moore Threads MUSA, AMD HIP, Iluvatar MACA, Huawei Ascend, and CPU fallback - Add plugin-based build mode with TENT_BUILD_PLUGIN_MODE option - Implement platform backend plugins for dynamic loading - Add transport loader for runtime plugin discovery and loading - Maintain full compatibility with existing static build mode - Add comprehensive documentation for cross-platform architecture This implementation focuses on core cross-platform functionality without including unrelated features like QoS scheduling or quota management.
Reference revise-paper-branch to fix platform abstraction and compilation issues: - Update platform.h to add IPlatformBackend interface and complete MemoryType enum - Replace platform.cpp with revised implementation supporting dynamic backend loading - Update topology.h to add bw_gbps field to NicEntry structure - Fix cpu.h constructor to properly initialize Platform base class - Update transfer_engine_impl.cpp to use MTYPE_HIP instead of MTYPE_ROCM - Add transport_selector implementation for cross-platform transport selection - Add gpu_vendor.h for vendor-specific GPU abstraction - Update CMakeLists.txt to remove transport_selector from runtime_impl The changes enable successful compilation by aligning the platform abstraction layer with the revised architecture that supports multiple GPU/NPU vendors through dynamic backend plugins. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a GPU Vendor Abstraction Layer and refactors the platform and transport systems to support dynamic plugin loading across multiple GPU vendors, including NVIDIA, AMD, Huawei, and others. Key additions include a configuration-driven transport selection policy and a thread-local location cache for performance optimization. Review feedback identifies a critical bug where a dynamic library is unloaded while a shared pointer still references its objects, potentially causing a crash. Additional recommendations focus on improving code quality through the use of static constexpr for constants, removing unnecessary const qualifiers on output parameters, replacing manual memory management with std::vector, and improving the robustness of JSON parsing.
| #include <acl/acl_rt.h> | ||
| #include <string> | ||
|
|
||
| const static std::string GPU_PREFIX = "ascend:"; |
There was a problem hiding this comment.
Defining a const static std::string in a header file included by multiple translation units can lead to multiple copies of the object and potential static initialization order issues. Since this is a constant string literal, it is more efficient and safer to use static constexpr const char*.
| const static std::string GPU_PREFIX = "ascend:"; | |
| static constexpr const char* GPU_PREFIX = "ascend:"; |
|
|
||
| // Ascend doesn't have cudaPointerGetAttributes, use a stub | ||
| inline int cudaPointerGetAttributes_ascend( | ||
| const struct cudaPointerAttributes* attr, const void* ptr) { |
There was a problem hiding this comment.
The first parameter of cudaPointerGetAttributes should not be const as it is an output parameter intended to be modified by the function. Removing the const qualifier eliminates the need for the const_cast inside the implementation.
| const struct cudaPointerAttributes* attr, const void* ptr) { | |
| struct cudaPointerAttributes* attr, const void* ptr) { |
| #include <string> | ||
|
|
||
| #define CPU_ONLY_MODE | ||
| const static std::string GPU_PREFIX = "cpu:"; |
There was a problem hiding this comment.
| // Note: CUDA API is cudaPointerGetAttributes(const cudaPointerAttributes* attr, | ||
| // const void* ptr) | ||
| inline int cudaPointerGetAttributes_impl( | ||
| const struct cudaPointerAttributes* attr, const void* ptr) { |
There was a problem hiding this comment.
| (uintptr_t(start) - aligned_start + len + kPageSize - 1) / kPageSize; | ||
| void** pages = (void**)malloc(sizeof(void*) * n); | ||
| int* status = (int*)malloc(sizeof(int) * n); |
There was a problem hiding this comment.
If len is 0, n might still be calculated as 1 depending on the alignment of start, leading to unnecessary processing. Additionally, using manual malloc and free is error-prone and not exception-safe. It is recommended to use std::vector for automatic memory management and add a guard for len == 0.
if (len == 0) return {};
int n =
(uintptr_t(start) - aligned_start + len + kPageSize - 1) / kPageSize;
std::vector<void*> pages(n);
std::vector<int> status(n);|
|
||
| // Capitalize first letter of base_name for symbol name | ||
| std::string base_cap = base_name; | ||
| if (!base_cap.empty()) base_cap[0] = toupper(base_cap[0]); |
|
|
||
| // Parse size filters (optional) | ||
| if (policy_json.contains("min_size")) { | ||
| policy.min_size = policy_json["min_size"].get<uint64_t>(); |
There was a problem hiding this comment.
Calling .get<uint64_t>() on a JSON object without verifying its type can throw a json::type_error exception if the configuration contains an unexpected type (e.g., a string or boolean). It is safer to use is_number_unsigned() or the .value() method with a default.
policy.min_size = policy_json["min_size"].is_number_unsigned() ?
std::optional<uint64_t>(policy_json["min_size"].get<uint64_t>()) : std::nullopt;|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Description
This PR introduces cross-platform support to TENT (Transfer Engine) through a comprehensive GPU vendor abstraction layer and dynamic library loading system. The changes enable TENT to work seamlessly across multiple GPU vendors and hardware platforms.
Key Changes
Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-pg)mooncake-rl)Type of Change
How Has This Been Tested?
TBD
Checklist
./scripts/code_format.shbefore submitting.