[TENT] Add cross-platform dynamic library adapter with GPU vendor abstraction by alogfans · Pull Request #2090 · kvcache-ai/Mooncake

alogfans · 2026-05-13T02:29:37Z

Description

This PR introduces cross-platform support to TENT (Transfer Engine) through a comprehensive GPU vendor abstraction layer and dynamic library loading system. The changes enable TENT to work seamlessly across multiple GPU vendors and hardware platforms.

Key Changes

GPU Vendor Abstraction Layer

Added unified GPU_* macros that map to vendor-specific APIs (CUDA, MUSA, HIP, MACA, Ascend)
Supports NVIDIA CUDA, Moore Threads MUSA, AMD HIP, Iluvatar MACA, Huawei Ascend, and CPU fallback
Provides compile-time vendor selection through CMake flags
Maintains API compatibility while enabling cross-platform compilation

Transport Selector

Implemented configuration-driven transport selection policy
Supports pattern-based rules for device and segment type matching
Enables priority-based transport selection with fallback support
Backward compatible with existing buffer-based transport ordering

Dynamic Library Loading

Added transport_loader for runtime plugin loading
Enables dynamic loading of transport implementations
Supports modular transport architecture
Reduces compilation dependencies and enables flexible deployment

Platform Implementation

Refactored platform backend for better multi-vendor support
Updated CMake build system for conditional compilation
Added comprehensive documentation for GPU abstraction layer

Module

Type of Change

How Has This Been Tested?

TBD

Checklist

I have performed a self-review of my own code.
I have formatted my own code using ./scripts/code_format.sh before submitting.
I have updated the documentation.
I have added tests to prove my changes are effective.

- Add GPU vendor abstraction layer supporting NVIDIA CUDA, Moore Threads MUSA, AMD HIP, Iluvatar MACA, Huawei Ascend, and CPU fallback - Add plugin-based build mode with TENT_BUILD_PLUGIN_MODE option - Implement platform backend plugins for dynamic loading - Add transport loader for runtime plugin discovery and loading - Maintain full compatibility with existing static build mode - Add comprehensive documentation for cross-platform architecture This implementation focuses on core cross-platform functionality without including unrelated features like QoS scheduling or quota management.

Reference revise-paper-branch to fix platform abstraction and compilation issues: - Update platform.h to add IPlatformBackend interface and complete MemoryType enum - Replace platform.cpp with revised implementation supporting dynamic backend loading - Update topology.h to add bw_gbps field to NicEntry structure - Fix cpu.h constructor to properly initialize Platform base class - Update transfer_engine_impl.cpp to use MTYPE_HIP instead of MTYPE_ROCM - Add transport_selector implementation for cross-platform transport selection - Add gpu_vendor.h for vendor-specific GPU abstraction - Update CMakeLists.txt to remove transport_selector from runtime_impl The changes enable successful compilation by aligning the platform abstraction layer with the revised architecture that supports multiple GPU/NPU vendors through dynamic backend plugins. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a GPU Vendor Abstraction Layer and refactors the platform and transport systems to support dynamic plugin loading across multiple GPU vendors, including NVIDIA, AMD, Huawei, and others. Key additions include a configuration-driven transport selection policy and a thread-local location cache for performance optimization. Review feedback identifies a critical bug where a dynamic library is unloaded while a shared pointer still references its objects, potentially causing a crash. Additional recommendations focus on improving code quality through the use of static constexpr for constants, removing unnecessary const qualifiers on output parameters, replacing manual memory management with std::vector, and improving the robustness of JSON parsing.

gemini-code-assist · 2026-05-13T02:31:25Z

+#include <acl/acl_rt.h>
+#include <string>
+
+const static std::string GPU_PREFIX = "ascend:";


Defining a const static std::string in a header file included by multiple translation units can lead to multiple copies of the object and potential static initialization order issues. Since this is a constant string literal, it is more efficient and safer to use static constexpr const char*.

Suggested change

const static std::string GPU_PREFIX = "ascend:";

static constexpr const char* GPU_PREFIX = "ascend:";

gemini-code-assist · 2026-05-13T02:31:25Z

+
+// Ascend doesn't have cudaPointerGetAttributes, use a stub
+inline int cudaPointerGetAttributes_ascend(
+    const struct cudaPointerAttributes* attr, const void* ptr) {


The first parameter of cudaPointerGetAttributes should not be const as it is an output parameter intended to be modified by the function. Removing the const qualifier eliminates the need for the const_cast inside the implementation.

Suggested change

const struct cudaPointerAttributes* attr, const void* ptr) {

struct cudaPointerAttributes* attr, const void* ptr) {

gemini-code-assist · 2026-05-13T02:31:25Z

+#include <string>
+
+#define CPU_ONLY_MODE
+const static std::string GPU_PREFIX = "cpu:";


Using const static std::string in a header can cause static initialization order issues. It is better to use static constexpr const char* for string constants.

Suggested change

const static std::string GPU_PREFIX = "cpu:";

static constexpr const char* GPU_PREFIX = "cpu:";

gemini-code-assist · 2026-05-13T02:31:25Z

+// Note: CUDA API is cudaPointerGetAttributes(const cudaPointerAttributes* attr,
+// const void* ptr)
+inline int cudaPointerGetAttributes_impl(
+    const struct cudaPointerAttributes* attr, const void* ptr) {


The attr parameter should not be const since it is modified by the function. Removing const avoids the const_cast on line 101.

Suggested change

const struct cudaPointerAttributes* attr, const void* ptr) {

struct cudaPointerAttributes* attr, const void* ptr) {

gemini-code-assist · 2026-05-13T02:31:25Z

+        (uintptr_t(start) - aligned_start + len + kPageSize - 1) / kPageSize;
+    void** pages = (void**)malloc(sizeof(void*) * n);
+    int* status = (int*)malloc(sizeof(int) * n);


If len is 0, n might still be calculated as 1 depending on the alignment of start, leading to unnecessary processing. Additionally, using manual malloc and free is error-prone and not exception-safe. It is recommended to use std::vector for automatic memory management and add a guard for len == 0.

if (len == 0) return {}; int n = (uintptr_t(start) - aligned_start + len + kPageSize - 1) / kPageSize; std::vector<void*> pages(n); std::vector<int> status(n);

gemini-code-assist · 2026-05-13T02:31:25Z

+
+        // Capitalize first letter of base_name for symbol name
+        std::string base_cap = base_name;
+        if (!base_cap.empty()) base_cap[0] = toupper(base_cap[0]);


The use of toupper requires the <cctype> header, which is not explicitly included in this file. While it might be included transitively, it is better to include it directly to ensure portability.

gemini-code-assist · 2026-05-13T02:31:25Z

+
+        // Parse size filters (optional)
+        if (policy_json.contains("min_size")) {
+            policy.min_size = policy_json["min_size"].get<uint64_t>();


Calling .get<uint64_t>() on a JSON object without verifying its type can throw a json::type_error exception if the configuration contains an unexpected type (e.g., a string or boolean). It is safer to use is_number_unsigned() or the .value() method with a default.

policy.min_size = policy_json["min_size"].is_number_unsigned() ? std::optional<uint64_t>(policy_json["min_size"].get<uint64_t>()) : std::nullopt;

codecov-commenter · 2026-05-13T03:04:23Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

alogfans and others added 3 commits May 12, 2026 13:04

Code reformat

97666f7

alogfans requested review from ShangmingCai, chestnut-Q, doujiang24, stmatengss and ykwd as code owners May 13, 2026 02:29

github-actions Bot added run-ci Transfer Engine labels May 13, 2026

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Update

a78f27b

alogfans added 3 commits May 14, 2026 02:20

fix issue

4f5fdcd

fix build issue

ef16fc8

fix

9452025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TENT] Add cross-platform dynamic library adapter with GPU vendor abstraction#2090

[TENT] Add cross-platform dynamic library adapter with GPU vendor abstraction#2090
alogfans wants to merge 7 commits into
kvcache-ai:mainfrom
alogfans:tent-cross-platform-clean

alogfans commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

codecov-commenter commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	const static std::string GPU_PREFIX = "ascend:";
	static constexpr const char* GPU_PREFIX = "ascend:";

	const struct cudaPointerAttributes* attr, const void* ptr) {
	struct cudaPointerAttributes* attr, const void* ptr) {

	const static std::string GPU_PREFIX = "cpu:";
	static constexpr const char* GPU_PREFIX = "cpu:";

Conversation

alogfans commented May 13, 2026

Description

Key Changes

Module

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 13, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants