feat: add multi-GPU backend abstraction layer (gpu_backend) by feiqiangs · Pull Request #131 · taco-project/FlexKV

feiqiangs · 2026-03-31T09:01:51Z

Introduce flexkv/gpu_backend/ Python abstraction layer with GpuBackend ABC (15 abstract methods covering memory, stream, IPC, and storage operations)
Add NvidiaBackend (production-ready), GenericBackend (CPU fallback), and MusaBackend (experimental) implementations
Auto-detect backend at import time via flexkv/gpu_backend/init.py
Unify C extension entry point to single flexkv.c_ext for all backends:
- csrc/gpu_backend/nvidia/gpu_transfer_bindings.h (CUDA-specific bindings)
- csrc/gpu_backend/musa/gpu_transfer_bindings.h (MUSA-specific bindings)
- csrc/bindings.cpp selects backend via #if FLEXKV_BACKEND_MUSA at compile time,
  injected by MUSABuilder; ext_name is always flexkv.c_ext (no c_ext_musa branch)
Add build_backends/ package (CUDABuilder, MUSABuilder, GenericBuilder)
with per-vendor compile flags and source lists; all comments in English
Add docs/gpu_backends/README_zh.md and README_en.md integration guides
covering 5 design principles, step-by-step onboarding, GDS/IPC fallback;
NvidiaBackend code example updated to match actual implementation

…ments - Add dp_rank to ModelConfig and propagate dp_size/dp_rank through config chain - Generate per-PP-rank IPC port suffixes to avoid ZMQ endpoint collisions - Generate per-PP/DP-rank eventfd socket paths for LayerwiseTransferWorker - Use global device_id (dp_rank * tp_size + tp_rank) to avoid GPU registration conflicts - Skip dp_rank in model config verification (DP ranks share same KVServer) - Fix recvmsg_into flags parameter from anc_buf_size to 0 - Handle zmq.Again in KVTPClient registration with blocking fallback - Add periodic GPU registration wait diagnostics in TransferManager - Add comprehensive IPC port/socket logging throughout initialization

Modifying LD_LIBRARY_PATH at runtime does NOT affect the current process's dynamic linker (ld.so reads it only at startup). Use ctypes.CDLL with RTLD_GLOBAL to pre-load libxxhash.so etc. so that c_ext (loaded via dlopen) can resolve them without requiring LD_LIBRARY_PATH to be set before process start.

- Replace fixed-count for loop with while loop that continues until all ranks have registered or deadline is reached - Add per-connection try-except so a single SCM_RIGHTS failure does not abort the entire accept loop - Increase listen backlog to tp_group_size*3 to accommodate client retries on failed connections - Use per-connection timeout with overall deadline instead of a single global timeout

… condition

…gineWrapper - Fix typo: change`engien_port`to`engine_port` - Use`self.config`for attribute assignment instead of`config Signed-off-by: staryxchen <staryxchen@tencent.com>

…one to int - Update method signature to match underlying engine's return value - Return engine's status code (0 for success, -1 for error) Signed-off-by: staryxchen <staryxchen@tencent.com>

transfer_sync_write_with_notify method - Added`-> int`return type annotation to the method signature. Signed-off-by: staryxchen <staryxchen@tencent.com>

1. Fuse indexer into layerwise transfer pipeline: - LayerwiseTransferOp/WorkerLayerwiseTransferOp: add indexer_src/dst_block_ids fields - merge_to_batch_graph: populate indexer block_ids (1:1 with main KV) in LAYERWISE op - C++ LayerwiseTransferGroup: accept indexer_ssd_files, init indexer_ioctx_ - C++ layerwise_transfer: add Step 0.5 — indexer SSD→CPU after main KV SSD→CPU - Python LayerwiseTransferWorker: accept indexer SSD params, compute indexer SSD strides, pass indexer SSD block_ids and H2D block_ids to C++ layerwise_transfer 2. Skip redundant worker creation in layerwise mode: - Main KV: skip h2d_workers and cpussd_read_worker creation & registration - Indexer: skip _indexer_h2d_workers and _indexer_disk2h_worker creation & registration - D2H / H2DISK workers preserved (layerwise does not support PUT direction) 3. Add startup assertions in layerwise mode: - Assert _worker_map does not contain H2D or DISK2H - Assert _worker_map contains LAYERWISE Co-authored-by: zittozhang <zittozhang@tencent.com>

Co-authored-by: zittozhang <zittozhang@tencent.com>

…via FLEXKV_NODE_ID env var and divide cp_size for multi-node TP when NSA CP is active

…nk) and unify eventfd socket path Replace device_count()-based probing and FLEXKV_MASTER_HOST/FLEXKV_NODE_ID/FLEXKV_LOCAL_GPU_COUNT env-var plumbing with explicit ModelConfig fields driven by the framework (sglang server_args / TRT-LLM launcher), so FlexKV and its callers cannot drift on the multi-node layout. Key changes: - ModelConfig: new fields nnodes, node_rank, master_host; FlexKVConfig.post_init_from_sglang_config accepts them. - kvtask.py: derive gpus_per_node=(tp*pp)//nnodes and nnodes_per_tp_group=ceil(tp/gpus_per_node); rename self.tp_node_count->self.nnodes_per_tp_group and drop self.is_multinode_tp (use nnodes_per_tp_group>1). NSA CP division logic preserved. - transfer_manager.py: rename get_master_host_and_ports_from_env()->resolve_master_host_and_ports(master_host=None); TransferManagerOnRemote accepts master_host kwarg (env remains as fallback). - transfer/layerwise.py: new helper build_layerwise_eventfd_socket_path(model_config) as single source of truth for the UDS path (suffix uses pp_rank/dp_rank only; node_rank intentionally omitted since UDS is kernel-local). LayerwiseTransferWorker takes layerwise_eventfd_socket kwarg instead of reading env in the subprocess. - transfer/transfer_engine.py: compute the socket path once and pass it down; replace getattr(model_config, 'is_nsa_cp'/'cp_size', ...) with direct attribute access now that they are ModelConfig fields.

refactor: align multi-node TP topology with framework (nnodes/node_rank) and unify eventfd socket path

- Add configuration entries for HugePage allocation - Implement HugePageAllocator and allocate CPU KV cache on hugetlbfs - Support HugePage for temporary buffer in PEER2CPUTransferWorker - Add tests and documentation for HugePage feature Signed-off-by: staryxchen <staryxchen@tencent.com>

…ction Extend ModelConfig with PP-aware fields: pp_start_layer/pp_end_layer, enable_dp_attention, attn_cp_size/attn_cp_rank, and derived properties (attn_tp_size, num_layers_per_pp_stage, token_size_in_bytes_per_pp_stage). Add freeze() to prevent post-init mutation of parallel config. Introduce WorkerKey(dp_rank, pp_rank) as the unique worker identifier, replacing the flat dp_id throughout the transfer stack. This allows TransferEngine to manage multiple PP stages within a single centralized data plane instance, while each PP stage retains independent control decisions (decentralized control plane). Key changes: - TransferOp: dp_id -> (dp_rank, pp_rank) - TransferOpGraph: add clear_gpu_blocks()/set_gpu_blocks() for deferred GPU block binding (PP stages share a graph template but - Integration adapters (vllm/sglang/trt-llm): compute pp_start_layer/pp_end_layer per stage, freeze ModelConfig after init - C++ layerwise/tp_transfer_thread_group: use tp_size_per_node instead of global tp_size for correct node-local eventfd grouping

feat: add Pipeline Parallelism support

fix: improve transfer manager and engine robustness for PP

- Remove CMatchResult.block_node_ids tensor; add matched_node_id (int32) - RefRadixTree::match_prefix now stops when encountering a different node_id - Simplify pybind bindings: expose matched_node_id, remove block_node_ids - Python MatchResultAccel: add matched_node_id field, broadcast to per-block arrays for backward compat in downstream worker/transfer paths - Update hie_cache_engine.py and cache_engine.py to derive per-block arrays from single matched_node_id instead of reading C++ tensor - Add unit tests for CMatchResult and MatchResultAccel matched_node_id This simplifies the distributed matching and transfer paths: - No need for shared_transfer_kv_blocks_remote_read multi-node grouping - Lease management is simpler (no cross-node cascade invalidation) - Better fault isolation (single node failure domain) - Cleaner integration with CP/PP/TP cooperative GET flow

zhuofan1123 and others added 29 commits March 20, 2026 17:56

initialize layerwise worker

4d083b4

add layerwise transfer op

9085d52

clear op callback if layerwise

645a22c

layerwise worker naive impl

be7cc02

check layerwise condition

52a19f3

add default value

fb6ce12

fix bug and benchmark

4dc9fb4

add layerwise param

80c2077

fix bugs

f64b9c0

disable layerwise in benchmark

d86335b

pin memory of block ids

7a74665

make ssd optional

7641dcc

initial layerwise cpp impl

0835555

add callback && fix some bugs

6881413

fix

7c7ab27

some fix

ed4187d

add sglang support using eventfd

79dc6ce

print bandwidth for layerwise transfer

b7c86bf

add nvtx for layerwise

d635935

update kernel

af4c1d5

remove print

a159da9

fix cuda device set

417b59b

fix

ffbcecd

fix mempool

fd8ce04

refactor transfer config, set num of cta instead of sm

2463aa4

fix

586dbc6

fix unit test

bc1a18c

update

d3cc1d9

merge h2d and disk2h to layerwiseop

12be2cc

feiqiangs requested a review from axxx03 March 31, 2026 09:01

zittozhang and others added 16 commits April 9, 2026 17:27

fix ssd read when blockwise + tp + layerwise

a5eb20f

dont sync prefetch

5a706b1

support cpuonly match for prefetch

9e72cf1

add kv_cache_dtype to sglang

5a5f7da

add some log to info the malloc

ba6c000

support cp+layerwise

7806055

fix empty token mask

d56c404

fix: add ACK handshake for layerwise eventfd transfer to prevent race…

b427e7e

… condition

fix: correct variable name and config reference in MoonCakeTransferEn…

d91bed1

…gineWrapper - Fix typo: change`engien_port`to`engine_port` - Use`self.config`for attribute assignment instead of`config Signed-off-by: staryxchen <staryxchen@tencent.com>

fix(mooncakeEngineWrapper): change unregist_buffer return type from N…

dfe21d4

…one to int - Update method signature to match underlying engine's return value - Return engine's status code (0 for success, -1 for error) Signed-off-by: staryxchen <staryxchen@tencent.com>

fix(mooncakeEngineWrapper): add return type annotation to

3a36508

transfer_sync_write_with_notify method - Added`-> int`return type annotation to the method signature. Signed-off-by: staryxchen <staryxchen@tencent.com>

fix d2h issue for glm5+cp8

7148c7e

fix sglang config issue

d54f930

feiqiangs force-pushed the dev_sfq branch 2 times, most recently from 6e5e94b to 4398462 Compare April 16, 2026 07:35

zhjc1124 and others added 11 commits April 17, 2026 15:02

fix server_args kv cache dtype (#151)

1bf5eca

Co-authored-by: zittozhang <zittozhang@tencent.com>

fix: support cross-node TP - add _node suffix to eventfd socket path …

b9e29ac

…via FLEXKV_NODE_ID env var and divide cp_size for multi-node TP when NSA CP is active

Merge pull request #156 from zhjc1124/feat/layerwise_rebase

ba8fccf

refactor: align multi-node TP topology with framework (nnodes/node_rank) and unify eventfd socket path

Merge pull request #159 from zhjc1124/support_pp

7a2a8af

feat: add Pipeline Parallelism support

fix: improve transfer manager and engine robustness for PP

2f572d8

Merge pull request #160 from zhjc1124/support_pp

7ef3715

fix: improve transfer manager and engine robustness for PP

feiqiangs force-pushed the dev_sfq branch from 4398462 to 16fa450 Compare April 30, 2026 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add multi-GPU backend abstraction layer (gpu_backend)#131

feat: add multi-GPU backend abstraction layer (gpu_backend)#131
feiqiangs wants to merge 62 commits into
mainfrom
dev_sfq

feiqiangs commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

feiqiangs commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants