Skip to content

feat: add multi-GPU backend abstraction layer (gpu_backend)#131

Open
feiqiangs wants to merge 62 commits into
mainfrom
dev_sfq
Open

feat: add multi-GPU backend abstraction layer (gpu_backend)#131
feiqiangs wants to merge 62 commits into
mainfrom
dev_sfq

Conversation

@feiqiangs
Copy link
Copy Markdown
Collaborator

  • Introduce flexkv/gpu_backend/ Python abstraction layer with GpuBackend ABC (15 abstract methods covering memory, stream, IPC, and storage operations)
  • Add NvidiaBackend (production-ready), GenericBackend (CPU fallback), and MusaBackend (experimental) implementations
  • Auto-detect backend at import time via flexkv/gpu_backend/init.py
  • Unify C extension entry point to single flexkv.c_ext for all backends:
    • csrc/gpu_backend/nvidia/gpu_transfer_bindings.h (CUDA-specific bindings)
    • csrc/gpu_backend/musa/gpu_transfer_bindings.h (MUSA-specific bindings)
    • csrc/bindings.cpp selects backend via #if FLEXKV_BACKEND_MUSA at compile time,
      injected by MUSABuilder; ext_name is always flexkv.c_ext (no c_ext_musa branch)
  • Add build_backends/ package (CUDABuilder, MUSABuilder, GenericBuilder)
    with per-vendor compile flags and source lists; all comments in English
  • Add docs/gpu_backends/README_zh.md and README_en.md integration guides
    covering 5 design principles, step-by-step onboarding, GDS/IPC fallback;
    NvidiaBackend code example updated to match actual implementation

@feiqiangs feiqiangs requested a review from axxx03 March 31, 2026 09:01
zittozhang and others added 16 commits April 9, 2026 17:27
…ments

- Add dp_rank to ModelConfig and propagate dp_size/dp_rank through config chain
- Generate per-PP-rank IPC port suffixes to avoid ZMQ endpoint collisions
- Generate per-PP/DP-rank eventfd socket paths for LayerwiseTransferWorker
- Use global device_id (dp_rank * tp_size + tp_rank) to avoid GPU registration conflicts
- Skip dp_rank in model config verification (DP ranks share same KVServer)
- Fix recvmsg_into flags parameter from anc_buf_size to 0
- Handle zmq.Again in KVTPClient registration with blocking fallback
- Add periodic GPU registration wait diagnostics in TransferManager
- Add comprehensive IPC port/socket logging throughout initialization
Modifying LD_LIBRARY_PATH at runtime does NOT affect the current process's
dynamic linker (ld.so reads it only at startup). Use ctypes.CDLL with
RTLD_GLOBAL to pre-load libxxhash.so etc. so that c_ext (loaded via dlopen)
can resolve them without requiring LD_LIBRARY_PATH to be set before process
start.
- Replace fixed-count for loop with while loop that continues until
  all ranks have registered or deadline is reached
- Add per-connection try-except so a single SCM_RIGHTS failure does
  not abort the entire accept loop
- Increase listen backlog to tp_group_size*3 to accommodate client
  retries on failed connections
- Use per-connection timeout with overall deadline instead of a
  single global timeout
…gineWrapper

- Fix typo: change`engien_port`to`engine_port`
- Use`self.config`for attribute assignment instead of`config

Signed-off-by: staryxchen <staryxchen@tencent.com>
…one to int

- Update method signature to match underlying engine's return value
- Return engine's status code (0 for success, -1 for error)

Signed-off-by: staryxchen <staryxchen@tencent.com>
transfer_sync_write_with_notify method

- Added`-> int`return type annotation to the method signature.

Signed-off-by: staryxchen <staryxchen@tencent.com>
@feiqiangs feiqiangs force-pushed the dev_sfq branch 2 times, most recently from 6e5e94b to 4398462 Compare April 16, 2026 07:35
zhjc1124 and others added 11 commits April 17, 2026 15:02
1. Fuse indexer into layerwise transfer pipeline:
   - LayerwiseTransferOp/WorkerLayerwiseTransferOp: add indexer_src/dst_block_ids fields
   - merge_to_batch_graph: populate indexer block_ids (1:1 with main KV) in LAYERWISE op
   - C++ LayerwiseTransferGroup: accept indexer_ssd_files, init indexer_ioctx_
   - C++ layerwise_transfer: add Step 0.5 — indexer SSD→CPU after main KV SSD→CPU
   - Python LayerwiseTransferWorker: accept indexer SSD params, compute indexer SSD
     strides, pass indexer SSD block_ids and H2D block_ids to C++ layerwise_transfer

2. Skip redundant worker creation in layerwise mode:
   - Main KV: skip h2d_workers and cpussd_read_worker creation & registration
   - Indexer: skip _indexer_h2d_workers and _indexer_disk2h_worker creation & registration
   - D2H / H2DISK workers preserved (layerwise does not support PUT direction)

3. Add startup assertions in layerwise mode:
   - Assert _worker_map does not contain H2D or DISK2H
   - Assert _worker_map contains LAYERWISE

Co-authored-by: zittozhang <zittozhang@tencent.com>
Co-authored-by: zittozhang <zittozhang@tencent.com>
…via FLEXKV_NODE_ID env var and divide cp_size for multi-node TP when NSA CP is active
…nk) and unify eventfd socket path

Replace device_count()-based probing and FLEXKV_MASTER_HOST/FLEXKV_NODE_ID/FLEXKV_LOCAL_GPU_COUNT env-var plumbing with explicit ModelConfig fields driven by the framework (sglang server_args / TRT-LLM launcher), so FlexKV and its callers cannot drift on the multi-node layout.

Key changes:

- ModelConfig: new fields nnodes, node_rank, master_host; FlexKVConfig.post_init_from_sglang_config accepts them.

- kvtask.py: derive gpus_per_node=(tp*pp)//nnodes and nnodes_per_tp_group=ceil(tp/gpus_per_node); rename self.tp_node_count->self.nnodes_per_tp_group and drop self.is_multinode_tp (use nnodes_per_tp_group>1). NSA CP division logic preserved.

- transfer_manager.py: rename get_master_host_and_ports_from_env()->resolve_master_host_and_ports(master_host=None); TransferManagerOnRemote accepts master_host kwarg (env remains as fallback).

- transfer/layerwise.py: new helper build_layerwise_eventfd_socket_path(model_config) as single source of truth for the UDS path (suffix uses pp_rank/dp_rank only; node_rank intentionally omitted since UDS is kernel-local). LayerwiseTransferWorker takes layerwise_eventfd_socket kwarg instead of reading env in the subprocess.

- transfer/transfer_engine.py: compute the socket path once and pass it down; replace getattr(model_config, 'is_nsa_cp'/'cp_size', ...) with direct attribute access now that they are ModelConfig fields.
refactor: align multi-node TP topology with framework (nnodes/node_rank) and unify eventfd socket path
- Add configuration entries for HugePage allocation
- Implement HugePageAllocator and allocate CPU KV cache on hugetlbfs
- Support HugePage for temporary buffer in PEER2CPUTransferWorker
- Add tests and documentation for HugePage feature

Signed-off-by: staryxchen <staryxchen@tencent.com>
…ction

Extend ModelConfig with PP-aware fields: pp_start_layer/pp_end_layer,
enable_dp_attention, attn_cp_size/attn_cp_rank, and derived properties
(attn_tp_size, num_layers_per_pp_stage, token_size_in_bytes_per_pp_stage).
Add freeze() to prevent post-init mutation of parallel config.

Introduce WorkerKey(dp_rank, pp_rank) as the unique worker identifier,
replacing the flat dp_id throughout the transfer stack. This allows
TransferEngine to manage multiple PP stages within a single centralized
data plane instance, while each PP stage retains independent control
decisions (decentralized control plane).

Key changes:
  - TransferOp: dp_id -> (dp_rank, pp_rank)
  - TransferOpGraph: add clear_gpu_blocks()/set_gpu_blocks() for
    deferred GPU block binding (PP stages share a graph template but
  - Integration adapters (vllm/sglang/trt-llm): compute
    pp_start_layer/pp_end_layer per stage, freeze ModelConfig after init
  - C++ layerwise/tp_transfer_thread_group: use tp_size_per_node
    instead of global tp_size for correct node-local eventfd grouping
feat: add Pipeline Parallelism support
fix: improve transfer manager and engine robustness for PP
- Remove CMatchResult.block_node_ids tensor; add matched_node_id (int32)
- RefRadixTree::match_prefix now stops when encountering a different node_id
- Simplify pybind bindings: expose matched_node_id, remove block_node_ids
- Python MatchResultAccel: add matched_node_id field, broadcast to per-block
  arrays for backward compat in downstream worker/transfer paths
- Update hie_cache_engine.py and cache_engine.py to derive per-block arrays
  from single matched_node_id instead of reading C++ tensor
- Add unit tests for CMatchResult and MatchResultAccel matched_node_id

This simplifies the distributed matching and transfer paths:
- No need for shared_transfer_kv_blocks_remote_read multi-node grouping
- Lease management is simpler (no cross-node cascade invalidation)
- Better fault isolation (single node failure domain)
- Cleaner integration with CP/PP/TP cooperative GET flow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants