Conversation
…ments - Add dp_rank to ModelConfig and propagate dp_size/dp_rank through config chain - Generate per-PP-rank IPC port suffixes to avoid ZMQ endpoint collisions - Generate per-PP/DP-rank eventfd socket paths for LayerwiseTransferWorker - Use global device_id (dp_rank * tp_size + tp_rank) to avoid GPU registration conflicts - Skip dp_rank in model config verification (DP ranks share same KVServer) - Fix recvmsg_into flags parameter from anc_buf_size to 0 - Handle zmq.Again in KVTPClient registration with blocking fallback - Add periodic GPU registration wait diagnostics in TransferManager - Add comprehensive IPC port/socket logging throughout initialization
Modifying LD_LIBRARY_PATH at runtime does NOT affect the current process's dynamic linker (ld.so reads it only at startup). Use ctypes.CDLL with RTLD_GLOBAL to pre-load libxxhash.so etc. so that c_ext (loaded via dlopen) can resolve them without requiring LD_LIBRARY_PATH to be set before process start.
- Replace fixed-count for loop with while loop that continues until all ranks have registered or deadline is reached - Add per-connection try-except so a single SCM_RIGHTS failure does not abort the entire accept loop - Increase listen backlog to tp_group_size*3 to accommodate client retries on failed connections - Use per-connection timeout with overall deadline instead of a single global timeout
…gineWrapper - Fix typo: change`engien_port`to`engine_port` - Use`self.config`for attribute assignment instead of`config Signed-off-by: staryxchen <staryxchen@tencent.com>
…one to int - Update method signature to match underlying engine's return value - Return engine's status code (0 for success, -1 for error) Signed-off-by: staryxchen <staryxchen@tencent.com>
transfer_sync_write_with_notify method - Added`-> int`return type annotation to the method signature. Signed-off-by: staryxchen <staryxchen@tencent.com>
6e5e94b to
4398462
Compare
1. Fuse indexer into layerwise transfer pipeline:
- LayerwiseTransferOp/WorkerLayerwiseTransferOp: add indexer_src/dst_block_ids fields
- merge_to_batch_graph: populate indexer block_ids (1:1 with main KV) in LAYERWISE op
- C++ LayerwiseTransferGroup: accept indexer_ssd_files, init indexer_ioctx_
- C++ layerwise_transfer: add Step 0.5 — indexer SSD→CPU after main KV SSD→CPU
- Python LayerwiseTransferWorker: accept indexer SSD params, compute indexer SSD
strides, pass indexer SSD block_ids and H2D block_ids to C++ layerwise_transfer
2. Skip redundant worker creation in layerwise mode:
- Main KV: skip h2d_workers and cpussd_read_worker creation & registration
- Indexer: skip _indexer_h2d_workers and _indexer_disk2h_worker creation & registration
- D2H / H2DISK workers preserved (layerwise does not support PUT direction)
3. Add startup assertions in layerwise mode:
- Assert _worker_map does not contain H2D or DISK2H
- Assert _worker_map contains LAYERWISE
Co-authored-by: zittozhang <zittozhang@tencent.com>
Co-authored-by: zittozhang <zittozhang@tencent.com>
…via FLEXKV_NODE_ID env var and divide cp_size for multi-node TP when NSA CP is active
…nk) and unify eventfd socket path Replace device_count()-based probing and FLEXKV_MASTER_HOST/FLEXKV_NODE_ID/FLEXKV_LOCAL_GPU_COUNT env-var plumbing with explicit ModelConfig fields driven by the framework (sglang server_args / TRT-LLM launcher), so FlexKV and its callers cannot drift on the multi-node layout. Key changes: - ModelConfig: new fields nnodes, node_rank, master_host; FlexKVConfig.post_init_from_sglang_config accepts them. - kvtask.py: derive gpus_per_node=(tp*pp)//nnodes and nnodes_per_tp_group=ceil(tp/gpus_per_node); rename self.tp_node_count->self.nnodes_per_tp_group and drop self.is_multinode_tp (use nnodes_per_tp_group>1). NSA CP division logic preserved. - transfer_manager.py: rename get_master_host_and_ports_from_env()->resolve_master_host_and_ports(master_host=None); TransferManagerOnRemote accepts master_host kwarg (env remains as fallback). - transfer/layerwise.py: new helper build_layerwise_eventfd_socket_path(model_config) as single source of truth for the UDS path (suffix uses pp_rank/dp_rank only; node_rank intentionally omitted since UDS is kernel-local). LayerwiseTransferWorker takes layerwise_eventfd_socket kwarg instead of reading env in the subprocess. - transfer/transfer_engine.py: compute the socket path once and pass it down; replace getattr(model_config, 'is_nsa_cp'/'cp_size', ...) with direct attribute access now that they are ModelConfig fields.
refactor: align multi-node TP topology with framework (nnodes/node_rank) and unify eventfd socket path
- Add configuration entries for HugePage allocation - Implement HugePageAllocator and allocate CPU KV cache on hugetlbfs - Support HugePage for temporary buffer in PEER2CPUTransferWorker - Add tests and documentation for HugePage feature Signed-off-by: staryxchen <staryxchen@tencent.com>
…ction
Extend ModelConfig with PP-aware fields: pp_start_layer/pp_end_layer,
enable_dp_attention, attn_cp_size/attn_cp_rank, and derived properties
(attn_tp_size, num_layers_per_pp_stage, token_size_in_bytes_per_pp_stage).
Add freeze() to prevent post-init mutation of parallel config.
Introduce WorkerKey(dp_rank, pp_rank) as the unique worker identifier,
replacing the flat dp_id throughout the transfer stack. This allows
TransferEngine to manage multiple PP stages within a single centralized
data plane instance, while each PP stage retains independent control
decisions (decentralized control plane).
Key changes:
- TransferOp: dp_id -> (dp_rank, pp_rank)
- TransferOpGraph: add clear_gpu_blocks()/set_gpu_blocks() for
deferred GPU block binding (PP stages share a graph template but
- Integration adapters (vllm/sglang/trt-llm): compute
pp_start_layer/pp_end_layer per stage, freeze ModelConfig after init
- C++ layerwise/tp_transfer_thread_group: use tp_size_per_node
instead of global tp_size for correct node-local eventfd grouping
feat: add Pipeline Parallelism support
fix: improve transfer manager and engine robustness for PP
- Remove CMatchResult.block_node_ids tensor; add matched_node_id (int32) - RefRadixTree::match_prefix now stops when encountering a different node_id - Simplify pybind bindings: expose matched_node_id, remove block_node_ids - Python MatchResultAccel: add matched_node_id field, broadcast to per-block arrays for backward compat in downstream worker/transfer paths - Update hie_cache_engine.py and cache_engine.py to derive per-block arrays from single matched_node_id instead of reading C++ tensor - Add unit tests for CMatchResult and MatchResultAccel matched_node_id This simplifies the distributed matching and transfer paths: - No need for shared_transfer_kv_blocks_remote_read multi-node grouping - Lease management is simpler (no cross-node cascade invalidation) - Better fault isolation (single node failure domain) - Cleaner integration with CP/PP/TP cooperative GET flow
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
injected by MUSABuilder; ext_name is always flexkv.c_ext (no c_ext_musa branch)
with per-vendor compile flags and source lists; all comments in English
covering 5 design principles, step-by-step onboarding, GDS/IPC fallback;
NvidiaBackend code example updated to match actual implementation