[TRTLLM-12499][feat] Add support for chunked KVCache transfer for disaggregated serving in Python Cache Transceiver#15519
Conversation
📝 WalkthroughWalkthroughThis PR adds sender-side chunked KV cache transfer for disaggregated serving. A new C++ ChangesChunked KV Transfer + Early Prefix Block Release
ExaoneMoeForCausalLM Optional Import
Sequence Diagram(s)sequenceDiagram
rect rgba(100, 149, 237, 0.5)
note over KvCacheTransceiverV2, KVCacheManager: Sender side (context server)
KvCacheTransceiverV2->>KvCacheTransceiverV2: _create_kv_slices(req) → [chunk₀, chunk₁, …, chunkₙ]
KvCacheTransceiverV2->>TransferWorker: create_tx_session(req, on_chunk_transferred=callback)
TransferWorker-->>KvCacheTransceiverV2: TxSession
loop each KVSlice chunk
KvCacheTransceiverV2->>TxSession: send(KVSlice with chunk_block_offset)
TxSession->>_deliver_kv_to_agent: transfer chunk to receiver agent
_deliver_kv_to_agent-->>TxSession: chunk delivered
TxSession->>_pending_prefix_releases: enqueue(request_id, chunk_end_offset)
end
end
rect rgba(144, 238, 144, 0.5)
note over KvCacheTransceiverV2, KVCacheManager: Executor main thread drain
KvCacheTransceiverV2->>KvCacheTransceiverV2: _drain_pending_releases()
KvCacheTransceiverV2->>KVCacheManager: release_prefix_blocks(request_id, num_blocks)
KVCacheManager->>BlockManager: releasePrefixBlocks(sequence, numBlocks)
BlockManager->>WindowBlockManager: releasePrefixBlocks(sequence, startIdx, numBlocks)
WindowBlockManager-->>BlockManager: slots replaced with placeholders, refs decremented
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/disaggregation/native/transfer.py (1)
508-516:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSend abort results to receiver slice
0for chunked sends.The success path remaps every sender chunk to the receiver’s monolithic task, but this abort branch still sends
write_meta.slice_id. If a later chunk aborts withslice_id > 0, the receiver rejects the result and can stay stuck until timeout.🐛 Proposed fix
[ MessageType.KV_AGENT_RESULT, str(self._instance_rank).encode("ascii"), str(write_meta.unique_rid).encode("ascii"), - str(write_meta.slice_id).encode("ascii"), + b"0", b"True", # is_last_slice — ensures receiver resolves its task event AgentResult.FAILED.value.encode("ascii"), ]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/disaggregation/native/transfer.py` around lines 508 - 516, The abort result being sent for a failed KV agent operation uses write_meta.slice_id, but the receiver's success path consolidates all sender chunks to a monolithic task at the receiver's slice 0. When later chunks abort with slice_id > 0, the receiver rejects the result causing a timeout. Replace the write_meta.slice_id parameter (the fourth encoded string argument) in the _get_or_connect_dealer(...).send(...) call with 0 to ensure abort results are always routed to the receiver's monolithic task handler regardless of which chunk failed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 570-592: The exception handler in the
session._on_chunk_transferred callback block uses a broad except Exception
clause that masks unexpected errors. Replace the generic except Exception with
specific exception types that represent expected non-fatal errors from the
callback (such as callback invocation errors or argument validation errors),
allowing unexpected exceptions to propagate and surface for debugging. Identify
the actual exceptions that may be raised by the callback implementation and
catch only those specific types instead of catching all exceptions.
- Around line 723-731: The condition for slicing destination blocks in the chunk
offset handling is too broad and triggers for monolithic transfers when extra
destination blocks exist, which silently truncates invalid block-list mismatches
instead of reaching validation logic. Modify the condition that checks whether
to slice dst_block_ids to gate the slicing operation on chunk metadata (check if
chunking is actually being used by the sender) in addition to the current
chunk_offset and src_block_ids length checks. This ensures that slicing only
happens for actual sender chunks and allows the non-chunked speculative path to
properly validate and trim exactly one draft block without silent truncation.
In `@tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py`:
- Around line 73-90: The auto-selection logic for the Python transceiver in the
conditional block starting at line 73 does not respect an explicit
transceiver_runtime='CPP' setting. The condition currently only checks if
use_python is False and chunk_size_blocks is set, but it ignores explicit CPP
overrides. Add an additional check to the condition to verify that
transceiver_runtime is not explicitly set to 'CPP' before auto-selecting Python.
This way, users who explicitly set transceiver_runtime='CPP' will have their
preference honored despite chunk_size_blocks being configured.
In `@tests/integration/defs/accuracy/test_disaggregated_serving.py`:
- Around line 411-414: The except block in the tokenizer fallback logic catches
the overly broad Exception type, which can mask unrelated failures during
logging. Replace the generic Exception catch with specific exception types that
the tokenizer.encode method is known to raise (such as ValueError or
tokenizer-specific exceptions). This ensures only expected tokenizer errors are
handled in the fallback path, while allowing genuinely unexpected errors to
propagate and be caught by higher-level error handling.
---
Outside diff comments:
In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 508-516: The abort result being sent for a failed KV agent
operation uses write_meta.slice_id, but the receiver's success path consolidates
all sender chunks to a monolithic task at the receiver's slice 0. When later
chunks abort with slice_id > 0, the receiver rejects the result causing a
timeout. Replace the write_meta.slice_id parameter (the fourth encoded string
argument) in the _get_or_connect_dealer(...).send(...) call with 0 to ensure
abort results are always routed to the receiver's monolithic task handler
regardless of which chunk failed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 508184b3-cf7b-474b-91c8-d25ca6e7fd47
📒 Files selected for processing (15)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.hcpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cppcpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpptensorrt_llm/_torch/disaggregation/base/transfer.pytensorrt_llm/_torch/disaggregation/native/transfer.pytensorrt_llm/_torch/disaggregation/transceiver.pytensorrt_llm/_torch/models/__init__.pytensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/llmapi/llm_args.pytests/integration/defs/accuracy/test_disaggregated_serving.pytests/unittest/disaggregated/test_chunked_transfer.pytests/unittest/disaggregated/test_kv_transfer.pytests/unittest/llmapi/test_llm_args.py
| if session._on_chunk_transferred is not None: | ||
| try: | ||
| # Use the max across layer groups as the | ||
| # cumulative release count. For asymmetric | ||
| # layer groups (e.g., sliding window), shorter | ||
| # groups may have fewer blocks per chunk, but | ||
| # each WindowBlockManager independently clamps | ||
| # to its own allocated block count via | ||
| # min(numBlocks, allocatedBlocks.size()). | ||
| num_blocks = max( | ||
| (len(ids) for ids in task._slice.block_ids_per_layer_groups), | ||
| default=0, | ||
| ) | ||
| session._on_chunk_transferred( | ||
| request_id=session.request_id, | ||
| chunk_block_offset=task._slice.chunk_block_offset, | ||
| num_blocks=num_blocks, | ||
| ) | ||
| except Exception as e: | ||
| logger.warning( | ||
| f"on_chunk_transferred callback failed for " | ||
| f"request {session.request_id} slice {write_meta.slice_id}: {e}" | ||
| ) |
There was a problem hiding this comment.
Narrow the callback exception handler.
except Exception hides unexpected callback bugs and is flagged by BLE001. Catch only the specific non-fatal errors expected from this internal release hook, or let unexpected errors surface.
As per coding guidelines, “Avoid broad exception handling — catch specific exceptions, not bare except:”.
🧰 Tools
🪛 Ruff (0.15.18)
[warning] 588-588: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/disaggregation/native/transfer.py` around lines 570 -
592, The exception handler in the session._on_chunk_transferred callback block
uses a broad except Exception clause that masks unexpected errors. Replace the
generic except Exception with specific exception types that represent expected
non-fatal errors from the callback (such as callback invocation errors or
argument validation errors), allowing unexpected exceptions to propagate and
surface for debugging. Identify the actual exceptions that may be raised by the
callback implementation and catch only those specific types instead of catching
all exceptions.
Sources: Coding guidelines, Linters/SAST tools
d7f8578 to
7746bbc
Compare
Tabrizian
left a comment
There was a problem hiding this comment.
Discussed offline with @athena-nv . early block release doesn't work with chunked KVCache transfer and needs to be removed from the PR.
759fcc9 to
4cd1aa0
Compare
|
/bot run |
4cd1aa0 to
d6a7f14
Compare
|
PR_Github #55599 [ run ] triggered by Bot. Commit: |
|
PR_Github #55599 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #55613 [ run ] triggered by Bot. Commit: |
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor
… counter - Replace _collect_block_ids with _collect_base_slice to preserve the full KVSlice metadata (including mamba_state_index) through all new code paths: _create_kv_slices (sender) and request_and_receive_async (receiver). Without this, Mamba/hybrid-state model transfers would lose required state metadata. - Fix VSWA shared counter bug in WindowBlockManager::releasePrefixBlocks: snapshot mNumFrontBlocksRemoved before iterating window managers so each manager releases blocks from the same range. Previously the first manager advanced the shared counter, causing subsequent managers to skip their own blocks entirely. - Guard chunking integrity assertion with __debug__ to avoid O(N) CPU overhead on the hot path in optimized builds. - Add tests for mamba_state_index propagation through chunked slices. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
- Update copyright year to 2026 in nanobind kvCacheManager.cpp - Add OnChunkTransferredCallback type alias for precise callback typing - Add strict=True to zip() calls in chunked transfer tests Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor
- Fix chunking integrity check: use np.array_equal() instead of == for numpy array comparison, raise ValueError instead of assert (eopXD comment on transceiver.py) - Add explicit VSWA limitation comment in BlockManager::releasePrefixBlocks documenting the single-window-size assumption (eopXD comment on kvCacheManager.cpp) - Auto-select Python transceiver when chunk_size_blocks is set and backend is NIXL/DEFAULT. The C++ transceiver does not support chunked transfer; this makes chunking work without requiring users to manually set transceiver_runtime="PYTHON" (pcastonguay comment on transceiver.py) Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor
Per reviewer feedback (chuangz0, Shixiaowei02): chunk_block_offset belongs as a member of KVSlice rather than a function parameter on send(). The KVSlice dataclass was designed to carry all slice metadata. - Add chunk_block_offset: int = 0 to KVSlice dataclass - Remove chunk_block_offset from TxSessionBase.send() signature - Remove chunk_block_offset from TxSession.send() signature - Remove chunk_block_offset from KVSendTask.__init__ - Read chunk_block_offset from task._slice in _build_kv_write_meta and _deliver_kv_to_agent callback - Set chunk_block_offset on each KVSlice in _create_kv_slices - Update all tests accordingly Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Athena Cai <athenac@nvidia.com>
Signed-off-by: Athena Cai <athenac@nvidia.com>
d6a7f14 to
423f698
Compare
|
PR_Github #55613 [ run ] completed with state
|
Original PR: #12602 by @chienchunhung
Description
Summary
Implements chunked KV cache transfer for disaggregated serving. This targets the Python transceiver (NIXL GPUDirect RDMA).
When
chunk_size_blocksis set and the backend is NIXL (the default), the Python transceiver is auto-selected. The Python transceiver also avoids the contiguous staging buffer that the C++ transceiver allocates, eliminating an additional source of memory pressure for long-context requests.Configuration
Enable chunked KV cache transfer by setting
chunk_size_blocksin the YAML config:Or via the Python API:
chunk_size_blocksNone(default)6464chunk_size_blocksignored (C++ transceiver has no chunking support yet)Recommended values: 64-128 for long-context workloads (ISL >= 32K).
Phased Roadmap
This PR is Phase 1 of a 2-phase effort:
What This PR Contains
KVSlicelevel:_create_kv_slicespartitions block IDs per layer group into slices of at mostchunk_size_blocksblocks, withKVSlice.chunk_block_offsetfor destination alignment_build_kv_write_metaviachunk_block_offsetCacheTransceiverConfig.chunk_size_blocksconfiguration fieldchunk_size_blocksis set (NIXL/DEFAULT)chunk_size_blocksset with unsupported backendRecvReqInfo; sender slices dst blocks per chunkDesign Rationale
Why sender-only chunking: Chunking on both sides produces an N² dispatch bug.
_respond_with_kvfires for eachRecvReqInfoand dispatches allkv_tasks. SinceRecvReqInfohas noslice_idfield, multiple messages overwrite each other in_peer_requests. With sender-only chunking, oneRecvReqInfo→ one dispatch → N tasks useKVSlice.chunk_block_offsetto slice the correct destination subset.Why
KVSlice.chunk_block_offset(not function parameter): Per reviewer feedback (chuangz0, Shixiaowei02), the offset belongs as a member ofKVSlicesince the dataclass was designed to carry all slice metadata including token range, layer range, and block offsets.Intermediate chunk results: Each chunk sends
KV_AGENT_RESULTto the receiver (not just the last). Intermediate results withis_last_slice=Falseare no-ops on the receiver but enable immediate error propagation if a chunk's RDMA fails.Beam width guard: C++
releasePrefixBlocksassertsbeamWidth == 1. Python-side guard inrespond_and_send_asyncsets callback toNoneforbeam_width > 1.VSWA limitation:
mNumFrontBlocksRemovedis shared across window managers. Single window size assumed, enforced by existing disagg gate (not is_vswa). Documented at call site.Auto-selection of Python transceiver: When
chunk_size_blocksis set with NIXL/DEFAULT backend, the Python transceiver is auto-selected. Non-NIXL backends log a warning that chunking will be ignored. The Python transceiver avoids the C++ transceiver's staging buffer, eliminating that memory pressure too.Why Python transceiver first (not C++): The Python transceiver provides full control over the transfer lifecycle for chunking and callbacks. The C++
CacheTransceiver::respondAndSendAsyncis monolithic with no per-chunk hook points. Phase 1b will add chunking to the C++ transceiver (~500 lines inCacheFormatter::format), which will also enable smaller staging buffers.Changes
Python (
base/transfer.py):chunk_block_offset: int = 0toKVSlicedataclassPython (
native/transfer.py):KVSendTaskreads offset from_slice.chunk_block_offset_build_kv_write_metadst slicing,_deliver_kv_to_agentcallback +receiver_slice_id=0Python (
transceiver.py):_collect_base_slice,_create_kv_slices(preservesmamba_state_index, setschunk_block_offset)respond_and_send_asyncandrequest_and_receive_asyncnp.array_equal()+ValueErrorPython (
kv_cache_transceiver.py):chunk_size_blockssetPython (
llm_args.py):CacheTransceiverConfig.chunk_size_blocksfieldTest Coverage
test_create_kv_slices_basic_create_kv_slicestest_create_kv_slices_integrity_checktest_create_kv_slices_multiple_layer_groupstest_create_kv_slices_preserves_mamba_statemamba_state_indexpropagated through chunked slicestest_transfer_worker_chunked[v1_tp1_pp1_chunked]test_transfer_worker_chunked[v2_tp1_pp1_chunked]test_chunked_transfer.pyTxSession/RxSessiontest_cache_transceiver_config_chunk_size_blockstest_chunked_kv_transfer_nixl_python_accuracyPR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.