[TRTLLM-12499][feat] Add support for chunked KVCache transfer for disaggregated serving in Python Cache Transceiver by athena-nv · Pull Request #15519 · NVIDIA/TensorRT-LLM

athena-nv · 2026-06-22T16:51:37Z

Description

Summary

Implements chunked KV cache transfer for disaggregated serving. This targets the Python transceiver (NIXL GPUDirect RDMA).

When chunk_size_blocks is set and the backend is NIXL (the default), the Python transceiver is auto-selected. The Python transceiver also avoids the contiguous staging buffer that the C++ transceiver allocates, eliminating an additional source of memory pressure for long-context requests.

Configuration

Enable chunked KV cache transfer by setting chunk_size_blocks in the YAML config:

# Context server config
context_servers:
  cache_transceiver_config:
    backend: "DEFAULT"          # or "NIXL"
    chunk_size_blocks: 64       # max blocks per layer group per chunk

# Generation server config (same setting)
generation_servers:
  cache_transceiver_config:
    backend: "DEFAULT"
    chunk_size_blocks: 64

Or via the Python API:

from tensorrt_llm.llmapi.llm_args import CacheTransceiverConfig

config = CacheTransceiverConfig(
    backend="NIXL",
    chunk_size_blocks=64,
)

`chunk_size_blocks`	Backend	Effect
`None` (default)	Any	No chunking (unchanged from today)
`64`	NIXL / DEFAULT	Python transceiver auto-selected, chunked transfer + early block release
`64`	UCX / MPI / MOONCAKE	Warning logged, `chunk_size_blocks` ignored (C++ transceiver has no chunking support yet)

Recommended values: 64-128 for long-context workloads (ISL >= 32K).

Phased Roadmap

This PR is Phase 1 of a 2-phase effort:

Phase	What	Transceiver	PR	Status
1	Chunked transfer	Python	This PR	In review
2	Pipelined prefill-transfer	Python	N/A	Prototype

What This PR Contains

Sender-only chunking at KVSlice level: _create_kv_slices partitions block IDs per layer group into slices of at most chunk_size_blocks blocks, with KVSlice.chunk_block_offset for destination alignment
Destination block slicing in _build_kv_write_meta via chunk_block_offset
CacheTransceiverConfig.chunk_size_blocks configuration field
Auto-selection of Python transceiver when chunk_size_blocks is set (NIXL/DEFAULT)
Warning when chunk_size_blocks set with unsupported backend
Receiver unchanged: single monolithic RecvReqInfo; sender slices dst blocks per chunk

Design Rationale

Why sender-only chunking: Chunking on both sides produces an N² dispatch bug. _respond_with_kv fires for each RecvReqInfo and dispatches all kv_tasks. Since RecvReqInfo has no slice_id field, multiple messages overwrite each other in _peer_requests. With sender-only chunking, one RecvReqInfo → one dispatch → N tasks use KVSlice.chunk_block_offset to slice the correct destination subset.

Why KVSlice.chunk_block_offset (not function parameter): Per reviewer feedback (chuangz0, Shixiaowei02), the offset belongs as a member of KVSlice since the dataclass was designed to carry all slice metadata including token range, layer range, and block offsets.

Intermediate chunk results: Each chunk sends KV_AGENT_RESULT to the receiver (not just the last). Intermediate results with is_last_slice=False are no-ops on the receiver but enable immediate error propagation if a chunk's RDMA fails.

Beam width guard: C++ releasePrefixBlocks asserts beamWidth == 1. Python-side guard in respond_and_send_async sets callback to None for beam_width > 1.

VSWA limitation: mNumFrontBlocksRemoved is shared across window managers. Single window size assumed, enforced by existing disagg gate (not is_vswa). Documented at call site.

Auto-selection of Python transceiver: When chunk_size_blocks is set with NIXL/DEFAULT backend, the Python transceiver is auto-selected. Non-NIXL backends log a warning that chunking will be ignored. The Python transceiver avoids the C++ transceiver's staging buffer, eliminating that memory pressure too.

Why Python transceiver first (not C++): The Python transceiver provides full control over the transfer lifecycle for chunking and callbacks. The C++ CacheTransceiver::respondAndSendAsync is monolithic with no per-chunk hook points. Phase 1b will add chunking to the C++ transceiver (~500 lines in CacheFormatter::format), which will also enable smaller staging buffers.

Changes

Python (base/transfer.py):

Add chunk_block_offset: int = 0 to KVSlice dataclass

Python (native/transfer.py):

KVSendTask reads offset from _slice.chunk_block_offset
_build_kv_write_meta dst slicing, _deliver_kv_to_agent callback + receiver_slice_id=0

Python (transceiver.py):

_collect_base_slice, _create_kv_slices (preserves mamba_state_index, sets chunk_block_offset)
Updated respond_and_send_async and request_and_receive_async
Integrity check uses np.array_equal() + ValueError

Python (kv_cache_transceiver.py):

Auto-select Python transceiver when chunk_size_blocks set
Warning for unsupported backends

Python (llm_args.py):

CacheTransceiverConfig.chunk_size_blocks field

Test Coverage

Test	What it covers
`test_create_kv_slices_basic`	5 parametrized cases calling real `_create_kv_slices`
`test_create_kv_slices_integrity_check`	Reassembled blocks match original across layer groups
`test_create_kv_slices_multiple_layer_groups`	Asymmetric layer groups produce correct chunking
`test_create_kv_slices_preserves_mamba_state`	`mamba_state_index` propagated through chunked slices
`test_transfer_worker_chunked[v1_tp1_pp1_chunked]`	E2E GPU: V1 chunked transfer with actual NIXL RDMA
`test_transfer_worker_chunked[v2_tp1_pp1_chunked]`	E2E GPU: V2 chunked transfer
`test_chunked_transfer.py`	19 tests: session state machine using real `TxSession`/`RxSession`
`test_cache_transceiver_config_chunk_size_blocks`	Config validation: valid, None, default, zero, negative
`test_chunked_kv_transfer_nixl_python_accuracy`	Test chunked KV transfer accuracy with a real disaggregated model

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

coderabbitai · 2026-06-22T17:00:37Z

📝 Walkthrough

Walkthrough

This PR adds sender-side chunked KV cache transfer for disaggregated serving. A new C++ releasePrefixBlocks API frees KV cache blocks as each chunk completes transfer, exposed through Python bindings. KvCacheTransceiverV2 gains _create_kv_slices, _make_chunk_callback, and _drain_pending_releases to partition block IDs into chunks and progressively release sender-side memory. CacheTransceiverConfig.chunk_size_blocks configures the feature. Separately, ExaoneMoeForCausalLM import is made optional.

Changes

Chunked KV Transfer + Early Prefix Block Release

Layer / File(s)	Summary
C++ releasePrefixBlocks API: declarations, implementation, bindings, and unit test `cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`, `cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`, `cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`, `cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`	Adds `releasePrefixBlocks` to `GenerationRequest` (counter helper), `WindowBlockManager` (per-range placeholder-swap release), `BlockManager` (cross-window orchestration), and `KVCacheManager` (request-id entry point); exposes via nanobind; validates cumulative semantics and double-free prevention in a new C++ unit test.
Python transfer contracts: KVSlice offset, callback type, TxSession/RxSession plumbing `tensorrt_llm/_torch/disaggregation/base/transfer.py`, `tensorrt_llm/_torch/disaggregation/native/transfer.py`	Adds `chunk_block_offset` to `KVSlice`, defines `OnChunkTransferredCallback`, wires the callback through `TxSession` and `TransferWorker.create_tx_session`, invokes it after each chunk completes in `_deliver_kv_to_agent`, slices destination block IDs in `_build_kv_write_meta` for non-zero offsets, fixes `RxSession.process_aux_agent_result` to use the last KV task, and fixes receiver `slice_id` to monolithic `0`.
KvCacheTransceiverV2 chunking machinery and send/receive path updates `tensorrt_llm/_torch/disaggregation/transceiver.py`	Adds `_collect_base_slice`, `_create_kv_slices` (block-ID partitioning with integrity check), `_make_chunk_callback` (enqueues pending releases), and `_drain_pending_releases`; updates `respond_and_send_async` to iterate chunks, `_get_or_create_send_session` to conditionally supply the callback for `beam_width<=1`, `check_context_transfer_status` to drain releases and use a monolithic receive slice, and shutdown paths to drain before closing.
Python resource manager, config, and transceiver selection wiring `tensorrt_llm/_torch/pyexecutor/resource_manager.py`, `tensorrt_llm/llmapi/llm_args.py`, `tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py`	Adds `release_prefix_blocks` to the Python `KVCacheManager` delegating to the C++ impl; adds `chunk_size_blocks` to `CacheTransceiverConfig` (Python-only, excluded from `_to_pybind`); auto-selects the Python transceiver when `chunk_size_blocks` is set on compatible backends with warnings for incompatible backends and sub-16 sizes.
Unit and integration tests `tests/unittest/disaggregated/test_chunked_transfer.py`, `tests/unittest/disaggregated/test_kv_transfer.py`, `tests/unittest/llmapi/test_llm_args.py`, `tests/integration/defs/accuracy/test_disaggregated_serving.py`	New `test_chunked_transfer.py` covers multi-chunk session state machines, callback enqueue/drain, and mid-transfer failure; `test_kv_transfer.py` gains `_create_kv_slices` correctness tests and `test_transfer_worker_chunked`; `test_llm_args.py` validates `chunk_size_blocks` validation; integration test `test_chunked_kv_transfer_nixl_python_accuracy` runs GSM8K accuracy with chunked NIXL Python transceivers.

ExaoneMoeForCausalLM Optional Import

Layer / File(s)	Summary
Conditional ExaoneMoeForCausalLM import and `__all__` guard `tensorrt_llm/_torch/models/__init__.py`	Wraps `ExaoneMoeForCausalLM` import in `try/except ImportError`, assigns `None` on failure, and appends to `__all__` only when the import succeeded.

Sequence Diagram(s)

sequenceDiagram
  rect rgba(100, 149, 237, 0.5)
    note over KvCacheTransceiverV2, KVCacheManager: Sender side (context server)
    KvCacheTransceiverV2->>KvCacheTransceiverV2: _create_kv_slices(req) → [chunk₀, chunk₁, …, chunkₙ]
    KvCacheTransceiverV2->>TransferWorker: create_tx_session(req, on_chunk_transferred=callback)
    TransferWorker-->>KvCacheTransceiverV2: TxSession
    loop each KVSlice chunk
      KvCacheTransceiverV2->>TxSession: send(KVSlice with chunk_block_offset)
      TxSession->>_deliver_kv_to_agent: transfer chunk to receiver agent
      _deliver_kv_to_agent-->>TxSession: chunk delivered
      TxSession->>_pending_prefix_releases: enqueue(request_id, chunk_end_offset)
    end
  end
  rect rgba(144, 238, 144, 0.5)
    note over KvCacheTransceiverV2, KVCacheManager: Executor main thread drain
    KvCacheTransceiverV2->>KvCacheTransceiverV2: _drain_pending_releases()
    KvCacheTransceiverV2->>KVCacheManager: release_prefix_blocks(request_id, num_blocks)
    KVCacheManager->>BlockManager: releasePrefixBlocks(sequence, numBlocks)
    BlockManager->>WindowBlockManager: releasePrefixBlocks(sequence, startIdx, numBlocks)
    WindowBlockManager-->>BlockManager: slots replaced with placeholders, refs decremented
  end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested labels

LLM API, SW Architecture, api-compatible

Suggested reviewers

Shixiaowei02
Tabrizian
bo-nv
dongxuy04
SimengLiu-nv
jieli-matrix
schetlur-nv

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 47.75% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly summarizes the main change: chunked KV cache transfer for disaggregated serving in the Python cache transceiver.
Description check	✅ Passed	The PR description follows the template and includes clear summary, configuration, design rationale, test coverage, and checklist items.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/disaggregation/native/transfer.py (1)

508-516: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Send abort results to receiver slice 0 for chunked sends.

The success path remaps every sender chunk to the receiver’s monolithic task, but this abort branch still sends write_meta.slice_id. If a later chunk aborts with slice_id > 0, the receiver rejects the result and can stay stuck until timeout.

🐛 Proposed fix

                 [
                     MessageType.KV_AGENT_RESULT,
                     str(self._instance_rank).encode("ascii"),
                     str(write_meta.unique_rid).encode("ascii"),
-                    str(write_meta.slice_id).encode("ascii"),
+                    b"0",
                     b"True",  # is_last_slice — ensures receiver resolves its task event
                     AgentResult.FAILED.value.encode("ascii"),
                 ]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/disaggregation/native/transfer.py` around lines 508 -
516, The abort result being sent for a failed KV agent operation uses
write_meta.slice_id, but the receiver's success path consolidates all sender
chunks to a monolithic task at the receiver's slice 0. When later chunks abort
with slice_id > 0, the receiver rejects the result causing a timeout. Replace
the write_meta.slice_id parameter (the fourth encoded string argument) in the
_get_or_connect_dealer(...).send(...) call with 0 to ensure abort results are
always routed to the receiver's monolithic task handler regardless of which
chunk failed.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 570-592: The exception handler in the
session._on_chunk_transferred callback block uses a broad except Exception
clause that masks unexpected errors. Replace the generic except Exception with
specific exception types that represent expected non-fatal errors from the
callback (such as callback invocation errors or argument validation errors),
allowing unexpected exceptions to propagate and surface for debugging. Identify
the actual exceptions that may be raised by the callback implementation and
catch only those specific types instead of catching all exceptions.
- Around line 723-731: The condition for slicing destination blocks in the chunk
offset handling is too broad and triggers for monolithic transfers when extra
destination blocks exist, which silently truncates invalid block-list mismatches
instead of reaching validation logic. Modify the condition that checks whether
to slice dst_block_ids to gate the slicing operation on chunk metadata (check if
chunking is actually being used by the sender) in addition to the current
chunk_offset and src_block_ids length checks. This ensures that slicing only
happens for actual sender chunks and allows the non-chunked speculative path to
properly validate and trim exactly one draft block without silent truncation.

In `@tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py`:
- Around line 73-90: The auto-selection logic for the Python transceiver in the
conditional block starting at line 73 does not respect an explicit
transceiver_runtime='CPP' setting. The condition currently only checks if
use_python is False and chunk_size_blocks is set, but it ignores explicit CPP
overrides. Add an additional check to the condition to verify that
transceiver_runtime is not explicitly set to 'CPP' before auto-selecting Python.
This way, users who explicitly set transceiver_runtime='CPP' will have their
preference honored despite chunk_size_blocks being configured.

In `@tests/integration/defs/accuracy/test_disaggregated_serving.py`:
- Around line 411-414: The except block in the tokenizer fallback logic catches
the overly broad Exception type, which can mask unrelated failures during
logging. Replace the generic Exception catch with specific exception types that
the tokenizer.encode method is known to raise (such as ValueError or
tokenizer-specific exceptions). This ensures only expected tokenizer errors are
handled in the fallback path, while allowing genuinely unexpected errors to
propagate and be caught by higher-level error handling.

---

Outside diff comments:
In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 508-516: The abort result being sent for a failed KV agent
operation uses write_meta.slice_id, but the receiver's success path consolidates
all sender chunks to a monolithic task at the receiver's slice 0. When later
chunks abort with slice_id > 0, the receiver rejects the result causing a
timeout. Replace the write_meta.slice_id parameter (the fourth encoded string
argument) in the _get_or_connect_dealer(...).send(...) call with 0 to ensure
abort results are always routed to the receiver's monolithic task handler
regardless of which chunk failed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 508184b3-cf7b-474b-91c8-d25ca6e7fd47

📥 Commits

Reviewing files that changed from the base of the PR and between 2e6abd1 and f0ab2ad.

📒 Files selected for processing (15)

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
tensorrt_llm/_torch/disaggregation/base/transfer.py
tensorrt_llm/_torch/disaggregation/native/transfer.py
tensorrt_llm/_torch/disaggregation/transceiver.py
tensorrt_llm/_torch/models/__init__.py
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/llmapi/llm_args.py
tests/integration/defs/accuracy/test_disaggregated_serving.py
tests/unittest/disaggregated/test_chunked_transfer.py
tests/unittest/disaggregated/test_kv_transfer.py
tests/unittest/llmapi/test_llm_args.py

coderabbitai · 2026-06-22T17:00:40Z

+                if session._on_chunk_transferred is not None:
+                    try:
+                        # Use the max across layer groups as the
+                        # cumulative release count.  For asymmetric
+                        # layer groups (e.g., sliding window), shorter
+                        # groups may have fewer blocks per chunk, but
+                        # each WindowBlockManager independently clamps
+                        # to its own allocated block count via
+                        # min(numBlocks, allocatedBlocks.size()).
+                        num_blocks = max(
+                            (len(ids) for ids in task._slice.block_ids_per_layer_groups),
+                            default=0,
+                        )
+                        session._on_chunk_transferred(
+                            request_id=session.request_id,
+                            chunk_block_offset=task._slice.chunk_block_offset,
+                            num_blocks=num_blocks,
+                        )
+                    except Exception as e:
+                        logger.warning(
+                            f"on_chunk_transferred callback failed for "
+                            f"request {session.request_id} slice {write_meta.slice_id}: {e}"
+                        )


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Narrow the callback exception handler.

except Exception hides unexpected callback bugs and is flagged by BLE001. Catch only the specific non-fatal errors expected from this internal release hook, or let unexpected errors surface.

As per coding guidelines, “Avoid broad exception handling — catch specific exceptions, not bare except:”.

🧰 Tools

🪛 Ruff (0.15.18)

[warning] 588-588: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/disaggregation/native/transfer.py` around lines 570 - 592, The exception handler in the session._on_chunk_transferred callback block uses a broad except Exception clause that masks unexpected errors. Replace the generic except Exception with specific exception types that represent expected non-fatal errors from the callback (such as callback invocation errors or argument validation errors), allowing unexpected exceptions to propagate and surface for debugging. Identify the actual exceptions that may be raised by the callback implementation and catch only those specific types instead of catching all exceptions.

Sources: Coding guidelines, Linters/SAST tools

Tabrizian

Discussed offline with @athena-nv . early block release doesn't work with chunked KVCache transfer and needs to be removed from the PR.

athena-nv · 2026-06-24T21:38:17Z

/bot run

tensorrt-cicd · 2026-06-24T21:45:08Z

PR_Github #55599 [ run ] triggered by Bot. Commit: d6a7f14 Link to invocation

tensorrt-cicd · 2026-06-24T22:25:51Z

PR_Github #55599 [ run ] completed with state FAILURE. Commit: d6a7f14
/LLM/main/L0_MergeRequest_PR pipeline #44516 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

athena-nv · 2026-06-24T23:18:58Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-24T23:24:37Z

PR_Github #55613 [ run ] triggered by Bot. Commit: d6a7f14 Link to invocation

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor

… counter - Replace _collect_block_ids with _collect_base_slice to preserve the full KVSlice metadata (including mamba_state_index) through all new code paths: _create_kv_slices (sender) and request_and_receive_async (receiver). Without this, Mamba/hybrid-state model transfers would lose required state metadata. - Fix VSWA shared counter bug in WindowBlockManager::releasePrefixBlocks: snapshot mNumFrontBlocksRemoved before iterating window managers so each manager releases blocks from the same range. Previously the first manager advanced the shared counter, causing subsequent managers to skip their own blocks entirely. - Guard chunking integrity assertion with __debug__ to avoid O(N) CPU overhead on the hot path in optimized builds. - Add tests for mamba_state_index propagation through chunked slices. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

- Update copyright year to 2026 in nanobind kvCacheManager.cpp - Add OnChunkTransferredCallback type alias for precise callback typing - Add strict=True to zip() calls in chunked transfer tests Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor

- Fix chunking integrity check: use np.array_equal() instead of == for numpy array comparison, raise ValueError instead of assert (eopXD comment on transceiver.py) - Add explicit VSWA limitation comment in BlockManager::releasePrefixBlocks documenting the single-window-size assumption (eopXD comment on kvCacheManager.cpp) - Auto-select Python transceiver when chunk_size_blocks is set and backend is NIXL/DEFAULT. The C++ transceiver does not support chunked transfer; this makes chunking work without requiring users to manually set transceiver_runtime="PYTHON" (pcastonguay comment on transceiver.py) Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor

Per reviewer feedback (chuangz0, Shixiaowei02): chunk_block_offset belongs as a member of KVSlice rather than a function parameter on send(). The KVSlice dataclass was designed to carry all slice metadata. - Add chunk_block_offset: int = 0 to KVSlice dataclass - Remove chunk_block_offset from TxSessionBase.send() signature - Remove chunk_block_offset from TxSession.send() signature - Remove chunk_block_offset from KVSendTask.__init__ - Read chunk_block_offset from task._slice in _build_kv_write_meta and _deliver_kv_to_agent callback - Set chunk_block_offset on each KVSlice in _create_kv_slices - Update all tests accordingly Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Made-with: Cursor Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

Signed-off-by: Athena Cai <athenac@nvidia.com>

tensorrt-cicd · 2026-06-25T03:06:30Z

PR_Github #55613 [ run ] completed with state FAILURE. Commit: d6a7f14
/LLM/main/L0_MergeRequest_PR pipeline #44531 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

athena-nv requested review from a team as code owners June 22, 2026 16:51

athena-nv requested review from HuiGao-NV, Superjomn, bo-nv, chuangz0 and syuoni June 22, 2026 16:51

github-actions Bot assigned athena-nv Jun 22, 2026

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

athena-nv force-pushed the trtllm-11608-chunked-kvcache-transfer-combined branch 2 times, most recently from d7f8578 to 7746bbc Compare June 22, 2026 22:44

jieli-matrix approved these changes Jun 23, 2026

View reviewed changes

Comment thread tests/integration/test_lists/test-db/l0_dgx_b200.yml

Tabrizian requested changes Jun 23, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/models/__init__.py

athena-nv force-pushed the trtllm-11608-chunked-kvcache-transfer-combined branch 2 times, most recently from 759fcc9 to 4cd1aa0 Compare June 24, 2026 21:36

athena-nv force-pushed the trtllm-11608-chunked-kvcache-transfer-combined branch from 4cd1aa0 to d6a7f14 Compare June 24, 2026 21:41

athena-nv changed the title ~~[TRTLLM-11608][feat] Chunked KV cache transfer with early block release~~ [TRTLLM-12499][feat] Add support for chunked KVCache transfer for disaggregated serving in Python Cache Transceiver Jun 24, 2026

chienchunhung added 3 commits June 24, 2026 23:44

chienchunhung and others added 5 commits June 24, 2026 23:44

fix ups

97eb94a

Signed-off-by: Athena Cai <athenac@nvidia.com>

address coderabbit comments

2702a10

Signed-off-by: Athena Cai <athenac@nvidia.com>

Revert early block release implementation

423f698

Signed-off-by: Athena Cai <athenac@nvidia.com>

athena-nv force-pushed the trtllm-11608-chunked-kvcache-transfer-combined branch from d6a7f14 to 423f698 Compare June 24, 2026 23:47

Uh oh!

Conversation

athena-nv commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary

Configuration

Phased Roadmap

What This PR Contains

Design Rationale

Changes

Test Coverage

PR Checklist

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tabrizian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

athena-nv commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

athena-nv commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

athena-nv commented Jun 22, 2026 •

edited

Loading

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading