Skip to content

[TRTLLM-12499][feat] Add support for chunked KVCache transfer for disaggregated serving in Python Cache Transceiver#15519

Open
athena-nv wants to merge 8 commits into
NVIDIA:mainfrom
athena-nv:trtllm-11608-chunked-kvcache-transfer-combined
Open

[TRTLLM-12499][feat] Add support for chunked KVCache transfer for disaggregated serving in Python Cache Transceiver#15519
athena-nv wants to merge 8 commits into
NVIDIA:mainfrom
athena-nv:trtllm-11608-chunked-kvcache-transfer-combined

Conversation

@athena-nv

@athena-nv athena-nv commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Original PR: #12602 by @chienchunhung

Description

Summary

Implements chunked KV cache transfer for disaggregated serving. This targets the Python transceiver (NIXL GPUDirect RDMA).

When chunk_size_blocks is set and the backend is NIXL (the default), the Python transceiver is auto-selected. The Python transceiver also avoids the contiguous staging buffer that the C++ transceiver allocates, eliminating an additional source of memory pressure for long-context requests.

Configuration

Enable chunked KV cache transfer by setting chunk_size_blocks in the YAML config:

# Context server config
context_servers:
  cache_transceiver_config:
    backend: "DEFAULT"          # or "NIXL"
    chunk_size_blocks: 64       # max blocks per layer group per chunk

# Generation server config (same setting)
generation_servers:
  cache_transceiver_config:
    backend: "DEFAULT"
    chunk_size_blocks: 64

Or via the Python API:

from tensorrt_llm.llmapi.llm_args import CacheTransceiverConfig

config = CacheTransceiverConfig(
    backend="NIXL",
    chunk_size_blocks=64,
)
chunk_size_blocks Backend Effect
None (default) Any No chunking (unchanged from today)
64 NIXL / DEFAULT Python transceiver auto-selected, chunked transfer + early block release
64 UCX / MPI / MOONCAKE Warning logged, chunk_size_blocks ignored (C++ transceiver has no chunking support yet)

Recommended values: 64-128 for long-context workloads (ISL >= 32K).

Phased Roadmap

This PR is Phase 1 of a 2-phase effort:

Phase What Transceiver PR Status
1 Chunked transfer Python This PR In review
2 Pipelined prefill-transfer Python N/A Prototype

What This PR Contains

  • Sender-only chunking at KVSlice level: _create_kv_slices partitions block IDs per layer group into slices of at most chunk_size_blocks blocks, with KVSlice.chunk_block_offset for destination alignment
  • Destination block slicing in _build_kv_write_meta via chunk_block_offset
  • CacheTransceiverConfig.chunk_size_blocks configuration field
  • Auto-selection of Python transceiver when chunk_size_blocks is set (NIXL/DEFAULT)
  • Warning when chunk_size_blocks set with unsupported backend
  • Receiver unchanged: single monolithic RecvReqInfo; sender slices dst blocks per chunk

Design Rationale

Why sender-only chunking: Chunking on both sides produces an N² dispatch bug. _respond_with_kv fires for each RecvReqInfo and dispatches all kv_tasks. Since RecvReqInfo has no slice_id field, multiple messages overwrite each other in _peer_requests. With sender-only chunking, one RecvReqInfo → one dispatch → N tasks use KVSlice.chunk_block_offset to slice the correct destination subset.

Why KVSlice.chunk_block_offset (not function parameter): Per reviewer feedback (chuangz0, Shixiaowei02), the offset belongs as a member of KVSlice since the dataclass was designed to carry all slice metadata including token range, layer range, and block offsets.

Intermediate chunk results: Each chunk sends KV_AGENT_RESULT to the receiver (not just the last). Intermediate results with is_last_slice=False are no-ops on the receiver but enable immediate error propagation if a chunk's RDMA fails.

Beam width guard: C++ releasePrefixBlocks asserts beamWidth == 1. Python-side guard in respond_and_send_async sets callback to None for beam_width > 1.

VSWA limitation: mNumFrontBlocksRemoved is shared across window managers. Single window size assumed, enforced by existing disagg gate (not is_vswa). Documented at call site.

Auto-selection of Python transceiver: When chunk_size_blocks is set with NIXL/DEFAULT backend, the Python transceiver is auto-selected. Non-NIXL backends log a warning that chunking will be ignored. The Python transceiver avoids the C++ transceiver's staging buffer, eliminating that memory pressure too.

Why Python transceiver first (not C++): The Python transceiver provides full control over the transfer lifecycle for chunking and callbacks. The C++ CacheTransceiver::respondAndSendAsync is monolithic with no per-chunk hook points. Phase 1b will add chunking to the C++ transceiver (~500 lines in CacheFormatter::format), which will also enable smaller staging buffers.

Changes

Python (base/transfer.py):

  • Add chunk_block_offset: int = 0 to KVSlice dataclass

Python (native/transfer.py):

  • KVSendTask reads offset from _slice.chunk_block_offset
  • _build_kv_write_meta dst slicing, _deliver_kv_to_agent callback + receiver_slice_id=0

Python (transceiver.py):

  • _collect_base_slice, _create_kv_slices (preserves mamba_state_index, sets chunk_block_offset)
  • Updated respond_and_send_async and request_and_receive_async
  • Integrity check uses np.array_equal() + ValueError

Python (kv_cache_transceiver.py):

  • Auto-select Python transceiver when chunk_size_blocks set
  • Warning for unsupported backends

Python (llm_args.py):

  • CacheTransceiverConfig.chunk_size_blocks field

Test Coverage

Test What it covers
test_create_kv_slices_basic 5 parametrized cases calling real _create_kv_slices
test_create_kv_slices_integrity_check Reassembled blocks match original across layer groups
test_create_kv_slices_multiple_layer_groups Asymmetric layer groups produce correct chunking
test_create_kv_slices_preserves_mamba_state mamba_state_index propagated through chunked slices
test_transfer_worker_chunked[v1_tp1_pp1_chunked] E2E GPU: V1 chunked transfer with actual NIXL RDMA
test_transfer_worker_chunked[v2_tp1_pp1_chunked] E2E GPU: V2 chunked transfer
test_chunked_transfer.py 19 tests: session state machine using real TxSession/RxSession
test_cache_transceiver_config_chunk_size_blocks Config validation: valid, None, default, zero, negative
test_chunked_kv_transfer_nixl_python_accuracy Test chunked KV transfer accuracy with a real disaggregated model

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds sender-side chunked KV cache transfer for disaggregated serving. A new C++ releasePrefixBlocks API frees KV cache blocks as each chunk completes transfer, exposed through Python bindings. KvCacheTransceiverV2 gains _create_kv_slices, _make_chunk_callback, and _drain_pending_releases to partition block IDs into chunks and progressively release sender-side memory. CacheTransceiverConfig.chunk_size_blocks configures the feature. Separately, ExaoneMoeForCausalLM import is made optional.

Changes

Chunked KV Transfer + Early Prefix Block Release

Layer / File(s) Summary
C++ releasePrefixBlocks API: declarations, implementation, bindings, and unit test
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h, cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp, cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
Adds releasePrefixBlocks to GenerationRequest (counter helper), WindowBlockManager (per-range placeholder-swap release), BlockManager (cross-window orchestration), and KVCacheManager (request-id entry point); exposes via nanobind; validates cumulative semantics and double-free prevention in a new C++ unit test.
Python transfer contracts: KVSlice offset, callback type, TxSession/RxSession plumbing
tensorrt_llm/_torch/disaggregation/base/transfer.py, tensorrt_llm/_torch/disaggregation/native/transfer.py
Adds chunk_block_offset to KVSlice, defines OnChunkTransferredCallback, wires the callback through TxSession and TransferWorker.create_tx_session, invokes it after each chunk completes in _deliver_kv_to_agent, slices destination block IDs in _build_kv_write_meta for non-zero offsets, fixes RxSession.process_aux_agent_result to use the last KV task, and fixes receiver slice_id to monolithic 0.
KvCacheTransceiverV2 chunking machinery and send/receive path updates
tensorrt_llm/_torch/disaggregation/transceiver.py
Adds _collect_base_slice, _create_kv_slices (block-ID partitioning with integrity check), _make_chunk_callback (enqueues pending releases), and _drain_pending_releases; updates respond_and_send_async to iterate chunks, _get_or_create_send_session to conditionally supply the callback for beam_width<=1, check_context_transfer_status to drain releases and use a monolithic receive slice, and shutdown paths to drain before closing.
Python resource manager, config, and transceiver selection wiring
tensorrt_llm/_torch/pyexecutor/resource_manager.py, tensorrt_llm/llmapi/llm_args.py, tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
Adds release_prefix_blocks to the Python KVCacheManager delegating to the C++ impl; adds chunk_size_blocks to CacheTransceiverConfig (Python-only, excluded from _to_pybind); auto-selects the Python transceiver when chunk_size_blocks is set on compatible backends with warnings for incompatible backends and sub-16 sizes.
Unit and integration tests
tests/unittest/disaggregated/test_chunked_transfer.py, tests/unittest/disaggregated/test_kv_transfer.py, tests/unittest/llmapi/test_llm_args.py, tests/integration/defs/accuracy/test_disaggregated_serving.py
New test_chunked_transfer.py covers multi-chunk session state machines, callback enqueue/drain, and mid-transfer failure; test_kv_transfer.py gains _create_kv_slices correctness tests and test_transfer_worker_chunked; test_llm_args.py validates chunk_size_blocks validation; integration test test_chunked_kv_transfer_nixl_python_accuracy runs GSM8K accuracy with chunked NIXL Python transceivers.

ExaoneMoeForCausalLM Optional Import

Layer / File(s) Summary
Conditional ExaoneMoeForCausalLM import and __all__ guard
tensorrt_llm/_torch/models/__init__.py
Wraps ExaoneMoeForCausalLM import in try/except ImportError, assigns None on failure, and appends to __all__ only when the import succeeded.

Sequence Diagram(s)

sequenceDiagram
  rect rgba(100, 149, 237, 0.5)
    note over KvCacheTransceiverV2, KVCacheManager: Sender side (context server)
    KvCacheTransceiverV2->>KvCacheTransceiverV2: _create_kv_slices(req) → [chunk₀, chunk₁, …, chunkₙ]
    KvCacheTransceiverV2->>TransferWorker: create_tx_session(req, on_chunk_transferred=callback)
    TransferWorker-->>KvCacheTransceiverV2: TxSession
    loop each KVSlice chunk
      KvCacheTransceiverV2->>TxSession: send(KVSlice with chunk_block_offset)
      TxSession->>_deliver_kv_to_agent: transfer chunk to receiver agent
      _deliver_kv_to_agent-->>TxSession: chunk delivered
      TxSession->>_pending_prefix_releases: enqueue(request_id, chunk_end_offset)
    end
  end
  rect rgba(144, 238, 144, 0.5)
    note over KvCacheTransceiverV2, KVCacheManager: Executor main thread drain
    KvCacheTransceiverV2->>KvCacheTransceiverV2: _drain_pending_releases()
    KvCacheTransceiverV2->>KVCacheManager: release_prefix_blocks(request_id, num_blocks)
    KVCacheManager->>BlockManager: releasePrefixBlocks(sequence, numBlocks)
    BlockManager->>WindowBlockManager: releasePrefixBlocks(sequence, startIdx, numBlocks)
    WindowBlockManager-->>BlockManager: slots replaced with placeholders, refs decremented
  end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested labels

LLM API, SW Architecture, api-compatible

Suggested reviewers

  • Shixiaowei02
  • Tabrizian
  • bo-nv
  • dongxuy04
  • SimengLiu-nv
  • jieli-matrix
  • schetlur-nv
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 47.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly summarizes the main change: chunked KV cache transfer for disaggregated serving in the Python cache transceiver.
Description check ✅ Passed The PR description follows the template and includes clear summary, configuration, design rationale, test coverage, and checklist items.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/disaggregation/native/transfer.py (1)

508-516: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Send abort results to receiver slice 0 for chunked sends.

The success path remaps every sender chunk to the receiver’s monolithic task, but this abort branch still sends write_meta.slice_id. If a later chunk aborts with slice_id > 0, the receiver rejects the result and can stay stuck until timeout.

🐛 Proposed fix
                 [
                     MessageType.KV_AGENT_RESULT,
                     str(self._instance_rank).encode("ascii"),
                     str(write_meta.unique_rid).encode("ascii"),
-                    str(write_meta.slice_id).encode("ascii"),
+                    b"0",
                     b"True",  # is_last_slice — ensures receiver resolves its task event
                     AgentResult.FAILED.value.encode("ascii"),
                 ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/disaggregation/native/transfer.py` around lines 508 -
516, The abort result being sent for a failed KV agent operation uses
write_meta.slice_id, but the receiver's success path consolidates all sender
chunks to a monolithic task at the receiver's slice 0. When later chunks abort
with slice_id > 0, the receiver rejects the result causing a timeout. Replace
the write_meta.slice_id parameter (the fourth encoded string argument) in the
_get_or_connect_dealer(...).send(...) call with 0 to ensure abort results are
always routed to the receiver's monolithic task handler regardless of which
chunk failed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 570-592: The exception handler in the
session._on_chunk_transferred callback block uses a broad except Exception
clause that masks unexpected errors. Replace the generic except Exception with
specific exception types that represent expected non-fatal errors from the
callback (such as callback invocation errors or argument validation errors),
allowing unexpected exceptions to propagate and surface for debugging. Identify
the actual exceptions that may be raised by the callback implementation and
catch only those specific types instead of catching all exceptions.
- Around line 723-731: The condition for slicing destination blocks in the chunk
offset handling is too broad and triggers for monolithic transfers when extra
destination blocks exist, which silently truncates invalid block-list mismatches
instead of reaching validation logic. Modify the condition that checks whether
to slice dst_block_ids to gate the slicing operation on chunk metadata (check if
chunking is actually being used by the sender) in addition to the current
chunk_offset and src_block_ids length checks. This ensures that slicing only
happens for actual sender chunks and allows the non-chunked speculative path to
properly validate and trim exactly one draft block without silent truncation.

In `@tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py`:
- Around line 73-90: The auto-selection logic for the Python transceiver in the
conditional block starting at line 73 does not respect an explicit
transceiver_runtime='CPP' setting. The condition currently only checks if
use_python is False and chunk_size_blocks is set, but it ignores explicit CPP
overrides. Add an additional check to the condition to verify that
transceiver_runtime is not explicitly set to 'CPP' before auto-selecting Python.
This way, users who explicitly set transceiver_runtime='CPP' will have their
preference honored despite chunk_size_blocks being configured.

In `@tests/integration/defs/accuracy/test_disaggregated_serving.py`:
- Around line 411-414: The except block in the tokenizer fallback logic catches
the overly broad Exception type, which can mask unrelated failures during
logging. Replace the generic Exception catch with specific exception types that
the tokenizer.encode method is known to raise (such as ValueError or
tokenizer-specific exceptions). This ensures only expected tokenizer errors are
handled in the fallback path, while allowing genuinely unexpected errors to
propagate and be caught by higher-level error handling.

---

Outside diff comments:
In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 508-516: The abort result being sent for a failed KV agent
operation uses write_meta.slice_id, but the receiver's success path consolidates
all sender chunks to a monolithic task at the receiver's slice 0. When later
chunks abort with slice_id > 0, the receiver rejects the result causing a
timeout. Replace the write_meta.slice_id parameter (the fourth encoded string
argument) in the _get_or_connect_dealer(...).send(...) call with 0 to ensure
abort results are always routed to the receiver's monolithic task handler
regardless of which chunk failed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 508184b3-cf7b-474b-91c8-d25ca6e7fd47

📥 Commits

Reviewing files that changed from the base of the PR and between 2e6abd1 and f0ab2ad.

📒 Files selected for processing (15)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • tensorrt_llm/_torch/disaggregation/base/transfer.py
  • tensorrt_llm/_torch/disaggregation/native/transfer.py
  • tensorrt_llm/_torch/disaggregation/transceiver.py
  • tensorrt_llm/_torch/models/__init__.py
  • tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/llmapi/llm_args.py
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • tests/unittest/disaggregated/test_chunked_transfer.py
  • tests/unittest/disaggregated/test_kv_transfer.py
  • tests/unittest/llmapi/test_llm_args.py

Comment on lines +570 to +592
if session._on_chunk_transferred is not None:
try:
# Use the max across layer groups as the
# cumulative release count. For asymmetric
# layer groups (e.g., sliding window), shorter
# groups may have fewer blocks per chunk, but
# each WindowBlockManager independently clamps
# to its own allocated block count via
# min(numBlocks, allocatedBlocks.size()).
num_blocks = max(
(len(ids) for ids in task._slice.block_ids_per_layer_groups),
default=0,
)
session._on_chunk_transferred(
request_id=session.request_id,
chunk_block_offset=task._slice.chunk_block_offset,
num_blocks=num_blocks,
)
except Exception as e:
logger.warning(
f"on_chunk_transferred callback failed for "
f"request {session.request_id} slice {write_meta.slice_id}: {e}"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Narrow the callback exception handler.

except Exception hides unexpected callback bugs and is flagged by BLE001. Catch only the specific non-fatal errors expected from this internal release hook, or let unexpected errors surface.

As per coding guidelines, “Avoid broad exception handling — catch specific exceptions, not bare except:”.

🧰 Tools
🪛 Ruff (0.15.18)

[warning] 588-588: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/disaggregation/native/transfer.py` around lines 570 -
592, The exception handler in the session._on_chunk_transferred callback block
uses a broad except Exception clause that masks unexpected errors. Replace the
generic except Exception with specific exception types that represent expected
non-fatal errors from the callback (such as callback invocation errors or
argument validation errors), allowing unexpected exceptions to propagate and
surface for debugging. Identify the actual exceptions that may be raised by the
callback implementation and catch only those specific types instead of catching
all exceptions.

Sources: Coding guidelines, Linters/SAST tools

Comment thread tensorrt_llm/_torch/disaggregation/native/transfer.py
Comment thread tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
Comment thread tests/integration/defs/accuracy/test_disaggregated_serving.py Outdated
@athena-nv athena-nv force-pushed the trtllm-11608-chunked-kvcache-transfer-combined branch 2 times, most recently from d7f8578 to 7746bbc Compare June 22, 2026 22:44
Comment thread tests/integration/test_lists/test-db/l0_dgx_b200.yml

@Tabrizian Tabrizian left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline with @athena-nv . early block release doesn't work with chunked KVCache transfer and needs to be removed from the PR.

Comment thread tensorrt_llm/_torch/models/__init__.py
@athena-nv athena-nv force-pushed the trtllm-11608-chunked-kvcache-transfer-combined branch 2 times, most recently from 759fcc9 to 4cd1aa0 Compare June 24, 2026 21:36
@athena-nv

Copy link
Copy Markdown
Collaborator Author

/bot run

@athena-nv athena-nv force-pushed the trtllm-11608-chunked-kvcache-transfer-combined branch from 4cd1aa0 to d6a7f14 Compare June 24, 2026 21:41
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55599 [ run ] triggered by Bot. Commit: d6a7f14 Link to invocation

@athena-nv athena-nv changed the title [TRTLLM-11608][feat] Chunked KV cache transfer with early block release [TRTLLM-12499][feat] Add support for chunked KVCache transfer for disaggregated serving in Python Cache Transceiver Jun 24, 2026
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55599 [ run ] completed with state FAILURE. Commit: d6a7f14
/LLM/main/L0_MergeRequest_PR pipeline #44516 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@athena-nv

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55613 [ run ] triggered by Bot. Commit: d6a7f14 Link to invocation

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
… counter

- Replace _collect_block_ids with _collect_base_slice to preserve the
  full KVSlice metadata (including mamba_state_index) through all new
  code paths: _create_kv_slices (sender) and request_and_receive_async
  (receiver).  Without this, Mamba/hybrid-state model transfers would
  lose required state metadata.

- Fix VSWA shared counter bug in WindowBlockManager::releasePrefixBlocks:
  snapshot mNumFrontBlocksRemoved before iterating window managers so
  each manager releases blocks from the same range.  Previously the
  first manager advanced the shared counter, causing subsequent managers
  to skip their own blocks entirely.

- Guard chunking integrity assertion with __debug__ to avoid O(N) CPU
  overhead on the hot path in optimized builds.

- Add tests for mamba_state_index propagation through chunked slices.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
- Update copyright year to 2026 in nanobind kvCacheManager.cpp
- Add OnChunkTransferredCallback type alias for precise callback typing
- Add strict=True to zip() calls in chunked transfer tests

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
chienchunhung and others added 5 commits June 24, 2026 23:44
- Fix chunking integrity check: use np.array_equal() instead of ==
  for numpy array comparison, raise ValueError instead of assert
  (eopXD comment on transceiver.py)

- Add explicit VSWA limitation comment in BlockManager::releasePrefixBlocks
  documenting the single-window-size assumption
  (eopXD comment on kvCacheManager.cpp)

- Auto-select Python transceiver when chunk_size_blocks is set and
  backend is NIXL/DEFAULT. The C++ transceiver does not support
  chunked transfer; this makes chunking work without requiring
  users to manually set transceiver_runtime="PYTHON"
  (pcastonguay comment on transceiver.py)

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
Per reviewer feedback (chuangz0, Shixiaowei02): chunk_block_offset
belongs as a member of KVSlice rather than a function parameter on
send(). The KVSlice dataclass was designed to carry all slice metadata.

- Add chunk_block_offset: int = 0 to KVSlice dataclass
- Remove chunk_block_offset from TxSessionBase.send() signature
- Remove chunk_block_offset from TxSession.send() signature
- Remove chunk_block_offset from KVSendTask.__init__
- Read chunk_block_offset from task._slice in _build_kv_write_meta
  and _deliver_kv_to_agent callback
- Set chunk_block_offset on each KVSlice in _create_kv_slices
- Update all tests accordingly

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Athena Cai <athenac@nvidia.com>
Signed-off-by: Athena Cai <athenac@nvidia.com>
Signed-off-by: Athena Cai <athenac@nvidia.com>
@athena-nv athena-nv force-pushed the trtllm-11608-chunked-kvcache-transfer-combined branch from d6a7f14 to 423f698 Compare June 24, 2026 23:47
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55613 [ run ] completed with state FAILURE. Commit: d6a7f14
/LLM/main/L0_MergeRequest_PR pipeline #44531 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants