[https://nvbugs/6293536][fix] Stage KV block offsets through a fresh host buffer by thorjohnsen · Pull Request #15546 · NVIDIA/TensorRT-LLM

thorjohnsen · 2026-06-23T13:10:34Z

Summary by CodeRabbit

Bug Fixes
- Fixed a data race condition in KV cache block offset handling during batch processing that could result in stale or incorrect data being used.
Tests
- Added regression test to prevent recurrence of the KV cache race condition.

Description

Fixes a host-side data race (nvbug 6293536) that corrupts KV-cache block
indexing under the overlap scheduler, surfacing as CUDA memory corruption.

KVCacheManager.copy_batch_block_offsets shipped each iteration's block-offset
page tables to the device with an asynchronous H2D whose source was a
single, persistently-reused pinned host buffer (host_kv_cache_block_offsets)
overwritten in place every call. An async copy reads its source at
execution time, not enqueue time, so under the overlap scheduler — where the
CPU runs an iteration ahead — the next iteration's in-place overwrite could
clobber the source of the previous iteration's still-pending H2D. The attention
kernel then indexed another batch's physical blocks → corruption.

The race window is widened by v1 host KV offloading: syncTransfers stalls the
execution stream right in front of the offset H2D, so heavy offloading (driven
by concurrent prefill + primary-memory pressure) delays the copy long enough
for the host to overwrite the buffer first.

Fix: stage the offsets through a freshly-allocated pinned host buffer on
every call instead of reusing one in place. PyTorch's caching host allocator
holds a freed pinned buffer until the consuming async copy completes, so the
source can no longer be overwritten mid-flight — the same protection the
already-safe kv_lens / block_ids_per_seq staging paths rely on. The manager
attribute is reassigned to the current buffer so same-iteration CPU readers
(DSA sparse attention) and the speculative-decoding draft/target swap stay
coherent; TrtllmAttentionMetadata.prepare re-grabs the reference after the
copy.

This change is internal (no public API change).

Test Coverage

tests/unittest/_torch/executor/test_kv_block_offset_overlap_race.py —
test_copy_batch_block_offsets_survives_overlap_overwrite drives the overlap
window deterministically with a torch.cuda._sleep stream stall (standing in
for the production syncTransfers offload stall), enqueues batch A's async H2D
behind the stall, then issues batch B before A's copy drains, and asserts batch
A's device offsets are not clobbered. The test fails on the pre-fix code and
passes with the fix. Existing tests/unittest/_torch/executor/test_resource_manager.py
continues to pass.

PR Checklist

Please check this after reviewing the above items as appropriate for this PR.

…host buffer KVCacheManager.copy_batch_block_offsets shipped each iteration's block-offset page tables to the device with an asynchronous H2D whose source was a single, persistently-reused pinned host buffer overwritten in place every call. An async copy reads its source at execution time, not enqueue time, so under the overlap scheduler the CPU runs an iteration ahead and the next iteration's in-place overwrite could clobber the source of the previous iteration's still-pending H2D. The attention kernel then indexed another batch's physical blocks, surfacing as memory corruption. The window is widened by v1 host KV offloading, which stalls the execution stream (syncTransfers) right in front of the offset H2D, plus concurrent prefill and primary-memory pressure. Stage the offsets through a freshly-allocated pinned host buffer on every call instead of reusing one in place. PyTorch's caching host allocator holds a freed pinned buffer until the consuming async copy completes, so the source can no longer be overwritten mid-flight -- the same protection the already-safe kv_lens / block_ids_per_seq staging paths rely on. The manager attribute is reassigned to the current buffer so same-iteration CPU readers (DSA sparse attention) and the speculative-decoding draft/target swap stay coherent; TrtllmAttentionMetadata.prepare re-grabs the reference after the copy. Add a regression test that drives the overlap window deterministically with a stream stall and asserts batch A's device offsets are not clobbered by batch B. Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

thorjohnsen · 2026-06-23T13:11:57Z

/bot run

coderabbitai · 2026-06-23T13:14:14Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c3ca4b98-355b-45b9-82c8-0cb7b48b3d6b

📥 Commits

Reviewing files that changed from the base of the PR and between 439ad22 and a862b7c.

📒 Files selected for processing (3)

tensorrt_llm/_torch/attention_backend/trtllm.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tests/unittest/_torch/executor/test_kv_block_offset_overlap_race.py

📝 Walkthrough

Walkthrough

Fixes nvbug 6293536: KVCacheManager.copy_batch_block_offsets now allocates a fresh pinned host tensor per call instead of overwriting a persistent buffer, preventing a prior iteration's async H2D copy from reading clobbered data. TrtllmAttentionMetadata.prepare() refreshes its host buffer reference after the call. A new CUDA regression test validates the fix.

Changes

KV Cache Block Offset Overlap Race Fix

Layer / File(s)	Summary
Fresh pinned buffer allocation and caller reference refresh `tensorrt_llm/_torch/pyexecutor/resource_manager.py`, `tensorrt_llm/_torch/attention_backend/trtllm.py`	`copy_batch_block_offsets` allocates a new pinned CPU tensor on every invocation, fills it with context and generation offsets, copies non-blocking to `dst_tensor`, and rebinds `self.host_kv_cache_block_offsets` to the fresh buffer. `TrtllmAttentionMetadata.prepare()` then re-reads `kv_cache_manager.host_kv_cache_block_offsets` to update its own reference after the call.
CUDA regression test for overlap race `tests/unittest/_torch/executor/test_kv_block_offset_overlap_race.py`	Adds `_build_manager()`, `_empty_device_offsets()`, `_reference_offsets()`, and `test_copy_batch_block_offsets_survives_overlap_overwrite()`; the test stalls the CUDA stream via `torch.cuda._sleep`, enqueues two back-to-back async offset copies for disjoint batches, synchronizes, and asserts each batch's device tensor matches its independently computed reference.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title correctly identifies the main change: staging KV block offsets through a fresh host buffer to fix a race condition (nvbug 6293536).
Description check	✅ Passed	The PR description is comprehensive and follows the template with all required sections: description explaining the issue and solution, test coverage details, and checklist completion.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

Resolve copy_batch_block_offsets conflict: upstream added a cross-attention (CacheType.CROSS, beam_width>1) early-return branch that also reused the single persistent host_kv_cache_block_offsets buffer in place. Route both the cross and the self-attention paths through the fresh per-iteration pinned buffer so the nvbug 6293536 fix covers the cross path too. Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

thorjohnsen · 2026-06-23T13:18:06Z

/bot run

tensorrt-cicd · 2026-06-23T13:26:36Z

PR_Github #55246 [ run ] triggered by Bot. Commit: ed1dfe6 Link to invocation

tensorrt-cicd · 2026-06-23T16:15:21Z

PR_Github #55246 [ run ] completed with state SUCCESS. Commit: ed1dfe6
/LLM/main/L0_MergeRequest_PR pipeline #44205 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

thorjohnsen requested review from a team as code owners June 23, 2026 13:10

thorjohnsen requested a review from yuxianq June 23, 2026 13:10

github-actions Bot assigned thorjohnsen Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6293536][fix] Stage KV block offsets through a fresh host buffer#15546

[https://nvbugs/6293536][fix] Stage KV block offsets through a fresh host buffer#15546
thorjohnsen wants to merge 2 commits into
NVIDIA:mainfrom
thorjohnsen:fix/nvbug-6293536-block-offset-race

thorjohnsen commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

thorjohnsen commented Jun 23, 2026

Uh oh!

coderabbitai Bot commented Jun 23, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

thorjohnsen commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thorjohnsen commented Jun 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

Uh oh!

thorjohnsen commented Jun 23, 2026

Uh oh!

coderabbitai Bot commented Jun 23, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

thorjohnsen commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thorjohnsen commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading