Skip to content

[https://nvbugs/6293536][fix] Stage KV block offsets through a fresh host buffer#15546

Open
thorjohnsen wants to merge 2 commits into
NVIDIA:mainfrom
thorjohnsen:fix/nvbug-6293536-block-offset-race
Open

[https://nvbugs/6293536][fix] Stage KV block offsets through a fresh host buffer#15546
thorjohnsen wants to merge 2 commits into
NVIDIA:mainfrom
thorjohnsen:fix/nvbug-6293536-block-offset-race

Conversation

@thorjohnsen

@thorjohnsen thorjohnsen commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • Bug Fixes

    • Fixed a data race condition in KV cache block offset handling during batch processing that could result in stale or incorrect data being used.
  • Tests

    • Added regression test to prevent recurrence of the KV cache race condition.

Description

Fixes a host-side data race (nvbug 6293536) that corrupts KV-cache block
indexing under the overlap scheduler, surfacing as CUDA memory corruption.

KVCacheManager.copy_batch_block_offsets shipped each iteration's block-offset
page tables to the device with an asynchronous H2D whose source was a
single, persistently-reused pinned host buffer (host_kv_cache_block_offsets)
overwritten in place every call. An async copy reads its source at
execution time, not enqueue time, so under the overlap scheduler — where the
CPU runs an iteration ahead — the next iteration's in-place overwrite could
clobber the source of the previous iteration's still-pending H2D. The attention
kernel then indexed another batch's physical blocks → corruption.

The race window is widened by v1 host KV offloading: syncTransfers stalls the
execution stream right in front of the offset H2D, so heavy offloading (driven
by concurrent prefill + primary-memory pressure) delays the copy long enough
for the host to overwrite the buffer first.

Fix: stage the offsets through a freshly-allocated pinned host buffer on
every call instead of reusing one in place. PyTorch's caching host allocator
holds a freed pinned buffer until the consuming async copy completes, so the
source can no longer be overwritten mid-flight — the same protection the
already-safe kv_lens / block_ids_per_seq staging paths rely on. The manager
attribute is reassigned to the current buffer so same-iteration CPU readers
(DSA sparse attention) and the speculative-decoding draft/target swap stay
coherent; TrtllmAttentionMetadata.prepare re-grabs the reference after the
copy.

This change is internal (no public API change).

Test Coverage

tests/unittest/_torch/executor/test_kv_block_offset_overlap_race.py
test_copy_batch_block_offsets_survives_overlap_overwrite drives the overlap
window deterministically with a torch.cuda._sleep stream stall (standing in
for the production syncTransfers offload stall), enqueues batch A's async H2D
behind the stall, then issues batch B before A's copy drains, and asserts batch
A's device offsets are not clobbered. The test fails on the pre-fix code and
passes with the fix. Existing tests/unittest/_torch/executor/test_resource_manager.py
continues to pass.

PR Checklist

  • Please check this after reviewing the above items as appropriate for this PR.

…host buffer

KVCacheManager.copy_batch_block_offsets shipped each iteration's block-offset
page tables to the device with an asynchronous H2D whose source was a single,
persistently-reused pinned host buffer overwritten in place every call. An
async copy reads its source at execution time, not enqueue time, so under the
overlap scheduler the CPU runs an iteration ahead and the next iteration's
in-place overwrite could clobber the source of the previous iteration's
still-pending H2D. The attention kernel then indexed another batch's physical
blocks, surfacing as memory corruption. The window is widened by v1 host KV
offloading, which stalls the execution stream (syncTransfers) right in front of
the offset H2D, plus concurrent prefill and primary-memory pressure.

Stage the offsets through a freshly-allocated pinned host buffer on every call
instead of reusing one in place. PyTorch's caching host allocator holds a freed
pinned buffer until the consuming async copy completes, so the source can no
longer be overwritten mid-flight -- the same protection the already-safe
kv_lens / block_ids_per_seq staging paths rely on. The manager attribute is
reassigned to the current buffer so same-iteration CPU readers (DSA sparse
attention) and the speculative-decoding draft/target swap stay coherent;
TrtllmAttentionMetadata.prepare re-grabs the reference after the copy.

Add a regression test that drives the overlap window deterministically with a
stream stall and asserts batch A's device offsets are not clobbered by batch B.

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
@thorjohnsen thorjohnsen requested review from a team as code owners June 23, 2026 13:10
@thorjohnsen thorjohnsen requested a review from yuxianq June 23, 2026 13:10
@thorjohnsen

Copy link
Copy Markdown
Collaborator Author

/bot run

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c3ca4b98-355b-45b9-82c8-0cb7b48b3d6b

📥 Commits

Reviewing files that changed from the base of the PR and between 439ad22 and a862b7c.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tests/unittest/_torch/executor/test_kv_block_offset_overlap_race.py

📝 Walkthrough

Walkthrough

Fixes nvbug 6293536: KVCacheManager.copy_batch_block_offsets now allocates a fresh pinned host tensor per call instead of overwriting a persistent buffer, preventing a prior iteration's async H2D copy from reading clobbered data. TrtllmAttentionMetadata.prepare() refreshes its host buffer reference after the call. A new CUDA regression test validates the fix.

Changes

KV Cache Block Offset Overlap Race Fix

Layer / File(s) Summary
Fresh pinned buffer allocation and caller reference refresh
tensorrt_llm/_torch/pyexecutor/resource_manager.py, tensorrt_llm/_torch/attention_backend/trtllm.py
copy_batch_block_offsets allocates a new pinned CPU tensor on every invocation, fills it with context and generation offsets, copies non-blocking to dst_tensor, and rebinds self.host_kv_cache_block_offsets to the fresh buffer. TrtllmAttentionMetadata.prepare() then re-reads kv_cache_manager.host_kv_cache_block_offsets to update its own reference after the call.
CUDA regression test for overlap race
tests/unittest/_torch/executor/test_kv_block_offset_overlap_race.py
Adds _build_manager(), _empty_device_offsets(), _reference_offsets(), and test_copy_batch_block_offsets_survives_overlap_overwrite(); the test stalls the CUDA stream via torch.cuda._sleep, enqueues two back-to-back async offset copies for disjoint batches, synchronizes, and asserts each batch's device tensor matches its independently computed reference.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title correctly identifies the main change: staging KV block offsets through a fresh host buffer to fix a race condition (nvbug 6293536).
Description check ✅ Passed The PR description is comprehensive and follows the template with all required sections: description explaining the issue and solution, test coverage details, and checklist completion.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

Resolve copy_batch_block_offsets conflict: upstream added a cross-attention
(CacheType.CROSS, beam_width>1) early-return branch that also reused the single
persistent host_kv_cache_block_offsets buffer in place. Route both the cross
and the self-attention paths through the fresh per-iteration pinned buffer so
the nvbug 6293536 fix covers the cross path too.

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
@thorjohnsen

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55246 [ run ] triggered by Bot. Commit: ed1dfe6 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55246 [ run ] completed with state SUCCESS. Commit: ed1dfe6
/LLM/main/L0_MergeRequest_PR pipeline #44205 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants