[https://nvbugs/6293536][fix] Stage KV block offsets through a fresh host buffer#15546
[https://nvbugs/6293536][fix] Stage KV block offsets through a fresh host buffer#15546thorjohnsen wants to merge 2 commits into
Conversation
…host buffer KVCacheManager.copy_batch_block_offsets shipped each iteration's block-offset page tables to the device with an asynchronous H2D whose source was a single, persistently-reused pinned host buffer overwritten in place every call. An async copy reads its source at execution time, not enqueue time, so under the overlap scheduler the CPU runs an iteration ahead and the next iteration's in-place overwrite could clobber the source of the previous iteration's still-pending H2D. The attention kernel then indexed another batch's physical blocks, surfacing as memory corruption. The window is widened by v1 host KV offloading, which stalls the execution stream (syncTransfers) right in front of the offset H2D, plus concurrent prefill and primary-memory pressure. Stage the offsets through a freshly-allocated pinned host buffer on every call instead of reusing one in place. PyTorch's caching host allocator holds a freed pinned buffer until the consuming async copy completes, so the source can no longer be overwritten mid-flight -- the same protection the already-safe kv_lens / block_ids_per_seq staging paths rely on. The manager attribute is reassigned to the current buffer so same-iteration CPU readers (DSA sparse attention) and the speculative-decoding draft/target swap stay coherent; TrtllmAttentionMetadata.prepare re-grabs the reference after the copy. Add a regression test that drives the overlap window deterministically with a stream stall and asserts batch A's device offsets are not clobbered by batch B. Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
|
/bot run |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughFixes nvbug 6293536: ChangesKV Cache Block Offset Overlap Race Fix
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Resolve copy_batch_block_offsets conflict: upstream added a cross-attention (CacheType.CROSS, beam_width>1) early-return branch that also reused the single persistent host_kv_cache_block_offsets buffer in place. Route both the cross and the self-attention paths through the fresh per-iteration pinned buffer so the nvbug 6293536 fix covers the cross path too. Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
|
/bot run |
|
PR_Github #55246 [ run ] triggered by Bot. Commit: |
|
PR_Github #55246 [ run ] completed with state
|
Summary by CodeRabbit
Bug Fixes
Tests
Description
Fixes a host-side data race (nvbug 6293536) that corrupts KV-cache block
indexing under the overlap scheduler, surfacing as CUDA memory corruption.
KVCacheManager.copy_batch_block_offsetsshipped each iteration's block-offsetpage tables to the device with an asynchronous H2D whose source was a
single, persistently-reused pinned host buffer (
host_kv_cache_block_offsets)overwritten in place every call. An async copy reads its source at
execution time, not enqueue time, so under the overlap scheduler — where the
CPU runs an iteration ahead — the next iteration's in-place overwrite could
clobber the source of the previous iteration's still-pending H2D. The attention
kernel then indexed another batch's physical blocks → corruption.
The race window is widened by v1 host KV offloading:
syncTransfersstalls theexecution stream right in front of the offset H2D, so heavy offloading (driven
by concurrent prefill + primary-memory pressure) delays the copy long enough
for the host to overwrite the buffer first.
Fix: stage the offsets through a freshly-allocated pinned host buffer on
every call instead of reusing one in place. PyTorch's caching host allocator
holds a freed pinned buffer until the consuming async copy completes, so the
source can no longer be overwritten mid-flight — the same protection the
already-safe
kv_lens/block_ids_per_seqstaging paths rely on. The managerattribute is reassigned to the current buffer so same-iteration CPU readers
(DSA sparse attention) and the speculative-decoding draft/target swap stay
coherent;
TrtllmAttentionMetadata.preparere-grabs the reference after thecopy.
This change is internal (no public API change).
Test Coverage
tests/unittest/_torch/executor/test_kv_block_offset_overlap_race.py—test_copy_batch_block_offsets_survives_overlap_overwritedrives the overlapwindow deterministically with a
torch.cuda._sleepstream stall (standing infor the production
syncTransfersoffload stall), enqueues batch A's async H2Dbehind the stall, then issues batch B before A's copy drains, and asserts batch
A's device offsets are not clobbered. The test fails on the pre-fix code and
passes with the fix. Existing
tests/unittest/_torch/executor/test_resource_manager.pycontinues to pass.
PR Checklist