[None][fix] Fix encoder-decoder beam search corruption via per-slot fragmentPointerDevice#15461
Conversation
📝 WalkthroughWalkthroughThree correctness fixes for encoder-decoder beam search: ChangesEncoder-Decoder Beam Search Correctness
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/tests/unit_tests/batch_manager/encDecBeamSearchTest.cpp`:
- Around line 194-198: The current test at the copyGenerationLogits call site
uses a dummy cache and the direct-copy implementation path, which does not
exercise the fragmentPointerDevice slot isolation logic needed to detect race
conditions. Modify or add a test case that exercises the cached-fragment merge
path (using real cache data rather than a dummy cache) with at least two
concurrent requests/slots, and validate that fragmentPointerDevice pointers are
not clobbered across different slot allocations to catch regressions of the
maxBatchSize x kCACHE_LENGTH change.
- Around line 63-70: The test CopyBlockOffsetsAllBeamsShareBeam0Blocks is too
lenient because it relies on the allocator already sharing blocks, making
equality across beams an unreliable check that could pass even with the bug
present. Modify the test to explicitly set per-beam source rows to different
values before calling copyBlockOffsets, then assert that the output rows are all
normalized to beam 0. This forces a discriminating scenario where the old buggy
behavior would fail (keeping per-beam differences) while the fixed behavior
would pass (normalizing all to beam 0). Apply the same principle to the related
test mentioned around lines 115-140.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d6e2d2c4-bf3b-49ae-964b-984a43ebb5dc
📒 Files selected for processing (7)
cpp/include/tensorrt_llm/batch_manager/runtimeBuffers.hcpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/batch_manager/runtimeBuffers.cppcpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cppcpp/tensorrt_llm/batch_manager/utils/inflightBatchingUtils.cppcpp/tests/unit_tests/batch_manager/CMakeLists.txtcpp/tests/unit_tests/batch_manager/encDecBeamSearchTest.cpp
|
/bot run |
|
PR_Github #54871 [ run ] triggered by Bot. Commit: |
|
PR_Github #54871 [ run ] completed with state
|
826b435 to
72c55d5
Compare
|
/bot run |
|
PR_Github #54877 [ run ] triggered by Bot. Commit: |
|
PR_Github #54877 [ run ] completed with state
|
|
/bot run |
|
PR_Github #54895 [ run ] triggered by Bot. Commit: |
|
PR_Github #54895 [ run ] completed with state
|
72c55d5 to
b8b6670
Compare
|
/bot run |
|
PR_Github #54936 [ run ] triggered by Bot. Commit: |
|
PR_Github #54936 [ run ] completed with state
|
|
/bot run |
|
PR_Github #54948 [ run ] triggered by Bot. Commit: |
|
PR_Github #54948 [ run ] completed with state |
Code review on PR NVIDIA#15461 identified that the cross-KV copyBlockOffsets beam-0 sharing invariant is already maintained upstream before copyBlockOffsets is called, making the guard redundant. Remove the fix from KVCacheManager::copyBlockOffsets and the corresponding CrossKvBeamSharingTest unit test. Validated locally: the remaining CopyGenerationLogitsTest (fix 3) passes independently after this removal. Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
In KVCacheManager::copyBlockOffsets, when building the cross-attention KV cache offset table for encoder-decoder models (e.g. Whisper), each beam of a request was assigned separate physical blocks. However, the encoder context phase only writes encoder features to beam-0's blocks (contextBeamWidth=1 means only slot 0 per request is addressed). Beams 1..N-1 pointed to uninitialised GPU memory, causing degenerate repetitive output such as "happ happ happ" during beam-search decoding. Fix: when isCrossKv(), always use beam-0's source block IDs and block count for every beam. The encoder output is identical for all beams of a request, so sharing the same physical blocks is semantically correct. Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
When concurrent requests with different beam widths (e.g. beam=1 and beam=5) triggered a verifyRequests() exception in TrtGptModelInflightBatching:: forwardAsync(), the exception handler freed the KV-cache sequence slots via terminateRequest() but did not remove the affected request IDs from mInflightReqIds. The subsequent changeBeamWidth() call checks TLLM_CHECK(mInflightReqIds.empty()) and would abort the process. Fix: in the exception catch block, erase each active request from mInflightReqIds before calling terminateRequest(), then call changeBeamWidth(mOperatingBeamWidth) to reset RuntimeBuffers and DecoderState so the next batch starts from a clean state. Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Two tests in encDecBeamSearchTest.cpp: CrossKvBeamSharingTest/CopyBlockOffsetsAllBeamsShareBeam0Blocks Verifies that KVCacheManager::copyBlockOffsets for a cross-KV cache (CacheType::kCROSS, beam width > 1) writes the same physical block IDs into every beam slot. In the simple context-only setup the allocator shares blocks across beams, so this acts as a sanity check for the correct invariant. CopyGenerationLogitsTest/DirectCopyPlacesEachBeamStepAtCorrectHostOffset Verifies that copyGenerationLogits correctly places each (beam, step) logit from its GPU fragment tensor into the right slot of the host logits buffer. Making copyGenerationLogits a no-op produces 56 assertion failures. Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
…tPointerDevice
The mergeLogitsFragmentsKernel reads fragment GPU addresses from fragmentPointerDevice
([kCACHE_LENGTH]). When multiple requests in the same batch flush their logits fragments
sequentially, each overwrites that shared device buffer before the previous kernel has
committed its result on the stream — a race that caused intermittent degenerate output
("happ happ happ") with gather_generation_logits=True.
Fix: resize fragmentPointerDevice from [kCACHE_LENGTH] to [maxBatchSize, kCACHE_LENGTH],
matching the layout of fragmentPointerHost. Add getFragmentPointerDevice() to return the
current-workIdx row without advancing it, and call it before getFragmentPointerHost() so
both use the same per-slot row. Each request in the batch now has its own device-side
pointer array, eliminating cross-request clobbering.
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
…s PR Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
…ccuracy - Replace getFragmentPointerDevice()/getFragmentPointerHost() with a single getFragmentPointerSlot() that returns both rows atomically and advances workIdx once, making it impossible to call them in the wrong order (gripe 2). - Add TLLM_CHECK that cacheBlockIds is non-empty on the cross-KV path before accessing cacheBlockIds[0] (gripe 5). - Fix the Fix-3 description in encDecBeamSearchTest.cpp: this branch uses a repaired kernel (per-slot fragmentPointerDevice), not the direct-copy path described by the stale comment (gripe 3). Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Fix-1 test (CrossKvBeamSharingTest): previously the allocator happened to share blocks across beams, making equality across beam slots a tautology that passed both with and without the fix. Now the test explicitly writes distinct per-beam sentinel values into the source cacheBlockIndices tensor before calling copyBlockOffsets. The old code (srcBeamIdx = beamIdx) leaves beams 1..N-1 with their own different values; the fixed code (srcBeamIdx = 0 when isCrossKv()) normalises all slots to beam-0. Confirmed: test fails when the fix is reverted. Fix-3 test (CopyGenerationLogitsTest): the previous test used a null dummy cache and did not exercise the mergeLogitsFragmentsKernel path. Replaced with a test that allocates a real GenerationLogitsCache (transposedLogits, fragmentPointerDevice, fragmentPointerHost), creates fragments as pinned-memory slices of cache.logits with known per-(step, beam) values, and verifies both requests produce the correct host layout via the actual kernel merge path. Running two back-to-back flushes also validates that each uses a separate fragmentPointerDevice slot (slot isolation). Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
… handler In pipeline-parallel multi-micro-batch mode, requests from other micro-batches may still be tracked in mInflightReqIds after the error handler erases only the current activeRequests. changeBeamWidth asserts mInflightReqIds.empty(), so calling it unconditionally would throw (caught by the inner try-catch) and skip the intended buffer reset. Add an explicit guard so the reset is skipped when other micro-batches are still in-flight. The next successful forwardAsync iteration will perform the reset via the normal changeBeamWidth call in verifyRequests once the set is clear. This makes the skip explicit rather than relying on TLLM_CHECK as control flow. Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Code review on PR NVIDIA#15461 identified that the cross-KV copyBlockOffsets beam-0 sharing invariant is already maintained upstream before copyBlockOffsets is called, making the guard redundant. Remove the fix from KVCacheManager::copyBlockOffsets and the corresponding CrossKvBeamSharingTest unit test. Validated locally: the remaining CopyGenerationLogitsTest (fix 3) passes independently after this removal. Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
…Test
The top-of-file block described the bug fix in PR-specific terms ("Before
the fix"/"After the fix") that have no long-term value in the source tree.
The per-test docstring already captures the invariant being verified.
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
c6367e9 to
713c457
Compare
|
/bot skip --comment "ci already passed" |
|
PR_Github #55257 [ skip ] triggered by Bot. Commit: |
|
PR_Github #55257 [ skip ] completed with state |
@coderabbitai summary
Description
Fixes two bugs in the TRT backend (
ModelRunnerCpp) that caused corrupted output when concurrent requests with different beam widths (e.g. beam=1 and beam=5) were submitted to the same executor on encoder-decoder models (Whisper) with beam search.Bug 1 —
mInflightReqIdsleak on mixed-beam exceptionWhen concurrent requests with different beam widths triggered the
verifyRequests()exception inTrtGptModelInflightBatching::forwardAsync, the exception handler freed the KV-cache but did not remove request IDs frommInflightReqIds. The subsequentchangeBeamWidth()call asserts this set is empty and aborted.Fix: erase affected IDs from
mInflightReqIdsand callchangeBeamWidth(mOperatingBeamWidth)in the exception catch block.Bug 2 —
gather_generation_logitscorruption via sharedfragmentPointerDevicemergeLogitsFragmentsKernelreads fragment GPU addresses fromfragmentPointerDevice(shape[kCACHE_LENGTH]). When multiple requests in the same batch flush their logits fragments sequentially, each overwrites that shared device buffer before the previous kernel has committed — a race causing intermittent degenerate output withgather_generation_logits=True.Fix: resize
fragmentPointerDevicefrom[kCACHE_LENGTH]to[maxBatchSize, kCACHE_LENGTH], matching the layout offragmentPointerHost. AddgetFragmentPointerSlot()which returns both host and device rows atomically and advancesworkIdxonce, making call-order errors structurally impossible. Each request in the batch now has its own device-side pointer array.Test Coverage
C++ unit test (
cpp/tests/unit_tests/batch_manager/encDecBeamSearchTest.cpp):CopyGenerationLogitsTest/KernelMergePathProducesCorrectHostLayoutAndSlotsAreIsolated— verifies Bug 2 fix: each batch slot gets its own device-side pointer row, and slots are fully isolated from one another.Validation on Whisper large-v3 with concurrent mixed-beam execution:
gather_generation_logitsgather_generation_logits=TruePR Checklist
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment `/bot help`.