[None][fix] Fix encoder-decoder beam search corruption via per-slot fragmentPointerDevice by achartier · Pull Request #15461 · NVIDIA/TensorRT-LLM

achartier · 2026-06-17T18:01:26Z

Description

Fixes two bugs in the TRT backend (ModelRunnerCpp) that caused corrupted output when concurrent requests with different beam widths (e.g. beam=1 and beam=5) were submitted to the same executor on encoder-decoder models (Whisper) with beam search.

Bug 1 — mInflightReqIds leak on mixed-beam exception

When concurrent requests with different beam widths triggered the verifyRequests() exception in TrtGptModelInflightBatching::forwardAsync, the exception handler freed the KV-cache but did not remove request IDs from mInflightReqIds. The subsequent changeBeamWidth() call asserts this set is empty and aborted.

Fix: erase affected IDs from mInflightReqIds and call changeBeamWidth(mOperatingBeamWidth) in the exception catch block.

Bug 2 — gather_generation_logits corruption via shared fragmentPointerDevice

mergeLogitsFragmentsKernel reads fragment GPU addresses from fragmentPointerDevice (shape [kCACHE_LENGTH]). When multiple requests in the same batch flush their logits fragments sequentially, each overwrites that shared device buffer before the previous kernel has committed — a race causing intermittent degenerate output with gather_generation_logits=True.

Fix: resize fragmentPointerDevice from [kCACHE_LENGTH] to [maxBatchSize, kCACHE_LENGTH], matching the layout of fragmentPointerHost. Add getFragmentPointerSlot() which returns both host and device rows atomically and advances workIdx once, making call-order errors structurally impossible. Each request in the batch now has its own device-side pointer array.

Test Coverage

C++ unit test (cpp/tests/unit_tests/batch_manager/encDecBeamSearchTest.cpp):

CopyGenerationLogitsTest/KernelMergePathProducesCorrectHostLayoutAndSlotsAreIsolated — verifies Bug 2 fix: each batch slot gets its own device-side pointer row, and slots are fully isolated from one another.

Validation on Whisper large-v3 with concurrent mixed-beam execution:

Configuration	Before fixes	After fixes
beam=5 only, no mixing	—	PASS 10/10
mix + no `gather_generation_logits`	crash/corrupt	PASS 20/20
mix + `gather_generation_logits=True`	crash/corrupt	PASS 50/50

PR Checklist

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment `/bot help`.

coderabbitai · 2026-06-17T18:06:43Z

📝 Walkthrough

Walkthrough

Three correctness fixes for encoder-decoder beam search: GenerationLogitsCache expands its device tensor to a 2D [maxBatchSize, kCACHE_LENGTH] layout and replaces the host-only accessor with getFragmentPointerSlot() returning paired host/device slices; copyBlockOffsets now forces all beams to share beam-0 block IDs in cross-KV mode; forwardAsync cleans up stale inflight tracking and resets runtime buffers on exception. New unit tests cover both the block-offset sharing and logits placement behaviors.

Changes

Encoder-Decoder Beam Search Correctness

Layer / File(s)	Summary
`GenerationLogitsCache`: 2D layout and unified slot accessor `cpp/include/tensorrt_llm/batch_manager/runtimeBuffers.h`, `cpp/tensorrt_llm/batch_manager/runtimeBuffers.cpp`, `cpp/tensorrt_llm/batch_manager/utils/inflightBatchingUtils.cpp`	`fragmentPointerDevice` allocation changed from 1D `[kCACHE_LENGTH]` to 2D `[maxBatchSize, kCACHE_LENGTH]`; `getFragmentPointerHost()` replaced by `[[nodiscard]] getFragmentPointerSlot()` returning `(host, device)` tensor pair for the current `workIdx`; `copyGenerationLogits` updated to unpack the pair via structured binding. Comments updated with the concurrency/clobbering guarantee.
`copyBlockOffsets`: cross-KV beam-0 sharing fix `cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`	When `isCrossKv()` is true, all beams now use `cacheBlockIds[0]` with a non-empty assertion; the memcpy chunk size is derived from beam-0's effective block count. Non-cross-KV paths are unchanged.
`forwardAsync` abort path cleanup `cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp`	On the `std::exception` catch-all path, each request ID is erased from `mInflightReqIds` before `terminateRequest` is called; on the last pipeline-parallel rank, `changeBeamWidth(mOperatingBeamWidth)` resets runtime buffers and decoder state from the aborted batch.
Regression tests `cpp/tests/unit_tests/batch_manager/encDecBeamSearchTest.cpp`, `cpp/tests/unit_tests/batch_manager/CMakeLists.txt`	Two gtests added: `CopyBlockOffsetsAllBeamsShareBeam0Blocks` validates cross-KV beam-0 block sharing; `DirectCopyPlacesEachBeamStepAtCorrectHostOffset` validates per-(step, beam) logit placement and fragment clearing after flush. Test target registered in CMake.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

Funatiq
taylor-yb-lee
stnie
VALLIS-NERIA

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main fix: encoder-decoder beam search corruption via per-slot fragmentPointerDevice.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description comprehensively explains the three bugs, their root causes, fixes, and includes detailed test coverage with validation results.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tests/unit_tests/batch_manager/encDecBeamSearchTest.cpp`:
- Around line 194-198: The current test at the copyGenerationLogits call site
uses a dummy cache and the direct-copy implementation path, which does not
exercise the fragmentPointerDevice slot isolation logic needed to detect race
conditions. Modify or add a test case that exercises the cached-fragment merge
path (using real cache data rather than a dummy cache) with at least two
concurrent requests/slots, and validate that fragmentPointerDevice pointers are
not clobbered across different slot allocations to catch regressions of the
maxBatchSize x kCACHE_LENGTH change.
- Around line 63-70: The test CopyBlockOffsetsAllBeamsShareBeam0Blocks is too
lenient because it relies on the allocator already sharing blocks, making
equality across beams an unreliable check that could pass even with the bug
present. Modify the test to explicitly set per-beam source rows to different
values before calling copyBlockOffsets, then assert that the output rows are all
normalized to beam 0. This forces a discriminating scenario where the old buggy
behavior would fail (keeping per-beam differences) while the fixed behavior
would pass (normalizing all to beam 0). Apply the same principle to the related
test mentioned around lines 115-140.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d6e2d2c4-bf3b-49ae-964b-984a43ebb5dc

📥 Commits

Reviewing files that changed from the base of the PR and between 42a3e55 and 687af83.

📒 Files selected for processing (7)

cpp/include/tensorrt_llm/batch_manager/runtimeBuffers.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/runtimeBuffers.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tensorrt_llm/batch_manager/utils/inflightBatchingUtils.cpp
cpp/tests/unit_tests/batch_manager/CMakeLists.txt
cpp/tests/unit_tests/batch_manager/encDecBeamSearchTest.cpp

achartier · 2026-06-17T18:53:32Z

/bot run

tensorrt-cicd · 2026-06-17T19:01:45Z

PR_Github #54871 [ run ] triggered by Bot. Commit: 92ff8bd Link to invocation

tensorrt-cicd · 2026-06-17T19:48:41Z

PR_Github #54871 [ run ] completed with state FAILURE. Commit: 92ff8bd
/LLM/main/L0_MergeRequest_PR pipeline #43876 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

achartier · 2026-06-17T23:18:33Z

/bot run

tensorrt-cicd · 2026-06-17T23:24:21Z

PR_Github #54877 [ run ] triggered by Bot. Commit: 72c55d5 Link to invocation

tensorrt-cicd · 2026-06-18T00:21:59Z

PR_Github #54877 [ run ] completed with state FAILURE. Commit: 72c55d5
/LLM/main/L0_MergeRequest_PR pipeline #43882 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

achartier · 2026-06-18T01:47:49Z

/bot run

tensorrt-cicd · 2026-06-18T01:53:20Z

PR_Github #54895 [ run ] triggered by Bot. Commit: 72c55d5 Link to invocation

tensorrt-cicd · 2026-06-18T04:05:29Z

PR_Github #54895 [ run ] completed with state FAILURE. Commit: 72c55d5
/LLM/main/L0_MergeRequest_PR pipeline #43899 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

achartier · 2026-06-18T15:35:34Z

/bot run

tensorrt-cicd · 2026-06-18T15:41:31Z

PR_Github #54936 [ run ] triggered by Bot. Commit: b8b6670 Link to invocation

tensorrt-cicd · 2026-06-18T19:47:25Z

PR_Github #54936 [ run ] completed with state SUCCESS. Commit: b8b6670
/LLM/main/L0_MergeRequest_PR pipeline #43936 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

achartier · 2026-06-18T21:42:13Z

/bot run

tensorrt-cicd · 2026-06-18T21:47:44Z

PR_Github #54948 [ run ] triggered by Bot. Commit: b8b6670 Link to invocation

tensorrt-cicd · 2026-06-19T01:05:38Z

PR_Github #54948 [ run ] completed with state SUCCESS. Commit: b8b6670
/LLM/main/L0_MergeRequest_PR pipeline #43948 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

Code review on PR NVIDIA#15461 identified that the cross-KV copyBlockOffsets beam-0 sharing invariant is already maintained upstream before copyBlockOffsets is called, making the guard redundant. Remove the fix from KVCacheManager::copyBlockOffsets and the corresponding CrossKvBeamSharingTest unit test. Validated locally: the remaining CopyGenerationLogitsTest (fix 3) passes independently after this removal. Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>

In KVCacheManager::copyBlockOffsets, when building the cross-attention KV cache offset table for encoder-decoder models (e.g. Whisper), each beam of a request was assigned separate physical blocks. However, the encoder context phase only writes encoder features to beam-0's blocks (contextBeamWidth=1 means only slot 0 per request is addressed). Beams 1..N-1 pointed to uninitialised GPU memory, causing degenerate repetitive output such as "happ happ happ" during beam-search decoding. Fix: when isCrossKv(), always use beam-0's source block IDs and block count for every beam. The encoder output is identical for all beams of a request, so sharing the same physical blocks is semantically correct. Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>

When concurrent requests with different beam widths (e.g. beam=1 and beam=5) triggered a verifyRequests() exception in TrtGptModelInflightBatching:: forwardAsync(), the exception handler freed the KV-cache sequence slots via terminateRequest() but did not remove the affected request IDs from mInflightReqIds. The subsequent changeBeamWidth() call checks TLLM_CHECK(mInflightReqIds.empty()) and would abort the process. Fix: in the exception catch block, erase each active request from mInflightReqIds before calling terminateRequest(), then call changeBeamWidth(mOperatingBeamWidth) to reset RuntimeBuffers and DecoderState so the next batch starts from a clean state. Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>

Two tests in encDecBeamSearchTest.cpp: CrossKvBeamSharingTest/CopyBlockOffsetsAllBeamsShareBeam0Blocks Verifies that KVCacheManager::copyBlockOffsets for a cross-KV cache (CacheType::kCROSS, beam width > 1) writes the same physical block IDs into every beam slot. In the simple context-only setup the allocator shares blocks across beams, so this acts as a sanity check for the correct invariant. CopyGenerationLogitsTest/DirectCopyPlacesEachBeamStepAtCorrectHostOffset Verifies that copyGenerationLogits correctly places each (beam, step) logit from its GPU fragment tensor into the right slot of the host logits buffer. Making copyGenerationLogits a no-op produces 56 assertion failures. Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>

…tPointerDevice The mergeLogitsFragmentsKernel reads fragment GPU addresses from fragmentPointerDevice ([kCACHE_LENGTH]). When multiple requests in the same batch flush their logits fragments sequentially, each overwrites that shared device buffer before the previous kernel has committed its result on the stream — a race that caused intermittent degenerate output ("happ happ happ") with gather_generation_logits=True. Fix: resize fragmentPointerDevice from [kCACHE_LENGTH] to [maxBatchSize, kCACHE_LENGTH], matching the layout of fragmentPointerHost. Add getFragmentPointerDevice() to return the current-workIdx row without advancing it, and call it before getFragmentPointerHost() so both use the same per-slot row. Each request in the batch now has its own device-side pointer array, eliminating cross-request clobbering. Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>

…s PR Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>

…ccuracy - Replace getFragmentPointerDevice()/getFragmentPointerHost() with a single getFragmentPointerSlot() that returns both rows atomically and advances workIdx once, making it impossible to call them in the wrong order (gripe 2). - Add TLLM_CHECK that cacheBlockIds is non-empty on the cross-KV path before accessing cacheBlockIds[0] (gripe 5). - Fix the Fix-3 description in encDecBeamSearchTest.cpp: this branch uses a repaired kernel (per-slot fragmentPointerDevice), not the direct-copy path described by the stale comment (gripe 3). Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>

Fix-1 test (CrossKvBeamSharingTest): previously the allocator happened to share blocks across beams, making equality across beam slots a tautology that passed both with and without the fix. Now the test explicitly writes distinct per-beam sentinel values into the source cacheBlockIndices tensor before calling copyBlockOffsets. The old code (srcBeamIdx = beamIdx) leaves beams 1..N-1 with their own different values; the fixed code (srcBeamIdx = 0 when isCrossKv()) normalises all slots to beam-0. Confirmed: test fails when the fix is reverted. Fix-3 test (CopyGenerationLogitsTest): the previous test used a null dummy cache and did not exercise the mergeLogitsFragmentsKernel path. Replaced with a test that allocates a real GenerationLogitsCache (transposedLogits, fragmentPointerDevice, fragmentPointerHost), creates fragments as pinned-memory slices of cache.logits with known per-(step, beam) values, and verifies both requests produce the correct host layout via the actual kernel merge path. Running two back-to-back flushes also validates that each uses a separate fragmentPointerDevice slot (slot isolation). Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>

… handler In pipeline-parallel multi-micro-batch mode, requests from other micro-batches may still be tracked in mInflightReqIds after the error handler erases only the current activeRequests. changeBeamWidth asserts mInflightReqIds.empty(), so calling it unconditionally would throw (caught by the inner try-catch) and skip the intended buffer reset. Add an explicit guard so the reset is skipped when other micro-batches are still in-flight. The next successful forwardAsync iteration will perform the reset via the normal changeBeamWidth call in verifyRequests once the set is clear. This makes the skip explicit rather than relying on TLLM_CHECK as control flow. Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>

Code review on PR NVIDIA#15461 identified that the cross-KV copyBlockOffsets beam-0 sharing invariant is already maintained upstream before copyBlockOffsets is called, making the guard redundant. Remove the fix from KVCacheManager::copyBlockOffsets and the corresponding CrossKvBeamSharingTest unit test. Validated locally: the remaining CopyGenerationLogitsTest (fix 3) passes independently after this removal. Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>

…Test The top-of-file block described the bug fix in PR-specific terms ("Before the fix"/"After the fix") that have no long-term value in the source tree. The per-test docstring already captures the invariant being verified. Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>

Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>

achartier · 2026-06-23T15:19:56Z

/bot skip --comment "ci already passed"

tensorrt-cicd · 2026-06-23T15:25:33Z

PR_Github #55257 [ skip ] triggered by Bot. Commit: 713c457 Link to invocation

tensorrt-cicd · 2026-06-23T15:31:24Z

PR_Github #55257 [ skip ] completed with state SUCCESS. Commit: 713c457
Skipping testing for commit 713c457

Link to invocation

achartier requested a review from a team as a code owner June 17, 2026 18:01

github-actions Bot assigned achartier Jun 17, 2026

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread cpp/tests/unit_tests/batch_manager/encDecBeamSearchTest.cpp Outdated

Comment thread cpp/tests/unit_tests/batch_manager/encDecBeamSearchTest.cpp Outdated

achartier force-pushed the fix-enc-dec-gather-logits-kernel branch from 826b435 to 72c55d5 Compare June 17, 2026 23:18

achartier force-pushed the fix-enc-dec-gather-logits-kernel branch from 72c55d5 to b8b6670 Compare June 18, 2026 15:25

achartier requested review from Funatiq and stnie June 19, 2026 02:22

Funatiq reviewed Jun 22, 2026

View reviewed changes

Comment thread cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp Outdated

Comment thread cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp Outdated

Comment thread cpp/tests/unit_tests/batch_manager/encDecBeamSearchTest.cpp Outdated

achartier mentioned this pull request Jun 22, 2026

[None][fix] Fix encoder-decoder beam search corruption (cross-KV sharing, mixed-beam exception, gather_generation_logits) #15444

Closed

1 task

Funatiq reviewed Jun 23, 2026

View reviewed changes

Comment thread cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp Outdated

Funatiq approved these changes Jun 23, 2026

View reviewed changes

achartier added 2 commits June 23, 2026 15:11

achartier added 9 commits June 23, 2026 15:11

[None][chore] Update copyright years to 2026 in files modified by thi…

35a14d6

…s PR Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>

Revert variable renaming

713c457

Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>

achartier force-pushed the fix-enc-dec-gather-logits-kernel branch from c6367e9 to 713c457 Compare June 23, 2026 15:15

achartier merged commit beb922f into NVIDIA:main Jun 23, 2026
7 checks passed

achartier deleted the fix-enc-dec-gather-logits-kernel branch June 23, 2026 15:33

Conversation

achartier commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

achartier commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

achartier commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 18, 2026

Uh oh!

achartier commented Jun 18, 2026

Uh oh!

tensorrt-cicd commented Jun 18, 2026

Uh oh!

tensorrt-cicd commented Jun 18, 2026

Uh oh!

achartier commented Jun 18, 2026

Uh oh!

tensorrt-cicd commented Jun 18, 2026

Uh oh!

tensorrt-cicd commented Jun 18, 2026

Uh oh!

achartier commented Jun 18, 2026

Uh oh!

tensorrt-cicd commented Jun 18, 2026

Uh oh!

tensorrt-cicd commented Jun 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

achartier commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

achartier commented Jun 17, 2026 •

edited

Loading

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading