[None][fix] DSv4 MLA overlap: record_stream cross-stream tensors by mingyangHao · Pull Request #15265 · NVIDIA/TensorRT-LLM

mingyangHao · 2026-06-11T12:47:14Z

The multi-stream attention prologue (TRTLLM_MLA_EXTRA_OVERLAP) hands tensors across CUDA streams without record_stream():

precompute_aux allocates weights/k_fp8/k_scale on aux_stream, consumed on the indexer stream;
q_b_proj output q (compressor_stream) and topk_indices (indexer_stream) are consumed on the caller stream.

The cuda.Event waits order execution but not the caching allocator, so under large-batch (>1024) workspace pressure a handed-off block can be recycled for a new allocation while the consumer stream is still reading it -> use-after-free -> CUDA_ERROR_ILLEGAL_ADDRESS. It reproduces deterministically on B300 during the generation CUDA-graph warmup at decode batch 1088 (e.g. tp8/ep8 dp-attn conc-2048), surfacing asynchronously in q_b_proj / q_norm. Both CUDA_LAUNCH_BLOCKING=1 and TRTLLM_MLA_EXTRA_OVERLAP=0 mask it, confirming a stream-ordering / allocator race rather than an OOB.

Add record_stream(current_stream) on the handed-off tensors so the allocator cannot recycle them mid-use. Keeps the overlap (no perf change); verified the warmup + serve now complete cleanly with overlap enabled.

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…ix IMA The multi-stream attention prologue (TRTLLM_MLA_EXTRA_OVERLAP) hands tensors across CUDA streams without record_stream(): - precompute_aux allocates weights/k_fp8/k_scale on aux_stream, consumed on the indexer stream; - q_b_proj output q (compressor_stream) and topk_indices (indexer_stream) are consumed on the caller stream. The cuda.Event waits order execution but not the caching allocator, so under large-batch (>1024) workspace pressure a handed-off block can be recycled for a new allocation while the consumer stream is still reading it -> use-after-free -> CUDA_ERROR_ILLEGAL_ADDRESS. It reproduces deterministically on B300 during the generation CUDA-graph warmup at decode batch 1088 (e.g. tp8/ep8 dp-attn conc-2048), surfacing asynchronously in q_b_proj / q_norm. Both CUDA_LAUNCH_BLOCKING=1 and TRTLLM_MLA_EXTRA_OVERLAP=0 mask it, confirming a stream-ordering / allocator race rather than an OOB. Add record_stream(current_stream) on the handed-off tensors so the allocator cannot recycle them mid-use. Keeps the overlap (no perf change); verified the warmup + serve now complete cleanly with overlap enabled. Signed-off-by: Mingyang Hao <mingyangh@nvidia.com>

mingyangHao · 2026-06-11T12:48:21Z

/bot run

tensorrt-cicd · 2026-06-11T12:54:32Z

PR_Github #53570 [ run ] triggered by Bot. Commit: e005597 Link to invocation

tensorrt-cicd · 2026-06-11T16:19:33Z

PR_Github #53570 [ run ] completed with state SUCCESS. Commit: e005597
/LLM/main/L0_MergeRequest_PR pipeline #42719 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Revert the dsv4-fp4-b300-trt 8k1k conc-end trim (back to 1024) and instead cap cuda_graph_config.max_batch_size at 1024 on both b300-trt and b300-trt-mtp. TRTLLM_MLA_EXTRA_OVERLAP hands MLA prologue tensors across CUDA streams without record_stream(), so CUDA-graph warmup at decode batch >1024 (repros at 1088, e.g. tp8/ep8 dp-attn conc-2048 on B300) use-after-frees into CUDA_ERROR_ILLEGAL_ADDRESS. Capping graph capture at 1024 avoids warming up the >1024 graph; runtime --max_batch_size stays = CONC, so batches >1024 run eager. Workaround until NVIDIA/TensorRT-LLM#15265 ships in the image. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Remove the conc=2048 point on the 1k1k tp8/ep8 DP-attn row for both dsv4-fp4-b300-trt and dsv4-fp4-b300-trt-mtp (now 512-1024). This is the batch regime that triggers the MLA-overlap warmup crash (NVIDIA/TensorRT-LLM#15265); the cudagraph cap at 1024 stays as a safety net. 8k1k unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

longlee0622 · 2026-06-11T23:32:23Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-11T23:38:27Z

PR_Github #53712 [ run ] triggered by Bot. Commit: e005597 Link to invocation

tensorrt-cicd · 2026-06-12T01:29:07Z

PR_Github #53712 [ run ] completed with state FAILURE. Commit: e005597
/LLM/main/L0_MergeRequest_PR pipeline #42841 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

mingyangHao · 2026-06-12T06:33:41Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-12T06:39:19Z

PR_Github #53821 [ run ] triggered by Bot. Commit: e005597 Link to invocation

tensorrt-cicd · 2026-06-12T08:36:30Z

PR_Github #53821 [ run ] completed with state SUCCESS. Commit: e005597
/LLM/main/L0_MergeRequest_PR pipeline #42937 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

mingyangHao · 2026-06-13T12:49:08Z

/bot run --disable-fail-fast

github-actions · 2026-06-13T12:49:15Z

⚠️ Bot command ignored: The /bot command must appear at the very beginning of the comment (no leading blank lines or spaces). Please post a new comment with /bot as the first character.

mingyangHao · 2026-06-13T13:19:27Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-13T13:25:20Z

PR_Github #54049 [ run ] triggered by Bot. Commit: e005597 Link to invocation

tensorrt-cicd · 2026-06-13T15:06:23Z

PR_Github #54049 [ run ] completed with state SUCCESS. Commit: e005597
/LLM/main/L0_MergeRequest_PR pipeline #43133 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

mingyangHao · 2026-06-14T06:18:43Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-14T06:26:02Z

PR_Github #54092 [ run ] triggered by Bot. Commit: e005597 Link to invocation

tensorrt-cicd · 2026-06-14T07:59:19Z

PR_Github #54092 [ run ] completed with state SUCCESS. Commit: e005597
/LLM/main/L0_MergeRequest_PR pipeline #43176 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…record-stream

lfr-0531 · 2026-06-17T04:42:00Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-17T04:48:02Z

PR_Github #54753 [ run ] triggered by Bot. Commit: aaa1ea8 Link to invocation

tensorrt-cicd · 2026-06-17T09:39:18Z

PR_Github #54753 [ run ] completed with state SUCCESS. Commit: aaa1ea8
/LLM/main/L0_MergeRequest_PR pipeline #43778 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

lfr-0531 · 2026-06-17T11:24:06Z

/bot run

tensorrt-cicd · 2026-06-17T11:30:26Z

PR_Github #54843 [ run ] triggered by Bot. Commit: aaa1ea8 Link to invocation

tensorrt-cicd · 2026-06-17T17:23:24Z

PR_Github #54843 [ run ] completed with state SUCCESS. Commit: aaa1ea8
/LLM/main/L0_MergeRequest_PR pipeline #43855 completed with status: 'SUCCESS'

CI Report

Link to invocation

mingyangHao requested a review from a team as a code owner June 11, 2026 12:47

mingyangHao requested review from pengbowang-nv and removed request for a team June 11, 2026 12:47

github-actions Bot assigned mingyangHao Jun 11, 2026

mingyangHao added the deepseek-v4 label Jun 11, 2026

Merge branch 'feat/deepseek_v4' into user/mingyangh/dsv4-mla-overlap-…

3037d60

…record-stream

pengbowang-nv approved these changes Jun 15, 2026

View reviewed changes

Merge branch 'feat/deepseek_v4' into user/mingyangh/dsv4-mla-overlap-…

aaa1ea8

…record-stream

longlee0622 merged commit a05ccbe into NVIDIA:feat/deepseek_v4 Jun 18, 2026
6 checks passed

Uh oh!

Conversation

mingyangHao commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

mingyangHao commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

longlee0622 commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

mingyangHao commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

mingyangHao commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

mingyangHao commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

mingyangHao commented Jun 14, 2026

Uh oh!

tensorrt-cicd commented Jun 14, 2026

Uh oh!

tensorrt-cicd commented Jun 14, 2026

Uh oh!

lfr-0531 commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

lfr-0531 commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mingyangHao commented Jun 11, 2026 •

edited

Loading