Skip to content

[None][fix] DSv4 MLA overlap: record_stream cross-stream tensors#15265

Merged
longlee0622 merged 3 commits into
NVIDIA:feat/deepseek_v4from
mingyangHao:user/mingyangh/dsv4-mla-overlap-record-stream
Jun 18, 2026
Merged

[None][fix] DSv4 MLA overlap: record_stream cross-stream tensors#15265
longlee0622 merged 3 commits into
NVIDIA:feat/deepseek_v4from
mingyangHao:user/mingyangh/dsv4-mla-overlap-record-stream

Conversation

@mingyangHao

@mingyangHao mingyangHao commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

The multi-stream attention prologue (TRTLLM_MLA_EXTRA_OVERLAP) hands tensors across CUDA streams without record_stream():

  • precompute_aux allocates weights/k_fp8/k_scale on aux_stream, consumed on the indexer stream;
  • q_b_proj output q (compressor_stream) and topk_indices (indexer_stream) are consumed on the caller stream.

The cuda.Event waits order execution but not the caching allocator, so under large-batch (>1024) workspace pressure a handed-off block can be recycled for a new allocation while the consumer stream is still reading it -> use-after-free -> CUDA_ERROR_ILLEGAL_ADDRESS. It reproduces deterministically on B300 during the generation CUDA-graph warmup at decode batch 1088 (e.g. tp8/ep8 dp-attn conc-2048), surfacing asynchronously in q_b_proj / q_norm. Both CUDA_LAUNCH_BLOCKING=1 and TRTLLM_MLA_EXTRA_OVERLAP=0 mask it, confirming a stream-ordering / allocator race rather than an OOB.

Add record_stream(current_stream) on the handed-off tensors so the allocator cannot recycle them mid-use. Keeps the overlap (no perf change); verified the warmup + serve now complete cleanly with overlap enabled.

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…ix IMA

The multi-stream attention prologue (TRTLLM_MLA_EXTRA_OVERLAP) hands tensors
across CUDA streams without record_stream():

  - precompute_aux allocates weights/k_fp8/k_scale on aux_stream, consumed on
    the indexer stream;
  - q_b_proj output q (compressor_stream) and topk_indices (indexer_stream) are
    consumed on the caller stream.

The cuda.Event waits order execution but not the caching allocator, so under
large-batch (>1024) workspace pressure a handed-off block can be recycled for a
new allocation while the consumer stream is still reading it -> use-after-free
-> CUDA_ERROR_ILLEGAL_ADDRESS. It reproduces deterministically on B300 during
the generation CUDA-graph warmup at decode batch 1088 (e.g. tp8/ep8 dp-attn
conc-2048), surfacing asynchronously in q_b_proj / q_norm. Both
CUDA_LAUNCH_BLOCKING=1 and TRTLLM_MLA_EXTRA_OVERLAP=0 mask it, confirming a
stream-ordering / allocator race rather than an OOB.

Add record_stream(current_stream) on the handed-off tensors so the allocator
cannot recycle them mid-use. Keeps the overlap (no perf change); verified the
warmup + serve now complete cleanly with overlap enabled.

Signed-off-by: Mingyang Hao <mingyangh@nvidia.com>
@mingyangHao mingyangHao requested a review from a team as a code owner June 11, 2026 12:47
@mingyangHao mingyangHao requested review from pengbowang-nv and removed request for a team June 11, 2026 12:47
@mingyangHao

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53570 [ run ] triggered by Bot. Commit: e005597 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53570 [ run ] completed with state SUCCESS. Commit: e005597
/LLM/main/L0_MergeRequest_PR pipeline #42719 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Oseltamivir added a commit to SemiAnalysisAI/InferenceX that referenced this pull request Jun 11, 2026
Revert the dsv4-fp4-b300-trt 8k1k conc-end trim (back to 1024) and instead
cap cuda_graph_config.max_batch_size at 1024 on both b300-trt and
b300-trt-mtp.

TRTLLM_MLA_EXTRA_OVERLAP hands MLA prologue tensors across CUDA streams
without record_stream(), so CUDA-graph warmup at decode batch >1024
(repros at 1088, e.g. tp8/ep8 dp-attn conc-2048 on B300) use-after-frees
into CUDA_ERROR_ILLEGAL_ADDRESS. Capping graph capture at 1024 avoids
warming up the >1024 graph; runtime --max_batch_size stays = CONC, so
batches >1024 run eager. Workaround until NVIDIA/TensorRT-LLM#15265 ships
in the image.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Oseltamivir added a commit to SemiAnalysisAI/InferenceX that referenced this pull request Jun 11, 2026
Remove the conc=2048 point on the 1k1k tp8/ep8 DP-attn row for both
dsv4-fp4-b300-trt and dsv4-fp4-b300-trt-mtp (now 512-1024). This is the
batch regime that triggers the MLA-overlap warmup crash (NVIDIA/TensorRT-LLM#15265);
the cudagraph cap at 1024 stays as a safety net. 8k1k unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@longlee0622

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53712 [ run ] triggered by Bot. Commit: e005597 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53712 [ run ] completed with state FAILURE. Commit: e005597
/LLM/main/L0_MergeRequest_PR pipeline #42841 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@mingyangHao

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53821 [ run ] triggered by Bot. Commit: e005597 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53821 [ run ] completed with state SUCCESS. Commit: e005597
/LLM/main/L0_MergeRequest_PR pipeline #42937 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@mingyangHao

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@github-actions

Copy link
Copy Markdown

⚠️ Bot command ignored: The /bot command must appear at the very beginning of the comment (no leading blank lines or spaces). Please post a new comment with /bot as the first character.

@mingyangHao

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54049 [ run ] triggered by Bot. Commit: e005597 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54049 [ run ] completed with state SUCCESS. Commit: e005597
/LLM/main/L0_MergeRequest_PR pipeline #43133 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@mingyangHao

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54092 [ run ] triggered by Bot. Commit: e005597 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54092 [ run ] completed with state SUCCESS. Commit: e005597
/LLM/main/L0_MergeRequest_PR pipeline #43176 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@lfr-0531

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54753 [ run ] triggered by Bot. Commit: aaa1ea8 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54753 [ run ] completed with state SUCCESS. Commit: aaa1ea8
/LLM/main/L0_MergeRequest_PR pipeline #43778 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@lfr-0531

Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54843 [ run ] triggered by Bot. Commit: aaa1ea8 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54843 [ run ] completed with state SUCCESS. Commit: aaa1ea8
/LLM/main/L0_MergeRequest_PR pipeline #43855 completed with status: 'SUCCESS'

CI Report

Link to invocation

@longlee0622 longlee0622 merged commit a05ccbe into NVIDIA:feat/deepseek_v4 Jun 18, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants