[None][fix] DSv4 MLA overlap: record_stream cross-stream tensors#15265
Conversation
…ix IMA
The multi-stream attention prologue (TRTLLM_MLA_EXTRA_OVERLAP) hands tensors
across CUDA streams without record_stream():
- precompute_aux allocates weights/k_fp8/k_scale on aux_stream, consumed on
the indexer stream;
- q_b_proj output q (compressor_stream) and topk_indices (indexer_stream) are
consumed on the caller stream.
The cuda.Event waits order execution but not the caching allocator, so under
large-batch (>1024) workspace pressure a handed-off block can be recycled for a
new allocation while the consumer stream is still reading it -> use-after-free
-> CUDA_ERROR_ILLEGAL_ADDRESS. It reproduces deterministically on B300 during
the generation CUDA-graph warmup at decode batch 1088 (e.g. tp8/ep8 dp-attn
conc-2048), surfacing asynchronously in q_b_proj / q_norm. Both
CUDA_LAUNCH_BLOCKING=1 and TRTLLM_MLA_EXTRA_OVERLAP=0 mask it, confirming a
stream-ordering / allocator race rather than an OOB.
Add record_stream(current_stream) on the handed-off tensors so the allocator
cannot recycle them mid-use. Keeps the overlap (no perf change); verified the
warmup + serve now complete cleanly with overlap enabled.
Signed-off-by: Mingyang Hao <mingyangh@nvidia.com>
|
/bot run |
|
PR_Github #53570 [ run ] triggered by Bot. Commit: |
|
PR_Github #53570 [ run ] completed with state
|
Revert the dsv4-fp4-b300-trt 8k1k conc-end trim (back to 1024) and instead cap cuda_graph_config.max_batch_size at 1024 on both b300-trt and b300-trt-mtp. TRTLLM_MLA_EXTRA_OVERLAP hands MLA prologue tensors across CUDA streams without record_stream(), so CUDA-graph warmup at decode batch >1024 (repros at 1088, e.g. tp8/ep8 dp-attn conc-2048 on B300) use-after-frees into CUDA_ERROR_ILLEGAL_ADDRESS. Capping graph capture at 1024 avoids warming up the >1024 graph; runtime --max_batch_size stays = CONC, so batches >1024 run eager. Workaround until NVIDIA/TensorRT-LLM#15265 ships in the image. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove the conc=2048 point on the 1k1k tp8/ep8 DP-attn row for both dsv4-fp4-b300-trt and dsv4-fp4-b300-trt-mtp (now 512-1024). This is the batch regime that triggers the MLA-overlap warmup crash (NVIDIA/TensorRT-LLM#15265); the cudagraph cap at 1024 stays as a safety net. 8k1k unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
/bot run --disable-fail-fast |
|
PR_Github #53712 [ run ] triggered by Bot. Commit: |
|
PR_Github #53712 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #53821 [ run ] triggered by Bot. Commit: |
|
PR_Github #53821 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
|
|
/bot run --disable-fail-fast |
|
PR_Github #54049 [ run ] triggered by Bot. Commit: |
|
PR_Github #54049 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #54092 [ run ] triggered by Bot. Commit: |
|
PR_Github #54092 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #54753 [ run ] triggered by Bot. Commit: |
|
PR_Github #54753 [ run ] completed with state
|
|
/bot run |
|
PR_Github #54843 [ run ] triggered by Bot. Commit: |
|
PR_Github #54843 [ run ] completed with state |
The multi-stream attention prologue (TRTLLM_MLA_EXTRA_OVERLAP) hands tensors across CUDA streams without record_stream():
The cuda.Event waits order execution but not the caching allocator, so under large-batch (>1024) workspace pressure a handed-off block can be recycled for a new allocation while the consumer stream is still reading it -> use-after-free -> CUDA_ERROR_ILLEGAL_ADDRESS. It reproduces deterministically on B300 during the generation CUDA-graph warmup at decode batch 1088 (e.g. tp8/ep8 dp-attn conc-2048), surfacing asynchronously in q_b_proj / q_norm. Both CUDA_LAUNCH_BLOCKING=1 and TRTLLM_MLA_EXTRA_OVERLAP=0 mask it, confirming a stream-ordering / allocator race rather than an OOB.
Add record_stream(current_stream) on the handed-off tensors so the allocator cannot recycle them mid-use. Keeps the overlap (no perf change); verified the warmup + serve now complete cleanly with overlap enabled.
@coderabbitai summary
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.