[https://nvbugs/6224637][fix] Enable CuTe DSL BF16 kernels for SM100 PP by yuxianq · Pull Request #14993 · NVIDIA/TensorRT-LLM

yuxianq · 2026-06-05T06:17:28Z

Summary

Unwaive the DeepSeekV3Lite 4-GPU pipeline-parallel accuracy tests tracked by NVBug 6224637.
Automatically enable CuTe DSL BF16 BMM and GEMM for SM100/SM103 pipeline-parallel LLM API runs.
Thread use_cute_dsl_bf16_gemm into attention and MLP linear projections so the affected PP4 paths consistently use the intended CuTe DSL BF16 GEMM implementation.

Root Cause

The hanging GB200 cases were not fixed reliably by changing NCCL_NVLS_ENABLE or by changing the remote task environment. The reproducible hang was tied to the SM100 pipeline-parallel DeepSeekV3Lite BF16 linear path selection: the existing CuTe DSL BF16 knobs did not cover every GEMM/BMM path used by the affected PP4 tests.

Solution

TorchLlmArgs.validate_cute_dsl_bf16 now enables both CuTe DSL BF16 BMM and GEMM automatically when the run uses pipeline parallelism on SM100/SM103. This keeps the public API stable and avoids requiring test-specific environment overrides.

The attention and gated-MLP modules now pass use_cute_dsl_bf16_gemm into their Linear projections, including attention/MLA output projection and MLP gate-up/down projection paths.

This replaces the earlier NCCL/NVLS workaround. The PR no longer relies on setting NCCL_NVLS_ENABLE=0 or modifying worker environment propagation for this bug.

Validation

git diff --check
Pre-commit hooks passed for commit 75aea27943.
GB200 OCI stress validation for the reproduced DeepSeekV3Lite PP4 hang case: the pre-fix path reproduced the hang during repeated runs; the fixed path passed 100/100 iterations.
The branch still removes the NVBug 6224637 waiver entries from tests/integration/test_lists/waives.txt so CI can run these cases again.

coderabbitai · 2026-06-05T06:18:43Z

📝 Walkthrough

Walkthrough

The PR updates test waiver skip entries in tests/integration/test_lists/waives.txt for the TestDeepSeekV3Lite test class. The test_bfloat16_4gpus section removes older pipeline-parallel and torch_compile variant waivers and adds new tensor-parallel and scheduler-specific SKIPs. The test_nvfp4_4gpus section removes multiple CUTLASS backend waivers with mtp_nextn=0 across different configurations, retaining a single mtp_nextn=2 fp8kv variant.

Changes

DeepSeekV3Lite waiver list updates

Layer / File(s)	Summary
Test waiver configuration updates `tests/integration/test_lists/waives.txt`	Updated skip entries for `TestDeepSeekV3Lite::test_bfloat16_4gpus` by removing older `pp4`/`torch_compile`-variant waivers (and associated fp8 block scale entries) and adding new `tp4`-based SKIPs with updated `bfloat16_python_scheduler` and `cute_dsl` 4-gpu waivers. Adjusted `test_nvfp4_4gpus` CUTLASS waivers by removing multiple `pp4` `mtp_nextn=0` entries across different `torch_compile`/`fp8kv` combinations and keeping/adding the `mtp_nextn=2` `fp8kv=True` `torch_compile=False` SKIP entry.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#14946: Modifies the same waiver file with updates to TestDeepSeekV3Lite skip entries and NVBug references.
NVIDIA/TensorRT-LLM#14523: Directly updates the same TestDeepSeekV3Lite fp8 block scale and test_nvfp4_4gpus CUTLASS mtp_nextn/fp8kv waiver entries.
NVIDIA/TensorRT-LLM#10835: Modifies the same waives.txt file by changing skipped test cases for the same test class.

Suggested reviewers

StanleySun639
xinhe-nv
LarryXFly

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description is well-structured with Summary, Root Cause, Solution, and Validation sections that clearly explain the changes. However, it does not strictly follow the provided template structure (missing Test Coverage section with explicit test list, and the checklist items are not addressed).
Title check	✅ Passed	The title references the BF16 CuTe DSL SM100 PP area affected by the waiver changes, so it is related to the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

yuxianq · 2026-06-05T06:20:01Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-05T06:25:38Z

PR_Github #52276 [ run ] triggered by Bot. Commit: 8a18308 Link to invocation

tensorrt-cicd · 2026-06-05T11:19:44Z

PR_Github #52276 [ run ] completed with state SUCCESS. Commit: 8a18308
/LLM/main/L0_MergeRequest_PR pipeline #41587 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yuxianq · 2026-06-06T05:07:56Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-06T05:14:36Z

PR_Github #52478 [ run ] triggered by Bot. Commit: ea2788c Link to invocation

tensorrt-cicd · 2026-06-06T14:17:39Z

PR_Github #52478 [ run ] completed with state SUCCESS. Commit: ea2788c
/LLM/main/L0_MergeRequest_PR pipeline #41770 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yuxianq · 2026-06-08T02:37:15Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-08T02:44:00Z

PR_Github #52640 [ run ] triggered by Bot. Commit: 990110a Link to invocation

tensorrt-cicd · 2026-06-08T07:45:12Z

PR_Github #52640 [ run ] completed with state FAILURE. Commit: 990110a
/LLM/main/L0_MergeRequest_PR pipeline #41919 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yuxianq · 2026-06-09T04:05:41Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-09T04:11:42Z

PR_Github #52945 [ run ] triggered by Bot. Commit: 4a19ebc Link to invocation

tensorrt-cicd · 2026-06-09T13:53:28Z

PR_Github #52945 [ run ] completed with state SUCCESS. Commit: 4a19ebc
/LLM/main/L0_MergeRequest_PR pipeline #42188 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yuxianq · 2026-06-10T03:41:49Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-10T03:48:44Z

PR_Github #53216 [ run ] triggered by Bot. Commit: 4a19ebc Link to invocation

tensorrt-cicd · 2026-06-10T12:35:23Z

PR_Github #53216 [ run ] completed with state SUCCESS. Commit: 4a19ebc
/LLM/main/L0_MergeRequest_PR pipeline #42411 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yuxianq · 2026-06-11T01:46:54Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-11T01:55:56Z

PR_Github #53424 [ run ] triggered by Bot. Commit: 4a19ebc Link to invocation

tensorrt-cicd · 2026-06-11T10:39:35Z

PR_Github #53424 [ run ] completed with state FAILURE. Commit: 4a19ebc
/LLM/main/L0_MergeRequest_PR pipeline #42595 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yuxianq · 2026-06-11T10:50:08Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-11T10:56:44Z

PR_Github #53559 [ run ] triggered by Bot. Commit: 056ac16 Link to invocation

tensorrt-cicd · 2026-06-11T19:36:09Z

PR_Github #53559 [ run ] completed with state FAILURE. Commit: 056ac16
/LLM/main/L0_MergeRequest_PR pipeline #42708 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yuxianq · 2026-06-12T03:02:54Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-12T03:09:21Z

PR_Github #53765 [ run ] triggered by Bot. Commit: 056ac16 Link to invocation

tensorrt-cicd · 2026-06-12T11:39:05Z

PR_Github #53765 [ run ] completed with state SUCCESS. Commit: 056ac16
/LLM/main/L0_MergeRequest_PR pipeline #42885 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

tensorrt-cicd · 2026-06-24T03:55:47Z

PR_Github #55396 [ run ] triggered by Bot. Commit: 75aea27 Link to invocation

tensorrt-cicd · 2026-06-24T07:42:44Z

PR_Github #55396 [ run ] completed with state SUCCESS. Commit: 75aea27
/LLM/main/L0_MergeRequest_PR pipeline #44344 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq · 2026-06-24T08:28:00Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-24T08:33:30Z

PR_Github #55460 [ run ] triggered by Bot. Commit: eb7647e Link to invocation

yuxianq · 2026-06-24T08:44:40Z

/bot run --disable-fail-fast

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

tensorrt-cicd · 2026-06-24T08:51:03Z

PR_Github #55464 [ run ] triggered by Bot. Commit: a1c811f Link to invocation

tensorrt-cicd · 2026-06-24T08:54:24Z

PR_Github #55460 [ run ] completed with state ABORTED. Commit: eb7647e

Link to invocation

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq · 2026-06-24T10:09:39Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-24T10:15:59Z

PR_Github #55482 [ run ] triggered by Bot. Commit: ce91671 Link to invocation

tensorrt-cicd · 2026-06-24T14:10:29Z

PR_Github #55482 [ run ] completed with state SUCCESS. Commit: ce91671
/LLM/main/L0_MergeRequest_PR pipeline #44407 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yuxianq · 2026-06-25T03:02:11Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-25T03:09:31Z

PR_Github #55677 [ run ] triggered by Bot. Commit: ce91671 Link to invocation

tensorrt-cicd · 2026-06-25T05:18:30Z

PR_Github #55677 [ run ] completed with state SUCCESS. Commit: ce91671
/LLM/main/L0_MergeRequest_PR pipeline #44583 completed with status: 'SUCCESS'

CI Report

Link to invocation

peaceh-nv · 2026-06-25T05:41:54Z

LGTM

QiJune

LGTM

github-actions Bot assigned yuxianq Jun 5, 2026

yuxianq requested a review from xinhe-nv June 5, 2026 06:19

yuxianq force-pushed the test/unwaive-nvbug-6224637 branch from 8a18308 to ea2788c Compare June 6, 2026 05:07

yuxianq force-pushed the test/unwaive-nvbug-6224637 branch from 990110a to 4a19ebc Compare June 9, 2026 04:03

yuxianq force-pushed the test/unwaive-nvbug-6224637 branch from 4a19ebc to 056ac16 Compare June 11, 2026 10:49

yuxianq added 6 commits June 24, 2026 08:27

[https://nvbugs/6224637][test] unwaive associated tests

d09c338

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

Handle NCCL NVLS init hangs in unwaived tests

302e3db

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

[https://nvbugs/6224637][test] restore failing GB200 waive

e46ca07

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

[https://nvbugs/6224637][test] keep waive update deletion-only

0a24f02

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

[https://nvbugs/6224637][test] remove GB200 hang workarounds

c9ec5ec

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

[NVBUG-6224637][fix] Enable CuTe DSL BF16 kernels on SM100 PP

eb7647e

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq force-pushed the test/unwaive-nvbug-6224637 branch from 75aea27 to eb7647e Compare June 24, 2026 08:27

[NVBUG-6224637][test] Remove rebase-added waiver

a1c811f

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

[NVBUG-6224637][fix] Guard CuTe BF16 GEMM config for VisualGen

ce91671

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

jieli-matrix approved these changes Jun 25, 2026

View reviewed changes

yuxianq changed the title ~~[NVBUG-6224637][fix] Enable CuTe DSL BF16 kernels for SM100 PP~~ [https://nvbugs/6224637][fix] Enable CuTe DSL BF16 kernels for SM100 PP Jun 25, 2026

yuxianq requested a review from peaceh-nv June 25, 2026 05:32

peaceh-nv approved these changes Jun 25, 2026

View reviewed changes

QiJune approved these changes Jun 25, 2026

View reviewed changes

yuxianq merged commit 70a7528 into NVIDIA:main Jun 25, 2026
10 checks passed

coderabbitai Bot mentioned this pull request Jun 26, 2026

[https://nvbugs/6248783][test] Unwaive Qwen3 skip softmax test #15652

Merged

Uh oh!

Conversation

yuxianq commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Solution

Validation

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

yuxianq commented Jun 5, 2026

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

yuxianq commented Jun 6, 2026

Uh oh!

tensorrt-cicd commented Jun 6, 2026

Uh oh!

tensorrt-cicd commented Jun 6, 2026

Uh oh!

yuxianq commented Jun 8, 2026

Uh oh!

tensorrt-cicd commented Jun 8, 2026

Uh oh!

tensorrt-cicd commented Jun 8, 2026

Uh oh!

yuxianq commented Jun 9, 2026

Uh oh!

tensorrt-cicd commented Jun 9, 2026

Uh oh!

tensorrt-cicd commented Jun 9, 2026

Uh oh!

yuxianq commented Jun 10, 2026

Uh oh!

tensorrt-cicd commented Jun 10, 2026

Uh oh!

tensorrt-cicd commented Jun 10, 2026

Uh oh!

yuxianq commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

yuxianq commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

yuxianq commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

yuxianq commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

yuxianq commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

yuxianq commented Jun 24, 2026

yuxianq commented Jun 5, 2026 •

edited

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading