Skip to content

[https://nvbugs/6224637][fix] Enable CuTe DSL BF16 kernels for SM100 PP#14993

Merged
yuxianq merged 8 commits into
NVIDIA:mainfrom
yuxianq:test/unwaive-nvbug-6224637
Jun 25, 2026
Merged

[https://nvbugs/6224637][fix] Enable CuTe DSL BF16 kernels for SM100 PP#14993
yuxianq merged 8 commits into
NVIDIA:mainfrom
yuxianq:test/unwaive-nvbug-6224637

Conversation

@yuxianq

@yuxianq yuxianq commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Unwaive the DeepSeekV3Lite 4-GPU pipeline-parallel accuracy tests tracked by NVBug 6224637.
  • Automatically enable CuTe DSL BF16 BMM and GEMM for SM100/SM103 pipeline-parallel LLM API runs.
  • Thread use_cute_dsl_bf16_gemm into attention and MLP linear projections so the affected PP4 paths consistently use the intended CuTe DSL BF16 GEMM implementation.

Root Cause

The hanging GB200 cases were not fixed reliably by changing NCCL_NVLS_ENABLE or by changing the remote task environment. The reproducible hang was tied to the SM100 pipeline-parallel DeepSeekV3Lite BF16 linear path selection: the existing CuTe DSL BF16 knobs did not cover every GEMM/BMM path used by the affected PP4 tests.

Solution

TorchLlmArgs.validate_cute_dsl_bf16 now enables both CuTe DSL BF16 BMM and GEMM automatically when the run uses pipeline parallelism on SM100/SM103. This keeps the public API stable and avoids requiring test-specific environment overrides.

The attention and gated-MLP modules now pass use_cute_dsl_bf16_gemm into their Linear projections, including attention/MLA output projection and MLP gate-up/down projection paths.

This replaces the earlier NCCL/NVLS workaround. The PR no longer relies on setting NCCL_NVLS_ENABLE=0 or modifying worker environment propagation for this bug.

Validation

  • git diff --check
  • Pre-commit hooks passed for commit 75aea27943.
  • GB200 OCI stress validation for the reproduced DeepSeekV3Lite PP4 hang case: the pre-fix path reproduced the hang during repeated runs; the fixed path passed 100/100 iterations.
  • The branch still removes the NVBug 6224637 waiver entries from tests/integration/test_lists/waives.txt so CI can run these cases again.

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The PR updates test waiver skip entries in tests/integration/test_lists/waives.txt for the TestDeepSeekV3Lite test class. The test_bfloat16_4gpus section removes older pipeline-parallel and torch_compile variant waivers and adds new tensor-parallel and scheduler-specific SKIPs. The test_nvfp4_4gpus section removes multiple CUTLASS backend waivers with mtp_nextn=0 across different configurations, retaining a single mtp_nextn=2 fp8kv variant.

Changes

DeepSeekV3Lite waiver list updates

Layer / File(s) Summary
Test waiver configuration updates
tests/integration/test_lists/waives.txt
Updated skip entries for TestDeepSeekV3Lite::test_bfloat16_4gpus by removing older pp4/torch_compile-variant waivers (and associated fp8 block scale entries) and adding new tp4-based SKIPs with updated bfloat16_python_scheduler and cute_dsl 4-gpu waivers. Adjusted test_nvfp4_4gpus CUTLASS waivers by removing multiple pp4 mtp_nextn=0 entries across different torch_compile/fp8kv combinations and keeping/adding the mtp_nextn=2 fp8kv=True torch_compile=False SKIP entry.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#14946: Modifies the same waiver file with updates to TestDeepSeekV3Lite skip entries and NVBug references.
  • NVIDIA/TensorRT-LLM#14523: Directly updates the same TestDeepSeekV3Lite fp8 block scale and test_nvfp4_4gpus CUTLASS mtp_nextn/fp8kv waiver entries.
  • NVIDIA/TensorRT-LLM#10835: Modifies the same waives.txt file by changing skipped test cases for the same test class.

Suggested reviewers

  • StanleySun639
  • xinhe-nv
  • LarryXFly
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is well-structured with Summary, Root Cause, Solution, and Validation sections that clearly explain the changes. However, it does not strictly follow the provided template structure (missing Test Coverage section with explicit test list, and the checklist items are not addressed).
Title check ✅ Passed The title references the BF16 CuTe DSL SM100 PP area affected by the waiver changes, so it is related to the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@yuxianq yuxianq requested a review from xinhe-nv June 5, 2026 06:19
@yuxianq

yuxianq commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52276 [ run ] triggered by Bot. Commit: 8a18308 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52276 [ run ] completed with state SUCCESS. Commit: 8a18308
/LLM/main/L0_MergeRequest_PR pipeline #41587 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yuxianq yuxianq force-pushed the test/unwaive-nvbug-6224637 branch from 8a18308 to ea2788c Compare June 6, 2026 05:07

yuxianq commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52478 [ run ] triggered by Bot. Commit: ea2788c Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52478 [ run ] completed with state SUCCESS. Commit: ea2788c
/LLM/main/L0_MergeRequest_PR pipeline #41770 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yuxianq

yuxianq commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52640 [ run ] triggered by Bot. Commit: 990110a Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52640 [ run ] completed with state FAILURE. Commit: 990110a
/LLM/main/L0_MergeRequest_PR pipeline #41919 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yuxianq yuxianq force-pushed the test/unwaive-nvbug-6224637 branch from 990110a to 4a19ebc Compare June 9, 2026 04:03
@yuxianq

yuxianq commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52945 [ run ] triggered by Bot. Commit: 4a19ebc Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52945 [ run ] completed with state SUCCESS. Commit: 4a19ebc
/LLM/main/L0_MergeRequest_PR pipeline #42188 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yuxianq

yuxianq commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53216 [ run ] triggered by Bot. Commit: 4a19ebc Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53216 [ run ] completed with state SUCCESS. Commit: 4a19ebc
/LLM/main/L0_MergeRequest_PR pipeline #42411 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yuxianq

yuxianq commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53424 [ run ] triggered by Bot. Commit: 4a19ebc Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53424 [ run ] completed with state FAILURE. Commit: 4a19ebc
/LLM/main/L0_MergeRequest_PR pipeline #42595 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yuxianq yuxianq force-pushed the test/unwaive-nvbug-6224637 branch from 4a19ebc to 056ac16 Compare June 11, 2026 10:49
@yuxianq

yuxianq commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53559 [ run ] triggered by Bot. Commit: 056ac16 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53559 [ run ] completed with state FAILURE. Commit: 056ac16
/LLM/main/L0_MergeRequest_PR pipeline #42708 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yuxianq

yuxianq commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53765 [ run ] triggered by Bot. Commit: 056ac16 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53765 [ run ] completed with state SUCCESS. Commit: 056ac16
/LLM/main/L0_MergeRequest_PR pipeline #42885 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55396 [ run ] triggered by Bot. Commit: 75aea27 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55396 [ run ] completed with state SUCCESS. Commit: 75aea27
/LLM/main/L0_MergeRequest_PR pipeline #44344 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yuxianq added 6 commits June 24, 2026 08:27
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
@yuxianq yuxianq force-pushed the test/unwaive-nvbug-6224637 branch from 75aea27 to eb7647e Compare June 24, 2026 08:27

yuxianq commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55460 [ run ] triggered by Bot. Commit: eb7647e Link to invocation

@yuxianq

yuxianq commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55464 [ run ] triggered by Bot. Commit: a1c811f Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55460 [ run ] completed with state ABORTED. Commit: eb7647e

Link to invocation

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55482 [ run ] triggered by Bot. Commit: ce91671 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55482 [ run ] completed with state SUCCESS. Commit: ce91671
/LLM/main/L0_MergeRequest_PR pipeline #44407 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yuxianq

yuxianq commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55677 [ run ] triggered by Bot. Commit: ce91671 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55677 [ run ] completed with state SUCCESS. Commit: ce91671
/LLM/main/L0_MergeRequest_PR pipeline #44583 completed with status: 'SUCCESS'

CI Report

Link to invocation

@yuxianq yuxianq changed the title [NVBUG-6224637][fix] Enable CuTe DSL BF16 kernels for SM100 PP [https://nvbugs/6224637][fix] Enable CuTe DSL BF16 kernels for SM100 PP Jun 25, 2026
@yuxianq yuxianq requested a review from peaceh-nv June 25, 2026 05:32
@peaceh-nv

Copy link
Copy Markdown
Collaborator

LGTM

@QiJune QiJune left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yuxianq yuxianq merged commit 70a7528 into NVIDIA:main Jun 25, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants