[https://nvbugs/6224637][fix] Enable CuTe DSL BF16 kernels for SM100 PP#14993
Conversation
📝 WalkthroughWalkthroughThe PR updates test waiver skip entries in ChangesDeepSeekV3Lite waiver list updates
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/bot run --disable-fail-fast |
|
PR_Github #52276 [ run ] triggered by Bot. Commit: |
|
PR_Github #52276 [ run ] completed with state
|
8a18308 to
ea2788c
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #52478 [ run ] triggered by Bot. Commit: |
|
PR_Github #52478 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #52640 [ run ] triggered by Bot. Commit: |
|
PR_Github #52640 [ run ] completed with state
|
990110a to
4a19ebc
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #52945 [ run ] triggered by Bot. Commit: |
|
PR_Github #52945 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #53216 [ run ] triggered by Bot. Commit: |
|
PR_Github #53216 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #53424 [ run ] triggered by Bot. Commit: |
|
PR_Github #53424 [ run ] completed with state
|
4a19ebc to
056ac16
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #53559 [ run ] triggered by Bot. Commit: |
|
PR_Github #53559 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #53765 [ run ] triggered by Bot. Commit: |
|
PR_Github #53765 [ run ] completed with state
|
|
PR_Github #55396 [ run ] triggered by Bot. Commit: |
|
PR_Github #55396 [ run ] completed with state
|
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
75aea27 to
eb7647e
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #55460 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
|
PR_Github #55464 [ run ] triggered by Bot. Commit: |
|
PR_Github #55460 [ run ] completed with state |
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #55482 [ run ] triggered by Bot. Commit: |
|
PR_Github #55482 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #55677 [ run ] triggered by Bot. Commit: |
|
PR_Github #55677 [ run ] completed with state |
|
LGTM |
Summary
use_cute_dsl_bf16_gemminto attention and MLP linear projections so the affected PP4 paths consistently use the intended CuTe DSL BF16 GEMM implementation.Root Cause
The hanging GB200 cases were not fixed reliably by changing
NCCL_NVLS_ENABLEor by changing the remote task environment. The reproducible hang was tied to the SM100 pipeline-parallel DeepSeekV3Lite BF16 linear path selection: the existing CuTe DSL BF16 knobs did not cover every GEMM/BMM path used by the affected PP4 tests.Solution
TorchLlmArgs.validate_cute_dsl_bf16now enables both CuTe DSL BF16 BMM and GEMM automatically when the run uses pipeline parallelism on SM100/SM103. This keeps the public API stable and avoids requiring test-specific environment overrides.The attention and gated-MLP modules now pass
use_cute_dsl_bf16_gemminto theirLinearprojections, including attention/MLA output projection and MLP gate-up/down projection paths.This replaces the earlier NCCL/NVLS workaround. The PR no longer relies on setting
NCCL_NVLS_ENABLE=0or modifying worker environment propagation for this bug.Validation
git diff --check75aea27943.tests/integration/test_lists/waives.txtso CI can run these cases again.