Skip to content

[None][chore] Allow fp8 per-tensor base weights for MoE LoRA#15528

Open
brb-nv wants to merge 1 commit into
NVIDIA:mainfrom
brb-nv:user/brb/support-fp8-moe-lora
Open

[None][chore] Allow fp8 per-tensor base weights for MoE LoRA#15528
brb-nv wants to merge 1 commit into
NVIDIA:mainfrom
brb-nv:user/brb/support-fp8-moe-lora

Conversation

@brb-nv

@brb-nv brb-nv commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Description

This MR does the following in context of supporting MoE lora:

  1. Validates fp8 qdq support for MoE base weights and relaxes corresponding constraints.
  2. Move tests that were previously under unit_tests to integration tests as they involved loading full model weights.

Test Coverage

$ pytest -v tests/integration/defs/llmapi/test_llm_api_pytorch_moe_lora.py

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

  • New Features

    • Expanded MoE LoRA support to cover more quantized model setups, including per-tensor FP8.
    • Enabled LoRA support in Mixtral model execution paths.
    • Added broader validation and end-to-end coverage for routed-expert LoRA with varying adapter ranks.
  • Bug Fixes

    • Tightened checks to reject unsupported quantization formats for MoE LoRA, improving error handling and compatibility.

@brb-nv brb-nv force-pushed the user/brb/support-fp8-moe-lora branch 3 times, most recently from 0c686b2 to 9028723 Compare June 28, 2026 21:49
@brb-nv brb-nv changed the title [None][feat] Support FP8 base weights for MoE LoRA [None][chore] Allow fp8 per-tensor base weights for MoE LoRA Jun 28, 2026
@brb-nv brb-nv force-pushed the user/brb/support-fp8-moe-lora branch from 9028723 to cde2b98 Compare June 28, 2026 22:04
@brb-nv brb-nv marked this pull request as ready for review June 28, 2026 22:10
@brb-nv brb-nv requested review from a team as code owners June 28, 2026 22:10
@brb-nv brb-nv requested review from byshiue and yechank-nvidia June 28, 2026 22:10
@brb-nv

brb-nv commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@brb-nv brb-nv requested review from LarryXFly and venkywonka and removed request for byshiue June 28, 2026 22:11
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56244 [ run ] triggered by Bot. Commit: cde2b98 Link to invocation

@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Extends MoE LoRA to support per-tensor FP8 (qdq) quantized base weights. Python validation gains a _is_supported_quant helper; the C++ runMoe and loraTypeFromActDtype are updated for FP8 dtype handling; Mixtral model forwards propagate lora_params; and new unit/integration tests validate the FP8 LoRA path end-to-end.

Changes

MoE LoRA FP8 Support

Layer / File(s) Summary
Python MoE LoRA quant validation
tensorrt_llm/_torch/peft/lora/validation.py
Adds _UNSUPPORTED_QUANT set and _is_supported_quant() helper; updates check_moe_lora_supported to allow per-tensor FP8 qdq while rejecting block-scale FP8, NVFP4, and integer quant.
C++ fused MoE op FP8 dtype validation and mapping
cpp/tensorrt_llm/thop/moeOp.cpp
Expands runMoe LoRA input checks to permit per-tensor FP8 activations/weights; converts loraTypeFromActDtype to a const member function with recursive FP8→output-dtype mapping via mOutputDtype.
Mixtral lora_params propagation
tensorrt_llm/_torch/models/modeling_mixtral.py
Adds lora_params: Optional[dict] = None to MixtralModel, MixtralDecoderLayer, and MixtralMoE forward signatures and forwards it to self_attn, block_sparse_moe, and experts.
Unit tests: validator and fused-op FP8 LoRA
tests/unittest/_torch/lora/test_moe_lora_validator.py, tests/unittest/_torch/lora/test_moe_lora_op.py
Replaces boolean quant stubs with real QuantMode constants in validator tests; adds FP8 qdq acceptance and block-scale rejection cases; extends _call_fused_moe for quant_scales and adds FP8 LoRA op parity/output-change tests.
Integration tests: Qwen MoE and Mixtral FP8 LoRA
tests/integration/defs/llmapi/test_llm_api_pytorch_moe_lora.py, tests/integration/test_lists/...
New integration test module fabricates per-expert LoRA adapter weights and runs mixed-batch multi-LoRA tests for Qwen1.5-MoE (bf16) and Mixtral-8x7B (FP8 qdq); CI test lists updated to route the tests to the new module.
test_llm_pytorch.py cleanup
tests/unittest/llmapi/test_llm_pytorch.py
Removes unused os and MoeConfig imports; reformats docstrings and whitespace; removes now-relocated Qwen MoE LoRA test entry from A100 pre-merge list.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#14801: Extends the same FusedMoeRunner::runMoe and lora_params wiring in moeOp.cpp that this PR builds on for FP8 support.
  • NVIDIA/TensorRT-LLM#14881: Adds the CUDA-graph slot-indexed LoRA schema to moeOp.cpp, directly adjacent to the validation code this PR modifies.
  • NVIDIA/TensorRT-LLM#14923: Introduces the device-path LoRA execution wiring in moeOp.cpp that this PR's FP8 dtype validation is layered on top of.

Suggested reviewers

  • amitz-nv
  • byshiue
  • xxi-nv
  • QiJune
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.10% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is clear, specific, and matches the main change: allowing fp8 per-tensor base weights for MoE LoRA.
Description check ✅ Passed The description follows the template with Description, Test Coverage, and PR Checklist sections filled in appropriately.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests/unittest/_torch/lora/test_moe_lora_validator.py (1)

93-147: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a test for the new quant-wrapper normalization path.

These cases only exercise plain QuantMode inputs, but tensorrt_llm/_torch/peft/lora/validation.py now has a separate .objs normalization branch. Please add a tiny fake wrapper in tests/unittest/_torch/lora/test_moe_lora_validator.py and cover at least one accepted mix ([QuantMode(0), _FP8_QDQ]) and one rejected mix ([_FP8_QDQ, _FP8_BLOCK_SCALE]) so that branch cannot regress silently.

As per path instructions, “Act as a QA engineer reviewing test changes and coverage” and “Keep feedback actionable”; coverage here is insufficient for the new wrapper path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/lora/test_moe_lora_validator.py` around lines 93 - 147,
Add coverage for the new quant-wrapper normalization branch in
check_moe_lora_supported: the current tests only pass plain QuantMode values, so
create a small fake wrapper with an objs field in test_moe_lora_validator.py and
use it to exercise the .objs path in
tensorrt_llm/_torch/peft/lora/validation.py. Add one passing case with a mixed
list such as QuantMode(0) plus _FP8_QDQ, and one failing case with _FP8_QDQ plus
_FP8_BLOCK_SCALE, so the wrapper normalization logic is validated for both
accepted and rejected inputs.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration/defs/llmapi/test_llm_api_pytorch_moe_lora.py`:
- Around line 212-217: The mixed-batch no-LoRA row is only being checked for
non-empty output, which can miss cross-request adapter contamination. Update the
assertions in the relevant mixed-batch test cases that use out_tokens and
base_tokens so the rank-0/no-LoRA row is compared directly against the
standalone base_tokens baseline, while keeping the existing per-LoRA row checks
intact. Use the mixed-batch assertions around out_tokens[0] and the standalone
baseline variables to ensure any accidental adapter application fails loudly.

In `@tests/integration/test_lists/test-db/l0_h100.yml`:
- Line 259: The pre-merge H100 test list still only covers the Qwen bf16 path,
so add the new Mixtral routed-expert FP8 case to the pre-merge tier in the
test-db list, or include an equivalent pre-merge FP8 test alongside it. Update
the relevant entry in the H100 list used by the auto-run CI test-db so the FP8
enablement is exercised before merge, and ensure the chosen test name matches
the existing Mixtral FP8 case referenced in the integration suite.

In `@tests/unittest/llmapi/test_llm_pytorch.py`:
- Around line 1096-1100: The module defines TestLlmError twice, so the later
class overwrites the earlier one and pytest only collects one suite. Rename one
of the TestLlmError classes in tests/unittest/llmapi/test_llm_pytorch.py,
keeping the existing error-path tests like test_max_num_token_check under
distinct class names so both suites are discoverable. Verify collection includes
both classes and that coverage is sufficient after the rename.

---

Nitpick comments:
In `@tests/unittest/_torch/lora/test_moe_lora_validator.py`:
- Around line 93-147: Add coverage for the new quant-wrapper normalization
branch in check_moe_lora_supported: the current tests only pass plain QuantMode
values, so create a small fake wrapper with an objs field in
test_moe_lora_validator.py and use it to exercise the .objs path in
tensorrt_llm/_torch/peft/lora/validation.py. Add one passing case with a mixed
list such as QuantMode(0) plus _FP8_QDQ, and one failing case with _FP8_QDQ plus
_FP8_BLOCK_SCALE, so the wrapper normalization logic is validated for both
accepted and rejected inputs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4522f74d-ca40-4f79-9f77-7659393a283a

📥 Commits

Reviewing files that changed from the base of the PR and between b6d186a and cde2b98.

📒 Files selected for processing (11)
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • tensorrt_llm/_torch/models/modeling_mixtral.py
  • tensorrt_llm/_torch/peft/lora/validation.py
  • tests/integration/defs/llmapi/test_llm_api_pytorch_moe_lora.py
  • tests/integration/test_lists/qa/llm_function_core.txt
  • tests/integration/test_lists/test-db/l0_a100.yml
  • tests/integration/test_lists/test-db/l0_h100.yml
  • tests/integration/test_lists/waives.txt
  • tests/unittest/_torch/lora/test_moe_lora_op.py
  • tests/unittest/_torch/lora/test_moe_lora_validator.py
  • tests/unittest/llmapi/test_llm_pytorch.py
💤 Files with no reviewable changes (2)
  • tests/integration/test_lists/test-db/l0_a100.yml
  • tests/integration/test_lists/waives.txt

Comment on lines +212 to +217
assert out_tokens[0], "No-LoRA row in the mixed batch produced no tokens."
for i in range(len(lora_requests)):
assert out_tokens[i + 1] != base_tokens, (
f"Quantized routed-expert MoE LoRA adapter {i} produced output "
"identical to the base model; it was not applied."
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Assert that the mixed-batch no-LoRA row matches the standalone baseline.

Both sites only check that out_tokens[0] is non-empty. That misses the isolation failure these tests are meant to catch: the rank-0 row can accidentally pick up another request's adapter and still pass. Compare out_tokens[0] to base_tokens in both places so cross-request contamination fails loudly.

Suggested fix
-        assert out_tokens[0], "No-LoRA row in the mixed batch produced no tokens."
+        assert out_tokens[0] == base_tokens, (
+            "No-LoRA row in the mixed batch diverged from the standalone base output."
+        )
-            assert out_tokens[0], "No-LoRA row in the mixed batch produced no tokens."
+            assert out_tokens[0] == base_tokens, (
+                "No-LoRA row in the mixed batch diverged from the standalone base output."
+            )

As per path instructions, “Act as a QA engineer reviewing test changes and coverage” and “Keep feedback actionable”; coverage is insufficient until the no-LoRA row is validated against the base output.

Also applies to: 345-353

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/defs/llmapi/test_llm_api_pytorch_moe_lora.py` around lines
212 - 217, The mixed-batch no-LoRA row is only being checked for non-empty
output, which can miss cross-request adapter contamination. Update the
assertions in the relevant mixed-batch test cases that use out_tokens and
base_tokens so the rank-0/no-LoRA row is compared directly against the
standalone base_tokens baseline, while keeping the existing per-LoRA row checks
intact. Use the mixed-batch assertions around out_tokens[0] and the standalone
baseline variables to ensure any accidental adapter application fails loudly.

Source: Path instructions

- unittest/llmapi/test_llm_pytorch.py -m "part2"
- unittest/llmapi/test_llm_pytorch.py -m "part3"
- unittest/llmapi/test_llm_pytorch.py::test_qwen_moe_routed_expert_multi_lora_varying_ranks TIMEOUT (90)
- llmapi/test_llm_api_pytorch_moe_lora.py::test_qwen_moe_routed_expert_multi_lora_varying_ranks TIMEOUT (90)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Promote an FP8 Mixtral case into pre-merge H100 coverage.

The pre-merge H100 entry still exercises only the Qwen bf16 path, while the new Mixtral routed-expert FP8 case is post-merge only. That leaves the PR’s headline FP8 enablement unguarded before merge. Please add one FP8 Mixtral case to the pre-merge tier, or add an equivalent pre-merge FP8 test.

As per path instructions, files under tests/integration/test_lists/test-db/** “drive the auto-run CI test-db”; coverage here is insufficient for the new FP8 path.

Also applies to: 334-334

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/test_lists/test-db/l0_h100.yml` at line 259, The pre-merge
H100 test list still only covers the Qwen bf16 path, so add the new Mixtral
routed-expert FP8 case to the pre-merge tier in the test-db list, or include an
equivalent pre-merge FP8 test alongside it. Update the relevant entry in the
H100 list used by the auto-run CI test-db so the FP8 enablement is exercised
before merge, and ensure the chosen test name matches the existing Mixtral FP8
case referenced in the integration suite.

Source: Path instructions

Comment on lines 1096 to +1100
class TestLlmError:

@pytest.mark.part3
def test_max_num_token_check(self):
""" LLM should raise error when got prompt length exceed the valid range. """
"""LLM should raise error when got prompt length exceed the valid range."""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Rename one of the TestLlmError classes so both suites are collected.

This module defines TestLlmError twice. The second definition at Line 1394 overwrites the earlier one at Line 1096 during import, so pytest only collects the later class and silently drops the earlier error-path coverage in tests/unittest/llmapi/test_llm_pytorch.py. Coverage is insufficient until both classes are discoverable.

Proposed fix
-class TestLlmError:
+class TestLlmErrorMaxNumTokens:

As per path instructions, "tests/**: Act as a QA engineer reviewing test changes and coverage for TensorRT-LLM. Keep feedback actionable: suggest concrete list file names and whether coverage is sufficient, insufficient, or needs follow-up outside the PR."

Also applies to: 1394-1398

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/llmapi/test_llm_pytorch.py` around lines 1096 - 1100, The
module defines TestLlmError twice, so the later class overwrites the earlier one
and pytest only collects one suite. Rename one of the TestLlmError classes in
tests/unittest/llmapi/test_llm_pytorch.py, keeping the existing error-path tests
like test_max_num_token_check under distinct class names so both suites are
discoverable. Verify collection includes both classes and that coverage is
sufficient after the rename.

Source: Path instructions

@brb-nv brb-nv enabled auto-merge (squash) June 28, 2026 22:36
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56244 [ run ] completed with state FAILURE. Commit: cde2b98
/LLM/main/L0_MergeRequest_PR pipeline #45104 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@brb-nv brb-nv force-pushed the user/brb/support-fp8-moe-lora branch from cde2b98 to a0d30c4 Compare June 29, 2026 16:48
@brb-nv

brb-nv commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56418 [ run ] triggered by Bot. Commit: a0d30c4 Link to invocation

@venkywonka venkywonka left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Comment thread tensorrt_llm/_torch/models/modeling_mixtral.py
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
@brb-nv brb-nv force-pushed the user/brb/support-fp8-moe-lora branch from a0d30c4 to ca2bb08 Compare June 29, 2026 18:05
@brb-nv

brb-nv commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56435 [ run ] triggered by Bot. Commit: ca2bb08 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56418 [ run ] completed with state ABORTED. Commit: a0d30c4

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56435 [ run ] completed with state FAILURE. Commit: ca2bb08
/LLM/main/L0_MergeRequest_PR pipeline #45278 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants