[None][chore] Allow fp8 per-tensor base weights for MoE LoRA by brb-nv · Pull Request #15528 · NVIDIA/TensorRT-LLM

brb-nv · 2026-06-22T21:46:56Z

Description

This MR does the following in context of supporting MoE lora:

Validates fp8 qdq support for MoE base weights and relaxes corresponding constraints.
Move tests that were previously under unit_tests to integration tests as they involved loading full model weights.

Test Coverage

$ pytest -v tests/integration/defs/llmapi/test_llm_api_pytorch_moe_lora.py

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

New Features
- Expanded MoE LoRA support to cover more quantized model setups, including per-tensor FP8.
- Enabled LoRA support in Mixtral model execution paths.
- Added broader validation and end-to-end coverage for routed-expert LoRA with varying adapter ranks.
Bug Fixes
- Tightened checks to reject unsupported quantization formats for MoE LoRA, improving error handling and compatibility.

brb-nv · 2026-06-28T22:11:13Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-28T22:16:35Z

PR_Github #56244 [ run ] triggered by Bot. Commit: cde2b98 Link to invocation

coderabbitai · 2026-06-28T22:17:05Z

📝 Walkthrough

Walkthrough

Extends MoE LoRA to support per-tensor FP8 (qdq) quantized base weights. Python validation gains a _is_supported_quant helper; the C++ runMoe and loraTypeFromActDtype are updated for FP8 dtype handling; Mixtral model forwards propagate lora_params; and new unit/integration tests validate the FP8 LoRA path end-to-end.

Changes

MoE LoRA FP8 Support

Layer / File(s)	Summary
Python MoE LoRA quant validation `tensorrt_llm/_torch/peft/lora/validation.py`	Adds `_UNSUPPORTED_QUANT` set and `_is_supported_quant()` helper; updates `check_moe_lora_supported` to allow per-tensor FP8 qdq while rejecting block-scale FP8, NVFP4, and integer quant.
C++ fused MoE op FP8 dtype validation and mapping `cpp/tensorrt_llm/thop/moeOp.cpp`	Expands `runMoe` LoRA input checks to permit per-tensor FP8 activations/weights; converts `loraTypeFromActDtype` to a `const` member function with recursive FP8→output-dtype mapping via `mOutputDtype`.
Mixtral `lora_params` propagation `tensorrt_llm/_torch/models/modeling_mixtral.py`	Adds `lora_params: Optional[dict] = None` to `MixtralModel`, `MixtralDecoderLayer`, and `MixtralMoE` forward signatures and forwards it to `self_attn`, `block_sparse_moe`, and `experts`.
Unit tests: validator and fused-op FP8 LoRA `tests/unittest/_torch/lora/test_moe_lora_validator.py`, `tests/unittest/_torch/lora/test_moe_lora_op.py`	Replaces boolean quant stubs with real `QuantMode` constants in validator tests; adds FP8 qdq acceptance and block-scale rejection cases; extends `_call_fused_moe` for `quant_scales` and adds FP8 LoRA op parity/output-change tests.
Integration tests: Qwen MoE and Mixtral FP8 LoRA `tests/integration/defs/llmapi/test_llm_api_pytorch_moe_lora.py`, `tests/integration/test_lists/...`	New integration test module fabricates per-expert LoRA adapter weights and runs mixed-batch multi-LoRA tests for Qwen1.5-MoE (bf16) and Mixtral-8x7B (FP8 qdq); CI test lists updated to route the tests to the new module.
`test_llm_pytorch.py` cleanup `tests/unittest/llmapi/test_llm_pytorch.py`	Removes unused `os` and `MoeConfig` imports; reformats docstrings and whitespace; removes now-relocated Qwen MoE LoRA test entry from A100 pre-merge list.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#14801: Extends the same FusedMoeRunner::runMoe and lora_params wiring in moeOp.cpp that this PR builds on for FP8 support.
NVIDIA/TensorRT-LLM#14881: Adds the CUDA-graph slot-indexed LoRA schema to moeOp.cpp, directly adjacent to the validation code this PR modifies.
NVIDIA/TensorRT-LLM#14923: Introduces the device-path LoRA execution wiring in moeOp.cpp that this PR's FP8 dtype validation is layered on top of.

Suggested reviewers

amitz-nv
byshiue
xxi-nv
QiJune

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 56.10% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is clear, specific, and matches the main change: allowing fp8 per-tensor base weights for MoE LoRA.
Description check	✅ Passed	The description follows the template with Description, Test Coverage, and PR Checklist sections filled in appropriately.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

tests/unittest/_torch/lora/test_moe_lora_validator.py (1)
93-147: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a test for the new quant-wrapper normalization path.

These cases only exercise plain QuantMode inputs, but tensorrt_llm/_torch/peft/lora/validation.py now has a separate .objs normalization branch. Please add a tiny fake wrapper in tests/unittest/_torch/lora/test_moe_lora_validator.py and cover at least one accepted mix ([QuantMode(0), _FP8_QDQ]) and one rejected mix ([_FP8_QDQ, _FP8_BLOCK_SCALE]) so that branch cannot regress silently.

As per path instructions, “Act as a QA engineer reviewing test changes and coverage” and “Keep feedback actionable”; coverage here is insufficient for the new wrapper path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/lora/test_moe_lora_validator.py` around lines 93 - 147,
Add coverage for the new quant-wrapper normalization branch in
check_moe_lora_supported: the current tests only pass plain QuantMode values, so
create a small fake wrapper with an objs field in test_moe_lora_validator.py and
use it to exercise the .objs path in
tensorrt_llm/_torch/peft/lora/validation.py. Add one passing case with a mixed
list such as QuantMode(0) plus _FP8_QDQ, and one failing case with _FP8_QDQ plus
_FP8_BLOCK_SCALE, so the wrapper normalization logic is validated for both
accepted and rejected inputs.
Source: Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration/defs/llmapi/test_llm_api_pytorch_moe_lora.py`:
- Around line 212-217: The mixed-batch no-LoRA row is only being checked for
non-empty output, which can miss cross-request adapter contamination. Update the
assertions in the relevant mixed-batch test cases that use out_tokens and
base_tokens so the rank-0/no-LoRA row is compared directly against the
standalone base_tokens baseline, while keeping the existing per-LoRA row checks
intact. Use the mixed-batch assertions around out_tokens[0] and the standalone
baseline variables to ensure any accidental adapter application fails loudly.

In `@tests/integration/test_lists/test-db/l0_h100.yml`:
- Line 259: The pre-merge H100 test list still only covers the Qwen bf16 path,
so add the new Mixtral routed-expert FP8 case to the pre-merge tier in the
test-db list, or include an equivalent pre-merge FP8 test alongside it. Update
the relevant entry in the H100 list used by the auto-run CI test-db so the FP8
enablement is exercised before merge, and ensure the chosen test name matches
the existing Mixtral FP8 case referenced in the integration suite.

In `@tests/unittest/llmapi/test_llm_pytorch.py`:
- Around line 1096-1100: The module defines TestLlmError twice, so the later
class overwrites the earlier one and pytest only collects one suite. Rename one
of the TestLlmError classes in tests/unittest/llmapi/test_llm_pytorch.py,
keeping the existing error-path tests like test_max_num_token_check under
distinct class names so both suites are discoverable. Verify collection includes
both classes and that coverage is sufficient after the rename.

---

Nitpick comments:
In `@tests/unittest/_torch/lora/test_moe_lora_validator.py`:
- Around line 93-147: Add coverage for the new quant-wrapper normalization
branch in check_moe_lora_supported: the current tests only pass plain QuantMode
values, so create a small fake wrapper with an objs field in
test_moe_lora_validator.py and use it to exercise the .objs path in
tensorrt_llm/_torch/peft/lora/validation.py. Add one passing case with a mixed
list such as QuantMode(0) plus _FP8_QDQ, and one failing case with _FP8_QDQ plus
_FP8_BLOCK_SCALE, so the wrapper normalization logic is validated for both
accepted and rejected inputs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4522f74d-ca40-4f79-9f77-7659393a283a

📥 Commits

Reviewing files that changed from the base of the PR and between b6d186a and cde2b98.

📒 Files selected for processing (11)

cpp/tensorrt_llm/thop/moeOp.cpp
tensorrt_llm/_torch/models/modeling_mixtral.py
tensorrt_llm/_torch/peft/lora/validation.py
tests/integration/defs/llmapi/test_llm_api_pytorch_moe_lora.py
tests/integration/test_lists/qa/llm_function_core.txt
tests/integration/test_lists/test-db/l0_a100.yml
tests/integration/test_lists/test-db/l0_h100.yml
tests/integration/test_lists/waives.txt
tests/unittest/_torch/lora/test_moe_lora_op.py
tests/unittest/_torch/lora/test_moe_lora_validator.py
tests/unittest/llmapi/test_llm_pytorch.py

💤 Files with no reviewable changes (2)

tests/integration/test_lists/test-db/l0_a100.yml
tests/integration/test_lists/waives.txt

coderabbitai · 2026-06-28T22:17:08Z

+        assert out_tokens[0], "No-LoRA row in the mixed batch produced no tokens."
+        for i in range(len(lora_requests)):
+            assert out_tokens[i + 1] != base_tokens, (
+                f"Quantized routed-expert MoE LoRA adapter {i} produced output "
+                "identical to the base model; it was not applied."
+            )


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Assert that the mixed-batch no-LoRA row matches the standalone baseline.

Both sites only check that out_tokens[0] is non-empty. That misses the isolation failure these tests are meant to catch: the rank-0 row can accidentally pick up another request's adapter and still pass. Compare out_tokens[0] to base_tokens in both places so cross-request contamination fails loudly.

Suggested fix

- assert out_tokens[0], "No-LoRA row in the mixed batch produced no tokens." + assert out_tokens[0] == base_tokens, ( + "No-LoRA row in the mixed batch diverged from the standalone base output." + )

- assert out_tokens[0], "No-LoRA row in the mixed batch produced no tokens." + assert out_tokens[0] == base_tokens, ( + "No-LoRA row in the mixed batch diverged from the standalone base output." + )

As per path instructions, “Act as a QA engineer reviewing test changes and coverage” and “Keep feedback actionable”; coverage is insufficient until the no-LoRA row is validated against the base output.

Also applies to: 345-353

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/defs/llmapi/test_llm_api_pytorch_moe_lora.py` around lines 212 - 217, The mixed-batch no-LoRA row is only being checked for non-empty output, which can miss cross-request adapter contamination. Update the assertions in the relevant mixed-batch test cases that use out_tokens and base_tokens so the rank-0/no-LoRA row is compared directly against the standalone base_tokens baseline, while keeping the existing per-LoRA row checks intact. Use the mixed-batch assertions around out_tokens[0] and the standalone baseline variables to ensure any accidental adapter application fails loudly.

Source: Path instructions

coderabbitai · 2026-06-28T22:17:08Z

    - unittest/llmapi/test_llm_pytorch.py -m "part2"
    - unittest/llmapi/test_llm_pytorch.py -m "part3"
-    - unittest/llmapi/test_llm_pytorch.py::test_qwen_moe_routed_expert_multi_lora_varying_ranks TIMEOUT (90)
+    - llmapi/test_llm_api_pytorch_moe_lora.py::test_qwen_moe_routed_expert_multi_lora_varying_ranks TIMEOUT (90)


📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Promote an FP8 Mixtral case into pre-merge H100 coverage.

The pre-merge H100 entry still exercises only the Qwen bf16 path, while the new Mixtral routed-expert FP8 case is post-merge only. That leaves the PR’s headline FP8 enablement unguarded before merge. Please add one FP8 Mixtral case to the pre-merge tier, or add an equivalent pre-merge FP8 test.

As per path instructions, files under tests/integration/test_lists/test-db/** “drive the auto-run CI test-db”; coverage here is insufficient for the new FP8 path.

Also applies to: 334-334

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/test_lists/test-db/l0_h100.yml` at line 259, The pre-merge H100 test list still only covers the Qwen bf16 path, so add the new Mixtral routed-expert FP8 case to the pre-merge tier in the test-db list, or include an equivalent pre-merge FP8 test alongside it. Update the relevant entry in the H100 list used by the auto-run CI test-db so the FP8 enablement is exercised before merge, and ensure the chosen test name matches the existing Mixtral FP8 case referenced in the integration suite.

Source: Path instructions

coderabbitai · 2026-06-28T22:17:08Z

 class TestLlmError:

    @pytest.mark.part3
    def test_max_num_token_check(self):
-        """ LLM should raise error when got prompt length exceed the valid range. """
+        """LLM should raise error when got prompt length exceed the valid range."""


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Rename one of the TestLlmError classes so both suites are collected.

This module defines TestLlmError twice. The second definition at Line 1394 overwrites the earlier one at Line 1096 during import, so pytest only collects the later class and silently drops the earlier error-path coverage in tests/unittest/llmapi/test_llm_pytorch.py. Coverage is insufficient until both classes are discoverable.

Proposed fix

-class TestLlmError: +class TestLlmErrorMaxNumTokens:

As per path instructions, "tests/**: Act as a QA engineer reviewing test changes and coverage for TensorRT-LLM. Keep feedback actionable: suggest concrete list file names and whether coverage is sufficient, insufficient, or needs follow-up outside the PR."

Also applies to: 1394-1398

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/llmapi/test_llm_pytorch.py` around lines 1096 - 1100, The module defines TestLlmError twice, so the later class overwrites the earlier one and pytest only collects one suite. Rename one of the TestLlmError classes in tests/unittest/llmapi/test_llm_pytorch.py, keeping the existing error-path tests like test_max_num_token_check under distinct class names so both suites are discoverable. Verify collection includes both classes and that coverage is sufficient after the rename.

Source: Path instructions

tensorrt-cicd · 2026-06-29T00:20:09Z

PR_Github #56244 [ run ] completed with state FAILURE. Commit: cde2b98
/LLM/main/L0_MergeRequest_PR pipeline #45104 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

brb-nv · 2026-06-29T16:49:28Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-29T16:56:16Z

PR_Github #56418 [ run ] triggered by Bot. Commit: a0d30c4 Link to invocation

venkywonka

lgtm!

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv · 2026-06-29T18:05:42Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-29T18:11:12Z

PR_Github #56435 [ run ] triggered by Bot. Commit: ca2bb08 Link to invocation

tensorrt-cicd · 2026-06-29T18:15:07Z

PR_Github #56418 [ run ] completed with state ABORTED. Commit: a0d30c4

Link to invocation

tensorrt-cicd · 2026-06-29T21:25:39Z

PR_Github #56435 [ run ] completed with state FAILURE. Commit: ca2bb08
/LLM/main/L0_MergeRequest_PR pipeline #45278 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

github-actions Bot assigned brb-nv Jun 22, 2026

brb-nv force-pushed the user/brb/support-fp8-moe-lora branch 3 times, most recently from 0c686b2 to 9028723 Compare June 28, 2026 21:49

brb-nv changed the title ~~[None][feat] Support FP8 base weights for MoE LoRA~~ [None][chore] Allow fp8 per-tensor base weights for MoE LoRA Jun 28, 2026

brb-nv force-pushed the user/brb/support-fp8-moe-lora branch from 9028723 to cde2b98 Compare June 28, 2026 22:04

brb-nv marked this pull request as ready for review June 28, 2026 22:10

brb-nv requested review from a team as code owners June 28, 2026 22:10

brb-nv requested review from byshiue and yechank-nvidia June 28, 2026 22:10

brb-nv requested review from LarryXFly and venkywonka and removed request for byshiue June 28, 2026 22:11

coderabbitai Bot reviewed Jun 28, 2026

View reviewed changes

brb-nv enabled auto-merge (squash) June 28, 2026 22:36

yingguo-trt approved these changes Jun 29, 2026

View reviewed changes

LarryXFly approved these changes Jun 29, 2026

View reviewed changes

yechank-nvidia approved these changes Jun 29, 2026

View reviewed changes

brb-nv force-pushed the user/brb/support-fp8-moe-lora branch from cde2b98 to a0d30c4 Compare June 29, 2026 16:48

venkywonka approved these changes Jun 29, 2026

View reviewed changes

venkywonka reviewed Jun 29, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/models/modeling_mixtral.py

[None][chore] Allow fp8 per-tensor base weights for MoE LoRA

ca2bb08

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv force-pushed the user/brb/support-fp8-moe-lora branch from a0d30c4 to ca2bb08 Compare June 29, 2026 18:05

Uh oh!

Conversation

brb-nv commented Jun 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Summary by CodeRabbit

Uh oh!

brb-nv commented Jun 28, 2026

Uh oh!

tensorrt-cicd commented Jun 28, 2026

Uh oh!

coderabbitai Bot commented Jun 28, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

brb-nv commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

venkywonka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brb-nv commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

brb-nv commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading