[https://nvbugs/5940460][fix] Harden FP8 quant fusion matching after PyTorch 26.02 update by dhansen-nvidia · Pull Request #12750 · NVIDIA/TensorRT-LLM

dhansen-nvidia · 2026-04-03T21:39:49Z

The PyTorch 26.02 stack changed the traced graph shape around static_quantize_e4m3_per_tensor: the unused scale getitem may be dead-code-eliminated, but some graphs still retain a live scale consumer.

Update the FP8 AR+Residual+RMSNorm quant fusion patterns to match both live-scale and pruned-scale graphs, with an explicit guard so the 2-output rewrite only fires when the scale output is absent. Name the custom passes and track aggregate match_count_by_pass totals so tests can assert exact semantic pass totals instead of the fixed-point bookkeeping trace.

Also build the custom pass pipeline per Backend instance rather than sharing a process-global cache, and add regressions for the live-scale compile path and per-instance backend pass configuration. This keeps the user-buffer tests unwaived without reducing the intended fusion coverage.

Summary by CodeRabbit

Release Notes

Bug Fixes
- Resolved failing user buffer compilation tests across FP16, BF16 data types, and multiple model configurations.
Tests
- Added comprehensive test coverage for user buffer operations with all-reduce and FP8 quantization in distributed training scenarios.
- Enhanced test infrastructure with detailed optimization pass execution tracking.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-03T21:45:28Z

📝 Walkthrough

Walkthrough

The changes refactor the Backend class to build per-instance custom passes instead of caching class-level pass instances, add per-pass match count aggregation during optimization, enhance FP8 quantization pattern matching with constraint helpers, remove test skip entries, and add new test cases for validation.

Changes

Cohort / File(s)	Summary
Backend Architecture `tensorrt_llm/_torch/compilation/backend.py`	Changed from class-level cached `_custom_pass_instances` to instance method `build_custom_passes(...)` that creates fresh passes per Backend instance; introduced `match_count_by_pass` OrderedDict to aggregate match counts per named pass during optimization; passes now explicitly named via `PatternMatcherPass` constructors with subsystem parameter.
Pattern Compilation Constants `tensorrt_llm/_torch/compilation/patterns/__init__.py`	Added module-level constant `MATCHER_SUBSYSTEM = "torch_compile"` for use in pattern pass initialization.
FP8 Quantization Pattern Matching `tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py`	Added helper functions `_check_getitem_only_users(...)`, `_has_getitem_user(...)`, and `_make_fp8_quant_extra_check(...)` for FP8-specific constraint validation; refactored `register_ar_residual_norm_out_fp8_quant(...)` and `register_ar_residual_norm_fp8_quant(...)` to split pattern registration into separate with-scale and without-scale branches; replaced `_users=2` constraints with `_users=MULTIPLE`.
Test Skip Entries `tests/integration/test_lists/waives.txt`	Removed 16 SKIP entries for `test_user_buffers_pass` and `test_user_buffers_mm_add_prologue` across combinations of dtype, token count, and hidden size.
User Buffers Tests `tests/unittest/_torch/multi_gpu/test_user_buffers.py`	Added `_assert_match_counts(...)` helper to validate per-pass match counts via `backend.match_count_by_pass`; added `run_single_rank_ar_rms_norm_fp8_live_scale_compile(...)` to test AR norm fusion with UB passes and `run_single_rank_backend_passes_are_per_instance(...)` to verify per-instance pass independence; replaced positional match-count assertions with named pass count validation; added two new pytest test cases (`test_user_buffers_ar_rms_norm_fp8_live_scale_compile`, `test_backend_passes_are_per_instance`); changed exception handling in `run_single_rank_ub_pass(...)` to re-raise instead of returning False.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The pull request description is provided but lacks the required structured format with distinct sections for 'Description' and 'Test Coverage' as specified in the template.	Please add a proper 'Description' section explaining the issue and solution, and a 'Test Coverage' section listing relevant tests that safeguard these changes, following the template structure.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly identifies the NVBugs ticket and describes the main fix: hardening FP8 quantization fusion matching for PyTorch 26.02 compatibility.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tensorrt_llm/_torch/compilation/backend.py (1)
2-2: Consider using dict instead of OrderedDict.

Since Python 3.7+, regular dict maintains insertion order. Given the project requires Python 3.10+, OrderedDict is not necessary here unless you need specific OrderedDict methods like move_to_end().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/compilation/backend.py` at line 2, The code imports
OrderedDict but on Python 3.10+ insertion order is preserved by built-in dict;
remove the from collections import OrderedDict import and replace any uses of
OrderedDict with plain dict in this module (e.g., any constructors or type hints
referencing OrderedDict), unless you rely on OrderedDict-specific methods like
move_to_end(), in which case keep it; update variable initializations,
annotations, and tests to use dict where applicable (search for the symbol
OrderedDict in this file and substitute dict).
tests/unittest/_torch/multi_gpu/test_user_buffers.py (1)
676-678: Inconsistent exception handling across test functions.

This function now re-raises exceptions after printing the traceback, but run_single_rank_ub_mm_add_pass (lines 969-972) and run_single_rank_ub_pass_fp4 (lines 1218-1221) still return False after printing the traceback. This inconsistency could lead to different test failure modes and make debugging harder.

Consider standardizing the exception handling pattern across all run_single_rank_* functions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/multi_gpu/test_user_buffers.py` around lines 676 - 678,
Standardize exception handling by having all run_single_rank_* helpers re-raise
exceptions instead of returning False after printing the traceback; update
run_single_rank_ub_mm_add_pass and run_single_rank_ub_pass_fp4 to mirror the
pattern used in the other function that prints the traceback and then does
"raise" (i.e., print traceback.traceback.print_exc() if kept) so failures
propagate as exceptions rather than silently returning False.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/compilation/patterns/__init__.py`:
- Line 1: The file lacks the required NVIDIA copyright header at the top; add
the standard NVIDIA copyright header including the year of latest meaningful
modification to the top of tensorrt_llm/_torch/compilation/patterns/__init__.py
so that the file begins with the header before any code (e.g., before
MATCHER_SUBSYSTEM = "torch_compile"); ensure the header matches the project's
canonical header format and includes the appropriate year and ownership text.

---

Nitpick comments:
In `@tensorrt_llm/_torch/compilation/backend.py`:
- Line 2: The code imports OrderedDict but on Python 3.10+ insertion order is
preserved by built-in dict; remove the from collections import OrderedDict
import and replace any uses of OrderedDict with plain dict in this module (e.g.,
any constructors or type hints referencing OrderedDict), unless you rely on
OrderedDict-specific methods like move_to_end(), in which case keep it; update
variable initializations, annotations, and tests to use dict where applicable
(search for the symbol OrderedDict in this file and substitute dict).

In `@tests/unittest/_torch/multi_gpu/test_user_buffers.py`:
- Around line 676-678: Standardize exception handling by having all
run_single_rank_* helpers re-raise exceptions instead of returning False after
printing the traceback; update run_single_rank_ub_mm_add_pass and
run_single_rank_ub_pass_fp4 to mirror the pattern used in the other function
that prints the traceback and then does "raise" (i.e., print
traceback.traceback.print_exc() if kept) so failures propagate as exceptions
rather than silently returning False.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ae061a2d-2716-4aba-883c-037e4ae7ca31

📥 Commits

Reviewing files that changed from the base of the PR and between 7ee9e8b and daf5f25.

📒 Files selected for processing (5)

tensorrt_llm/_torch/compilation/backend.py
tensorrt_llm/_torch/compilation/patterns/__init__.py
tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py
tests/integration/test_lists/waives.txt
tests/unittest/_torch/multi_gpu/test_user_buffers.py

💤 Files with no reviewable changes (1)

tests/integration/test_lists/waives.txt

tensorrt_llm/_torch/compilation/patterns/__init__.py

…PyTorch 26.02 update The PyTorch 26.02 stack changed the traced graph shape around static_quantize_e4m3_per_tensor: the unused scale getitem may be dead-code-eliminated, but some graphs still retain a live scale consumer. Update the FP8 AR+Residual+RMSNorm quant fusion patterns to match both live-scale and pruned-scale graphs, with an explicit guard so the 2-output rewrite only fires when the scale output is absent. Name the custom passes and track aggregate match_count_by_pass totals so tests can assert exact semantic pass totals instead of the fixed-point bookkeeping trace. Also build the custom pass pipeline per Backend instance rather than sharing a process-global cache, and add regressions for the live-scale compile path and per-instance backend pass configuration. This keeps the user-buffer tests unwaived without reducing the intended fusion coverage. Signed-off-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com>

Signed-off-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com>

dhansen-nvidia · 2026-04-06T19:44:55Z

/bot run

tensorrt-cicd · 2026-04-06T19:52:37Z

PR_Github #41982 [ run ] triggered by Bot. Commit: c6fad62 Link to invocation

tensorrt-cicd · 2026-04-06T22:30:43Z

PR_Github #41982 [ run ] completed with state SUCCESS. Commit: c6fad62
/LLM/main/L0_MergeRequest_PR pipeline #32835 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

dhansen-nvidia requested a review from a team as a code owner April 3, 2026 21:39

dhansen-nvidia requested a review from hyukn April 3, 2026 21:39

github-actions bot assigned dhansen-nvidia Apr 3, 2026

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

tensorrt_llm/_torch/compilation/patterns/__init__.py Show resolved Hide resolved

dhansen-nvidia changed the title ~~[https://nvbugs/5940460][fix] Harden FP8 quant fusion matching after …~~ [https://nvbugs/5940460][fix] Harden FP8 quant fusion matching after PyTorch 26.02 update Apr 6, 2026

mojombo added 3 commits April 6, 2026 12:33

Add copyright header

fb58efe

Signed-off-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com>

Fix ruff legacy violation

c6fad62

Signed-off-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com>

dhansen-nvidia force-pushed the nvbug5940460_fix branch from daf5f25 to c6fad62 Compare April 6, 2026 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/5940460][fix] Harden FP8 quant fusion matching after PyTorch 26.02 update#12750

[https://nvbugs/5940460][fix] Harden FP8 quant fusion matching after PyTorch 26.02 update#12750
dhansen-nvidia wants to merge 3 commits intoNVIDIA:mainfrom
dhansen-nvidia:nvbug5940460_fix

dhansen-nvidia commented Apr 3, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 3, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

dhansen-nvidia commented Apr 6, 2026

Uh oh!

tensorrt-cicd commented Apr 6, 2026

Uh oh!

tensorrt-cicd commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dhansen-nvidia commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dhansen-nvidia commented Apr 6, 2026

Uh oh!

tensorrt-cicd commented Apr 6, 2026

Uh oh!

tensorrt-cicd commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dhansen-nvidia commented Apr 3, 2026 •

edited

Loading

coderabbitai bot commented Apr 3, 2026 •

edited

Loading