Skip to content

[https://nvbugs/5940460][fix] Harden FP8 quant fusion matching after PyTorch 26.02 update#12750

Open
dhansen-nvidia wants to merge 3 commits intoNVIDIA:mainfrom
dhansen-nvidia:nvbug5940460_fix
Open

[https://nvbugs/5940460][fix] Harden FP8 quant fusion matching after PyTorch 26.02 update#12750
dhansen-nvidia wants to merge 3 commits intoNVIDIA:mainfrom
dhansen-nvidia:nvbug5940460_fix

Conversation

@dhansen-nvidia
Copy link
Copy Markdown
Collaborator

@dhansen-nvidia dhansen-nvidia commented Apr 3, 2026

The PyTorch 26.02 stack changed the traced graph shape around static_quantize_e4m3_per_tensor: the unused scale getitem may be dead-code-eliminated, but some graphs still retain a live scale consumer.

Update the FP8 AR+Residual+RMSNorm quant fusion patterns to match both live-scale and pruned-scale graphs, with an explicit guard so the 2-output rewrite only fires when the scale output is absent. Name the custom passes and track aggregate match_count_by_pass totals so tests can assert exact semantic pass totals instead of the fixed-point bookkeeping trace.

Also build the custom pass pipeline per Backend instance rather than sharing a process-global cache, and add regressions for the live-scale compile path and per-instance backend pass configuration. This keeps the user-buffer tests unwaived without reducing the intended fusion coverage.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Resolved failing user buffer compilation tests across FP16, BF16 data types, and multiple model configurations.
  • Tests

    • Added comprehensive test coverage for user buffer operations with all-reduce and FP8 quantization in distributed training scenarios.
    • Enhanced test infrastructure with detailed optimization pass execution tracking.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@dhansen-nvidia dhansen-nvidia requested a review from a team as a code owner April 3, 2026 21:39
@dhansen-nvidia dhansen-nvidia requested a review from hyukn April 3, 2026 21:39
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

The changes refactor the Backend class to build per-instance custom passes instead of caching class-level pass instances, add per-pass match count aggregation during optimization, enhance FP8 quantization pattern matching with constraint helpers, remove test skip entries, and add new test cases for validation.

Changes

Cohort / File(s) Summary
Backend Architecture
tensorrt_llm/_torch/compilation/backend.py
Changed from class-level cached _custom_pass_instances to instance method build_custom_passes(...) that creates fresh passes per Backend instance; introduced match_count_by_pass OrderedDict to aggregate match counts per named pass during optimization; passes now explicitly named via PatternMatcherPass constructors with subsystem parameter.
Pattern Compilation Constants
tensorrt_llm/_torch/compilation/patterns/__init__.py
Added module-level constant MATCHER_SUBSYSTEM = "torch_compile" for use in pattern pass initialization.
FP8 Quantization Pattern Matching
tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py
Added helper functions _check_getitem_only_users(...), _has_getitem_user(...), and _make_fp8_quant_extra_check(...) for FP8-specific constraint validation; refactored register_ar_residual_norm_out_fp8_quant(...) and register_ar_residual_norm_fp8_quant(...) to split pattern registration into separate with-scale and without-scale branches; replaced _users=2 constraints with _users=MULTIPLE.
Test Skip Entries
tests/integration/test_lists/waives.txt
Removed 16 SKIP entries for test_user_buffers_pass and test_user_buffers_mm_add_prologue across combinations of dtype, token count, and hidden size.
User Buffers Tests
tests/unittest/_torch/multi_gpu/test_user_buffers.py
Added _assert_match_counts(...) helper to validate per-pass match counts via backend.match_count_by_pass; added run_single_rank_ar_rms_norm_fp8_live_scale_compile(...) to test AR norm fusion with UB passes and run_single_rank_backend_passes_are_per_instance(...) to verify per-instance pass independence; replaced positional match-count assertions with named pass count validation; added two new pytest test cases (test_user_buffers_ar_rms_norm_fp8_live_scale_compile, test_backend_passes_are_per_instance); changed exception handling in run_single_rank_ub_pass(...) to re-raise instead of returning False.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The pull request description is provided but lacks the required structured format with distinct sections for 'Description' and 'Test Coverage' as specified in the template. Please add a proper 'Description' section explaining the issue and solution, and a 'Test Coverage' section listing relevant tests that safeguard these changes, following the template structure.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly identifies the NVBugs ticket and describes the main fix: hardening FP8 quantization fusion matching for PyTorch 26.02 compatibility.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tensorrt_llm/_torch/compilation/backend.py (1)

2-2: Consider using dict instead of OrderedDict.

Since Python 3.7+, regular dict maintains insertion order. Given the project requires Python 3.10+, OrderedDict is not necessary here unless you need specific OrderedDict methods like move_to_end().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/compilation/backend.py` at line 2, The code imports
OrderedDict but on Python 3.10+ insertion order is preserved by built-in dict;
remove the from collections import OrderedDict import and replace any uses of
OrderedDict with plain dict in this module (e.g., any constructors or type hints
referencing OrderedDict), unless you rely on OrderedDict-specific methods like
move_to_end(), in which case keep it; update variable initializations,
annotations, and tests to use dict where applicable (search for the symbol
OrderedDict in this file and substitute dict).
tests/unittest/_torch/multi_gpu/test_user_buffers.py (1)

676-678: Inconsistent exception handling across test functions.

This function now re-raises exceptions after printing the traceback, but run_single_rank_ub_mm_add_pass (lines 969-972) and run_single_rank_ub_pass_fp4 (lines 1218-1221) still return False after printing the traceback. This inconsistency could lead to different test failure modes and make debugging harder.

Consider standardizing the exception handling pattern across all run_single_rank_* functions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/multi_gpu/test_user_buffers.py` around lines 676 - 678,
Standardize exception handling by having all run_single_rank_* helpers re-raise
exceptions instead of returning False after printing the traceback; update
run_single_rank_ub_mm_add_pass and run_single_rank_ub_pass_fp4 to mirror the
pattern used in the other function that prints the traceback and then does
"raise" (i.e., print traceback.traceback.print_exc() if kept) so failures
propagate as exceptions rather than silently returning False.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/compilation/patterns/__init__.py`:
- Line 1: The file lacks the required NVIDIA copyright header at the top; add
the standard NVIDIA copyright header including the year of latest meaningful
modification to the top of tensorrt_llm/_torch/compilation/patterns/__init__.py
so that the file begins with the header before any code (e.g., before
MATCHER_SUBSYSTEM = "torch_compile"); ensure the header matches the project's
canonical header format and includes the appropriate year and ownership text.

---

Nitpick comments:
In `@tensorrt_llm/_torch/compilation/backend.py`:
- Line 2: The code imports OrderedDict but on Python 3.10+ insertion order is
preserved by built-in dict; remove the from collections import OrderedDict
import and replace any uses of OrderedDict with plain dict in this module (e.g.,
any constructors or type hints referencing OrderedDict), unless you rely on
OrderedDict-specific methods like move_to_end(), in which case keep it; update
variable initializations, annotations, and tests to use dict where applicable
(search for the symbol OrderedDict in this file and substitute dict).

In `@tests/unittest/_torch/multi_gpu/test_user_buffers.py`:
- Around line 676-678: Standardize exception handling by having all
run_single_rank_* helpers re-raise exceptions instead of returning False after
printing the traceback; update run_single_rank_ub_mm_add_pass and
run_single_rank_ub_pass_fp4 to mirror the pattern used in the other function
that prints the traceback and then does "raise" (i.e., print
traceback.traceback.print_exc() if kept) so failures propagate as exceptions
rather than silently returning False.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ae061a2d-2716-4aba-883c-037e4ae7ca31

📥 Commits

Reviewing files that changed from the base of the PR and between 7ee9e8b and daf5f25.

📒 Files selected for processing (5)
  • tensorrt_llm/_torch/compilation/backend.py
  • tensorrt_llm/_torch/compilation/patterns/__init__.py
  • tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py
  • tests/integration/test_lists/waives.txt
  • tests/unittest/_torch/multi_gpu/test_user_buffers.py
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt

@dhansen-nvidia dhansen-nvidia changed the title [https://nvbugs/5940460][fix] Harden FP8 quant fusion matching after … [https://nvbugs/5940460][fix] Harden FP8 quant fusion matching after PyTorch 26.02 update Apr 6, 2026
mojombo added 3 commits April 6, 2026 12:33
…PyTorch 26.02 update

The PyTorch 26.02 stack changed the traced graph shape around
static_quantize_e4m3_per_tensor: the unused scale getitem may be
dead-code-eliminated, but some graphs still retain a live scale
consumer.

Update the FP8 AR+Residual+RMSNorm quant fusion patterns to match both
live-scale and pruned-scale graphs, with an explicit guard so the
2-output rewrite only fires when the scale output is absent. Name the
custom passes and track aggregate match_count_by_pass totals so tests
can assert exact semantic pass totals instead of the fixed-point
bookkeeping trace.

Also build the custom pass pipeline per Backend instance rather than
sharing a process-global cache, and add regressions for the live-scale
compile path and per-instance backend pass configuration. This keeps the
user-buffer tests unwaived without reducing the intended fusion
coverage.

Signed-off-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com>
Signed-off-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com>
Signed-off-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com>
@dhansen-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41982 [ run ] triggered by Bot. Commit: c6fad62 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41982 [ run ] completed with state SUCCESS. Commit: c6fad62
/LLM/main/L0_MergeRequest_PR pipeline #32835 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants