[https://nvbugs/5940460][fix] Harden FP8 quant fusion matching after PyTorch 26.02 update#12750
[https://nvbugs/5940460][fix] Harden FP8 quant fusion matching after PyTorch 26.02 update#12750dhansen-nvidia wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughThe changes refactor the Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tensorrt_llm/_torch/compilation/backend.py (1)
2-2: Consider usingdictinstead ofOrderedDict.Since Python 3.7+, regular
dictmaintains insertion order. Given the project requires Python 3.10+,OrderedDictis not necessary here unless you need specificOrderedDictmethods likemove_to_end().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/compilation/backend.py` at line 2, The code imports OrderedDict but on Python 3.10+ insertion order is preserved by built-in dict; remove the from collections import OrderedDict import and replace any uses of OrderedDict with plain dict in this module (e.g., any constructors or type hints referencing OrderedDict), unless you rely on OrderedDict-specific methods like move_to_end(), in which case keep it; update variable initializations, annotations, and tests to use dict where applicable (search for the symbol OrderedDict in this file and substitute dict).tests/unittest/_torch/multi_gpu/test_user_buffers.py (1)
676-678: Inconsistent exception handling across test functions.This function now re-raises exceptions after printing the traceback, but
run_single_rank_ub_mm_add_pass(lines 969-972) andrun_single_rank_ub_pass_fp4(lines 1218-1221) still returnFalseafter printing the traceback. This inconsistency could lead to different test failure modes and make debugging harder.Consider standardizing the exception handling pattern across all
run_single_rank_*functions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/multi_gpu/test_user_buffers.py` around lines 676 - 678, Standardize exception handling by having all run_single_rank_* helpers re-raise exceptions instead of returning False after printing the traceback; update run_single_rank_ub_mm_add_pass and run_single_rank_ub_pass_fp4 to mirror the pattern used in the other function that prints the traceback and then does "raise" (i.e., print traceback.traceback.print_exc() if kept) so failures propagate as exceptions rather than silently returning False.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/compilation/patterns/__init__.py`:
- Line 1: The file lacks the required NVIDIA copyright header at the top; add
the standard NVIDIA copyright header including the year of latest meaningful
modification to the top of tensorrt_llm/_torch/compilation/patterns/__init__.py
so that the file begins with the header before any code (e.g., before
MATCHER_SUBSYSTEM = "torch_compile"); ensure the header matches the project's
canonical header format and includes the appropriate year and ownership text.
---
Nitpick comments:
In `@tensorrt_llm/_torch/compilation/backend.py`:
- Line 2: The code imports OrderedDict but on Python 3.10+ insertion order is
preserved by built-in dict; remove the from collections import OrderedDict
import and replace any uses of OrderedDict with plain dict in this module (e.g.,
any constructors or type hints referencing OrderedDict), unless you rely on
OrderedDict-specific methods like move_to_end(), in which case keep it; update
variable initializations, annotations, and tests to use dict where applicable
(search for the symbol OrderedDict in this file and substitute dict).
In `@tests/unittest/_torch/multi_gpu/test_user_buffers.py`:
- Around line 676-678: Standardize exception handling by having all
run_single_rank_* helpers re-raise exceptions instead of returning False after
printing the traceback; update run_single_rank_ub_mm_add_pass and
run_single_rank_ub_pass_fp4 to mirror the pattern used in the other function
that prints the traceback and then does "raise" (i.e., print
traceback.traceback.print_exc() if kept) so failures propagate as exceptions
rather than silently returning False.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ae061a2d-2716-4aba-883c-037e4ae7ca31
📒 Files selected for processing (5)
tensorrt_llm/_torch/compilation/backend.pytensorrt_llm/_torch/compilation/patterns/__init__.pytensorrt_llm/_torch/compilation/patterns/ar_residual_norm.pytests/integration/test_lists/waives.txttests/unittest/_torch/multi_gpu/test_user_buffers.py
💤 Files with no reviewable changes (1)
- tests/integration/test_lists/waives.txt
…PyTorch 26.02 update The PyTorch 26.02 stack changed the traced graph shape around static_quantize_e4m3_per_tensor: the unused scale getitem may be dead-code-eliminated, but some graphs still retain a live scale consumer. Update the FP8 AR+Residual+RMSNorm quant fusion patterns to match both live-scale and pruned-scale graphs, with an explicit guard so the 2-output rewrite only fires when the scale output is absent. Name the custom passes and track aggregate match_count_by_pass totals so tests can assert exact semantic pass totals instead of the fixed-point bookkeeping trace. Also build the custom pass pipeline per Backend instance rather than sharing a process-global cache, and add regressions for the live-scale compile path and per-instance backend pass configuration. This keeps the user-buffer tests unwaived without reducing the intended fusion coverage. Signed-off-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com>
Signed-off-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com>
Signed-off-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com>
daf5f25 to
c6fad62
Compare
|
/bot run |
|
PR_Github #41982 [ run ] triggered by Bot. Commit: |
|
PR_Github #41982 [ run ] completed with state
|
The PyTorch 26.02 stack changed the traced graph shape around static_quantize_e4m3_per_tensor: the unused scale getitem may be dead-code-eliminated, but some graphs still retain a live scale consumer.
Update the FP8 AR+Residual+RMSNorm quant fusion patterns to match both live-scale and pruned-scale graphs, with an explicit guard so the 2-output rewrite only fires when the scale output is absent. Name the custom passes and track aggregate match_count_by_pass totals so tests can assert exact semantic pass totals instead of the fixed-point bookkeeping trace.
Also build the custom pass pipeline per Backend instance rather than sharing a process-global cache, and add regressions for the live-scale compile path and per-instance backend pass configuration. This keeps the user-buffer tests unwaived without reducing the intended fusion coverage.
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.