[tmp] Opt for MXFP4 (gpt-oss-120b) by taylor-yb-lee · Pull Request #251 · nv-auto-deploy/TensorRT-LLM

taylor-yb-lee · 2026-05-14T04:47:48Z

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…r test (NVIDIA#14335) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com>

…13842) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Signed-off-by: Michal Guzek <mguzek@nvidia.com> Co-authored-by: Michal Guzek <mguzek@nvidia.com>

…A#14055) Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>

NVIDIA#14267) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

… format (NVIDIA#14088) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

…VIDIA#14170) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Deletes unused make_mxfp4_trtllm_load_hook + _get_default_dist_info + _hook_dist_info_fn and stale V4-plan / make_mxfp4_ep_slice_load_hook docstring references. Relocates make_mxfp4_sharding_load_hook to transform/library/mxfp4_moe.py next to its only caller. Renames swizzle_moe_mxfp4_weights{.py,()} -> prepare_trtllm_gen_moe_mxfp4_weights{.py,()} and PreparedMXFP4Weights -> TRTLLMGenMXFP4MoEWeights. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…ability Unifies the four near-identical _shuffle_per_expert_{w3_w1,w2,bias_*} helpers into one _shuffle_per_expert(permute_fn, ...) plus a single-expert helper. Extracts the long main function's six sections (de-interleave, TP slice, pad weights, pad scales, shuffle, prepare biases) into private helpers that absorb the scratch=None/else branching, so the top-level function is ~100 lines of sequential helper calls instead of ~400 lines of repeated if/else blocks. No public API or behavior change. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…ine TP-slice no-op Compresses module + dataclass + helper docstrings (per-section helpers already carry focused docstrings); removes redundant PT mirror text and duplicated layout tables, keeps WHY notes. Moves make_swiglu_param_tensors next to its only caller InsertMXFP4MLP. Folds the tp_size==1 early-return into _tp_slice_intermediate_axis so the main function drops the if/else. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Matches the fused_moe.py convention (one file per MoE backend, holding both pattern-matcher and post-load-fusion transforms together); the ``fused_moe_*`` prefix groups them naturally next to fused_moe.py. No code change beyond the rename + a docstring reference in modeling_gpt_oss.py. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…FuseMXFP4Moe docstring MatchMOEDenseMLP -> MatchMXFP4MoePattern (Match{Backend}MoePattern convention). InsertMXFP4MLP{,Config} -> QuantizeMXFP4MOE{,Config} (quantize_*_moe convention, matches QuantizeFP8MOE / QuantizeNVFP4MOE in quantize_moe.py). Drops the verbose 1-5 step list + skipping rules from FuseMXFP4Moe — those are visible in the code itself; keeps only the WHEN/WHERE-it-runs context. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…TE in default.yaml examples/auto_deploy/model_registry/configs/gpt_oss_120b{,_tp2}.yaml: drop the unused world_size key and switch on apply_sharding_hints for mha+moe. tensorrt_llm/_torch/auto_deploy/config/default.yaml: remove the 10-line NOTE above fuse_mxfp4_moe — the legacy transform reference is in git history and the active transform pair is documented in their respective class docstrings. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…_gen_* Aligns with the existing trtllm_quant_{fp8,nvfp4,finegrained_fp8}_moe_fused naming and adds the ``trtllm_gen`` family marker (mirrors trtllm_nvfp4_trtllm_gen_moe_fused). Affected ops: * trtllm_mxfp4_w4a16_moe_fused -> trtllm_quant_mxfp4_trtllm_gen_w4a16_moe_fused * trtllm_mxfp4_w4a8_moe_fused -> trtllm_quant_mxfp4_trtllm_gen_w4a8_moe_fused Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Replaces trtllm_quant_mxfp4_trtllm_gen_{w4a16,w4a8}_moe_fused with a single trtllm_quant_mxfp4_trtllm_gen_moe_fused op that branches on a required ``act_dtype: str`` arg ("bf16" → bf16_mxe2m1 runner, "mxfp8" → mxfp8_quantize + mxe4m3_mxe2m1 runner). Caller selects via config.trtllm_quant_act. Also drops model-specific phrasing in two comments (linear.py cublas_mm branch reframed as model-agnostic; fused_moe_mxfp4.py bf16-free comment trimmed of GPT-OSS-120B size quantification). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…elpers Modeling-side __init__ code no longer reads the active DistConfig via the contextvars-backed get_active_dist_config (that path moved to the transform sharding load hook + FuseMXFP4Moe), so the helpers in dist_config.py have no callers. Removes _ACTIVE_DIST_CONFIG / get_active_dist_config / use_dist_config and their dead imports. build_model.py is unchanged. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…ments from base Brings back the architecture summary, AD-canonical-ops list, and inline forward annotations from the 3ae0b70 base that got dropped during the sharding-IR rewrite, while keeping the new sharding-hint sections of the docstring + the existing code. Also trims the now-redundant lm_head / registration comments (covered by the module docstring or stale). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Covers the new bias-aware fusion added to FuseGemms vs the 3ae0b70 base: * All-bias siblings fuse into one linear with stacked bias (concat dim=0). * Mixed bias / no-bias siblings on the same parent get bucketed separately (one fused with-bias linear + one fused no-bias linear). Existing FusableModel3's stale "no bias support yet" note is updated to reflect the new bucketing behavior. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…ly override Pipeline trace confirms the only module-level dtype walk is ``QuantConfigReader.post_process_model``'s ``model.to(new_dtype)``, which fires *before* PATTERN_MATCHER. At that point ``_dtype_protected_params`` is unset (FuseMXFP4Moe sets it later) so the override degenerates to a normal ``nn.Module._apply``. No subsequent ``gm.to(dtype)`` exists. Removes the override + both transform-side ``_dtype_protected_params`` setters + the matching docstrings. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…-out lines * Remove "match PT test_w4_1gpu" trailing comment on GSM8K_MAX_OUTPUT_LEN. * Remove the MODEL_PARAMS entry-format docstring — the parametrize names and the if-elif moe_topology dispatch below are already self-describing. * Remove the commented-out ``marks=pytest.mark.skip_less_device(4)``. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

… only for MoE topology Source of truth for ``world_size`` is the registry (``_get_registry_yaml_extra``'s 2nd return value — defaults to 1 when yaml doesn't carry an explicit ``world_size_N.yaml``); MODEL_PARAMS' 3rd column is a thin ``world_size_override`` used only for the ``120b-tp2`` / ``120b-ep2`` cases that exercise MoE-TP / MoE-EP on top of the same yaml. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Consolidates the prep-helper invariants (previously in test_prepare_trtllm_gen_moe_mxfp4_weights.py) and the unified op's act_dtype-dispatch contract into a single file at tests/unittest/auto_deploy/singlegpu/custom_ops/moe/test_trtllm_quant_mxfp4_trtllm_gen_moe.py. Coverage: * fc1 / fc2 bias rows must follow the SAME TMA permute as the weights (gated-act-gemm + epilogue-tile reorder for w3/w1; epilogue-tile reorder only for w2) — guard for the gpt-oss-120b GSM8K 2% bug. * Byte-identical match against PT's MXFP4 reference loader. * ``act_dtype="bf16"`` and ``act_dtype="mxfp8"`` both run end-to-end on Blackwell+; invalid ``act_dtype`` raises ``ValueError``. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Pins the POST_LOAD_FUSION contract of ``FuseMXFP4Moe``: * Raw HF MXFP4 buffers (``gate_up_proj_{blocks,scales,bias}`` / ``down_proj_{blocks,scales,bias}``) are deleted and replaced by the six prepared ``*_trtllm`` params on the experts module. * The ``trtllm_quant_mxfp4_trtllm_gen_moe_fused`` op's weight/bias arg slots (4..9) are re-pointed at the new prepared get_attr nodes. * ``moe_tp_size > 1`` divides ONLY ``fc2_bias_trtllm`` by ``moe_tp_size`` (so the post-AR sum reproduces the unsharded bias); all other prepared tensors match the TP=1 prep output byte-for-byte. * Re-running on an already-prepped graph is a no-op (idempotent skip). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…g_transform_executor disables ``apply_sharding_hints`` is the only sharding pass actually used here; the explicit ``enabled: false`` lines for ``detect_sharding`` and ``sharding_transform_executor`` are no-ops (those passes are off by default in this pipeline) and just add noise. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

- 20B now inherits the same MXFP4/sharding/fuse transforms as 120B (it was missing them despite being MXFP4 too). - world_size moves out of the model yaml and is supplied by the registry's world_size_N.yaml overlay. - models.yaml, cookbook, and supported-models.md all point at the unified config. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

get_sm_version() is already @lru_cache(maxsize=1), so the manual _SM_VERSION cache adds nothing. The try/except fallback to 0 was dead defensive code: this branch only triggers on CUDA bf16 tensors, where torch.cuda.get_device_properties(0) cannot fail. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…XFP4 prep _tp_slice_intermediate_axis() pre-pads I to i_padded_tp before slicing, and _get_weight_alignment() guarantees the alignment is a multiple of tp_size, so the helper already handles non-tp-divisible intermediate sizes (the original I is never reused downstream — only per_rank_i is). The guard rejected exactly the shapes the helper was designed to support. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…llers already contiguous where needed). - _shuffle_per_expert: drop per-expert loop; batched torch.index_select on dim=1 (permute derived once on stacked[0], _PERMUTE_CACHE is shape-keyed). - default.yaml: comment fuse_mxfp4_moe.expect_mem_change with alignment-padding rationale. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

- _register_mxfp4_expert_params (Triton): torch.zeros -> torch.empty with device=gu_w.device. - _apply_trtllm: raw_specs + make_swiglu_param_tensors now use device=gu_w_t.device. - Avoids materializing giant CPU buffers on meta-device builds (GPT-OSS-120B); load hook overwrites bytes anyway. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

This reverts commit ef50383. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

… only for HF-loaded params Previous "Keep MXFP4 placeholders on existing param device" change (ef50383) also routed make_swiglu_param_tensors through param_device, which is meta on the normal build. swiglu_alpha/beta/limit (1.702/1.0/7.0) are NOT in HF safetensors, so meta tensors silently dropped the values and tanked GSM8K. Restore the memory-saving device reuse for raw HF buffers (blocks/scales/bias) and keep SwiGLU constants on CPU with real values; add a comment so it isn't re-broken. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Replace three inline ((x + a - 1) // a) * a expressions with tensorrt_llm.math_utils.pad_up. Same arithmetic, less to misread. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

…t inline Pull the GPT-OSS sharding invariants into gpt_oss.yaml so the model registry is the single source of truth: - detect_sharding.enabled=false - sharding_transform_executor.enabled=false Both were inlined in TestGPTOSS.test_mxfp4_gsm8k only for the tp2/ep2 parametrize cases; with them in yaml, trtllm-serve via this config now uses the same apply_sharding_hints-only sharding path as the test. Test inline keeps only the per-parametrize dist_mapping override; the already-duplicated apply_sharding_hints.{enabled, requires_shape_prop, shard_layers} keys (also present in yaml) are dropped. pydantic-settings deep-merges init kwargs into yaml-sourced transforms, so the effective config is unchanged across all 4 parametrize cases. Pattern matches _IR_SHARDING_TRANSFORMS used by the existing IR-sharding tests (TestNemotronSuperV3_IR, TestQwen3_5_MoE_IR). Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

github-actions Bot assigned taylor-yb-lee May 14, 2026

taylor-yb-lee force-pushed the taylor/gpt-oss-0511_rebase_0511 branch 21 times, most recently from 93809db to 25a4e4e Compare May 21, 2026 01:09

Shixiaowei02 and others added 8 commits May 21, 2026 09:57

[https://nvbugs/6114141][test] Remove deprecated disagg trtllm_sample…

b3fcf08

…r test (NVIDIA#14335) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

[None][infra] Check in most recent lock file from nightly pipeline

57060a7

Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com>

[None][doc] Add Claude skill for multimodal model onboarding (NVIDIA#…

67b0654

…13842) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Signed-off-by: Michal Guzek <mguzek@nvidia.com> Co-authored-by: Michal Guzek <mguzek@nvidia.com>

[https://nvbugs/6141803][fix] Skip Qwen3.5-4B tests pre-hopper (NVIDI…

09f6885

…A#14055) Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>

[None][fix] ADP router crashes on serve when scheduling_params.attent… (

e64c92b

NVIDIA#14267) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

[nvbug6185190][doc] fix invalid links in doc (NVIDIA#14337)

6a4a2a8

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

[None][feat] Refactor to support legacy and 1.x modelopt quant config…

c0b73b0

… format (NVIDIA#14088) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

[None][feature] Add env variables to help debugging mamba modules. (N…

57a1b84

…VIDIA#14170) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

taylor-yb-lee added 28 commits May 28, 2026 12:26

Revert "Keep MXFP4 placeholders on existing param device"

fab0a93

This reverts commit ef50383. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

[ad-mxfp4-moe] _compute_padded_dims: use pad_up helper

b5a6104

Replace three inline ((x + a - 1) // a) * a expressions with tensorrt_llm.math_utils.pad_up. Same arithmetic, less to misread. Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

Add acc test to CI

768e3e7

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

remove redundant configs

14e246d

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>

taylor-yb-lee force-pushed the taylor/gpt-oss-0511_rebase_0511 branch from 8d39fb2 to 14e246d Compare May 28, 2026 19:49

taylor-yb-lee closed this May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tmp] Opt for MXFP4 (gpt-oss-120b) #251

[tmp] Opt for MXFP4 (gpt-oss-120b) #251
taylor-yb-lee wants to merge 437 commits into
chenghao/gpt-oss-0505from
taylor/gpt-oss-0511_rebase_0511

taylor-yb-lee commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

taylor-yb-lee commented May 14, 2026

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants