Skip to content

[tmp] Opt for MXFP4 (gpt-oss-120b) #251

Closed
taylor-yb-lee wants to merge 437 commits into
chenghao/gpt-oss-0505from
taylor/gpt-oss-0511_rebase_0511
Closed

[tmp] Opt for MXFP4 (gpt-oss-120b) #251
taylor-yb-lee wants to merge 437 commits into
chenghao/gpt-oss-0505from
taylor/gpt-oss-0511_rebase_0511

Conversation

@taylor-yb-lee

Copy link
Copy Markdown

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@taylor-yb-lee taylor-yb-lee force-pushed the taylor/gpt-oss-0511_rebase_0511 branch 21 times, most recently from 93809db to 25a4e4e Compare May 21, 2026 01:09
Shixiaowei02 and others added 8 commits May 21, 2026 09:57
…r test (NVIDIA#14335)

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com>
…13842)

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Co-authored-by: Michal Guzek <mguzek@nvidia.com>
…A#14055)

Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
NVIDIA#14267)

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
… format (NVIDIA#14088)

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
…VIDIA#14170)

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Deletes unused make_mxfp4_trtllm_load_hook + _get_default_dist_info +
_hook_dist_info_fn and stale V4-plan / make_mxfp4_ep_slice_load_hook
docstring references. Relocates make_mxfp4_sharding_load_hook to
transform/library/mxfp4_moe.py next to its only caller. Renames
swizzle_moe_mxfp4_weights{.py,()} -> prepare_trtllm_gen_moe_mxfp4_weights{.py,()}
and PreparedMXFP4Weights -> TRTLLMGenMXFP4MoEWeights.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…ability

Unifies the four near-identical _shuffle_per_expert_{w3_w1,w2,bias_*} helpers
into one _shuffle_per_expert(permute_fn, ...) plus a single-expert helper.
Extracts the long main function's six sections (de-interleave, TP slice,
pad weights, pad scales, shuffle, prepare biases) into private helpers
that absorb the scratch=None/else branching, so the top-level function is
~100 lines of sequential helper calls instead of ~400 lines of repeated
if/else blocks. No public API or behavior change.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…ine TP-slice no-op

Compresses module + dataclass + helper docstrings (per-section helpers
already carry focused docstrings); removes redundant PT mirror text and
duplicated layout tables, keeps WHY notes. Moves make_swiglu_param_tensors
next to its only caller InsertMXFP4MLP. Folds the tp_size==1 early-return
into _tp_slice_intermediate_axis so the main function drops the if/else.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Matches the fused_moe.py convention (one file per MoE backend, holding
both pattern-matcher and post-load-fusion transforms together); the
``fused_moe_*`` prefix groups them naturally next to fused_moe.py. No
code change beyond the rename + a docstring reference in modeling_gpt_oss.py.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…FuseMXFP4Moe docstring

MatchMOEDenseMLP -> MatchMXFP4MoePattern (Match{Backend}MoePattern convention).
InsertMXFP4MLP{,Config} -> QuantizeMXFP4MOE{,Config} (quantize_*_moe convention,
matches QuantizeFP8MOE / QuantizeNVFP4MOE in quantize_moe.py). Drops the
verbose 1-5 step list + skipping rules from FuseMXFP4Moe — those are visible
in the code itself; keeps only the WHEN/WHERE-it-runs context.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…TE in default.yaml

examples/auto_deploy/model_registry/configs/gpt_oss_120b{,_tp2}.yaml: drop the
unused world_size key and switch on apply_sharding_hints for mha+moe.
tensorrt_llm/_torch/auto_deploy/config/default.yaml: remove the 10-line NOTE
above fuse_mxfp4_moe — the legacy transform reference is in git history and
the active transform pair is documented in their respective class docstrings.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…_gen_*

Aligns with the existing trtllm_quant_{fp8,nvfp4,finegrained_fp8}_moe_fused
naming and adds the ``trtllm_gen`` family marker (mirrors
trtllm_nvfp4_trtllm_gen_moe_fused). Affected ops:
* trtllm_mxfp4_w4a16_moe_fused -> trtllm_quant_mxfp4_trtllm_gen_w4a16_moe_fused
* trtllm_mxfp4_w4a8_moe_fused  -> trtllm_quant_mxfp4_trtllm_gen_w4a8_moe_fused

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Replaces trtllm_quant_mxfp4_trtllm_gen_{w4a16,w4a8}_moe_fused with a single
trtllm_quant_mxfp4_trtllm_gen_moe_fused op that branches on a required
``act_dtype: str`` arg ("bf16" → bf16_mxe2m1 runner, "mxfp8" → mxfp8_quantize
+ mxe4m3_mxe2m1 runner). Caller selects via config.trtllm_quant_act.
Also drops model-specific phrasing in two comments (linear.py cublas_mm
branch reframed as model-agnostic; fused_moe_mxfp4.py bf16-free comment
trimmed of GPT-OSS-120B size quantification).

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…elpers

Modeling-side __init__ code no longer reads the active DistConfig via the
contextvars-backed get_active_dist_config (that path moved to the transform
sharding load hook + FuseMXFP4Moe), so the helpers in dist_config.py have
no callers. Removes _ACTIVE_DIST_CONFIG / get_active_dist_config /
use_dist_config and their dead imports. build_model.py is unchanged.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…ments from base

Brings back the architecture summary, AD-canonical-ops list, and inline
forward annotations from the 3ae0b70 base that got dropped during the
sharding-IR rewrite, while keeping the new sharding-hint sections of the
docstring + the existing code. Also trims the now-redundant lm_head /
registration comments (covered by the module docstring or stale).

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Covers the new bias-aware fusion added to FuseGemms vs the 3ae0b70 base:
* All-bias siblings fuse into one linear with stacked bias (concat dim=0).
* Mixed bias / no-bias siblings on the same parent get bucketed separately
  (one fused with-bias linear + one fused no-bias linear).
Existing FusableModel3's stale "no bias support yet" note is updated to
reflect the new bucketing behavior.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…ly override

Pipeline trace confirms the only module-level dtype walk is
``QuantConfigReader.post_process_model``'s ``model.to(new_dtype)``, which
fires *before* PATTERN_MATCHER. At that point ``_dtype_protected_params`` is
unset (FuseMXFP4Moe sets it later) so the override degenerates to a normal
``nn.Module._apply``. No subsequent ``gm.to(dtype)`` exists. Removes the
override + both transform-side ``_dtype_protected_params`` setters + the
matching docstrings.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…-out lines

* Remove "match PT test_w4_1gpu" trailing comment on GSM8K_MAX_OUTPUT_LEN.
* Remove the MODEL_PARAMS entry-format docstring — the parametrize names
  and the if-elif moe_topology dispatch below are already self-describing.
* Remove the commented-out ``marks=pytest.mark.skip_less_device(4)``.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
… only for MoE topology

Source of truth for ``world_size`` is the registry
(``_get_registry_yaml_extra``'s 2nd return value — defaults to 1 when yaml
doesn't carry an explicit ``world_size_N.yaml``); MODEL_PARAMS' 3rd column
is a thin ``world_size_override`` used only for the ``120b-tp2`` /
``120b-ep2`` cases that exercise MoE-TP / MoE-EP on top of the same yaml.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Consolidates the prep-helper invariants (previously in
test_prepare_trtllm_gen_moe_mxfp4_weights.py) and the unified op's
act_dtype-dispatch contract into a single file at
tests/unittest/auto_deploy/singlegpu/custom_ops/moe/test_trtllm_quant_mxfp4_trtllm_gen_moe.py.

Coverage:
* fc1 / fc2 bias rows must follow the SAME TMA permute as the weights
  (gated-act-gemm + epilogue-tile reorder for w3/w1; epilogue-tile reorder
  only for w2) — guard for the gpt-oss-120b GSM8K 2% bug.
* Byte-identical match against PT's MXFP4 reference loader.
* ``act_dtype="bf16"`` and ``act_dtype="mxfp8"`` both run end-to-end on
  Blackwell+; invalid ``act_dtype`` raises ``ValueError``.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Pins the POST_LOAD_FUSION contract of ``FuseMXFP4Moe``:

* Raw HF MXFP4 buffers (``gate_up_proj_{blocks,scales,bias}`` /
  ``down_proj_{blocks,scales,bias}``) are deleted and replaced by the six
  prepared ``*_trtllm`` params on the experts module.
* The ``trtllm_quant_mxfp4_trtllm_gen_moe_fused`` op's weight/bias arg
  slots (4..9) are re-pointed at the new prepared get_attr nodes.
* ``moe_tp_size > 1`` divides ONLY ``fc2_bias_trtllm`` by ``moe_tp_size``
  (so the post-AR sum reproduces the unsharded bias); all other prepared
  tensors match the TP=1 prep output byte-for-byte.
* Re-running on an already-prepped graph is a no-op (idempotent skip).

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…g_transform_executor disables

``apply_sharding_hints`` is the only sharding pass actually used here;
the explicit ``enabled: false`` lines for ``detect_sharding`` and
``sharding_transform_executor`` are no-ops (those passes are off by
default in this pipeline) and just add noise.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
- 20B now inherits the same MXFP4/sharding/fuse transforms as 120B (it
  was missing them despite being MXFP4 too).
- world_size moves out of the model yaml and is supplied by the
  registry's world_size_N.yaml overlay.
- models.yaml, cookbook, and supported-models.md all point at the
  unified config.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
get_sm_version() is already @lru_cache(maxsize=1), so the manual
_SM_VERSION cache adds nothing. The try/except fallback to 0 was
dead defensive code: this branch only triggers on CUDA bf16 tensors,
where torch.cuda.get_device_properties(0) cannot fail.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…XFP4 prep

_tp_slice_intermediate_axis() pre-pads I to i_padded_tp before
slicing, and _get_weight_alignment() guarantees the alignment is a
multiple of tp_size, so the helper already handles non-tp-divisible
intermediate sizes (the original I is never reused downstream — only
per_rank_i is). The guard rejected exactly the shapes the helper was
designed to support.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…llers

  already contiguous where needed).
- _shuffle_per_expert: drop per-expert loop; batched torch.index_select on  dim=1 (permute derived once on stacked[0], _PERMUTE_CACHE is shape-keyed).
- default.yaml: comment fuse_mxfp4_moe.expect_mem_change with alignment-padding rationale.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
- _register_mxfp4_expert_params (Triton): torch.zeros -> torch.empty with device=gu_w.device.
- _apply_trtllm: raw_specs + make_swiglu_param_tensors now use device=gu_w_t.device.
- Avoids materializing giant CPU buffers on meta-device builds (GPT-OSS-120B); load hook overwrites bytes anyway.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
This reverts commit ef50383.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
… only for HF-loaded params

Previous "Keep MXFP4 placeholders on existing param device" change
(ef50383) also routed make_swiglu_param_tensors through param_device,
which is meta on the normal build. swiglu_alpha/beta/limit (1.702/1.0/7.0)
are NOT in HF safetensors, so meta tensors silently dropped the values
and tanked GSM8K. Restore the memory-saving device reuse for raw HF
buffers (blocks/scales/bias) and keep SwiGLU constants on CPU with real
values; add a comment so it isn't re-broken.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Replace three inline ((x + a - 1) // a) * a expressions with
tensorrt_llm.math_utils.pad_up. Same arithmetic, less to misread.

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
…t inline

Pull the GPT-OSS sharding invariants into gpt_oss.yaml so the model
registry is the single source of truth:

- detect_sharding.enabled=false
- sharding_transform_executor.enabled=false

Both were inlined in TestGPTOSS.test_mxfp4_gsm8k only for the
tp2/ep2 parametrize cases; with them in yaml, trtllm-serve via this
config now uses the same apply_sharding_hints-only sharding path as
the test.

Test inline keeps only the per-parametrize dist_mapping override; the
already-duplicated apply_sharding_hints.{enabled, requires_shape_prop,
shard_layers} keys (also present in yaml) are dropped. pydantic-settings
deep-merges init kwargs into yaml-sourced transforms, so the effective
config is unchanged across all 4 parametrize cases.

Pattern matches _IR_SHARDING_TRANSFORMS used by the existing IR-sharding
tests (TestNemotronSuperV3_IR, TestQwen3_5_MoE_IR).

Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
@taylor-yb-lee taylor-yb-lee force-pushed the taylor/gpt-oss-0511_rebase_0511 branch from 8d39fb2 to 14e246d Compare May 28, 2026 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.