Skip to content

[None][Fix] Enable INT8 weight-only (W8A16) MoE for non-gated activations#15550

Draft
Dorijan10 wants to merge 1 commit into
NVIDIA:mainfrom
Dorijan10:feat/int8-woq-moe-nongated
Draft

[None][Fix] Enable INT8 weight-only (W8A16) MoE for non-gated activations#15550
Dorijan10 wants to merge 1 commit into
NVIDIA:mainfrom
Dorijan10:feat/int8-woq-moe-nongated

Conversation

@Dorijan10

@Dorijan10 Dorijan10 commented Jun 23, 2026

Copy link
Copy Markdown

Description

The INT8 weight-only per-channel (W8A16) fused-MoE path assumes gated activations (SwiGLU/GeGLU), which prevents non-gated MoE models (squared-ReLU, e.g. Nemotron) from using it; even though the underlying CUTLASS mixed-GEMM kernels already support the non-gated layout. Non-gated experts are already handled for FP8 and NVFP4 (see the Nemotron weight mapper and the FP8 fused-MoE method); this PR brings the INT8 weight-only path to parity.

  • cpp/tensorrt_llm/thop/moeOp.cpp: the woq validation hardcoded fc1.inter == 2 * fc2.inter; now conditioned on isGatedActivation(), mirroring the existing non-woq branch.
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py (INT8WoqPerChannelFusedMoEMethod): buffer sizing, weight loading, and scale loading assumed the doubled (gate+up) layout; now handle the single up-projection when the gate weight is absent, mirroring UnquantizedFusedMoEMethod's existing non-gated handling.

Gated models are unaffected; they retain the original code path. W8A16 checkpoints are produced the standard way (ModelOpt int8_wo, or offline per-output-channel quantization); no new runtime quantizer is introduced. On Ampere (e.g. A100), where FP8/NVFP4 are unavailable, INT8 weight-only is the natural weight-compression format for MoE inference.

Test Coverage

Validated on NVIDIA-Nemotron-3-Nano-30B-A3B (non-gated, Relu² experts) on A100-SXM4-80GB, served via trtllm-serve from a pre-quantized W8A16 checkpoint (hf_quant_config.jsonquant_algo: W8A16, non-expert linears excluded). Built and validated against current main with only the two files in this PR.

Throughput — vllm bench serve (random, ISL/OSL 512/512, concurrency 32, --ignore-eos, mean of 3):

Metric BF16 W8A16 (this PR) Δ
Output throughput ~1203 tok/s ~1690 tok/s +40%
Median TPOT ~23.6 ms ~16.5 ms −30%
Weight memory ~58.9 GB ~31.5 GB −27 GB

Lossless within eval noise on gsm8k.

Validated end-to-end on Nemotron; can add a unit test in the format you prefer.

The INT8 weight-only per-channel MoE path assumed gated activations (Swiglu/Geglu) in three places, rejecting or mis-handling non-gated experts (squared-ReLU, e.g. Nemotron-H) that the underlying CUTLASS kernels already support:

- moeOp.cpp: the woq validation hardcoded fc1.inter == 2 * fc2.inter; now conditioned on isGatedActivation(), mirroring the existing non-woq branch.
- INT8WoqPerChannelFusedMoEMethod: buffer sizing, weight loading, and scale loading assumed the doubled (gate+up) layout; now handle the single up-projection when the gate weight is absent, mirroring UnquantizedFusedMoEMethod's existing non-gated handling.

Gated models are unaffected (they retain the original code path).

Signed-off-by: Dorijan10 <dorian.magasic@turintech.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant