[None][Fix] Enable INT8 weight-only (W8A16) MoE for non-gated activations by Dorijan10 · Pull Request #15550 · NVIDIA/TensorRT-LLM

Dorijan10 · 2026-06-23T16:43:42Z

Description

The INT8 weight-only per-channel (W8A16) fused-MoE path assumes gated activations (SwiGLU/GeGLU), which prevents non-gated MoE models (squared-ReLU, e.g. Nemotron) from using it; even though the underlying CUTLASS mixed-GEMM kernels already support the non-gated layout. Non-gated experts are already handled for FP8 and NVFP4 (see the Nemotron weight mapper and the FP8 fused-MoE method); this PR brings the INT8 weight-only path to parity.

cpp/tensorrt_llm/thop/moeOp.cpp: the woq validation hardcoded fc1.inter == 2 * fc2.inter; now conditioned on isGatedActivation(), mirroring the existing non-woq branch.
tensorrt_llm/_torch/modules/fused_moe/quantization.py (INT8WoqPerChannelFusedMoEMethod): buffer sizing, weight loading, and scale loading assumed the doubled (gate+up) layout; now handle the single up-projection when the gate weight is absent, mirroring UnquantizedFusedMoEMethod's existing non-gated handling.

Gated models are unaffected; they retain the original code path. W8A16 checkpoints are produced the standard way (ModelOpt int8_wo, or offline per-output-channel quantization); no new runtime quantizer is introduced. On Ampere (e.g. A100), where FP8/NVFP4 are unavailable, INT8 weight-only is the natural weight-compression format for MoE inference.

Test Coverage

Validated on NVIDIA-Nemotron-3-Nano-30B-A3B (non-gated, Relu² experts) on A100-SXM4-80GB, served via trtllm-serve from a pre-quantized W8A16 checkpoint (hf_quant_config.json → quant_algo: W8A16, non-expert linears excluded). Built and validated against current main with only the two files in this PR.

Throughput — vllm bench serve (random, ISL/OSL 512/512, concurrency 32, --ignore-eos, mean of 3):

Metric	BF16	W8A16 (this PR)	Δ
Output throughput	~1203 tok/s	~1690 tok/s	+40%
Median TPOT	~23.6 ms	~16.5 ms	−30%
Weight memory	~58.9 GB	~31.5 GB	−27 GB

Lossless within eval noise on gsm8k.

Validated end-to-end on Nemotron; can add a unit test in the format you prefer.

The INT8 weight-only per-channel MoE path assumed gated activations (Swiglu/Geglu) in three places, rejecting or mis-handling non-gated experts (squared-ReLU, e.g. Nemotron-H) that the underlying CUTLASS kernels already support: - moeOp.cpp: the woq validation hardcoded fc1.inter == 2 * fc2.inter; now conditioned on isGatedActivation(), mirroring the existing non-woq branch. - INT8WoqPerChannelFusedMoEMethod: buffer sizing, weight loading, and scale loading assumed the doubled (gate+up) layout; now handle the single up-projection when the gate weight is absent, mirroring UnquantizedFusedMoEMethod's existing non-gated handling. Gated models are unaffected (they retain the original code path). Signed-off-by: Dorijan10 <dorian.magasic@turintech.ai>

github-actions Bot assigned Dorijan10 Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][Fix] Enable INT8 weight-only (W8A16) MoE for non-gated activations#15550

[None][Fix] Enable INT8 weight-only (W8A16) MoE for non-gated activations#15550
Dorijan10 wants to merge 1 commit into
NVIDIA:mainfrom
Dorijan10:feat/int8-woq-moe-nongated

Dorijan10 commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Dorijan10 commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Dorijan10 commented Jun 23, 2026 •

edited

Loading