[None][Fix] Enable INT8 weight-only (W8A16) MoE for non-gated activations#15550
Draft
Dorijan10 wants to merge 1 commit into
Draft
[None][Fix] Enable INT8 weight-only (W8A16) MoE for non-gated activations#15550Dorijan10 wants to merge 1 commit into
Dorijan10 wants to merge 1 commit into
Conversation
The INT8 weight-only per-channel MoE path assumed gated activations (Swiglu/Geglu) in three places, rejecting or mis-handling non-gated experts (squared-ReLU, e.g. Nemotron-H) that the underlying CUTLASS kernels already support: - moeOp.cpp: the woq validation hardcoded fc1.inter == 2 * fc2.inter; now conditioned on isGatedActivation(), mirroring the existing non-woq branch. - INT8WoqPerChannelFusedMoEMethod: buffer sizing, weight loading, and scale loading assumed the doubled (gate+up) layout; now handle the single up-projection when the gate weight is absent, mirroring UnquantizedFusedMoEMethod's existing non-gated handling. Gated models are unaffected (they retain the original code path). Signed-off-by: Dorijan10 <dorian.magasic@turintech.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The INT8 weight-only per-channel (W8A16) fused-MoE path assumes gated activations (SwiGLU/GeGLU), which prevents non-gated MoE models (squared-ReLU, e.g. Nemotron) from using it; even though the underlying CUTLASS mixed-GEMM kernels already support the non-gated layout. Non-gated experts are already handled for FP8 and NVFP4 (see the Nemotron weight mapper and the FP8 fused-MoE method); this PR brings the INT8 weight-only path to parity.
cpp/tensorrt_llm/thop/moeOp.cpp: the woq validation hardcodedfc1.inter == 2 * fc2.inter; now conditioned onisGatedActivation(), mirroring the existing non-woq branch.tensorrt_llm/_torch/modules/fused_moe/quantization.py(INT8WoqPerChannelFusedMoEMethod): buffer sizing, weight loading, and scale loading assumed the doubled (gate+up) layout; now handle the single up-projection when the gate weight is absent, mirroringUnquantizedFusedMoEMethod's existing non-gated handling.Gated models are unaffected; they retain the original code path. W8A16 checkpoints are produced the standard way (ModelOpt
int8_wo, or offline per-output-channel quantization); no new runtime quantizer is introduced. On Ampere (e.g. A100), where FP8/NVFP4 are unavailable, INT8 weight-only is the natural weight-compression format for MoE inference.Test Coverage
Validated on NVIDIA-Nemotron-3-Nano-30B-A3B (non-gated, Relu² experts) on A100-SXM4-80GB, served via
trtllm-servefrom a pre-quantized W8A16 checkpoint (hf_quant_config.json→quant_algo: W8A16, non-expert linears excluded). Built and validated against currentmainwith only the two files in this PR.Throughput —
vllm bench serve(random, ISL/OSL 512/512, concurrency 32,--ignore-eos, mean of 3):Lossless within eval noise on gsm8k.
Validated end-to-end on Nemotron; can add a unit test in the format you prefer.