Skip to content

Commit bceab03

Browse files
authored
[CUDA] Default QMoE GEMV fp16 accumulation for fp16 activations (#29166)
### Description Make fp16 accumulation the default for the CUDA QMoE GEMV fast path when activations are fp16. The previous fp32 accumulation behavior remains available as an opt-in fallback with `ORT_MOE_GEMV_FP32_ACCUM=1`, and bf16 activations continue to use fp32 accumulation. This is motivated by GPT-OSS-20B decode measurements where fp16 accumulation was close in accuracy to the fp32 path and materially faster. ### Changes - Invert the QMoE GEMV accumulation environment knob: - default fp16 accumulation for fp16 activations - `ORT_MOE_GEMV_FP32_ACCUM=1` restores fp32 accumulation - bf16 stays on fp32 accumulation - Document the new runtime knob in the QMoE CUDA docs. - Add the standalone helper, full-model decode, and MMLU smoke measurements to the QMoE GEMV experiment log. ### Measurements | Measurement | Default fp16 accumulation | `ORT_MOE_GEMV_FP32_ACCUM=1` | |---|---:|---:| | Standalone GPT-OSS QMoE helper latency | 0.0708 ms | 0.0812 ms | | Helper FC1 SwiGLU GEMV avg | 13.93 us | 21.57 us | | Helper FC2 GEMV avg | 10.14 us | 12.24 us | | Full GPT-OSS CUDA-graph decode latency | 2.588930 ms/token | 2.827260 ms/token | | Full GPT-OSS CUDA-graph decode throughput | 386.259956 tok/s | 353.699315 tok/s | The full-model A/B shows about +9.2% decode throughput for the default fp16 accumulation path versus the fp32 fallback in this run. ### Accuracy Prior 1000-sample MMLU smoke runs matched pooled accuracy for both modes: | Mode | Pooled accuracy | |---|---:| | fp32 accumulation | 0.8260 | | fp16 accumulation | 0.8260 | ### Testing - `lintrunner -a onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu` - `cmake --build /home/tianlei/onnxruntime/build/cu130/Release --target onnxruntime_providers_cuda --parallel $(nproc)` - `git diff --check -- onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu docs/contrib_ops/cuda/qmoe_gemv_experiments.md docs/contrib_ops/cuda/moe_qmoe.md` - Standalone QMoE helper A/B on `gpt_oss_20b_m1_top4_fp16_2880x2880_e32` - Full GPT-OSS CUDA-graph decode A/B
1 parent ad8e258 commit bceab03

3 files changed

Lines changed: 77 additions & 11 deletions

File tree

docs/contrib_ops/cuda/moe_qmoe.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -989,6 +989,20 @@ per-column INT4, block-wise INT4/INT8, and interleaved-SwiGLU GEMV kernels.
989989
| Kernel instantiation | `moe_gemv.cu` adds `__nv_bfloat16` details/instantiations (group sizes 0/32/64/128, INT4/INT8, bias on/off) under `ENABLE_BF16`. | The custom FC1/FC2 GEMV kernels run for BF16; no grouped-GEMM fallback when the FP16 gate would route. |
990990
| Profiling | GPT-OSS-20B, Qwen3.6-35B-A3B, and Gemma model shapes profiled with `block_size=64` for both dtypes. | BF16 matches FP16 routing and latency within noise (about 1.3x–1.5x faster than grouped GEMM); SwiGLU BF16 parity tests pass. |
991991

992+
#### Accumulation policy
993+
994+
The QMoE GEMV fast path accumulates fp16 activations in fp16 by default. Set
995+
`ORT_MOE_GEMV_FP32_ACCUM=1` before process start to restore the previous fp32
996+
accumulation path for fp16 activations. BF16 activations always use fp32
997+
accumulation because bf16 accumulation is too lossy.
998+
999+
On the GPT-OSS-20B decode-shaped helper case
1000+
`gpt_oss_20b_m1_top4_fp16_2880x2880_e32`, the default fp16-accumulation path was
1001+
0.0708 ms versus 0.0812 ms with `ORT_MOE_GEMV_FP32_ACCUM=1`. In a full GPT-OSS
1002+
CUDA-graph decode run, default fp16 accumulation reached 386.26 tok/s versus
1003+
353.70 tok/s with the fp32 fallback. A 1000-sample MMLU smoke test matched pooled
1004+
accuracy at 0.8260 for both modes.
1005+
9921006
#### Experiments rejected after profiling
9931007

9941008
| Experiment | Why it was rejected |

docs/contrib_ops/cuda/qmoe_gemv_experiments.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -978,3 +978,56 @@ Every case reported `has_invalid_output=false`.
978978
per-column case for INT4 and INT8.
979979
- Per-column INT8 W8A16 decode shapes route to GEMV for both FP16 and BF16 and
980980
beat the grouped-GEMM fallback at every profiled shape.
981+
982+
## 2026-06-19 FP16 Accumulation Default: SM90, GPT-OSS Decode Shape
983+
984+
### Setup
985+
986+
- Goal: make fp16 accumulation the default for fp16 QMoE GEMV, while preserving
987+
the previous fp32 accumulation path behind `ORT_MOE_GEMV_FP32_ACCUM=1`.
988+
- GPU: single H200 (SM90).
989+
- ONNX Runtime build: `~/onnxruntime/build/cu130/Release`, CUDA 13.0.
990+
- QMoE helper case: `gpt_oss_20b_m1_top4_fp16_2880x2880_e32`, warmup 5,
991+
repeat 20.
992+
- Full-model case: GPT-OSS-20B INT4 QMoE, batch 1, prompt 512, generation 128,
993+
warmup 2, repeat 5, CUDA graph enabled, XQA enabled, deterministic MoE tactic
994+
selection.
995+
996+
### Standalone QMoE Helper
997+
998+
Lower is better. Both modes reported `has_invalid_output=false`.
999+
1000+
| Mode | Helper latency ms | FC1 SwiGLU GEMV avg us | FC2 GEMV avg us |
1001+
|------|------------------:|-----------------------:|----------------:|
1002+
| default fp16 accumulation | 0.0708 | 13.93 | 10.14 |
1003+
| `ORT_MOE_GEMV_FP32_ACCUM=1` | 0.0812 | 21.57 | 12.24 |
1004+
1005+
The new default is about 12.8% faster than the fp32 fallback in the isolated
1006+
GPT-OSS decode-shaped QMoE helper. The gain comes from the expected GEMV rows:
1007+
FC1 interleaved SwiGLU is about 35% faster and FC2 GEMV is about 17% faster.
1008+
1009+
### Full GPT-OSS Decode
1010+
1011+
| Mode | Decode latency ms/token | Decode throughput tok/s |
1012+
|------|-------------------------:|-------------------------:|
1013+
| default fp16 accumulation | 2.588930 | 386.259956 |
1014+
| `ORT_MOE_GEMV_FP32_ACCUM=1` | 2.827260 | 353.699315 |
1015+
1016+
The full-model A/B shows the default fp16 accumulation path is about 9.2% faster
1017+
in decode throughput than the fp32 fallback in this run.
1018+
1019+
### Accuracy Smoke
1020+
1021+
Prior 1000-sample MMLU runs found no pooled-accuracy difference between the old
1022+
fp32 default and the fp16-accumulation experiment:
1023+
1024+
| Mode | Output dir | Pooled accuracy |
1025+
|------|------------|-----------------|
1026+
| fp32 accumulation | `~/eval_runs/mmlu1000_default_20260619_001348` | 0.8260 |
1027+
| fp16 accumulation | `~/eval_runs/mmlu1000_fp16accum_20260619_001352` | 0.8260 |
1028+
1029+
### Decision
1030+
1031+
- Make fp16 accumulation the default for fp16 QMoE GEMV.
1032+
- Keep bf16 on fp32 accumulation.
1033+
- Keep `ORT_MOE_GEMV_FP32_ACCUM=1` as the opt-in numerical fallback and A/B knob.

onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -397,13 +397,13 @@ struct TypeTag {
397397
using type = T;
398398
};
399399

400-
// Opt-in: accumulate the GEMV inner product in 16-bit (fp16) instead of the default
401-
// fp32. Honored only for fp16 activations (bf16 always accumulates in fp32). Set
402-
// ORT_MOE_GEMV_FP16_ACCUM=1 to measure the perf/accuracy tradeoff of 16-bit accumulation.
403-
inline bool MoeGemvUseFp16Accum() {
400+
// Opt-in: accumulate the GEMV inner product in fp32 instead of the default fp16
401+
// for fp16 activations. bf16 always accumulates in fp32 because 16-bit bf16
402+
// accumulation is too lossy.
403+
inline bool MoeGemvUseFp32Accum() {
404404
// Parsed once via ORT's environment helper (consistent parsing/thread-safety across platforms).
405405
static bool const enabled =
406-
onnxruntime::ParseEnvironmentVariableWithDefault<int>("ORT_MOE_GEMV_FP16_ACCUM", 0) == 1;
406+
onnxruntime::ParseEnvironmentVariableWithDefault<int>("ORT_MOE_GEMV_FP32_ACCUM", 0) == 1;
407407
return enabled;
408408
}
409409

@@ -502,10 +502,9 @@ void launch_moe_gemv_int_symmetric(T const* act, WeightType const* weight, T con
502502
ORT_UNUSED_PARAMETER(sm);
503503
using Details = typename DetailsForTAndWeight<T, WeightType>::Details;
504504
using TypeA = typename DetailsForTAndWeight<T, WeightType>::TypeA;
505-
// Accumulate in fp32 by default. fp16 activations may opt back into 16-bit accumulation
506-
// via ORT_MOE_GEMV_FP16_ACCUM=1; bf16 always accumulates in fp32 (16-bit bf16 accumulation
507-
// is too lossy). use_fp32_accum selects the kernel's AccT at runtime.
508-
bool const use_fp32_accum = !std::is_same_v<T, half> || !MoeGemvUseFp16Accum();
505+
// Accumulate fp16 activations in fp16 by default. ORT_MOE_GEMV_FP32_ACCUM=1
506+
// restores the previous fp32 accumulation path; bf16 always uses fp32.
507+
bool const use_fp32_accum = !std::is_same_v<T, half> || MoeGemvUseFp32Accum();
509508
auto launch = [&](auto acc_tag) {
510509
using AccT = typename decltype(acc_tag)::type;
511510
fiv::dispatch_moe_gemv_group_size<Details, kCtaN, kThreads, TypeA, AccT>(
@@ -532,8 +531,8 @@ void launch_moe_gemv_int_symmetric_interleaved_swiglu(
532531
ORT_UNUSED_PARAMETER(sm);
533532
using Details = typename DetailsForTAndWeight<T, WeightType>::Details;
534533
using TypeA = typename DetailsForTAndWeight<T, WeightType>::TypeA;
535-
// Accumulate in fp32 by default (see launch_moe_gemv_int_symmetric for the policy).
536-
bool const use_fp32_accum = !std::is_same_v<T, half> || !MoeGemvUseFp16Accum();
534+
// Accumulation policy matches launch_moe_gemv_int_symmetric.
535+
bool const use_fp32_accum = !std::is_same_v<T, half> || MoeGemvUseFp32Accum();
537536
auto launch = [&](auto acc_tag) {
538537
using AccT = typename decltype(acc_tag)::type;
539538
fiv::dispatch_moe_gemv_interleaved_swiglu_group_size<Details, kCtaN, kThreads, TypeA, AccT>(

0 commit comments

Comments
 (0)