[CUDA] Default QMoE GEMV fp16 accumulation for fp16 activations (#29166)

tianleiwu · web-flow · commit bceab03c2496 · 2026-06-19T22:48:01.000Z
### Description

Make fp16 accumulation the default for the CUDA QMoE GEMV fast path when
activations are fp16. The previous fp32 accumulation behavior remains
available as an opt-in fallback with `ORT_MOE_GEMV_FP32_ACCUM=1`, and
bf16 activations continue to use fp32 accumulation.

This is motivated by GPT-OSS-20B decode measurements where fp16
accumulation was close in accuracy to the fp32 path and materially
faster.

### Changes

- Invert the QMoE GEMV accumulation environment knob:
  - default fp16 accumulation for fp16 activations
  - `ORT_MOE_GEMV_FP32_ACCUM=1` restores fp32 accumulation
  - bf16 stays on fp32 accumulation
- Document the new runtime knob in the QMoE CUDA docs.
- Add the standalone helper, full-model decode, and MMLU smoke
measurements to the QMoE GEMV experiment log.

### Measurements

| Measurement | Default fp16 accumulation | `ORT_MOE_GEMV_FP32_ACCUM=1`
|
|---|---:|---:|
| Standalone GPT-OSS QMoE helper latency | 0.0708 ms | 0.0812 ms |
| Helper FC1 SwiGLU GEMV avg | 13.93 us | 21.57 us |
| Helper FC2 GEMV avg | 10.14 us | 12.24 us |
| Full GPT-OSS CUDA-graph decode latency | 2.588930 ms/token | 2.827260
ms/token |
| Full GPT-OSS CUDA-graph decode throughput | 386.259956 tok/s |
353.699315 tok/s |

The full-model A/B shows about +9.2% decode throughput for the default
fp16 accumulation path versus the fp32 fallback in this run.

### Accuracy

Prior 1000-sample MMLU smoke runs matched pooled accuracy for both
modes:

| Mode | Pooled accuracy |
|---|---:|
| fp32 accumulation | 0.8260 |
| fp16 accumulation | 0.8260 |

### Testing

- `lintrunner -a onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu`
- `cmake --build /home/tianlei/onnxruntime/build/cu130/Release --target
onnxruntime_providers_cuda --parallel $(nproc)`
- `git diff --check --
onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu
docs/contrib_ops/cuda/qmoe_gemv_experiments.md
docs/contrib_ops/cuda/moe_qmoe.md`
- Standalone QMoE helper A/B on `gpt_oss_20b_m1_top4_fp16_2880x2880_e32`
- Full GPT-OSS CUDA-graph decode A/B
diff --git a/docs/contrib_ops/cuda/moe_qmoe.md b/docs/contrib_ops/cuda/moe_qmoe.md
@@ -989,6 +989,20 @@ per-column INT4, block-wise INT4/INT8, and interleaved-SwiGLU GEMV kernels.
 | Kernel instantiation | `moe_gemv.cu` adds `__nv_bfloat16` details/instantiations (group sizes 0/32/64/128, INT4/INT8, bias on/off) under `ENABLE_BF16`. | The custom FC1/FC2 GEMV kernels run for BF16; no grouped-GEMM fallback when the FP16 gate would route. |
 | Profiling | GPT-OSS-20B, Qwen3.6-35B-A3B, and Gemma model shapes profiled with `block_size=64` for both dtypes. | BF16 matches FP16 routing and latency within noise (about 1.3x–1.5x faster than grouped GEMM); SwiGLU BF16 parity tests pass. |
 
+#### Accumulation policy
+
+The QMoE GEMV fast path accumulates fp16 activations in fp16 by default. Set
+`ORT_MOE_GEMV_FP32_ACCUM=1` before process start to restore the previous fp32
+accumulation path for fp16 activations. BF16 activations always use fp32
+accumulation because bf16 accumulation is too lossy.
+
+On the GPT-OSS-20B decode-shaped helper case
+`gpt_oss_20b_m1_top4_fp16_2880x2880_e32`, the default fp16-accumulation path was
+0.0708 ms versus 0.0812 ms with `ORT_MOE_GEMV_FP32_ACCUM=1`. In a full GPT-OSS
+CUDA-graph decode run, default fp16 accumulation reached 386.26 tok/s versus
+353.70 tok/s with the fp32 fallback. A 1000-sample MMLU smoke test matched pooled
+accuracy at 0.8260 for both modes.
+
 #### Experiments rejected after profiling
 
 | Experiment | Why it was rejected |
diff --git a/docs/contrib_ops/cuda/qmoe_gemv_experiments.md b/docs/contrib_ops/cuda/qmoe_gemv_experiments.md
@@ -978,3 +978,56 @@ Every case reported `has_invalid_output=false`.
   per-column case for INT4 and INT8.
 - Per-column INT8 W8A16 decode shapes route to GEMV for both FP16 and BF16 and
   beat the grouped-GEMM fallback at every profiled shape.
+
+## 2026-06-19 FP16 Accumulation Default: SM90, GPT-OSS Decode Shape
+
+### Setup
+
+- Goal: make fp16 accumulation the default for fp16 QMoE GEMV, while preserving
+  the previous fp32 accumulation path behind `ORT_MOE_GEMV_FP32_ACCUM=1`.
+- GPU: single H200 (SM90).
+- ONNX Runtime build: `~/onnxruntime/build/cu130/Release`, CUDA 13.0.
+- QMoE helper case: `gpt_oss_20b_m1_top4_fp16_2880x2880_e32`, warmup 5,
+  repeat 20.
+- Full-model case: GPT-OSS-20B INT4 QMoE, batch 1, prompt 512, generation 128,
+  warmup 2, repeat 5, CUDA graph enabled, XQA enabled, deterministic MoE tactic
+  selection.
+
+### Standalone QMoE Helper
+
+Lower is better. Both modes reported `has_invalid_output=false`.
+
+| Mode | Helper latency ms | FC1 SwiGLU GEMV avg us | FC2 GEMV avg us |
+|------|------------------:|-----------------------:|----------------:|
+| default fp16 accumulation | 0.0708 | 13.93 | 10.14 |
+| `ORT_MOE_GEMV_FP32_ACCUM=1` | 0.0812 | 21.57 | 12.24 |
+
+The new default is about 12.8% faster than the fp32 fallback in the isolated
+GPT-OSS decode-shaped QMoE helper. The gain comes from the expected GEMV rows:
+FC1 interleaved SwiGLU is about 35% faster and FC2 GEMV is about 17% faster.
+
+### Full GPT-OSS Decode
+
+| Mode | Decode latency ms/token | Decode throughput tok/s |
+|------|-------------------------:|-------------------------:|
+| default fp16 accumulation | 2.588930 | 386.259956 |
+| `ORT_MOE_GEMV_FP32_ACCUM=1` | 2.827260 | 353.699315 |
+
+The full-model A/B shows the default fp16 accumulation path is about 9.2% faster
+in decode throughput than the fp32 fallback in this run.
+
+### Accuracy Smoke
+
+Prior 1000-sample MMLU runs found no pooled-accuracy difference between the old
+fp32 default and the fp16-accumulation experiment:
+
+| Mode | Output dir | Pooled accuracy |
+|------|------------|-----------------|
+| fp32 accumulation | `~/eval_runs/mmlu1000_default_20260619_001348` | 0.8260 |
+| fp16 accumulation | `~/eval_runs/mmlu1000_fp16accum_20260619_001352` | 0.8260 |
+
+### Decision
+
+- Make fp16 accumulation the default for fp16 QMoE GEMV.
+- Keep bf16 on fp32 accumulation.
+- Keep `ORT_MOE_GEMV_FP32_ACCUM=1` as the opt-in numerical fallback and A/B knob.
diff --git a/onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu b/onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu
@@ -397,13 +397,13 @@ struct TypeTag {
   using type = T;
 };
 
-// Opt-in: accumulate the GEMV inner product in 16-bit (fp16) instead of the default
-// fp32. Honored only for fp16 activations (bf16 always accumulates in fp32). Set
-// ORT_MOE_GEMV_FP16_ACCUM=1 to measure the perf/accuracy tradeoff of 16-bit accumulation.
-inline bool MoeGemvUseFp16Accum() {
+// Opt-in: accumulate the GEMV inner product in fp32 instead of the default fp16
+// for fp16 activations. bf16 always accumulates in fp32 because 16-bit bf16
+// accumulation is too lossy.
+inline bool MoeGemvUseFp32Accum() {
   // Parsed once via ORT's environment helper (consistent parsing/thread-safety across platforms).
   static bool const enabled =
-      onnxruntime::ParseEnvironmentVariableWithDefault<int>("ORT_MOE_GEMV_FP16_ACCUM", 0) == 1;
+      onnxruntime::ParseEnvironmentVariableWithDefault<int>("ORT_MOE_GEMV_FP32_ACCUM", 0) == 1;
   return enabled;
 }
 
@@ -502,10 +502,9 @@ void launch_moe_gemv_int_symmetric(T const* act, WeightType const* weight, T con
   ORT_UNUSED_PARAMETER(sm);
   using Details = typename DetailsForTAndWeight<T, WeightType>::Details;
   using TypeA = typename DetailsForTAndWeight<T, WeightType>::TypeA;
-  // Accumulate in fp32 by default. fp16 activations may opt back into 16-bit accumulation
-  // via ORT_MOE_GEMV_FP16_ACCUM=1; bf16 always accumulates in fp32 (16-bit bf16 accumulation
-  // is too lossy). use_fp32_accum selects the kernel's AccT at runtime.
-  bool const use_fp32_accum = !std::is_same_v<T, half> || !MoeGemvUseFp16Accum();
+  // Accumulate fp16 activations in fp16 by default. ORT_MOE_GEMV_FP32_ACCUM=1
+  // restores the previous fp32 accumulation path; bf16 always uses fp32.
+  bool const use_fp32_accum = !std::is_same_v<T, half> || MoeGemvUseFp32Accum();
   auto launch = [&](auto acc_tag) {
     using AccT = typename decltype(acc_tag)::type;
     fiv::dispatch_moe_gemv_group_size<Details, kCtaN, kThreads, TypeA, AccT>(
@@ -532,8 +531,8 @@ void launch_moe_gemv_int_symmetric_interleaved_swiglu(
   ORT_UNUSED_PARAMETER(sm);
   using Details = typename DetailsForTAndWeight<T, WeightType>::Details;
   using TypeA = typename DetailsForTAndWeight<T, WeightType>::TypeA;
-  // Accumulate in fp32 by default (see launch_moe_gemv_int_symmetric for the policy).
-  bool const use_fp32_accum = !std::is_same_v<T, half> || !MoeGemvUseFp16Accum();
+  // Accumulation policy matches launch_moe_gemv_int_symmetric.
+  bool const use_fp32_accum = !std::is_same_v<T, half> || MoeGemvUseFp32Accum();
   auto launch = [&](auto acc_tag) {
     using AccT = typename decltype(acc_tag)::type;
     fiv::dispatch_moe_gemv_interleaved_swiglu_group_size<Details, kCtaN, kThreads, TypeA, AccT>(