Commit bceab03
authored
[CUDA] Default QMoE GEMV fp16 accumulation for fp16 activations (#29166)
### Description
Make fp16 accumulation the default for the CUDA QMoE GEMV fast path when
activations are fp16. The previous fp32 accumulation behavior remains
available as an opt-in fallback with `ORT_MOE_GEMV_FP32_ACCUM=1`, and
bf16 activations continue to use fp32 accumulation.
This is motivated by GPT-OSS-20B decode measurements where fp16
accumulation was close in accuracy to the fp32 path and materially
faster.
### Changes
- Invert the QMoE GEMV accumulation environment knob:
- default fp16 accumulation for fp16 activations
- `ORT_MOE_GEMV_FP32_ACCUM=1` restores fp32 accumulation
- bf16 stays on fp32 accumulation
- Document the new runtime knob in the QMoE CUDA docs.
- Add the standalone helper, full-model decode, and MMLU smoke
measurements to the QMoE GEMV experiment log.
### Measurements
| Measurement | Default fp16 accumulation | `ORT_MOE_GEMV_FP32_ACCUM=1`
|
|---|---:|---:|
| Standalone GPT-OSS QMoE helper latency | 0.0708 ms | 0.0812 ms |
| Helper FC1 SwiGLU GEMV avg | 13.93 us | 21.57 us |
| Helper FC2 GEMV avg | 10.14 us | 12.24 us |
| Full GPT-OSS CUDA-graph decode latency | 2.588930 ms/token | 2.827260
ms/token |
| Full GPT-OSS CUDA-graph decode throughput | 386.259956 tok/s |
353.699315 tok/s |
The full-model A/B shows about +9.2% decode throughput for the default
fp16 accumulation path versus the fp32 fallback in this run.
### Accuracy
Prior 1000-sample MMLU smoke runs matched pooled accuracy for both
modes:
| Mode | Pooled accuracy |
|---|---:|
| fp32 accumulation | 0.8260 |
| fp16 accumulation | 0.8260 |
### Testing
- `lintrunner -a onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu`
- `cmake --build /home/tianlei/onnxruntime/build/cu130/Release --target
onnxruntime_providers_cuda --parallel $(nproc)`
- `git diff --check --
onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu
docs/contrib_ops/cuda/qmoe_gemv_experiments.md
docs/contrib_ops/cuda/moe_qmoe.md`
- Standalone QMoE helper A/B on `gpt_oss_20b_m1_top4_fp16_2880x2880_e32`
- Full GPT-OSS CUDA-graph decode A/B1 parent ad8e258 commit bceab03
3 files changed
Lines changed: 77 additions & 11 deletions
File tree
- docs/contrib_ops/cuda
- onnxruntime/contrib_ops/cuda/llm/moe_gemm
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
989 | 989 | | |
990 | 990 | | |
991 | 991 | | |
| 992 | + | |
| 993 | + | |
| 994 | + | |
| 995 | + | |
| 996 | + | |
| 997 | + | |
| 998 | + | |
| 999 | + | |
| 1000 | + | |
| 1001 | + | |
| 1002 | + | |
| 1003 | + | |
| 1004 | + | |
| 1005 | + | |
992 | 1006 | | |
993 | 1007 | | |
994 | 1008 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
978 | 978 | | |
979 | 979 | | |
980 | 980 | | |
| 981 | + | |
| 982 | + | |
| 983 | + | |
| 984 | + | |
| 985 | + | |
| 986 | + | |
| 987 | + | |
| 988 | + | |
| 989 | + | |
| 990 | + | |
| 991 | + | |
| 992 | + | |
| 993 | + | |
| 994 | + | |
| 995 | + | |
| 996 | + | |
| 997 | + | |
| 998 | + | |
| 999 | + | |
| 1000 | + | |
| 1001 | + | |
| 1002 | + | |
| 1003 | + | |
| 1004 | + | |
| 1005 | + | |
| 1006 | + | |
| 1007 | + | |
| 1008 | + | |
| 1009 | + | |
| 1010 | + | |
| 1011 | + | |
| 1012 | + | |
| 1013 | + | |
| 1014 | + | |
| 1015 | + | |
| 1016 | + | |
| 1017 | + | |
| 1018 | + | |
| 1019 | + | |
| 1020 | + | |
| 1021 | + | |
| 1022 | + | |
| 1023 | + | |
| 1024 | + | |
| 1025 | + | |
| 1026 | + | |
| 1027 | + | |
| 1028 | + | |
| 1029 | + | |
| 1030 | + | |
| 1031 | + | |
| 1032 | + | |
| 1033 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
397 | 397 | | |
398 | 398 | | |
399 | 399 | | |
400 | | - | |
401 | | - | |
402 | | - | |
403 | | - | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
404 | 404 | | |
405 | 405 | | |
406 | | - | |
| 406 | + | |
407 | 407 | | |
408 | 408 | | |
409 | 409 | | |
| |||
502 | 502 | | |
503 | 503 | | |
504 | 504 | | |
505 | | - | |
506 | | - | |
507 | | - | |
508 | | - | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
509 | 508 | | |
510 | 509 | | |
511 | 510 | | |
| |||
532 | 531 | | |
533 | 532 | | |
534 | 533 | | |
535 | | - | |
536 | | - | |
| 534 | + | |
| 535 | + | |
537 | 536 | | |
538 | 537 | | |
539 | 538 | | |
| |||
0 commit comments