Commit d936717
committed
Hoist W4A8 activation quantization out of GEMM K-loop
Add dedicated _quantize_activations_int8_kernel and _silu_quantize_int8_kernel
that pre-quantize activations to INT8 with per-row-per-tile FP32 scales before
GEMM1 and GEMM2 respectively. The existing _fused_moe_batched_int8_kernel and
_fused_moe_silu_batched_int8_kernel are rewritten to consume pre-quantized
activations + scales, eliminating ~256 redundant tl.max reductions per program
(cdiv(K, BLOCK_K) tiles * BLOCK_M rows) and halving activation HBM bandwidth in
the K-loop (bf16 -> int8). BLOCK_SIZE_K is fixed at PREQUANT_BLOCK_K (= 128)
so per-tile activation scales align with the GEMM K-loop.
Correctness: 7/7 microbenchmark configs pass with rel diff <1.5% vs BF16 ref.
End-to-end (Qwen3.5 MoE 1600 prefill + 512 decode, --cuda_graph, A100):
prefill 5727 -> 6171 tok/s (+7.7%), decode 92.6 -> 99.0 tok/s (+6.9%).1 parent 87c9947 commit d936717
5 files changed
Lines changed: 261 additions & 109 deletions
File tree
- backends/cuda
- tests
- triton/kernels
- examples/models/qwen3_5_moe
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
22 | | - | |
23 | 22 | | |
24 | 23 | | |
25 | 24 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
| 11 | + | |
11 | 12 | | |
12 | 13 | | |
13 | 14 | | |
| |||
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
| 27 | + | |
| 28 | + | |
26 | 29 | | |
27 | 30 | | |
28 | 31 | | |
| |||
0 commit comments