Commit 32c49a3
committed
Add W4A8 INT8 activation kernels for batched MoE prefill
INT8 tensor core variants of the batched MoE GEMM kernels that
dynamically quantize bf16 activations to INT8 per-row per-tile and
dequantize INT4 weights directly to INT8 (skipping bf16 conversion).
Uses tl.dot(int8, int8) → int32 accumulation with per-tile float32
rescale. 1.7× MoE speedup on A100 at M=1024 with 0.9998 cosine
similarity vs bf16 baseline.
Co-authored-by: Claude <noreplyanthropic.com>
ghstack-source-id: a153b52
Pull Request resolved: #191871 parent cb4e5ae commit 32c49a3
4 files changed
Lines changed: 501 additions & 4 deletions
File tree
- backends/cuda
- benchmarks
- triton/kernels
- examples/models/qwen3_5_moe
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
247 | 247 | | |
248 | 248 | | |
249 | 249 | | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
250 | 278 | | |
251 | 279 | | |
252 | 280 | | |
| |||
358 | 386 | | |
359 | 387 | | |
360 | 388 | | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
361 | 398 | | |
362 | 399 | | |
363 | 400 | | |
| |||
0 commit comments