Commit dbcc10f
committed
Update on "Add W4A8 INT8 activation kernels for batched MoE prefill"
INT8 tensor core variants of the batched MoE GEMM kernels that
dynamically quantize bf16 activations to INT8 per-row per-tile and
dequantize INT4 weights directly to INT8 (skipping bf16 conversion).
Uses tl.dot(int8, int8) → int32 accumulation with per-tile float32
rescale. 1.7× MoE speedup on A100 at M=1024 with 0.9998 cosine
similarity vs bf16 baseline.
Co-authored-by: Claude <noreplyanthropic.com>
[ghstack-poisoned]1 parent 594009d commit dbcc10f
3 files changed
Lines changed: 13 additions & 8 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
213 | 213 | | |
214 | 214 | | |
215 | 215 | | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
216 | 221 | | |
217 | 222 | | |
218 | 223 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
535 | 535 | | |
536 | 536 | | |
537 | 537 | | |
538 | | - | |
| 538 | + | |
539 | 539 | | |
540 | 540 | | |
541 | 541 | | |
542 | 542 | | |
543 | | - | |
| 543 | + | |
544 | 544 | | |
545 | 545 | | |
546 | 546 | | |
| |||
783 | 783 | | |
784 | 784 | | |
785 | 785 | | |
786 | | - | |
787 | | - | |
| 786 | + | |
| 787 | + | |
788 | 788 | | |
789 | 789 | | |
790 | 790 | | |
| |||
949 | 949 | | |
950 | 950 | | |
951 | 951 | | |
952 | | - | |
| 952 | + | |
953 | 953 | | |
954 | 954 | | |
955 | | - | |
| 955 | + | |
956 | 956 | | |
957 | 957 | | |
958 | 958 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
479 | 479 | | |
480 | 480 | | |
481 | 481 | | |
482 | | - | |
| 482 | + | |
483 | 483 | | |
484 | 484 | | |
485 | 485 | | |
| |||
498 | 498 | | |
499 | 499 | | |
500 | 500 | | |
501 | | - | |
| 501 | + | |
502 | 502 | | |
503 | 503 | | |
504 | 504 | | |
| |||
0 commit comments