Update on "Add W4A8 INT8 activation kernels for batched MoE prefill"

digantdesai · digantdesai · commit 2b1e1eb1af8e · 2026-04-29T12:26:27.000-07:00
INT8 tensor core variants of the batched MoE GEMM kernels that
dynamically quantize bf16 activations to INT8 per-row per-tile and
dequantize INT4 weights directly to INT8 (skipping bf16 conversion).
Uses tl.dot(int8, int8) → int32 accumulation with per-tile float32
rescale. 1.7× MoE speedup on A100 at M=1024 with 0.9998 cosine
similarity vs bf16 baseline.

Co-authored-by: Claude &lt;noreplyanthropic.com&gt;

[ghstack-poisoned]
diff --git a/.ci/scripts/export_model_artifact.sh b/.ci/scripts/export_model_artifact.sh
@@ -418,7 +418,8 @@ if [ "$MODEL_NAME" = "qwen3_5_moe" ]; then
   TORCHINDUCTOR_CACHE_DIR="$INDUCTOR_CACHE" \
   python -m executorch.examples.models.qwen3_5_moe.export \
       --prequantized "$LOCAL_MODEL_DIR" \
-      --output-dir "${OUTPUT_DIR}"
+      --output-dir "${OUTPUT_DIR}" \
+      --moe-activation-dtype int8
   echo "::endgroup::"
 
   test -f "${OUTPUT_DIR}/model.pte"
diff --git a/examples/models/qwen3_5_moe/export.py b/examples/models/qwen3_5_moe/export.py
@@ -952,7 +952,7 @@ def main():  # noqa: C901
         "--moe-activation-dtype",
         choices=["bf16", "int8"],
         default="bf16",
-        help="MoE activation dtype for prefill only. Decode always uses bf16. bf16 (default): W4A16 batched GEMM. int8: W4A8 with INT8 tensor cores (~1.5x faster prefill).",
+        help="MoE activation dtype for prefill only. Decode always uses bf16. bf16 (default): W4A16 batched GEMM. int8: W4A8 with INT8 tensor cores.",
     )
     args = parser.parse_args()
 

Original file line number	Diff line number	Diff line change
`@@ -952,7 +952,7 @@ def main(): # noqa: C901`
`952`	`952`	`"--moe-activation-dtype",`
`953`	`953`	`choices=["bf16", "int8"],`
`954`	`954`	`default="bf16",`
`955`		`- help="MoE activation dtype for prefill only. Decode always uses bf16. bf16 (default): W4A16 batched GEMM. int8: W4A8 with INT8 tensor cores (~1.5x faster prefill).",`
	`955`	`+ help="MoE activation dtype for prefill only. Decode always uses bf16. bf16 (default): W4A16 batched GEMM. int8: W4A8 with INT8 tensor cores.",`
`956`	`956`	`)`
`957`	`957`	`args = parser.parse_args()`
`958`	`958`