Skip to content

Commit 2b1e1eb

Browse files
committed
Update on "Add W4A8 INT8 activation kernels for batched MoE prefill"
INT8 tensor core variants of the batched MoE GEMM kernels that dynamically quantize bf16 activations to INT8 per-row per-tile and dequantize INT4 weights directly to INT8 (skipping bf16 conversion). Uses tl.dot(int8, int8) → int32 accumulation with per-tile float32 rescale. 1.7× MoE speedup on A100 at M=1024 with 0.9998 cosine similarity vs bf16 baseline. Co-authored-by: Claude <noreplyanthropic.com> [ghstack-poisoned]
1 parent dbcc10f commit 2b1e1eb

2 files changed

Lines changed: 3 additions & 2 deletions

File tree

.ci/scripts/export_model_artifact.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -418,7 +418,8 @@ if [ "$MODEL_NAME" = "qwen3_5_moe" ]; then
418418
TORCHINDUCTOR_CACHE_DIR="$INDUCTOR_CACHE" \
419419
python -m executorch.examples.models.qwen3_5_moe.export \
420420
--prequantized "$LOCAL_MODEL_DIR" \
421-
--output-dir "${OUTPUT_DIR}"
421+
--output-dir "${OUTPUT_DIR}" \
422+
--moe-activation-dtype int8
422423
echo "::endgroup::"
423424

424425
test -f "${OUTPUT_DIR}/model.pte"

examples/models/qwen3_5_moe/export.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -952,7 +952,7 @@ def main(): # noqa: C901
952952
"--moe-activation-dtype",
953953
choices=["bf16", "int8"],
954954
default="bf16",
955-
help="MoE activation dtype for prefill only. Decode always uses bf16. bf16 (default): W4A16 batched GEMM. int8: W4A8 with INT8 tensor cores (~1.5x faster prefill).",
955+
help="MoE activation dtype for prefill only. Decode always uses bf16. bf16 (default): W4A16 batched GEMM. int8: W4A8 with INT8 tensor cores.",
956956
)
957957
args = parser.parse_args()
958958

0 commit comments

Comments
 (0)