Skip to content

Commit 9dede18

Browse files
committed
Update on "Add GEMM-based standard SDPA benchmark"
Add bench_sdpa.cpp with a standalone GEMM-based SDPA implementation (run_standard_sdpa) alongside ExecuTorch's tiled flash attention (custom_sdpa_out) for comparative benchmarking. The standalone SDPA uses full GEMM per head with 3-pass softmax and supports both [B,S,H,D] and [B,H,S,D] layouts via BLAS leading dimension parameters, allowing isolation of algorithm vs layout effects. Includes validation tests that verify the GEMM-based implementation matches custom_sdpa_out within tolerance. Differential Revision: [D96044313](https://our.internmc.facebook.com/intern/diff/D96044313/) [ghstack-poisoned]
2 parents a37f82c + 68dbb0c commit 9dede18

158 files changed

Lines changed: 14353 additions & 1466 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.ci/scripts/test_lora.sh

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -139,8 +139,7 @@ Okay, so I need to calculate 15% of 80."
139139
EXPECTED_QUANT_LORA_PREFIX="
140140
<|im_start|>user Calculate 15% of 80?<|im_end|><|im_start|>assistant
141141
To calculate 15% of 80, we can multiply 80 by 15/100.
142-
80 * 15/100 = 12.
143-
So, 15% of 80 is 12.
142+
So, 15% of 80 is equal to (80 * 15) / 100 = 1200 / 100 = 12.
144143
#### 12
145144
The answer is: 12<|im_end|>"
146145

.ci/scripts/test_model_e2e.sh

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -354,7 +354,7 @@ EOF
354354
fi
355355
;;
356356
qwen3_5_moe)
357-
RUNNER_ARGS="$RUNNER_ARGS --tokenizer_path ${MODEL_DIR}/$TOKENIZER_FILE --prompt 'What is the capital of France?' --max_new_tokens 128 --temperature 0"
357+
RUNNER_ARGS="$RUNNER_ARGS --tokenizer_path ${MODEL_DIR}/$TOKENIZER_FILE --prompt 'What is the capital of France?' --max_new_tokens 128 --temperature 0 --cuda_graph"
358358
;;
359359
voxtral_realtime)
360360
RUNNER_ARGS="--model_path ${MODEL_DIR}/model.pte --tokenizer_path ${MODEL_DIR}/$TOKENIZER_FILE --preprocessor_path ${MODEL_DIR}/$PREPROCESSOR --audio_path ${MODEL_DIR}/$AUDIO_FILE --temperature 0"
@@ -397,6 +397,27 @@ if [ -n "$EXPECTED_OUTPUT" ]; then
397397
else
398398
echo "SUCCESS: Runner completed successfully"
399399
fi
400+
401+
# Validate GPU peak memory usage for models with known memory budgets.
402+
# The runner prints "GPU peak memory usage: XXXX.X MiB" at the end.
403+
case "$MODEL_NAME" in
404+
qwen3_5_moe)
405+
MAX_MEMORY_MIB=20480 # 20 GB — must fit on a single GPU (e.g. 4090)
406+
PEAK_MEM=$(echo "$OUTPUT" | grep -oP 'GPU peak memory usage: \K[0-9.]+' || true)
407+
if [ -n "$PEAK_MEM" ]; then
408+
# Compare as integers (truncate decimals)
409+
PEAK_MEM_INT=${PEAK_MEM%%.*}
410+
if [ "$PEAK_MEM_INT" -gt "$MAX_MEMORY_MIB" ]; then
411+
echo "FAIL: GPU peak memory ${PEAK_MEM} MiB exceeds budget ${MAX_MEMORY_MIB} MiB"
412+
exit 1
413+
else
414+
echo "Success: GPU peak memory ${PEAK_MEM} MiB within budget (max ${MAX_MEMORY_MIB} MiB)"
415+
fi
416+
else
417+
echo "WARNING: GPU peak memory usage not found in output"
418+
fi
419+
;;
420+
esac
400421
echo "::endgroup::"
401422

402423
popd

.claude/skills/qualcomm/SKILL.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,5 @@ Required flags: `-m` (SoC model), `-b` (Android build dir). Optional: `-s` (devi
9393
| `TestExampleLLMScript` | LLM script tests |
9494
| `TestExampleMultimodalityScript` | Multimodality script tests |
9595
| `TestExampleOssScript` | OSS model script tests |
96-
| `TestExampleQaihubScript` | QAI Hub script tests |
9796
| `TestExampleScript` | General example script tests |
9897
| `TestUtilsScript` | Utility script tests |

.github/pytorch-probot.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ ciflow_push_tags:
66
- ciflow/cuda
77
- ciflow/cuda-perf
88
- ciflow/metal
9+
- ciflow/mlx
910
- ciflow/nightly
1011
- ciflow/trunk
1112
- ciflow/binaries

.github/workflows/cuda.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -145,8 +145,8 @@ jobs:
145145
# Run CUDA backend Python tests
146146
python -m pytest backends/cuda/tests backends/cuda/passes/tests -v -o "addopts="
147147
148-
# Run Qwen 3.5 MoE tests (quantize roundtrip + TurboQuant KV cache)
149-
python -m pytest examples/models/qwen3_5_moe/test_quantize_roundtrip.py examples/models/qwen3_5_moe/test_turboquant.py -v -o "addopts="
148+
# Run Qwen 3.5 MoE tests (quantize roundtrip + TurboQuant KV cache + sampler)
149+
python -m pytest examples/models/qwen3_5_moe/test_quantize_roundtrip.py examples/models/qwen3_5_moe/test_turboquant.py examples/models/qwen3_5_moe/test_sampler.py -v -o "addopts="
150150
151151
export-model-cuda-artifact:
152152
name: export-model-cuda-artifact

.github/workflows/mlx.yml

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@ on:
55
branches:
66
- main
77
- release/*
8+
tags:
9+
- ciflow/mlx/*
810
pull_request:
911
paths:
1012
- .github/workflows/mlx.yml
@@ -16,6 +18,10 @@ on:
1618
- examples/models/qwen3_5_moe/**
1719
workflow_dispatch:
1820

21+
concurrency:
22+
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
23+
cancel-in-progress: true
24+
1925
permissions: {}
2026

2127
jobs:
@@ -218,6 +224,10 @@ jobs:
218224
echo "::endgroup::"
219225
220226
test-mlx-voxtral:
227+
# Requires HuggingFace secrets — skip on fork PRs.
228+
# Maintainers can opt-in by applying the ciflow/mlx label, which
229+
# pushes a ciflow/mlx/<PR> tag that re-runs this workflow with secrets.
230+
if: github.event.pull_request.head.repo.full_name == github.repository || github.event_name != 'pull_request'
221231
uses: pytorch/test-infra/.github/workflows/macos_job.yml@main
222232
secrets: inherit
223233
with:
@@ -275,6 +285,9 @@ jobs:
275285
echo "::endgroup::"
276286
277287
test-mlx-voxtral-realtime:
288+
# Requires HuggingFace secrets — skip on fork PRs.
289+
# Maintainers can opt-in by applying the ciflow/mlx label.
290+
if: github.event.pull_request.head.repo.full_name == github.repository || github.event_name != 'pull_request'
278291
uses: pytorch/test-infra/.github/workflows/macos_job.yml@main
279292
secrets: inherit
280293
with:
@@ -347,6 +360,9 @@ jobs:
347360
echo "::endgroup::"
348361
349362
test-mlx-whisper:
363+
# Requires HuggingFace secrets — skip on fork PRs.
364+
# Maintainers can opt-in by applying the ciflow/mlx label.
365+
if: github.event.pull_request.head.repo.full_name == github.repository || github.event_name != 'pull_request'
350366
uses: pytorch/test-infra/.github/workflows/macos_job.yml@main
351367
secrets: inherit
352368
with:
@@ -458,6 +474,9 @@ jobs:
458474
echo "::endgroup::"
459475
460476
test-mlx-llm:
477+
# Requires HuggingFace secrets — skip on fork PRs.
478+
# Maintainers can opt-in by applying the ciflow/mlx label.
479+
if: github.event.pull_request.head.repo.full_name == github.repository || github.event_name != 'pull_request'
461480
strategy:
462481
fail-fast: false
463482
matrix:

CMakeLists.txt

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1124,6 +1124,8 @@ if(EXECUTORCH_BUILD_EXTENSION_TRAINING)
11241124
endif()
11251125

11261126
if(EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER)
1127+
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/extension/memory_allocator)
1128+
list(APPEND _executorch_extensions extension_memory_allocator)
11271129
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/extension/llm/runner)
11281130
list(APPEND _executorch_extensions extension_llm_runner)
11291131
endif()
@@ -1228,6 +1230,32 @@ if(NOT EXECUTORCH_SELECT_OPS_YAML STREQUAL ""
12281230
)
12291231
list(APPEND _executorch_kernels executorch_selected_kernels)
12301232

1233+
# Auto-right-size the kernel registry unless the user has pinned
1234+
# MAX_KERNEL_NUM.
1235+
if(NOT DEFINED CACHE{MAX_KERNEL_NUM} AND NOT DEFINED MAX_KERNEL_NUM)
1236+
gen_selected_max_kernel_num(
1237+
LIB_NAME "executorch_selected_kernels" OPLIST_YAMLS
1238+
${gen_selected_ops_output_yaml}
1239+
)
1240+
target_include_directories(
1241+
executorch_core
1242+
PRIVATE ${executorch_selected_kernels_max_kernel_num_include_dir}
1243+
)
1244+
add_dependencies(
1245+
executorch_core executorch_selected_kernels_max_kernel_num_header
1246+
)
1247+
if(TARGET executorch_core_shared)
1248+
target_include_directories(
1249+
executorch_core_shared
1250+
PRIVATE ${executorch_selected_kernels_max_kernel_num_include_dir}
1251+
)
1252+
add_dependencies(
1253+
executorch_core_shared
1254+
executorch_selected_kernels_max_kernel_num_header
1255+
)
1256+
endif()
1257+
endif()
1258+
12311259
install(
12321260
TARGETS executorch_selected_kernels
12331261
EXPORT ExecuTorchTargets

Makefile

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@
9191
#
9292
# ==============================================================================
9393

94-
.PHONY: voxtral-cuda voxtral-cpu voxtral-metal voxtral-mlx voxtral_realtime-cuda voxtral_realtime-cpu voxtral_realtime-metal voxtral_realtime-mlx whisper-cuda whisper-cuda-debug whisper-cpu whisper-metal parakeet-cuda parakeet-cuda-debug parakeet-cpu parakeet-metal parakeet-mlx parakeet-vulkan dinov2-cuda dinov2-cuda-debug sortformer-cuda sortformer-cpu silero-vad-cpu llama-cuda llama-cuda-debug llama-cpu llava-cpu gemma3-cuda gemma3-cpu qwen3_5_moe-cuda qwen3_5_moe-metal clean help
94+
.PHONY: voxtral-cuda voxtral-cpu voxtral-metal voxtral-mlx voxtral_realtime-cuda voxtral_realtime-cpu voxtral_realtime-metal voxtral_realtime-mlx voxtral_tts-cpu voxtral_tts-cuda whisper-cuda whisper-cuda-debug whisper-cpu whisper-metal parakeet-cuda parakeet-cuda-debug parakeet-cpu parakeet-metal parakeet-mlx parakeet-vulkan dinov2-cuda dinov2-cuda-debug sortformer-cuda sortformer-cpu silero-vad-cpu llama-cuda llama-cuda-debug llama-cpu llava-cpu gemma3-cuda gemma3-cpu qwen3_5_moe-cuda qwen3_5_moe-metal clean help
9595

9696
help:
9797
@echo "This Makefile adds targets to build runners for various models on various backends. Run using \`make <target>\`. Available targets:"
@@ -103,6 +103,8 @@ help:
103103
@echo " voxtral_realtime-cpu - Build Voxtral Realtime runner with CPU backend"
104104
@echo " voxtral_realtime-metal - Build Voxtral Realtime runner with Metal backend (macOS only)"
105105
@echo " voxtral_realtime-mlx - Build Voxtral Realtime runner with MLX backend"
106+
@echo " voxtral_tts-cpu - Build Voxtral TTS runner (CPU)"
107+
@echo " voxtral_tts-cuda - Build Voxtral TTS runner with CUDA backend"
106108
@echo " whisper-cuda - Build Whisper runner with CUDA backend"
107109
@echo " whisper-cuda-debug - Build Whisper runner with CUDA backend (debug mode)"
108110
@echo " whisper-cpu - Build Whisper runner with CPU backend"
@@ -396,6 +398,24 @@ gemma3-cpu:
396398
@echo "✓ Build complete!"
397399
@echo " Binary: cmake-out/examples/models/gemma3/gemma3_e2e_runner"
398400

401+
voxtral_tts-cpu:
402+
@echo "==> Building and installing ExecuTorch..."
403+
cmake --workflow --preset llm-release
404+
@echo "==> Building Voxtral TTS runner (CPU)..."
405+
cd examples/models/voxtral_tts && cmake --workflow --preset voxtral-tts-cpu
406+
@echo ""
407+
@echo "✓ Build complete!"
408+
@echo " Binary: cmake-out/examples/models/voxtral_tts/voxtral_tts_runner"
409+
410+
voxtral_tts-cuda:
411+
@echo "==> Building and installing ExecuTorch with CUDA..."
412+
cmake --workflow --preset llm-release-cuda
413+
@echo "==> Building Voxtral TTS runner with CUDA..."
414+
cd examples/models/voxtral_tts && cmake --workflow --preset voxtral-tts-cuda
415+
@echo ""
416+
@echo "✓ Build complete!"
417+
@echo " Binary: cmake-out/examples/models/voxtral_tts/voxtral_tts_runner"
418+
399419
qwen3_5_moe-cuda:
400420
@echo "==> Building and installing ExecuTorch with CUDA..."
401421
cmake --workflow --preset llm-release-cuda

backends/aoti/aoti_backend.py

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@
2525

2626
class COMPILE_SPEC_KEYS(Enum):
2727
METHOD_NAME = "method_name"
28-
SHARE_KV_CACHE_ACROSS_METHODS = "share_kv_cache_across_methods"
2928

3029

3130
@experimental(
@@ -287,13 +286,3 @@ def method_name_from_compile_specs(
287286
raise RuntimeError(
288287
f"Could not find method name in compile specs: {compile_specs}"
289288
)
290-
291-
@classmethod
292-
def generate_share_kv_cache_compile_spec(cls) -> CompileSpec:
293-
"""
294-
Generate a CompileSpec to enable cross-method KV cache sharing.
295-
"""
296-
return CompileSpec(
297-
COMPILE_SPEC_KEYS.SHARE_KV_CACHE_ACROSS_METHODS.value,
298-
bytes([1]),
299-
)

backends/apple/metal/metal_backend.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ def get_supported_fallback_kernels(cls) -> Dict[str, Any]:
3535
"aoti_torch_mps_convolution": None,
3636
"aoti_torch_mps_mm_out": None,
3737
"at::_ops::_scaled_dot_product_attention_math_for_mps::call": None,
38+
"at::_ops::_scaled_dot_product_attention_math_for_mps_v2::call": None,
3839
"torchao::_linear_fp_act_4bit_weight": None,
3940
"at::_ops::topk::call": None,
4041
"metal::gather_qmv": None,

0 commit comments

Comments
 (0)