Skip to content

Commit 6bb983b

Browse files
authored
Add LLM support for cuda backend (#17316)
## Summary This PR extends CUDA support for text-only LLM workflows and adds CI coverage for Qwen3-0.6B artifacts and pybind execution. ## Why We already validate CUDA multimodal paths, but text-generation CUDA coverage (especially Qwen3) was incomplete. This change adds export/run support and CI wiring so CUDA text-generation artifacts are exercised in automated tests. ## What changed ### CUDA LLM runner/build support - Added `llama-cuda` and `llama-cuda-debug` Makefile targets. - Added CUDA presets/workflow presets in `examples/models/llama/CMakePresets.json`. - Updated `examples/models/llama/CMakeLists.txt` to link CUDA backend when `EXECUTORCH_BUILD_CUDA=ON`. - Updated `examples/models/llama/main.cpp`: - Added `--data_path` convenience flag (single PTD path). - Added `--prompt_file` support for file-based prompts. ### Gemma3 runner usability - Updated `examples/models/gemma3/e2e_runner.cpp`: - Added `--max_new_tokens`. - Added `--stop_sequence` early-stop behavior. ### Optimum exporter integration and CI pin - Bumped optimum-executorch CI pin to: - `a9592258daacad7423fd5f39aaa59c6e36471520` - Added `Qwen/Qwen3-0.6B` handling in `.ci/scripts/export_model_artifact.sh` for `text-generation`. ### HuggingFace optimum CUDA test path - Updated `.ci/scripts/test_huggingface_optimum_model.py` (`test_text_generation`): - Supports `recipe=cuda` export (`--device cuda --dtype bfloat16`). - Supports CUDA quantization for this path: - `--qlinear 4w` - `--qlinear_packing_format tile_packed_to_4d` - `--qembedding 8w` - Validates presence of `aoti_cuda_blob.ptd`. - Passes blob path into `TextLLMRunner`. ### CUDA workflow updates - Updated `.github/workflows/cuda.yml`: - Added `Qwen/Qwen3-0.6B` to CUDA export matrix. - Updated `test-cuda-pybind` matrix to explicit artifact mapping. - Added Qwen non-quantized and quantized-int4-tile-packed artifact runs in pybind test. - Switched `download-artifact` to matrix-provided artifact name. ## Validation Rely on new CI jobs.
1 parent 9f2f005 commit 6bb983b

9 files changed

Lines changed: 215 additions & 12 deletions

File tree

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
5bf1aeb587e9b1f3572b0bd60265c5dafd007b73
1+
a9592258daacad7423fd5f39aaa59c6e36471520

.ci/scripts/export_model_artifact.sh

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,14 @@ case "$HF_MODEL" in
141141
PREPROCESSOR_FEATURE_SIZE=""
142142
PREPROCESSOR_OUTPUT=""
143143
;;
144+
Qwen/Qwen3-0.6B)
145+
MODEL_NAME="qwen3"
146+
TASK="text-generation"
147+
MAX_SEQ_LEN="64"
148+
EXTRA_PIP=""
149+
PREPROCESSOR_FEATURE_SIZE=""
150+
PREPROCESSOR_OUTPUT=""
151+
;;
144152
nvidia/parakeet-tdt)
145153
MODEL_NAME="parakeet"
146154
TASK=""
@@ -159,7 +167,7 @@ case "$HF_MODEL" in
159167
;;
160168
*)
161169
echo "Error: Unsupported model '$HF_MODEL'"
162-
echo "Supported models: mistralai/Voxtral-Mini-3B-2507, mistralai/Voxtral-Mini-4B-Realtime-2602, openai/whisper-{small, medium, large, large-v2, large-v3, large-v3-turbo}, google/gemma-3-4b-it, nvidia/parakeet-tdt"
170+
echo "Supported models: mistralai/Voxtral-Mini-3B-2507, mistralai/Voxtral-Mini-4B-Realtime-2602, openai/whisper-{small, medium, large, large-v2, large-v3, large-v3-turbo}, google/gemma-3-4b-it, Qwen/Qwen3-0.6B, nvidia/parakeet-tdt"
163171
exit 1
164172
;;
165173
esac

.ci/scripts/test_huggingface_optimum_model.py

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -142,21 +142,50 @@ def test_text_generation(model_id, model_dir, recipe, *, quantize=True, run_only
142142
"--qembedding",
143143
"8w",
144144
]
145+
elif recipe == "cuda":
146+
command += [
147+
"--dtype",
148+
"bfloat16",
149+
"--device",
150+
"cuda",
151+
]
152+
if quantize:
153+
command += [
154+
"--qlinear",
155+
"4w",
156+
"--qlinear_packing_format",
157+
"tile_packed_to_4d",
158+
"--qembedding",
159+
"8w",
160+
]
145161
else:
146162
assert (
147163
not quantize
148-
), "Quantization is only supported for XnnPack and CoreML recipes at the moment."
164+
), "Quantization is only supported for XnnPack, CoreML, and CUDA recipes at the moment."
149165

150166
if not run_only:
151167
cli_export(command, model_dir)
152168

169+
if recipe == "cuda":
170+
model_path = Path(model_dir) / "model.pte"
171+
cuda_blob_path = Path(model_dir) / "aoti_cuda_blob.ptd"
172+
assert model_path.exists(), f"Main model file not found: {model_path}"
173+
assert cuda_blob_path.exists(), f"CUDA blob not found: {cuda_blob_path}"
174+
153175
tokenizer = AutoTokenizer.from_pretrained(model_id)
154176
saved_files = tokenizer.save_pretrained(model_dir)
155177
tokenizer_path = get_tokenizer_path(model_dir, saved_files)
156178

157179
from executorch.extension.llm.runner import GenerationConfig, TextLLMRunner
158180

159-
runner = TextLLMRunner(f"{model_dir}/model.pte", tokenizer_path)
181+
if recipe == "cuda":
182+
runner = TextLLMRunner(
183+
f"{model_dir}/model.pte",
184+
tokenizer_path,
185+
f"{model_dir}/aoti_cuda_blob.ptd",
186+
)
187+
else:
188+
runner = TextLLMRunner(f"{model_dir}/model.pte", tokenizer_path)
160189
tokens = []
161190
runner.generate(
162191
"Simply put, the theory of relativity states that",

.ci/scripts/test_model_e2e.sh

Lines changed: 30 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ Arguments:
2121
- mistralai/Voxtral-Mini-3B-2507
2222
- openai/whisper series (whisper-{small, medium, large, large-v2, large-v3, large-v3-turbo})
2323
- google/gemma-3-4b-it
24+
- Qwen/Qwen3-0.6B
2425
- nvidia/parakeet-tdt
2526
- mistralai/Voxtral-Mini-4B-Realtime-2602
2627
@@ -151,6 +152,18 @@ case "$HF_MODEL" in
151152
AUDIO_FILE=""
152153
IMAGE_PATH="docs/source/_static/img/et-logo.png"
153154
;;
155+
Qwen/Qwen3-0.6B)
156+
MODEL_NAME="qwen3"
157+
RUNNER_TARGET="llama_main"
158+
RUNNER_PATH="llama"
159+
EXPECTED_OUTPUT="Paris"
160+
PREPROCESSOR=""
161+
TOKENIZER_URL="https://huggingface.co/Qwen/Qwen3-0.6B/resolve/main" # @lint-ignore
162+
TOKENIZER_FILE=""
163+
AUDIO_URL=""
164+
AUDIO_FILE=""
165+
IMAGE_PATH=""
166+
;;
154167
nvidia/parakeet-tdt)
155168
MODEL_NAME="parakeet"
156169
RUNNER_TARGET="parakeet_runner"
@@ -177,7 +190,7 @@ case "$HF_MODEL" in
177190
;;
178191
*)
179192
echo "Error: Unsupported model '$HF_MODEL'"
180-
echo "Supported models: mistralai/Voxtral-Mini-3B-2507, mistralai/Voxtral-Mini-4B-Realtime-2602, openai/whisper series (whisper-{small, medium, large, large-v2, large-v3, large-v3-turbo}), google/gemma-3-4b-it, nvidia/parakeet-tdt"
193+
echo "Supported models: mistralai/Voxtral-Mini-3B-2507, mistralai/Voxtral-Mini-4B-Realtime-2602, openai/whisper series (whisper-{small, medium, large, large-v2, large-v3, large-v3-turbo}), google/gemma-3-4b-it, Qwen/Qwen3-0.6B, nvidia/parakeet-tdt"
181194
exit 1
182195
;;
183196
esac
@@ -246,9 +259,14 @@ if [ "$(uname -s)" = "Darwin" ] && [ -f "$RUNNER_BIN" ]; then
246259
install_name_tool -change /opt/llvm-openmp/lib/libomp.dylib @rpath/libomp.dylib "$RUNNER_BIN"
247260
fi
248261
fi
249-
# For CUDA, add data_path argument (Metal embeds data in .pte)
262+
# For CUDA, add named data argument (Metal embeds data in .pte).
263+
# Llama runner uses --data_paths, other runners use --data_path.
250264
if [ "$DEVICE" = "cuda" ]; then
251-
RUNNER_ARGS="$RUNNER_ARGS --data_path ${MODEL_DIR}/aoti_cuda_blob.ptd"
265+
if [ "$RUNNER_PATH" = "llama" ]; then
266+
RUNNER_ARGS="$RUNNER_ARGS --data_paths ${MODEL_DIR}/aoti_cuda_blob.ptd"
267+
else
268+
RUNNER_ARGS="$RUNNER_ARGS --data_path ${MODEL_DIR}/aoti_cuda_blob.ptd"
269+
fi
252270
fi
253271

254272
# Add model-specific arguments
@@ -262,6 +280,15 @@ case "$MODEL_NAME" in
262280
gemma3)
263281
RUNNER_ARGS="$RUNNER_ARGS --tokenizer_path ${MODEL_DIR}/ --image_path $IMAGE_PATH"
264282
;;
283+
qwen3)
284+
PROMPT_FILE="${MODEL_DIR}/qwen3_prompt.txt"
285+
cat > "${PROMPT_FILE}" << 'EOF'
286+
<|im_start|>user
287+
What is the capital of France?<|im_end|>
288+
<|im_start|>assistant
289+
EOF
290+
RUNNER_ARGS="$RUNNER_ARGS --tokenizer_path ${MODEL_DIR}/ --prompt_file ${PROMPT_FILE}"
291+
;;
265292
parakeet)
266293
RUNNER_ARGS="--model_path ${MODEL_DIR}/model.pte --audio_path ${MODEL_DIR}/$AUDIO_FILE --tokenizer_path ${MODEL_DIR}/$TOKENIZER_FILE"
267294
# For CUDA, add data_path argument (Metal embeds data in .pte)

.github/workflows/cuda.yml

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,8 @@ jobs:
138138
name: "whisper-large-v3-turbo"
139139
- repo: "google"
140140
name: "gemma-3-4b-it"
141+
- repo: "Qwen"
142+
name: "Qwen3-0.6B"
141143
- repo: "nvidia"
142144
name: "parakeet-tdt"
143145
quant:
@@ -236,12 +238,23 @@ jobs:
236238
strategy:
237239
fail-fast: false
238240
matrix:
239-
model: ["gemma3-4b"]
240-
quantize: ["", "--quantize"]
241+
include:
242+
- model: "gemma3-4b"
243+
quantize: ""
244+
artifact: "google-gemma-3-4b-it-cuda-non-quantized"
245+
- model: "gemma3-4b"
246+
quantize: "--quantize"
247+
artifact: "google-gemma-3-4b-it-cuda-quantized-int4-tile-packed"
248+
- model: "qwen3-0.6b"
249+
quantize: ""
250+
artifact: "Qwen-Qwen3-0.6B-cuda-non-quantized"
251+
- model: "qwen3-0.6b"
252+
quantize: "--quantize"
253+
artifact: "Qwen-Qwen3-0.6B-cuda-quantized-int4-tile-packed"
241254
with:
242255
timeout: 120
243256
secrets-env: EXECUTORCH_HF_TOKEN
244-
download-artifact: google-gemma-3-4b-it-cuda-${{ matrix.quantize && 'quantized-int4-tile-packed' || 'non-quantized' }}
257+
download-artifact: ${{ matrix.artifact }}
245258
runner: linux.g5.4xlarge.nvidia.gpu
246259
gpu-arch-type: cuda
247260
gpu-arch-version: 12.6
@@ -280,7 +293,7 @@ jobs:
280293
pip install git+https://github.com/huggingface/optimum-executorch.git@${OPTIMUM_ET_VERSION}
281294
echo "::endgroup::"
282295
283-
echo "::group::Test CUDA Multimodal: ${{ matrix.model }} ${{ matrix.quantize }}"
296+
echo "::group::Test CUDA Model: ${{ matrix.model }} ${{ matrix.quantize }}"
284297
python .ci/scripts/test_huggingface_optimum_model.py \
285298
--model ${{ matrix.model }} \
286299
--recipe cuda \

Makefile

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@
9191
#
9292
# ==============================================================================
9393

94-
.PHONY: voxtral-cuda voxtral-cpu voxtral-metal voxtral_realtime-cpu voxtral_realtime-metal whisper-cuda whisper-cuda-debug whisper-cpu whisper-metal parakeet-cuda parakeet-cuda-debug parakeet-cpu parakeet-metal sortformer-cpu silero-vad-cpu llama-cpu llava-cpu gemma3-cuda gemma3-cpu clean help
94+
.PHONY: voxtral-cuda voxtral-cpu voxtral-metal voxtral_realtime-cpu voxtral_realtime-metal whisper-cuda whisper-cuda-debug whisper-cpu whisper-metal parakeet-cuda parakeet-cuda-debug parakeet-cpu parakeet-metal sortformer-cpu silero-vad-cpu llama-cuda llama-cuda-debug llama-cpu llava-cpu gemma3-cuda gemma3-cpu clean help
9595

9696
help:
9797
@echo "This Makefile adds targets to build runners for various models on various backends. Run using \`make <target>\`. Available targets:"
@@ -110,6 +110,8 @@ help:
110110
@echo " parakeet-metal - Build Parakeet runner with Metal backend (macOS only)"
111111
@echo " sortformer-cpu - Build Sortformer runner with CPU backend"
112112
@echo " silero-vad-cpu - Build Silero VAD runner with CPU backend"
113+
@echo " llama-cuda - Build Llama runner with CUDA backend"
114+
@echo " llama-cuda-debug - Build Llama runner with CUDA backend (debug mode)"
113115
@echo " llama-cpu - Build Llama runner with CPU backend"
114116
@echo " llava-cpu - Build Llava runner with CPU backend"
115117
@echo " gemma3-cuda - Build Gemma3 runner with CUDA backend"
@@ -265,6 +267,24 @@ llama-cpu:
265267
@echo "✓ Build complete!"
266268
@echo " Binary: cmake-out/examples/models/llama/llama_main"
267269

270+
llama-cuda:
271+
@echo "==> Building and installing ExecuTorch with CUDA..."
272+
cmake --workflow --preset llm-release-cuda
273+
@echo "==> Building Llama runner with CUDA..."
274+
cd examples/models/llama && cmake --workflow --preset llama-cuda
275+
@echo ""
276+
@echo "✓ Build complete!"
277+
@echo " Binary: cmake-out/examples/models/llama/llama_main"
278+
279+
llama-cuda-debug:
280+
@echo "==> Building and installing ExecuTorch with CUDA (debug mode)..."
281+
cmake --workflow --preset llm-debug-cuda
282+
@echo "==> Building Llama runner with CUDA (debug mode)..."
283+
cd examples/models/llama && cmake --workflow --preset llama-cuda-debug
284+
@echo ""
285+
@echo "✓ Build complete!"
286+
@echo " Binary: cmake-out/examples/models/llama/llama_main"
287+
268288
llava-cpu:
269289
@echo "==> Building and installing ExecuTorch..."
270290
cmake --workflow --preset llm-release

examples/models/llama/CMakeLists.txt

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,15 @@ if(TARGET xnnpack_backend)
163163
executorch_target_link_options_shared_lib(xnnpack_backend)
164164
endif()
165165

166+
# CUDA backend
167+
if(EXECUTORCH_BUILD_CUDA)
168+
find_package(CUDAToolkit REQUIRED)
169+
list(APPEND link_libraries aoti_cuda_backend)
170+
if(NOT MSVC)
171+
executorch_target_link_options_shared_lib(aoti_cuda_backend)
172+
endif()
173+
endif()
174+
166175
# Vulkan backend
167176
if(TARGET vulkan_backend)
168177
list(APPEND link_libraries vulkan_backend)

examples/models/llama/CMakePresets.json

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,36 @@
1818
"CMAKE_BUILD_TYPE": "Debug",
1919
"CMAKE_FIND_ROOT_PATH": "${sourceDir}/../../../cmake-out"
2020
}
21+
},
22+
{
23+
"name": "llama-cuda-debug",
24+
"displayName": "Llama runner in Debug mode with CUDA backend",
25+
"binaryDir": "${sourceDir}/../../../cmake-out/examples/models/llama",
26+
"cacheVariables": {
27+
"CMAKE_BUILD_TYPE": "Debug",
28+
"CMAKE_FIND_ROOT_PATH": "${sourceDir}/../../../cmake-out",
29+
"EXECUTORCH_BUILD_CUDA": "ON"
30+
},
31+
"condition": {
32+
"type": "inList",
33+
"string": "${hostSystemName}",
34+
"list": ["Linux", "Windows"]
35+
}
36+
},
37+
{
38+
"name": "llama-cuda",
39+
"displayName": "Llama runner with CUDA backend",
40+
"binaryDir": "${sourceDir}/../../../cmake-out/examples/models/llama",
41+
"cacheVariables": {
42+
"CMAKE_BUILD_TYPE": "Release",
43+
"CMAKE_FIND_ROOT_PATH": "${sourceDir}/../../../cmake-out",
44+
"EXECUTORCH_BUILD_CUDA": "ON"
45+
},
46+
"condition": {
47+
"type": "inList",
48+
"string": "${hostSystemName}",
49+
"list": ["Linux", "Windows"]
50+
}
2151
}
2252
],
2353
"buildPresets": [
@@ -32,6 +62,18 @@
3262
"displayName": "Build Llama runner in Debug mode",
3363
"configurePreset": "llama-debug",
3464
"targets": ["llama_main"]
65+
},
66+
{
67+
"name": "llama-cuda-debug",
68+
"displayName": "Build Llama runner in Debug mode with CUDA backend",
69+
"configurePreset": "llama-cuda-debug",
70+
"targets": ["llama_main"]
71+
},
72+
{
73+
"name": "llama-cuda",
74+
"displayName": "Build Llama runner with CUDA backend",
75+
"configurePreset": "llama-cuda",
76+
"targets": ["llama_main"]
3577
}
3678
],
3779
"workflowPresets": [
@@ -62,6 +104,34 @@
62104
"name": "llama-debug"
63105
}
64106
]
107+
},
108+
{
109+
"name": "llama-cuda-debug",
110+
"displayName": "Configure and build Llama runner in Debug mode with CUDA backend",
111+
"steps": [
112+
{
113+
"type": "configure",
114+
"name": "llama-cuda-debug"
115+
},
116+
{
117+
"type": "build",
118+
"name": "llama-cuda-debug"
119+
}
120+
]
121+
},
122+
{
123+
"name": "llama-cuda",
124+
"displayName": "Configure and build Llama runner with CUDA backend",
125+
"steps": [
126+
{
127+
"type": "configure",
128+
"name": "llama-cuda"
129+
},
130+
{
131+
"type": "build",
132+
"name": "llama-cuda"
133+
}
134+
]
65135
}
66136
]
67137
}

examples/models/llama/main.cpp

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99

1010
#include <executorch/examples/models/llama/runner/runner.h>
1111
#include <gflags/gflags.h>
12+
#include <fstream>
1213
#include <sstream>
1314
#include <vector>
1415

@@ -34,6 +35,10 @@ DEFINE_string(
3435
DEFINE_string(tokenizer_path, "tokenizer.bin", "Tokenizer stuff.");
3536

3637
DEFINE_string(prompt, "The answer to the ultimate question is", "Prompt.");
38+
DEFINE_string(
39+
prompt_file,
40+
"",
41+
"Optional path to a file containing the prompt. If set, this overrides --prompt.");
3742

3843
DEFINE_double(
3944
temperature,
@@ -102,6 +107,17 @@ std::vector<std::string> parseStringList(const std::string& input) {
102107
return result;
103108
}
104109

110+
bool readFileToString(const std::string& path, std::string& out) {
111+
std::ifstream file(path, std::ios::in | std::ios::binary);
112+
if (!file) {
113+
return false;
114+
}
115+
std::ostringstream ss;
116+
ss << file.rdbuf();
117+
out = ss.str();
118+
return true;
119+
}
120+
105121
int32_t main(int32_t argc, char** argv) {
106122
gflags::ParseCommandLineFlags(&argc, &argv, true);
107123

@@ -114,7 +130,18 @@ int32_t main(int32_t argc, char** argv) {
114130

115131
const char* tokenizer_path = FLAGS_tokenizer_path.c_str();
116132

133+
std::string prompt_storage;
117134
const char* prompt = FLAGS_prompt.c_str();
135+
if (!FLAGS_prompt_file.empty()) {
136+
if (!readFileToString(FLAGS_prompt_file, prompt_storage)) {
137+
ET_LOG(
138+
Error,
139+
"Failed to read prompt file at path: %s",
140+
FLAGS_prompt_file.c_str());
141+
return 1;
142+
}
143+
prompt = prompt_storage.c_str();
144+
}
118145

119146
float temperature = FLAGS_temperature;
120147

0 commit comments

Comments
 (0)