Skip to content

Commit f4019c3

Browse files
authored
Revert "Add Qwen 3.6 MoE model and switch CI to Qwen3.6-35B-A3B-HQQ-INT4" (#18965)
1 parent 7fdd306 commit f4019c3

5 files changed

Lines changed: 19 additions & 48 deletions

File tree

.ci/scripts/export_model_artifact.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,7 @@ case "$HF_MODEL" in
184184
PREPROCESSOR_FEATURE_SIZE=""
185185
PREPROCESSOR_OUTPUT=""
186186
;;
187-
SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4)
187+
SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4)
188188
MODEL_NAME="qwen3_5_moe"
189189
TASK=""
190190
MAX_SEQ_LEN=""
@@ -194,7 +194,7 @@ case "$HF_MODEL" in
194194
;;
195195
*)
196196
echo "Error: Unsupported model '$HF_MODEL'"
197-
echo "Supported models: mistralai/Voxtral-Mini-3B-2507, mistralai/Voxtral-Mini-4B-Realtime-2602, openai/whisper-{small, medium, large, large-v2, large-v3, large-v3-turbo}, google/gemma-3-4b-it, Qwen/Qwen3-0.6B, nvidia/diar_streaming_sortformer_4spk-v2, nvidia/parakeet-tdt, facebook/dinov2-small-imagenet1k-1-layer, SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4"
197+
echo "Supported models: mistralai/Voxtral-Mini-3B-2507, mistralai/Voxtral-Mini-4B-Realtime-2602, openai/whisper-{small, medium, large, large-v2, large-v3, large-v3-turbo}, google/gemma-3-4b-it, Qwen/Qwen3-0.6B, nvidia/diar_streaming_sortformer_4spk-v2, nvidia/parakeet-tdt, facebook/dinov2-small-imagenet1k-1-layer, SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4"
198198
exit 1
199199
;;
200200
esac

.ci/scripts/test_model_e2e.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -216,7 +216,7 @@ case "$HF_MODEL" in
216216
AUDIO_FILE="test_audio.wav"
217217
IMAGE_PATH=""
218218
;;
219-
SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4)
219+
SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4)
220220
MODEL_NAME="qwen3_5_moe"
221221
RUNNER_TARGET="qwen3_5_moe_runner"
222222
RUNNER_PATH="qwen3_5_moe"
@@ -230,7 +230,7 @@ case "$HF_MODEL" in
230230
;;
231231
*)
232232
echo "Error: Unsupported model '$HF_MODEL'"
233-
echo "Supported models: mistralai/Voxtral-Mini-3B-2507, mistralai/Voxtral-Mini-4B-Realtime-2602, nvidia/diar_streaming_sortformer_4spk-v2, openai/whisper series (whisper-{small, medium, large, large-v2, large-v3, large-v3-turbo}), google/gemma-3-4b-it, Qwen/Qwen3-0.6B, nvidia/parakeet-tdt, facebook/dinov2-small-imagenet1k-1-layer, SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4"
233+
echo "Supported models: mistralai/Voxtral-Mini-3B-2507, mistralai/Voxtral-Mini-4B-Realtime-2602, nvidia/diar_streaming_sortformer_4spk-v2, openai/whisper series (whisper-{small, medium, large, large-v2, large-v3, large-v3-turbo}), google/gemma-3-4b-it, Qwen/Qwen3-0.6B, nvidia/parakeet-tdt, facebook/dinov2-small-imagenet1k-1-layer, SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4"
234234
exit 1
235235
;;
236236
esac

.github/workflows/cuda.yml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,7 @@ jobs:
180180
- repo: "facebook"
181181
name: "dinov2-small-imagenet1k-1-layer"
182182
- repo: "SocialLocalMobile"
183-
name: "Qwen3.6-35B-A3B-HQQ-INT4"
183+
name: "Qwen3.5-35B-A3B-HQQ-INT4"
184184
quant:
185185
- "non-quantized"
186186
- "quantized-int4-tile-packed"
@@ -194,11 +194,11 @@ jobs:
194194
# Qwen3.5 MoE uses a prequantized checkpoint, only tile-packed
195195
- model:
196196
repo: "SocialLocalMobile"
197-
name: "Qwen3.6-35B-A3B-HQQ-INT4"
197+
name: "Qwen3.5-35B-A3B-HQQ-INT4"
198198
quant: "non-quantized"
199199
- model:
200200
repo: "SocialLocalMobile"
201-
name: "Qwen3.6-35B-A3B-HQQ-INT4"
201+
name: "Qwen3.5-35B-A3B-HQQ-INT4"
202202
quant: "quantized-int4-weight-only"
203203
# Voxtral Realtime only supports int4-tile-packed on CUDA
204204
- model:
@@ -254,7 +254,7 @@ jobs:
254254
with:
255255
timeout: 90
256256
secrets-env: EXECUTORCH_HF_TOKEN
257-
runner: ${{ matrix.model.name == 'Qwen3.6-35B-A3B-HQQ-INT4' && 'linux.aws.a100' || 'linux.g5.4xlarge.nvidia.gpu' }}
257+
runner: ${{ matrix.model.name == 'Qwen3.5-35B-A3B-HQQ-INT4' && 'linux.aws.a100' || 'linux.g5.4xlarge.nvidia.gpu' }}
258258
gpu-arch-type: cuda
259259
gpu-arch-version: 12.6
260260
use-custom-docker-registry: false
@@ -310,7 +310,7 @@ jobs:
310310
- repo: "facebook"
311311
name: "dinov2-small-imagenet1k-1-layer"
312312
- repo: "SocialLocalMobile"
313-
name: "Qwen3.6-35B-A3B-HQQ-INT4"
313+
name: "Qwen3.5-35B-A3B-HQQ-INT4"
314314
quant:
315315
- "non-quantized"
316316
- "quantized-int4-tile-packed"
@@ -324,11 +324,11 @@ jobs:
324324
# Qwen3.5 MoE uses a prequantized checkpoint, only tile-packed
325325
- model:
326326
repo: "SocialLocalMobile"
327-
name: "Qwen3.6-35B-A3B-HQQ-INT4"
327+
name: "Qwen3.5-35B-A3B-HQQ-INT4"
328328
quant: "non-quantized"
329329
- model:
330330
repo: "SocialLocalMobile"
331-
name: "Qwen3.6-35B-A3B-HQQ-INT4"
331+
name: "Qwen3.5-35B-A3B-HQQ-INT4"
332332
quant: "quantized-int4-weight-only"
333333
# Voxtral Realtime only supports int4-tile-packed on CUDA
334334
- model:
@@ -378,7 +378,7 @@ jobs:
378378
quant: "non-quantized"
379379
with:
380380
timeout: 90
381-
runner: ${{ matrix.model.name == 'Qwen3.6-35B-A3B-HQQ-INT4' && 'linux.aws.a100' || 'linux.g5.4xlarge.nvidia.gpu' }}
381+
runner: ${{ matrix.model.name == 'Qwen3.5-35B-A3B-HQQ-INT4' && 'linux.aws.a100' || 'linux.g5.4xlarge.nvidia.gpu' }}
382382
gpu-arch-type: cuda
383383
gpu-arch-version: 12.6
384384
use-custom-docker-registry: false

examples/models/qwen3_5_moe/README.md

Lines changed: 7 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -30,24 +30,6 @@ Export produces a `model.pte` and `aoti_cuda_blob.ptd` containing the
3030
compiled CUDA kernels and quantized weights. Int4 quantization is
3131
recommended — the model is too large to fit in VRAM at bf16.
3232

33-
### Quick start: prequantized weights
34-
35-
The fastest path is to export from prequantized weights, which skips
36-
the slow quantization step entirely.
37-
38-
Prequantized checkpoints are available for download:
39-
- [SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4](https://huggingface.co/SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4)
40-
- [SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4](https://huggingface.co/SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4)
41-
42-
```bash
43-
python export.py --prequantized <path-to-bundle>
44-
```
45-
46-
See [Generating Prequantized Weights](#generating-prequantized-weights)
47-
to create your own.
48-
49-
### Quantize and Export
50-
5133
```bash
5234
python export.py \
5335
--model-id Qwen/Qwen3.5-35B-A3B \
@@ -78,7 +60,7 @@ python export.py \
7860
| `--qlinear-group-size` | `32` | Group size for linear quantization |
7961
| `--qembedding` | (none) | Embedding quantization: `8w` |
8062
| `--hqq` | off | Use HQQ scale-only optimization for expert quantization (slower, better accuracy) |
81-
| `--prequantized` | (none) | Path to prequantized checkpoint directory (skips quantization) |
63+
| `--prequantized` | (none) | Path to prequantized bundle directory (skips quantization) |
8264
| `--turboquant` | off | Enable TurboQuant TQ4 KV cache compression (3.8x cache savings) |
8365

8466
### TurboQuant KV Cache Compression
@@ -90,11 +72,11 @@ KV cache compression (3.8x savings) on the 10 full-attention layers.
9072
python export.py --prequantized qwen35_moe_int4_hqq --turboquant
9173
```
9274

93-
### Generating Prequantized Weights
75+
### Prequantized Export
9476

9577
Quantization is slow (~30 min with HQQ). To avoid re-quantizing on every
96-
export, use `quantize_and_save.py` to create a prequantized checkpoint
97-
directory, then export from it:
78+
export, use `quantize_and_save.py` to create a self-contained bundle, then
79+
export from it:
9880

9981
```bash
10082
# Step 1: Quantize once (slow)
@@ -106,13 +88,13 @@ python quantize_and_save.py \
10688
--hqq \
10789
--output qwen35_moe_int4_hqq
10890

109-
# Step 2: Export from prequantized checkpoint (fast, no --model-dir needed)
91+
# Step 2: Export from bundle (fast, no --model-dir needed)
11092
python export.py \
11193
--prequantized qwen35_moe_int4_hqq
11294
```
11395

114-
The output directory contains `model.safetensors`, `config.json`, and
115-
tokenizer files. It can be uploaded to HuggingFace Hub for easy sharing.
96+
The bundle contains `model.safetensors`, `config.json`, and tokenizer files.
97+
It can be uploaded to HuggingFace Hub for easy sharing.
11698

11799
## Build
118100

examples/models/qwen3_6_moe/README.md

Lines changed: 0 additions & 11 deletions
This file was deleted.

0 commit comments

Comments
 (0)