Skip to content

Commit 7fdd306

Browse files
authored
Add Qwen 3.6 MoE model and switch CI to Qwen3.6-35B-A3B-HQQ-INT4 (pytorch#18955)
Qwen 3.6 MoE shares architecture and runner with Qwen 3.5 MoE. Add a stub README pointing to the existing qwen3_5_moe example. Update CI scripts and cuda.yml to use the Qwen 3.6 prequantized checkpoint. Improve qwen3_5_moe README: add quick-start section for prequantized weights, list available prequantized checkpoints, and clean up terminology.
1 parent 3998693 commit 7fdd306

5 files changed

Lines changed: 48 additions & 19 deletions

File tree

.ci/scripts/export_model_artifact.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,7 @@ case "$HF_MODEL" in
184184
PREPROCESSOR_FEATURE_SIZE=""
185185
PREPROCESSOR_OUTPUT=""
186186
;;
187-
SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4)
187+
SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4)
188188
MODEL_NAME="qwen3_5_moe"
189189
TASK=""
190190
MAX_SEQ_LEN=""
@@ -194,7 +194,7 @@ case "$HF_MODEL" in
194194
;;
195195
*)
196196
echo "Error: Unsupported model '$HF_MODEL'"
197-
echo "Supported models: mistralai/Voxtral-Mini-3B-2507, mistralai/Voxtral-Mini-4B-Realtime-2602, openai/whisper-{small, medium, large, large-v2, large-v3, large-v3-turbo}, google/gemma-3-4b-it, Qwen/Qwen3-0.6B, nvidia/diar_streaming_sortformer_4spk-v2, nvidia/parakeet-tdt, facebook/dinov2-small-imagenet1k-1-layer, SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4"
197+
echo "Supported models: mistralai/Voxtral-Mini-3B-2507, mistralai/Voxtral-Mini-4B-Realtime-2602, openai/whisper-{small, medium, large, large-v2, large-v3, large-v3-turbo}, google/gemma-3-4b-it, Qwen/Qwen3-0.6B, nvidia/diar_streaming_sortformer_4spk-v2, nvidia/parakeet-tdt, facebook/dinov2-small-imagenet1k-1-layer, SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4"
198198
exit 1
199199
;;
200200
esac

.ci/scripts/test_model_e2e.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -216,7 +216,7 @@ case "$HF_MODEL" in
216216
AUDIO_FILE="test_audio.wav"
217217
IMAGE_PATH=""
218218
;;
219-
SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4)
219+
SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4)
220220
MODEL_NAME="qwen3_5_moe"
221221
RUNNER_TARGET="qwen3_5_moe_runner"
222222
RUNNER_PATH="qwen3_5_moe"
@@ -230,7 +230,7 @@ case "$HF_MODEL" in
230230
;;
231231
*)
232232
echo "Error: Unsupported model '$HF_MODEL'"
233-
echo "Supported models: mistralai/Voxtral-Mini-3B-2507, mistralai/Voxtral-Mini-4B-Realtime-2602, nvidia/diar_streaming_sortformer_4spk-v2, openai/whisper series (whisper-{small, medium, large, large-v2, large-v3, large-v3-turbo}), google/gemma-3-4b-it, Qwen/Qwen3-0.6B, nvidia/parakeet-tdt, facebook/dinov2-small-imagenet1k-1-layer, SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4"
233+
echo "Supported models: mistralai/Voxtral-Mini-3B-2507, mistralai/Voxtral-Mini-4B-Realtime-2602, nvidia/diar_streaming_sortformer_4spk-v2, openai/whisper series (whisper-{small, medium, large, large-v2, large-v3, large-v3-turbo}), google/gemma-3-4b-it, Qwen/Qwen3-0.6B, nvidia/parakeet-tdt, facebook/dinov2-small-imagenet1k-1-layer, SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4"
234234
exit 1
235235
;;
236236
esac

.github/workflows/cuda.yml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,7 @@ jobs:
180180
- repo: "facebook"
181181
name: "dinov2-small-imagenet1k-1-layer"
182182
- repo: "SocialLocalMobile"
183-
name: "Qwen3.5-35B-A3B-HQQ-INT4"
183+
name: "Qwen3.6-35B-A3B-HQQ-INT4"
184184
quant:
185185
- "non-quantized"
186186
- "quantized-int4-tile-packed"
@@ -194,11 +194,11 @@ jobs:
194194
# Qwen3.5 MoE uses a prequantized checkpoint, only tile-packed
195195
- model:
196196
repo: "SocialLocalMobile"
197-
name: "Qwen3.5-35B-A3B-HQQ-INT4"
197+
name: "Qwen3.6-35B-A3B-HQQ-INT4"
198198
quant: "non-quantized"
199199
- model:
200200
repo: "SocialLocalMobile"
201-
name: "Qwen3.5-35B-A3B-HQQ-INT4"
201+
name: "Qwen3.6-35B-A3B-HQQ-INT4"
202202
quant: "quantized-int4-weight-only"
203203
# Voxtral Realtime only supports int4-tile-packed on CUDA
204204
- model:
@@ -254,7 +254,7 @@ jobs:
254254
with:
255255
timeout: 90
256256
secrets-env: EXECUTORCH_HF_TOKEN
257-
runner: ${{ matrix.model.name == 'Qwen3.5-35B-A3B-HQQ-INT4' && 'linux.aws.a100' || 'linux.g5.4xlarge.nvidia.gpu' }}
257+
runner: ${{ matrix.model.name == 'Qwen3.6-35B-A3B-HQQ-INT4' && 'linux.aws.a100' || 'linux.g5.4xlarge.nvidia.gpu' }}
258258
gpu-arch-type: cuda
259259
gpu-arch-version: 12.6
260260
use-custom-docker-registry: false
@@ -310,7 +310,7 @@ jobs:
310310
- repo: "facebook"
311311
name: "dinov2-small-imagenet1k-1-layer"
312312
- repo: "SocialLocalMobile"
313-
name: "Qwen3.5-35B-A3B-HQQ-INT4"
313+
name: "Qwen3.6-35B-A3B-HQQ-INT4"
314314
quant:
315315
- "non-quantized"
316316
- "quantized-int4-tile-packed"
@@ -324,11 +324,11 @@ jobs:
324324
# Qwen3.5 MoE uses a prequantized checkpoint, only tile-packed
325325
- model:
326326
repo: "SocialLocalMobile"
327-
name: "Qwen3.5-35B-A3B-HQQ-INT4"
327+
name: "Qwen3.6-35B-A3B-HQQ-INT4"
328328
quant: "non-quantized"
329329
- model:
330330
repo: "SocialLocalMobile"
331-
name: "Qwen3.5-35B-A3B-HQQ-INT4"
331+
name: "Qwen3.6-35B-A3B-HQQ-INT4"
332332
quant: "quantized-int4-weight-only"
333333
# Voxtral Realtime only supports int4-tile-packed on CUDA
334334
- model:
@@ -378,7 +378,7 @@ jobs:
378378
quant: "non-quantized"
379379
with:
380380
timeout: 90
381-
runner: ${{ matrix.model.name == 'Qwen3.5-35B-A3B-HQQ-INT4' && 'linux.aws.a100' || 'linux.g5.4xlarge.nvidia.gpu' }}
381+
runner: ${{ matrix.model.name == 'Qwen3.6-35B-A3B-HQQ-INT4' && 'linux.aws.a100' || 'linux.g5.4xlarge.nvidia.gpu' }}
382382
gpu-arch-type: cuda
383383
gpu-arch-version: 12.6
384384
use-custom-docker-registry: false

examples/models/qwen3_5_moe/README.md

Lines changed: 25 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,24 @@ Export produces a `model.pte` and `aoti_cuda_blob.ptd` containing the
3030
compiled CUDA kernels and quantized weights. Int4 quantization is
3131
recommended — the model is too large to fit in VRAM at bf16.
3232

33+
### Quick start: prequantized weights
34+
35+
The fastest path is to export from prequantized weights, which skips
36+
the slow quantization step entirely.
37+
38+
Prequantized checkpoints are available for download:
39+
- [SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4](https://huggingface.co/SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4)
40+
- [SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4](https://huggingface.co/SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4)
41+
42+
```bash
43+
python export.py --prequantized <path-to-bundle>
44+
```
45+
46+
See [Generating Prequantized Weights](#generating-prequantized-weights)
47+
to create your own.
48+
49+
### Quantize and Export
50+
3351
```bash
3452
python export.py \
3553
--model-id Qwen/Qwen3.5-35B-A3B \
@@ -60,7 +78,7 @@ python export.py \
6078
| `--qlinear-group-size` | `32` | Group size for linear quantization |
6179
| `--qembedding` | (none) | Embedding quantization: `8w` |
6280
| `--hqq` | off | Use HQQ scale-only optimization for expert quantization (slower, better accuracy) |
63-
| `--prequantized` | (none) | Path to prequantized bundle directory (skips quantization) |
81+
| `--prequantized` | (none) | Path to prequantized checkpoint directory (skips quantization) |
6482
| `--turboquant` | off | Enable TurboQuant TQ4 KV cache compression (3.8x cache savings) |
6583

6684
### TurboQuant KV Cache Compression
@@ -72,11 +90,11 @@ KV cache compression (3.8x savings) on the 10 full-attention layers.
7290
python export.py --prequantized qwen35_moe_int4_hqq --turboquant
7391
```
7492

75-
### Prequantized Export
93+
### Generating Prequantized Weights
7694

7795
Quantization is slow (~30 min with HQQ). To avoid re-quantizing on every
78-
export, use `quantize_and_save.py` to create a self-contained bundle, then
79-
export from it:
96+
export, use `quantize_and_save.py` to create a prequantized checkpoint
97+
directory, then export from it:
8098

8199
```bash
82100
# Step 1: Quantize once (slow)
@@ -88,13 +106,13 @@ python quantize_and_save.py \
88106
--hqq \
89107
--output qwen35_moe_int4_hqq
90108

91-
# Step 2: Export from bundle (fast, no --model-dir needed)
109+
# Step 2: Export from prequantized checkpoint (fast, no --model-dir needed)
92110
python export.py \
93111
--prequantized qwen35_moe_int4_hqq
94112
```
95113

96-
The bundle contains `model.safetensors`, `config.json`, and tokenizer files.
97-
It can be uploaded to HuggingFace Hub for easy sharing.
114+
The output directory contains `model.safetensors`, `config.json`, and
115+
tokenizer files. It can be uploaded to HuggingFace Hub for easy sharing.
98116

99117
## Build
100118

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Qwen 3.6 MoE
2+
3+
Qwen 3.6 MoE uses the same architecture and runner as Qwen 3.5 MoE.
4+
See [examples/models/qwen3_5_moe](../qwen3_5_moe/) for export, build,
5+
and inference instructions.
6+
7+
Prequantized INT4 weights are available at
8+
[SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4](https://huggingface.co/SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4).
9+
10+
**Note:** This model has not been tested or evaluated. It is provided
11+
mainly for development purposes.

0 commit comments

Comments
 (0)