@@ -30,6 +30,24 @@ Export produces a `model.pte` and `aoti_cuda_blob.ptd` containing the
3030compiled CUDA kernels and quantized weights. Int4 quantization is
3131recommended — the model is too large to fit in VRAM at bf16.
3232
33+ ### Quick start: prequantized weights
34+
35+ The fastest path is to export from prequantized weights, which skips
36+ the slow quantization step entirely.
37+
38+ Prequantized checkpoints are available for download:
39+ - [ SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4] ( https://huggingface.co/SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4 )
40+ - [ SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4] ( https://huggingface.co/SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4 )
41+
42+ ``` bash
43+ python export.py --prequantized < path-to-bundle>
44+ ```
45+
46+ See [ Generating Prequantized Weights] ( #generating-prequantized-weights )
47+ to create your own.
48+
49+ ### Quantize and Export
50+
3351``` bash
3452python export.py \
3553 --model-id Qwen/Qwen3.5-35B-A3B \
@@ -60,7 +78,7 @@ python export.py \
6078| ` --qlinear-group-size ` | ` 32 ` | Group size for linear quantization |
6179| ` --qembedding ` | (none) | Embedding quantization: ` 8w ` |
6280| ` --hqq ` | off | Use HQQ scale-only optimization for expert quantization (slower, better accuracy) |
63- | ` --prequantized ` | (none) | Path to prequantized bundle directory (skips quantization) |
81+ | ` --prequantized ` | (none) | Path to prequantized checkpoint directory (skips quantization) |
6482| ` --turboquant ` | off | Enable TurboQuant TQ4 KV cache compression (3.8x cache savings) |
6583
6684### TurboQuant KV Cache Compression
@@ -72,11 +90,11 @@ KV cache compression (3.8x savings) on the 10 full-attention layers.
7290python export.py --prequantized qwen35_moe_int4_hqq --turboquant
7391```
7492
75- ### Prequantized Export
93+ ### Generating Prequantized Weights
7694
7795Quantization is slow (~ 30 min with HQQ). To avoid re-quantizing on every
78- export, use ` quantize_and_save.py ` to create a self-contained bundle, then
79- export from it:
96+ export, use ` quantize_and_save.py ` to create a prequantized checkpoint
97+ directory, then export from it:
8098
8199``` bash
82100# Step 1: Quantize once (slow)
@@ -88,13 +106,13 @@ python quantize_and_save.py \
88106 --hqq \
89107 --output qwen35_moe_int4_hqq
90108
91- # Step 2: Export from bundle (fast, no --model-dir needed)
109+ # Step 2: Export from prequantized checkpoint (fast, no --model-dir needed)
92110python export.py \
93111 --prequantized qwen35_moe_int4_hqq
94112```
95113
96- The bundle contains ` model.safetensors ` , ` config.json ` , and tokenizer files.
97- It can be uploaded to HuggingFace Hub for easy sharing.
114+ The output directory contains ` model.safetensors ` , ` config.json ` , and
115+ tokenizer files. It can be uploaded to HuggingFace Hub for easy sharing.
98116
99117## Build
100118
0 commit comments