@@ -30,24 +30,6 @@ Export produces a `model.pte` and `aoti_cuda_blob.ptd` containing the
3030compiled CUDA kernels and quantized weights. Int4 quantization is
3131recommended — the model is too large to fit in VRAM at bf16.
3232
33- ### Quick start: prequantized weights
34-
35- The fastest path is to export from prequantized weights, which skips
36- the slow quantization step entirely.
37-
38- Prequantized checkpoints are available for download:
39- - [ SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4] ( https://huggingface.co/SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4 )
40- - [ SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4] ( https://huggingface.co/SocialLocalMobile/Qwen3.6-35B-A3B-HQQ-INT4 )
41-
42- ``` bash
43- python export.py --prequantized < path-to-bundle>
44- ```
45-
46- See [ Generating Prequantized Weights] ( #generating-prequantized-weights )
47- to create your own.
48-
49- ### Quantize and Export
50-
5133``` bash
5234python export.py \
5335 --model-id Qwen/Qwen3.5-35B-A3B \
@@ -78,7 +60,7 @@ python export.py \
7860| ` --qlinear-group-size ` | ` 32 ` | Group size for linear quantization |
7961| ` --qembedding ` | (none) | Embedding quantization: ` 8w ` |
8062| ` --hqq ` | off | Use HQQ scale-only optimization for expert quantization (slower, better accuracy) |
81- | ` --prequantized ` | (none) | Path to prequantized checkpoint directory (skips quantization) |
63+ | ` --prequantized ` | (none) | Path to prequantized bundle directory (skips quantization) |
8264| ` --turboquant ` | off | Enable TurboQuant TQ4 KV cache compression (3.8x cache savings) |
8365
8466### TurboQuant KV Cache Compression
@@ -90,11 +72,11 @@ KV cache compression (3.8x savings) on the 10 full-attention layers.
9072python export.py --prequantized qwen35_moe_int4_hqq --turboquant
9173```
9274
93- ### Generating Prequantized Weights
75+ ### Prequantized Export
9476
9577Quantization is slow (~ 30 min with HQQ). To avoid re-quantizing on every
96- export, use ` quantize_and_save.py ` to create a prequantized checkpoint
97- directory, then export from it:
78+ export, use ` quantize_and_save.py ` to create a self-contained bundle, then
79+ export from it:
9880
9981``` bash
10082# Step 1: Quantize once (slow)
@@ -106,13 +88,13 @@ python quantize_and_save.py \
10688 --hqq \
10789 --output qwen35_moe_int4_hqq
10890
109- # Step 2: Export from prequantized checkpoint (fast, no --model-dir needed)
91+ # Step 2: Export from bundle (fast, no --model-dir needed)
11092python export.py \
11193 --prequantized qwen35_moe_int4_hqq
11294```
11395
114- The output directory contains ` model.safetensors ` , ` config.json ` , and
115- tokenizer files. It can be uploaded to HuggingFace Hub for easy sharing.
96+ The bundle contains ` model.safetensors ` , ` config.json ` , and tokenizer files.
97+ It can be uploaded to HuggingFace Hub for easy sharing.
11698
11799## Build
118100
0 commit comments