Skip to content

Commit 3089088

Browse files
chilukam-qtiTrishansh Bhardwaj
authored andcommitted
Addition of Qwen2.5-0.5B-Instruct, Qwen2.5-Coder-0.5B-Instruct and Qwen2.5-Coder-1.5B-Instruct recipies for QNN
1 parent 8540cd3 commit 3089088

12 files changed

Lines changed: 666 additions & 0 deletions

File tree

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Qwen2.5-0.5B-Instruct Model Optimization
2+
3+
This repository demonstrates the optimization of the [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) model using **post-training quantization (PTQ)** techniques.
4+
5+
6+
### Quantization Python Environment Setup
7+
Quantization is resource-intensive and requires GPU acceleration. In an x64 Python environment, install the required packages:
8+
9+
```bash
10+
pip install -r requirements.txt
11+
12+
# Disable CUDA extension build (not required)
13+
# Linux
14+
export BUILD_CUDA_EXT=0
15+
# Windows
16+
# set BUILD_CUDA_EXT=0
17+
18+
# Install GptqModel from source
19+
pip install --no-build-isolation git+https://github.com/CodeLinaro/GPTQModel.git@rel_4.2.5
20+
pip install --no-build-isolation git+https://github.com/Dao-AILab/fast-hadamard-transform.git@e7706faf8d1c3b9f241e36860640ad1dac644ede
21+
```
22+
23+
### AOT Compilation Python Environment Setup
24+
Model compilation using QNN Execution Provider requires a Python environment with onnxruntime-qnn installed. In a separate Python environment, install the required packages:
25+
26+
```bash
27+
# Install Olive
28+
pip install olive-ai==0.11.0
29+
30+
# Install ONNX Runtime QNN
31+
pip install -r https://raw.githubusercontent.com/microsoft/onnxruntime/refs/heads/main/requirements.txt
32+
pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple "onnxruntime-qnn==1.23.2" --no-deps
33+
```
34+
35+
Replace `/path/to/qnn/env/bin` in [config.json](config.json) with the path to the directory containing your QNN environment's Python executable. This path can be found by running the following command in the environment:
36+
37+
```bash
38+
# Linux
39+
command -v python
40+
# Windows
41+
# where python
42+
```
43+
44+
This command will return the path to the Python executable. Set the parent directory of the executable as the `/path/to/qnn/env/bin` in the config file.
45+
46+
### Run the Quantization + Compilation Config
47+
Activate the **Quantization Python Environment** and run the workflow:
48+
49+
### Change the soc_model param in config.json file corressponding to the target platform
50+
51+
```bash
52+
olive run --config config.json
53+
```
54+
55+
Olive will run the AOT compilation step in the **AOT Compilation Python Environment** specified in the config file using a subprocess. All other steps will run in the **Quantization Python Environment** natively.
56+
57+
✅ Optimized model saved in: `models/qwen_2.5_0.5b_Instruct/`
58+
59+
> ⚠️ If optimization fails during context binary generation, rerun the command. The process will resume from the last completed step.
60+
61+
### QNN-GPU: Run the Quantization Config
62+
63+
Running QNN-GPU configs requires features and fixes that are not available in the released Olive version 0.9.3.
64+
To ensure compatibility, please install Olive directly from the source at the required commit:
65+
66+
```bash
67+
pip install git+https://github.com/microsoft/Olive.git@da24463e14ed976503dc5871572b285bc5ddc4b2
68+
```
69+
70+
If you previously installed Olive via PyPI or pinned it to version 0.9.3, please uninstall it first and then use the above
71+
commit to install:
72+
73+
```bash
74+
pip uninstall olive-ai
75+
```
76+
77+
Replace `/path/to/qnn/env/bin` in [config_gpu.json](config_gpu.json) with the path to the directory containing your QNN environment's Python executable.
78+
79+
Activate the **Quantization Python Environment** and run the workflow:
80+
81+
```bash
82+
olive run --config config_gpu.json
83+
```
84+
85+
✅ Optimized model saved in: `models/qwen_2.5_0.5b_Instruct/`
86+
87+
### QNN-GPU: Run the Context Binary Compilation Config
88+
89+
Replace `/path/to/model/` in [config_gpu_ctxbin.json](config_gpu_ctxbin.json) with the output path generated from above step.
90+
91+
Activate the **AOT Python Environment** and run the workflow:
92+
93+
```bash
94+
olive run --config config_gpu_ctxbin.json
95+
```
96+
97+
✅ Optimized model saved in: `models/qwen_2.5_0.5b_Instruct/`
98+
99+
> ⚠️ If optimization fails during context binary generation, rerun the command. The process will resume from the last completed step.
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
{
2+
"input_model": { "type": "HfModel", "model_path": "Qwen/Qwen2.5-0.5B-Instruct" },
3+
"systems": {
4+
"qnn_system": {
5+
"type": "PythonEnvironment",
6+
"python_environment_path": "/path/to/qnn/env/bin",
7+
"accelerators": [ { "execution_providers": [ "QNNExecutionProvider" ] } ]
8+
}
9+
},
10+
"data_configs": [
11+
{
12+
"name": "wikitext2_train_joined",
13+
"type": "HuggingfaceContainer",
14+
"load_dataset_config": { "data_name": "wikitext", "subset": "wikitext-2-raw-v1", "split": "train" },
15+
"pre_process_data_config": {
16+
"strategy": "join",
17+
"add_special_tokens": false,
18+
"max_seq_len": 4096,
19+
"max_samples": 128
20+
}
21+
},
22+
{
23+
"name": "wikitext2_train_act",
24+
"type": "HuggingfaceContainer",
25+
"load_dataset_config": { "data_name": "wikitext", "subset": "wikitext-2-raw-v1", "split": "train" },
26+
"pre_process_data_config": {
27+
"strategy": "line-by-line",
28+
"add_special_tokens": true,
29+
"max_samples": 256,
30+
"max_seq_len": 4096
31+
}
32+
}
33+
],
34+
"passes": {
35+
"cs": { "type": "CaptureSplitInfo", "num_splits": 1, "unique_embeds_lm_head_splits": true },
36+
"g": {
37+
"type": "GptqModel",
38+
"bits": 4,
39+
"sym": true,
40+
"group_size": -1,
41+
"lm_head": true,
42+
"rotation": "hadamard",
43+
"device": "cuda",
44+
"data_config": "wikitext2_train_joined",
45+
"dynamic": {
46+
"+:.*lm_head*": {"bits": 8, "sym": true, "group_size": 32, "desc_act": false}
47+
}
48+
},
49+
"mb": {
50+
"type": "ModelBuilder",
51+
"precision": "int4",
52+
"int4_block_size": 32,
53+
"int4_accuracy_level": 4,
54+
"int4_op_types_to_quantize": [ "Gather" ]
55+
},
56+
"mq": {
57+
"type": "MatMulNBitsToQDQ",
58+
"use_int4": true,
59+
"add_zero_point": true,
60+
"nodes_to_exclude": [ "/lm_head/MatMulNBits" ],
61+
"save_as_external_data": true
62+
},
63+
"gs": {
64+
"type": "GraphSurgeries",
65+
"surgeries": [
66+
{ "surgeon": "RemoveRopeMultiCache" },
67+
{ "surgeon": "AttentionMaskToSequenceLengths" },
68+
{ "surgeon": "RemoveGidxFromMatMulNBits" },
69+
{ "surgeon": "SimplifiedLayerNormToL2Norm" }
70+
],
71+
"save_as_external_data": true
72+
},
73+
"sq": {
74+
"type": "OnnxStaticQuantization",
75+
"data_config": "wikitext2_train_act",
76+
"activation_type": "uint16",
77+
"precision": "uint8",
78+
"calibration_providers": [ "CUDAExecutionProvider" ],
79+
"quant_preprocess": true,
80+
"op_types_to_exclude": [ "GatherBlockQuantized", "GroupQueryAttention", "MatMulNBits" ],
81+
"save_as_external_data": true
82+
},
83+
"sp": { "type": "SplitModel" },
84+
"st": { "type": "StaticLLM", "batch_size": 1, "context_length": 64 },
85+
"cb": {
86+
"type": "EPContextBinaryGenerator",
87+
"provider_options": {
88+
"htp_performance_mode": "burst",
89+
"htp_graph_finalization_optimization_mode": "3",
90+
"soc_model": "60"
91+
},
92+
"weight_sharing": true
93+
},
94+
"cp": { "type": "ComposeOnnxModels" }
95+
},
96+
"target": "qnn_system",
97+
"log_severity_level": 1,
98+
"output_dir": "models/qwen_2.5_0.5b_Instruct",
99+
"cache_dir": "cache",
100+
"no_artifacts": true
101+
}
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
keywords:
2+
foundry-local
3+
qnn
4+
arch: qwen2
5+
recipes:
6+
- file: "config.json"
7+
devices:
8+
- npu
9+
ep: QNNExecutionProvider
10+
name: qwen2.5-0.5b-instruct
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
datasets
2+
setuptools==81
3+
wheel
4+
# these are the versions the recipes were last validated with
5+
torch==2.8.0
6+
torchvision==0.23.0
7+
olive-ai==0.11.0
8+
onnxruntime-genai-cuda==0.11.2
9+
onnxruntime-gpu==1.23.2
10+
optimum
11+
# newer transformers might have incompatibility with gptq passes
12+
transformers==4.57.3
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Qwen2.5-Coder-0.5B-Instruct Model Optimization
2+
3+
This repository demonstrates the optimization of the [Qwen2.5-Coder-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct) model using **post-training quantization (PTQ)** techniques.
4+
5+
6+
### Quantization Python Environment Setup
7+
Quantization is resource-intensive and requires GPU acceleration. In an x64 Python environment, install the required packages:
8+
9+
```bash
10+
pip install -r requirements.txt
11+
12+
# Disable CUDA extension build (not required)
13+
# Linux
14+
export BUILD_CUDA_EXT=0
15+
# Windows
16+
# set BUILD_CUDA_EXT=0
17+
18+
# Install GptqModel from source
19+
pip install --no-build-isolation git+https://github.com/CodeLinaro/GPTQModel.git@rel_4.2.5
20+
pip install --no-build-isolation git+https://github.com/Dao-AILab/fast-hadamard-transform.git@e7706faf8d1c3b9f241e36860640ad1dac644ede
21+
```
22+
23+
### AOT Compilation Python Environment Setup
24+
Model compilation using QNN Execution Provider requires a Python environment with onnxruntime-qnn installed. In a separate Python environment, install the required packages:
25+
26+
```bash
27+
# Install Olive
28+
pip install olive-ai==0.11.0
29+
30+
# Install ONNX Runtime QNN
31+
pip install -r https://raw.githubusercontent.com/microsoft/onnxruntime/refs/heads/main/requirements.txt
32+
pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple "onnxruntime-qnn==1.23.2" --no-deps
33+
```
34+
35+
Replace `/path/to/qnn/env/bin` in [config.json](config.json) with the path to the directory containing your QNN environment's Python executable. This path can be found by running the following command in the environment:
36+
37+
```bash
38+
# Linux
39+
command -v python
40+
# Windows
41+
# where python
42+
```
43+
44+
This command will return the path to the Python executable. Set the parent directory of the executable as the `/path/to/qnn/env/bin` in the config file.
45+
46+
### Run the Quantization + Compilation Config
47+
Activate the **Quantization Python Environment** and run the workflow:
48+
49+
### Change the soc_model param in config.json file corressponding to the target platform
50+
51+
```bash
52+
olive run --config config.json
53+
```
54+
55+
Olive will run the AOT compilation step in the **AOT Compilation Python Environment** specified in the config file using a subprocess. All other steps will run in the **Quantization Python Environment** natively.
56+
57+
✅ Optimized model saved in: `models/qwen_2.5_0.5b_Instruct/`
58+
59+
> ⚠️ If optimization fails during context binary generation, rerun the command. The process will resume from the last completed step.
60+
61+
### QNN-GPU: Run the Quantization Config
62+
63+
Running QNN-GPU configs requires features and fixes that are not available in the released Olive version 0.9.3.
64+
To ensure compatibility, please install Olive directly from the source at the required commit:
65+
66+
```bash
67+
pip install git+https://github.com/microsoft/Olive.git@da24463e14ed976503dc5871572b285bc5ddc4b2
68+
```
69+
70+
If you previously installed Olive via PyPI or pinned it to version 0.9.3, please uninstall it first and then use the above
71+
commit to install:
72+
73+
```bash
74+
pip uninstall olive-ai
75+
```
76+
77+
Replace `/path/to/qnn/env/bin` in [config_gpu.json](config_gpu.json) with the path to the directory containing your QNN environment's Python executable.
78+
79+
Activate the **Quantization Python Environment** and run the workflow:
80+
81+
```bash
82+
olive run --config config_gpu.json
83+
```
84+
85+
✅ Optimized model saved in: `models/qwen_2.5_0.5b_Instruct/`
86+
87+
### QNN-GPU: Run the Context Binary Compilation Config
88+
89+
Replace `/path/to/model/` in [config_gpu_ctxbin.json](config_gpu_ctxbin.json) with the output path generated from above step.
90+
91+
Activate the **AOT Python Environment** and run the workflow:
92+
93+
```bash
94+
olive run --config config_gpu_ctxbin.json
95+
```
96+
97+
✅ Optimized model saved in: `models/qwen_2.5_0.5b_Instruct/`
98+
99+
> ⚠️ If optimization fails during context binary generation, rerun the command. The process will resume from the last completed step.

0 commit comments

Comments
 (0)