Skip to content

Commit b146f17

Browse files
ynankanidanielkorzekwa
authored andcommitted
Ynankani/update windows benchmark md (#762)
## What does this PR do? **Type of change:** ? documentation **Overview:** Md update to add perplexity and kl divergence benchmark info. ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: NA - **Did you write any new necessary tests?**: NA - **Did you add or update any necessary documentation?**: Yes - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: NA <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Expanded accuracy comparison section with three detailed benchmark metrics: MMLU scores, Perplexity (PPL), and KL-divergence. * Added comprehensive tables showing results across models and quantization configurations. * Included evaluation guides and references for each metric. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: unknown <ynankani@nvidia.com>
1 parent 7739972 commit b146f17

File tree

2 files changed

+131
-0
lines changed

2 files changed

+131
-0
lines changed

examples/windows/Benchmark.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,8 @@ Memory savings and inference speedup are compared to the ONNX FP16 baseline.
2424

2525
### 1.2 Accuracy Comparison
2626

27+
#### 1.2.1 MMLU
28+
2729
For accuracy evaluation, the [Massive Multitask Language Understanding (MMLU)](https://arxiv.org/abs/2009.03300) benchmark has been utilized. Please refer to the [detailed instructions](./accuracy_benchmark/README.md) for running the MMLU accuracy benchmark.
2830

2931
The table below shows the MMLU 5-shot score for some models.
@@ -39,3 +41,56 @@ The table below shows the MMLU 5-shot score for some models.
3941
| [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) | 61.76 | 60.73 |
4042
| [Llama3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | 60.8 | 57.71 |
4143
| [Gemma-2b-it](https://huggingface.co/google/gemma-2b-it) | 37.01 | 37.2 |
44+
45+
#### 1.2.2 Perplexity (PPL)
46+
47+
Perplexity measures how well a probability model predicts a sample. Lower perplexity values indicate better model quality. The following table shows perplexity values at input sequence length 1024 with chunk size of 512.
48+
49+
**Learn more about Perplexity:** [Perplexity - Wikipedia](https://en.wikipedia.org/wiki/Perplexity) | [Hugging Face - Perplexity of Fixed-Length Models](https://huggingface.co/docs/transformers/en/perplexity)
50+
51+
- **FP16-MB**: Baseline FP16 genai model (Model Builder)
52+
- **Mixed AWQ-MO**: Important linear layers in INT8, rest in INT4 (AWQ), with ModelOpt.
53+
- **Mixed RTN-MO**: Important linear layers in INT8, rest in INT4 (RTN), with ModelOpt.
54+
- **Pure INT4 AWQ-MO**: All linear layers INT4 (AWQ) with ModelOpt.
55+
- **Pure INT4 RTN-MO**: All linear layers INT4 (RTN) with ModelOpt.
56+
- **Pure INT8 RTN-MO**: All linear layers INT8 (RTN) with ModelOpt.
57+
- **Pure INT8 AWQ-MO**: All linear layers INT8 (AWQ) with ModelOpt.
58+
- **Configuration**: Windows OS, GPU RTX 5090, nvidia-modelopt v0.39.0, onnxruntime-genai-cuda 0.9.2, onnxruntime-gpu 1.23.0, torch 2.8.0+cu128, transformers 4.49.0
59+
60+
| Model | FP16-MB | Mixed AWQ-MO | Mixed RTN-MO | Pure INT4 AWQ-MO | Pure INT4 RTN-MO | Pure INT8 RTN-MO | Pure INT8 AWQ-MO |
61+
|:------|:--------|:-------------|:-------------|:-----------------|:-----------------|:-----------------|:-----------------|
62+
| DeepSeek R1 Distill Qwen 1.5B | 39.447 | 41.699 | 44.332 | 44.213 | 46.304 | 39.802 | 39.713 |
63+
| Llama 3.2 1B Instruct | 12.631 | 13.852 | 14.176 | 14.549 | 16.900 | 12.664 | 12.637 |
64+
| Phi-3.5 Mini Instruct | 6.046 | 6.500 | 6.599 | 6.711 | 7.070 | - | - |
65+
| Phi-4 Mini Instruct | 9.039 | 9.673 | 9.712 | 10.015 | 10.911 | - | - |
66+
| Qwen 2.5 1.5B Instruct | 9.216 | 10.084 | 10.338 | 10.495 | 10.933 | 9.227 | 9.232 |
67+
68+
For detailed instructions on evaluating perplexity, please refer to the [Perplexity Evaluation Guide](./accuracy_benchmark/perplexity_metrics/README.md).
69+
70+
#### 1.2.3 KL-divergence
71+
72+
KL-divergence (Kullback-Leibler divergence) quantifies the distributional difference between the quantized model and the baseline model. Lower KL-divergence values indicate that the quantized model's output distribution is closer to the original model.
73+
74+
**Learn more about KL-divergence:** [KL Divergence - Wikipedia](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) | [Understanding KL Divergence](https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained)
75+
76+
**Supported backends:** PyTorch and Onnxruntim-cuda, onnxruntime-trt-rtx-ep are both supported for evaluation.
77+
78+
- **Baseline model**: Hugging Face FP16 model
79+
- **Quantized models**: Models where quantization is simulated (a.k.a. fake quantization), typically using the PyTorch-CUDA backend for evaluation. Fake quantization means quantized weights and dequantized simultaneously to simulate quantization. The inference backend column in the table below indicates whether the reported results are from PyTorch simulation or ONNX-runtime-based inference.
80+
- **Configuration**: Windows OS, GPU RTX 5090, nvidia-modelopt v0.39.0, onnxruntime-genai-cuda 0.9.2, onnxruntime-gpu 1.23.0, torch 2.8.0+cu128, transformers 4.49.0
81+
82+
| Model | Quantization Method | Quantization Granularity | KL-divergence | Inference Backend |
83+
|:-----------------------|:-------------------------------------------------|:--------------------------------------------------------------------|:--------------|:------------------------------|
84+
| Qwen2.5-1.5B-Instruct | Base FP16 (Baseline) | - | 0.000 | PyTorch (FP16) |
85+
| Qwen2.5-1.5B-Instruct | int4+int8 Blockwise-max_algo-mixed_quant (simulated) | INT4: per-block (block-size=128), INT8: per-channel (row-wise) | 0.336 | PyTorch (fake quantization) |
86+
| Qwen2.5-1.5B-Instruct | int4+int8 max_algo-mixed_quant (simulated, per-channel) | INT4: per-block (block-size=128), INT8: per-channel (row-wise) | 0.337 | PyTorch (fake quantization) |
87+
| Llama-3.2-3B-Instruct | Base FP16 (Baseline) | - | 0.000 | PyTorch (FP16) |
88+
| Llama-3.2-3B-Instruct | int4+int8 Blockwise-awq-lite_algo-mixed_quant (simulated) | INT4: per-block (block-size=128), INT8: per-channel (row-wise) | 0.228 | PyTorch (fake quantization) |
89+
| Llama-3.2-3B-Instruct | int4+int8 per-channel-awq-lite_algo-mixed_quant (simulated) | INT4: per-block (block-size=128), INT8: per-channel (row-wise) | 0.230 | PyTorch (fake quantization) |
90+
| Llama-3.2-3B-Instruct | int4+int8 Blockwise-max_algo-mixed_quant (simulated) | INT4: per-block (block-size=128), INT8: per-channel (row-wise) | 0.238 | PyTorch (fake quantization) |
91+
| Llama-3.2-3B-Instruct | int4+int8 per-channel-max_algo-mixed_quant (simulated) | INT4: per-block (block-size=128), INT8: per-channel (row-wise) | 0.238 | PyTorch (fake quantization) |
92+
| Llama-3.2-3B-Instruct | int4 Blockwise-max_algo only (simulated) | INT4: per-block (block-size=128) | 0.334 | PyTorch (fake quantization) |
93+
94+
*All KL-divergence results above are obtained via PyTorch fake quantization simulation unless otherwise noted. Inference with ONNX-runtime can also be evaluated .*
95+
96+
For detailed instructions on computing KL-divergence, please refer to the [KL-divergence Evaluation Guide](./accuracy_benchmark/kl_divergence_metrics/README.md).

examples/windows/onnx_ptq/genai_llm/README.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,82 @@ Note:
8282

8383
Please refer to `quantize.py` for further details on command-line parameters.
8484

85+
#### Mixed Precision Quantization (INT4 + INT8)
86+
87+
ModelOpt-Windows supports **mixed precision quantization**, where different layers in the model can be quantized to different bit-widths. This approach combines INT4 quantization for most layers (for maximum compression and speed) with INT8 quantization for important or sensitive layers (to preserve accuracy).
88+
89+
##### Why Use Mixed Precision?
90+
91+
Mixed precision quantization provides an optimal balance between:
92+
93+
- **Model Size**: Primarily INT4 keeps the model small
94+
- **Inference Speed**: INT4 layers run faster and smaller
95+
- **Accuracy Preservation**: Critical layers in INT8 maintain model quality
96+
97+
Based on benchmark results, mixed precision quantization shows significant advantages:
98+
99+
| Model | Metric | INT4 RTN | Mixed RTN (INT4+INT8) | Improvement |
100+
|:------|:-------|:-------------|:---------------------|:-----------|
101+
| DeepSeek R1 1.5B | MMLU | 32.40% | 33.90% | +1.5% |
102+
| | Perplexity | 46.304 | 44.332 | -2.0 (lower is better) |
103+
| Llama 3.2 1B | MMLU | 39.90% | 44.70% | +4.8% |
104+
| | Perplexity | 16.900 | 14.176 | -2.7 (lower is better) |
105+
| Qwen 2.5 1.5B | MMLU | 56.70% | 57.50% | +0.8% |
106+
| | Perplexity | 10.933 | 10.338 | -0.6 (lower is better) |
107+
108+
As shown above, mixed precision significantly improves accuracy with minimal disk size increase (~85-109 MB).
109+
110+
##### How Mixed Precision Works
111+
112+
The quantization strategy selects which layers to quantize to INT8 vs INT4:
113+
114+
1. **INT8 Layers** (Higher Precision): Important layers that significantly impact model quality. Quantized per-channel
115+
116+
2. **INT4 Layers** (Maximum Compression): All other layers. Qunatized blockwise.
117+
118+
This strategy preserves accuracy for the most sensitive layers while maintaining aggressive compression elsewhere.
119+
120+
##### Using Mixed Precision Quantization
121+
122+
###### Method 1: Use the default mixed precision strategy
123+
124+
```bash
125+
python quantize.py --model_name=meta-llama/Meta-Llama-3.2-1B \
126+
--onnx_path="E:\models\llama3.2-1b-fp16\model.onnx" \
127+
--output_path="E:\models\llama3.2-1b-int4-int8-mixed\model.onnx" \
128+
--algo=awq_lite \
129+
--calib_size=32 \
130+
--enable_mixed_quant
131+
```
132+
133+
The `--enable_mixed_quant` flag automatically applies the default strategy.
134+
135+
###### Method 2: Specify custom layers for INT8
136+
137+
```bash
138+
python quantize.py --model_name=meta-llama/Meta-Llama-3.2-1B \
139+
--onnx_path="E:\models\llama3.2-1b-fp16\model.onnx" \
140+
--output_path="E:\models\llama3.2-1b-int4-int8-custom\model.onnx" \
141+
--algo=awq_lite \
142+
--calib_size=32 \
143+
--layers_8bit="layers.0,layers.1,layers.15,layers.16"
144+
```
145+
146+
The `--layers_8bit` option allows you to manually specify which layers to quantize to INT8. You can use:
147+
148+
- Layer indices: `layers.0,layers.5,layers.10`
149+
- Layer paths: `model/layers.0/attn/qkv_proj`
150+
- Partial names: `qkv_proj,down_proj`
151+
152+
##### Technical Details
153+
154+
- **Block Size**: INT4 layers use block-wise quantization (default block-size=128), INT8 uses per-channel quantization
155+
- **Quantization Axis**: INT4 (per-block), INT8 (per-channel row-wise)
156+
- **Compatibility**: Works with both `awq_lite` and `rtn_dq` algorithms
157+
- **Automatic Detection**: The `--layers_8bit` option automatically enables mixed quantization
158+
159+
For more benchmark results and detailed accuracy metrics, refer to the [Benchmark Guide](../../Benchmark.md).
160+
85161
### Evaluate the Quantized Model
86162

87163
To evaluate the quantized model, please refer to the [accuracy benchmarking](../../accuracy_benchmark/README.md) and [onnxruntime-genai performance benchmarking](https://github.com/microsoft/onnxruntime-genai/tree/main/benchmark/python).

0 commit comments

Comments
 (0)