Skip to content

Commit 96a8dee

Browse files
authored
Update openvino_quantizer.rst
1 parent 6397dec commit 96a8dee

1 file changed

Lines changed: 72 additions & 14 deletions

File tree

unstable_source/openvino_quantizer.rst

Lines changed: 72 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -36,27 +36,27 @@ The high-level architecture of this flow could look like this:
3636
float_model(Python) Example Input
3737
\ /
3838
\ /
39-
--------------------------------------------------------
39+
---------------------------------------------------------
4040
| export |
41-
--------------------------------------------------------
41+
---------------------------------------------------------
4242
|
4343
FX Graph in ATen
4444
|
4545
| OpenVINOQuantizer
4646
| /
47-
--------------------------------------------------------
47+
---------------------------------------------------------
4848
| prepare_pt2e |
4949
| | |
5050
| Calibrate
5151
| | |
5252
| convert_pt2e |
53-
--------------------------------------------------------
53+
---------------------------------------------------------
5454
|
5555
Quantized Model
5656
|
57-
--------------------------------------------------------
57+
---------------------------------------------------------
5858
| Lower into Inductor |
59-
--------------------------------------------------------
59+
---------------------------------------------------------
6060
|
6161
OpenVINO model
6262

@@ -164,10 +164,15 @@ Below is the list of essential parameters and their description:
164164
subgraph = nncf.Subgraph(inputs=['layer_1', 'layer_2'], outputs=['layer_3'])
165165
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(subgraphs=[subgraph]))
166166
167+
* ``target_device`` - defines the target device, the specificity of which will be taken into account during optimization. The following values are supported: ``ANY`` (default), ``CPU``, ``CPU_SPR``, ``GPU``, and ``NPU``.
168+
169+
.. code-block:: python
170+
171+
OpenVINOQuantizer(target_device=nncf.TargetDevice.CPU)
167172
168173
For further details on `OpenVINOQuantizer` please see the `documentation <https://openvinotoolkit.github.io/nncf/autoapi/nncf/experimental/torch/fx/index.html#nncf.experimental.torch.fx.OpenVINOQuantizer>`_.
169174

170-
After we import the backend-specific Quantizer, we will prepare the model for post-training quantization.
175+
After we import the backend-specific Quantizer, we will prepare the model for post-training quantization/weights-only quantization.
171176
``prepare_pt2e`` folds BatchNorm operators into preceding Conv2d operators, and inserts observers in appropriate places in the model.
172177

173178
.. code-block:: python
@@ -209,31 +214,84 @@ The optimized model is using low-level kernels designed specifically for Intel C
209214
This should significantly speed up inference time in comparison with the eager model.
210215

211216
4. Optional: Improve quantized model metrics
212-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
217+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
213218

214-
NNCF implements advanced quantization algorithms like `SmoothQuant <https://arxiv.org/abs/2211.10438>`_ and `BiasCorrection <https://arxiv.org/abs/1906.04721>`_ for static activation and weights quantization. For weights-only quantization, there are `AWQ https://arxiv.org/abs/2306.00978`_ and `Scale Estimation https://github.com/openvinotoolkit/nncf/blob/develop/src/nncf/quantization/algorithms/weight_compression/scale_estimation.py`_ algorithms. These techniques help in improving the quantized model metrics while minimizing the output discrepancies between the original and compressed models.
215-
These advanced NNCF algorithms can be accessed via the NNCF `quantize_pt2e` API for static activation and weights or `compress_pt2e` for weights-only quantization:
219+
NNCF implements advanced quantization algorithms that help improve the metrics of a compressed model while minimizing the output discrepancies between the original and compressed models. These are accessed via the NNCF ``quantize_pt2e`` API for static activation and weights quantization, or ``compress_pt2e`` for weights-only quantization.
220+
221+
Post Training Quantization
222+
""""""""""""""""""""""""""
223+
224+
``quantize_pt2e`` can be applied on top of any ``torchao`` Quantizer to improve the accuracy of the quantized model. Key algorithms:
225+
226+
- `SmoothQuant <https://arxiv.org/abs/2211.10438>`_ - Reduces activation quantization error by inserting smoothing scales before weighted layers, migrating quantization difficulty from hard-to-quantize activations onto the weights.
227+
- `BiasCorrection <https://arxiv.org/abs/1906.04721>`_ - Compares quantized and original layer outputs layer-by-layer and adjusts convolution biases to align them, compensating for the error introduced by quantization.
216228

217229
.. code-block:: python
218230
219231
from nncf.experimental.torch.fx import quantize_pt2e
220232
221233
calibration_loader = torch.utils.data.DataLoader(...)
222234
223-
224235
def transform_fn(data_item):
225236
images, _ = data_item
226237
return images
227238
228-
229239
calibration_dataset = nncf.Dataset(calibration_loader, transform_fn)
230240
quantized_model = quantize_pt2e(
231241
exported_model, quantizer, calibration_dataset, smooth_quant=True, fast_bias_correction=False
232242
)
233243
244+
Weights Only Quantization
245+
"""""""""""""""""""""""""
246+
247+
``compress_pt2e`` applies weight compression to a ``torch.fx.GraphModule``, targeting LLM deployment. The following activation-aware algorithms use a small calibration subset to capture activation statistics:
248+
249+
- `AWQ <https://arxiv.org/abs/2306.00978>`_ - Activation-aware Weight Quantization that finds per-channel scales to minimize quantization error based on activation distributions.
250+
- `Scale Estimation <https://github.com/openvinotoolkit/nncf/blob/develop/src/nncf/quantization/algorithms/weight_compression/scale_estimation.py>`_ - Estimates scales to minimize the layer-wise output error for INT4 weight layers, iteratively refining the scales on a calibration subset.
251+
252+
.. code-block:: python
253+
254+
from nncf.experimental.torch.fx import compress_pt2e
255+
256+
calibration_loader = torch.utils.data.DataLoader(...)
257+
258+
def transform_fn(data_item):
259+
images, _ = data_item
260+
return images
261+
262+
calibration_dataset = nncf.Dataset(calibration_loader, transform_fn)
263+
compressed_model = compress_pt2e(
264+
exported_model, quantizer, calibration_dataset, awq=True, scale_estimation=True
265+
)
266+
267+
Data-free algorithms
268+
~~~~~~~~~~~~~~~~~~~~
269+
270+
When no calibration data is available, ``compress_pt2e`` can perform weight compression relying solely on the pretrained weights. Data-Free Compression uses only the weight tensor statistics, with no activations observed at any point. It can be combined with the AWQ and Mixed Precision algorithms when richer behavior is needed without giving up the no-dataset workflow.
271+
272+
.. code-block:: python
273+
274+
from nncf.experimental.torch.fx import compress_pt2e
275+
276+
compressed_model = compress_pt2e(exported_model, quantizer, awq=True, ratio=0.8)
277+
278+
Mixed Precision algorithms
279+
~~~~~~~~~~~~~~~~~~~~~~~~~~
280+
281+
Mixed Precision assigns different bit-widths (e.g. INT4 vs INT8) to individual layers based on their sensitivity, keeping more sensitive layers at higher precision while aggressively compressing the rest. NNCF supports several sensitivity-ranking criteria:
282+
283+
- **Weight Quantization Error** - Data-free metric that measures the per-layer error introduced by quantizing the weights themselves, requiring no calibration data.
284+
- **Hessian** - Activation-aware metric that uses second-order information about the loss to estimate how much the model output changes when a layer's weights are perturbed by quantization.
285+
- **Mean Variance** and **Max Variance** - Activation-aware metrics that rank layers by the mean or maximum variance of their input activations, on the intuition that layers with more spread-out activations are harder to quantize.
286+
- **Mean Magnitude** - Activation-aware metric that ranks layers by the average magnitude of their input activations.
287+
288+
.. code-block:: python
289+
from nncf import SensitivityMetric
290+
compressed_model = compress_pt2e(
291+
exported_model, quantizer, calibration_dataset, awq=True, scale_estimation=True, ratio=0.8, sensitivity_metric=SensitivityMetric.MAX_ACTIVATION_VARIANCE
292+
)
234293
235-
For further details, please see the `documentation <https://openvinotoolkit.github.io/nncf/autoapi/nncf/experimental/torch/fx/index.html#nncf.experimental.torch.fx.quantize_pt2e>`_
236-
and `for some examples with llama and stable_diffusion checkout <https://github.com/openvinotoolkit/nncf/blob/develop/examples/post_training_quantization/torch_fx/resnet18/README.md>`_. For `YoloV26 example with this API <https://github.com/pytorch/executorch/tree/main/examples/models/yolo26>`
294+
Checkout some `resnet <https://github.com/openvinotoolkit/nncf/blob/develop/examples/post_training_quantization/torch_fx/resnet18/README.md>`_, `llama <https://github.com/pytorch/executorch/tree/main/examples/openvino/llama>`_, `stable diffusion <https://github.com/pytorch/executorch/tree/main/examples/models/yolo26>`_ and `Yolo26 <https://github.com/pytorch/executorch/tree/main/examples/models/yolo26>`_ examples with this API.
237295

238296
Conclusion
239297
------------

0 commit comments

Comments
 (0)