You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* ``target_device`` - defines the target device, the specificity of which will be taken into account during optimization. The following values are supported: ``ANY`` (default), ``CPU``, ``CPU_SPR``, ``GPU``, and ``NPU``.
For further details on `OpenVINOQuantizer` please see the `documentation <https://openvinotoolkit.github.io/nncf/autoapi/nncf/experimental/torch/fx/index.html#nncf.experimental.torch.fx.OpenVINOQuantizer>`_.
169
174
170
-
After we import the backend-specific Quantizer, we will prepare the model for post-training quantization.
175
+
After we import the backend-specific Quantizer, we will prepare the model for post-training quantization/weights-only quantization.
171
176
``prepare_pt2e`` folds BatchNorm operators into preceding Conv2d operators, and inserts observers in appropriate places in the model.
172
177
173
178
.. code-block:: python
@@ -209,31 +214,84 @@ The optimized model is using low-level kernels designed specifically for Intel C
209
214
This should significantly speed up inference time in comparison with the eager model.
210
215
211
216
4. Optional: Improve quantized model metrics
212
-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
217
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
213
218
214
-
NNCF implements advanced quantization algorithms like `SmoothQuant <https://arxiv.org/abs/2211.10438>`_ and `BiasCorrection <https://arxiv.org/abs/1906.04721>`_ for static activation and weights quantization. For weights-only quantization, there are `AWQ https://arxiv.org/abs/2306.00978`_ and `Scale Estimation https://github.com/openvinotoolkit/nncf/blob/develop/src/nncf/quantization/algorithms/weight_compression/scale_estimation.py`_ algorithms. These techniques help in improving the quantized model metrics while minimizing the output discrepancies between the original and compressed models.
215
-
These advanced NNCF algorithms can be accessed via the NNCF `quantize_pt2e` API for static activation and weights or `compress_pt2e` for weights-only quantization:
219
+
NNCF implements advanced quantization algorithms that help improve the metrics of a compressed model while minimizing the output discrepancies between the original and compressed models. These are accessed via the NNCF ``quantize_pt2e`` API for static activation and weights quantization, or ``compress_pt2e`` for weights-only quantization.
220
+
221
+
Post Training Quantization
222
+
""""""""""""""""""""""""""
223
+
224
+
``quantize_pt2e`` can be applied on top of any ``torchao`` Quantizer to improve the accuracy of the quantized model. Key algorithms:
225
+
226
+
- `SmoothQuant <https://arxiv.org/abs/2211.10438>`_ - Reduces activation quantization error by inserting smoothing scales before weighted layers, migrating quantization difficulty from hard-to-quantize activations onto the weights.
227
+
- `BiasCorrection <https://arxiv.org/abs/1906.04721>`_ - Compares quantized and original layer outputs layer-by-layer and adjusts convolution biases to align them, compensating for the error introduced by quantization.
216
228
217
229
.. code-block:: python
218
230
219
231
from nncf.experimental.torch.fx import quantize_pt2e
``compress_pt2e`` applies weight compression to a ``torch.fx.GraphModule``, targeting LLM deployment. The following activation-aware algorithms use a small calibration subset to capture activation statistics:
248
+
249
+
- `AWQ <https://arxiv.org/abs/2306.00978>`_ - Activation-aware Weight Quantization that finds per-channel scales to minimize quantization error based on activation distributions.
250
+
- `Scale Estimation <https://github.com/openvinotoolkit/nncf/blob/develop/src/nncf/quantization/algorithms/weight_compression/scale_estimation.py>`_ - Estimates scales to minimize the layer-wise output error for INT4 weight layers, iteratively refining the scales on a calibration subset.
251
+
252
+
.. code-block:: python
253
+
254
+
from nncf.experimental.torch.fx import compress_pt2e
When no calibration data is available, ``compress_pt2e`` can perform weight compression relying solely on the pretrained weights. Data-Free Compression uses only the weight tensor statistics, with no activations observed at any point. It can be combined with the AWQ and Mixed Precision algorithms when richer behavior is needed without giving up the no-dataset workflow.
271
+
272
+
.. code-block:: python
273
+
274
+
from nncf.experimental.torch.fx import compress_pt2e
Mixed Precision assigns different bit-widths (e.g. INT4 vs INT8) to individual layers based on their sensitivity, keeping more sensitive layers at higher precision while aggressively compressing the rest. NNCF supports several sensitivity-ranking criteria:
282
+
283
+
- **Weight Quantization Error** - Data-free metric that measures the per-layer error introduced by quantizing the weights themselves, requiring no calibration data.
284
+
- **Hessian** - Activation-aware metric that uses second-order information about the loss to estimate how much the model output changes when a layer's weights are perturbed by quantization.
285
+
- **Mean Variance** and **Max Variance** - Activation-aware metrics that rank layers by the mean or maximum variance of their input activations, on the intuition that layers with more spread-out activations are harder to quantize.
286
+
- **Mean Magnitude** - Activation-aware metric that ranks layers by the average magnitude of their input activations.
For further details, please see the `documentation <https://openvinotoolkit.github.io/nncf/autoapi/nncf/experimental/torch/fx/index.html#nncf.experimental.torch.fx.quantize_pt2e>`_
236
-
and `for some examples with llama and stable_diffusion checkout <https://github.com/openvinotoolkit/nncf/blob/develop/examples/post_training_quantization/torch_fx/resnet18/README.md>`_. For `YoloV26 example with this API <https://github.com/pytorch/executorch/tree/main/examples/models/yolo26>`
294
+
Checkout some `resnet <https://github.com/openvinotoolkit/nncf/blob/develop/examples/post_training_quantization/torch_fx/resnet18/README.md>`_, `llama <https://github.com/pytorch/executorch/tree/main/examples/openvino/llama>`_, `stable diffusion <https://github.com/pytorch/executorch/tree/main/examples/models/yolo26>`_ and `Yolo26 <https://github.com/pytorch/executorch/tree/main/examples/models/yolo26>`_ examples with this API.
0 commit comments