PyTorch 2 Export Quantization for OpenVINO torch.compile Backend

Authors: Daniil Lyakhov, Aamir Nazir, Alexander Suslov, Yamini Nimmagadda, Alexander Kozlov

Prerequisites

Introduction

Note

This is an experimental feature, the quantization API is subject to change.

This tutorial demonstrates how to use OpenVINOQuantizer from Executorch in PyTorch 2 Export Quantization flow to generate a quantized model customized for the OpenVINO torch.compile backend and explains how to lower the quantized model into the OpenVINO representation. OpenVINOQuantizer unlocks the full potential of low-precision OpenVINO kernels due to the placement of quantizers designed specifically for the OpenVINO.

The PyTorch 2 export quantization flow uses torch.export to capture the model into a graph and performs quantization transformations on top of the ATen graph. This approach is expected to have significantly higher model coverage, improved flexibility, and a simplified UX. OpenVINO backend compiles the FX Graph generated by TorchDynamo into an optimized OpenVINO model.

The quantization flow mainly includes four steps:

Step 1: Capture the FX Graph from the eager Model based on the torch export mechanism.
Step 2: Apply the PyTorch 2 Export Quantization flow with OpenVINOQuantizer based on the captured FX Graph.
Step 3: Lower the quantized model into OpenVINO representation with the torch.compile API.
Optional step 4: : Improve quantized model metrics via quantize_pt2e method.

The high-level architecture of this flow could look like this:

float_model(Python)                          Example Input
    \                                              /
     \                                            /
—--------------------------------------------------------
|                         export                       |
—--------------------------------------------------------
                            |
                    FX Graph in ATen
                            |
                            |           OpenVINOQuantizer
                            |                 /
—--------------------------------------------------------
|                      prepare_pt2e                     |
|                           |                           |
|                       Calibrate
|                           |                           |
|                      convert_pt2e                     |
—--------------------------------------------------------
                            |
                     Quantized Model
                            |
—--------------------------------------------------------
|                  Lower into Inductor                  |
—--------------------------------------------------------
                            |
                      OpenVINO model

Post Training Quantization

Now, we will walk you through a step-by-step tutorial for how to use it with torchvision resnet18 model for post training quantization.

Prerequisite: OpenVINO and NNCF installation

OpenVINO and NNCF could be easily installed via pip distribution:

pip install -U pip
pip install openvino, nncf

1. Capture FX Graph

We will start by performing the necessary imports, capturing the FX Graph from the eager module.

import copy
import openvino.torch
import torch
import torchvision.models as models
from torch.ao.quantization.quantize_pt2e import convert_pt2e
from torch.ao.quantization.quantize_pt2e import prepare_pt2e

import nncf.torch

# Create the Eager Model
model_name = "resnet18"
model = models.__dict__[model_name](pretrained=True)

# Set the model to eval mode
model = model.eval()

# Create the data, using the dummy data here as an example
traced_bs = 50
x = torch.randn(traced_bs, 3, 224, 224)
example_inputs = (x,)

# Capture the FX Graph to be quantized
with torch.no_grad(), nncf.torch.disable_patching():
    exported_model = torch.export.export(model, example_inputs).module()

2. Apply Quantization

After we capture the FX Module to be quantized, we will import the OpenVINOQuantizer.

from executorch.backends.openvino.quantizer import OpenVINOQuantizer
from executorch.backends.openvino.quantizer import QuantizationMode

quantizer = OpenVINOQuantizer()

OpenVINOQuantizer has several optional parameters that allow tuning the quantization process to get a more accurate model. Below is the list of essential parameters and their description:

mode - defines quantization scheme for the model. Multiple modes are supported:
- INT8_SYM (default) - defines symmetric quantization of weights and activations. This is the best for performance
- INT8_MIXED - weights are quantized with symmetric quantization and the activations are quantized with asymmetric quantization. This preset is recommended for models with non-ReLU and asymmetric activation functions, e.g. ELU, PReLU, GELU, etc.
- INT8_TRANSFORMER - special quantization scheme to preserve accuracy after quantization of Transformer models (BERT, Llama, etc.). None is default, i.e. no specific scheme is defined.
- INT8WO_SYM, INT8WO_ASYM, INT4WO_SYM, INT4WO_ASYM - these are weights-only quantization schemes. They apply simple min-max quantization to model weights to INT8/INT4 with Symmetric and Asymmetric schemes.
```
OpenVINOQuantizer(mode=QuantizationMode.INT8_SYM)
```

ignored_scope - this parameter can be used to exclude some layers from the quantization process to preserve the model accuracy. For example, when you want to exclude the last layer of the model from quantization. Below are some examples of how to use this parameter:

#Exclude by layer name:
names = ['layer_1', 'layer_2', 'layer_3']
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(names=names))

#Exclude by layer type:
types = ['Conv2d', 'Linear']
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(types=types))

#Exclude by regular expression:
regex = '.*layer_.*'
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(patterns=regex))

#Exclude by subgraphs:
# In this case, all nodes along all simple paths in the graph
# from input to output nodes will be excluded from the quantization process.
subgraph = nncf.Subgraph(inputs=['layer_1', 'layer_2'], outputs=['layer_3'])
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(subgraphs=[subgraph]))

For further details on OpenVINOQuantizer please see the documentation.

After we import the backend-specific Quantizer, we will prepare the model for post-training quantization. prepare_pt2e folds BatchNorm operators into preceding Conv2d operators, and inserts observers in appropriate places in the model.

prepared_model = prepare_pt2e(exported_model, quantizer)

Now, we will calibrate the prepared_model after the observers are inserted in the model.

# We use the dummy data as an example here
prepared_model(*example_inputs)

Finally, we will convert the calibrated Model to a quantized Model. convert_pt2e takes a calibrated model and produces a quantized model.

quantized_model = convert_pt2e(prepared_model, fold_quantize=False)

After these steps, we finished running the quantization flow, and we will get the quantized model.

3. Lower into OpenVINO representation

After that the FX Graph can utilize OpenVINO optimizations using torch.compile(…, backend=”openvino”) functionality.

with torch.no_grad(), nncf.torch.disable_patching():
    optimized_model = torch.compile(quantized_model, backend="openvino")

    # Running some benchmark
    optimized_model(*example_inputs)

The optimized model is using low-level kernels designed specifically for Intel CPU. This should significantly speed up inference time in comparison with the eager model.

4. Optional: Improve quantized model metrics

NNCF implements advanced quantization algorithms like SmoothQuant and BiasCorrection, which help to improve the quantized model metrics while minimizing the output discrepancies between the original and compressed models. These advanced NNCF algorithms can be accessed via the NNCF quantize_pt2e API:

from nncf.experimental.torch.fx import quantize_pt2e

calibration_loader = torch.utils.data.DataLoader(...)


def transform_fn(data_item):
    images, _ = data_item
    return images


calibration_dataset = nncf.Dataset(calibration_loader, transform_fn)
quantized_model = quantize_pt2e(
    exported_model, quantizer, calibration_dataset, smooth_quant=True, fast_bias_correction=False
)

For further details, please see the documentation and a complete example on Resnet18 quantization.

Conclusion

This tutorial introduces how to use torch.compile with the OpenVINO backend and the OpenVINO quantizer. For more details on NNCF and the NNCF Quantization Flow for PyTorch models, refer to the NNCF Quantization Guide. For additional information, check out the OpenVINO Deployment via torch.compile Documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch 2 Export Quantization for OpenVINO torch.compile Backend

Prerequisites

Introduction

Post Training Quantization

Prerequisite: OpenVINO and NNCF installation

1. Capture FX Graph

2. Apply Quantization

3. Lower into OpenVINO representation

4. Optional: Improve quantized model metrics

Conclusion

FilesExpand file tree

openvino_quantizer.rst

Latest commit

History

openvino_quantizer.rst

File metadata and controls

PyTorch 2 Export Quantization for OpenVINO torch.compile Backend

Prerequisites

Introduction

Post Training Quantization

Prerequisite: OpenVINO and NNCF installation

1. Capture FX Graph

2. Apply Quantization

3. Lower into OpenVINO representation

4. Optional: Improve quantized model metrics

Conclusion