You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docsrc/dynamo/torch_compile.rst
+56-3Lines changed: 56 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -46,7 +46,20 @@ Custom Setting Usage
46
46
"optimization_level": 4,
47
47
"use_python_runtime": False,})
48
48
49
-
.. note:: Quantization/INT8 support is slated for a future release; currently, we support FP16 and FP32 precision layers.
49
+
.. note:: Torch-TensorRT supports FP32, FP16, and INT8 precision layers. For INT8 quantization, use the TensorRT Model Optimizer (modelopt) for post-training quantization (PTQ). See :ref:`vgg16_ptq` for an example.
50
+
51
+
Advanced Precision Control
52
+
^^^^^^^^^^^^^^^^^
53
+
54
+
For fine-grained control over mixed precision execution, TensorRT 10.12+ provides additional settings:
55
+
56
+
* ``use_explicit_typing``: Enable explicit type specification (required for TensorRT 10.12+)
57
+
* ``enable_autocast``: Enable rule-based autocast for automatic precision selection
58
+
* ``autocast_low_precision_type``: Target precision for autocast (e.g., ``torch.float16``)
59
+
* ``autocast_excluded_nodes``: Specific nodes to exclude from autocast
60
+
* ``autocast_excluded_ops``: Operation types to exclude from autocast
61
+
62
+
For detailed information and examples, see :ref:`mixed_precision`.
50
63
51
64
Compilation
52
65
-----------------
@@ -98,14 +111,54 @@ Compilation can also be helpful in demonstrating graph breaks and the feasibilit
Engine caching can significantly reduce recompilation times by saving built TensorRT engines to disk and reusing them when possible. This is particularly useful for JIT workflows where graphs may be invalidated and recompiled. When enabled, engines are saved with a hash of their corresponding PyTorch subgraph and can be reloaded in subsequent compilations—even across different Python sessions.
117
+
118
+
To enable engine caching, use the ``cache_built_engines`` and ``reuse_cached_engines`` options:
.. note:: To use engine caching, ``immutable_weights`` must be set to ``False`` to allow engine refitting. When a cached engine is loaded, weights are refitted rather than rebuilding the entire engine, which can reduce compilation times by orders of magnitude.
132
+
133
+
For more details and examples, see :ref:`engine_caching_example`.
134
+
101
135
Dynamic Shape Support
102
136
-----------------
103
137
104
-
The Torch-TensorRT `torch.compile` backend will currently require recompilation for each new batch size encountered, and it is preferred to use the `dynamic=False` argument when compiling with this backend. Full dynamic shape support is planned for a future release.
138
+
The Torch-TensorRT `torch.compile` backend now supports dynamic shapes, allowing models to handle varying input dimensions without recompilation. You can specify dynamic dimensions using the ``torch._dynamo.mark_dynamic`` API:
Without dynamic shapes, the model will recompile for each new input shape encountered. For more control over dynamic shapes, consider using the AOT compilation path with ``torch_tensorrt.compile`` as described in :ref:`dynamic_shapes`. For a complete tutorial on dynamic shape compilation, see :ref:`compile_with_dynamic_inputs`.
105
156
106
157
Recompilation Conditions
107
158
-----------------
108
159
109
-
Once the model has been compiled, subsequent inference inputs with the same shape and data type, which traverse the graph in the same way will not require recompilation. Furthermore, each new recompilation will be cached for the duration of the Python session. For instance, if inputs of batch size 4 and 8 are provided to the model, causing two recompilations, no further recompilation would be necessary for future inputs with those batch sizes during inference within the same session. Support for engine cache serialization is planned for a future release.
160
+
Once the model has been compiled, subsequent inference inputs with the same shape and data type, which traverse the graph in the same way will not require recompilation. Furthermore, each new recompilation will be cached for the duration of the Python session. For instance, if inputs of batch size 4 and 8 are provided to the model, causing two recompilations, no further recompilation would be necessary for future inputs with those batch sizes during inference within the same session.
161
+
162
+
To persist engine caches across Python sessions, use the ``cache_built_engines`` and ``reuse_cached_engines`` options as described in the Engine Caching section above.
110
163
111
164
Recompilation is generally triggered by one of two events: encountering inputs of different sizes or inputs which traverse the model code differently. The latter scenario can occur when the model code includes conditional logic, complex loops, or data-dependent-shapes. `torch.compile` handles guarding in both of these scenario and determines when recompilation is necessary.
Copy file name to clipboardExpand all lines: docsrc/ts/ptq.rst
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,9 @@
3
3
Post Training Quantization (PTQ)
4
4
=================================
5
5
6
+
.. warning::
7
+
This guide describes the legacy PTQ workflow for the TorchScript frontend. **For new projects, use the TensorRT Model Optimizer (modelopt) with the Dynamo frontend instead.** See :ref:`vgg16_ptq` for the recommended approach.
8
+
6
9
Post Training Quantization (PTQ) is a technique to reduce the required computational resources for inference
7
10
while still preserving the accuracy of your model by mapping the traditional FP32 activation space to a reduced
8
11
INT8 space. TensorRT uses a calibration step which executes your model with sample data from the target domain
0 commit comments