You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Revert non-essential modifications to core torch_tensorrt files,
keeping only what TTA strictly requires:
_compile.py:
- Restore module_type == _ModuleType.ep branch (preserve EP input handling)
- Restore load() with extra_files/kwargs support
- Restore save() with all original params (extra_files, use_legacy_exporter,
dynamic_shapes, Input type annotations, full docstring)
- Restore original imports (inspect, Dict/Tuple, default_device, etc.)
- Keep only the post-trace hook loop as the TTA addition
_defaults.py / _settings.py:
- Remove editable_timing_cache, error_on_timing_cache_miss (autotune, out of scope)
- Restore DECOMPOSE_ATTENTION and decompose_attention field
- Restore cpu_memory_budget: Optional[int]
- Keep profiling_verbosity (needed for ILayer.metadata inspection)
_TRTInterpreter.py:
- Remove algorithm_selector parameter (autotune, out of scope)
- Remove _mark_debug_candidates / mark_debug logic (debug feature, out of scope)
- Remove editable_timing_cache / error_on_timing_cache_miss flag handling
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: py/torch_tensorrt/dynamo/_settings.py
+5-7Lines changed: 5 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -20,17 +20,16 @@
20
20
DLA_GLOBAL_DRAM_SIZE,
21
21
DLA_LOCAL_DRAM_SIZE,
22
22
DLA_SRAM_SIZE,
23
+
DECOMPOSE_ATTENTION,
23
24
DYNAMICALLY_ALLOCATE_RESOURCES,
24
25
DRYRUN,
25
-
EDITABLE_TIMING_CACHE,
26
26
ENABLE_AUTOCAST,
27
27
ENABLE_CROSS_COMPILE_FOR_WINDOWS,
28
28
ENABLE_EXPERIMENTAL_DECOMPOSITIONS,
29
29
ENABLE_RESOURCE_PARTITIONING,
30
30
ENABLE_WEIGHT_STREAMING,
31
31
ENABLED_PRECISIONS,
32
32
ENGINE_CAPABILITY,
33
-
ERROR_ON_TIMING_CACHE_MISS,
34
33
HARDWARE_COMPATIBLE,
35
34
IMMUTABLE_WEIGHTS,
36
35
L2_LIMIT_FOR_TILING,
@@ -110,8 +109,6 @@ class CompilationSettings:
110
109
True will enable cross-platform compatibility which allows the engine to be built on Linux and run on Windows
111
110
tiling_optimization_level (str): The optimization level of tiling strategies. A higher level allows TensorRT to spend more time searching for better tiling strategy. We currently support ["none", "fast", "moderate", "full"].
112
111
l2_limit_for_tiling (int): The target L2 cache usage limit (in bytes) for tiling optimization (default is -1 which means no limit).
113
-
editable_timing_cache (bool): Allow TensorRT to write new timing measurements into the timing cache during build (TRT 10.8+). Enable this on the first run so the cache is fully populated; subsequent runs can then load the cache and reproduce the same tactic selection. Default: False.
114
-
error_on_timing_cache_miss (bool): Raise a build error if any tactic's timing is not found in the loaded timing cache (TRT 10.8+). Use in combination with a pre-populated ``timing_cache_path`` to guarantee that no re-profiling occurs and tactic selection is identical to the seed run, producing bitwise-identical engines. Default: False.
115
112
use_distributed_mode_trace (bool): Using aot_autograd to trace the graph. This is enabled when DTensors or distributed tensors are present in distributed model
116
113
enable_autocast (bool): Whether to enable autocast. If enabled, use_explicit_typing will be set to True.
117
114
autocast_low_precision_type (Optional[Union[torch.dtype, dtype]]): The precision to reduce to. We currently support torch.float16 and torch.bfloat16. Default is None, which means no low precision is used.
@@ -122,6 +119,7 @@ class CompilationSettings:
122
119
autocast_calibration_dataloader (Optional[torch.utils.data.DataLoader]): The dataloader to use for autocast calibration. Default is None.
123
120
offload_module_to_cpu (bool): Offload the model to CPU to reduce memory footprint during compilation
124
121
dynamically_allocate_resources (bool): Dynamically allocate resources for TensorRT engines
122
+
decompose_attention (bool): Whether to decompose attention layers. We have converters for handling attention ops, but if you want to decompose them into smaller ops, you can set this to True.
0 commit comments