Skip to content

Commit 8e81fe0

Browse files
committed
resolve comments
Signed-off-by: Will Guo <willg@nvidia.com>
1 parent a857f47 commit 8e81fe0

2 files changed

Lines changed: 1 addition & 1 deletion

File tree

CHANGELOG.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ NVIDIA Model Optimizer Changelog
2020
- Add ``nvfp4_omlp_only`` quantization format for NVFP4 quantization. This is similar to ``nvfp4_mlp_only`` but also quantizes the output projection layer in attention.
2121
- ``pass_through_bwd`` in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.
2222
- Add :meth:`compute_quantization_mse <modelopt.torch.quantization.model_quant.compute_quantization_mse>` API to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering.
23+
- **AutoQDQ**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the AutoQDQ guide in the documentation.
2324

2425
**Misc**
2526

pyproject.toml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,6 @@ dependencies = [
4848
[project.optional-dependencies]
4949
onnx = [
5050
"cppimport",
51-
"cuda-python",
5251
"cupy-cuda12x; platform_machine != 'aarch64' and platform_system != 'Darwin'",
5352
"lief",
5453
"ml_dtypes",

0 commit comments

Comments
 (0)