Why introduce the design of an `output_quantizer` within quantization operators?

I'm puzzled by a specific question: why does modelopt introduce the design of an `output_quantizer` within QuantOP(QuantInputBase, QuantLinearConvBase), then disable `output_quantizer`？(code from [here](https://github.com/NVIDIA/Model-Optimizer/blob/b02e8885509c53b4e187f9fd5f56c5497e937d7e/modelopt/torch/quantization/nn/modules/quant_module.py#L205))

I think there might be two places that could be the cause：
1. For kv cache quantization:
* run_auto_quantize.py set "*output_quantizer" to enable.(code from [here](https://github.com/NVIDIA/Model-Optimizer/blob/b02e8885509c53b4e187f9fd5f56c5497e937d7e/examples/llm_autodeploy/run_auto_quantize.py#L117))
* quantization/algorithms.py set "*output_quantizer" to false to diable kv cache quantization. (code from [here](https://github.com/NVIDIA/Model-Optimizer/blob/b02e8885509c53b4e187f9fd5f56c5497e937d7e/modelopt/torch/quantization/algorithms.py#L132))

However it now appears that configuring KV cache quantization is achieved via "*[kv]_bmm_quantizer".

2. For LayerNorm output, _FP8_MHA_OVERRIDE set "*output_quantizer" to enable(code from [here](https://github.com/NVIDIA/Model-Optimizer/blob/b02e8885509c53b4e187f9fd5f56c5497e937d7e/examples/torch_onnx/torch_quant_to_onnx.py#L94)). I am unclear on what "fuse the shared Q/DQ across all downstream Q/K/V/FC consumers" implies. Does this mean that the Q, K, V, and FC inputs are all expected to be FP8-quantized?

Are there other reasons related to the definition of output_quantizer that I am unaware of? Looking forward to your reply, Thanks in advance!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why introduce the design of an `output_quantizer` within quantization operators? #1533

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Why introduce the design of an output_quantizer within quantization operators? #1533

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Why introduce the design of an `output_quantizer` within quantization operators? #1533