I'm puzzled by a specific question: why does modelopt introduce the design of an output_quantizer within QuantOP(QuantInputBase, QuantLinearConvBase), then disable output_quantizer?(code from here)
I think there might be two places that could be the cause:
- For kv cache quantization:
- run_auto_quantize.py set "*output_quantizer" to enable.(code from here)
- quantization/algorithms.py set "*output_quantizer" to false to diable kv cache quantization. (code from here)
However it now appears that configuring KV cache quantization is achieved via "*[kv]_bmm_quantizer".
- For LayerNorm output, _FP8_MHA_OVERRIDE set "*output_quantizer" to enable(code from here). I am unclear on what "fuse the shared Q/DQ across all downstream Q/K/V/FC consumers" implies. Does this mean that the Q, K, V, and FC inputs are all expected to be FP8-quantized?
Are there other reasons related to the definition of output_quantizer that I am unaware of? Looking forward to your reply, Thanks in advance!
I'm puzzled by a specific question: why does modelopt introduce the design of an
output_quantizerwithin QuantOP(QuantInputBase, QuantLinearConvBase), then disableoutput_quantizer?(code from here)I think there might be two places that could be the cause:
However it now appears that configuring KV cache quantization is achieved via "*[kv]_bmm_quantizer".
Are there other reasons related to the definition of output_quantizer that I am unaware of? Looking forward to your reply, Thanks in advance!