Skip to content

Why introduce the design of an output_quantizer within quantization operators? #1533

Description

@bestzsq

I'm puzzled by a specific question: why does modelopt introduce the design of an output_quantizer within QuantOP(QuantInputBase, QuantLinearConvBase), then disable output_quantizer?(code from here)

I think there might be two places that could be the cause:

  1. For kv cache quantization:
  • run_auto_quantize.py set "*output_quantizer" to enable.(code from here)
  • quantization/algorithms.py set "*output_quantizer" to false to diable kv cache quantization. (code from here)

However it now appears that configuring KV cache quantization is achieved via "*[kv]_bmm_quantizer".

  1. For LayerNorm output, _FP8_MHA_OVERRIDE set "*output_quantizer" to enable(code from here). I am unclear on what "fuse the shared Q/DQ across all downstream Q/K/V/FC consumers" implies. Does this mean that the Q, K, V, and FC inputs are all expected to be FP8-quantized?

Are there other reasons related to the definition of output_quantizer that I am unaware of? Looking forward to your reply, Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionHelp is is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions