Skip to content

Quantizing models with bitsandbytes #1161

@raunaks13

Description

@raunaks13

I had a couple of questions related to exactly how the quantization is done:

  1. Does bitsandbytes do fp16 -> int8 quantization after transferring the tensors to the GPU? And if you want to dequantize, are those operations done on the GPU as well?
  2. I traced the workflow of of Linear8BitLt(), which leads me to believe that quantization is happening in this line - https://github.com/TimDettmers/bitsandbytes/blob/main/csrc/kernels.cu#L2419. Could someone please confirm this? If not, where is the quantization occuring?
  3. Is the quantization method used absmax or zero point? Is it done row-wise? There is some mention of column-wise features, but when I load quantized models with huggingface the scale factors seem to be different for each row, not column.
  4. When you quantize a model, do you treat outliers separately as described in the LLM.int8() paper? If so, then where does this happen in the source code?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions