Quantizing models with bitsandbytes

I had a couple of questions related to exactly how the quantization is done:

1. Does bitsandbytes do fp16 -> int8 quantization after transferring the tensors to the GPU? And if you want to dequantize, are those operations done on the GPU as well?
2. I traced the workflow of of `Linear8BitLt()`, which leads me to believe that quantization is happening in this line - https://github.com/TimDettmers/bitsandbytes/blob/main/csrc/kernels.cu#L2419. Could someone please confirm this? If not, where is the quantization occuring?
3. Is the quantization method used absmax or zero point? Is it done row-wise? There is some mention of column-wise features, but when I load quantized models with huggingface the scale factors seem to be different for each row, not column.
4. When you quantize a model, do you treat outliers separately as described in the LLM.int8() paper? If so, then where does this happen in the source code?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantizing models with bitsandbytes #1161

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Quantizing models with bitsandbytes #1161

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions