Hello,
I am currently working on a fork of Omniquant and your work is truly brilliant. I have one question about activation quantization and what are the reason why you still activate it.
Running the process to replicate your paper result I run:
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/llama-7b \
--epochs 0 --output_dir ./log/test \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc \
--resume /PATH/TO/Pretrained/Parameters
But what I find odd in omniquant.py, you still activate activation quantization:
|
# init smooth parameters |
|
set_quant_state(qlayer, weight_quant=False, act_quant=True) # weight will be manually quantized before forward |
|
qlayer.let = args.let |
|
use_shift = True |
|
if is_llama or args.abits == 16: |
|
use_shift = False # deactivate channel-wise shifting for llama model and weight-only quantization |
Which can seem harmless since we are using 16 bits activation, but wouldn't this force FP16 into INT16 resulting in a loss of information? Especially since you init QuantMatMul
|
self.qkt_matmul = QuantMatMul( |
|
args.q_quant_params, args.k_quant_params, matmul_func=torch.matmul |
|
) |
|
self.pv_matmul = QuantMatMul( |
|
args.p_quant_params, args.v_quant_params, matmul_func=torch.matmul |
|
) |
And this will trigger the quantization procedure on every activation as in
|
# repeat k/v heads if n_kv_heads < n_heads |
|
key_states = repeat_kv(key_states, self.num_key_value_groups) |
|
value_states = repeat_kv(value_states, self.num_key_value_groups) |
|
|
|
query_states = self.qkt_matmul.quant_x1(query_states) |
|
key_states = self.qkt_matmul.quant_x2(key_states) |
|
attn_weights = self.qkt_matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) |
|
|
which will result in degraded performance while your desired behavior would be keeping in FP16 which is not the case as the flag use_act_quant is set for all QuantMatMul module. This result in a call to UniformAffineQuantizer and produce a quantized INT16 form of activations.
Hello,
I am currently working on a fork of Omniquant and your work is truly brilliant. I have one question about activation quantization and what are the reason why you still activate it.
Running the process to replicate your paper result I run:
But what I find odd in omniquant.py, you still activate activation quantization:
OmniQuant/quantize/omniquant.py
Lines 216 to 221 in feffe8e
Which can seem harmless since we are using 16 bits activation, but wouldn't this force FP16 into INT16 resulting in a loss of information? Especially since you init QuantMatMul
OmniQuant/models/int_llama_layer.py
Lines 90 to 95 in feffe8e
And this will trigger the quantization procedure on every activation as in
OmniQuant/models/int_llama_layer.py
Lines 137 to 144 in feffe8e
which will result in degraded performance while your desired behavior would be keeping in FP16 which is not the case as the flag
use_act_quantis set for allQuantMatMulmodule. This result in a call toUniformAffineQuantizerand produce a quantized INT16 form of activations.