INT4 W4A16 kernel for AWQ#57
Conversation
8fec4a8 to
edfe62a
Compare
e274c99 to
dfe832a
Compare
|
@Xia-Weiwen could you help review this one?
|
|
I can comment a bit on point 3 about
On point 4
|
|
|
||
| qzeros = (qzeros.unsqueeze(-1) >> bitshifts) & 0xF | ||
| qzeros = qzeros.flatten(-2).to(torch.uint8) | ||
| return qweight, qzeros, scales |
There was a problem hiding this comment.
By convention, scales is put before qzeros. How about reordering them?
This PR uses integer zero points which aligns with AWQ format. |
@MingxuZh will help collect data. We will review the data internally first. Thanks. |
For point 3 and 4, I think the comments from @gau-nernst are very helpful. And I believe @gau-nernst has thought about them seriously. Since I am not very familiar with the code of vLLM or SGLang, I am afraid I cannot give further comments on these issues. Thanks. |
|
we tested the performance numbers and find out that when concurrent requests is small, performance improvement of int4 over int8 is not big, but the improvement grows bigger when concurrent requests go larger and larger. A little bit out of my expectation, anyway will debug deeper to find our why. |
mingfeima
left a comment
There was a problem hiding this comment.
merge this one now. I will help refine the kernels later on. We tested that the functionality is OK, performance wise still need some work.
Thanks the contribution! @gau-nernst
* fix * fix AWQ for DSv3 * don't use absorb MLA for AWQ * lint * more fixes * add w4a16 kernel * remove unnecessary name * add note * use prefetch. simplify impl * clean up. add brgemm (WIP) * fix brgemm * fix mismatch BLOCK_N * use at::quint4x2 to signify type better * change type of zero point back to uint8 * add FusedMoE interface * use FusedMoE kernel * fix types * fix MoE * update deepseek.cpp
* fix * fix AWQ for DSv3 * don't use absorb MLA for AWQ * lint * more fixes * add w4a16 kernel * remove unnecessary name * add note * use prefetch. simplify impl * clean up. add brgemm (WIP) * fix brgemm * fix mismatch BLOCK_N * use at::quint4x2 to signify type better * change type of zero point back to uint8 * add FusedMoE interface * use FusedMoE kernel * fix types * fix MoE * update deepseek.cpp
|
Hi @gau-nernst . Thanks for your amazing work! I am a little curious that why the bf16_lut is If I didn't misunderstand it, the value visible shoule be The (w-z) should be ranged from [-15, 15] and the (w-z+15) should be ranged from [0, 30]. I understand the positive value which map 15-30 to 0-15, but I don't understand why the negative value range from [-4, 0] instead of [-15, 0]? Appreciate it if you can give me some instructions. Thanks! |
import torch
x = torch.tensor([
0x0000, 0x4170, 0x4160, 0x4150, 0x4140, 0x4130, 0x4120, 0x4110,
0x4100, 0x40E0, 0x40C0, 0x40A0, 0x4080, 0x4040, 0x4000, 0x3F80,
0x0000,-0x4080,-0x4000,-0x3FC0,-0x3F80,-0x3F60,-0x3F40,-0x3F20,
-0x3F00,-0x3EF0,-0x3EE0,-0x3ED0,-0x3EC0,-0x3EB0,-0x3EA0,-0x3E90
], dtype=torch.int16)
x.view(torch.bfloat16).view(2, -1)I wrote this a long time ago so I don't remember the exact details. From the snippet above, looks like it's working correctly? I think negative hexadecimal numbers can be confusing. I can't remember how I obtained the LUT, but I think because |
|
Hi @gau-nernst . I made a mistake to convert the BF16 to FP32. You were right. Thanks for your help!!! |
|
Hi @gau-nernst . We are going to apply your amazing work in bitsandbytes. Could you please provide any guidance on the licensing/attribution we need for the code here? Thanks! |
Modifications
Add INT4 W4A16 kernel for AWQ to replace
torch._weight_int4pack_mm_for_cpu()Key features
Benchmarks. Intel 8481C (VM access, not baremetal)
TechxGenus/DeepSeek-V2-Lite-Chat-AWQ (MoE) - 5.3GB active
deepseek_v2.py, which can get quite messyLet me know if you need me to add correctness test. I do have some quick tests when I was developing the kernel, as well as sanity check by looking at model outputs
Checklist