add fp4 matmul kernels for deepseek v4 flash#867
Conversation
Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
|
@IlyasMoutawwakil please help review |
|
shouldn't they be w4a8 ? |
|
no, activation is 16bit since fp8_act_quant is not called in activation in fp4 kernels path |
|
To do FP8 × FP4, the activations cannot be used as-is. They must first be quantized, typically per-token or per-block, together with the corresponding scale factors. In the MoE grouped path, there is already additional overhead from routing, sorting, the activation function, and the second projection, so activation quantization, dequantization, and scale movement are not free. FP8 activations only provide a clear speedup when the backend fuses activation quantization and GEMM efficiently. DeepGEMM does that; this Triton fallback does not. |
|
yes exactly and that's why we do it in the batched and grouped fp8 paths as well so why not in these fp4 ones 😅 for me this is not much of a choice but rather how to stay as close to the original dsv4 implementation. |
Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
verified in xpu.