add fp4 matmul kernels for deepseek v4 flash by sywangyi · Pull Request #867 · huggingface/kernels-community

sywangyi · 2026-05-18T08:13:03Z

verified in xpu.

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

sywangyi · 2026-05-18T08:13:18Z

@IlyasMoutawwakil please help review

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

IlyasMoutawwakil · 2026-05-19T01:31:51Z

shouldn't they be w4a8 ?

sywangyi · 2026-05-19T11:38:56Z

no, activation is 16bit since fp8_act_quant is not called in activation in fp4 kernels path

sywangyi · 2026-05-20T07:03:23Z

To do FP8 × FP4, the activations cannot be used as-is. They must first be quantized, typically per-token or per-block, together with the corresponding scale factors. In the MoE grouped path, there is already additional overhead from routing, sorting, the activation function, and the second projection, so activation quantization, dequantization, and scale movement are not free. FP8 activations only provide a clear speedup when the backend fuses activation quantization and GEMM efficiently. DeepGEMM does that; this Triton fallback does not.

IlyasMoutawwakil · 2026-05-21T01:17:21Z

yes exactly and that's why we do it in the batched and grouped fp8 paths as well so why not in these fp4 ones 😅 for me this is not much of a choice but rather how to stay as close to the original dsv4 implementation.

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

sywangyi added 2 commits May 18, 2026 16:11

add fp4 matmul kernels for deepseek v4 flash

99c182a

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

add missing source code..

0225c79

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

sywangyi requested review from danieldk and drbh as code owners May 18, 2026 08:13

update

ccd9d65

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

update to w4a8

fa8bd0c

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

IlyasMoutawwakil reviewed May 25, 2026

View reviewed changes

Comment thread finegrained-fp8/torch-ext/finegrained_fp8/batched.py Outdated

update

f6ebbb4

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add fp4 matmul kernels for deepseek v4 flash#867

add fp4 matmul kernels for deepseek v4 flash#867
sywangyi wants to merge 5 commits into
huggingface:mainfrom
sywangyi:deepseek_v4_fp4

sywangyi commented May 18, 2026

Uh oh!

sywangyi commented May 18, 2026

Uh oh!

IlyasMoutawwakil commented May 19, 2026

Uh oh!

sywangyi commented May 19, 2026 •

edited

Loading

Uh oh!

sywangyi commented May 20, 2026

Uh oh!

IlyasMoutawwakil commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sywangyi commented May 18, 2026

Uh oh!

sywangyi commented May 18, 2026

Uh oh!

IlyasMoutawwakil commented May 19, 2026

Uh oh!

sywangyi commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sywangyi commented May 20, 2026

Uh oh!

IlyasMoutawwakil commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sywangyi commented May 19, 2026 •

edited

Loading