activation: Add CPU backend for `silu_and_mul` by jiqing-feng · Pull Request #942 · huggingface/kernels-community

jiqing-feng · 2026-06-10T03:25:14Z

Summary

Adds an optimized CPU implementation of silu_and_mul (SwiGLU activation) to the
activation kernel, so the op can run natively on CPU in addition to CUDA/Metal.

What's included

Runtime-dispatched AVX512-BF16 kernel. A dispatcher TU (compiled without
AVX512 flags) selects, at runtime via CPUID feature detection, between an
AVX512-BF16 vectorized path and a generic ATen fallback. The AVX512 path and the
dispatcher are split into separate translation units so the binary stays loadable
on machines without AVX512.
Numerically exact. The BF16 path rounds the SiLU result back through BF16
before the multiply, matching F.silu(x[..., :d]) * x[..., d:] bit-for-bit.
Fallback. Non-BF16 / non-contiguous inputs, or CPUs without AVX512-BF16, use
an ATen-based path, so correctness is preserved everywhere.

Files

File	Purpose
`activation_cpu/silu_and_mul_cpu_torch.cpp`	Op entry point matching the existing binding signature
`activation_cpu/silu_and_mul_cpu.{cpp,hpp}`	Runtime dispatcher (AVX512 vs. ATen fallback)
`activation_cpu/silu_and_mul_avx512.{cpp,hpp}`	AVX512-BF16 vectorized kernel (header is intrinsic-free)
`activation_cpu/cpu_features.hpp`	CPUID-based AVX512-BF16 detection
`build.toml`	`cpu` backend + `activation_cpu` / `activation_cpu_avx512` kernel sections
`torch-ext/torch_binding.cpp`	Registers the CPU impl for `silu_and_mul`

Validation

Tested against the nix-built artifact via kernels.get_local_kernel, on an
Intel Xeon 6 machine bound to a single NUMA node
(numactl --cpunodebind=0 --membind=0, OMP_NUM_THREADS=32, AVX512-BF16).

Correctness — bf16 / fp16 / fp32 across shapes
[(7,512), (83,512), (2048,512), (2048,13824), (1,4096)]: max_diff = 0.0 for all.

Performance (bf16, torch.utils.benchmark median, speedup vs. eager
F.silu(gate) * up):

ntok	d	speedup
2048	512	1.72×
2048	4096	1.26×
2048	13824	1.96×
8192	4096	2.17×

Notes

Only silu_and_mul gets a CPU impl in this PR; the other *_and_mul / GELU
variants remain CUDA/Metal-only.
The AVX512 TU is built with -mavx512f/bf16/vl/dq/bw. dq and bw are required
because ATen's at::vec BF16↔FP32 conversion uses _mm512_extracti32x8_epi32.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2026-06-10T03:26:19Z

Referenced script:

import os
from pathlib import Path

import torch
import torch.nn.functional as F
import torch.utils.benchmark as benchmark
from kernels import get_local_kernel

REPO = Path("/home/jiqing/qrdgvgjak43jp7w90ckydgz3righnyi2-activation-torch-ext")

act = get_local_kernel(REPO, backend="cpu")
print(f"loaded: {act.__name__}  silu_and_mul={act.silu_and_mul}\n")


def kernel(x):
    d = x.shape[-1] // 2
    out = torch.empty(x.shape[:-1] + (d,), dtype=x.dtype)
    act.silu_and_mul(out, x.contiguous())
    return out


def ref(x):
    d = x.shape[-1] // 2
    return F.silu(x[..., :d]) * x[..., d:]


def run_correctness():
    print("=== Correctness ===")
    torch.manual_seed(0)
    dtypes = {torch.bfloat16: 2e-2, torch.float16: 1e-3, torch.float32: 1e-5}
    shapes = [(7, 512), (83, 512), (2048, 512), (2048, 13824), (1, 4096)]
    ok_all = True
    for dt, tol in dtypes.items():
        for ntok, d in shapes:
            x = torch.randn(ntok, 2 * d, dtype=dt)
            got, exp = kernel(x), ref(x)
            diff = (got.float() - exp.float()).abs().max().item()
            ok = torch.allclose(got, exp, atol=tol, rtol=tol)
            ok_all &= ok
            print(f"  {str(dt):17s} ntok={ntok:5d} d={d:6d}  "
                  f"max_diff={diff:.3e}  {'OK' if ok else 'FAIL'}")
    print("ALL CORRECT\n" if ok_all else "SOME FAILED\n")
    return ok_all


def run_perf():
    print("=== Performance (bf16, torch.utils.benchmark) ===")
    nt = torch.get_num_threads()
    print(f"  threads={nt}  OMP_NUM_THREADS={os.environ.get('OMP_NUM_THREADS')}")
    torch.manual_seed(0)
    for ntok, d in [(2048, 512), (2048, 4096), (2048, 13824), (8192, 4096)]:
        x = torch.randn(ntok, 2 * d, dtype=torch.bfloat16)
        out = torch.empty(ntok, d, dtype=torch.bfloat16)

        tk = benchmark.Timer(
            stmt="act.silu_and_mul(out, x)",
            globals={"act": act, "x": x, "out": out}, num_threads=nt,
        ).blocked_autorange(min_run_time=2.0)
        tr = benchmark.Timer(
            stmt="out.copy_(ref(x))",
            globals={"ref": ref, "x": x, "out": out}, num_threads=nt,
        ).blocked_autorange(min_run_time=2.0)

        k, r = tk.median * 1e6, tr.median * 1e6
        print(f"  ntok={ntok:5d} d={d:6d}  kernel={k:8.1f}us  "
              f"torch={r:8.1f}us  speedup={r / k:5.2f}x")


if __name__ == "__main__":
    ok = run_correctness()
    run_perf()
    raise SystemExit(0 if ok else 1)

jiqing-feng · 2026-06-10T03:29:11Z

This kernel was written following the
cpu-kernels skill
in kernel-builder, which guides the runtime-dispatch + AVX512 + ATen-fallback design.

danieldk · 2026-06-23T09:52:39Z

Sorry, this slipped through the cracks.

/kernel-bot security-and-build activation

jiqing-feng · 2026-06-30T03:06:41Z

Hi @danieldk . Is it okay to merge the PR? Thanks!

add silu_and_mul cpu kernel

e88580e

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

sayakpaul requested a review from danieldk June 12, 2026 13:20

jiqing-feng changed the title ~~Add CPU backend for silu_and_mul~~ activation: Add CPU backend for silu_and_mul Jun 16, 2026

jiqing-feng added 2 commits June 16, 2026 09:10

Merge branch 'main' into silu_and_mul

2bbe293

Merge branch 'main' into silu_and_mul

aeb494f

Merge branch 'main' into silu_and_mul

3583e71

Merge branch 'main' into silu_and_mul

12ec888

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

activation: Add CPU backend for `silu_and_mul`#942

activation: Add CPU backend for `silu_and_mul`#942
jiqing-feng wants to merge 5 commits into
huggingface:mainfrom
jiqing-feng:silu_and_mul

jiqing-feng commented Jun 10, 2026

Uh oh!

jiqing-feng commented Jun 10, 2026

Uh oh!

jiqing-feng commented Jun 10, 2026 •

edited

Loading

Uh oh!

danieldk commented Jun 23, 2026

Uh oh!

jiqing-feng commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jiqing-feng commented Jun 10, 2026

Summary

What's included

Files

Validation

Notes

Uh oh!

jiqing-feng commented Jun 10, 2026

Uh oh!

jiqing-feng commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danieldk commented Jun 23, 2026

Uh oh!

jiqing-feng commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiqing-feng commented Jun 10, 2026 •

edited

Loading