Skip to content

activation: Add CPU backend for silu_and_mul#942

Open
jiqing-feng wants to merge 5 commits into
huggingface:mainfrom
jiqing-feng:silu_and_mul
Open

activation: Add CPU backend for silu_and_mul#942
jiqing-feng wants to merge 5 commits into
huggingface:mainfrom
jiqing-feng:silu_and_mul

Conversation

@jiqing-feng

Copy link
Copy Markdown
Contributor

Summary

Adds an optimized CPU implementation of silu_and_mul (SwiGLU activation) to the
activation kernel, so the op can run natively on CPU in addition to CUDA/Metal.

What's included

  • Runtime-dispatched AVX512-BF16 kernel. A dispatcher TU (compiled without
    AVX512 flags) selects, at runtime via CPUID feature detection, between an
    AVX512-BF16 vectorized path and a generic ATen fallback. The AVX512 path and the
    dispatcher are split into separate translation units so the binary stays loadable
    on machines without AVX512.
  • Numerically exact. The BF16 path rounds the SiLU result back through BF16
    before the multiply, matching F.silu(x[..., :d]) * x[..., d:] bit-for-bit.
  • Fallback. Non-BF16 / non-contiguous inputs, or CPUs without AVX512-BF16, use
    an ATen-based path, so correctness is preserved everywhere.

Files

File Purpose
activation_cpu/silu_and_mul_cpu_torch.cpp Op entry point matching the existing binding signature
activation_cpu/silu_and_mul_cpu.{cpp,hpp} Runtime dispatcher (AVX512 vs. ATen fallback)
activation_cpu/silu_and_mul_avx512.{cpp,hpp} AVX512-BF16 vectorized kernel (header is intrinsic-free)
activation_cpu/cpu_features.hpp CPUID-based AVX512-BF16 detection
build.toml cpu backend + activation_cpu / activation_cpu_avx512 kernel sections
torch-ext/torch_binding.cpp Registers the CPU impl for silu_and_mul

Validation

Tested against the nix-built artifact via kernels.get_local_kernel, on an
Intel Xeon 6 machine bound to a single NUMA node
(numactl --cpunodebind=0 --membind=0, OMP_NUM_THREADS=32, AVX512-BF16).

Correctnessbf16 / fp16 / fp32 across shapes
[(7,512), (83,512), (2048,512), (2048,13824), (1,4096)]: max_diff = 0.0 for all.

Performance (bf16, torch.utils.benchmark median, speedup vs. eager
F.silu(gate) * up):

ntok d speedup
2048 512 1.72×
2048 4096 1.26×
2048 13824 1.96×
8192 4096 2.17×

Notes

  • Only silu_and_mul gets a CPU impl in this PR; the other *_and_mul / GELU
    variants remain CUDA/Metal-only.
  • The AVX512 TU is built with -mavx512f/bf16/vl/dq/bw. dq and bw are required
    because ATen's at::vec BF16↔FP32 conversion uses _mm512_extracti32x8_epi32.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@jiqing-feng

Copy link
Copy Markdown
Contributor Author

Referenced script:

import os
from pathlib import Path

import torch
import torch.nn.functional as F
import torch.utils.benchmark as benchmark
from kernels import get_local_kernel

REPO = Path("/home/jiqing/qrdgvgjak43jp7w90ckydgz3righnyi2-activation-torch-ext")

act = get_local_kernel(REPO, backend="cpu")
print(f"loaded: {act.__name__}  silu_and_mul={act.silu_and_mul}\n")


def kernel(x):
    d = x.shape[-1] // 2
    out = torch.empty(x.shape[:-1] + (d,), dtype=x.dtype)
    act.silu_and_mul(out, x.contiguous())
    return out


def ref(x):
    d = x.shape[-1] // 2
    return F.silu(x[..., :d]) * x[..., d:]


def run_correctness():
    print("=== Correctness ===")
    torch.manual_seed(0)
    dtypes = {torch.bfloat16: 2e-2, torch.float16: 1e-3, torch.float32: 1e-5}
    shapes = [(7, 512), (83, 512), (2048, 512), (2048, 13824), (1, 4096)]
    ok_all = True
    for dt, tol in dtypes.items():
        for ntok, d in shapes:
            x = torch.randn(ntok, 2 * d, dtype=dt)
            got, exp = kernel(x), ref(x)
            diff = (got.float() - exp.float()).abs().max().item()
            ok = torch.allclose(got, exp, atol=tol, rtol=tol)
            ok_all &= ok
            print(f"  {str(dt):17s} ntok={ntok:5d} d={d:6d}  "
                  f"max_diff={diff:.3e}  {'OK' if ok else 'FAIL'}")
    print("ALL CORRECT\n" if ok_all else "SOME FAILED\n")
    return ok_all


def run_perf():
    print("=== Performance (bf16, torch.utils.benchmark) ===")
    nt = torch.get_num_threads()
    print(f"  threads={nt}  OMP_NUM_THREADS={os.environ.get('OMP_NUM_THREADS')}")
    torch.manual_seed(0)
    for ntok, d in [(2048, 512), (2048, 4096), (2048, 13824), (8192, 4096)]:
        x = torch.randn(ntok, 2 * d, dtype=torch.bfloat16)
        out = torch.empty(ntok, d, dtype=torch.bfloat16)

        tk = benchmark.Timer(
            stmt="act.silu_and_mul(out, x)",
            globals={"act": act, "x": x, "out": out}, num_threads=nt,
        ).blocked_autorange(min_run_time=2.0)
        tr = benchmark.Timer(
            stmt="out.copy_(ref(x))",
            globals={"ref": ref, "x": x, "out": out}, num_threads=nt,
        ).blocked_autorange(min_run_time=2.0)

        k, r = tk.median * 1e6, tr.median * 1e6
        print(f"  ntok={ntok:5d} d={d:6d}  kernel={k:8.1f}us  "
              f"torch={r:8.1f}us  speedup={r / k:5.2f}x")


if __name__ == "__main__":
    ok = run_correctness()
    run_perf()
    raise SystemExit(0 if ok else 1)

@jiqing-feng

jiqing-feng commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

This kernel was written following the
cpu-kernels skill
in kernel-builder, which guides the runtime-dispatch + AVX512 + ATen-fallback design.

@sayakpaul sayakpaul requested a review from danieldk June 12, 2026 13:20
@jiqing-feng jiqing-feng changed the title Add CPU backend for silu_and_mul activation: Add CPU backend for silu_and_mul Jun 16, 2026
@danieldk

Copy link
Copy Markdown
Member

Sorry, this slipped through the cracks.

/kernel-bot security-and-build activation

@jiqing-feng

Copy link
Copy Markdown
Contributor Author

Hi @danieldk . Is it okay to merge the PR? Thanks!

@drbh drbh added area: build-system build.toml, Nix flakes, packaging, and kernel-builder integration backend: cpu CPU kernels new-backend Adds a backend to an existing kernel performance Autotuning, speedups, build-speed improvements size: L Diff <= 1000 lines type: feature New functionality / capability labels Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: build-system build.toml, Nix flakes, packaging, and kernel-builder integration backend: cpu CPU kernels new-backend Adds a backend to an existing kernel performance Autotuning, speedups, build-speed improvements size: L Diff <= 1000 lines type: feature New functionality / capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants