Skip to content

Commit d9d3a5d

Browse files
danielfleischersayakpauldrbhCopilotdanieldk
authored
Add xpu-kernels skill - Intel XPU Triton kernel development (#547)
* XPU Skill Adds a new skill under kernel-builder/skills/xpu-kernels/, alongside the existing cuda-kernels and rocm-kernels skills, bringing Intel XPU support to kernel-builder. Target hardware is Intel Battlemage / Arc Pro B70 (Xe2) via the Intel XPU Backend for Triton (https://github.com/intel/intel-xpu-backend-for-triton). The skill packages the Xe-Forge (https://github.com/IntelLabs/Xe-Forge) workflow — an LLM-driven loop that transforms PyTorch code into optimized Triton kernels for Intel XPU — into the hf-kernels skill format. Xe-Forge has been used to produce measured speedups on KernelBench Level 2 fused kernels (bf16) and Flash Attention forward (fp16); full results live in that repo. * hook up skill in the CI and add docs (#1) * fix: remove existing test repo before upload (#519) * fix: remove existing test repo before upload * fix: add missing content type * fix: prefer removing repos via hub library * fix: use lib from nix shell on runner * fix: disallow more than one instance of E2E running at once to avoid race conditions * fix: prefer using ci token * fix: update e2e to use trust_remote_code for the dummy user * fix: prefer using latest kernels-data in test * fix: update nix warns to throws (#540) * feat: bump cute dsl/cutlass (#545) * feat: add to vouched (#551) * hook up skill in the cli and add docs. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: Copilot <copilot@github.com> * Update version bumping scripts with the `--major` option (#550) * Update version bumping scripts with the `--major` option With this change the script supports both major and minor version bumping. For example: Codebase at `0.10.1.dev0` ``` (none) -> 0.10.1 --major -> 0.11.0 --dev -> 0.10.1.dev1 --dev --major -> 0.11.0.dev0 ``` Codebase at `0.10.1`: ``` (none) -> 0.10.2 --major -> 0.11.0 --dev -> 0.10.2.dev0 ``` These are the typical version bumping workflows within the project. * Sync .PHONY targets * upload: fix benchmark deletion filter to match upload filter (#543) * get_local_kernel api changed, leaving backend (second arg) empty for auto discovery (#555) * Paths fix Fixing some paths due to the skill living in the agent-specific location, outside of `kernel-builder/skills/xpu-kernels/`. * update enum --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Dev-X25874 <283057883+Dev-X25874@users.noreply.github.com> Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>
1 parent 61bcfe8 commit d9d3a5d

30 files changed

Lines changed: 7632 additions & 2 deletions

docs/source/builder-cli.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -308,7 +308,7 @@ Install a kernels skill for an AI assistant
308308

309309
Default value: `cuda-kernels`
310310

311-
Possible values: `cuda-kernels`, `rocm-kernels`
311+
Possible values: `cuda-kernels`, `rocm-kernels`, `xpu-kernels`
312312

313313
* `--claude` — Install for Claude
314314
* `--codex` — Install for Codex

docs/source/builder/agents-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Code agents are a good fit to build custom kernels because the hard part is not just writing in Domain Specific Language (DSLs) like CUDA. You also need the right project layout, PyTorch bindings, architecture-specific choices, model-specific integration, and trustworthy benchmarks.
44

5-
Kernels on Hugging Face are compatible with agents via skills and the `hf` CLI. The `cuda-kernels` and `rocm-kernels` skills contain knowledge so an agent can generate and publish a complete kernel project, instead of isolated snippets.
5+
Kernels on Hugging Face are compatible with agents via skills and the `hf` CLI. The `cuda-kernels`, `rocm-kernels`, and `xpu-kernels` skills contain knowledge so an agent can generate and publish a complete kernel project, instead of isolated snippets.
66

77
This guide is for **authoring new kernels**. If you only want to **load an existing precompiled kernel**, use `get_kernel()` instead.
88

docs/source/cli-skills.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ Use `kernel-builder skills add` to install the skills for AI coding assistants l
44
Supported skills include:
55
- `cuda-kernels` (default)
66
- `rocm-kernels`
7+
- `xpu-kernels`
78

89
Skill files are downloaded from the `huggingface/kernels` directory in this [repository](https://github.com/huggingface/kernels/tree/main/kernel-builder/skills).
910

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# XPU Kernels Skill
2+
3+
This skill was adapted from [Xe-Forge](https://github.com/IntelLabs/Xe-Forge) — an LLM-driven optimization framework that transforms PyTorch code into fast Triton kernels for Intel XPU GPUs.
4+
5+
The skill includes Xe-Forge's CLI tools (`scripts/`), knowledge base (`references/`), and the optimization workflow, all integrated into the hf-kernels skill format.
6+
7+
## Full Experience
8+
9+
For the complete Xe-Forge setup — including the ai-bench harness, test kernels, GEMM/reduction templates, annotated examples, and VTune profiling — clone the full project:
10+
11+
```bash
12+
# Clone the repository
13+
git clone https://github.com/IntelLabs/Xe-Forge
14+
cd Xe-Forge
15+
16+
# Install for Intel XPU
17+
uv sync --extra intel
18+
```
19+
20+
## Prerequisites
21+
22+
- Python 3.10+
23+
- PyTorch with XPU support
24+
- [Intel XPU Backend for Triton](https://github.com/intel/intel-xpu-backend-for-triton)
25+
- Intel XPU hardware (tested on Battlemage G21 / Arc Pro B50)
26+
- Intel VTune Profiler 2025+ *(optional — set `vtune_enabled: false` in `scripts/config.yaml` to skip)*
27+
28+
## Install Dependencies
29+
30+
```bash
31+
pip install -r scripts/requirements.txt
32+
```

kernel-builder/skills/xpu-kernels/SKILL.md

Lines changed: 290 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Files for xpu-kernels skill
2+
SKILL.md
3+
README.md
4+
references/correctness.yaml
5+
references/dtype_optimizations.yaml
6+
references/fusion_patterns.yaml
7+
references/huggingface-kernels-integration.md
8+
references/implementation_reference.md
9+
references/kernelbench-classification.md
10+
references/memory_patterns.yaml
11+
references/optimization_levels.yaml
12+
references/optimization_strategies.md
13+
references/persistent_kernel_patterns.yaml
14+
references/workflow_details.md
15+
references/xpu_optimizations.yaml
16+
scripts/analyze_kernel.py
17+
scripts/benchmark.py
18+
scripts/benchmark_kernels.py
19+
scripts/config.py
20+
scripts/config.yaml
21+
scripts/huggingface_kernels_example.py
22+
scripts/requirements.txt
23+
scripts/transformers_injection_example.py
24+
scripts/trial_manager.py
25+
scripts/validate_triton.py
26+
scripts/xpu_profiler.py
Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
constraints:
2+
- id: outputs_must_match
3+
name: "Outputs must match original"
4+
severity: info
5+
description: |
6+
The verification tool will check that outputs match the original.
7+
If it fails, try a different optimization approach.
8+
9+
- id: streamk_output_must_be_prezeroed
10+
name: "Pre-zero output buffer when using atomic accumulation (Stream K)"
11+
severity: critical
12+
description: |
13+
When partial tiles use tl.atomic_add to accumulate results, the output
14+
tensor MUST be initialized to zero (torch.zeros, NOT torch.empty).
15+
Otherwise partial sums will include garbage values.
16+
17+
WRONG:
18+
```python
19+
c = torch.empty((M, N), device=a.device, dtype=torch.float32)
20+
first_wave[grid](a, b, c, ...) # atomic_add onto garbage
21+
```
22+
23+
CORRECT:
24+
```python
25+
c = torch.zeros((M, N), device=a.device, dtype=torch.float32)
26+
first_wave[grid](a, b, c, ...) # atomic_add safely onto zeros
27+
```
28+
29+
- id: streamk_atomic_add_needs_mask
30+
name: "Atomic adds on partial tiles must be masked for boundary safety"
31+
severity: critical
32+
description: |
33+
When falling back to tl.atomic_add for partial tiles, you MUST apply
34+
boundary masks (rm < M, rn < N) to avoid writing out-of-bounds.
35+
36+
```python
37+
rm = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
38+
rn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
39+
mask = (rm < M)[:, None] & (rn < N)[None, :]
40+
tl.atomic_add(c_ptr_, acc, mask=mask, sem='relaxed')
41+
```
42+
43+
- id: int64_cast_for_large_batch_offsets
44+
name: "Cast batch/stride products to int64 to prevent pointer overflow"
45+
severity: critical
46+
description: |
47+
When computing pointer offsets for batched operations, the product of
48+
a batch index and a stride can exceed int32 range for large tensors.
49+
Triton program_id returns int32 by default. You MUST cast to int64
50+
before multiplying by strides.
51+
52+
WRONG (silent int32 overflow → wrong memory addresses):
53+
```python
54+
bid = tl.program_id(axis=1)
55+
offset_a = bid * stride_az # int32 * int32 → overflow for large tensors
56+
a_ptrs = a_ptr + offset_a + ...
57+
```
58+
59+
CORRECT:
60+
```python
61+
bid = tl.program_id(axis=1)
62+
offset_a = bid.to(tl.int64) * stride_az # safe for large tensors
63+
a_ptrs = a_ptr + offset_a + ...
64+
```
65+
66+
This applies whenever a program_id or loop index is multiplied by a
67+
stride that could produce values > 2^31 (≈2 billion elements). Common
68+
in batched GEMM, multi-head attention, and any kernel with a batch
69+
dimension over large tensors.
70+
71+
- id: autotune_no_defaults
72+
name: "Do not put default values on @triton.autotune meta-parameters"
73+
severity: critical
74+
description: |
75+
When using @triton.autotune, the meta-parameters (BLOCK_M, BLOCK_N, etc.)
76+
must NOT have default values in the kernel signature. Default values cause
77+
a "Conflicting meta-parameters" error at runtime.
78+
79+
WRONG:
80+
```python
81+
@triton.autotune(configs=[...], key=['M', 'N', 'K'])
82+
@triton.jit
83+
def kernel(..., BLOCK_M: tl.constexpr = 128, ...):
84+
...
85+
```
86+
87+
CORRECT:
88+
```python
89+
@triton.autotune(configs=[...], key=['M', 'N', 'K'])
90+
@triton.jit
91+
def kernel(..., BLOCK_M: tl.constexpr, ...):
92+
...
93+
```
94+
95+
- id: model_class_pattern
96+
name: "Model class must be compatible with ai-bench loading"
97+
severity: critical
98+
description: |
99+
ai-bench creates Model via direct `__init__()` and uses standard
100+
`load_state_dict()` for weight synchronization between reference
101+
and optimized models.
102+
103+
The Model class should use standard nn.Module patterns:
104+
105+
```python
106+
class Model(nn.Module):
107+
def __init__(self, input_size, hidden_size, ...):
108+
super().__init__()
109+
self.gemm = nn.Linear(input_size, hidden_size)
110+
self._packed = False
111+
112+
def _pack_weights(self):
113+
device = torch.device("xpu")
114+
w = self.gemm.weight.data.detach()
115+
b = self.gemm.bias.data.detach()
116+
self.weight_t = w.to(device, torch.float16).t().contiguous()
117+
self.bias_xpu = b.to(device, torch.float16).contiguous()
118+
self._packed = True
119+
120+
def forward(self, x):
121+
if not self._packed:
122+
self._pack_weights()
123+
# ... launch triton kernel ...
124+
```
125+
126+
- id: descriptor_no_boundary_check_arg
127+
name: "Tensor descriptor .load() does NOT accept boundary_check"
128+
severity: critical
129+
description: |
130+
Tensor descriptors are the preferred memory access API on XPU.
131+
Unlike block pointers which use tl.load(ptr, boundary_check=(0, 1)),
132+
tensor descriptors handle boundaries internally. The .load() method
133+
takes only a coordinate list.
134+
135+
WRONG:
136+
```python
137+
desc = tl.make_tensor_descriptor(base=ptr, shape=(M, K), ...)
138+
data = desc.load([row, col], boundary_check=(0, 1))
139+
```
140+
141+
CORRECT:
142+
```python
143+
desc = tl.make_tensor_descriptor(base=ptr, shape=(M, K), ...)
144+
data = desc.load([row, col])
145+
```
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Dtype Optimization Patterns for Intel XPU
2+
3+
patterns:
4+
- id: dtype_float64_to_float32
5+
name: "Float64 to Float32 Accumulator"
6+
stage: dtype_fix
7+
description: "Replace float64 accumulators with float32"
8+
rationale: |
9+
float64 throughput is 16-32x slower than float32 on GPUs/XPUs.
10+
This is the single biggest performance killer in many kernels.
11+
Using float64 alone can cap performance at around 2 TFLOPS on Intel XPU.
12+
pattern_before: |
13+
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float64)
14+
a = a_fp32.to(tl.float64)
15+
b = b_fp32.to(tl.float64)
16+
pattern_after: |
17+
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
18+
# No need to convert inputs - keep as float32
19+
expected_speedup: "5-10x"
20+
applies_to:
21+
- gemm
22+
- matmul
23+
- reduction
24+
examples:
25+
- before: |
26+
@triton.jit
27+
def kernel(a_ptr, b_ptr, c_ptr, ...):
28+
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float64)
29+
for k in range(K):
30+
a = tl.load(a_ptr + ...).to(tl.float64)
31+
b = tl.load(b_ptr + ...).to(tl.float64)
32+
acc = tl.dot(a, b, acc)
33+
after: |
34+
@triton.jit
35+
def kernel(a_ptr, b_ptr, c_ptr, ...):
36+
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
37+
for k in range(K):
38+
a = tl.load(a_ptr + ...)
39+
b = tl.load(b_ptr + ...)
40+
acc = tl.dot(a, b, acc)
41+
42+
- id: dtype_input_conversion
43+
name: "Remove Unnecessary Type Conversions"
44+
stage: dtype_fix
45+
description: "Avoid converting inputs to higher precision unnecessarily"
46+
rationale: |
47+
Converting float16 inputs to float64 for computation wastes bandwidth
48+
and compute. Use float32 accumulators with float16 inputs for best
49+
performance on modern accelerators.
50+
pattern_before: |
51+
x = tl.load(x_ptr + offsets).to(tl.float64)
52+
result = x * x # float64 computation
53+
pattern_after: |
54+
x = tl.load(x_ptr + offsets) # Keep as float16
55+
x_fp32 = x.to(tl.float32) # Upcast to float32 only if needed
56+
result = x_fp32 * x_fp32
57+
expected_speedup: "2-4x"
58+
applies_to:
59+
- elementwise
60+
- reduction
61+
62+
- id: dtype_prepack_bf16
63+
name: "Pre-pack weights and inputs to bf16 before kernel launch"
64+
stage: dtype_fix
65+
description: |
66+
Convert weights to bf16 at _pack_weights() time and inputs to bf16
67+
before kernel launch, instead of loading fp32 and converting in-kernel.
68+
rationale: |
69+
Loading fp32 data and converting to bf16 inside the kernel wastes
70+
memory bandwidth:
71+
- fp32 load: 4 bytes per element from global memory
72+
- In-kernel .to(tl.bfloat16): discards half the loaded data
73+
- Net: 2x wasted bandwidth in the K-loop (the hottest path)
74+
75+
Pre-packing to bf16 means the kernel loads 2 bytes per element directly.
76+
For a GEMM with K-loop iterations, this halves the memory traffic for
77+
both A and B tiles — often the difference between 2x and 4x+ speedup.
78+
pattern_before: |
79+
# In _pack_weights():
80+
self.weight_t = w.to(device).t().contiguous() # stored as fp32
81+
82+
# In forward():
83+
x = x.to(device).contiguous() # fp32 input
84+
85+
# In kernel K-loop:
86+
a = tl.load(a_block_ptr, boundary_check=(0, 1)) # loads 4B per element
87+
a = a.to(tl.bfloat16) # converts to 2B — 2x waste
88+
b = tl.load(b_block_ptr, boundary_check=(0, 1))
89+
b = b.to(tl.bfloat16)
90+
acc += tl.dot(a, b)
91+
pattern_after: |
92+
# In _pack_weights():
93+
self.weight_t = w.to(device).t().contiguous().to(torch.bfloat16) # bf16
94+
95+
# In forward():
96+
x = x.to(device, torch.bfloat16).contiguous() # bf16 input
97+
98+
# In kernel K-loop (no conversion needed):
99+
a = tl.load(a_block_ptr, boundary_check=(0, 1)) # loads 2B directly
100+
b = tl.load(b_block_ptr, boundary_check=(0, 1))
101+
acc = tl.dot(a, b, acc=acc) # fused accumulate
102+
expected_speedup: "1.5-2x (halves K-loop memory traffic)"
103+
applies_to:
104+
- gemm
105+
- matmul
106+
- attention
107+
- inference
108+
notes: |
109+
- Keep bias and epilogue vectors in fp32 (small, precision-sensitive)
110+
- Combine with grf_mode='256' and tl.dot(a, b, acc=acc) for best results
111+
- Only for inference; training needs fp32 gradients
112+
- Works with both block pointers and tensor descriptors

0 commit comments

Comments
 (0)