Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/builder-cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -308,7 +308,7 @@ Install a kernels skill for an AI assistant

Default value: `cuda-kernels`

Possible values: `cuda-kernels`, `rocm-kernels`
Possible values: `cuda-kernels`, `rocm-kernels`, `xpu-kernels`

* `--claude` — Install for Claude
* `--codex` — Install for Codex
Expand Down
2 changes: 1 addition & 1 deletion docs/source/builder/agents-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Code agents are a good fit to build custom kernels because the hard part is not just writing in Domain Specific Language (DSLs) like CUDA. You also need the right project layout, PyTorch bindings, architecture-specific choices, model-specific integration, and trustworthy benchmarks.

Kernels on Hugging Face are compatible with agents via skills and the `hf` CLI. The `cuda-kernels` and `rocm-kernels` skills contain knowledge so an agent can generate and publish a complete kernel project, instead of isolated snippets.
Kernels on Hugging Face are compatible with agents via skills and the `hf` CLI. The `cuda-kernels`, `rocm-kernels`, and `xpu-kernels` skills contain knowledge so an agent can generate and publish a complete kernel project, instead of isolated snippets.

This guide is for **authoring new kernels**. If you only want to **load an existing precompiled kernel**, use `get_kernel()` instead.

Expand Down
1 change: 1 addition & 0 deletions docs/source/cli-skills.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ Use `kernel-builder skills add` to install the skills for AI coding assistants l
Supported skills include:
- `cuda-kernels` (default)
- `rocm-kernels`
- `xpu-kernels`

Skill files are downloaded from the `huggingface/kernels` directory in this [repository](https://github.com/huggingface/kernels/tree/main/kernel-builder/skills).

Expand Down
32 changes: 32 additions & 0 deletions kernel-builder/skills/xpu-kernels/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# XPU Kernels Skill

This skill was adapted from [Xe-Forge](https://github.com/IntelLabs/Xe-Forge) — an LLM-driven optimization framework that transforms PyTorch code into fast Triton kernels for Intel XPU GPUs.

The skill includes Xe-Forge's CLI tools (`scripts/`), knowledge base (`references/`), and the optimization workflow, all integrated into the hf-kernels skill format.

## Full Experience

For the complete Xe-Forge setup — including the ai-bench harness, test kernels, GEMM/reduction templates, annotated examples, and VTune profiling — clone the full project:

```bash
# Clone the repository
git clone https://github.com/IntelLabs/Xe-Forge
cd Xe-Forge

# Install for Intel XPU
uv sync --extra intel
```

## Prerequisites

- Python 3.10+
- PyTorch with XPU support
- [Intel XPU Backend for Triton](https://github.com/intel/intel-xpu-backend-for-triton)
- Intel XPU hardware (tested on Battlemage G21 / Arc Pro B50)
- Intel VTune Profiler 2025+ *(optional — set `vtune_enabled: false` in `scripts/config.yaml` to skip)*

## Install Dependencies

```bash
pip install -r scripts/requirements.txt
```
290 changes: 290 additions & 0 deletions kernel-builder/skills/xpu-kernels/SKILL.md

Large diffs are not rendered by default.

26 changes: 26 additions & 0 deletions kernel-builder/skills/xpu-kernels/manifest.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Files for xpu-kernels skill
SKILL.md
README.md
references/correctness.yaml
references/dtype_optimizations.yaml
references/fusion_patterns.yaml
references/huggingface-kernels-integration.md
references/implementation_reference.md
references/kernelbench-classification.md
references/memory_patterns.yaml
references/optimization_levels.yaml
references/optimization_strategies.md
references/persistent_kernel_patterns.yaml
references/workflow_details.md
references/xpu_optimizations.yaml
scripts/analyze_kernel.py
scripts/benchmark.py
scripts/benchmark_kernels.py
scripts/config.py
scripts/config.yaml
scripts/huggingface_kernels_example.py
scripts/requirements.txt
scripts/transformers_injection_example.py
scripts/trial_manager.py
scripts/validate_triton.py
scripts/xpu_profiler.py
145 changes: 145 additions & 0 deletions kernel-builder/skills/xpu-kernels/references/correctness.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
constraints:
- id: outputs_must_match
name: "Outputs must match original"
severity: info
description: |
The verification tool will check that outputs match the original.
If it fails, try a different optimization approach.

- id: streamk_output_must_be_prezeroed
name: "Pre-zero output buffer when using atomic accumulation (Stream K)"
severity: critical
description: |
When partial tiles use tl.atomic_add to accumulate results, the output
tensor MUST be initialized to zero (torch.zeros, NOT torch.empty).
Otherwise partial sums will include garbage values.

WRONG:
```python
c = torch.empty((M, N), device=a.device, dtype=torch.float32)
first_wave[grid](a, b, c, ...) # atomic_add onto garbage
```

CORRECT:
```python
c = torch.zeros((M, N), device=a.device, dtype=torch.float32)
first_wave[grid](a, b, c, ...) # atomic_add safely onto zeros
```

- id: streamk_atomic_add_needs_mask
name: "Atomic adds on partial tiles must be masked for boundary safety"
severity: critical
description: |
When falling back to tl.atomic_add for partial tiles, you MUST apply
boundary masks (rm < M, rn < N) to avoid writing out-of-bounds.

```python
rm = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
rn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
mask = (rm < M)[:, None] & (rn < N)[None, :]
tl.atomic_add(c_ptr_, acc, mask=mask, sem='relaxed')
```

- id: int64_cast_for_large_batch_offsets
name: "Cast batch/stride products to int64 to prevent pointer overflow"
severity: critical
description: |
When computing pointer offsets for batched operations, the product of
a batch index and a stride can exceed int32 range for large tensors.
Triton program_id returns int32 by default. You MUST cast to int64
before multiplying by strides.

WRONG (silent int32 overflow → wrong memory addresses):
```python
bid = tl.program_id(axis=1)
offset_a = bid * stride_az # int32 * int32 → overflow for large tensors
a_ptrs = a_ptr + offset_a + ...
```

CORRECT:
```python
bid = tl.program_id(axis=1)
offset_a = bid.to(tl.int64) * stride_az # safe for large tensors
a_ptrs = a_ptr + offset_a + ...
```

This applies whenever a program_id or loop index is multiplied by a
stride that could produce values > 2^31 (≈2 billion elements). Common
in batched GEMM, multi-head attention, and any kernel with a batch
dimension over large tensors.

- id: autotune_no_defaults
name: "Do not put default values on @triton.autotune meta-parameters"
severity: critical
description: |
When using @triton.autotune, the meta-parameters (BLOCK_M, BLOCK_N, etc.)
must NOT have default values in the kernel signature. Default values cause
a "Conflicting meta-parameters" error at runtime.

WRONG:
```python
@triton.autotune(configs=[...], key=['M', 'N', 'K'])
@triton.jit
def kernel(..., BLOCK_M: tl.constexpr = 128, ...):
...
```

CORRECT:
```python
@triton.autotune(configs=[...], key=['M', 'N', 'K'])
@triton.jit
def kernel(..., BLOCK_M: tl.constexpr, ...):
...
```

- id: model_class_pattern
name: "Model class must be compatible with ai-bench loading"
severity: critical
description: |
ai-bench creates Model via direct `__init__()` and uses standard
`load_state_dict()` for weight synchronization between reference
and optimized models.

The Model class should use standard nn.Module patterns:

```python
class Model(nn.Module):
def __init__(self, input_size, hidden_size, ...):
super().__init__()
self.gemm = nn.Linear(input_size, hidden_size)
self._packed = False

def _pack_weights(self):
device = torch.device("xpu")
w = self.gemm.weight.data.detach()
b = self.gemm.bias.data.detach()
self.weight_t = w.to(device, torch.float16).t().contiguous()
self.bias_xpu = b.to(device, torch.float16).contiguous()
self._packed = True

def forward(self, x):
if not self._packed:
self._pack_weights()
# ... launch triton kernel ...
```

- id: descriptor_no_boundary_check_arg
name: "Tensor descriptor .load() does NOT accept boundary_check"
severity: critical
description: |
Tensor descriptors are the preferred memory access API on XPU.
Unlike block pointers which use tl.load(ptr, boundary_check=(0, 1)),
tensor descriptors handle boundaries internally. The .load() method
takes only a coordinate list.

WRONG:
```python
desc = tl.make_tensor_descriptor(base=ptr, shape=(M, K), ...)
data = desc.load([row, col], boundary_check=(0, 1))
```

CORRECT:
```python
desc = tl.make_tensor_descriptor(base=ptr, shape=(M, K), ...)
data = desc.load([row, col])
```
112 changes: 112 additions & 0 deletions kernel-builder/skills/xpu-kernels/references/dtype_optimizations.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Dtype Optimization Patterns for Intel XPU

patterns:
- id: dtype_float64_to_float32
name: "Float64 to Float32 Accumulator"
stage: dtype_fix
description: "Replace float64 accumulators with float32"
rationale: |
float64 throughput is 16-32x slower than float32 on GPUs/XPUs.
This is the single biggest performance killer in many kernels.
Using float64 alone can cap performance at around 2 TFLOPS on Intel XPU.
pattern_before: |
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float64)
a = a_fp32.to(tl.float64)
b = b_fp32.to(tl.float64)
pattern_after: |
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
# No need to convert inputs - keep as float32
expected_speedup: "5-10x"
applies_to:
- gemm
- matmul
- reduction
examples:
- before: |
@triton.jit
def kernel(a_ptr, b_ptr, c_ptr, ...):
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float64)
for k in range(K):
a = tl.load(a_ptr + ...).to(tl.float64)
b = tl.load(b_ptr + ...).to(tl.float64)
acc = tl.dot(a, b, acc)
after: |
@triton.jit
def kernel(a_ptr, b_ptr, c_ptr, ...):
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
for k in range(K):
a = tl.load(a_ptr + ...)
b = tl.load(b_ptr + ...)
acc = tl.dot(a, b, acc)

- id: dtype_input_conversion
name: "Remove Unnecessary Type Conversions"
stage: dtype_fix
description: "Avoid converting inputs to higher precision unnecessarily"
rationale: |
Converting float16 inputs to float64 for computation wastes bandwidth
and compute. Use float32 accumulators with float16 inputs for best
performance on modern accelerators.
pattern_before: |
x = tl.load(x_ptr + offsets).to(tl.float64)
result = x * x # float64 computation
pattern_after: |
x = tl.load(x_ptr + offsets) # Keep as float16
x_fp32 = x.to(tl.float32) # Upcast to float32 only if needed
result = x_fp32 * x_fp32
expected_speedup: "2-4x"
applies_to:
- elementwise
- reduction

- id: dtype_prepack_bf16
name: "Pre-pack weights and inputs to bf16 before kernel launch"
stage: dtype_fix
description: |
Convert weights to bf16 at _pack_weights() time and inputs to bf16
before kernel launch, instead of loading fp32 and converting in-kernel.
rationale: |
Loading fp32 data and converting to bf16 inside the kernel wastes
memory bandwidth:
- fp32 load: 4 bytes per element from global memory
- In-kernel .to(tl.bfloat16): discards half the loaded data
- Net: 2x wasted bandwidth in the K-loop (the hottest path)

Pre-packing to bf16 means the kernel loads 2 bytes per element directly.
For a GEMM with K-loop iterations, this halves the memory traffic for
both A and B tiles — often the difference between 2x and 4x+ speedup.
pattern_before: |
# In _pack_weights():
self.weight_t = w.to(device).t().contiguous() # stored as fp32

# In forward():
x = x.to(device).contiguous() # fp32 input

# In kernel K-loop:
a = tl.load(a_block_ptr, boundary_check=(0, 1)) # loads 4B per element
a = a.to(tl.bfloat16) # converts to 2B — 2x waste
b = tl.load(b_block_ptr, boundary_check=(0, 1))
b = b.to(tl.bfloat16)
acc += tl.dot(a, b)
pattern_after: |
# In _pack_weights():
self.weight_t = w.to(device).t().contiguous().to(torch.bfloat16) # bf16

# In forward():
x = x.to(device, torch.bfloat16).contiguous() # bf16 input

# In kernel K-loop (no conversion needed):
a = tl.load(a_block_ptr, boundary_check=(0, 1)) # loads 2B directly
b = tl.load(b_block_ptr, boundary_check=(0, 1))
acc = tl.dot(a, b, acc=acc) # fused accumulate
expected_speedup: "1.5-2x (halves K-loop memory traffic)"
applies_to:
- gemm
- matmul
- attention
- inference
notes: |
- Keep bias and epilogue vectors in fp32 (small, precision-sensitive)
- Combine with grf_mode='256' and tl.dot(a, b, acc=acc) for best results
- Only for inference; training needs fp32 gradients
- Works with both block pointers and tensor descriptors
Loading
Loading