Skip to content

Commit 37e30e1

Browse files
burtenshawmerveenoyanpcuencadanieldk
authored
[DOCS] guide for using agents (#459)
* add a guide for using agent to write kernels based on the blog post * Update docs/source/builder/agents-guide.md Co-authored-by: Merve Noyan <merve@huggingface.co> * Apply suggestions from code review Co-authored-by: Pedro Cuenca <pedro@huggingface.co> * Apply suggestions from code review Co-authored-by: Daniël de Kok <me@danieldk.eu> --------- Co-authored-by: Merve Noyan <merve@huggingface.co> Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Daniël de Kok <me@danieldk.eu>
1 parent 1e09d03 commit 37e30e1

2 files changed

Lines changed: 221 additions & 0 deletions

File tree

docs/source/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@
3939
title: Metal Notes
4040
- local: builder/build-variants
4141
title: Build Variants
42+
- local: builder/agents-guide
43+
title: Building kernels with agents
4244
title: Building kernels
4345
- sections:
4446
- local: api/kernels
Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,219 @@
1+
# Writing custom kernels with code agents
2+
3+
Code agents are a good fit to build custom kernels because the hard part is not just writing in Domain Specific Language (DSLs) like CUDA. You also need the right project layout, PyTorch bindings, architecture-specific choices, model-specific integration, and trustworthy benchmarks.
4+
5+
Kernels on Hugging Face are compatible with agents via skills and the `hf` CLI. The `cuda-kernels` and `rocm-kernels` skills contain knowledge so an agent can generate and publish a complete kernel project, instead of isolated snippets.
6+
7+
This guide is for **authoring new kernels**. If you only want to **load an existing precompiled kernel**, use `get_kernel()` instead.
8+
9+
## Before you start
10+
11+
You need:
12+
13+
- a coding agent that supports skills, such as Claude Code, Codex, Cursor, or OpenCode
14+
- a clear target: library, model, operation, GPU, dtype, and representative shapes
15+
16+
The skill currently focuses on NVIDIA GPUs such as **H100**, **A100**, and **T4**, and on integration patterns for **transformers** and **diffusers**.
17+
18+
Install the skill into your agent. If you need the latest version from `main`, use:
19+
20+
```shell
21+
cargo install --git https://github.com/huggingface/kernels hf-kernel-builder
22+
23+
# Install your skills. Use --claude, --codex, or --opencode
24+
kernel-builder skills add --claude
25+
```
26+
27+
> [!NOTE]
28+
> Check [this example](https://github.com/burtenshaw/kernel-skill/tree/main/examples/ltx_video) to see what generated kernels look like.
29+
30+
## 1. Give the agent a precise task prompt
31+
32+
Writing kernels is a hard problem, so be specific to agents. A robust prompt will declare all core attributes, including:
33+
34+
- the library, for example `transformers` or `diffusers`
35+
- the model id, for example `Qwen3-8B` or `LTX-Video`
36+
- the operation, for example `RMSNorm`, attention, RoPE, `GEGLU`, or `AdaLN`
37+
- the target GPU, for example `H100`, `A100`, or `T4`
38+
- the dtype, for example `bfloat16`, `float16`, or `float32`
39+
- the outputs you expect: kernel code, bindings, tests, and benchmarks
40+
41+
In practice, you can often skip some of these and the agent will infer based on common practice, but if you know a detail declare it.
42+
43+
For example:
44+
45+
```
46+
Build a vectorized RMSNorm kernel for H100 targeting Qwen3-8B in transformers.
47+
Create the full kernel-builder project, PyTorch bindings, correctness tests, and benchmark scripts.
48+
```
49+
50+
Or for diffusers:
51+
52+
```
53+
Build an H100 RMSNorm kernel for LTX-Video in diffusers.
54+
Patch the pipeline correctly, benchmark it against the PyTorch baseline, and report end-to-end impact.
55+
```
56+
57+
If you prefer, you can first scaffold a project with `kernel-builder init --name <org>/<kernel>` and then ask the agent to fill in the implementation.
58+
59+
## 2. Verify that the agent produces a complete kernel project
60+
61+
A useful result is a full `kernel-builder` project, not just a `.cu` file. The exact layout can vary, but it should include at least:
62+
63+
```
64+
examples/your_model/
65+
├── kernel_src/
66+
│ └── rmsnorm.cu # Vectorized CUDA kernel
67+
├── torch-ext/
68+
│ ├── your_kernels/__init__.py
69+
│ └── torch_binding.cpp # PyTorch C++ bindings
70+
├── benchmark_rmsnorm.py # Micro-benchmark script
71+
├── build.toml # kernel-builder config
72+
├── setup.py # pip install -e .
73+
└── pyproject.toml
74+
```
75+
76+
The agent skills contain example scipts to help you verify the project. So you can briefly test it yourself by running:
77+
78+
```
79+
Verify the kernel project works with a transformers example.
80+
```
81+
82+
## 3. Review the generated files
83+
84+
Let's dive deeper into the generated files, and explore how to validate the project.
85+
86+
### `build.toml`
87+
88+
This is the main configuration file for `kernel-builder`. It tells `kernel-builder` what to build and how so it should contain all the core information about your kernel project.
89+
90+
```
91+
[general]
92+
name = "your_kernels"
93+
backends = ["cuda"]
94+
version = 1
95+
96+
[torch]
97+
src = ["torch-ext/torch_binding.cpp"]
98+
99+
[kernel.rmsnorm]
100+
backend = "cuda"
101+
src = ["kernel_src/rmsnorm.cu"]
102+
depends = ["torch"]
103+
cuda-capabilities = ["9.0"] # H100
104+
```
105+
106+
First check that:
107+
108+
- `backends = ["cuda"]` is correct for your project
109+
- the kernel source files are listed correctly
110+
- the Torch binding sources are included under `[torch]`
111+
- `cuda-capabilities` is only set when the kernel truly targets specific architectures
112+
113+
For architecture-specific kernels, typical capability values are:
114+
115+
- H100: `9.0`
116+
- A100: `8.0`
117+
- T4: `7.5`
118+
119+
If the kernel does **not** require a specific capability, the kernels docs recommend leaving `cuda-capabilities` unset so the builder can target all supported capabilities. In practice, you can prompt your agent to review the `build.toml` for excessive definitions. Agents have a tendency to over-specify capabilities.
120+
121+
### Torch bindings
122+
123+
The kernel should be registered as Torch ops in `torch-ext/torch_binding.cpp`, with declarations in a header and a small Python wrapper in `torch-ext/<name>/__init__.py`. This is what makes the kernel callable from Python and is the right foundation for `torch.compile` compatibility.
124+
125+
### Model integration code
126+
127+
Make sure the integration matches the library:
128+
129+
- **transformers**: patch the target modules directly, often RMSNorm modules whose class name contains `RMSNorm`
130+
- **diffusers**: inspect the actual pipeline structure before patching, because modules and attention processors can differ across pipelines
131+
132+
> [!NOTE]
133+
> One common issue is that the agent will not integrate the kernel at all. Typically because the project's context is so long.
134+
135+
A few patterns matter in practice for the integration code:
136+
137+
- In **transformers**, RMSNorm modules generally have weights, but epsilon may be exposed as `variance_epsilon` or `eps` depending on the model.
138+
- In **diffusers**, some RMSNorm modules may have `weight=None`, so the integration code needs to handle both weighted and unweighted cases.
139+
- In **diffusers**, checking `type(module).__name__` is often more reliable than `isinstance(...)` for matching RMSNorm modules across implementations.
140+
- If a diffusers pipeline uses CPU offloading, inject custom kernels **before** enabling offload.
141+
142+
For attention, prefer the model library's existing optimized path when one already exists. For example, in `transformers`, Flash Attention 2 is usually the right baseline for attention, while custom kernels are especially useful for operations like RMSNorm and other targeted hotspots.
143+
144+
## 5. Build and test, and benchmark
145+
146+
Kernel Hub kernels must support all recent PyTorch and CUDA configurations. The kernel-builder Nix flake handles this automatically. Copy the [example `flake.nix`](https://github.com/huggingface/kernels/blob/main/builder/examples/relu/flake.nix) into your project and run:
147+
148+
```shell
149+
nix flake update
150+
nix run .#build-and-copy -L
151+
```
152+
153+
This builds the kernel for every required PyTorch/CUDA variant and places the results in `build/`. For faster builds, enable the HuggingFace Nix cache:
154+
155+
```shell
156+
nix run nixpkgs#cachix -- use huggingface
157+
```
158+
159+
## 6. Benchmark
160+
161+
There are two main benchmarks to consider:
162+
163+
1. an isolated kernel micro-benchmark
164+
2. an end-to-end benchmark in the real model or pipeline
165+
166+
The agent will generate both benchmarks based on the agent skills examples. Typically as a script called `benchmark_example.py`. If you have access to the target hardware, you can run it to verify the kernel works. For example, the agent will generat a table like this:
167+
168+
```markdown
169+
| Shape | Custom (ms) | PyTorch (ms) | Speedup |
170+
| :---- | :---: | :---: | :---: |
171+
| [1x128x4096] | 0.040 | 0.062 | **1.58x** |
172+
| [1x512x4096] | 0.038 | 0.064 | **1.69x** |
173+
| [1x1024x4096] | 0.037 | 0.071 | **1.90x** |
174+
| [1x2048x4096] | 0.045 | 0.091 | **2.03x** |
175+
| [1x4096x4096] | 0.071 | 0.150 | **2.12x** |
176+
| [4x512x4096] | 0.056 | 0.093 | **1.67x** |
177+
| [8x256x4096] | 0.045 | 0.092 | **2.06x** |
178+
| [1x8192x4096] | 0.109 | 0.269 | **2.47x** |
179+
```
180+
181+
Interpret the results carefully. A kernel can show a large isolated speedup but only a modest end-to-end gain if that operation is a small fraction of total runtime. In the LTX-Video example from [the blog we wrote](https://huggingface.co/blog/custom-cuda-kernels-agent-skills), the generated RMSNorm kernel improved the isolated benchmark by about **1.88x** on average, but end-to-end video generation improved by about **6%**, which matched the fact that RMSNorm accounted for only a small share of total compute.
182+
183+
## 7. Publish to the Hub
184+
185+
Once the project is correct and benchmarked, you can build Hub-compatible artifacts and upload them. For this, you should first push to the Hub using the `hf` CLI tool:
186+
187+
```shell
188+
# install the hf CLI tool
189+
hf skills add
190+
191+
# Authenticate
192+
hf auth login
193+
194+
# Push to the Hub
195+
<agent-prompt>
196+
Push the kernel to the Hub.
197+
</agent-prompt>
198+
```
199+
200+
Or, you can manually create the repository and upload the artifacts:
201+
202+
```shell
203+
# Create the repository
204+
hf repo create your-org/your-kernel --type model
205+
206+
# Upload the artifacts
207+
# Run inside the main kernel directory, where build/ is.
208+
kernel-builder upload
209+
```
210+
211+
After pushing to the Hub, users can load the kernel without compiling:
212+
213+
```py
214+
from kernels import get_kernel
215+
216+
kernel = get_kernel("your-org/your-kernel", version=1)
217+
```
218+
219+
Well done! You have now built a custom kernel and published it to the Hub.

0 commit comments

Comments
 (0)