|
| 1 | +# kernels benchmark |
| 2 | + |
| 3 | +Use `kernels benchmark` to run benchmark scripts shipped with a kernel repository. |
| 4 | + |
| 5 | +The command: |
| 6 | + |
| 7 | +- Downloads the kernel repo at a specific **branch** or **version** |
| 8 | +- Runs all `benchmarks/benchmark*.py` scripts |
| 9 | +- Times each `benchmark_*` workload and prints a results table |
| 10 | +- Optionally saves results as JSON |
| 11 | + |
| 12 | +## Installation |
| 13 | + |
| 14 | +`kernels benchmark` requires extra dependencies: |
| 15 | + |
| 16 | +```bash |
| 17 | +uv pip install 'kernels[benchmark]' # or pip install 'kernels[benchmark]' |
| 18 | +``` |
| 19 | + |
| 20 | +## Example |
| 21 | + |
| 22 | +```bash |
| 23 | +kernels benchmark kernels-community/activation --version 1 |
| 24 | +``` |
| 25 | + |
| 26 | +Example output: |
| 27 | + |
| 28 | +```text |
| 29 | +Downloading kernels-community/activation@v1... |
| 30 | +Running benchmark.py... |
| 31 | +
|
| 32 | + GPU Apple M3 Max (30 cores) |
| 33 | + CPU Apple M3 Max |
| 34 | + OS Darwin 25.2.0 |
| 35 | + PyTorch 2.10.0 |
| 36 | +
|
| 37 | + Running SiluWorkloads on mps |
| 38 | +
|
| 39 | +┌───────────────┬────────────┬─────┬───────────┬────────────┬───────────┬───────────┬───────────┬───────────┬────────────┬───────────┬─────────┐ |
| 40 | +│ Benchmark │ Workload │ N │ Speedup │ Mean(ms) │ Std(ms) │ Min(ms) │ Max(ms) │ IQR(ms) │ Outliers │ Ref(ms) │ Match │ |
| 41 | +├───────────────┼────────────┼─────┼───────────┼────────────┼───────────┼───────────┼───────────┼───────────┼────────────┼───────────┼─────────┤ |
| 42 | +│ SiluWorkloads │ large │ 100 │ 1.72x │ 6.5153 │ 0.4343 │ 6.2883 │ 8.4699 │ 0.1701 │ 8 │ 11.2048 │ ✓ │ |
| 43 | +│ SiluWorkloads │ medium │ 100 │ 2.48x │ 1.1813 │ 0.3976 │ 1.04 │ 4.2146 │ 0.0698 │ 5 │ 2.9332 │ ✓ │ |
| 44 | +│ SiluWorkloads │ small │ 100 │ 1.96x │ 0.4909 │ 0.2175 │ 0.4407 │ 2.6438 │ 0.0085 │ 16 │ 0.9622 │ ✓ │ |
| 45 | +└───────────────┴────────────┴─────┴───────────┴────────────┴───────────┴───────────┴───────────┴───────────┴────────────┴───────────┴─────────┘ |
| 46 | +
|
| 47 | + large: 1.72x faster (95% CI: 6.4302-6.6004ms vs ref 11.2048ms) ✓ significant |
| 48 | + medium: 2.48x faster (95% CI: 1.1034-1.2592ms vs ref 2.9332ms) ✓ significant |
| 49 | + small: 1.96x faster (95% CI: 0.4483-0.5335ms vs ref 0.9622ms) ✓ significant |
| 50 | +
|
| 51 | +Kernel: 2385e44 Benchmark: 5b53516 |
| 52 | +``` |
| 53 | + |
| 54 | +## Usage |
| 55 | + |
| 56 | +You must specify which revision to benchmark, either via flags or with `@...` in the repo id: |
| 57 | + |
| 58 | +```bash |
| 59 | +kernels benchmark <repo_id> --version <N> |
| 60 | +kernels benchmark <repo_id> --branch <name> |
| 61 | +kernels benchmark <repo_id>@v<N> |
| 62 | +kernels benchmark <repo_id>@<branch> |
| 63 | +``` |
| 64 | + |
| 65 | +## Examples |
| 66 | + |
| 67 | +Benchmark a tagged kernel version: |
| 68 | + |
| 69 | +```bash |
| 70 | +kernels benchmark kernels-community/activation --version 1 |
| 71 | +``` |
| 72 | + |
| 73 | +Equivalent shorthand: |
| 74 | + |
| 75 | +```bash |
| 76 | +kernels benchmark kernels-community/activation@v1 |
| 77 | +``` |
| 78 | + |
| 79 | +Benchmark a branch: |
| 80 | + |
| 81 | +```bash |
| 82 | +kernels benchmark kernels-community/activation --branch main |
| 83 | +``` |
| 84 | + |
| 85 | +Tune warmup and iteration count: |
| 86 | + |
| 87 | +```bash |
| 88 | +kernels benchmark kernels-community/activation@v1 --warmup 20 --iterations 200 |
| 89 | +``` |
| 90 | + |
| 91 | +Save results to a file (JSON): |
| 92 | + |
| 93 | +```bash |
| 94 | +kernels benchmark kernels-community/activation@v1 --output results.json |
| 95 | +``` |
| 96 | + |
| 97 | +Benchmark a local kernel checkout (must contain `benchmarks/`): |
| 98 | + |
| 99 | +```bash |
| 100 | +kernels benchmark ./my_kernel |
| 101 | +``` |
| 102 | + |
| 103 | +## Output |
| 104 | + |
| 105 | +- By default, a table is printed (timings in ms). |
| 106 | +- `--output <file>.json` writes a JSON payload to disk. |
| 107 | + |
| 108 | +## Writing Benchmark Scripts |
| 109 | + |
| 110 | +Benchmark scripts must live under `benchmarks/` in the kernel repository and match `benchmark*.py`. |
| 111 | +Each script should define one or more subclasses of `kernels.benchmark.Benchmark`. |
| 112 | + |
| 113 | +Minimal example (`benchmarks/benchmark_activation.py`): |
| 114 | + |
| 115 | +```python |
| 116 | +import torch |
| 117 | + |
| 118 | +from kernels.benchmark import Benchmark |
| 119 | + |
| 120 | + |
| 121 | +class ActivationBenchmark(Benchmark): |
| 122 | + seed = 0 |
| 123 | + |
| 124 | + def setup(self): |
| 125 | + self.x = torch.randn(128, 1024, device=self.device, dtype=torch.float16) |
| 126 | + self.out = torch.empty(128, 512, device=self.device, dtype=torch.float16) |
| 127 | + |
| 128 | + def benchmark_silu_and_mul(self): |
| 129 | + self.kernel.silu_and_mul(self.out, self.x) |
| 130 | + |
| 131 | + def verify_silu_and_mul(self): |
| 132 | + # Return reference tensor; runner compares with self.out |
| 133 | + return torch.nn.functional.silu(self.x[..., :512]) * self.x[..., 512:] |
| 134 | +``` |
| 135 | + |
| 136 | +The runner will: |
| 137 | + |
| 138 | +- Call `setup()` once per workload (or `setup_<workload>()` if present) |
| 139 | +- Warm up (`--warmup`) |
| 140 | +- Time `benchmark_<workload>()` for `--iterations` |
| 141 | +- If `verify_<workload>()` exists, check that outputs match (`torch.allclose(..., atol=1e-2)`) and show a speedup vs the reference computation |
| 142 | + |
| 143 | +## Troubleshooting |
| 144 | + |
| 145 | +- If the repo does not contain a `benchmarks/` directory (or no `benchmark*.py` files), the command exits with an error. |
| 146 | +- If a benchmark script defines no `Benchmark` subclasses, the command exits with an error. |
| 147 | +- If `verify_<workload>()` exists and the outputs do not match, the command exits with an error. |
0 commit comments