Skip to content

Commit 2bf7117

Browse files
[Skill] Add liger-kernel-perf skill for Triton kernel performance optimization (#1185)
## Summary Adds a new Claude Code skill (`liger-kernel-perf`) that optimizes existing Liger Kernel Triton kernels through a 3-stage pipeline: Profile -> Optimize -> Finalize. The skill automatically diagnoses bottlenecks, generates multiple optimization variants, benchmarks them against the full suite, and creates a PR with the winning variant -- all while maintaining correctness. Tested on `fused_add_rms_norm` (see PR #1187), which achieved up to 70% backward speedup on H100 via register pressure reduction. ## What the skill does - **Stage 1 (Profiler):** Runs baseline benchmarks, detects GPU architecture (Ampere/Hopper/Blackwell), optionally runs NCU profiling, classifies bottleneck (memory-bound vs compute-bound), produces optimization profile with recommended strategy order - **Stage 2 (Optimizer):** Autonomous optimization loop -- tries parameter tuning first (manual sweep, NOT @triton.autotune), then diagnosis-driven techniques. Each variant gets a versioned file + lab notebook tracking hypothesis/changes/results/learnings. Stops on budget exhaustion, diminishing returns, or target met - **Stage 3 (Finalizer):** Applies winner in-place, runs full test suite (hard gate), generates 3-way comparison plots (original vs optimized vs baseline), updates benchmark CSV, creates PR with descriptive body ## Key design decisions - **No @triton.autotune**: Incompatible with Liger's forward-backward ctx coupling pattern and NPU backends. Uses manual parameter sweeps instead. - **Full benchmark suite every iteration**: No lightweight shortcuts -- ensures optimization is good across ALL input sizes, not just cherry-picked ones. - **Balanced guardrails**: Rejects variants that regress non-target metric >5% or regress one pass >10% to improve the other. - **Comment preservation**: All existing comments kept, explanatory comments added for every optimization change. - **Autonomous + interactive modes**: Runs end-to-end when told "just optimize it", or pauses at human checkpoints between stages. ## Files (7 files, ~2,100 lines) | File | Lines | Purpose | |------|-------|---------| | `SKILL.md` | 116 | Orchestration, input parsing, pipeline flow, guardrails | | `profiler.md` | 156 | Stage 1: baseline benchmarks, GPU detection, bottleneck diagnosis | | `optimizer.md` | 395 | Stage 2: optimization loop with accumulated learning | | `finalizer.md` | 417 | Stage 3: apply, test, plot, update CSV, create PR, report | | `optimization-strategies.md` | 794 | Technique catalog: parameter tuning, memory-bound, compute-bound, architecture-specific | | `templates/optimization-profile.md` | 140 | Cross-stage contract between Profiler and Optimizer | | `templates/variant-notes.md` | 70 | Per-variant lab notebook format for learning accumulation | ## Test plan - [x] Skill triggers correctly on "optimize the X kernel" prompts - [x] Tested end-to-end on `fused_add_rms_norm` (PR #1187) - [x] Verified against Claude Code skill best practices (conciseness, progressive disclosure, SKILL.md under 500 lines) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 969d4ab commit 2bf7117

7 files changed

Lines changed: 2095 additions & 0 deletions

File tree

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
---
2+
name: liger-kernel-perf
3+
description: "Optimizes the performance of existing Liger Kernel Triton kernels. Profiles kernels, diagnoses bottlenecks (memory-bound vs compute-bound), generates multiple optimization variants with benchmarking, and applies the best variant while maintaining correctness. Supports GPU architecture-specific optimization (Ampere, Hopper, Blackwell). Use when a user asks to optimize, speed up, tune, profile, or reduce memory of an existing Liger kernel."
4+
---
5+
6+
# Liger Kernel Perf
7+
8+
Optimizes existing Liger Kernel Triton kernels through a 3-stage pipeline: Profile, Optimize, Finalize. Supports interactive mode (human checkpoints between stages) and autonomous mode (runs end-to-end). NVIDIA GPUs only.
9+
10+
## Mode Detection
11+
12+
- **Interactive mode** (default): Human checkpoints between each stage
13+
- **Autonomous mode**: User says "just optimize it", "run without asking me", "optimize autonomously" → all stages run end-to-end, user sees only the final report
14+
15+
## Input Parsing
16+
17+
Extract from the user's request:
18+
19+
| Field | Description | Default |
20+
|-------|-------------|---------|
21+
| `target_kernel` | Which kernel to optimize (e.g., "rms_norm", "cross_entropy") | **Required** |
22+
| `optimization_goal` | speed / memory / balanced | balanced |
23+
| `scope` | Specific pass (forward/backward), input regime, or general | general |
24+
| `target_gpu` | Ampere / Hopper / Blackwell / auto-detect | auto-detect |
25+
| `autonomy` | interactive / autonomous | interactive |
26+
| `max_variants` | Max optimization variants to try | 8 |
27+
| `target_metric` | Optional concrete target (e.g., "forward under 0.3ms at hidden_size=4096") | none |
28+
29+
## Pre-Flight Validation
30+
31+
Before starting the pipeline, validate:
32+
33+
1. Kernel file exists: `src/liger_kernel/ops/{kernel}.py`
34+
2. Benchmark script exists: `benchmark/scripts/benchmark_{kernel}.py`
35+
3. Test file exists: `test/transformers/test_{kernel}.py`
36+
4. GPU is available and CUDA works
37+
5. Project is installed in dev mode (`pip install -e ".[dev]"`)
38+
39+
If any validation fails, report clearly and stop.
40+
41+
## Pipeline
42+
43+
### Stage 1: Profile
44+
45+
Spawn a **Profiler** agent (read [profiler.md](profiler.md)).
46+
47+
The agent:
48+
1. Creates the workspace directory `optimization/{kernel}/`
49+
2. Copies the original kernel as a snapshot
50+
3. Runs baseline benchmarks using the existing benchmark script
51+
4. Detects GPU architecture (or uses user-specified target)
52+
5. Optionally runs NCU profiling (if `ncu` is available)
53+
6. Analyzes the kernel code (tier classification, patterns, optimization opportunities)
54+
7. Classifies the bottleneck: memory-bound vs compute-bound
55+
8. Produces an optimization profile with a recommended strategy order
56+
9. Saves profile to `optimization/{kernel}/profile.md`
57+
58+
**Human checkpoint (interactive mode):** Present the optimization profile with bottleneck diagnosis and proposed strategy order. Confirm before proceeding.
59+
60+
### Stage 2: Optimize
61+
62+
Spawn an **Optimizer** agent (read [optimizer.md](optimizer.md)).
63+
64+
The agent runs an autonomous optimization loop:
65+
66+
1. Read the optimization profile and original kernel
67+
2. **Always try parameter tuning first** (BLOCK_SIZE, num_warps, num_stages manual sweep -- NOT @triton.autotune)
68+
3. Then apply diagnosis-driven techniques from [optimization-strategies.md](optimization-strategies.md)
69+
4. For each variant:
70+
a. Generate the variant code → `optimization/{kernel}/{kernel}_vN.py`
71+
b. Write the variant lab notebook → `optimization/{kernel}/{kernel}_vN_notes.md`
72+
c. Run quick smoke test (single shape, float32, forward+backward) → discard on failure
73+
d. Run the **full existing benchmark script**`optimization/{kernel}/benchmarks/vN_results.csv`
74+
e. Check guardrails (no catastrophic regressions)
75+
f. Update the variant notes with actual results
76+
5. Read all prior variant notes before generating the next variant
77+
6. **Stop when:** budget exhausted, 2 consecutive variants with <1% improvement, or target metric met
78+
7. Produce a comparison table of ALL variants
79+
80+
**Human checkpoint (interactive mode):** Present the comparison table across all variants. User approves the winner (or skill picks best if autonomous).
81+
82+
### Stage 3: Finalize
83+
84+
Spawn a **Finalizer** agent (read [finalizer.md](finalizer.md)).
85+
86+
The agent:
87+
1. Applies the winning variant in-place to `src/liger_kernel/ops/{kernel}.py`
88+
2. Runs the full test suite: `python -m pytest test/transformers/test_{kernel}.py -xvs` (hard gate)
89+
3. Runs checkstyle: `make checkstyle` (auto-fix with `ruff check . --fix && ruff format .`)
90+
4. Generates 3-way comparison plots (original liger vs optimized liger vs huggingface baseline) using `benchmarks_visualizer.py`
91+
5. Generates the final optimization report → `optimization/{kernel}/report.md`
92+
6. Creates a PR with only the kernel code changes (no plots or optimization workspace files)
93+
7. Presents the before/after summary with plots
94+
95+
**Human checkpoint (interactive mode):** Present the final report with before/after numbers, comparison plots, and test results.
96+
97+
## Guardrails
98+
99+
These apply to EVERY variant, regardless of mode:
100+
101+
| Guardrail | Threshold | Action |
102+
|-----------|-----------|--------|
103+
| Non-target metric regression | >5% worse | Reject variant |
104+
| Cross-pass regression | >10% on one pass to marginally improve other | Reject variant |
105+
| Smoke test failure | Any correctness failure | Discard variant immediately |
106+
| Full test suite failure | Any | Do NOT apply winner, report failure, stop |
107+
| Checkstyle failure | Any | Auto-fix with ruff, retry once |
108+
109+
## Reference Files
110+
111+
- [profiler.md](profiler.md) -- Profiler Agent specification
112+
- [optimizer.md](optimizer.md) -- Optimizer Agent specification
113+
- [finalizer.md](finalizer.md) -- Finalizer Agent specification
114+
- [optimization-strategies.md](optimization-strategies.md) -- Catalog of optimization techniques
115+
- [templates/optimization-profile.md](templates/optimization-profile.md) -- Profiling output format (cross-stage contract)
116+
- [templates/variant-notes.md](templates/variant-notes.md) -- Per-variant lab notebook format

0 commit comments

Comments
 (0)