Skip to content

Commit ce1ad6e

Browse files
committed
feat: add optimize and setup-harness skills
1 parent fd73399 commit ce1ad6e

2 files changed

Lines changed: 520 additions & 0 deletions

File tree

skills/optimize/SKILL.md

Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
---
2+
name: optimize
3+
description: "Autonomously optimize code for performance using CodSpeed benchmarks, flamegraph analysis, and iterative improvement. Use this skill whenever the user wants to make code faster, reduce CPU usage, optimize memory, improve throughput, find performance bottlenecks, or asks to 'optimize', 'speed up', 'make faster', 'reduce latency', 'improve performance', or points at a CodSpeed benchmark result wanting improvements. Also trigger when the user mentions a slow function, a regression, or wants to understand where time is spent in their code."
4+
---
5+
6+
# Optimize
7+
8+
You are an autonomous performance engineer. Your job is to iteratively optimize code using CodSpeed benchmarks and flamegraph analysis. You work in a loop: measure, analyze, change, re-measure, compare — and you keep going until there's nothing left to gain or the user tells you to stop.
9+
10+
## Before you start
11+
12+
1. **Understand the target**: What code does the user want to optimize? A specific function, a whole module, a benchmark suite? If unclear, ask.
13+
14+
2. **Understand the metric**: CPU time (default), memory, walltime? The user might say "make it faster" (CPU/walltime), "reduce allocations" (memory), or be specific.
15+
16+
3. **Check for existing benchmarks**: Look for benchmark files, `codspeed.yml`, or CI workflows. **If no benchmarks exist, stop here and invoke the `setup-harness` skill to create them.** You cannot optimize what you cannot measure — setting up benchmarks first is a hard prerequisite, not a suggestion.
17+
18+
4. **Check CodSpeed auth**: Run `codspeed auth login` if needed. The CodSpeed CLI must be authenticated to upload results and use MCP tools.
19+
20+
## The optimization loop
21+
22+
### Step 1: Establish a baseline
23+
24+
Build and run the benchmarks to get a baseline measurement. Use simulation mode for fast iteration:
25+
26+
**For projects with CodSpeed integrations (Rust/criterion, Python/pytest, Node.js/vitest, etc.):**
27+
28+
```bash
29+
# Build with CodSpeed instrumentation
30+
cargo codspeed build -m simulation # Rust
31+
# or for other languages, benchmarks run directly
32+
33+
# Run benchmarks
34+
codspeed run -m simulation -- <bench_command>
35+
```
36+
37+
**For projects using the exec harness or codspeed.yml:**
38+
39+
```bash
40+
codspeed run -m simulation
41+
# or
42+
codspeed exec -m simulation -- <command>
43+
```
44+
45+
**Scope your runs**: When iterating on a specific area, run only the relevant benchmarks. This dramatically speeds up the feedback loop:
46+
47+
```bash
48+
# Rust: build and run only relevant suite
49+
cargo codspeed build -m simulation --bench decode
50+
codspeed run -m simulation -- cargo codspeed run --bench decode cat.jpg
51+
52+
# codspeed.yml: individual benchmark
53+
codspeed exec -m simulation -- ./my_binary
54+
```
55+
56+
Save the run ID from the output — you'll need it for comparisons.
57+
58+
### Step 2: Analyze with flamegraphs
59+
60+
Use the CodSpeed MCP tools to understand where time is spent:
61+
62+
1. **List runs** to find your baseline run ID:
63+
- Use `list_runs` with appropriate filters (branch, event type)
64+
65+
2. **Query flamegraphs** on the hottest benchmarks:
66+
- Use `query_flamegraph` with the run ID and benchmark name
67+
- Start with `depth_limit: 5` to get the big picture
68+
- Use `root_function_name` to zoom into hot subtrees
69+
- Look for:
70+
- Functions with high **self time** (these are the actual bottlenecks)
71+
- Instruction-bound vs cache-bound vs memory-bound breakdown
72+
- Unexpected functions appearing high in the profile (redundant work, unnecessary abstractions)
73+
74+
3. **Identify optimization targets**: Rank functions by self time. The top 2-3 are your targets. Consider:
75+
- Can this computation be avoided entirely?
76+
- Can the algorithm be improved (O(n) vs O(n^2))?
77+
- Are there unnecessary allocations in hot loops?
78+
- Are there type conversions (float/int round-trips) that could be eliminated?
79+
- Could data layout be improved for cache locality?
80+
- Are there libm calls (roundf, sinf) that could be replaced with faster alternatives?
81+
- Is there redundant memory initialization (zeroing memory that's immediately overwritten)?
82+
83+
### Step 3: Make targeted changes
84+
85+
Apply optimizations one at a time. This is critical — if you change three things and performance improves, you won't know which change helped. If it regresses, you won't know which one hurt.
86+
87+
**Important constraints:**
88+
- Only change code you've read and understood
89+
- Preserve correctness — run existing tests after each change
90+
- Keep changes minimal and focused
91+
- Don't over-engineer — the simplest fix that works is the best fix
92+
93+
**Common optimization patterns by bottleneck type:**
94+
95+
- **Instruction-bound**: Algorithmic improvements, loop unrolling, removing redundant computations, SIMD
96+
- **Cache-bound**: Improve data locality, reduce struct size, use contiguous memory, avoid pointer chasing
97+
- **Memory-bound**: Reduce allocations, reuse buffers, avoid unnecessary copies, use stack allocation
98+
- **System-call-bound**: Batch I/O, reduce file operations, buffer writes (note: simulation mode doesn't measure syscalls, use walltime for these)
99+
100+
### Step 4: Re-measure and compare
101+
102+
After each change, rebuild and rerun the relevant benchmarks:
103+
104+
```bash
105+
# Rebuild and rerun (scoped to what you changed)
106+
cargo codspeed build -m simulation --bench <suite>
107+
codspeed run -m simulation -- cargo codspeed run --bench <suite>
108+
```
109+
110+
Then compare against the baseline using the MCP tools:
111+
112+
- Use `compare_runs` with `base_run_id` (baseline) and `head_run_id` (after your change)
113+
- Check for:
114+
- **Improvements** in your target benchmarks
115+
- **Regressions** in other benchmarks (shared code paths can affect unrelated benchmarks)
116+
- The magnitude of the change — is it significant?
117+
118+
### Step 5: Report and decide next steps
119+
120+
**When you find a significant improvement** (>5% on target benchmarks with no regressions), pause and tell the user:
121+
122+
- What you changed and why
123+
- The before/after numbers from `compare_runs`
124+
- What the flamegraph showed as the bottleneck
125+
- What further optimizations you see as possible next steps
126+
127+
Then ask if they want you to continue optimizing or if they're satisfied.
128+
129+
**When a change doesn't help or causes regressions**, revert it and try a different approach. Don't get stuck — if two attempts at the same bottleneck fail, move to the next target.
130+
131+
### Step 6: Validate with walltime
132+
133+
Before finalizing any optimization, always validate with walltime benchmarks. Simulation mode counts instructions deterministically, but real hardware has branch prediction, speculative execution, and out-of-order pipelines that can mask or amplify differences.
134+
135+
```bash
136+
# Build for walltime
137+
cargo codspeed build -m walltime # Rust with cargo-codspeed
138+
# or just run directly for other setups
139+
140+
# Run with walltime
141+
codspeed run -m walltime -- <bench_command>
142+
# or
143+
codspeed exec -m walltime -- <command>
144+
```
145+
146+
Then compare the walltime run against a walltime baseline using `compare_runs`.
147+
148+
**Patterns that often show up in simulation but NOT walltime:**
149+
- Iterator adapter overhead (e.g., `.take(n)` to `[..n]`) — branch prediction hides it
150+
- Bounds check elimination — hardware speculates past them
151+
- Trivial arithmetic simplifications — hidden by out-of-order execution
152+
153+
**Patterns that reliably help in both modes:**
154+
- Avoiding type conversions in hot loops (float/integer round-trips)
155+
- Eliminating libm calls (roundf, sinf — these are software routines)
156+
- Skipping redundant memory initialization
157+
- Algorithmic improvements (reducing overall work)
158+
159+
If a simulation improvement doesn't show up in walltime, strongly consider reverting it — the added code complexity isn't worth a phantom improvement.
160+
161+
### Step 7: Continue or finish
162+
163+
If the user wants more optimization, go back to Step 2 with fresh flamegraphs from your latest run. The profile will have shifted now that you've addressed the top bottleneck, revealing new targets.
164+
165+
Keep iterating until:
166+
- The user says they're satisfied
167+
- The flamegraph shows no clear bottleneck (time is spread evenly)
168+
- Remaining optimizations would require architectural changes the user hasn't approved
169+
- You've hit diminishing returns (<1-2% improvement per change)
170+
171+
## Language-specific notes
172+
173+
### Rust
174+
- Use `cargo codspeed build -m <mode>` to build, `cargo codspeed run` to run
175+
- `--bench <name>` selects specific benchmark suites (matching `[[bench]]` targets in Cargo.toml)
176+
- Positional filter after `cargo codspeed run` matches benchmark names (e.g., `cargo codspeed run cat.jpg`)
177+
- Frameworks: criterion, divan, bencher (all work with cargo-codspeed)
178+
179+
### Python
180+
- Uses pytest-codspeed: `codspeed run -m simulation -- pytest --codspeed`
181+
- Framework: pytest-benchmark compatible
182+
183+
### Node.js
184+
- Frameworks: vitest (`@codspeed/vitest-plugin`), tinybench v5 (`@codspeed/tinybench-plugin`), benchmark.js (`@codspeed/benchmark.js-plugin`)
185+
- Run via: `codspeed run -m simulation -- npx vitest bench` (or equivalent)
186+
187+
### Go
188+
- Built-in: `codspeed run -m simulation -- go test -bench .`
189+
- No special packages needed — CodSpeed instruments `go test -bench` directly
190+
191+
### C/C++
192+
- Uses Google Benchmark with valgrind-codspeed
193+
- Build with CMake, run benchmarks via `codspeed run`
194+
195+
### Any language (exec harness)
196+
- Use `codspeed exec -m <mode> -- <command>` for any executable
197+
- Or define benchmarks in `codspeed.yml` and use `codspeed run`
198+
- No code changes required — CodSpeed instruments the binary externally
199+
200+
## MCP tools reference
201+
202+
You have access to these CodSpeed MCP tools:
203+
204+
- **`list_runs`**: Find run IDs. Filter by branch, event type. Use this to find your baseline and latest runs.
205+
- **`compare_runs`**: Compare two runs. Shows improvements, regressions, new/missing benchmarks with formatted values. This is your primary tool for measuring impact.
206+
- **`query_flamegraph`**: Inspect where time is spent. Parameters:
207+
- `run_id`: which run to look at
208+
- `benchmark_name`: full benchmark URI
209+
- `depth_limit`: call tree depth (default 5, max 20)
210+
- `root_function_name`: re-root at a specific function to zoom in
211+
- **`list_repositories`**: Find the repository slug if needed
212+
- **`get_run`**: Get details about a specific run
213+
214+
## Guiding principles
215+
216+
- **Measure first, optimize second.** Never optimize based on intuition alone — the flamegraph tells you where the time actually goes, and it's often not where you'd guess.
217+
- **One change at a time.** Isolated changes make it clear what helped and what didn't.
218+
- **Correctness over speed.** Always run tests. A fast but broken program is useless.
219+
- **Simulation for iteration, walltime for validation.** Simulation is deterministic and fast for feedback. Walltime is the ground truth.
220+
- **Know when to stop.** Diminishing returns are real. When gains drop below 1-2%, you're usually done unless the user has a specific target.
221+
- **Be transparent.** Show the user your reasoning, the numbers, and the tradeoffs. Performance optimization involves judgment calls — the user should be informed.

0 commit comments

Comments
 (0)