Skip to content

Commit e29c2ba

Browse files
authored
Merge pull request #377 from lanmogu98/fix/mlx_metal_kernel_opt_eval_validity
Fixes #372: Reliability issues in `examples/mlx_metal_kernel_opt`
2 parents 67aca71 + 69d2dab commit e29c2ba

14 files changed

+1109
-1258
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,8 @@ ENV/
4141
examples/*/output/
4242
openevolve_output*/
4343
*.log
44+
demo_output*/
45+
pr_sanity*/
4446

4547
# Test cache
4648
.pytest_cache/
Lines changed: 252 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
# Evolution Analysis: Why Optimization Failed
2+
3+
This document analyzes the evolution experiment results after applying validity fixes, and proposes improvements for future work.
4+
5+
## Experiment Results
6+
7+
After applying validity fixes, we ran 25 evolution iterations to verify that the evaluation now works correctly.
8+
9+
**Note**: The `maximum_context_stress_test` benchmark was disabled to reduce memory requirements on test hardware.
10+
11+
### Evolution Summary
12+
13+
| Metric | Value |
14+
| ------ | ----- |
15+
| Total Iterations | 25 |
16+
| Programs Evaluated | 25 |
17+
| Compilation Failures (bf16) | 8 (32%) |
18+
| Best Program Found | Iteration 23 |
19+
| Best combined_score | 2.96 |
20+
| Benchmarks Used | 4 (stress test disabled) |
21+
22+
### Performance of Best Evolved Kernel
23+
24+
| Benchmark | Baseline (tok/s) | Custom (tok/s) | Change |
25+
| --------- | ---------------- | -------------- | ------ |
26+
| short_context_quick | 59.1 | 63.1 | **+6.9%**|
27+
| code_generation | 58.3 | 58.1 | -0.4% |
28+
| long_context_detailed | 54.7 | 46.0 | **-15.9%** |
29+
| long_generation | 48.0 | 46.4 | -3.4% |
30+
| **Average** | **55.0** | **53.4** | **-3.2%** |
31+
32+
### Key Finding
33+
34+
> **The best evolved kernel is still 3.2% SLOWER than MLX's baseline implementation.**
35+
36+
The evolution only improved from an initial -11.5% regression to -3.2% regression. It never exceeded baseline performance.
37+
38+
### Evolution Trajectory
39+
40+
```text
41+
Iteration 0 (Initial): -11.5% regression
42+
Iterations 1-4: Failed (bf16 compilation errors)
43+
Iteration 5: -23.6% regression
44+
...
45+
Iteration 19: -3.6% regression (first "positive" score)
46+
Iteration 23: -3.2% regression (best found)
47+
Iteration 25: Evolution complete, no improvement
48+
```
49+
50+
---
51+
52+
## Why Evolution Failed
53+
54+
The failure reveals fundamental limitations in the current evolution mechanism. Framing through a **Reinforcement Learning lens**:
55+
56+
| RL Concept | Current State | Problem |
57+
| ---------- | ------------- | ------- |
58+
| **Reward Signal** | Detailed metrics but abstract ranking score | LLM sees metrics but selection uses opaque `combined_score` |
59+
| **State Representation** | Code text + char-level features | Doesn't capture performance-relevant program properties |
60+
| **Observability** | No GPU profiling data | Partially Observable MDP; agent blind to actual bottlenecks |
61+
| **Credit Assignment** | Per-program metrics, no diff-level attribution | Cannot identify which code mutation caused improvement |
62+
| **Exploration** | 1 parent + 5 samples per iteration | Severely underutilizes available information (128K context) |
63+
64+
### 1. Meaningless Feature Dimensions
65+
66+
Current MAP-Elites dimensions are inadequate for kernel optimization:
67+
68+
| Dimension | Current Implementation | Problem for Kernels |
69+
| --------- | -------------------- |-------------------- |
70+
| `complexity` | Code character count | Two kernels with different algorithms can have similar length |
71+
| `diversity` | Character-level diff | Renaming variables looks "diverse"; algorithmic changes don't |
72+
73+
**What would be meaningful**: tiling strategy, vectorization width, memory access pattern, thread block size.
74+
75+
### 2. Fitness Feedback Interpretability
76+
77+
The LLM receives detailed metrics (decode speed, prefill speed, per-benchmark results), but:
78+
79+
- **Relative performance unclear**: Raw `53.4 tok/s` means little without knowing baseline is `55.0 tok/s`
80+
- **No performance diagnosis**: Cannot tell if kernel is memory-bound vs compute-bound
81+
- **Selection uses abstract score**: MAP-Elites ranking uses `combined_score`, not individual metrics
82+
- **Missing actionable guidance**: "Score: 2.96" doesn't tell LLM what to fix
83+
84+
### 3. Lack of Profiling Data
85+
86+
Without GPU profiling feedback, the LLM is essentially optimizing blind. Metal performance depends heavily on:
87+
88+
- Memory coalescing patterns
89+
- Register pressure
90+
- Warp divergence
91+
- Cache utilization
92+
93+
None of this information is available to guide evolution.
94+
95+
### 4. Conservative Parent Selection
96+
97+
Default configuration uses 70% exploitation (selecting from elites). For kernel optimization where the search space has many local optima, this may cause premature convergence to suboptimal solutions.
98+
99+
### 5. Underutilized LLM Context Window
100+
101+
Each iteration only feeds the LLM:
102+
103+
- 1 parent program
104+
- 3 top programs (inspirations)
105+
- 2 diverse programs
106+
107+
This is extremely conservative given modern LLM context capabilities (128K+ tokens).
108+
109+
**The real cost**: Each evolution iteration is expensive (~10 minutes for model loading + benchmarking), yet the LLM receives minimal information to guide its optimization. This is a **massive waste of resources**.
110+
111+
**Better approach**: Feed the LLM as much context as possible—all programs from the current population, complete benchmark results, historical evolution trajectory. Only apply context pruning when approaching actual model limits.
112+
113+
### 6. High Failure Rate
114+
115+
32% of generated kernels failed to compile with bfloat16. The LLM generates syntactically valid Metal code but often uses float-only operations incompatible with bf16.
116+
117+
### 7. Benchmarking Feedback Quality
118+
119+
While the evaluator returns detailed metrics, the **ranking and selection** uses a single `combined_score`:
120+
121+
```python
122+
# Detailed metrics ARE available to LLM:
123+
performance_metrics = {'avg_decode_speed': 53.4, 'baseline_comparison': {'avg_decode_improvement_pct': -3.2}}
124+
125+
# But MAP-Elites selection uses:
126+
combined_score = 2.96 # What does this mean? Is 3.0 good? Is 10.0 possible?
127+
```
128+
129+
---
130+
131+
## KernelBench Comparison
132+
133+
[KernelBench](https://github.com/ScalingIntelligence/KernelBench) provides a complete, evolution-ready metric system that could address many of these issues:
134+
135+
### KernelBench Evaluation Structure
136+
137+
**1. Binary Correctness Gates**:
138+
139+
```python
140+
class KernelExecResult:
141+
compiled: bool # Did the kernel compile?
142+
correctness: bool # Did it pass numerical correctness? (multiple trials)
143+
metadata: dict # max_difference, avg_difference, error details
144+
```
145+
146+
**2. Primary Optimization Objective** (direct speedup ratio):
147+
148+
```python
149+
speedup = baseline_time / custom_time # 1.2 = 20% faster, directly interpretable
150+
```
151+
152+
**3. Statistical Rigor**:
153+
154+
```python
155+
runtime_stats = {
156+
"mean": 3.68, # Average runtime (ms)
157+
"std": 0.011, # Standard deviation
158+
"min": 3.65, # Best case
159+
"max": 3.74, # Worst case
160+
"num_trials": 100 # With warmup runs
161+
}
162+
```
163+
164+
**4. Multi-threshold Performance Metrics**:
165+
166+
```python
167+
# fast_p: fraction of kernels that are BOTH correct AND achieve speedup > p
168+
fast_0.0 = 0.85 # 85% correct
169+
fast_1.0 = 0.42 # 42% faster than baseline
170+
fast_1.5 = 0.18 # 18% achieve 1.5x speedup
171+
fast_2.0 = 0.05 # 5% achieve 2x speedup
172+
```
173+
174+
**5. Population-level Metrics**:
175+
176+
```python
177+
geometric_mean_speedup = 1.23 # Average 23% improvement across population
178+
pass_at_1 = 0.42
179+
pass_at_5 = 0.78
180+
```
181+
182+
### How KernelBench Metrics Could Integrate with Evolution
183+
184+
| OpenEvolve Component | Current | KernelBench-style Improvement |
185+
| ------------------- | ------- | ---------------------------- |
186+
| **Fitness Score** | Abstract `combined_score` | Direct `speedup` ratio |
187+
| **Correctness Gate** | Binary pass/fail | Binary + `max_difference`, `avg_difference` for gradient |
188+
| **Performance Feedback** | Single number | `mean ± std` with confidence intervals |
189+
| **MAP-Elites Features** | Code length, char diff | Speedup tier (0.5x, 1x, 1.5x, 2x), runtime variance |
190+
| **Early Stopping** | Fixed threshold | `fast_p` targets: stop when `fast_1.5 > 0.1` |
191+
| **Prompt Feedback** | "Score: 2.96" | "Speedup: 0.85x (15% slower), need to beat 1.0x" |
192+
193+
The key insight: **KernelBench's metrics are designed to be directly actionable**. The LLM can understand "this kernel is 15% slower than baseline" but cannot learn from "combined_score = 2.96".
194+
195+
Additionally, KernelBench enables **temporal credit assignment**:
196+
197+
- Compare child speedup vs parent speedup (not just vs baseline)
198+
- Track which mutations led to improvement
199+
- Provide mutation-specific feedback: "Adding SIMD vectorization improved prefill by 23%"
200+
201+
---
202+
203+
## Proposed Improvements
204+
205+
### Priority 1: Adopt KernelBench-style Evaluation
206+
207+
- Replace `combined_score` with direct speedup ratio: `baseline_time / custom_time`
208+
- Return statistical timing data: mean, std, min, max, num_trials
209+
- Use `fast_p` as milestone targets for early stopping
210+
- Report correctness metrics: `max_difference`, `avg_difference`, tolerance margin
211+
- Provide actionable prompt feedback: "Speedup: 0.85x, need to beat 1.0x"
212+
213+
### Priority 2: Performance-based MAP-Elites Features
214+
215+
- `speedup_tier`: (0-0.5x, 0.5-1x, 1-1.5x, 1.5-2x, >2x) instead of code length
216+
- `runtime_variance`: (low/medium/high std) for consistency tracking
217+
- `correctness_margin`: distance from tolerance threshold
218+
219+
### Priority 3: Integrate Metal GPU Profiling
220+
221+
- Feed occupancy, bandwidth, cache stats back to LLM
222+
- Use profiling data as additional feature dimensions
223+
224+
### Priority 4: Domain-specific Strategy Tracking
225+
226+
- `uses_simd_vectorization: 0-3` (none/2/4/8-wide)
227+
- `memory_access_pattern: coalesced/strided/random`
228+
- `algorithm_type: 2pass/3pass/online`
229+
230+
### Priority 5: Maximize LLM Context Utilization
231+
232+
- Feed entire population (or top N by speedup) instead of just 1 parent + 5 samples
233+
- Include complete benchmark results with statistical breakdowns
234+
- Show evolution history: what worked, what failed, why
235+
- Only prune context when approaching actual model limits (128K+ tokens)
236+
237+
### Priority 6: Curated Metal bf16 Examples
238+
239+
- Add few-shot examples of correct bf16 Metal syntax
240+
- Include common pitfalls in system prompt
241+
242+
---
243+
244+
## References
245+
246+
- [KernelBench](https://github.com/ScalingIntelligence/KernelBench)
247+
- MAP-Elites: Mouret & Clune, 2015
248+
249+
---
250+
251+
*Experiment run: 2026-01-05 18:09 - 21:20 (3h 11m)*
252+
*Note: `maximum_context_stress_test` disabled for this validation run*

0 commit comments

Comments
 (0)