|
| 1 | +# Evolution Analysis: Why Optimization Failed |
| 2 | + |
| 3 | +This document analyzes the evolution experiment results after applying validity fixes, and proposes improvements for future work. |
| 4 | + |
| 5 | +## Experiment Results |
| 6 | + |
| 7 | +After applying validity fixes, we ran 25 evolution iterations to verify that the evaluation now works correctly. |
| 8 | + |
| 9 | +**Note**: The `maximum_context_stress_test` benchmark was disabled to reduce memory requirements on test hardware. |
| 10 | + |
| 11 | +### Evolution Summary |
| 12 | + |
| 13 | +| Metric | Value | |
| 14 | +| ------ | ----- | |
| 15 | +| Total Iterations | 25 | |
| 16 | +| Programs Evaluated | 25 | |
| 17 | +| Compilation Failures (bf16) | 8 (32%) | |
| 18 | +| Best Program Found | Iteration 23 | |
| 19 | +| Best combined_score | 2.96 | |
| 20 | +| Benchmarks Used | 4 (stress test disabled) | |
| 21 | + |
| 22 | +### Performance of Best Evolved Kernel |
| 23 | + |
| 24 | +| Benchmark | Baseline (tok/s) | Custom (tok/s) | Change | |
| 25 | +| --------- | ---------------- | -------------- | ------ | |
| 26 | +| short_context_quick | 59.1 | 63.1 | **+6.9%** ✓ | |
| 27 | +| code_generation | 58.3 | 58.1 | -0.4% | |
| 28 | +| long_context_detailed | 54.7 | 46.0 | **-15.9%** | |
| 29 | +| long_generation | 48.0 | 46.4 | -3.4% | |
| 30 | +| **Average** | **55.0** | **53.4** | **-3.2%** | |
| 31 | + |
| 32 | +### Key Finding |
| 33 | + |
| 34 | +> **The best evolved kernel is still 3.2% SLOWER than MLX's baseline implementation.** |
| 35 | +
|
| 36 | +The evolution only improved from an initial -11.5% regression to -3.2% regression. It never exceeded baseline performance. |
| 37 | + |
| 38 | +### Evolution Trajectory |
| 39 | + |
| 40 | +```text |
| 41 | +Iteration 0 (Initial): -11.5% regression |
| 42 | +Iterations 1-4: Failed (bf16 compilation errors) |
| 43 | +Iteration 5: -23.6% regression |
| 44 | +... |
| 45 | +Iteration 19: -3.6% regression (first "positive" score) |
| 46 | +Iteration 23: -3.2% regression (best found) |
| 47 | +Iteration 25: Evolution complete, no improvement |
| 48 | +``` |
| 49 | + |
| 50 | +--- |
| 51 | + |
| 52 | +## Why Evolution Failed |
| 53 | + |
| 54 | +The failure reveals fundamental limitations in the current evolution mechanism. Framing through a **Reinforcement Learning lens**: |
| 55 | + |
| 56 | +| RL Concept | Current State | Problem | |
| 57 | +| ---------- | ------------- | ------- | |
| 58 | +| **Reward Signal** | Detailed metrics but abstract ranking score | LLM sees metrics but selection uses opaque `combined_score` | |
| 59 | +| **State Representation** | Code text + char-level features | Doesn't capture performance-relevant program properties | |
| 60 | +| **Observability** | No GPU profiling data | Partially Observable MDP; agent blind to actual bottlenecks | |
| 61 | +| **Credit Assignment** | Per-program metrics, no diff-level attribution | Cannot identify which code mutation caused improvement | |
| 62 | +| **Exploration** | 1 parent + 5 samples per iteration | Severely underutilizes available information (128K context) | |
| 63 | + |
| 64 | +### 1. Meaningless Feature Dimensions |
| 65 | + |
| 66 | +Current MAP-Elites dimensions are inadequate for kernel optimization: |
| 67 | + |
| 68 | +| Dimension | Current Implementation | Problem for Kernels | |
| 69 | +| --------- | -------------------- |-------------------- | |
| 70 | +| `complexity` | Code character count | Two kernels with different algorithms can have similar length | |
| 71 | +| `diversity` | Character-level diff | Renaming variables looks "diverse"; algorithmic changes don't | |
| 72 | + |
| 73 | +**What would be meaningful**: tiling strategy, vectorization width, memory access pattern, thread block size. |
| 74 | + |
| 75 | +### 2. Fitness Feedback Interpretability |
| 76 | + |
| 77 | +The LLM receives detailed metrics (decode speed, prefill speed, per-benchmark results), but: |
| 78 | + |
| 79 | +- **Relative performance unclear**: Raw `53.4 tok/s` means little without knowing baseline is `55.0 tok/s` |
| 80 | +- **No performance diagnosis**: Cannot tell if kernel is memory-bound vs compute-bound |
| 81 | +- **Selection uses abstract score**: MAP-Elites ranking uses `combined_score`, not individual metrics |
| 82 | +- **Missing actionable guidance**: "Score: 2.96" doesn't tell LLM what to fix |
| 83 | + |
| 84 | +### 3. Lack of Profiling Data |
| 85 | + |
| 86 | +Without GPU profiling feedback, the LLM is essentially optimizing blind. Metal performance depends heavily on: |
| 87 | + |
| 88 | +- Memory coalescing patterns |
| 89 | +- Register pressure |
| 90 | +- Warp divergence |
| 91 | +- Cache utilization |
| 92 | + |
| 93 | +None of this information is available to guide evolution. |
| 94 | + |
| 95 | +### 4. Conservative Parent Selection |
| 96 | + |
| 97 | +Default configuration uses 70% exploitation (selecting from elites). For kernel optimization where the search space has many local optima, this may cause premature convergence to suboptimal solutions. |
| 98 | + |
| 99 | +### 5. Underutilized LLM Context Window |
| 100 | + |
| 101 | +Each iteration only feeds the LLM: |
| 102 | + |
| 103 | +- 1 parent program |
| 104 | +- 3 top programs (inspirations) |
| 105 | +- 2 diverse programs |
| 106 | + |
| 107 | +This is extremely conservative given modern LLM context capabilities (128K+ tokens). |
| 108 | + |
| 109 | +**The real cost**: Each evolution iteration is expensive (~10 minutes for model loading + benchmarking), yet the LLM receives minimal information to guide its optimization. This is a **massive waste of resources**. |
| 110 | + |
| 111 | +**Better approach**: Feed the LLM as much context as possible—all programs from the current population, complete benchmark results, historical evolution trajectory. Only apply context pruning when approaching actual model limits. |
| 112 | + |
| 113 | +### 6. High Failure Rate |
| 114 | + |
| 115 | +32% of generated kernels failed to compile with bfloat16. The LLM generates syntactically valid Metal code but often uses float-only operations incompatible with bf16. |
| 116 | + |
| 117 | +### 7. Benchmarking Feedback Quality |
| 118 | + |
| 119 | +While the evaluator returns detailed metrics, the **ranking and selection** uses a single `combined_score`: |
| 120 | + |
| 121 | +```python |
| 122 | +# Detailed metrics ARE available to LLM: |
| 123 | +performance_metrics = {'avg_decode_speed': 53.4, 'baseline_comparison': {'avg_decode_improvement_pct': -3.2}} |
| 124 | + |
| 125 | +# But MAP-Elites selection uses: |
| 126 | +combined_score = 2.96 # What does this mean? Is 3.0 good? Is 10.0 possible? |
| 127 | +``` |
| 128 | + |
| 129 | +--- |
| 130 | + |
| 131 | +## KernelBench Comparison |
| 132 | + |
| 133 | +[KernelBench](https://github.com/ScalingIntelligence/KernelBench) provides a complete, evolution-ready metric system that could address many of these issues: |
| 134 | + |
| 135 | +### KernelBench Evaluation Structure |
| 136 | + |
| 137 | +**1. Binary Correctness Gates**: |
| 138 | + |
| 139 | +```python |
| 140 | +class KernelExecResult: |
| 141 | + compiled: bool # Did the kernel compile? |
| 142 | + correctness: bool # Did it pass numerical correctness? (multiple trials) |
| 143 | + metadata: dict # max_difference, avg_difference, error details |
| 144 | +``` |
| 145 | + |
| 146 | +**2. Primary Optimization Objective** (direct speedup ratio): |
| 147 | + |
| 148 | +```python |
| 149 | +speedup = baseline_time / custom_time # 1.2 = 20% faster, directly interpretable |
| 150 | +``` |
| 151 | + |
| 152 | +**3. Statistical Rigor**: |
| 153 | + |
| 154 | +```python |
| 155 | +runtime_stats = { |
| 156 | + "mean": 3.68, # Average runtime (ms) |
| 157 | + "std": 0.011, # Standard deviation |
| 158 | + "min": 3.65, # Best case |
| 159 | + "max": 3.74, # Worst case |
| 160 | + "num_trials": 100 # With warmup runs |
| 161 | +} |
| 162 | +``` |
| 163 | + |
| 164 | +**4. Multi-threshold Performance Metrics**: |
| 165 | + |
| 166 | +```python |
| 167 | +# fast_p: fraction of kernels that are BOTH correct AND achieve speedup > p |
| 168 | +fast_0.0 = 0.85 # 85% correct |
| 169 | +fast_1.0 = 0.42 # 42% faster than baseline |
| 170 | +fast_1.5 = 0.18 # 18% achieve 1.5x speedup |
| 171 | +fast_2.0 = 0.05 # 5% achieve 2x speedup |
| 172 | +``` |
| 173 | + |
| 174 | +**5. Population-level Metrics**: |
| 175 | + |
| 176 | +```python |
| 177 | +geometric_mean_speedup = 1.23 # Average 23% improvement across population |
| 178 | +pass_at_1 = 0.42 |
| 179 | +pass_at_5 = 0.78 |
| 180 | +``` |
| 181 | + |
| 182 | +### How KernelBench Metrics Could Integrate with Evolution |
| 183 | + |
| 184 | +| OpenEvolve Component | Current | KernelBench-style Improvement | |
| 185 | +| ------------------- | ------- | ---------------------------- | |
| 186 | +| **Fitness Score** | Abstract `combined_score` | Direct `speedup` ratio | |
| 187 | +| **Correctness Gate** | Binary pass/fail | Binary + `max_difference`, `avg_difference` for gradient | |
| 188 | +| **Performance Feedback** | Single number | `mean ± std` with confidence intervals | |
| 189 | +| **MAP-Elites Features** | Code length, char diff | Speedup tier (0.5x, 1x, 1.5x, 2x), runtime variance | |
| 190 | +| **Early Stopping** | Fixed threshold | `fast_p` targets: stop when `fast_1.5 > 0.1` | |
| 191 | +| **Prompt Feedback** | "Score: 2.96" | "Speedup: 0.85x (15% slower), need to beat 1.0x" | |
| 192 | + |
| 193 | +The key insight: **KernelBench's metrics are designed to be directly actionable**. The LLM can understand "this kernel is 15% slower than baseline" but cannot learn from "combined_score = 2.96". |
| 194 | + |
| 195 | +Additionally, KernelBench enables **temporal credit assignment**: |
| 196 | + |
| 197 | +- Compare child speedup vs parent speedup (not just vs baseline) |
| 198 | +- Track which mutations led to improvement |
| 199 | +- Provide mutation-specific feedback: "Adding SIMD vectorization improved prefill by 23%" |
| 200 | + |
| 201 | +--- |
| 202 | + |
| 203 | +## Proposed Improvements |
| 204 | + |
| 205 | +### Priority 1: Adopt KernelBench-style Evaluation |
| 206 | + |
| 207 | +- Replace `combined_score` with direct speedup ratio: `baseline_time / custom_time` |
| 208 | +- Return statistical timing data: mean, std, min, max, num_trials |
| 209 | +- Use `fast_p` as milestone targets for early stopping |
| 210 | +- Report correctness metrics: `max_difference`, `avg_difference`, tolerance margin |
| 211 | +- Provide actionable prompt feedback: "Speedup: 0.85x, need to beat 1.0x" |
| 212 | + |
| 213 | +### Priority 2: Performance-based MAP-Elites Features |
| 214 | + |
| 215 | +- `speedup_tier`: (0-0.5x, 0.5-1x, 1-1.5x, 1.5-2x, >2x) instead of code length |
| 216 | +- `runtime_variance`: (low/medium/high std) for consistency tracking |
| 217 | +- `correctness_margin`: distance from tolerance threshold |
| 218 | + |
| 219 | +### Priority 3: Integrate Metal GPU Profiling |
| 220 | + |
| 221 | +- Feed occupancy, bandwidth, cache stats back to LLM |
| 222 | +- Use profiling data as additional feature dimensions |
| 223 | + |
| 224 | +### Priority 4: Domain-specific Strategy Tracking |
| 225 | + |
| 226 | +- `uses_simd_vectorization: 0-3` (none/2/4/8-wide) |
| 227 | +- `memory_access_pattern: coalesced/strided/random` |
| 228 | +- `algorithm_type: 2pass/3pass/online` |
| 229 | + |
| 230 | +### Priority 5: Maximize LLM Context Utilization |
| 231 | + |
| 232 | +- Feed entire population (or top N by speedup) instead of just 1 parent + 5 samples |
| 233 | +- Include complete benchmark results with statistical breakdowns |
| 234 | +- Show evolution history: what worked, what failed, why |
| 235 | +- Only prune context when approaching actual model limits (128K+ tokens) |
| 236 | + |
| 237 | +### Priority 6: Curated Metal bf16 Examples |
| 238 | + |
| 239 | +- Add few-shot examples of correct bf16 Metal syntax |
| 240 | +- Include common pitfalls in system prompt |
| 241 | + |
| 242 | +--- |
| 243 | + |
| 244 | +## References |
| 245 | + |
| 246 | +- [KernelBench](https://github.com/ScalingIntelligence/KernelBench) |
| 247 | +- MAP-Elites: Mouret & Clune, 2015 |
| 248 | + |
| 249 | +--- |
| 250 | + |
| 251 | +*Experiment run: 2026-01-05 18:09 - 21:20 (3h 11m)* |
| 252 | +*Note: `maximum_context_stress_test` disabled for this validation run* |
0 commit comments