This example uses OpenEvolve to automatically discover optimized Metal GPU kernels for Grouped Query Attention (GQA) in Qwen3-0.6B on Apple Silicon.
- Model: Qwen3-0.6B-bf16
- Architecture: 16 query heads : 8 KV heads (2:1 ratio), 2048 hidden size, 128 head dimension
- Hardware: Apple M-series GPUs with unified memory
- Baseline:
mx.fast.scaled_dot_product_attentionviamlx_lm.generate - Goal: Evolve custom Metal kernel source code to outperform baseline
pip install mlx mlx-lm openevolve
# Set API key (Gemini via OpenAI-compatible endpoint)
export OPENAI_API_KEY="your-gemini-key"cd openevolve/examples/mlx_metal_kernel_opt
# Using the experiment runner script
./run_evolve_experiment.sh --run-name test_run --iterations 25
# Or directly
python -m openevolve.cli \
--initial-program initial_program.py \
--evaluator evaluator.py \
--config config.yaml \
--iterations 25 \
--output ./openevolve_output# Run a single benchmark with verbose output
python -c "
from evaluator import Qwen3GQAEvaluator
e = Qwen3GQAEvaluator()
result = e.evaluate('initial_program.py')
print(result['summary'])
"| File | Purpose |
|---|---|
initial_program.py |
Starting Metal kernel (to be evolved) |
evaluator.py |
Correctness + performance evaluation |
config.yaml |
Evolution configuration |
qwen3_benchmark_suite.py |
Benchmark definitions |
mlx_lm_generate_with_hook.py |
Subprocess hook wrapper |
run_evolve_experiment.sh |
Experiment runner script |
This PR corrects critical issues that invalidated prior evaluation results:
-
Subprocess Kernel Hook: Evolved kernels are now properly applied in benchmark subprocesses via
mlx_lm_generate_with_hook.py -
bfloat16 Correctness Gate: Correctness tests now use
mx.bfloat16inputs to match actual inference dtype -
Architecture Alignment: Fixed head ratio from 40:8 to correct 16:8 (2:1 GQA pattern)
-
Evaluation Flow Optimizations: Early exit on compilation errors, correctness-before-baseline ordering, GPU state cleanup between runs
After fixing validity issues, we ran 25 evolution iterations.
Result: The best evolved kernel is 3.2% SLOWER than MLX's baseline implementation.
The evolution improved from an initial -11.5% regression to -3.2%, but never exceeded baseline. This indicates fundamental limitations in the current evolution mechanism that require further investigation.
For detailed experiment results and analysis, see EVOLUTION_ANALYSIS.md.
For review and reproducibility, this example repo includes a committed snapshot of one post-fix evolution run:
best_program.py: best evolved program (iteration 23)best_program_info.json: metrics + baseline comparisons (includes the -3.2% result)
The full run output directory is intentionally git-ignored (see .gitignore) to avoid committing large run artifacts.
- MAP-Elites selection uses abstract
combined_scoreinstead of direct speedup ratios - LLM context underutilized (only 1 parent + 5 samples per iteration)
- No GPU profiling data to guide optimization
- 32% bf16 compilation failure rate