Comparing AI-driven evolution frameworks on systems code optimization.
Run evolutionary search frameworks like GEPA, K-Search, SkyDiscover, Nous, and coding agents against the same compiled system target with the same optimization target.
Each framework is optional — install only the ones you want to compare.
| Framework | Type | Repo | Paper |
|---|---|---|---|
| GEPA | Reflective Pareto evolution | github.com/gepa-ai/gepa |
arxiv:2507.19457 |
| SkyDiscover | Adaptive population (AdaEvolve, EvoX) | github.com/skydiscover-ai/skydiscover |
AdaEvolve, EvoX |
| K-Search | World-model guided tree search | github.com/caoshiyi/K-Search |
arxiv:2602.19128 |
| ShinkaEvolve | Async population, UCB model selection | github.com/SakanaAI/ShinkaEvolve |
arxiv:2509.19349 |
| CORAL | Multi-agent shared memory | github.com/Human-Agent-Society/CORAL |
arxiv:2604.01658 |
| AutoScientists | Multi-agent self-organizing teams | github.com/mims-harvard/AutoScientists |
arxiv:2605.28655 |
| Nous | Hypothesis-driven controlled experiments | github.com/AI-native-Systems-Research/agentic-strategy-evolution |
— |
| Coding Agent | Single Claude session, iterative | (built-in) | — |
# Clone this repo and the certus target:
git clone https://github.com/AI-native-Systems-Research/ai-native-storage-certus-workbench.git
git clone https://github.com/AI-native-Systems-Research/ai-native-storage-certus.git -b unstable
# Clone frameworks you want to compare:
git clone https://github.com/gepa-ai/gepa.git
git clone https://github.com/skydiscover-ai/skydiscover.git
git clone https://github.com/caoshiyi/K-Search.git
git clone https://github.com/SakanaAI/ShinkaEvolve.git
# Install workbench:
cd ai-native-storage-certus-workbench
pip install -e .
# Install frameworks (editable from cloned source):
pip install -e ../gepa
pip install -e ../skydiscover
pip install -e ../K-Search
pip install -e ../ShinkaEvolve# Run a single framework on a target:
workbench --framework gepa --target certus_p2p --config 1d_1c_1g
# Compare multiple frameworks (same budget, same seed):
workbench-compare --target certus_p2p --config 1d_1c_1g \
--frameworks gepa,ksearch,skydiscover,nous,coding_agent \
--budget 15targets/ System targets to evolve
certus_p2p/ SSD→GPU data path optimization
target.yaml Files, build command, scoring, hardware ceilings
evaluate.py Multi-config evaluator
configs/ Hardware configurations (drives × clients × GPUs)
initial_programs/ Seed code per framework format
frameworks/ Per-framework runner scripts
gepa.py GEPA optimize_anything wrapper
ksearch.py K-Search world-model wrapper
skydiscover.py SkyDiscover (AdaEvolve/EvoX) wrapper
nous.py Nous campaign orchestrator wrapper
coding_agent.py Raw Claude coding agent
random_baseline.py Random mutation control
orchestrator/ Run & compare infrastructure
run.py Run one framework on one target+config
compare.py Run multiple, generate comparison report
preflight.py Smoke-test frameworks before full run
knowledge/ Accumulated learnings across experiments
experiment_experience.md Operational lessons from prior runs
evolution_strategy.md Framework comparison analysis
results/ Experiment outputs
Create a folder under targets/ with:
target.yaml— defines files in scope, build command, scoring formula, hardware ceilingsevaluate.py— evaluator script (build → test → bench → score)configs/— hardware configuration variantsinitial_programs/— seed code for frameworks that need it
Add a script to frameworks/ that:
- Accepts a target config (YAML path)
- Accepts an iteration budget
- Runs the framework's optimization loop
- Writes results to
results/<target>/<config>/<framework>/
No ABC required — just a script that takes args and produces scored candidates.
See knowledge/evolution_strategy.md for full framework comparison from the P2P experiment (7 frameworks × 2 conditions, single-drive). Key finding: evolutionary frameworks (GEPA, SkyDiscover, ShinkaEvolve) cannot handle multi-file Rust FFI code. Only hypothesis-driven (Nous) and agentic (coding agent) approaches produced meaningful results.