Certus Workbench

Comparing AI-driven evolution frameworks on systems code optimization.

Run evolutionary search frameworks like GEPA, K-Search, SkyDiscover, Nous, and coding agents against the same compiled system target with the same optimization target.

Frameworks

Each framework is optional — install only the ones you want to compare.

Framework	Type	Repo	Paper
GEPA	Reflective Pareto evolution	`github.com/gepa-ai/gepa`	arxiv:2507.19457
SkyDiscover	Adaptive population (AdaEvolve, EvoX)	`github.com/skydiscover-ai/skydiscover`	AdaEvolve, EvoX
K-Search	World-model guided tree search	`github.com/caoshiyi/K-Search`	arxiv:2602.19128
ShinkaEvolve	Async population, UCB model selection	`github.com/SakanaAI/ShinkaEvolve`	arxiv:2509.19349
CORAL	Multi-agent shared memory	`github.com/Human-Agent-Society/CORAL`	arxiv:2604.01658
AutoScientists	Multi-agent self-organizing teams	`github.com/mims-harvard/AutoScientists`	arxiv:2605.28655
Nous	Hypothesis-driven controlled experiments	`github.com/AI-native-Systems-Research/agentic-strategy-evolution`	—
Coding Agent	Single Claude session, iterative	(built-in)	—

Setup

# Clone this repo and the certus target:
git clone https://github.com/AI-native-Systems-Research/ai-native-storage-certus-workbench.git
git clone https://github.com/AI-native-Systems-Research/ai-native-storage-certus.git -b unstable

# Clone frameworks you want to compare:
git clone https://github.com/gepa-ai/gepa.git
git clone https://github.com/skydiscover-ai/skydiscover.git
git clone https://github.com/caoshiyi/K-Search.git
git clone https://github.com/SakanaAI/ShinkaEvolve.git

# Install workbench:
cd ai-native-storage-certus-workbench
pip install -e .

# Install frameworks (editable from cloned source):
pip install -e ../gepa
pip install -e ../skydiscover
pip install -e ../K-Search
pip install -e ../ShinkaEvolve

Usage

# Run a single framework on a target:
workbench --framework gepa --target certus_p2p --config 1d_1c_1g

# Compare multiple frameworks (same budget, same seed):
workbench-compare --target certus_p2p --config 1d_1c_1g \
    --frameworks gepa,ksearch,skydiscover,nous,coding_agent \
    --budget 15

Structure

targets/                    System targets to evolve
  certus_p2p/               SSD→GPU data path optimization
    target.yaml             Files, build command, scoring, hardware ceilings
    evaluate.py             Multi-config evaluator
    configs/                Hardware configurations (drives × clients × GPUs)
    initial_programs/       Seed code per framework format

frameworks/                 Per-framework runner scripts
  gepa.py                   GEPA optimize_anything wrapper
  ksearch.py                K-Search world-model wrapper
  skydiscover.py            SkyDiscover (AdaEvolve/EvoX) wrapper
  nous.py                   Nous campaign orchestrator wrapper
  coding_agent.py           Raw Claude coding agent
  random_baseline.py        Random mutation control

orchestrator/               Run & compare infrastructure
  run.py                    Run one framework on one target+config
  compare.py                Run multiple, generate comparison report
  preflight.py              Smoke-test frameworks before full run

knowledge/                  Accumulated learnings across experiments
  experiment_experience.md  Operational lessons from prior runs
  evolution_strategy.md     Framework comparison analysis

results/                    Experiment outputs

Adding a New Target

Create a folder under targets/ with:

target.yaml — defines files in scope, build command, scoring formula, hardware ceilings
evaluate.py — evaluator script (build → test → bench → score)
configs/ — hardware configuration variants
initial_programs/ — seed code for frameworks that need it

Adding a New Framework

Add a script to frameworks/ that:

Accepts a target config (YAML path)
Accepts an iteration budget
Runs the framework's optimization loop
Writes results to results/<target>/<config>/<framework>/

No ABC required — just a script that takes args and produces scored candidates.

Prior Results

See knowledge/evolution_strategy.md for full framework comparison from the P2P experiment (7 frameworks × 2 conditions, single-drive). Key finding: evolutionary frameworks (GEPA, SkyDiscover, ShinkaEvolve) cannot handle multi-file Rust FFI code. Only hypothesis-driven (Nous) and agentic (coding agent) approaches produced meaningful results.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
frameworks		frameworks
knowledge		knowledge
orchestrator		orchestrator
results		results
targets/certus_p2p		targets/certus_p2p
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Certus Workbench

Frameworks

Setup

Usage

Structure

Adding a New Target

Adding a New Framework

Prior Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Certus Workbench

Frameworks

Setup

Usage

Structure

Adding a New Target

Adding a New Framework

Prior Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages