| Metric | Value |
|---|---|
| Total cycles | 163 |
| Score range | 35 → 862 (24.6× improvement) |
| Population size | 163 prompts |
| Evaluation method | 400+ signal lexical scoring |
| Evolution loops | Lexical + Grounded (dual) |
The original skeleton already contained a mature lexical-evolution loop. Prompts are scored by keyword/signal presence (tech stack, code quality, testing, security, etc.). The best prompt scored 862/1000 — a 24.6× improvement over the starting seed.
langgraph(+4),ollama(+5),local(+3) — local LLM stack consistently winspydantic(+3),httpx(+2) — type-safe HTTP clients valuederror handling(+2),logging(+2),tests(+2) — code quality signalsdocker(+2),CI / GitHub Actions(+2) — deployment awareness- Async, streaming, caching, retry primitives
- Whether the generated code actually compiles
- Whether tests pass
- Whether the project is well-structured (functions, classes, modules)
- Whether dependencies are declared
- Linting / code style violations
This limitation is the motivation for the grounded loop.
The grounded loop closes the gap by:
- Generating real code from the prompt (via LLM →
generator.py) - Validating execution (AST parse, pytest, flake8 —
runtime_evaluator.py) - Scoring on real outcomes not keyword counts
- Backpropagating the execution score into the population as prompt fitness
| Component | Max score | What it measures |
|---|---|---|
| Syntax validity | 20 | AST parse succeeds on all .py files |
| Test pass rate | 25 | pytest exit code 0 |
| Lint cleanliness | 10 | flake8 no errors (E/F/W) |
| Runtime execution | 15 | main.py can be imported and run |
| Structure | 10 | ≥3 functions + ≥2 classes |
| Tests exist | 5 | test_*.py files present |
| Dependencies | 3 | requirements.txt or pyproject.toml |
| Documentation | 2 | README.md present |
| File count | 5 | ≥3 .py files |
| AST complexity | 5 | >50 AST nodes |
| Total | 100 |
- Prompts that mention specific library versions tend to generate broken
requirements.txt(hard to match against current PyPI) - Vague prompts ("generate a clean Python project") produce higher pass rates than specific ones ("build a CLI task manager with argparse") because the LLM has more freedom
- High lexical score ≠ high execution score; some 800+ prompts generate code that fails flake8
Lexical and grounded scores are weakly correlated. The most "impressive" prompts (long, many keywords) often produce the messiest code. The grounded loop acts as a regularizer — rewarding prompts that produce simple, correct, and tested output over prompts that read well on paper.
The lexical loop was stuck at 862/1000 until a root-cause analysis revealed a bug in mutate.py's get_missing_keywords function:
- Bug: Single-keyword conditions (e.g.,
if "ollama" in content:) — 286 out of 592 total — were completely ignored because the function only handledand/orpatterns, missing the simpleelsebranch. - Impact: 180 uncovered signals were invisible to the mutation engine, creating an artificial plateau.
- Fix: Added an
elsebranch for single-keyword conditions. Combined with the earlier fix (relaxedlen(k) > 3→len(k) >= 3with a short-word blocklist).
| Cycle | Score | Missing signals |
|---|---|---|
| 0 | 896 | 181 |
| 5 | 914 | 172 |
| 15 | 932 | 164 |
| 16 | 950 | 155 |
| 21 | 968 | 144 |
| 22 | 986 | 134 |
| 23 | 1000 | 124 |
6 prompts now score the maximum 1000/1000. Population: 218 prompts.
| Metric | Value |
|---|---|
| Total cycles | 218 |
| Score range | 35 → 1000 (28.6× improvement) |
| Champions at 1000 | 6 |
| Population size | 218 prompts |
| Evaluator signals | 592 keyword checks |
| Ceiling progression | 500 → 862 → 1000 |
| Evaluation method | 400+ signal lexical scoring (fully capped) |
| Evolution loops | Lexical (capped) + Grounded (needs API key) |
- Run the grounded loop with an LLM API key set (
LLM_API_KEY) — shifts from keyword scoring to execution validation - Add new signal pools to
evolve_forever.pyto raise the ceiling beyond 1000 - Run a Spearman correlation between lexical score and grounded score across the population
- Archive generated projects from top-scoring grounded prompts for manual review
- Diversify benchmarks in
benchmarks/tasks.json