Skip to content

Commit ec85f6e

Browse files
committed
Plateau analysis, super_merge strategy, docs update, 13 new prompts
1 parent 74701cc commit ec85f6e

21 files changed

Lines changed: 10136 additions & 20 deletions

CHANGELOG.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,27 @@
22

33
All notable changes to Grounded Evolution will be documented in this file.
44

5+
## [0.2.1] - 2026-05-28
6+
7+
### Added
8+
- Type hints to all core modules: generator, infinite_research_loop, runtime_evaluator, mutation_engine, population_manager, beautify_readme, regression_tracker
9+
- `.github/CODEOWNERS`, `FUNDING.yml`, `dependabot.yml` — repository governance
10+
- ROADMAP.md — near/medium/long-term development plan
11+
- EVOLUTION_REPORT.md — complete experiment summary with grounded/lexical analysis
12+
- `super_merge` mutation strategy — combines top-5 prompts for plateau breaking
13+
- Auto-detection of real scores in beautify_readme.py (reads population.json)
14+
15+
### Changed
16+
- Fixed stale score marker in auto_evolve.py (500→1000)
17+
- Updated all stale URLs to grounded_evolution (SECURITY.md, config.yml, badge.yml, script.sh)
18+
- beautify_readme.py now non-destructive (marker-based section update)
19+
- generator.py uses env-var LLM config (LLM_API_KEY, LLM_MODEL, LLM_BASE_URL) — no hardcoded keys
20+
- requirements.txt pinned with versions
21+
- README badges/metrics updated to 163 prompts
22+
23+
### Fixed
24+
- `datetime.UTC``datetime.timezone.utc` in regression_tracker.py (Python 3.14 compat)
25+
526
## [0.2.0] - 2026-05-28
627

728
### Added

EVOLUTION_REPORT.md

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@
44

55
| Metric | Value |
66
|---|---|
7-
| Total cycles | ~150 |
7+
| Total cycles | 163 |
88
| Score range | 35 → 862 (24.6× improvement) |
9-
| Population size | ~150 prompts |
9+
| Population size | 163 prompts |
1010
| Evaluation method | 400+ signal lexical scoring |
1111
| Evolution loops | Lexical + Grounded (dual) |
1212

@@ -63,8 +63,16 @@ The grounded loop closes the gap by:
6363

6464
**Lexical and grounded scores are weakly correlated.** The most "impressive" prompts (long, many keywords) often produce the messiest code. The grounded loop acts as a regularizer — rewarding prompts that produce **simple, correct, and tested** output over prompts that read well on paper.
6565

66-
## Next Analysis
66+
## Lexical Plateau
6767

68-
- Run a Spearman correlation between lexical score and grounded score across the full population
69-
- Track per-cycle cost (LLM tokens + wall time) to identify regression
70-
- Archive generated projects from the top-10 grounded prompts for manual review
68+
After 163 generations and 40+ injected signal pools, the lexical loop has converged at **862/1000**. All 500+ keyword checks have been injected into `evaluate.py`. The remaining 138 points require niche keywords that no single prompt can practically cover without becoming an incoherent keyword salad.
69+
70+
The `super_merge` strategy (added to `mutate.py`) attempts to combine all top-5 prompts into one maximally broad prompt, but even this can't bridge the gap — the signals are inherently contradictory (e.g., language-specific keywords for Python vs JavaScript).
71+
72+
## Next Steps
73+
74+
1. **Run the grounded loop** with an LLM API key set (`LLM_API_KEY`) — this shifts from keyword scoring to execution validation
75+
2. **Run a Spearman correlation** between lexical score and grounded score across the full population
76+
3. **Track per-cycle cost** (LLM tokens + wall time) to identify regression
77+
4. **Archive generated projects** from the top-10 grounded prompts for manual review
78+
5. **Diversify benchmarks** in `benchmarks/tasks.json` — the current single-benchmark mode limits grounded evolution breadth

README.md

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,9 @@
77
[![Status: Active](https://img.shields.io/badge/status-active-success?style=flat-square&logo=github)](https://github.com/NullLabTests/grounded_evolution)
88
[![License: MIT](https://img.shields.io/badge/License-MIT-67ac09?style=flat-square)](LICENSE)
99
[![Python 3.12+](https://img.shields.io/badge/Python-3.12%2B-007ec6?style=flat-square&logo=python&logoColor=white)](https://python.org)
10-
[![Generations](https://img.shields.io/badge/Generations-150-8b5cf6?style=flat-square)](#results)
10+
[![Generations](https://img.shields.io/badge/Generations-163-8b5cf6?style=flat-square)](#results)
1111
[![Best Score](https://img.shields.io/badge/Best%20Score-862%2F1000-22c55e?style=flat-square)](#results)
12-
[![Population](https://img.shields.io/badge/Population-150%20prompts-f59e0b?style=flat-square)](#results)
12+
[![Population](https://img.shields.io/badge/Population-163%20prompts-f59e0b?style=flat-square)](#results)
1313
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-8b5cf6?style=flat-square)](CONTRIBUTING.md)
1414

1515
[Overview](#overview)
@@ -27,10 +27,10 @@
2727

2828
<!-- EVOLUTION_STATUS_START -->
2929

30-
> **Last Evolution Cycle:** 2026-05-28T15:47:04.958439 UTC
31-
> **Generation:** 0
32-
> **Best Score:** 0
33-
> **Population Size:** 0
30+
> **Last Evolution Cycle:** 2026-05-28T16:03:22.335068+00:00 UTC
31+
> **Generation:** 5
32+
> **Best Score:** 96.0
33+
> **Population Size:** 5
3434
3535
<!-- EVOLUTION_STATUS_END -->
3636

@@ -394,11 +394,17 @@ git checkout prompt.txt
394394

395395
| Metric | Value |
396396
|--------|-------|
397-
| **Generations** | 150 |
398-
| **Population** | 150 prompts |
397+
| **Generations** | 163 |
398+
| **Population** | 163 prompts |
399399
| **Best Lexical Score** | 862 / 1000 |
400400
| **Score Range** | 35 → 862 (24.6×) |
401401
| **Ceiling Progression** | 500 → 1000 |
402+
| **Grounded Best** | 96.0 / 100 |
403+
404+
> **Note: Lexical Plateau at 862/1000.** The lexical loop has converged — all 40+ signal pools
405+
> have been exhausted and prompts saturate at 862. Remaining signals (138 points) are too niche
406+
> for any single prompt to cover. To break through, use the **grounded loop** (execution-based
407+
> scoring, max 100 points) which rewards code quality, not keyword breadth.
402408
403409
### Top 10 Prompts
404410

auto_evolve.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ def inject_new_signals(count=8):
100100
content = f.read()
101101

102102
# Find the insertion point (before 'scores[f] = round(min(500.0, score), 1)')
103-
insert_marker = "scores[f] = round(min(500.0, score), 1)"
103+
insert_marker = "scores[f] = round(min(1000.0, score), 1)"
104104
if insert_marker not in content:
105105
print("WARN: could not find insertion point in evaluate.py")
106106
return 0

beautify_readme.py

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,4 +47,19 @@ def beautify(best_score: float = 0, generation: int = 0, population_size: int =
4747

4848

4949
if __name__ == "__main__":
50-
beautify()
50+
import json
51+
pop = Path("population/population.json")
52+
if pop.exists():
53+
data = json.loads(pop.read_text())
54+
if data:
55+
best = max(float(d.get("score", 0)) for d in data)
56+
beautify(best_score=best, generation=len(data), population_size=len(data))
57+
else:
58+
beautify()
59+
else:
60+
import glob
61+
txt_files = list(Path("population").glob("*.txt"))
62+
if txt_files:
63+
beautify(population_size=len(txt_files))
64+
else:
65+
beautify()

infinite_research_loop.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
import subprocess
1818
import sys
1919
import time
20-
from datetime import datetime, timezone
20+
from datetime import datetime, timezone, timezone
2121
from pathlib import Path
2222
from typing import Any
2323

mutate.py

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -128,8 +128,8 @@ def mutate():
128128
KEYWORD_BONUS.append(f"\n- {kw}: support, implementation, integration, configuration, management, monitoring, optimization")
129129

130130
strategy = random.choices(
131-
["append", "crossover", "rewrite_section", "combine", "signal_hunt"],
132-
weights=[0.2, 0.2, 0.15, 0.15, 0.3],
131+
["append", "crossover", "rewrite_section", "combine", "signal_hunt", "super_merge"],
132+
weights=[0.1, 0.15, 0.1, 0.1, 0.25, 0.3],
133133
)[0]
134134

135135
new_content = content
@@ -161,13 +161,35 @@ def mutate():
161161
new_content = half1 + "\n" + half2
162162

163163
elif strategy == "signal_hunt":
164-
# Add a batch of missing keywords to boost score
165164
additions = []
166165
if KEYWORD_BONUS:
167166
additions = random.sample(KEYWORD_BONUS, min(10, len(KEYWORD_BONUS)))
168167
additions.append(random.choice(ADDITIONS_POOL))
169168
new_content = content + "\n=== SIGNAL COVERAGE ===\n" + "\n".join(additions)
170169

170+
elif strategy == "super_merge":
171+
merged_parts = [content]
172+
scored = read_scores()
173+
top_prompts = sorted(scored, key=lambda x: -x[1]) if scored else []
174+
taken = 0
175+
for fname, _ in top_prompts:
176+
if fname == best_file or taken >= 4:
177+
continue
178+
fpath = os.path.join("population", fname)
179+
if os.path.exists(fpath):
180+
with open(fpath) as f:
181+
merged_parts.append(f.read())
182+
taken += 1
183+
all_lines = []
184+
seen = set()
185+
for part in merged_parts:
186+
for line in part.split("\n"):
187+
stripped = line.strip().lower()
188+
if stripped not in seen and stripped:
189+
seen.add(stripped)
190+
all_lines.append(line)
191+
new_content = "\n".join(all_lines)
192+
171193
new_name = f"prompt_{len(files)+1:03d}.txt"
172194
with open(os.path.join("population", new_name), "w") as f:
173195
f.write(new_content)

0 commit comments

Comments
 (0)