You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+21Lines changed: 21 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,27 @@
2
2
3
3
All notable changes to Grounded Evolution will be documented in this file.
4
4
5
+
## [0.2.1] - 2026-05-28
6
+
7
+
### Added
8
+
- Type hints to all core modules: generator, infinite_research_loop, runtime_evaluator, mutation_engine, population_manager, beautify_readme, regression_tracker
Copy file name to clipboardExpand all lines: EVOLUTION_REPORT.md
+14-6Lines changed: 14 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,9 +4,9 @@
4
4
5
5
| Metric | Value |
6
6
|---|---|
7
-
| Total cycles |~150|
7
+
| Total cycles |163|
8
8
| Score range | 35 → 862 (24.6× improvement) |
9
-
| Population size |~150 prompts |
9
+
| Population size |163 prompts |
10
10
| Evaluation method | 400+ signal lexical scoring |
11
11
| Evolution loops | Lexical + Grounded (dual) |
12
12
@@ -63,8 +63,16 @@ The grounded loop closes the gap by:
63
63
64
64
**Lexical and grounded scores are weakly correlated.** The most "impressive" prompts (long, many keywords) often produce the messiest code. The grounded loop acts as a regularizer — rewarding prompts that produce **simple, correct, and tested** output over prompts that read well on paper.
65
65
66
-
## Next Analysis
66
+
## Lexical Plateau
67
67
68
-
- Run a Spearman correlation between lexical score and grounded score across the full population
- Archive generated projects from the top-10 grounded prompts for manual review
68
+
After 163 generations and 40+ injected signal pools, the lexical loop has converged at **862/1000**. All 500+ keyword checks have been injected into `evaluate.py`. The remaining 138 points require niche keywords that no single prompt can practically cover without becoming an incoherent keyword salad.
69
+
70
+
The `super_merge` strategy (added to `mutate.py`) attempts to combine all top-5 prompts into one maximally broad prompt, but even this can't bridge the gap — the signals are inherently contradictory (e.g., language-specific keywords for Python vs JavaScript).
71
+
72
+
## Next Steps
73
+
74
+
1.**Run the grounded loop** with an LLM API key set (`LLM_API_KEY`) — this shifts from keyword scoring to execution validation
75
+
2.**Run a Spearman correlation** between lexical score and grounded score across the full population
0 commit comments