NullLabTests
diff --git a/‎CHANGELOG.md‎
Lines changed: 21 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎EVOLUTION_REPORT.md‎
Lines changed: 14 additions & 6 deletions b/‎EVOLUTION_REPORT.md‎
Lines changed: 14 additions & 6 deletions
diff --git a/‎README.md‎
Lines changed: 14 additions & 8 deletions b/‎README.md‎
Lines changed: 14 additions & 8 deletions
diff --git a/‎auto_evolve.py‎
Lines changed: 1 addition & 1 deletion b/‎auto_evolve.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎beautify_readme.py‎
Lines changed: 16 additions & 1 deletion b/‎beautify_readme.py‎
Lines changed: 16 additions & 1 deletion
diff --git a/‎infinite_research_loop.py‎
Lines changed: 1 addition & 1 deletion b/‎infinite_research_loop.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎mutate.py‎
Lines changed: 25 additions & 3 deletions b/‎mutate.py‎
Lines changed: 25 additions & 3 deletions
@@ -2,6 +2,27 @@
 
 All notable changes to Grounded Evolution will be documented in this file.
 
+## [0.2.1] - 2026-05-28
+
+### Added
+- Type hints to all core modules: generator, infinite_research_loop, runtime_evaluator, mutation_engine, population_manager, beautify_readme, regression_tracker
+- `.github/CODEOWNERS`, `FUNDING.yml`, `dependabot.yml` — repository governance
+- ROADMAP.md — near/medium/long-term development plan
+- EVOLUTION_REPORT.md — complete experiment summary with grounded/lexical analysis
+- `super_merge` mutation strategy — combines top-5 prompts for plateau breaking
+- Auto-detection of real scores in beautify_readme.py (reads population.json)
+
+### Changed
+- Fixed stale score marker in auto_evolve.py (500→1000)
+- Updated all stale URLs to grounded_evolution (SECURITY.md, config.yml, badge.yml, script.sh)
+- beautify_readme.py now non-destructive (marker-based section update)
+- generator.py uses env-var LLM config (LLM_API_KEY, LLM_MODEL, LLM_BASE_URL) — no hardcoded keys
+- requirements.txt pinned with versions
+- README badges/metrics updated to 163 prompts
+
+### Fixed
+- `datetime.UTC` → `datetime.timezone.utc` in regression_tracker.py (Python 3.14 compat)
+
 ## [0.2.0] - 2026-05-28
 
 ### Added
 
@@ -4,9 +4,9 @@
 
 | Metric | Value |
 |---|---|
-| Total cycles | ~150 |
+| Total cycles | 163 |
 | Score range | 35 → 862 (24.6× improvement) |
-| Population size | ~150 prompts |
+| Population size | 163 prompts |
 | Evaluation method | 400+ signal lexical scoring |
 | Evolution loops | Lexical + Grounded (dual) |
 
@@ -63,8 +63,16 @@ The grounded loop closes the gap by:
 
 **Lexical and grounded scores are weakly correlated.** The most "impressive" prompts (long, many keywords) often produce the messiest code. The grounded loop acts as a regularizer — rewarding prompts that produce **simple, correct, and tested** output over prompts that read well on paper.
 
-## Next Analysis
+## Lexical Plateau
 
-- Run a Spearman correlation between lexical score and grounded score across the full population
-- Track per-cycle cost (LLM tokens + wall time) to identify regression
-- Archive generated projects from the top-10 grounded prompts for manual review
+After 163 generations and 40+ injected signal pools, the lexical loop has converged at **862/1000**. All 500+ keyword checks have been injected into `evaluate.py`. The remaining 138 points require niche keywords that no single prompt can practically cover without becoming an incoherent keyword salad.
+
+The `super_merge` strategy (added to `mutate.py`) attempts to combine all top-5 prompts into one maximally broad prompt, but even this can't bridge the gap — the signals are inherently contradictory (e.g., language-specific keywords for Python vs JavaScript).
+
+## Next Steps
+
+1. **Run the grounded loop** with an LLM API key set (`LLM_API_KEY`) — this shifts from keyword scoring to execution validation
+2. **Run a Spearman correlation** between lexical score and grounded score across the full population
+3. **Track per-cycle cost** (LLM tokens + wall time) to identify regression
+4. **Archive generated projects** from the top-10 grounded prompts for manual review
+5. **Diversify benchmarks** in `benchmarks/tasks.json` — the current single-benchmark mode limits grounded evolution breadth
@@ -7,9 +7,9 @@
 [![Status: Active](https://img.shields.io/badge/status-active-success?style=flat-square&logo=github)](https://github.com/NullLabTests/grounded_evolution)
 [![License: MIT](https://img.shields.io/badge/License-MIT-67ac09?style=flat-square)](LICENSE)
 [![Python 3.12+](https://img.shields.io/badge/Python-3.12%2B-007ec6?style=flat-square&logo=python&logoColor=white)](https://python.org)
-[![Generations](https://img.shields.io/badge/Generations-150-8b5cf6?style=flat-square)](#results)
+[![Generations](https://img.shields.io/badge/Generations-163-8b5cf6?style=flat-square)](#results)
 [![Best Score](https://img.shields.io/badge/Best%20Score-862%2F1000-22c55e?style=flat-square)](#results)
-[![Population](https://img.shields.io/badge/Population-150%20prompts-f59e0b?style=flat-square)](#results)
+[![Population](https://img.shields.io/badge/Population-163%20prompts-f59e0b?style=flat-square)](#results)
 [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-8b5cf6?style=flat-square)](CONTRIBUTING.md)
 
 [Overview](#overview) •
@@ -27,10 +27,10 @@
 
 <!-- EVOLUTION_STATUS_START -->
 
-> **Last Evolution Cycle:** 2026-05-28T15:47:04.958439 UTC  
-> **Generation:** 0  
-> **Best Score:** 0  
-> **Population Size:** 0  
+> **Last Evolution Cycle:** 2026-05-28T16:03:22.335068+00:00 UTC  
+> **Generation:** 5  
+> **Best Score:** 96.0  
+> **Population Size:** 5  
 
 <!-- EVOLUTION_STATUS_END -->
 
@@ -394,11 +394,17 @@ git checkout prompt.txt
 
 | Metric | Value |
 |--------|-------|
-| **Generations** | 150 |
-| **Population** | 150 prompts |
+| **Generations** | 163 |
+| **Population** | 163 prompts |
 | **Best Lexical Score** | 862 / 1000 |
 | **Score Range** | 35 → 862 (24.6×) |
 | **Ceiling Progression** | 500 → 1000 |
+| **Grounded Best** | 96.0 / 100 |
+
+> **Note: Lexical Plateau at 862/1000.** The lexical loop has converged — all 40+ signal pools
+> have been exhausted and prompts saturate at 862. Remaining signals (138 points) are too niche
+> for any single prompt to cover. To break through, use the **grounded loop** (execution-based
+> scoring, max 100 points) which rewards code quality, not keyword breadth.
 
 ### Top 10 Prompts
 
 
@@ -100,7 +100,7 @@ def inject_new_signals(count=8):
         content = f.read()
 
     # Find the insertion point (before 'scores[f] = round(min(500.0, score), 1)')
-    insert_marker = "scores[f] = round(min(500.0, score), 1)"
+    insert_marker = "scores[f] = round(min(1000.0, score), 1)"
     if insert_marker not in content:
         print("WARN: could not find insertion point in evaluate.py")
         return 0
 
@@ -47,4 +47,19 @@ def beautify(best_score: float = 0, generation: int = 0, population_size: int =
 
 
 if __name__ == "__main__":
-    beautify()
+    import json
+    pop = Path("population/population.json")
+    if pop.exists():
+        data = json.loads(pop.read_text())
+        if data:
+            best = max(float(d.get("score", 0)) for d in data)
+            beautify(best_score=best, generation=len(data), population_size=len(data))
+        else:
+            beautify()
+    else:
+        import glob
+        txt_files = list(Path("population").glob("*.txt"))
+        if txt_files:
+            beautify(population_size=len(txt_files))
+        else:
+            beautify()
@@ -17,7 +17,7 @@
 import subprocess
 import sys
 import time
-from datetime import datetime, timezone
+from datetime import datetime, timezone, timezone
 from pathlib import Path
 from typing import Any
 
 
@@ -128,8 +128,8 @@ def mutate():
         KEYWORD_BONUS.append(f"\n- {kw}: support, implementation, integration, configuration, management, monitoring, optimization")
 
     strategy = random.choices(
-        ["append", "crossover", "rewrite_section", "combine", "signal_hunt"],
-        weights=[0.2, 0.2, 0.15, 0.15, 0.3],
+        ["append", "crossover", "rewrite_section", "combine", "signal_hunt", "super_merge"],
+        weights=[0.1, 0.15, 0.1, 0.1, 0.25, 0.3],
     )[0]
 
     new_content = content
@@ -161,13 +161,35 @@ def mutate():
         new_content = half1 + "\n" + half2
 
     elif strategy == "signal_hunt":
-        # Add a batch of missing keywords to boost score
         additions = []
         if KEYWORD_BONUS:
             additions = random.sample(KEYWORD_BONUS, min(10, len(KEYWORD_BONUS)))
         additions.append(random.choice(ADDITIONS_POOL))
         new_content = content + "\n=== SIGNAL COVERAGE ===\n" + "\n".join(additions)
 
+    elif strategy == "super_merge":
+        merged_parts = [content]
+        scored = read_scores()
+        top_prompts = sorted(scored, key=lambda x: -x[1]) if scored else []
+        taken = 0
+        for fname, _ in top_prompts:
+            if fname == best_file or taken >= 4:
+                continue
+            fpath = os.path.join("population", fname)
+            if os.path.exists(fpath):
+                with open(fpath) as f:
+                    merged_parts.append(f.read())
+                taken += 1
+        all_lines = []
+        seen = set()
+        for part in merged_parts:
+            for line in part.split("\n"):
+                stripped = line.strip().lower()
+                if stripped not in seen and stripped:
+                    seen.add(stripped)
+                    all_lines.append(line)
+        new_content = "\n".join(all_lines)
+
     new_name = f"prompt_{len(files)+1:03d}.txt"
     with open(os.path.join("population", new_name), "w") as f:
         f.write(new_content)