NullLabTests
diff --git a/‎.env.example‎
Lines changed: 10 additions & 16 deletions b/‎.env.example‎
Lines changed: 10 additions & 16 deletions
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 46 additions & 5 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 46 additions & 5 deletions
diff --git a/‎AGENTS.md‎
Lines changed: 55 additions & 0 deletions b/‎AGENTS.md‎
Lines changed: 55 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 3 additions & 3 deletions b/‎README.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎analysis/plot_convergence.py‎
Lines changed: 4 additions & 5 deletions b/‎analysis/plot_convergence.py‎
Lines changed: 4 additions & 5 deletions
@@ -1,20 +1,14 @@
-# LLM Provider Configuration
-# Pick one provider and set the corresponding variables.
+# LLM provider configuration for Grounded Evolution
+# The grounded loop (infinite_research_loop.py) requires LLM_API_KEY.
+# Lexical loop (eval.py / auto_evolve.py) does not use the LLM.
 
-# --- Mistral AI (default) ---
-LLM_API_KEY="your_mistral_api_key"
+# Required for grounded evolution
+LLM_API_KEY="your_api_key_here"
+
+# Model selection (default: mistral-large-latest)
 LLM_MODEL="mistral-large-latest"
-LLM_BASE_URL="https://api.mistral.ai/v1"
 
-# --- OpenAI ---
-# LLM_API_KEY="sk-..."
-# LLM_MODEL="gpt-4o"
+# API base URL: Mistral, OpenAI, or any OpenAI-compatible endpoint
+LLM_BASE_URL="https://api.mistral.ai/v1"
 # LLM_BASE_URL="https://api.openai.com/v1"
-
-# --- Ollama (local) ---
-# LLM_API_KEY="ollama"
-# LLM_MODEL="qwen2.5:7b"
-# LLM_BASE_URL="http://localhost:11434/v1"
-
-# --- Alternative: OPENAI_API_KEY (used by generator.py fallback) ---
-# OPENAI_API_KEY="sk-..."
+# LLM_BASE_URL="http://localhost:11434/v1"  # Ollama
@@ -1,11 +1,15 @@
-name: CI
+name: Validate
 
 on:
   push:
     branches: [main]
   pull_request:
     branches: [main]
 
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
+
 jobs:
   lint:
     runs-on: ubuntu-latest
@@ -15,14 +19,51 @@ jobs:
         with:
           python-version: "3.12"
       - run: pip install ruff
-      - run: ruff check . --ignore=E501,F821
+      - run: ruff check . --ignore=E501,F821 --statistics
+
+  imports:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - run: pip install openai
+      - name: Check all modules load
+        run: |
+          for mod in generator population_manager mutation_engine; do
+            echo "=== $mod ==="
+            python -c "import $mod; print('OK')"
+          done
+          for mod in evaluator.runtime_evaluator; do
+            echo "=== $mod ==="
+            python -c "import $mod; print('OK')"
+          done
+      - name: Check scripts parse (no-exec)
+        run: |
+          for script in infinite_research_loop.py run_experiment.py beautify_readme.py \
+                        eval.py auto_evolve.py; do
+            echo "=== $script ==="
+            python -c "compile(open('$script').read(), '$script', 'exec'); print('syntax OK')"
+          done
 
-  evaluate:
+  experiment_design:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v4
       - uses: actions/setup-python@v5
         with:
           python-version: "3.12"
-      - name: Quick sanity check
-        run: python -c "from evaluate import evaluate; print('evaluate.py loads OK')"
+      - name: Validate benchmarks/tasks.json
+        run: |
+          python3 -c "
+import json
+with open('benchmarks/tasks.json') as f:
+    tasks = json.load(f)
+assert len(tasks) >= 3
+for t in tasks:
+    assert 'name' in t
+    assert 'hidden_test_files' in t
+    assert len(t['hidden_test_files']) >= 1
+print(f'{len(tasks)} benchmarks OK')
+"
@@ -0,0 +1,55 @@
+# AGENTS.md — Grounded Evolution Conventions
+
+## Project Identity
+This is a **research platform** for execution-grounded prompt evolution.
+Framing: evolutionary software optimization, NOT AGI/sentience claims.
+Repository is public at `NullLabTests/grounded_evolution`.
+
+## Conventions
+
+### Code Style
+- Type hints on **all** function signatures and public variables
+- No comments unless absolutely necessary (the code should explain itself)
+- Max line length: loosely 120 (no hard enforcement, follow existing style)
+- Imports: stdlib, then blank line, then third-party, then blank line, then local
+  (stdlib comes first; no `isort`/`ruff` ordering enforced — be practical)
+- Use `Any` from `typing` for dynamic types, never bare generics omitted
+- Prefer `Path` from `pathlib` over `os.path`
+- File-level docstrings on every `.py` file
+
+### Project Structure
+- `generator.py` — LLM code generation, returns `(text, usage_dict)` tuple
+- `evaluator/runtime_evaluator.py` — execution-grounded validation (AST, pytest, hidden tests)
+- `mutation_engine.py` — prompt mutation/crossover operators
+- `mutation.py` / `evaluate.py` / `evolve_forever.py` / `auto_evolve.py` — lexical-only loop (legacy)
+- `population_manager.py` — JSON-based population persistence
+- `infinite_research_loop.py` — main grounded loop (calls generator → evaluator → population_manager)
+- `run_experiment.py` — orchestrated ablation experiments
+- `benchmarks/tasks.json` — 3 benchmark definitions with inline hidden test files
+- `experiments/` — all experiment output (logs, archives, ablation runs)
+
+### Two Loops
+1. **Lexical loop** (`evaluate.py`/`evolve_forever.py`): keyword-matching fitness. Currently at 218 prompts, best score 1000/1000. Less important now.
+2. **Grounded loop** (`infinite_research_loop.py`/`generator.py`/`runtime_evaluator.py`): real code execution fitness. This is the primary focus.
+
+### Environment Variables (never hardcode secrets)
+- `LLM_API_KEY` — required for grounded loop
+- `LLM_MODEL` — model name (default: `mistral-large-latest`)
+- `LLM_BASE_URL` — API base URL (default: `https://api.mistral.ai/v1`)
+
+### Testing
+- No test suite for the project itself yet (TODO for future)
+- Hidden benchmark tests live in `benchmarks/tasks.json` as `hidden_test_files` dict
+- Rust-based tests (`cargo test`) exist in the `generated_projects/` output (not our code)
+
+### Git
+- Auto-commits on score improvement from the grounded loop
+- Manual commits for structural changes (new features, refactors, docs)
+- Commit messages: concise, descriptive, no emoji
+
+### Adding New Features
+1. Check if the feature already exists (grep for related terms)
+2. Follow the existing pattern (if it's a mutation, add to `mutation_engine.py`)
+3. Type hints everywhere
+4. Add the new feature to `run_experiment.py` if it's an experimental variable
+5. Update EXPERIMENT_DESIGN.md if the experiment protocol changes
@@ -27,10 +27,10 @@
 
 <!-- EVOLUTION_STATUS_START -->
 
-> **Last Evolution Cycle:** 2026-05-28T17:27:44.326471+00:00 UTC  
-> **Generation:** 46  
+> **Last Evolution Cycle:** 2026-05-28T17:39:51.337015+00:00 UTC  
+> **Generation:** 50  
 > **Best Score:** 96.0  
-> **Population Size:** 46  
+> **Population Size:** 50  
 
 <!-- EVOLUTION_STATUS_END -->
 
 
@@ -188,14 +188,13 @@ def plot_ablation_convergence(conditions: dict[str, list[dict[str, Any]]]) -> No
 
 def main() -> None:
     """Main entry point."""
+    global ROLLING_WINDOW
     use_ablation: bool = "--ablation" in sys.argv
-    rolling_window: int = ROLLING_WINDOW
+    window: int = ROLLING_WINDOW
     for arg in sys.argv:
         if arg.startswith("--rolling="):
-            rolling_window = int(arg.split("=")[1])
-
-    global ROLLING_WINDOW
-    ROLLING_WINDOW = rolling_window
+            window = int(arg.split("=")[1])
+    ROLLING_WINDOW = window
 
     if use_ablation:
         conditions = load_ablation_runs()