NullLabTests
diff --git a/‎.gitignore‎
Lines changed: 6 additions & 0 deletions b/‎.gitignore‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎EXPERIMENT_DESIGN.md‎
Lines changed: 96 additions & 0 deletions b/‎EXPERIMENT_DESIGN.md‎
Lines changed: 96 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 3 additions & 3 deletions b/‎README.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎analysis/__init__.py‎ b/‎analysis/__init__.py‎
diff --git a/‎analysis/correlation.py‎
Lines changed: 155 additions & 0 deletions b/‎analysis/correlation.py‎
Lines changed: 155 additions & 0 deletions
diff --git a/‎benchmarks/tasks.json‎
Lines changed: 16 additions & 52 deletions b/‎benchmarks/tasks.json‎
Lines changed: 16 additions & 52 deletions
@@ -45,6 +45,12 @@ runtime_logs/*
 memory/*
 !memory/.gitkeep
 
+# Experiments
+experiments/projects/*
+!experiments/projects/.gitkeep
+# Track run_log.jsonl (it's structured experiment data)
+!experiments/run_log.jsonl
+
 # Environment
 .env
 *.key
@@ -0,0 +1,96 @@
+# Experiment Design
+
+This document describes the experimental framework for studying execution-grounded
+prompt evolution. The goal is to move from a research prototype to a reproducible
+experiment with publishable results.
+
+## Research Questions
+
+1. **Does execution-grounded scoring select for different prompts than lexical scoring?**
+   - Hypothesis: Lexical and grounded scores are weakly correlated (Spearman ρ < 0.3).
+   - A prompt that "reads well" (mentions many keywords) does not necessarily produce
+     good code. The grounded loop should surface prompts that produce correct, testable,
+     well-structured output — even if they are shorter and less keyword-dense.
+
+2. **Does grounded evolution improve generated code quality across generations?**
+   - Hypothesis: Grounded scores increase over successive evolution cycles,
+     with diminishing returns after 50–100 cycles per benchmark.
+
+3. **Which mutation operator contributes most to grounded score improvement?**
+   - Ablation: run with mutation only, crossover only, and both disabled (random walk).
+   - Compare convergence rates and final scores.
+
+## Benchmarks
+
+Three benchmarks, each with real hidden test code that validates behavior:
+
+| Benchmark | Domain | Hidden tests | Max score |
+|-----------|--------|-------------|-----------|
+| `cli_task_manager` | CLI tool with argparse | Add/list/complete/delete tasks | 100 |
+| `async_web_scraper` | Async HTTP with rate limiting | Concurrent fetch, error handling, rate limit | 100 |
+| `data_pipeline` | CSV→transform→JSON pipeline | Read, filter, aggregate, write | 100 |
+
+Hidden test files live inside `benchmarks/tasks.json` as `hidden_test_files`.
+They are copied into the generated project directory before pytest runs.
+
+## Metrics
+
+Per-cycle logging (JSONL at `experiments/run_log.jsonl`):
+
+| Field | Source | Description |
+|-------|--------|-------------|
+| `score` | runtime_evaluator | Composite execution score (0–100) |
+| `syntax_valid` | AST parser | Whether all .py files parse |
+| `pytest_pass` | pytest | Whether all tests pass |
+| `hidden_tests_pass` | hidden tests | Whether benchmark behavioral tests pass |
+| `mutation` | mutation_engine | Which operator was applied |
+| `llm_total_tokens` | generator | Total LLM tokens consumed |
+| `llm_duration_s` | generator | Wall time for LLM generation |
+| `cycle_duration_s` | timed | Total cycle wall time |
+| `ast_nodes` | AST counter | Code complexity proxy |
+| `functions` / `classes` | AST counter | Code structure proxy |
+
+## Archiving
+
+Every generated project is archived to `experiments/projects/{benchmark}/cycle_{n:04d}/`
+with full source code and a `metadata.json`. This enables manual review of top-10
+vs bottom-10 outputs.
+
+## Ablation Studies
+
+Configure by setting `ABLATION` flags in `infinite_research_loop.py`:
+
+```python
+ABLATION = {
+    "mutation": True,      # toggle prompt mutation
+    "crossover": True,     # toggle prompt crossover
+    "mutation_rate": 0.7,  # probability of mutation when both enabled
+}
+```
+
+Four conditions, each run for 100 cycles per benchmark:
+
+| Condition | mutation | crossover | Expectation |
+|-----------|----------|-----------|-------------|
+| Full | True | True | Best convergence (baseline) |
+| Mutation only | True | False | Tests crossover contribution |
+| Crossover only | False | True | Tests mutation contribution |
+| Random walk | False | False | No-op baseline (chance improvement) |
+
+## Procedure
+
+1. Run 100 grounded cycles per benchmark (full config) → ~300 cycles total
+2. Run 100 grounded cycles per ablation condition → ~1200 cycles total
+3. Run Spearman correlation (`analysis/correlation.py --refresh`)
+4. Manually review top-10 and bottom-10 generated projects from each condition
+5. Publish results with score distributions, convergence curves, and qualitative analysis
+
+## Files
+
+| File | Purpose |
+|------|---------|
+| `experiments/run_log.jsonl` | Per-cycle structured log |
+| `experiments/projects/` | Archived generated projects |
+| `analysis/correlation.py` | Spearman ρ computation |
+| `benchmarks/tasks.json` | Benchmark + hidden test definitions |
+| `infinite_research_loop.py` | Experiment runner (main entry point) |
@@ -27,10 +27,10 @@
 
 <!-- EVOLUTION_STATUS_START -->
 
-> **Last Evolution Cycle:** 2026-05-28T16:25:12.375152+00:00 UTC  
-> **Generation:** 15  
+> **Last Evolution Cycle:** 2026-05-28T16:26:52.331163+00:00 UTC  
+> **Generation:** 16  
 > **Best Score:** 96.0  
-> **Population Size:** 15  
+> **Population Size:** 16  
 
 <!-- EVOLUTION_STATUS_END -->
 
 
@@ -0,0 +1,155 @@
+"""Spearman correlation analysis: lexical vs grounded scores.
+
+Usage:
+    python analysis/correlation.py                  # Use cached grounded scores
+    python analysis/correlation.py --refresh        # Re-run grounded eval for all prompts
+
+Requires LLM_API_KEY to be set for the --refresh mode.
+"""
+
+import json
+import subprocess
+import sys
+import os
+from pathlib import Path
+from typing import Any
+
+
+POPULATION_DIR: Path = Path("population")
+POPULATION_JSON: Path = Path("population/population.json")
+RESULTS_LOG: Path = Path("results.log")
+GROUNDED_CACHE: Path = Path("analysis/grounded_scores_cache.json")
+
+
+def load_lexical_scores() -> dict[str, float]:
+    """Load lexical scores from results.log (lexical evaluation output)."""
+    scores: dict[str, float] = {}
+    if not RESULTS_LOG.exists():
+        print("WARN: results.log not found — run evaluate.py first")
+        return scores
+    import re
+    for line in RESULTS_LOG.read_text().splitlines():
+        m = re.match(r"(\S+):\s*([\d.]+)", line)
+        if m:
+            scores[m.group(1)] = float(m.group(2))
+    return scores
+
+
+def load_grounded_scores() -> dict[str, float]:
+    """Load previously cached grounded scores."""
+    if GROUNDED_CACHE.exists():
+        return json.loads(GROUNDED_CACHE.read_text())
+    return {}
+
+
+def save_grounded_scores(scores: dict[str, float]) -> None:
+    """Cache grounded scores to disk."""
+    GROUNDED_CACHE.parent.mkdir(exist_ok=True)
+    GROUNDED_CACHE.write_text(json.dumps(scores, indent=2))
+
+
+def evaluate_grounded(prompt_file: str) -> float | None:
+    """Run a single prompt through the grounded evaluator.
+
+    This requires a valid LLM_API_KEY. The prompt is sent to the LLM,
+    the generated code is evaluated, and the execution score is returned.
+    """
+    prompt_path: Path = POPULATION_DIR / prompt_file
+    if not prompt_path.exists():
+        return None
+    prompt_text: str = prompt_path.read_text()
+    try:
+        from generator import generate_code, write_project_files
+        from evaluator.runtime_evaluator import evaluate_project
+        import tempfile
+
+        tmpdir: str = tempfile.mkdtemp(prefix="grounded_eval_")
+        generated_text: str
+        usage: dict[str, Any]
+        generated_text, usage = generate_code(prompt_text, temperature=0.3)
+        write_project_files(tmpdir, generated_text)
+        metrics: dict[str, Any] = evaluate_project(tmpdir)
+        return metrics.get("final_score", 0.0)
+    except Exception as e:
+        print(f"  ERROR evaluating {prompt_file}: {e}")
+        return None
+
+
+def compute_spearman(x: list[float], y: list[float]) -> float:
+    """Compute Spearman rank correlation coefficient between two lists."""
+    n: int = len(x)
+    if n < 3:
+        return 0.0
+
+    from scipy.stats import spearmanr
+    result = spearmanr(x, y)
+    return float(result.statistic)
+
+
+def main() -> None:
+    """Compare lexical vs grounded scores across the population."""
+    lexical: dict[str, float] = load_lexical_scores()
+    if not lexical:
+        print("No lexical scores found. Run evaluate.py first.")
+        sys.exit(1)
+
+    grounded: dict[str, float] = load_grounded_scores()
+    refresh: bool = "--refresh" in sys.argv
+
+    if refresh:
+        grounded = {}
+        print("Refreshing all grounded scores (this may take a while)...")
+        prompt_files: list[str] = sorted(f.name for f in POPULATION_DIR.glob("*.txt"))
+        for i, pf in enumerate(prompt_files):
+            if pf in lexical:
+                score = evaluate_grounded(pf)
+                if score is not None:
+                    grounded[pf] = score
+                print(f"  [{i+1}/{len(prompt_files)}] {pf}: {grounded.get(pf, 'FAIL')}")
+        save_grounded_scores(grounded)
+
+    # Find prompts that have both lexical and grounded scores
+    common: list[str] = sorted(set(lexical.keys()) & set(grounded.keys()))
+    if not common:
+        print("No prompts have both lexical and grounded scores.")
+        print("Run with --refresh to compute grounded scores, or")
+        print("set up the grounded loop and let it populate population.json.")
+        sys.exit(1)
+
+    lex_vals: list[float] = [lexical[p] for p in common]
+    grd_vals: list[float] = [grounded[p] for p in common]
+
+    print(f"\n{'='*60}")
+    print(f"CORRELATION ANALYSIS: {len(common)} prompts")
+    print(f"{'='*60}")
+    print(f"{'Prompt':<25} {'Lexical':>8} {'Grounded':>10}")
+    print(f"{'-'*25} {'-'*8} {'-'*10}")
+    for p in common[:20]:
+        print(f"{p:<25} {lexical[p]:>8.1f} {grounded[p]:>10.1f}")
+
+    if len(common) > 20:
+        print(f"  ... and {len(common) - 20} more")
+
+    print(f"\nLexical  — min: {min(lex_vals):.1f}, max: {max(lex_vals):.1f}, mean: {sum(lex_vals)/len(lex_vals):.1f}")
+    print(f"Grounded — min: {min(grd_vals):.1f}, max: {max(grd_vals):.1f}, mean: {sum(grd_vals)/len(grd_vals):.1f}")
+
+    try:
+        rho: float = compute_spearman(lex_vals, grd_vals)
+        print(f"\nSpearman's ρ = {rho:.4f}")
+        if abs(rho) < 0.2:
+            print("Interpretation: Weak or no correlation — lexical and grounded scores")
+            print("measure different things. This confirms the grounded loop is")
+            print("capturing signal the lexical loop misses.")
+        elif abs(rho) < 0.6:
+            print("Interpretation: Moderate correlation — prompts that score well")
+            print("lexically tend to produce better code, but with significant noise.")
+        else:
+            print("Interpretation: Strong correlation — lexical scoring may be a")
+            print("sufficient proxy for execution quality in this domain.")
+    except ImportError:
+        print("\nInstall scipy for Spearman correlation: pip install scipy")
+        print(f"Raw pair count: {len(common)}")
+
+
+if __name__ == "__main__":
+    main()
@@ -1,62 +1,26 @@
 [
   {
-    "name": "flask_api",
-    "prompt": "Create a modular Flask REST API with health endpoint and clean structure",
-    "requirements": ["flask", "pytest"],
-    "hidden_tests": ["from flask import Flask", "def test_health_endpoint"]
-  },
-  {
-    "name": "cli_tool",
-    "prompt": "Create a Python CLI task manager using argparse",
-    "requirements": ["pytest"],
-    "hidden_tests": ["import argparse", "if __name__ == '__main__'"]
-  },
-  {
-    "name": "websocket_server",
-    "prompt": "Create an async websocket echo server in Python",
-    "requirements": ["pytest", "pytest-asyncio"],
-    "hidden_tests": ["import asyncio", "import websockets"]
-  },
-  {
-    "name": "data_processor",
-    "prompt": "Create a Python data processing pipeline that reads CSV, transforms data, and outputs JSON",
-    "requirements": ["pytest"],
-    "hidden_tests": ["import csv", "import json"]
-  },
-  {
-    "name": "math_library",
-    "prompt": "Create a Python math utility library with functions for statistics, algebra, and geometry",
+    "name": "cli_task_manager",
+    "prompt": "Create a Python CLI task manager using argparse. Commands: add TASK, list, complete ID, delete ID. Tasks persist to a JSON file called tasks.json. The main entry point must be in main.py with if __name__ == '__main__' block. Each task has id, description, done fields.",
     "requirements": ["pytest"],
-    "hidden_tests": ["def test_", "import math"]
+    "hidden_test_files": {
+      "test_hidden_cli.py": "import subprocess\nimport json\nimport os\n\n\ndef setup_module():\n    if os.path.exists('tasks.json'):\n        os.remove('tasks.json')\n\n\ndef teardown_module():\n    if os.path.exists('tasks.json'):\n        os.remove('tasks.json')\n\n\ndef test_add_creates_task():\n    result = subprocess.run(['python', 'main.py', 'add', 'test task'], capture_output=True, text=True)\n    assert result.returncode == 0, f'add failed: {result.stderr}'\n    assert os.path.exists('tasks.json'), 'tasks.json not created'\n    with open('tasks.json') as f:\n        tasks = json.load(f)\n    assert len(tasks) == 1\n    assert tasks[0]['description'] == 'test task'\n\n\ndef test_list_shows_tasks():\n    subprocess.run(['python', 'main.py', 'add', 'visible task'], capture_output=True)\n    result = subprocess.run(['python', 'main.py', 'list'], capture_output=True, text=True)\n    assert result.returncode == 0\n    assert 'visible task' in result.stdout\n\n\ndef test_complete_marks_done():\n    subprocess.run(['python', 'main.py', 'add', 'finish me'], capture_output=True)\n    result = subprocess.run(['python', 'main.py', 'complete', '1'], capture_output=True, text=True)\n    assert result.returncode == 0\n    with open('tasks.json') as f:\n        tasks = json.load(f)\n    assert any(t['id'] == 1 and t.get('done') for t in tasks)\n\n\ndef test_delete_removes_task():\n    subprocess.run(['python', 'main.py', 'add', 'delete me'], capture_output=True)\n    result = subprocess.run(['python', 'main.py', 'delete', '1'], capture_output=True, text=True)\n    assert result.returncode == 0\n    with open('tasks.json') as f:\n        tasks = json.load(f)\n    assert not any(t['id'] == 1 for t in tasks)\n"
+    }
   },
   {
-    "name": "file_indexer",
-    "prompt": "Create a Python file indexer that walks directories, computes file hashes, and outputs a JSON index",
-    "requirements": ["pytest"],
-    "hidden_tests": ["import hashlib", "os.walk"]
-  },
-  {
-    "name": "config_parser",
-    "prompt": "Create a Python configuration parser that reads YAML/JSON/INI formats and validates schemas",
-    "requirements": ["pytest", "pyyaml"],
-    "hidden_tests": ["import yaml", "import json"]
-  },
-  {
-    "name": "rate_limiter",
-    "prompt": "Create a Python rate limiter implementation with token bucket or sliding window algorithm",
-    "requirements": ["pytest"],
-    "hidden_tests": ["class RateLimiter", "def test_rate_limit"]
-  },
-  {
-    "name": "caching_layer",
-    "prompt": "Create a Python caching library with TTL support, LRU eviction, and decorator interface",
-    "requirements": ["pytest"],
-    "hidden_tests": ["functools.lru_cache", "class Cache"]
+    "name": "async_web_scraper",
+    "prompt": "Create an async Python web scraper in main.py. Must have an async function fetch_url(url, session) that fetches a single URL and returns the response text. Must have an async function scrape_all(urls) that fetches multiple URLs concurrently using asyncio.gather or similar, limits to max 3 concurrent requests via asyncio.Semaphore, retries failed requests once, and returns a list of (url, success, text_or_error) tuples. Must have a main() entry point that accepts a comma-separated list of URLs from sys.argv and prints results as JSON: [{url, success, length, error}]. Include proper error handling for network failures and timeouts.",
+    "requirements": ["pytest", "pytest-asyncio", "aiohttp"],
+    "hidden_test_files": {
+      "test_hidden_scraper.py": "import subprocess\nimport json\nimport sys\n\n\ndef test_scraper_imports():\n    import importlib\n    spec = importlib.util.find_spec('main')\n    assert spec is not None, 'main.py not importable'\n\n\ndef test_main_entry_point():\n    result = subprocess.run(\n        [sys.executable, 'main.py', 'https://example.com'],\n        capture_output=True, text=True, timeout=15\n    )\n    assert result.returncode == 0, f'main failed: {result.stderr}'\n    output = json.loads(result.stdout)\n    assert isinstance(output, list)\n    assert len(output) == 1\n    entry = output[0]\n    assert 'url' in entry\n    assert 'success' in entry\n    assert 'length' in entry\n\n\ndef test_scraper_handles_bad_url():\n    result = subprocess.run(\n        [sys.executable, 'main.py', 'https://nonexistent.example.test'],\n        capture_output=True, text=True, timeout=15\n    )\n    output = json.loads(result.stdout)\n    entry = output[0]\n    assert not entry['success'], 'should fail for bad URL'\n    assert 'error' in entry\n\n\ndef test_scraper_concurrent_flag():\n    source = open('main.py').read()\n    assert 'asyncio.gather' in source or 'asyncio.wait' in source or 'asyncio.TaskGroup' in source, 'no concurrent fetching'\n    assert 'asyncio.Semaphore' in source or 'asyncio.BoundedSemaphore' in source, 'no rate limiting'\n"
+    }
   },
   {
-    "name": "log_analyzer",
-    "prompt": "Create a Python log file analyzer that parses timestamps, counts levels, and generates reports",
+    "name": "data_pipeline",
+    "prompt": "Create a Python data pipeline in main.py. Must have a function read_csv(path) that reads a CSV file (handle both comma and header row) and returns a list of dicts. Must have a function filter_data(data, key, value) that filters rows where data[key] == value. Must have a function aggregate(data, group_key, agg_key, agg_func) that groups by group_key and applies agg_func ('sum', 'avg', 'count') to agg_key. Must have a function write_json(data, path) that writes the processed data as JSON. Must have a main() that accepts input CSV path from sys.argv, reads, filters where status=active, aggregates by department summing salary, and writes to {input}_report.json. Include error handling for missing files.",
     "requirements": ["pytest"],
-    "hidden_tests": ["import re", "def test_parse"]
+    "hidden_test_files": {
+      "test_hidden_pipeline.py": "import subprocess\nimport json\nimport os\nimport tempfile\n\n\ndef test_read_csv():\n    from main import read_csv\n    with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f:\n        f.write('name,age,city\\nAlice,30,NYC\\nBob,25,LA\\n')\n        path = f.name\n    data = read_csv(path)\n    os.unlink(path)\n    assert len(data) == 2\n    assert data[0]['name'] == 'Alice'\n\n\ndef test_filter_data():\n    from main import filter_data\n    data = [{'name': 'Alice', 'status': 'active'}, {'name': 'Bob', 'status': 'inactive'}]\n    result = filter_data(data, 'status', 'active')\n    assert len(result) == 1\n    assert result[0]['name'] == 'Alice'\n\n\ndef test_aggregate_sum():\n    from main import aggregate\n    data = [{'dept': 'eng', 'salary': 100}, {'dept': 'eng', 'salary': 200}, {'dept': 'hr', 'salary': 150}]\n    result = aggregate(data, 'dept', 'salary', 'sum')\n    assert result['eng'] == 300\n    assert result['hr'] == 150\n\n\ndef test_aggregate_avg():\n    from main import aggregate\n    data = [{'dept': 'eng', 'salary': 100}, {'dept': 'eng', 'salary': 200}]\n    result = aggregate(data, 'dept', 'salary', 'avg')\n    assert result['eng'] == 150\n\n\ndef test_main_pipeline():\n    with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f:\n        f.write('name,department,salary,status\\nAlice,Engineering,100000,active\\nBob,Engineering,120000,active\\nCharlie,HR,80000,inactive\\n')\n        input_path = f.name\n    output_path = input_path.replace('.csv', '_report.json')\n    result = subprocess.run(['python', 'main.py', input_path], capture_output=True, text=True, timeout=10)\n    os.unlink(input_path)\n    assert result.returncode == 0, f'pipeline failed: {result.stderr}'\n    assert os.path.exists(output_path), f'report not created at {output_path}'\n    with open(output_path) as f:\n        report = json.load(f)\n    os.unlink(output_path)\n    assert isinstance(report, dict)\n"
+    }
   }
 ]