Skip to content

Commit 202c905

Browse files
committed
Experiment infrastructure: 3 diverse benchmarks with hidden tests, token tracking, project archiving, ablation config, Spearman correlation script, experiment design doc
1 parent 8eb3a38 commit 202c905

12 files changed

Lines changed: 481 additions & 276 deletions

File tree

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,12 @@ runtime_logs/*
4545
memory/*
4646
!memory/.gitkeep
4747

48+
# Experiments
49+
experiments/projects/*
50+
!experiments/projects/.gitkeep
51+
# Track run_log.jsonl (it's structured experiment data)
52+
!experiments/run_log.jsonl
53+
4854
# Environment
4955
.env
5056
*.key

EXPERIMENT_DESIGN.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Experiment Design
2+
3+
This document describes the experimental framework for studying execution-grounded
4+
prompt evolution. The goal is to move from a research prototype to a reproducible
5+
experiment with publishable results.
6+
7+
## Research Questions
8+
9+
1. **Does execution-grounded scoring select for different prompts than lexical scoring?**
10+
- Hypothesis: Lexical and grounded scores are weakly correlated (Spearman ρ < 0.3).
11+
- A prompt that "reads well" (mentions many keywords) does not necessarily produce
12+
good code. The grounded loop should surface prompts that produce correct, testable,
13+
well-structured output — even if they are shorter and less keyword-dense.
14+
15+
2. **Does grounded evolution improve generated code quality across generations?**
16+
- Hypothesis: Grounded scores increase over successive evolution cycles,
17+
with diminishing returns after 50–100 cycles per benchmark.
18+
19+
3. **Which mutation operator contributes most to grounded score improvement?**
20+
- Ablation: run with mutation only, crossover only, and both disabled (random walk).
21+
- Compare convergence rates and final scores.
22+
23+
## Benchmarks
24+
25+
Three benchmarks, each with real hidden test code that validates behavior:
26+
27+
| Benchmark | Domain | Hidden tests | Max score |
28+
|-----------|--------|-------------|-----------|
29+
| `cli_task_manager` | CLI tool with argparse | Add/list/complete/delete tasks | 100 |
30+
| `async_web_scraper` | Async HTTP with rate limiting | Concurrent fetch, error handling, rate limit | 100 |
31+
| `data_pipeline` | CSV→transform→JSON pipeline | Read, filter, aggregate, write | 100 |
32+
33+
Hidden test files live inside `benchmarks/tasks.json` as `hidden_test_files`.
34+
They are copied into the generated project directory before pytest runs.
35+
36+
## Metrics
37+
38+
Per-cycle logging (JSONL at `experiments/run_log.jsonl`):
39+
40+
| Field | Source | Description |
41+
|-------|--------|-------------|
42+
| `score` | runtime_evaluator | Composite execution score (0–100) |
43+
| `syntax_valid` | AST parser | Whether all .py files parse |
44+
| `pytest_pass` | pytest | Whether all tests pass |
45+
| `hidden_tests_pass` | hidden tests | Whether benchmark behavioral tests pass |
46+
| `mutation` | mutation_engine | Which operator was applied |
47+
| `llm_total_tokens` | generator | Total LLM tokens consumed |
48+
| `llm_duration_s` | generator | Wall time for LLM generation |
49+
| `cycle_duration_s` | timed | Total cycle wall time |
50+
| `ast_nodes` | AST counter | Code complexity proxy |
51+
| `functions` / `classes` | AST counter | Code structure proxy |
52+
53+
## Archiving
54+
55+
Every generated project is archived to `experiments/projects/{benchmark}/cycle_{n:04d}/`
56+
with full source code and a `metadata.json`. This enables manual review of top-10
57+
vs bottom-10 outputs.
58+
59+
## Ablation Studies
60+
61+
Configure by setting `ABLATION` flags in `infinite_research_loop.py`:
62+
63+
```python
64+
ABLATION = {
65+
"mutation": True, # toggle prompt mutation
66+
"crossover": True, # toggle prompt crossover
67+
"mutation_rate": 0.7, # probability of mutation when both enabled
68+
}
69+
```
70+
71+
Four conditions, each run for 100 cycles per benchmark:
72+
73+
| Condition | mutation | crossover | Expectation |
74+
|-----------|----------|-----------|-------------|
75+
| Full | True | True | Best convergence (baseline) |
76+
| Mutation only | True | False | Tests crossover contribution |
77+
| Crossover only | False | True | Tests mutation contribution |
78+
| Random walk | False | False | No-op baseline (chance improvement) |
79+
80+
## Procedure
81+
82+
1. Run 100 grounded cycles per benchmark (full config) → ~300 cycles total
83+
2. Run 100 grounded cycles per ablation condition → ~1200 cycles total
84+
3. Run Spearman correlation (`analysis/correlation.py --refresh`)
85+
4. Manually review top-10 and bottom-10 generated projects from each condition
86+
5. Publish results with score distributions, convergence curves, and qualitative analysis
87+
88+
## Files
89+
90+
| File | Purpose |
91+
|------|---------|
92+
| `experiments/run_log.jsonl` | Per-cycle structured log |
93+
| `experiments/projects/` | Archived generated projects |
94+
| `analysis/correlation.py` | Spearman ρ computation |
95+
| `benchmarks/tasks.json` | Benchmark + hidden test definitions |
96+
| `infinite_research_loop.py` | Experiment runner (main entry point) |

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,10 @@
2727

2828
<!-- EVOLUTION_STATUS_START -->
2929

30-
> **Last Evolution Cycle:** 2026-05-28T16:25:12.375152+00:00 UTC
31-
> **Generation:** 15
30+
> **Last Evolution Cycle:** 2026-05-28T16:26:52.331163+00:00 UTC
31+
> **Generation:** 16
3232
> **Best Score:** 96.0
33-
> **Population Size:** 15
33+
> **Population Size:** 16
3434
3535
<!-- EVOLUTION_STATUS_END -->
3636

analysis/__init__.py

Whitespace-only changes.

analysis/correlation.py

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
"""Spearman correlation analysis: lexical vs grounded scores.
2+
3+
Usage:
4+
python analysis/correlation.py # Use cached grounded scores
5+
python analysis/correlation.py --refresh # Re-run grounded eval for all prompts
6+
7+
Requires LLM_API_KEY to be set for the --refresh mode.
8+
"""
9+
10+
import json
11+
import subprocess
12+
import sys
13+
import os
14+
from pathlib import Path
15+
from typing import Any
16+
17+
18+
POPULATION_DIR: Path = Path("population")
19+
POPULATION_JSON: Path = Path("population/population.json")
20+
RESULTS_LOG: Path = Path("results.log")
21+
GROUNDED_CACHE: Path = Path("analysis/grounded_scores_cache.json")
22+
23+
24+
def load_lexical_scores() -> dict[str, float]:
25+
"""Load lexical scores from results.log (lexical evaluation output)."""
26+
scores: dict[str, float] = {}
27+
if not RESULTS_LOG.exists():
28+
print("WARN: results.log not found — run evaluate.py first")
29+
return scores
30+
import re
31+
for line in RESULTS_LOG.read_text().splitlines():
32+
m = re.match(r"(\S+):\s*([\d.]+)", line)
33+
if m:
34+
scores[m.group(1)] = float(m.group(2))
35+
return scores
36+
37+
38+
def load_grounded_scores() -> dict[str, float]:
39+
"""Load previously cached grounded scores."""
40+
if GROUNDED_CACHE.exists():
41+
return json.loads(GROUNDED_CACHE.read_text())
42+
return {}
43+
44+
45+
def save_grounded_scores(scores: dict[str, float]) -> None:
46+
"""Cache grounded scores to disk."""
47+
GROUNDED_CACHE.parent.mkdir(exist_ok=True)
48+
GROUNDED_CACHE.write_text(json.dumps(scores, indent=2))
49+
50+
51+
def evaluate_grounded(prompt_file: str) -> float | None:
52+
"""Run a single prompt through the grounded evaluator.
53+
54+
This requires a valid LLM_API_KEY. The prompt is sent to the LLM,
55+
the generated code is evaluated, and the execution score is returned.
56+
"""
57+
prompt_path: Path = POPULATION_DIR / prompt_file
58+
if not prompt_path.exists():
59+
return None
60+
prompt_text: str = prompt_path.read_text()
61+
try:
62+
from generator import generate_code, write_project_files
63+
from evaluator.runtime_evaluator import evaluate_project
64+
import tempfile
65+
66+
tmpdir: str = tempfile.mkdtemp(prefix="grounded_eval_")
67+
generated_text: str
68+
usage: dict[str, Any]
69+
generated_text, usage = generate_code(prompt_text, temperature=0.3)
70+
write_project_files(tmpdir, generated_text)
71+
metrics: dict[str, Any] = evaluate_project(tmpdir)
72+
return metrics.get("final_score", 0.0)
73+
except Exception as e:
74+
print(f" ERROR evaluating {prompt_file}: {e}")
75+
return None
76+
77+
78+
def compute_spearman(x: list[float], y: list[float]) -> float:
79+
"""Compute Spearman rank correlation coefficient between two lists."""
80+
n: int = len(x)
81+
if n < 3:
82+
return 0.0
83+
84+
from scipy.stats import spearmanr
85+
result = spearmanr(x, y)
86+
return float(result.statistic)
87+
88+
89+
def main() -> None:
90+
"""Compare lexical vs grounded scores across the population."""
91+
lexical: dict[str, float] = load_lexical_scores()
92+
if not lexical:
93+
print("No lexical scores found. Run evaluate.py first.")
94+
sys.exit(1)
95+
96+
grounded: dict[str, float] = load_grounded_scores()
97+
refresh: bool = "--refresh" in sys.argv
98+
99+
if refresh:
100+
grounded = {}
101+
print("Refreshing all grounded scores (this may take a while)...")
102+
prompt_files: list[str] = sorted(f.name for f in POPULATION_DIR.glob("*.txt"))
103+
for i, pf in enumerate(prompt_files):
104+
if pf in lexical:
105+
score = evaluate_grounded(pf)
106+
if score is not None:
107+
grounded[pf] = score
108+
print(f" [{i+1}/{len(prompt_files)}] {pf}: {grounded.get(pf, 'FAIL')}")
109+
save_grounded_scores(grounded)
110+
111+
# Find prompts that have both lexical and grounded scores
112+
common: list[str] = sorted(set(lexical.keys()) & set(grounded.keys()))
113+
if not common:
114+
print("No prompts have both lexical and grounded scores.")
115+
print("Run with --refresh to compute grounded scores, or")
116+
print("set up the grounded loop and let it populate population.json.")
117+
sys.exit(1)
118+
119+
lex_vals: list[float] = [lexical[p] for p in common]
120+
grd_vals: list[float] = [grounded[p] for p in common]
121+
122+
print(f"\n{'='*60}")
123+
print(f"CORRELATION ANALYSIS: {len(common)} prompts")
124+
print(f"{'='*60}")
125+
print(f"{'Prompt':<25} {'Lexical':>8} {'Grounded':>10}")
126+
print(f"{'-'*25} {'-'*8} {'-'*10}")
127+
for p in common[:20]:
128+
print(f"{p:<25} {lexical[p]:>8.1f} {grounded[p]:>10.1f}")
129+
130+
if len(common) > 20:
131+
print(f" ... and {len(common) - 20} more")
132+
133+
print(f"\nLexical — min: {min(lex_vals):.1f}, max: {max(lex_vals):.1f}, mean: {sum(lex_vals)/len(lex_vals):.1f}")
134+
print(f"Grounded — min: {min(grd_vals):.1f}, max: {max(grd_vals):.1f}, mean: {sum(grd_vals)/len(grd_vals):.1f}")
135+
136+
try:
137+
rho: float = compute_spearman(lex_vals, grd_vals)
138+
print(f"\nSpearman's ρ = {rho:.4f}")
139+
if abs(rho) < 0.2:
140+
print("Interpretation: Weak or no correlation — lexical and grounded scores")
141+
print("measure different things. This confirms the grounded loop is")
142+
print("capturing signal the lexical loop misses.")
143+
elif abs(rho) < 0.6:
144+
print("Interpretation: Moderate correlation — prompts that score well")
145+
print("lexically tend to produce better code, but with significant noise.")
146+
else:
147+
print("Interpretation: Strong correlation — lexical scoring may be a")
148+
print("sufficient proxy for execution quality in this domain.")
149+
except ImportError:
150+
print("\nInstall scipy for Spearman correlation: pip install scipy")
151+
print(f"Raw pair count: {len(common)}")
152+
153+
154+
if __name__ == "__main__":
155+
main()

benchmarks/tasks.json

Lines changed: 16 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -1,62 +1,26 @@
11
[
22
{
3-
"name": "flask_api",
4-
"prompt": "Create a modular Flask REST API with health endpoint and clean structure",
5-
"requirements": ["flask", "pytest"],
6-
"hidden_tests": ["from flask import Flask", "def test_health_endpoint"]
7-
},
8-
{
9-
"name": "cli_tool",
10-
"prompt": "Create a Python CLI task manager using argparse",
11-
"requirements": ["pytest"],
12-
"hidden_tests": ["import argparse", "if __name__ == '__main__'"]
13-
},
14-
{
15-
"name": "websocket_server",
16-
"prompt": "Create an async websocket echo server in Python",
17-
"requirements": ["pytest", "pytest-asyncio"],
18-
"hidden_tests": ["import asyncio", "import websockets"]
19-
},
20-
{
21-
"name": "data_processor",
22-
"prompt": "Create a Python data processing pipeline that reads CSV, transforms data, and outputs JSON",
23-
"requirements": ["pytest"],
24-
"hidden_tests": ["import csv", "import json"]
25-
},
26-
{
27-
"name": "math_library",
28-
"prompt": "Create a Python math utility library with functions for statistics, algebra, and geometry",
3+
"name": "cli_task_manager",
4+
"prompt": "Create a Python CLI task manager using argparse. Commands: add TASK, list, complete ID, delete ID. Tasks persist to a JSON file called tasks.json. The main entry point must be in main.py with if __name__ == '__main__' block. Each task has id, description, done fields.",
295
"requirements": ["pytest"],
30-
"hidden_tests": ["def test_", "import math"]
6+
"hidden_test_files": {
7+
"test_hidden_cli.py": "import subprocess\nimport json\nimport os\n\n\ndef setup_module():\n if os.path.exists('tasks.json'):\n os.remove('tasks.json')\n\n\ndef teardown_module():\n if os.path.exists('tasks.json'):\n os.remove('tasks.json')\n\n\ndef test_add_creates_task():\n result = subprocess.run(['python', 'main.py', 'add', 'test task'], capture_output=True, text=True)\n assert result.returncode == 0, f'add failed: {result.stderr}'\n assert os.path.exists('tasks.json'), 'tasks.json not created'\n with open('tasks.json') as f:\n tasks = json.load(f)\n assert len(tasks) == 1\n assert tasks[0]['description'] == 'test task'\n\n\ndef test_list_shows_tasks():\n subprocess.run(['python', 'main.py', 'add', 'visible task'], capture_output=True)\n result = subprocess.run(['python', 'main.py', 'list'], capture_output=True, text=True)\n assert result.returncode == 0\n assert 'visible task' in result.stdout\n\n\ndef test_complete_marks_done():\n subprocess.run(['python', 'main.py', 'add', 'finish me'], capture_output=True)\n result = subprocess.run(['python', 'main.py', 'complete', '1'], capture_output=True, text=True)\n assert result.returncode == 0\n with open('tasks.json') as f:\n tasks = json.load(f)\n assert any(t['id'] == 1 and t.get('done') for t in tasks)\n\n\ndef test_delete_removes_task():\n subprocess.run(['python', 'main.py', 'add', 'delete me'], capture_output=True)\n result = subprocess.run(['python', 'main.py', 'delete', '1'], capture_output=True, text=True)\n assert result.returncode == 0\n with open('tasks.json') as f:\n tasks = json.load(f)\n assert not any(t['id'] == 1 for t in tasks)\n"
8+
}
319
},
3210
{
33-
"name": "file_indexer",
34-
"prompt": "Create a Python file indexer that walks directories, computes file hashes, and outputs a JSON index",
35-
"requirements": ["pytest"],
36-
"hidden_tests": ["import hashlib", "os.walk"]
37-
},
38-
{
39-
"name": "config_parser",
40-
"prompt": "Create a Python configuration parser that reads YAML/JSON/INI formats and validates schemas",
41-
"requirements": ["pytest", "pyyaml"],
42-
"hidden_tests": ["import yaml", "import json"]
43-
},
44-
{
45-
"name": "rate_limiter",
46-
"prompt": "Create a Python rate limiter implementation with token bucket or sliding window algorithm",
47-
"requirements": ["pytest"],
48-
"hidden_tests": ["class RateLimiter", "def test_rate_limit"]
49-
},
50-
{
51-
"name": "caching_layer",
52-
"prompt": "Create a Python caching library with TTL support, LRU eviction, and decorator interface",
53-
"requirements": ["pytest"],
54-
"hidden_tests": ["functools.lru_cache", "class Cache"]
11+
"name": "async_web_scraper",
12+
"prompt": "Create an async Python web scraper in main.py. Must have an async function fetch_url(url, session) that fetches a single URL and returns the response text. Must have an async function scrape_all(urls) that fetches multiple URLs concurrently using asyncio.gather or similar, limits to max 3 concurrent requests via asyncio.Semaphore, retries failed requests once, and returns a list of (url, success, text_or_error) tuples. Must have a main() entry point that accepts a comma-separated list of URLs from sys.argv and prints results as JSON: [{url, success, length, error}]. Include proper error handling for network failures and timeouts.",
13+
"requirements": ["pytest", "pytest-asyncio", "aiohttp"],
14+
"hidden_test_files": {
15+
"test_hidden_scraper.py": "import subprocess\nimport json\nimport sys\n\n\ndef test_scraper_imports():\n import importlib\n spec = importlib.util.find_spec('main')\n assert spec is not None, 'main.py not importable'\n\n\ndef test_main_entry_point():\n result = subprocess.run(\n [sys.executable, 'main.py', 'https://example.com'],\n capture_output=True, text=True, timeout=15\n )\n assert result.returncode == 0, f'main failed: {result.stderr}'\n output = json.loads(result.stdout)\n assert isinstance(output, list)\n assert len(output) == 1\n entry = output[0]\n assert 'url' in entry\n assert 'success' in entry\n assert 'length' in entry\n\n\ndef test_scraper_handles_bad_url():\n result = subprocess.run(\n [sys.executable, 'main.py', 'https://nonexistent.example.test'],\n capture_output=True, text=True, timeout=15\n )\n output = json.loads(result.stdout)\n entry = output[0]\n assert not entry['success'], 'should fail for bad URL'\n assert 'error' in entry\n\n\ndef test_scraper_concurrent_flag():\n source = open('main.py').read()\n assert 'asyncio.gather' in source or 'asyncio.wait' in source or 'asyncio.TaskGroup' in source, 'no concurrent fetching'\n assert 'asyncio.Semaphore' in source or 'asyncio.BoundedSemaphore' in source, 'no rate limiting'\n"
16+
}
5517
},
5618
{
57-
"name": "log_analyzer",
58-
"prompt": "Create a Python log file analyzer that parses timestamps, counts levels, and generates reports",
19+
"name": "data_pipeline",
20+
"prompt": "Create a Python data pipeline in main.py. Must have a function read_csv(path) that reads a CSV file (handle both comma and header row) and returns a list of dicts. Must have a function filter_data(data, key, value) that filters rows where data[key] == value. Must have a function aggregate(data, group_key, agg_key, agg_func) that groups by group_key and applies agg_func ('sum', 'avg', 'count') to agg_key. Must have a function write_json(data, path) that writes the processed data as JSON. Must have a main() that accepts input CSV path from sys.argv, reads, filters where status=active, aggregates by department summing salary, and writes to {input}_report.json. Include error handling for missing files.",
5921
"requirements": ["pytest"],
60-
"hidden_tests": ["import re", "def test_parse"]
22+
"hidden_test_files": {
23+
"test_hidden_pipeline.py": "import subprocess\nimport json\nimport os\nimport tempfile\n\n\ndef test_read_csv():\n from main import read_csv\n with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f:\n f.write('name,age,city\\nAlice,30,NYC\\nBob,25,LA\\n')\n path = f.name\n data = read_csv(path)\n os.unlink(path)\n assert len(data) == 2\n assert data[0]['name'] == 'Alice'\n\n\ndef test_filter_data():\n from main import filter_data\n data = [{'name': 'Alice', 'status': 'active'}, {'name': 'Bob', 'status': 'inactive'}]\n result = filter_data(data, 'status', 'active')\n assert len(result) == 1\n assert result[0]['name'] == 'Alice'\n\n\ndef test_aggregate_sum():\n from main import aggregate\n data = [{'dept': 'eng', 'salary': 100}, {'dept': 'eng', 'salary': 200}, {'dept': 'hr', 'salary': 150}]\n result = aggregate(data, 'dept', 'salary', 'sum')\n assert result['eng'] == 300\n assert result['hr'] == 150\n\n\ndef test_aggregate_avg():\n from main import aggregate\n data = [{'dept': 'eng', 'salary': 100}, {'dept': 'eng', 'salary': 200}]\n result = aggregate(data, 'dept', 'salary', 'avg')\n assert result['eng'] == 150\n\n\ndef test_main_pipeline():\n with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f:\n f.write('name,department,salary,status\\nAlice,Engineering,100000,active\\nBob,Engineering,120000,active\\nCharlie,HR,80000,inactive\\n')\n input_path = f.name\n output_path = input_path.replace('.csv', '_report.json')\n result = subprocess.run(['python', 'main.py', input_path], capture_output=True, text=True, timeout=10)\n os.unlink(input_path)\n assert result.returncode == 0, f'pipeline failed: {result.stderr}'\n assert os.path.exists(output_path), f'report not created at {output_path}'\n with open(output_path) as f:\n report = json.load(f)\n os.unlink(output_path)\n assert isinstance(report, dict)\n"
24+
}
6125
}
6226
]

0 commit comments

Comments
 (0)