|
| 1 | +# AGENTS.md — Grounded Evolution Conventions |
| 2 | + |
| 3 | +## Project Identity |
| 4 | +This is a **research platform** for execution-grounded prompt evolution. |
| 5 | +Framing: evolutionary software optimization, NOT AGI/sentience claims. |
| 6 | +Repository is public at `NullLabTests/grounded_evolution`. |
| 7 | + |
| 8 | +## Conventions |
| 9 | + |
| 10 | +### Code Style |
| 11 | +- Type hints on **all** function signatures and public variables |
| 12 | +- No comments unless absolutely necessary (the code should explain itself) |
| 13 | +- Max line length: loosely 120 (no hard enforcement, follow existing style) |
| 14 | +- Imports: stdlib, then blank line, then third-party, then blank line, then local |
| 15 | + (stdlib comes first; no `isort`/`ruff` ordering enforced — be practical) |
| 16 | +- Use `Any` from `typing` for dynamic types, never bare generics omitted |
| 17 | +- Prefer `Path` from `pathlib` over `os.path` |
| 18 | +- File-level docstrings on every `.py` file |
| 19 | + |
| 20 | +### Project Structure |
| 21 | +- `generator.py` — LLM code generation, returns `(text, usage_dict)` tuple |
| 22 | +- `evaluator/runtime_evaluator.py` — execution-grounded validation (AST, pytest, hidden tests) |
| 23 | +- `mutation_engine.py` — prompt mutation/crossover operators |
| 24 | +- `mutation.py` / `evaluate.py` / `evolve_forever.py` / `auto_evolve.py` — lexical-only loop (legacy) |
| 25 | +- `population_manager.py` — JSON-based population persistence |
| 26 | +- `infinite_research_loop.py` — main grounded loop (calls generator → evaluator → population_manager) |
| 27 | +- `run_experiment.py` — orchestrated ablation experiments |
| 28 | +- `benchmarks/tasks.json` — 3 benchmark definitions with inline hidden test files |
| 29 | +- `experiments/` — all experiment output (logs, archives, ablation runs) |
| 30 | + |
| 31 | +### Two Loops |
| 32 | +1. **Lexical loop** (`evaluate.py`/`evolve_forever.py`): keyword-matching fitness. Currently at 218 prompts, best score 1000/1000. Less important now. |
| 33 | +2. **Grounded loop** (`infinite_research_loop.py`/`generator.py`/`runtime_evaluator.py`): real code execution fitness. This is the primary focus. |
| 34 | + |
| 35 | +### Environment Variables (never hardcode secrets) |
| 36 | +- `LLM_API_KEY` — required for grounded loop |
| 37 | +- `LLM_MODEL` — model name (default: `mistral-large-latest`) |
| 38 | +- `LLM_BASE_URL` — API base URL (default: `https://api.mistral.ai/v1`) |
| 39 | + |
| 40 | +### Testing |
| 41 | +- No test suite for the project itself yet (TODO for future) |
| 42 | +- Hidden benchmark tests live in `benchmarks/tasks.json` as `hidden_test_files` dict |
| 43 | +- Rust-based tests (`cargo test`) exist in the `generated_projects/` output (not our code) |
| 44 | + |
| 45 | +### Git |
| 46 | +- Auto-commits on score improvement from the grounded loop |
| 47 | +- Manual commits for structural changes (new features, refactors, docs) |
| 48 | +- Commit messages: concise, descriptive, no emoji |
| 49 | + |
| 50 | +### Adding New Features |
| 51 | +1. Check if the feature already exists (grep for related terms) |
| 52 | +2. Follow the existing pattern (if it's a mutation, add to `mutation_engine.py`) |
| 53 | +3. Type hints everywhere |
| 54 | +4. Add the new feature to `run_experiment.py` if it's an experimental variable |
| 55 | +5. Update EXPERIMENT_DESIGN.md if the experiment protocol changes |
0 commit comments