mert-cemri
diff --git a/‎.gitignore‎
Lines changed: 4 additions & 0 deletions b/‎.gitignore‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 0 additions & 118 deletions b/‎CLAUDE.md‎
Lines changed: 0 additions & 118 deletions
diff --git a/‎SEARCH_STRATEGIES.md‎
Lines changed: 32 additions & 16 deletions b/‎SEARCH_STRATEGIES.md‎
Lines changed: 32 additions & 16 deletions
diff --git a/‎examples/math_mas/README.md‎
Lines changed: 45 additions & 112 deletions b/‎examples/math_mas/README.md‎
Lines changed: 45 additions & 112 deletions
@@ -59,3 +59,7 @@ problems
 # Local debug logs
 prompt_builder_logs.jsonl
 try.txt
+
+# Claude-generated documentation
+CLAUDE.md
+CONFIGURATION_GUIDE.md
@@ -278,7 +278,7 @@ All strategies share the following OpenEvolve components:
 To compare all strategies:
 
 ```bash
-# Run all strategies with same parameters
+# Run all strategies with same parameters (each saves to separate directory)
 for strategy in "" "--best-of-n" "--beam-search" "--mcts"; do
   python openevolve-run.py \
     examples/math_mas/initial_program.py \
@@ -287,8 +287,11 @@ for strategy in "" "--best-of-n" "--beam-search" "--mcts"; do
     --iterations 50
 done
 
-# Compare results
-# Each run saves to: openevolve_output/best/best_program_info.json
+# Compare results - each strategy has its own directory:
+# openevolve_output/best_of_n/best/best_program_info.json
+# openevolve_output/beam_search/best/best_program_info.json
+# openevolve_output/mcts/best/best_program_info.json
+# openevolve_output/best/best_program_info.json (default MAP-Elites)
 ```
 
 ## Configuration
@@ -372,24 +375,37 @@ database:
 
 ## Output
 
-All strategies produce the same output structure:
+Each strategy saves results to its own subdirectory to prevent overwriting:
 
 ```
 openevolve_output/
-├── best/
-│   ├── best_program.py          # Best program code
-│   └── best_program_info.json   # Metrics and metadata
-├── checkpoints/
-│   ├── checkpoint_5/
-│   │   ├── strategy.json        # Strategy state
-│   │   ├── best_program.py
-│   │   └── best_program_info.json
-│   └── checkpoint_10/
-│       └── ...
-└── logs/
-    └── openevolve_YYYYMMDD_HHMMSS.log
+├── best_of_n/                   # Best-of-N results
+│   ├── best/
+│   │   ├── best_program.py          # Best program code
+│   │   └── best_program_info.json   # Metrics and metadata
+│   ├── checkpoints/
+│   │   ├── checkpoint_5/
+│   │   │   ├── strategy.json        # Strategy state
+│   │   │   ├── best_program.py
+│   │   │   └── best_program_info.json
+│   │   └── checkpoint_10/
+│   │       └── ...
+│   └── logs/
+│       └── openevolve_YYYYMMDD_HHMMSS.log
+├── beam_search/                 # Beam Search results
+│   ├── best/
+│   ├── checkpoints/
+│   └── logs/
+├── mcts/                        # MCTS results
+│   ├── best/
+│   ├── checkpoints/
+│   └── logs/
+└── [default MAP-Elites at root if no strategy specified]
 ```
 
+**Note**: Each strategy uses `openevolve_output/<strategy_name>/` to keep results separate.
+You can override with `--output custom_dir/`.
+
 ## Extending with New Strategies
 
 To add a new search strategy:
 
@@ -1,150 +1,83 @@
-# Multi-Agent Math Solver Evolution
+# Multi-Agent Math Solving System Evolution
 
-This example uses OpenEvolve to evolve a multi-agent system for solving mathematical problems from the Math500 dataset.
+This directory contains scripts for evolving and testing multi-agent systems that solve mathematical problems from the OlympiadBench dataset.
 
-## Overview
+## Quick Start
 
-The system evolves a collaborative multi-agent architecture with up to 4 agents:
-- **Solver**: Initial problem-solving
-- **Verifier**: Solution verification
-- **Reviser**: Error correction based on feedback
-- **Refiner**: Final answer polishing
-
-OpenEvolve optimizes:
-- Agent system prompts (roles and expertise)
-- Communication protocols (interaction patterns)
-- Workflow structure (agent coordination)
-
-## Setup
-
-### 1. Install Dependencies
+### Run Evolution (Single Strategy)
 
 ```bash
-# Be in the main code
-pip install -e ".[dev]"
-
-# Install additional dependencies for this example
-pip install langchain langchain-openai datasets word2number sympy latex2sympy2
-```
-
-### 2. Set Environment Variables
+# Run MAP-Elites for 100 iterations with 100 problems
+./run_map_elites.sh 100 100
 
-```bash
-# OpenAI API key (used for both evolution and multi-agent system)
-export OPENAI_API_KEY="your-openai-api-key"
+# Run Best-of-N
+./run_best_of_n.sh 100 100
 
-# Optional: Configure the model used by agents (inside the multi-agent system)
-export OPENEVOLVE_MODEL="gpt-4o-mini"  # Default model for agents in the multi-agent system
+# Run Beam Search
+./run_beam_search.sh 100 100
 
-# Optional: Number of test problems per evaluation
-export MATH_EVAL_PROBLEMS="10"  # Default: 10 problems
+# Run MCTS
+./run_mcts.sh 100 100
 ```
 
-### 3. Test the Initial System
+### Run All Strategies in Parallel
 
 ```bash
-cd examples/math_mas
-
-# Test the initial multi-agent system
-python initial_program.py
-
-# Test the evaluator
-python evaluator.py
+# Run all 4 strategies simultaneously
+./run_all_strategies.sh 100 100
 ```
 
-## Running Evolution
-
-### Basic Evolution Run
+### Test a Program
 
 ```bash
-# From the repository root
-python openevolve-run.py \
-  examples/math_mas/initial_program.py \
-  examples/math_mas/evaluator.py \
-  --config examples/math_mas/config.yaml \
-  --iterations 50
-```
+# Test initial program with 100 problems (seed=42)
+python test_program.py initial_program.py
 
-### Resume from Checkpoint
+# Test evolved program with different seed for test set
+python test_program.py openevolve_output/best/best_program.py --seed 99
 
-```bash
-python openevolve-run.py \
-  examples/math_mas/initial_program.py \
-  examples/math_mas/evaluator.py \
-  --config examples/math_mas/config.yaml \
-  --checkpoint examples/math_mas/openevolve_output/checkpoints/checkpoint_40 \
-  --iterations 20
+# Test with more problems
+python test_program.py path/to/program.py --num-problems 200 --seed 1234
 ```
 
-## Configuration
+---
 
-Key configuration options in `config.yaml`:
+## Testing Script: test_program.py
 
-### Evolution Settings
-- `max_iterations: 50` - Number of evolution iterations
-- `diff_based_evolution: false` - Use full rewrites instead of diffs
-- `early_stopping_patience: 20` - Stop if no improvement for 20 iterations
+Standalone script to evaluate any program on math problems with configurable random seed.
 
-### Database (MAP-Elites)
-- `population_size: 100` - Maximum programs in population
-- `num_islands: 4` - Isolated populations for diversity
-- `feature_dimensions: [accuracy, completion_rate]` - Quality-diversity space
-
-### Evaluator
-- `cascade_evaluation: true` - Fast-fail for bad programs
-- `parallel_evaluations: 4` - Run 4 evaluations concurrently
-- `timeout: 600` - 10 minute timeout per evaluation
-
-
-
-
-### Adjust Problem Difficulty
+**Usage:**
 ```bash
-# Use fewer problems for faster iterations
-export MATH_EVAL_PROBLEMS="5"
+python test_program.py <program_path> [options]
 
-# Use more problems for better evaluation
-export MATH_EVAL_PROBLEMS="20"
+Options:
+  -n, --num-problems N  Number of problems (default: 100, use -1 for all 675)
+  -s, --seed N          Random seed for sampling (default: 42)
+  -o, --output FILE     Output JSON file
 ```
 
-### Customize Agent Model
+**Examples:**
 ```bash
-# Use a more powerful model for agents
-export OPENEVOLVE_MODEL="gpt-4o"
+# Test with default settings (100 problems, seed=42)
+python test_program.py initial_program.py
 
-# Or use GPT-5 for agents too (expensive but powerful)
-export OPENEVOLVE_MODEL="gpt-5"
-```
+# Test on DIFFERENT problems (seed=99 instead of 42)
+python test_program.py openevolve_output/best/best_program.py --seed 99
 
-### Visualize Evolution
-```bash
-# After evolution completes
-python scripts/visualizer.py \
-  --path examples/math_mas/openevolve_output/checkpoints/checkpoint_50/
+# Test on full dataset
+python test_program.py path/to/program.py --num-problems -1
 ```
 
-## Troubleshooting
+## Train/Test Split with Seeds
 
-### "Import langchain_openai could not be resolved"
-```bash
-pip install langchain langchain-openai
-```
+Use different random seeds to create train/test splits:
 
-### "No problems loaded"
 ```bash
-# Install datasets library
-pip install datasets
+# Evolution uses seed=42 (from config.yaml)
+./run_map_elites.sh 100 100
 
-# Or problems will fall back to synthetic test problems
+# Test on different problems (seed=99)
+python test_program.py openevolve_output/best/best_program.py --seed 99
 ```
 
-### API Rate Limits
-- Reduce `parallel_evaluations` in config.yaml
-- Increase `timeout` if models are slow
-- Use faster models (gpt-4o-mini instead of gpt-5)
-
-### Out of Memory
-- Reduce `population_size` in config.yaml
-- Reduce `MATH_EVAL_PROBLEMS` environment variable
-- Enable cascade evaluation to fail fast on bad programs
-
+See full README for more details.