Evaluator

Executes LLM generated functions in isolated sandboxes using fork-based process isolation.

Parse extracts code from XML tags, markdown fences, or raw text, then validates with AST.
Integrate replaces the priority function body in the evaluation script.
Cache graphs loads graphs from LMDB on first use, caches in evaluator memory.
Distribute submits each test input (with graph) to the ThreadPoolExecutor worker pool.
Compile workers compile the program. Eval script is cached, only priority is recompiled.
Fork child process inherits parent's memory (including cached graphs) via copy-on-write.
Sandbox child sets memory limits, runs evaluation, returns result via pipe.
Collect parent reads results, publishes scores to database queue.

Process Hierarchy

Evaluator process (caches graphs in memory)
├── ThreadPoolExecutor
│   ├── Thread 1 shares memory with evaluator (including graph cache)
│   ├── Thread 2
│   └── Thread N
│
└── Each thread forks:
    └── Child process (inherits graphs via copy-on-write), killed after timeout

Fork Sandboxing

Uses os.fork() instead of subprocess.Popen() for massive performance gains:

Fork is fast: ~1ms vs ~50ms for subprocess
No graph reload: Child inherits parent's cached graphs via copy-on-write
No serialization: Function and input already in child's memory
Full isolation: Child crash doesn't affect parent

Worker thread                          Forked child
      │                                       │
      ├─── fork() ──────────────────────────► │ inherits memory (graphs, compiled code)
      │                                       │ sets memory limit
      │                                       │ executes priority function
      │  ◄─── pipe (result) ──────────────────┤ writes result
      │                                       │ os._exit(0)

The child:

Inherits parent's memory space (copy-on-write)
Sets memory limit via resource.setrlimit
Runs evaluation with cached graph
Writes result to pipe
Exits via os._exit() or gets killed after timeout

Compilation Caching

Threads share memory and the compilation cache persists between evaluations. The program is separated into:

Base imports, solve, evaluate, cached after first compilation
Priority LLM generated function, recompiled each time

Graph Caching

Graphs are loaded lazily on first use using FastGraph, a C++ graph class with NetworkX-compatible API. Uses CSR (Compressed Sparse Row) format for minimal memory.

All threads share the same graph cache (thread-safe with Lock)
Forked children inherit the cache via copy-on-write
Graphs are never reloaded during evaluation

LLM priority functions use the same API: G.neighbors(node), G.degree(node), G.nodes.

Performance

Fork overhead is ~35-40ms per evaluation regardless of graph size. Compare to subprocess approach which would reload graphs from LMDB each time (2+ minutes for IDS n=10).

Graph Size	Fork + Eval Time	Graph Reload Time (old)
1K nodes	~40ms	~30ms
65K nodes	~40ms	~3s
1M nodes	~40ms	~2 min
4M nodes	~40ms	~75 min

Memory Requirements

Each evaluator process loads and caches graphs. All values measured with C++ FastGraph. Total Time includes graph loading plus one greedy MIS evaluation with a trivial priority function.

Deletion Binary s=1

n	Nodes	Edges	Memory	Total Time
6	64	543	< 1 MB	5 ms
7	128	1,471	< 1 MB	4 ms
8	256	3,839	< 1 MB	47 ms
9	512	9,727	< 1 MB	23 ms
10	1,024	24,063	< 1 MB	13 ms
11	2,048	58,367	< 1 MB	24 ms

Total for n=6 to n=11: < 5 MB per evaluator

Deletion Binary s=2

n	Nodes	Edges	Memory	Total Time
7	128	5,173	< 1 MB	5 ms
8	256	17,183	< 1 MB	58 ms
9	512	54,895	< 1 MB	20 ms
10	1,024	169,162	< 1 MB	48 ms
11	2,048	504,451	< 1 MB	137 ms
12	4,096	1,460,525	< 1 MB	386 ms

Total for n=7 to n=12: < 10 MB per evaluator

IDS Quaternary s=1

Memory has two values: Final (after loading, in-use) and Peak (during LMDB parsing spike).

n	Nodes	Edges	Final Memory	Peak Memory	Eval Time
6	4,096	392,358	40 MB	40 MB	0.1 s
7	16,384	2,206,374	90 MB	350 MB	0.7 s
8	65,536	11,815,590	200 MB	1.4 GB	3.4 s
9	262,144	60,992,166	620 MB	5.0 GB	19.7 s
10	1,048,576	305,965,734	2.5 GB	20 GB	2.2 min
11	4,194,304	1,500,162,726	12.2 GB	~50 GB	75 min

Cumulative for n=6 to n=9: ~1.2 GB final, ~6.5 GB peak during loading Cumulative for n=6 to n=10: ~3.5 GB final, ~25 GB peak during loading

WARNING: Peak memory during LMDB loading is much higher than final memory. When starting multiple evaluators simultaneously, the peak spikes can cause OOM. Solutions:

Stagger evaluator startup - don't start all at once
Pre-load graphs in main process before forking evaluators
Start with fewer evaluators and scale up after graphs are cached

Configuration

EvaluatorConfig(
    timeout=30,                    # Seconds before sandbox killed
    max_workers=2,                 # Parallel threads per evaluator
    sandbox_memory_limit_gb=1.0,   # Memory limit per sandbox
    debug_samples=False,           # Save debug sample files to log_dir/debug_samples
    prefetch_count=15,             # RabbitMQ message buffer

    evaluation_script_path="...",  # Path to evaluation script with evaluate and priority
    initial_functions_dir="...",   # Seed functions to start evolution
    s_values=[1],                  # Problem parameters
    start_n=[6],                   # Range start
    end_n=[11],                    # Range end
    mode="last_relative",          # Score aggregation: last, average, weighted, relative_difference, last_relative
)

Debugging

Graphs are cached in memory. No files written to disk during evaluation (fork based).

To save debug samples (raw LLM output, parsed code, evaluation results) for inspection, enable debug_samples in your config:

EvaluatorConfig(
    debug_samples=True,  # Default: False. Saves debug files to log_dir/debug_samples/.
)

Sandbox errors (compile errors, runtime exceptions, timeouts) are returned as strings in the result and logged by the evaluator. Check the main log for lines containing "Compile error", "Runtime error", or "Timeout".

# Check memory usage (graphs cached in evaluator)
ps aux | grep evaluator

# Check evaluator logs for errors
grep -i "error\|timeout" logs/main.log

# Monitor child processes (short lived forks)
watch -n 0.1 'ps --ppid <evaluator_pid>'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluator

Process Hierarchy

Fork Sandboxing

Compilation Caching

Graph Caching

Performance

Memory Requirements

Deletion Binary s=1

Deletion Binary s=2

IDS Quaternary s=1

Configuration

Debugging

FilesExpand file tree

EVALUATOR.md

Latest commit

History

EVALUATOR.md

File metadata and controls

Evaluator

Process Hierarchy

Fork Sandboxing

Compilation Caching

Graph Caching

Performance

Memory Requirements

Deletion Binary s=1

Deletion Binary s=2

IDS Quaternary s=1

Configuration

Debugging