Skip to content

Global learnings 2#1

Draft
dtdannen wants to merge 7 commits into
mainfrom
global_learnings_2
Draft

Global learnings 2#1
dtdannen wants to merge 7 commits into
mainfrom
global_learnings_2

Conversation

@dtdannen
Copy link
Copy Markdown
Owner

@dtdannen dtdannen commented Oct 8, 2025

No description provided.

dtdannen and others added 3 commits October 6, 2025 15:21
Implements a global learnings aggregator that collects insights across all
iterations and islands, injecting them into prompts to help the LLM avoid
repeated mistakes and learn from successful patterns.

Features:
- Tracks common failures: syntax errors, runtime errors, performance regressions
- Optionally tracks successful patterns with improvement metrics
- Configurable aggregation window (default: last 50 iterations)
- Flexible injection (system/user prompts) with verbosity levels
- Full checkpoint save/load support
- Zero overhead when disabled (default: off)

Implementation Details:
- New GlobalLearningsConfig in config.py with 16+ configuration options
- GlobalLearnings class extracts patterns using regex on stderr/artifacts
- Pattern detection: NameError, IndexError, TypeError, SyntaxError, etc.
- Performance regression tracking (>10% metric decreases)
- Prompt integration via PromptSampler.build_prompt()
- ProcessParallelController passes learnings to workers and updates from results
- Checkpoint persistence to global_learnings.json

Configuration Example:
  global_learnings:
    enabled: true
    track_failures: true
    track_successes: false
    window_size: 50
    max_learnings: 5
    inject_in_system_prompt: true
    verbosity: "concise"

Output Format:
  ## Evolution Insights (Global Learnings)

  ### Common Pitfalls:
  ❌ Undefined variable: temp (seen 5x)
  ❌ IndexError: list index out of range (seen 3x)
  ⚠️  score decreased by 15.2% (seen 4x)

Bug Fixes:
- Fix worker process API key propagation by calling update_model_params()
  with overwrite=True after LLMConfig reconstruction

Files Changed:
- openevolve/config.py: Add GlobalLearningsConfig
- openevolve/global_learnings.py: New core implementation
- openevolve/controller.py: Initialize and checkpoint management
- openevolve/prompt/sampler.py: Inject learnings into prompts
- openevolve/process_parallel.py: Worker integration and updates
- examples/config_with_global_learnings.yaml: Example config
- GLOBAL_LEARNINGS_IMPLEMENTATION.md: Complete documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@dtdannen
Copy link
Copy Markdown
Owner Author

dtdannen commented Oct 8, 2025

@claude can you review this?

@claude
Copy link
Copy Markdown

claude Bot commented Oct 8, 2025

Claude finished @dtdannen's task —— View job


PR Review: Global Learnings System

I've completed a thorough review of the Global Learnings implementation. This is a well-designed feature that adds cross-island learning capabilities to OpenEvolve. Here's my comprehensive feedback:


✅ Overall Assessment

Strengths:

  • Well-documented implementation with clear design doc
  • Clean separation of concerns across modules
  • Comprehensive configuration options
  • Proper checkpoint/resume support
  • Good error handling and pattern extraction

Architecture Grade: A-
The implementation follows existing patterns well and integrates cleanly with the controller/database/parallel processing layers.


🔍 Detailed Findings

1. Core Implementation (openevolve/global_learnings.py)

Strengths:

  • Clean dataclass design for FailurePattern and SuccessPattern
  • Good regex patterns for error extraction (openevolve/global_learnings.py:165-201)
  • Proper serialization with save()/load() methods
  • Three verbosity levels (minimal/concise/detailed) provide flexibility

Issues Found:

🔴 Critical - Regex Escape Issue (openevolve/global_learnings.py:168)

(r"NameError: name '(\w+)' is not defined", lambda m: f"Undefined variable: {m.group(1)}"),

The single quotes in the regex pattern should be escaped. This will fail to match actual Python NameErrors which use single quotes. Should be:

(r"NameError: name \'(\w+)\' is not defined", lambda m: f"Undefined variable: {m.group(1)}"),

🟡 Medium - Performance Regression Threshold Hardcoded (openevolve/global_learnings.py:223)

if parent_value > 0 and child_value < parent_value * 0.9:

The 10% threshold (0.9) is hardcoded. Consider making this configurable via GlobalLearningsConfig for different problem domains.

🟡 Medium - Limited Pattern Normalization (openevolve/global_learnings.py:265-266)

key = f"{pattern_type}:{description}"

Similar errors might not group together well (e.g., "list index out of range" vs "tuple index out of range"). Consider more sophisticated normalization.

🟢 Minor - Improvement Calculation (openevolve/global_learnings.py:232-255)
The _calculate_improvement() averages all metrics equally. For better results, consider using combined_score when available (similar to how the rest of the codebase prioritizes it).

2. Configuration (openevolve/config.py)

Strengths:

  • Comprehensive GlobalLearningsConfig dataclass (openevolve/config.py:330-364)
  • Good defaults (enabled=False, track_failures=True)
  • Proper integration into master Config class

Issues Found:

🟢 Minor - Documentation (openevolve/config.py:336-339)
The track_both flag overrides individual flags, but this behavior isn't obvious from just the field names. Consider adding a docstring to the dataclass explaining this precedence.

3. Controller Integration (openevolve/controller.py)

Strengths:

  • Clean initialization (openevolve/controller.py:195-198)
  • Proper checkpoint save/load (openevolve/controller.py:460-461, 512-513)
  • Good logging

Issues Found:

🟡 Medium - No Cleanup on Disable (openevolve/controller.py:459-461)

if self.config.global_learnings.enabled:
    self.global_learnings.save(Path(checkpoint_path))

If global learnings is disabled mid-run, old checkpoint files persist. Consider cleanup logic or documenting this behavior.

4. Parallel Processing (openevolve/process_parallel.py)

Strengths:

  • Clean snapshot integration (openevolve/process_parallel.py:401-403)
  • Proper Result object creation for compatibility (openevolve/process_parallel.py:518-529)
  • Updates happen after successful iterations

Issues Found:

🔴 Critical - Potential None Reference (openevolve/process_parallel.py:525-526)

parent_program = self.database.get(result.parent_id) if result.parent_id else None
parent_metrics = parent_program.metrics if parent_program else None

If parent_program is None, this will fail. The code checks for None before accessing .metrics, but there's a window where result.parent_id exists but the program was evicted from the database. Should add defensive check:

parent_program = self.database.get(result.parent_id) if result.parent_id else None
parent_metrics = parent_program.metrics if parent_program and hasattr(parent_program, 'metrics') else None

🟡 Medium - Update Frequency Not Honored (openevolve/process_parallel.py:527-529)
The config has update_interval: 10 but updates happen every iteration. Should check:

if completed_iteration % self.global_learnings.config.update_interval == 0:
    self.global_learnings.update_from_iteration(...)

5. Prompt Integration (openevolve/prompt/sampler.py)

Strengths:

  • Clean injection point (openevolve/prompt/sampler.py:113-115)
  • Simple parameter passing

Issues Found:

🟢 Minor - Duplicate Newlines (openevolve/prompt/sampler.py:115)

system_message = f"{system_message}\n\n{global_learnings}"

The generate_prompt_section() already includes header with newlines. This might result in excessive spacing. Review formatting.

6. Example Configurations

Strengths:

  • Clear example config (examples/config_with_global_learnings.yaml)
  • Good comments explaining each option
  • Reasonable defaults

Issues Found:

🟢 Minor - Blocksworld Config Missing Global Learnings (examples/blocksworld/config.yaml)
The blocksworld example would greatly benefit from global learnings (to avoid repeated planning failures), but it's not enabled in the config. Consider adding it with appropriate thresholds.


📝 Documentation Review

GLOBAL_LEARNINGS_IMPLEMENTATION.md

  • ✅ Comprehensive and well-organized
  • ✅ Good examples of prompt injection output
  • ✅ Clear benefits explanation
  • 🟡 Missing: Performance impact discussion (token costs, memory overhead)

CLAUDE.md Updates

  • ✅ Good high-level architecture updates
  • ✅ Proper integration into development workflow docs
  • 🟡 Consider adding troubleshooting section for global learnings

🧪 Testing Recommendations

Missing Tests (should add):

  1. Unit Tests for Global Learnings:

    • Pattern extraction from various error types
    • Aggregation with window size limits
    • Serialization/deserialization
    • Prompt formatting at different verbosity levels
  2. Integration Tests:

    • Full evolution run with global learnings enabled
    • Checkpoint save/load with learnings data
    • Verify learnings actually appear in prompts
    • Test with both failure-only and success tracking
  3. Edge Cases:

    • What happens with empty/None metrics?
    • Handling when parent program is not in database
    • Unicode/special characters in error messages

🔧 Additional Issues

Documentation Files

🟢 Minor - Large Markdown Files Added (MULTI_BLOCK_IMPLEMENTATION_PLAN.md: 2018 lines, SYSTEM_PROMPT_EVOLUTION_PLAN.md: 1309 lines)
These are design documents but very large. Consider:

  • Moving to a docs/design/ directory
  • Splitting into multiple files by topic
  • Adding a table of contents

Papers Directory

🟡 Medium - New Papers Download Scripts (papers/*.py)
These scripts appear unrelated to global learnings. Should they be in a separate PR? If they're for future LLM-based learning features, add a README explaining the connection.


🎯 Recommendations

Before Merging:

  1. Fix critical issues:

    • Escape quotes in NameError regex pattern
    • Add defensive check for parent_program None case
    • Honor update_interval configuration
  2. Consider for follow-up:

    • Make regression threshold configurable
    • Add unit tests for pattern extraction
    • Document performance impact
    • Clean up large design docs into organized structure
  3. Nice to have:

    • Add global learnings to blocksworld example config
    • Improve pattern normalization for better grouping
    • Use combined_score in improvement calculations

✅ Conclusion

This is a solid implementation of a useful feature. The critical issues are minor and easily fixed. The architecture integrates well with the existing codebase. With the suggested fixes, this is ready to merge.

Recommendation: Approve with minor changes requested for the critical issues.


Review completed: 2025-10-08
Files reviewed: 26 changed files
Branch: global_learnings_2

Critical Fixes:
- Fix regex pattern for NameError to match both single and double quotes
  Pattern now: r"NameError: name ['\"](\w+)['\"] is not defined"
  This will properly match Python NameError messages

Medium Improvements:
- Add configurable performance_regression_threshold (default: 0.10 = 10%)
  Allows tuning sensitivity for different problem domains
- Improve error pattern normalization for better grouping:
  * Normalize container types (list/tuple/str/dict → sequence)
  * Normalize numeric values (123 → N)
  * Lowercase for consistency
  Example: "list index 5 out of range" → "sequence index n out of range"

Minor Improvements:
- Prioritize combined_score in improvement calculation (consistency with codebase)
  Falls back to averaging all metrics if combined_score unavailable
- Add comprehensive docstring to GlobalLearningsConfig explaining track_both precedence

Files Changed:
- openevolve/config.py: Add performance_regression_threshold field, docstring
- openevolve/global_learnings.py: All pattern detection and normalization improvements

All changes are backward compatible with existing configurations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented Oct 8, 2025

Pull Request Review: Global Learnings System

Summary

This PR introduces a Global Learnings System (~500 lines) that aggregates insights across evolution iterations, plus new blocksworld examples and documentation.

Strengths

  • Clean architecture with proper dataclasses (FailurePattern, SuccessPattern)
  • Comprehensive docstrings and type hints throughout
  • Smart error normalization for better pattern grouping
  • Proper checkpoint integration with save/load
  • Well-designed configuration with sensible defaults

Critical Issues

1. Missing Test Coverage (HIGH)

No unit tests found for the 510-line global_learnings.py module. Need tests for:

  • Regex pattern extraction
  • Regression detection edge cases (zero, negative values)
  • Pattern normalization
  • Checkpoint serialization
  • Prompt generation at all verbosity levels

2. Security: Artifact Injection Risk (MEDIUM)

Location: global_learnings.py:160-162

stderr is injected into prompts without sanitization. Malicious programs could craft stderr to manipulate LLM behavior. Recommend adding sanitization to truncate long errors, remove prompt injection patterns, and filter control characters.

3. Bug: Division Issues (MEDIUM)

Location: global_learnings.py:223-229

Regression calculation checks parent_value > 0 but doesn't handle overflow or extreme values. Add try-except wrapper and clamp percentages.

Medium Priority

4. Unbounded Pattern Growth (LOW)

failure_patterns dict can grow unbounded. Add pattern limit (~1000) with LRU eviction.

5. Code Smell: Nested Conditionals (LOW)

Lines 238-243 have triple-nested ifs with no logging. Use early returns and add debug logs.

6. Documentation: Missing Performance Details (LOW)

Excellent docs but missing runtime/memory impact, token cost, and guidance on when NOT to use.

Other Changes

Blocksworld Example: Excellent quality (737 lines) with proper validation and comprehensive metrics.

Checkpoint Files: 150+ checkpoint files add 10K+ lines. These should be removed and added to .gitignore.

Summary

Metric Value Status
Core Implementation 500 lines Good
Test Coverage 0% Critical
Documentation Excellent Good
API Design Clean Good

Must Fix Before Merge

  1. Add unit tests for global_learnings.py
  2. Remove checkpoint data files from commit
  3. Add error message sanitization

Approval: Conditional

Solid design and valuable feature, but needs tests and cleanup. Estimate 3-4 hours to address critical issues.

Great work on the implementation and documentation!

- Remove 441 accidentally committed checkpoint files from global_learning_failures_only_window_50/
- Add global_learning*/ to .gitignore to prevent future checkpoint commits
- Checkpoint files are local experiment outputs and should not be in version control

This cleanup removes ~10K+ lines of JSON checkpoint data that were accidentally
included in the previous commit.
@claude claude Bot mentioned this pull request Oct 8, 2025
@claude
Copy link
Copy Markdown

claude Bot commented Oct 8, 2025

Pull Request Review - Global Learnings 2

Summary

This PR introduces a Global Learnings System to help the evolution process learn from past failures and successes across all islands. Additionally, it includes new examples (blocksworld), planning documents, and paper download utilities. Overall, the core implementation is solid, but there are concerns about test coverage, security practices, and organizational aspects.

🟢 Strengths

1. Well-Designed Global Learnings System

The global_learnings.py implementation (509 lines) demonstrates thoughtful architecture:

  • Clean dataclass patterns (FailurePattern, SuccessPattern)
  • Pattern normalization and grouping logic
  • Configurable tracking modes (failures only, successes only, both)
  • Sliding window aggregation with configurable verbosity
  • Proper checkpoint save/load integration

Code Quality: The error extraction regexes are robust and the formatting methods adapt well to different verbosity levels.

2. Comprehensive Documentation

  • CLAUDE.md improvements are excellent - much clearer guidance on testing, configuration, and troubleshooting
  • GLOBAL_LEARNINGS_IMPLEMENTATION.md provides thorough implementation summary with usage examples
  • Documentation explains the distinction between "window_size" and "migration_interval" (generations vs iterations)

3. Configuration Design

The GlobalLearningsConfig dataclass is well thought out:

  • Sensible defaults (disabled by default, failures-only when enabled)
  • Flexible thresholds and filtering options
  • Clear parameter names and inline comments

🟡 Areas of Concern

1. Test Coverage - CRITICAL

Issue: No unit tests were added for the global learnings system.

Impact: This is a complex feature with pattern extraction, aggregation logic, and prompt injection. Lack of tests means:

  • Error extraction regexes could fail on edge cases
  • Performance regression detection could have calculation errors
  • Checkpoint save/load could corrupt data
  • Prompt formatting could break with certain patterns

Recommendation: Add at minimum:

tests/test_global_learnings.py:
  - test_extract_syntax_errors()
  - test_extract_runtime_errors()
  - test_performance_regression_detection()
  - test_pattern_normalization()
  - test_failure_aggregation()
  - test_success_tracking()
  - test_prompt_generation()
  - test_checkpoint_persistence()

2. Papers Directory - Security & Purpose

Issues:

  • papers/download_papers.py (627 lines) downloads PDFs from multiple sources using API keys
  • Includes hardcoded email: openevolvetesting.worrier295@passmail.net
  • Uses external APIs (Semantic Scholar, OpenReview, ArXiv, Unpaywall)
  • No clear connection to OpenEvolve core functionality

Security Concerns:

  • API key handling via .env (good) but no validation
  • Rate limiting logic could be bypassed
  • PDF downloads could consume significant disk space (~5GB mentioned)
  • No integrity checks on downloaded files

Questions:

  • Why is this in the main repository? Should it be a separate tool/repo?
  • How do research papers relate to code evolution?
  • Is this used for benchmarking or documentation purposes?

Recommendation:

  • Add a README explaining the purpose and relationship to OpenEvolve
  • Consider moving to scripts/research/ or separate repo
  • Add checksum verification for downloaded files
  • Document the papers/ directory purpose in main README

3. Large Planning Documents

Three markdown files add 3,547 lines:

  • MULTI_BLOCK_IMPLEMENTATION_PLAN.md (2,018 lines)
  • SYSTEM_PROMPT_EVOLUTION_PLAN.md (1,309 lines)
  • GLOBAL_LEARNINGS_IMPLEMENTATION.md (220 lines)

Issue: These are massive planning documents that:

  • Describe features NOT in this PR (multi-block evolution, system prompt evolution)
  • Make the PR diff hard to review (8,758 additions total)
  • Could be GitHub issues or discussions instead

Recommendation:

  • Move MULTI_BLOCK_IMPLEMENTATION_PLAN.md and SYSTEM_PROMPT_EVOLUTION_PLAN.md to GitHub Issues or Wiki
  • Keep GLOBAL_LEARNINGS_IMPLEMENTATION.md as it documents implemented features
  • Or create a docs/design/ directory for architectural planning

4. Blocksworld Example - Incomplete?

Two config files for the same example:

  • examples/blocksworld/config.yaml
  • examples/blocksworld/blocksworld_config.yaml

Issue: Unclear which is canonical or if one is outdated.

Recommendation: Remove duplicate or add comment explaining the difference.


🔴 Bugs & Code Issues

1. Missing Type Annotations

In global_learnings.py:72, the result parameter uses Any:

def update_from_iteration(
    self,
    iteration: int,
    result: Any,  # ← Should be typed
    parent_metrics: Optional[Dict[str, float]] = None,
) -> None:

Recommendation: Define a proper type or use Protocol/TypedDict for result.

2. Pattern Normalization Could Lose Information

# openevolve/global_learnings.py:269-273
def _normalize_error_description(self, description: str) -> str:
    description = re.sub(r'\b(list|tuple|str|dict)\b index', 'sequence index', description)
    description = re.sub(r'\b\d+\b', 'N', description)  # All numbers → N
    return description.lower()

Issue: This might over-normalize. For example:

  • "IndexError at line 42" → "indexerror at line n"
  • "Array index 999 out of range" → "array index n out of range"

The first case loses line number context that could be valuable.

Recommendation: Consider preserving line numbers or adding configuration for normalization level.

3. Config.py - Missing @DataClass Decorator

# openevolve/config.py:330
class GlobalLearningsConfig:  # ← Missing @dataclass decorator!
    """Configuration for global learnings aggregation"""
    enabled: bool = False
    ...

Impact: This will fail at runtime - the class needs @dataclass to work with field defaults.

Fix Required: Add @dataclass decorator before the class definition.


🧪 Testing Strategy Needed

Given the lack of tests, here is a testing roadmap:

Unit Tests (high priority):

tests/test_global_learnings.py:
  - Pattern extraction accuracy (various error formats)
  - Aggregation with windowing
  - Checkpoint serialization roundtrip
  - Prompt formatting (all verbosity levels)
  - Edge cases: empty patterns, missing metrics

tests/test_global_learnings_integration.py:
  - Integration with controller
  - Multi-iteration tracking
  - Cross-island aggregation
  - Resume from checkpoint with learnings

Integration Tests (medium priority):

make test-integration  # Verify learnings appear in prompts

📝 Documentation & Code Quality

Positives:

  • Docstrings are comprehensive and helpful
  • Type hints used consistently (except noted issues)
  • Configuration is well-documented with inline comments
  • CLAUDE.md updates significantly improve usability

Improvements Needed:

  • Add module-level docstring examples in global_learnings.py
  • Document the papers/ directory purpose
  • Add usage example to blocksworld README

🔒 Security Assessment

Low-to-Medium Risk:

  1. API Key Handling: Uses environment variables (good) but no validation
  2. External Downloads: Papers downloader fetches arbitrary URLs - could be risky
  3. Disk Usage: Could fill disk with PDFs if misconfigured
  4. No Input Validation: Download scripts do not validate conference names/years

Recommendations:

  • Add API key validation before making requests
  • Implement disk space checks before downloads
  • Add URL whitelist for paper sources
  • Document security considerations in papers/README.md

📊 Performance Considerations

Global Learnings:

  • Pattern aggregation is O(n) per iteration - acceptable
  • Window trimming keeps memory bounded
  • Prompt injection adds ~100-500 tokens per generation (moderate cost)

Recommendation: Consider adding metrics:

  • Token count for injected learnings
  • Time spent in pattern extraction
  • Memory usage of pattern dictionaries

✅ Recommendations Summary

Must Fix Before Merge:

  1. ❌ Add @dataclass decorator to GlobalLearningsConfig (breaks at runtime)
  2. ❌ Add unit tests for global learnings system
  3. ❌ Document or remove duplicate blocksworld configs
  4. ❌ Add README to papers/ explaining purpose

Should Fix:

  1. ⚠️ Type result parameter properly instead of Any
  2. ⚠️ Move planning documents to GitHub Issues or separate directory
  3. ⚠️ Add integration test verifying learnings injection
  4. ⚠️ Review pattern normalization to preserve useful context

Nice to Have:

  1. 💡 Add performance metrics for learnings system
  2. 💡 Add examples of evolved programs with learnings enabled
  3. 💡 Document papers directory relationship to OpenEvolve
  4. 💡 Add checksum verification for downloaded papers

🎯 Overall Assessment

Score: 6.5/10

Strengths: The global learnings implementation is architecturally sound and well-integrated. Documentation improvements are excellent.

Weaknesses: Critical lack of tests, unclear purpose of papers directory, missing @dataclass decorator will cause runtime errors, and PR is bloated with unrelated planning documents.

Recommendation: Request Changes - Fix the critical issues (especially the missing decorator and test coverage) before merging. Consider splitting this into multiple PRs:

  1. Global learnings (with tests)
  2. Blocksworld example
  3. Papers utilities (if needed at all)
  4. Documentation/planning (as GitHub issues)

🤝 Final Thoughts

This is ambitious work with good ideas, but needs polish before production readiness. The global learnings concept is valuable - seeing LLMs avoid repeated mistakes could significantly improve convergence. However, without tests, we cannot be confident it works correctly.

Happy to discuss any of these points! The core work is solid - let us just ensure it is production-ready.

References:

  • Global learnings implementation: openevolve/global_learnings.py:1-509
  • Config issue: openevolve/config.py:330
  • Documentation: CLAUDE.md, GLOBAL_LEARNINGS_IMPLEMENTATION.md

Implements automatic summarization of failure/success patterns using LLM to generate actionable, natural language guidance instead of raw metric changes.

Key changes:
- Enhanced FailurePattern and SuccessPattern dataclasses with code context fields (parent_code, child_code, code_diff, metrics)
- Added _summarize_pattern_with_llm_sync() that calls LLM with pattern + code context to generate 2-sentence summaries
- Handles full rewrites by including both parent and child code when code_diff="Full rewrite"
- Uses thread pool to handle async LLM calls when event loop is already running
- Caches LLM summaries to avoid redundant calls
- Logs warnings when code context is missing
- Updated prompt to ask LLM to: 1) describe the code change/pattern, 2) state what to avoid/apply
- Format changed to "(seen Nx) {summary}" for better readability
- Lowered min_failure_count to 1 for testing
- Changed log_level to DEBUG in circle packing config for visibility
- Updated circle packing evaluator to return EvaluationResult with artifacts

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented Oct 8, 2025

Pull Request Review: Global Learnings 2

Summary

This PR adds significant new functionality (Global Learnings system) and a new example (blocksworld), but includes ~19,000 lines of test output/checkpoint files that should not be committed to the repository.


🚨 Critical Issues

1. Checkpoint/Test Output Files Should Not Be in Git

The PR includes hundreds of checkpoint files and test outputs that should be gitignored:

  • examples/circle_packing/test_global_learnings_working/ - entire directory with 300+ checkpoint JSON files
  • These add ~15,000+ lines and will bloat the repository unnecessarily

Action Required:

  • Remove examples/circle_packing/test_global_learnings_working/ directory
  • Verify .gitignore already covers this (it has openevolve_output*/ and global_learning*/)
  • Consider adding examples/*/test*/ to .gitignore

2. Missing Core Code Changes

The implementation docs reference changes to core files, but I cannot verify these are included:

  • openevolve/config.py - Should have GlobalLearningsConfig
  • openevolve/controller.py - Should initialize and save/load global learnings
  • openevolve/process_parallel.py - Should pass learnings to workers
  • openevolve/prompt/sampler.py - Should inject learnings into prompts

Action Required: Confirm these core files were modified and are part of the PR.


✅ Positive Aspects

Global Learnings System (openevolve/global_learnings.py)

Well-structured implementation with:

  • ✅ Clean dataclass design (FailurePattern, SuccessPattern)
  • ✅ Comprehensive pattern extraction (syntax, runtime, performance)
  • ✅ Checkpoint save/load support
  • ✅ Configurable tracking modes (failures-only, successes, both)
  • ✅ Good separation of concerns

Documentation

  • ✅ Excellent GLOBAL_LEARNINGS_IMPLEMENTATION.md with usage examples
  • ✅ Updated CLAUDE.md with helpful developer guidance
  • ✅ Clear configuration examples

Blocksworld Example

  • ✅ Well-structured domain (simulator + planner separation)
  • ✅ Comprehensive evaluator with timeout handling
  • ✅ Good test problem generation (both stack and rearrange)
  • ✅ Proper use of EvaluationResult API

🔧 Code Quality Issues

blocksworld_evaluator.py

  1. Threading for Timeout is Risky (lines 308-312, 471-474)

    thread = threading.Thread(target=solve_wrapper)
    thread.daemon = True
    thread.start()
    thread.join(timeout=timeout_seconds)
    • Issue: Daemon threads can't be truly killed; they'll keep running after timeout
    • Impact: May cause resource leaks and interfere with subsequent evaluations
    • Suggestion: Use multiprocessing with Process.terminate() for hard timeouts, or rely on OpenEvolve's built-in cascade evaluation timeouts
  2. Code Duplication

    • Lines 308-356 and 462-502 have nearly identical timeout/evaluation logic
    • Suggestion: Extract to a helper function like _run_with_timeout(problem, timeout)
  3. Magic Number

    • Line 93: optimal_length = 2 * (n_blocks - 1) - no comment explaining why
    • Suggestion: Add comment: "Each block needs pickup + stack actions"
  4. Inconsistent Error Handling

    • Lines 350-356: Some exceptions set solve_time = 0, others preserve it
    • Suggestion: Standardize error handling behavior

blocksworld_planner.py

  1. Infinite Loop by Design (line 112)

    while True:  # Run until solution found or timeout
    • Concern: Relies entirely on external timeout; no internal stopping condition
    • Risk: If evaluator timeout fails, this hangs forever
    • Suggestion: Add max attempts parameter: for attempt in range(max_attempts):
  2. No Seeding for Random Search

    • Random planner uses random.choice() but no seed control
    • Issue: Evolution runs won't be reproducible even with fixed config seed
    • Suggestion: Accept optional seed parameter and call random.seed(seed)

circle_packing/evaluator.py

  1. Complex Subprocess Approach (line 77-100)
    • The run_with_timeout function is very complex with temp files
    • Question: Why is this needed instead of OpenEvolve's built-in timeout?
    • Risk: Platform-specific issues (signal handling on Windows)

🐛 Potential Bugs

blocksworld_evaluator.py

  1. Race Condition After Timeout (line 316)

    if thread.is_alive():
        timed_out = True
    elif result and result['success']:
    • Bug: result may still be None if thread just barely timed out
    • Fix: Check result is not None before accessing
  2. Division by Zero Risk (line 508)

    progress_score = problems_solved / TOTAL_PROBLEMS  # TOTAL_PROBLEMS=95
    • If TOTAL_PROBLEMS were ever 0, this would crash
    • Low risk since it's hardcoded, but could add assertion

blocksworld_planner.py

  1. State Hashing May Fail (line 19)
    def __hash__(self):
        return hash((self.on, self.clear, self.holding))
    • If self.on or self.clear contain unhashable items, this fails
    • Current code seems safe (tuples of tuples), but fragile to changes

🎯 Design Considerations

Global Learnings System

  1. String-Based Pattern Matching

    • Uses basic string matching to group similar errors
    • Limitation: May not catch semantically similar errors with different syntax
    • Future: Consider fuzzy matching or LLM-based clustering
  2. Cross-Island Only

    • Learnings are global, not island-specific
    • Trade-off: Good for avoiding repeated mistakes, but might suppress island diversity
    • Future: Consider per-island learnings with periodic global aggregation
  3. Prompt Injection Point

    • Injects into system message by default
    • Consideration: Large learnings may consume significant context
    • Suggestion: Add token budget limit for learnings section

Blocksworld Example

  1. Random Search as Initial Program

    • Very inefficient (exponential search space)
    • Good for evolution: Provides clear improvement gradient
    • Question: What's the expected evolved strategy? (heuristic search, A*, etc.)
  2. Optimal Length Estimation (line 154)

    optimal_length = max(2 * misplaced, n_blocks)  # rough estimate
    • This is just a heuristic, not true optimal
    • Impact: Efficiency scores may be misleading
    • Suggestion: Rename to estimated_optimal_length or compute true optimal with A*

🔒 Security Concerns

Low Risk

  • No credential handling
  • No network calls (except LLM which is existing code)
  • File operations are properly scoped to temp directories

Observations

  • Subprocess creation in evaluators is properly isolated
  • No SQL injection or command injection vectors
  • Pickle usage in circle_packing evaluator (line 14) - acceptable for internal use

🧪 Test Coverage

Missing Tests

  • No unit tests for GlobalLearnings class
  • No integration tests for global learnings feature
  • Blocksworld example has no automated tests

Recommendations

Per GLOBAL_LEARNINGS_IMPLEMENTATION.md:

  1. Add unit tests for pattern extraction (_extract_syntax_errors, _extract_runtime_errors)
  2. Add test for prompt injection (generate_prompt_section)
  3. Add checkpoint save/load test
  4. Run small integration test (50 iterations) to verify end-to-end

📊 Performance Considerations

  1. Checkpoint File Size

    • Global learnings adds JSON file per checkpoint
    • Impact: Small (< 10KB typically based on examples)
    • Acceptable
  2. Prompt Token Usage

    • Default max_learnings=5 is reasonable
    • With verbosity="detailed" could add 200-500 tokens
    • Suggestion: Document expected token overhead in config
  3. Blocksworld Timeout

    • 5 seconds per problem for 95 problems = 7.9 min max evaluation time
    • Concern: Very slow for evolution iterations
    • Suggestion: Consider cascade evaluation to fail fast on small problems

📝 Documentation Issues

CLAUDE.md

  • ✅ Well structured and comprehensive
  • ✅ Good command examples
  • ✅ Clear troubleshooting section
  • ⚠️ Line 164: "not iterations!" - slightly confusing phrasing
  • Suggestion: Rephrase to "measured in generations per island, not total iterations"

Implementation Plans

  • MULTI_BLOCK_IMPLEMENTATION_PLAN.md - 2,018 lines
  • SYSTEM_PROMPT_EVOLUTION_PLAN.md - 1,309 lines
  • Question: Should these live in repo root or in docs/ or design docs folder?

✨ Suggestions for Improvement

High Priority

  1. ❗ Remove test output directories from PR
  2. ❗ Verify core code changes are included
  3. Fix threading-based timeout in blocksworld evaluator
  4. Add reproducibility (seeding) to blocksworld planner

Medium Priority

  1. Add unit tests for global learnings
  2. Extract duplicated timeout logic
  3. Document token overhead of global learnings feature
  4. Add max_attempts to random search to prevent infinite loops

Low Priority

  1. Consider per-island learnings in future
  2. Optimize blocksworld evaluation speed
  3. Move implementation plan docs to docs/ folder

🎓 Final Verdict

Recommendation: Request Changes

The global learnings feature is well-designed and the blocksworld example is solid, but the PR has critical issues:

  • 15K+ lines of test output files that must be removed
  • Missing verification of core code integration
  • Threading-based timeouts need fixing for production use

Once these are addressed, this will be a valuable addition to OpenEvolve.

Estimated Effort to Fix: 1-2 hours

  • Remove test output files
  • Switch to multiprocessing-based timeouts
  • Verify core integration
  • Add basic unit tests

Let me know if you'd like me to help with any of these fixes!

- Enable global learnings system with failure tracking
- Set window_size to 100 iterations and max_learnings to 10
- Configure concise verbosity for learnings injection
- Add DEBUG log level for detailed execution monitoring
- Track syntax, runtime, and performance regression errors

This configuration allows the blocksworld planner to learn from
common failure patterns across islands and avoid repeating mistakes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented Oct 9, 2025

Pull Request Review: Global Learnings 2

Thank you for this substantial contribution! I've reviewed the PR and have observations organized by category.

Overall Assessment

This PR focuses on planning and documentation for major OpenEvolve enhancements. The changes (953 files, ~20K additions) consist of:

  • Planning documents (3 new markdown files)
  • Test/checkpoint artifacts (~900+ files in examples/)
  • Minimal code changes (3 Python files modified)

Documentation & Planning

Strengths:

  • GLOBAL_LEARNINGS_IMPLEMENTATION.md: Excellent comprehensive design with clear architecture
  • MULTI_BLOCK_IMPLEMENTATION_PLAN.md: Very thorough with detailed phases and time estimates
  • SYSTEM_PROMPT_EVOLUTION_PLAN.md: Solid evaluation methodology and experimental design
  • Updated CLAUDE.md: Improved project documentation

Concerns:

  • Planning vs. Implementation Gap: These are implementation plans, not actual implementations
  • Documentation Scope: 2,000+ lines of planning docs but no corresponding test files or implementation code

Repository Hygiene - MAJOR ISSUES

Test Artifacts Committed (~900 files)

Problem: examples/circle_packing/test_global_learnings_working/ contains hundreds of checkpoint JSON files

Recommendation:

  • Remove: git rm -r examples/circle_packing/test_global_learnings_working/
  • Update .gitignore to exclude: test outputs, checkpoints, openevolve_output, pkl files

Code Changes

Modified files:

  • examples/blocksworld/blocksworld_evaluator.py (+571 lines)
  • examples/blocksworld/blocksworld_planner.py (+168 lines)
  • examples/circle_packing/evaluator.py (+85, -43 lines)

Cannot review without diffs. Questions:

  • Are these new examples or modifications?
  • Do they follow Black formatting (line length 100)?
  • Any security concerns?
  • Proper sandboxing/timeouts?

Security Considerations

Based on documentation:

  1. Global Learnings: Needs sanitization of error messages before LLM injection
  2. System Prompt Evolution: Needs cost limits and validation
  3. Multi-Block Evolution: Needs robust validation of evolved code blocks

Testing - MISSING

  • No test files added despite 3 implementation plans
  • No CI/CD updates
  • Cannot assess test coverage

Recommendations:

  • Add tests/test_global_learnings.py
  • Add tests/test_multi_block_utils.py
  • Ensure make test passes

Performance Considerations

From documentation:

  • ✅ Global learnings: Minimal overhead with configurable intervals
  • ⚠️ Multi-block: Could significantly increase prompt token usage
  • ⚠️ System prompt evolution: No caching or rate limiting mentioned

Recommendations

Critical:

  1. Remove test artifacts from git
  2. Update .gitignore for outputs/checkpoints
  3. Clarify PR scope - Planning or implementation?

High Priority:

  1. Add unit tests for new functionality
  2. Share diffs for the 3 modified Python files
  3. Validate examples run successfully

Nice to Have:

  1. Move plans to docs/design/ directory
  2. Consider incremental implementation

Positive Highlights

  1. Excellent, thorough planning
  2. Clear backward compatibility considerations
  3. High-quality documentation
  4. Thoughtful design with edge case coverage

Conclusion

This PR demonstrates excellent planning and design work but needs cleanup before merging.

Accept if: Design proposal PR + artifacts removed + .gitignore updated

Request changes if: Intended as feature implementation (needs code + tests) OR test artifacts remain

Overall: Request changes to remove test artifacts and clarify scope.

Questions for Author

  1. Is this PR primarily for planning/design or implementation?
  2. Are features in markdown files actually implemented?
  3. Were test artifacts committed intentionally?
  4. Review priority (urgent, normal, low)?
  5. Should plans move to docs/ directory?

Review per CLAUDE.md conventions. Focused on code quality, security, testing, and best practices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant