Skip to content

Add plot tab#3

Draft
dtdannen wants to merge 4 commits into
mainfrom
add-plot-tab
Draft

Add plot tab#3
dtdannen wants to merge 4 commits into
mainfrom
add-plot-tab

Conversation

@dtdannen
Copy link
Copy Markdown
Owner

@dtdannen dtdannen commented Oct 8, 2025

No description provided.

dtdannen and others added 3 commits October 6, 2025 15:21
This commit adds a fourth "Plot" tab to the evolution visualizer that
displays how metrics evolve across checkpoint iterations.

Changes:
- Backend (visualizer.py): Scan sibling checkpoints and aggregate best
  program metrics per iteration
- Frontend (index.html): Add Plot tab UI element and view container
- Plot visualization (plot.js): D3.js line chart with zoom/pan support,
  tooltips, and metric dropdown integration
- Tab management (mainUI.js): Add "plot" to tabs array for switching
- Data loading (main.js): Pass iteration_metrics to plot renderer

Features:
- Line chart showing metric value vs iteration number
- Responsive to metric dropdown changes
- Zoom and pan support consistent with other tabs
- Hover tooltips showing exact values
- Grid lines for readability
- Works with existing checkpoint_interval (default: 10)
- Gracefully handles missing data (single checkpoint mode)

This enables easy comparison of evolution progress across different runs
and helps identify convergence speed and plateaus.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@dtdannen
Copy link
Copy Markdown
Owner Author

dtdannen commented Oct 8, 2025

@claude can you review this?

@claude
Copy link
Copy Markdown

claude Bot commented Oct 8, 2025

Claude finished @dtdannen's task —— View job


Code Review for Plot Tab PR

I've completed a thorough review of this PR. Overall, the plot tab implementation is solid and the new examples/documentation add significant value. However, there are several issues that need attention before merging.


✅ Strengths

  1. Plot Tab Implementation - Well-architected visualization with D3.js

    • Clean separation of concerns in plot.js
    • Proper zoom/pan behavior with state preservation
    • Responsive design with window resize handling
    • Good UX: tooltips, animations, empty state messaging
  2. Backend Integration - Smart checkpoint scanning approach

    • load_iteration_metrics_from_checkpoints() in visualizer.py:29-86 efficiently loads metrics from sibling checkpoints
    • Proper error handling for missing/malformed data
  3. Documentation - Significant CLAUDE.md improvements

    • Much more comprehensive development guidance
    • Clear architecture documentation
    • Good troubleshooting section

🐛 Issues to Fix

Critical Issues

  1. Broken mainUI.js - Dark mode initialization error (scripts/static/js/mainUI.js:6-8)

    const darkToggleContainer = document.getElementById('darkmode-toggle').parentElement;
    const darkToggleInput = document.getElementById('darkmode-toggle');
    const darkToggleLabel = document.getElementById('darkmode-label');

    Problem: Code references DOM elements before checking if they exist, then tries to create custom toggle at line 10-28. This will fail if the elements are missing on initial load.

    Fix: Add null checks before accessing .parentElement and other properties:

    const darkToggleInput = document.getElementById('darkmode-toggle');
    const darkToggleLabel = document.getElementById('darkmode-label');
    if (!darkToggleInput || !darkToggleLabel) return;
    const darkToggleContainer = darkToggleInput.parentElement;
  2. Memory leak in plot.js (scripts/static/js/plot.js:256-267)
    The tooltip div is created every time updatePlot() is called but never removed. On repeated renders (metric changes, window resize), this creates multiple tooltip divs.

    Fix: Reuse existing tooltip or remove old one:

    // Remove old tooltip if exists
    d3.select('#view-plot .plot-tooltip').remove();
    
    const tooltip = d3.select('#view-plot')
        .append('div')
        .attr('class', 'plot-tooltip')
        // ... rest of styles
  3. Race condition in plot rendering (scripts/static/js/main.js:116-119)

    if (window.renderPlot && data.iteration_metrics) {
        window.renderPlot(data.iteration_metrics);
    }

    window.renderPlot is set at the end of plot.js (line 326) but may not be loaded yet when main.js runs. This is an ES6 module import issue.

    Fix: Import properly at top of main.js:

    import { renderPlot } from './plot.js';
    // Then use: renderPlot(data.iteration_metrics);

Major Issues

  1. Missing error handling in visualizer.py (scripts/visualizer.py:64-78)
    The loop that loads best_program_info.json files catches json.JSONDecodeError and KeyError but not FileNotFoundError or PermissionError which could occur at line 64.

    Fix: Broader exception handling:

    try:
        with open(best_info_path, "r") as f:
            best_info = json.load(f)
        # ...
    except (json.JSONDecodeError, KeyError, IOError) as e:
        logger.debug(f"Error loading best_program_info from {ckpt_dir}: {e}")
        continue
  2. Inconsistent metric handling (scripts/static/js/plot.js:76-83)
    The code filters for valid data but doesn't handle the case where d.metrics exists but is not an object (could be null, array, etc.).

    Fix: Add type check:

    const validData = data.filter(d =>
        d.metrics &&
        typeof d.metrics === 'object' &&
        !Array.isArray(d.metrics) &&
        typeof d.metrics[metric] === 'number' &&
        isFinite(d.metrics[metric])
    );

Minor Issues

  1. Code duplication in plot.js

    • Lines 53-68 and 86-99 have nearly identical "no data" message rendering
    • Should extract to shared function
  2. Missing JSDoc comments

    • plot.js exports and functions lack documentation
    • Makes maintenance harder
  3. Potential XSS in tooltip (scripts/static/js/plot.js:279-282)
    While metric comes from a controlled dropdown, it's interpolated directly into HTML without escaping. If metric names ever come from user data, this could be a security issue.

    Recommendation: Use textContent or escape HTML:

    .html(`
        <strong>Iteration ${d.iteration}</strong><br/>
        <span class="metric-name"></span>: ${d.metrics[metric].toFixed(4)}
    `)
    // Then: tooltip.select('.metric-name').textContent = metric;
  4. Tab switching logic duplication (scripts/static/js/mainUI.js:31-56)
    Very similar pattern to existing tab code - could be refactored to reduce duplication


🤔 Questions & Concerns

Blocksworld Example

  1. Evaluator complexity (examples/blocksworld/blocksworld_evaluator.py)

    • 571 lines is quite large for an evaluator
    • Contains substantial domain logic (state validation, problem generation)
    • Should some of this be extracted to a separate module?
  2. Timeout configuration (examples/blocksworld/config.yaml:65)

    evaluator:
      timeout: 1000  # Comment says "5s × 95 problems = 475s max"

    Why 1000s if max expected is 475s? Is this intentional buffer or should it be tuned?

Papers Scripts

  1. Massive PR scope - 5 different paper downloading scripts totaling ~2800 lines

    • download_papers.py, download_papers_v2.py, download_papers_openreview.py, etc.
    • Are all of these needed? Seems like exploratory code
    • Should these be in a separate PR or cleaned up before merge?
  2. API credentials in code (papers/download_papers.py:36-37)

    SEMANTIC_SCHOLAR_API_KEY = os.getenv("SEMANTIC_SCHOLAR_API_KEY")  # Optional
    EMAIL = os.getenv("EMAIL", "openevolvetesting.worrier295@passmail.net")

    Default email is hardcoded - is this intentional for testing or should it require explicit configuration?

Planning Documents

  1. Large uncommitted design docs

    • MULTI_BLOCK_IMPLEMENTATION_PLAN.md (2018 lines)
    • SYSTEM_PROMPT_EVOLUTION_PLAN.md (1309 lines)

    These are excellent planning documents but:

    • Should they be in /docs directory instead of repo root?
    • Are they meant as temporary planning or permanent documentation?
    • Consider adding a README section explaining their purpose

📋 Recommendations

High Priority

Medium Priority

Low Priority

  • Consider moving planning docs to /docs
  • Add documentation about paper scripts purpose
  • Evaluate if all 5 paper download scripts are necessary

Before Merge

  • Add PR description explaining the changes
  • Consider splitting papers scripts into separate PR (they're 75% of the additions)
  • Run linting: make lint or python -m black openevolve examples tests scripts

📊 Test Coverage

Not tested (should add before merge):

  • Plot tab with empty iteration_metrics
  • Plot tab with missing metrics
  • Metric dropdown changes while plot is visible
  • Window resize during plot interaction
  • Multiple checkpoint loads

Summary

The plot tab feature is well-implemented and adds valuable visualization capability. The main concerns are:

  1. 3 critical bugs that will cause runtime errors (dark mode init, memory leak, race condition)
  2. Large PR scope - mixing feature implementation with exploratory code (paper scripts)
  3. Missing tests for the new visualization features

I recommend fixing the critical bugs, cleaning up the paper scripts (or moving to separate PR), and adding basic tests before merging.

Estimated effort to fix critical issues: 1-2 hours


Job: https://github.com/dtdannen/openevolve/actions/runs/
Branch: add-plot-tab

@claude
Copy link
Copy Markdown

claude Bot commented Oct 8, 2025

Pull Request Review: Add Plot Tab

Overview

This PR adds significant new functionality including visualization features, a new blocksworld example, documentation improvements, and paper download utilities. Overall, the changes are well-structured but there are several areas that need attention before merging.

🟢 Strengths

1. Excellent Documentation Updates (CLAUDE.md)

  • Comprehensive improvements with clear command examples
  • Better organized architecture descriptions
  • Added library API examples showing the public interface
  • Improved troubleshooting section with actionable guidance
  • Good use of code examples throughout

2. Well-Designed Plot Visualization Feature

  • Clean separation of concerns in plot.js (326 lines)
  • Good use of D3.js for interactive plotting
  • Proper zoom behavior preservation
  • Responsive design with window resize handling
  • Helpful tooltips and grid lines for better UX
  • Graceful error handling for missing data

3. Solid Blocksworld Example Implementation

  • Clean domain modeling with immutable State dataclass
  • Comprehensive evaluator with multiple test strategies
  • Good separation between simulator, planner, and evaluator
  • Thorough validation of state validity (no cycles, proper structure)
  • Supports both hardcoded and random problem generation
  • Early termination on failures for efficient cascade evaluation

🟡 Issues & Concerns

Critical: Paper Download Scripts May Not Belong in Core Repository

The papers/ directory (5 Python scripts, 627+ lines) appears to be research tooling rather than core OpenEvolve functionality:

Concerns:

  1. Scope creep: Paper downloading is tangential to an evolutionary coding framework
  2. API keys required: Requires SEMANTIC_SCHOLAR_API_KEY and email in .env (security consideration)
  3. Large data footprint: Comments mention "~5GB disk space for PDFs"
  4. Multiple external dependencies: requests, dotenv, tqdm, tenacity
  5. Rate limiting complexity: Different rate limits for 4 different APIs

Recommendation:

  • Consider moving to a separate research/ or tools/ subdirectory
  • Add clear documentation about why this is included
  • OR move to a separate repository and reference it in docs
  • At minimum: Add a dedicated README in papers/ explaining its purpose and that it's optional

If keeping in repo:

  • Add papers/README.md explaining this is optional research tooling
  • Ensure papers/requirements.txt is properly documented in setup docs
  • Add to .gitignore properly (already has papers/data and papers/.env ✓)

Major: Planning Documents in Root Directory

Two large planning documents added to root:

  • MULTI_BLOCK_IMPLEMENTATION_PLAN.md (2018 lines)
  • SYSTEM_PROMPT_EVOLUTION_PLAN.md (1309 lines)

Issues:

  1. Not implemented yet: These are planning docs for future work, not current functionality
  2. Root clutter: Root directory should contain only essential docs
  3. Misleading: Users might think these features exist

Recommendation:

  • Move to docs/design/ or docs/planning/ subdirectory
  • Add clear headers stating these are future work proposals
  • OR remove from this PR and add them when the features are actually implemented

Minor: Visualizer Changes Mixed with Feature Addition

File: scripts/visualizer.py

Changes:

  1. Added iteration metrics loading (lines 29-86) ✓
  2. Modified data endpoint to include metrics ✓
  3. Overall well-implemented

Minor concern: The load_iteration_metrics_from_checkpoints() function has some edge case handling that could be cleaner:

# Line 68: Prefer current_iteration but fallback to iteration
iteration = best_info.get("current_iteration", best_info.get("iteration", 0))

Suggestion: Document why this fallback exists (checkpoint format changes over time?).

Code Quality Issues

1. Blocksworld Evaluator - Threading Safety

File: examples/blocksworld/blocksworld_evaluator.py

Lines 308-312, 462-474:

# Threading timeout implementation
thread = threading.Thread(target=solve_wrapper)
thread.daemon = True
thread.start()
thread.join(timeout=timeout_seconds)

Issues:

  • Daemon threads may not clean up properly
  • No mechanism to actually stop the thread after timeout (thread keeps running)
  • Using nonlocal result in nested function could cause race conditions

Recommendation:

  • Use multiprocessing with proper process termination instead of threading
  • OR use concurrent.futures.ThreadPoolExecutor with proper cancellation
  • OR document that threads will continue running until natural completion

2. Hardcoded Values

File: examples/blocksworld/blocksworld_evaluator.py

Lines 507-508:

TOTAL_PROBLEMS = 95  # Hardcoded constant for problems 6-100
progress_score = problems_solved / TOTAL_PROBLEMS

Issue: If problem_sizes parameter changes, this hardcoded value becomes incorrect.

Fix:

# Calculate total from range
TOTAL_PROBLEMS = len(range(6, 101))  # = 95
# OR better: pass as parameter

3. Potential State Validation Bug

File: examples/blocksworld/blocksworld_evaluator.py

Lines 148-149:

assert is_valid_state(initial, blocks), "Generated invalid initial state"
assert is_valid_state(goal, blocks), "Generated invalid goal state"

Issue: Using assert for runtime validation. Asserts are disabled with python -O.

Fix:

if not is_valid_state(initial, blocks):
    raise ValueError("Generated invalid initial state")
if not is_valid_state(goal, blocks):
    raise ValueError("Generated invalid goal state")

4. Plot Tab Addition - Missing Error Boundary

File: scripts/static/js/plot.js

Line 76-99: Good error handling for missing metrics ✓

Suggestion: Add error boundary for D3.js initialization failures (e.g., if D3 CDN fails to load).

5. Index.html - CDN Dependency

File: scripts/templates/index.html

Line 6:

<script src="https://d3js.org/d3.v7.min.js"></script>

Concern: CDN dependency means visualizer won't work offline.

Recommendation:

  • Document this requirement in visualizer docs
  • OR vendor D3.js locally for offline support
  • Consider adding integrity hash for security: integrity="sha384-..."

🔵 Best Practices & Suggestions

1. Testing Coverage

No tests added for new functionality. Consider adding:

  • Unit tests for load_iteration_metrics_from_checkpoints()
  • Integration test for plot tab rendering
  • Tests for blocksworld problem generation and validation

2. Configuration Files

File: examples/blocksworld/config.yaml and blocksworld_config.yaml

Two config files exist - this seems redundant. Document the difference or remove one.

3. Code Comments

The plot.js code is well-structured but lacks JSDoc comments for exported functions. Consider adding:

/**
 * Renders the evolution progress plot with iteration metrics
 * @param {Array} data - Array of {iteration, metrics} objects
 */
export function renderPlot(data) { ... }

4. Blocksworld Planner Evolution Block

File: examples/blocksworld/blocksworld_planner.py

Lines 95-137: The evolve block implements random search.

Suggestion for evolution: This is a great starting point, but consider adding comments inside the evolve block guiding the LLM:

# EVOLVE-BLOCK-START
# TODO: Improve this planner! Ideas:
# - Add heuristic guidance (goal distance)
# - Implement BFS/A* search
# - Add state caching to avoid loops
def plan_blocks(...):
    ...
# EVOLVE-BLOCK-END

🔒 Security Considerations

  1. Paper Download Scripts - API Keys: .env file with API keys added to .gitignore
  2. CDN Integrity: D3.js loaded from CDN without integrity hash (see above)
  3. Email Exposure: Hardcoded email in download_papers.py line 37 - consider this is public code

📊 Performance Considerations

  1. Iteration Metrics Loading: load_iteration_metrics_from_checkpoints() reads all sibling checkpoints on every data request. Consider:

    • Caching this data
    • Only reloading when filesystem changes detected
  2. Plot Rendering: Re-renders entire plot on metric change - this is fine for typical dataset sizes but could be optimized for large evolution runs

✅ Recommendations Summary

Before Merging:

  1. Decision required: Keep or relocate paper download scripts?
  2. Move planning documents to docs/ subdirectory (or remove from PR)
  3. Fix threading safety in blocksworld evaluator (use multiprocessing)
  4. Replace asserts with proper runtime validation
  5. Resolve duplicate config files in blocksworld example
  6. Document the optional vs. required nature of papers/ directory

Nice to Have:

  1. Add tests for new visualization features
  2. Add JSDoc comments to plot.js
  3. Add CDN integrity hashes
  4. Add caching for iteration metrics loading
  5. Add error boundary for D3.js initialization

📝 Final Verdict

Conditional Approval - The core visualization features and blocksworld example are well-implemented, but the PR mixes several concerns:

  • ✅ Plot tab feature (good)
  • ✅ Blocksworld example (good, minor fixes needed)
  • ✅ Documentation improvements (excellent)
  • ⚠️ Paper download scripts (questionable fit)
  • ⚠️ Unimplemented planning docs (should be in docs/ or separate PR)

Suggested Actions:

  1. Address threading safety and assert issues in blocksworld
  2. Make a clear decision on papers/ directory location
  3. Move planning documents to appropriate location
  4. Add brief release notes or update main README mentioning new plot tab feature

Great work overall! The plot visualization is a valuable addition to the project. 🚀

Bug fixes:
- Fix dark mode initialization error with null checks (mainUI.js)
- Fix memory leak by removing old tooltip before creating new (plot.js)
- Fix race condition with proper ES6 import of renderPlot (main.js)
- Add IOError handling for file operations (visualizer.py)
- Fix inconsistent metric type validation (plot.js)

New feature:
- Add plot_metrics.py: standalone matplotlib plotting script
- Support grouped runs with mean line and min/max shaded range
- Use ':' separator to define groups of checkpoint directories
- Configurable output, metrics, and plot styles

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented Oct 9, 2025

Pull Request Review: Add Plot Tab

Summary

This PR adds significant new functionality including a plotting tab for the visualizer, a blocksworld example, planning documents, and paper download scripts. The changes span 24 files with 8,525 additions.

Positive Aspects ✅

1. Well-Structured Visualization Code

  • scripts/plot_metrics.py is professionally written with clear documentation, proper error handling, statistical aggregation, and good matplotlib practices

2. Clean JavaScript Implementation

  • scripts/static/js/plot.js demonstrates solid D3.js skills with proper module pattern, responsive design, zoom/pan functionality, and interactive tooltips

3. Comprehensive Documentation

  • CLAUDE.md improvements are excellent with clear testing instructions, CLI entry points, expanded architecture explanations, and troubleshooting section

4. Code Quality

  • Consistent formatting (Black with line-length 100), type hints, good naming conventions, proper dataclass usage

Issues and Concerns ⚠️

1. Security Concerns 🔴 HIGH PRIORITY

papers/download_papers.py Line 37: Email address exposed in code (openevolvetesting.worrier295@passmail.net). Move to environment variable only - do not hardcode email as fallback. Add exponential backoff for 429 responses. PDF downloads: No validation of downloaded file content.

2. Missing Test Coverage 🟡 MEDIUM PRIORITY

No tests found for scripts/plot_metrics.py statistical aggregation logic, examples/blocksworld/blocksworld_evaluator.py critical evaluation logic, or scripts/visualizer.py new load_iteration_metrics_from_checkpoints function.

3. Performance Issues 🟡 MEDIUM PRIORITY

scripts/plot_metrics.py Line 216: Subplot layout calculation is incorrect. scripts/visualizer.py Line 29-86: Synchronous checkpoint file reading could be slow for large experiments (1000+ checkpoints).

4. Code Quality Issues 🟢 LOW PRIORITY

examples/blocksworld/blocksworld_evaluator.py Line 20-55: is_valid_state function has O(n²) worst case. scripts/plot_metrics.py Line 210: Bare except clause should specify exception type.

Specific Recommendations

Must Fix Before Merge:

  1. Remove hardcoded email from papers/download_papers.py:37
  2. Add .env.example showing required environment variables
  3. Add tests for blocksworld_evaluator.is_valid_state() and plot_metrics.compute_grouped_statistics()

Should Fix Before Merge:

  1. Fix subplot layout calculation in plot_metrics.py:216
  2. Add PDF content validation in paper download scripts
  3. Fix bare except in plot_metrics.py:210

Questions for Author

  1. Planning Documents: The MULTI_BLOCK_IMPLEMENTATION_PLAN.md is 2018 lines. Is this intended for this PR or future work?
  2. Papers Directory: Are the paper download scripts core to OpenEvolve functionality or research tooling?
  3. Blocksworld Example: What is the target audience? Educational or performance benchmarking?
  4. Testing Strategy: Would you prefer to add tests in a follow-up PR or as part of this one?

Overall Assessment

Recommendation: Approve with changes required

This PR adds valuable functionality but has critical security issues (hardcoded email) and missing test coverage. The core plotting and visualization code is high quality. Once the security concern is addressed and basic tests are added, this will be a solid addition.

Estimated effort to address: ~2-4 hours

Code Review Checklist

  • ✅ Code follows project conventions
  • ✅ Documentation updated
  • ⚠️ Tests added (MISSING)
  • ⚠️ Security concerns addressed (hardcoded email)
  • ✅ Performance considerations reasonable
  • ✅ Error handling present
  • ⚠️ Edge cases handled (some areas need work)
  • ✅ Backwards compatibility maintained

Great work overall! The visualization features are well-implemented. Please address the security issue and consider adding tests before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant