Skip to content

Latest commit

 

History

History
639 lines (485 loc) · 19 KB

File metadata and controls

639 lines (485 loc) · 19 KB

Optimization-Driven Experimental Design Guide

Complete guide to using the MicroGrowAgents optimization system for data-driven v14 design generation.

Table of Contents

  1. Overview
  2. Quick Start
  3. Pipeline Architecture
  4. Interpreting Optimization Reports
  5. Understanding v14 Recommendations
  6. Command Reference
  7. Troubleshooting
  8. Advanced Usage

Overview

The optimization system converts experimental data into actionable recommendations for the next design iteration. It combines:

  • Gaussian Process (GP) models - Surrogate models for response surface prediction
  • Pareto frontier analysis - Multi-objective optimization (OD600 + Nd_uM)
  • Sobol sensitivity analysis - Variance decomposition to identify important factors
  • Bayesian optimization - Expected Improvement acquisition for next experiments
  • Boundary effect detection - Identify factors saturated at design space limits
  • LinkML validation - Structured, validated recommendations for v14 design

Scientific Goal: Identify which ingredient ranges to expand, contract, or shift for the next experimental design iteration based on quantitative evidence from v10/v12 data.


Quick Start

1. Run Full Pipeline (Recommended)

# Start with experimental data directory
just analyze-experimental-full data/experimental/plate_designs_v10_results/

# This runs ALL steps:
# Step 1: Analysis (replicate statistics)
# Step 2: Clustering (hierarchical clustering)
# Step 3: Visualization (growth curves, heatmaps, PCA)
# Step 4: Response surfaces (GP model fitting)
# Step 5: Optimization report (Pareto, Sobol, BO, boundaries)
# Step 6: v14 recommendations (LinkML-validated YAML)

2. Check Outputs

# Optimization report
cat outputs/plate_designs_v10_results_experimental_analysis_absolute/optimization/OPTIMIZATION_REPORT_absolute.md

# v14 recommendations
cat outputs/recommendations/recommendations_v14.yaml

3. Review Visualizations

# Open key plots
open outputs/.../optimization/sensitivity_tornado_absolute.pdf
open outputs/.../optimization/acquisition_surface_absolute.pdf
open outputs/.../optimization/uncertainty_heatmap_absolute.pdf

Pipeline Architecture

Data Flow:
  Raw Experimental Data (plate1.tsv, plate2.tsv, plate3.tsv)
        ↓
  [Step 1-3] Analysis + Clustering + Visualization
        ↓
  replicate_statistics_{mode}.tsv
        ↓
  [Step 4] Response Surface Analysis (GP Fitting)
        ↓
  GP models (pickled), surface predictions
        ↓
  [Step 5] Optimization Report Generation
        ↓
  Pareto frontier, Sobol indices, BO suggestions, boundary effects
        ↓
  [Step 6] v14 Recommendations (absolute mode only)
        ↓
  recommendations_v14.yaml (LinkML-validated)

Step 5: Optimization Report

Inputs:

  • replicate_statistics_{mode}.tsv - Experimental data with ingredient concentrations and measurements
  • GP models (fitted or refitted from experimental data)

Outputs:

optimization/
├── OPTIMIZATION_REPORT_{mode}.md              # 6-section comprehensive report
├── pareto_experimental_{mode}.csv             # Experimental Pareto points (if 2+ measurements)
├── pareto_predicted_{mode}.csv                # GP surface Pareto points (novel optima)
├── next_experiments_bayesian_{mode}.csv       # Top 20 by Expected Improvement
├── sensitivity_sobol_{mode}.csv               # Sobol indices (ST, S1)
├── boundary_effects_{mode}.csv                # Boundary flags per ingredient
├── uncertainty_grid_{mode}.csv                # GP σ predictions (5000 points)
├── sensitivity_tornado_{mode}.pdf/png         # Ingredient ranking visualization
├── acquisition_surface_{mode}.pdf/png         # EI surface with top-5 markers
└── uncertainty_heatmap_{mode}.pdf/png         # GP uncertainty hexbin

Analyses Performed:

  1. Pareto Frontier (2+ measurements only):

    • Computes non-dominated points on GP surface (10,000-point Sobol grid)
    • Identifies novel optima not in experimental data
    • Compares experimental vs predicted Pareto frontiers
  2. Bayesian Optimization:

    • Computes Expected Improvement (EI) acquisition function
    • Suggests top 20 next experiments by EI score
    • Balances exploitation (high predicted response) vs exploration (high uncertainty)
  3. Sobol Sensitivity:

    • Variance-based global sensitivity analysis
    • Computes total-order (ST) and first-order (S1) indices
    • Ranks ingredients by importance (ST captures main + interaction effects)
  4. Boundary Effects:

    • Detects if optima clustered at design space boundaries (±5% margin)
    • Flags ingredients needing range expansion (UPPER, LOWER, BOTH)
  5. Uncertainty Quantification:

    • Evaluates GP prediction std (σ) across design space
    • Identifies high-uncertainty regions for targeted exploration

Step 6: v14 Recommendations

Inputs:

  • All optimization report outputs (CSV files)

Outputs:

recommendations/
├── recommendations_v14.yaml               # LinkML-validated recommendations
├── v14_factor_recommendations.csv         # Tabular summary
└── v14_generation_provenance.json         # Decision log

Recommendation Logic:

for each ingredient:
    if boundary_effect == 'UPPER':
        adjustment_type = 'EXPAND_UPPER'
        new_max = current_max * 1.5  # Expand by 50%

    elif boundary_effect == 'LOWER':
        adjustment_type = 'EXPAND_LOWER'
        new_min = current_min * 0.5  # Expand by 50%

    elif sobol_index < 0.05:
        adjustment_type = 'CONTRACT'
        # Narrow to ±20% of median

    elif pareto_optimal far from center:
        adjustment_type = 'SHIFT'
        # Center on Pareto median, keep width

    else:
        adjustment_type = 'MAINTAIN'
        # No change needed

    priority_score = 0.5*Sobol_ST + 0.3*(1-EI_rank/20) + 0.2*boundary_flag

    if priority_score > 0.6:
        priority = 'VERY HIGH'
    elif priority_score > 0.4:
        priority = 'HIGH'
    elif priority_score > 0.2:
        priority = 'MEDIUM'
    else:
        priority = 'LOW'

Interpreting Optimization Reports

Section 1: Executive Summary

What to look for:

  • Best experimental condition (highest observed response)
  • Best predicted condition (GP model optimum)
  • Number of Pareto frontier points (novel optima discovered)
  • Number of BO suggestions (high-EI candidates)

Example:

**Best Experimental Condition:** OD600 = 1.113
**Best Predicted Condition:** OD600 = 0.912
**Pareto Frontier Points:** 15 novel optima identified
**Next Experiments Suggested:** 20 high-EI conditions

Interpretation:

  • Experimental best (1.113) is higher than predicted best (0.912) → GP model may be underestimating optimal region
  • 15 novel Pareto points → GP surface suggests unexplored high-performing conditions
  • 20 BO suggestions → Next experiments to improve model accuracy

Section 2: Pareto Frontier Analysis

Key Insights:

  • Experimental Pareto: Actual non-dominated conditions from v10 data
  • Predicted Pareto: GP-predicted non-dominated points (10,000-point grid)
  • Novel optima: Predicted points NOT in experimental data

What to look for:

  • Are predicted Pareto points far from experimental data? → Exploration opportunity
  • Do predictions cluster in specific ingredient regions? → Important factor
  • Are there multiple competing optima? → Trade-offs exist

Visualization: pareto_comparison_{mode}.pdf

  • Blue circles = Experimental Pareto
  • Red triangles = Predicted Pareto
  • Look for red triangles away from blue → novel optima

Section 3: Bayesian Optimization Results

Top 20 Next Experiments:

  • Ranked by Expected Improvement (EI) score
  • Higher EI = Better balance of predicted performance + uncertainty

What to look for:

  • High EI (>0.01): Strong candidates, likely to improve model
  • Clustered suggestions: Specific ingredient region worth exploring
  • Diverse suggestions: Model uncertain across design space

Example:

1. EI=0.0108, OD600_pred=0.912, σ=0.145
2. EI=0.0040, OD600_pred=0.874, σ=0.151

Interpretation:

  • Top suggestion has high EI (0.0108) → Run this condition first in v14
  • High σ (0.145) → Model uncertain in this region, needs data

Visualization: acquisition_surface_{mode}.pdf

  • Stars mark top-5 EI maxima
  • Color intensity = EI value
  • Look for star clusters → promising regions

Section 4: Sensitivity Analysis (Sobol Indices)

Sobol Indices Explained:

  • S1 (First-order): Main effect of ingredient alone
  • ST (Total-order): Main effect + all interactions
  • ST - S1: Interaction effects

Interpretation Guidelines:

  • ST > 0.3: High impact factor (prioritize exploration)
  • ST 0.1-0.3: Moderate impact (standard exploration)
  • ST < 0.1: Low impact (consider contracting range)

Example:

| Rank | Ingredient      | ST    |
|------|-----------------|-------|
| 1    | (NH4)2SO4       | 0.520 |
| 2    | Succinate       | 0.380 |
| 3    | Phosphate       | 0.322 |
| 4    | Methanol        | 0.268 |
| 5    | PQQ             | 0.249 |
| 6    | CoCl2           | 0.224 |

Interpretation:

  • (NH4)2SO4 is most important (ST=0.52) → Focus v14 optimization here
  • Sum of top 3 (1.22) > 1.0 → Strong interaction effects
  • CoCl2 lowest (ST=0.22) → Less critical, could narrow range

Visualization: sensitivity_tornado_{mode}.pdf

  • Horizontal bars ranked by ST
  • Color-coded by magnitude
  • Focus on top 3-5 ingredients

Section 5: Boundary Effects

Flags:

  • UPPER: Optima at upper boundary → Expand upper bound in v14
  • LOWER: Optima at lower boundary → Expand lower bound in v14
  • BOTH: Multiple optima at boundaries → Expand both bounds
  • NONE: Interior optimum → Range well-calibrated

Example:

| Ingredient  | Boundary | Recommendation     |
|-------------|----------|---------------------|
| (NH4)2SO4   | UPPER    | Expand upper bound |
| Succinate   | NONE     | Maintain range     |

Interpretation:

  • (NH4)2SO4 at UPPER → Increase max from 100 mM to 150 mM in v14
  • Succinate NONE → Current range (10-100 mM) is good

Section 6: Uncertainty Quantification

Metrics:

  • Mean σ: Average prediction uncertainty
  • Max σ: Highest uncertainty region
  • High-uncertainty fraction: % of design space with σ > 50% of max

Example:

OD600:
- Mean σ: 0.179
- Max σ: 0.193
- High-uncertainty regions: 99.8%

Interpretation:

  • Mean σ=0.179 relative to OD600 range (0-1.2) → ~15% uncertainty
  • 99.8% high-uncertainty → Sparse data coverage, model needs more data
  • Prioritize BO suggestions to reduce uncertainty

Visualization: uncertainty_heatmap_{mode}.pdf

  • Hexbin color intensity = GP σ
  • White crosses = Experimental conditions
  • Dark regions (high σ) = Unexplored, run BO suggestions there

Understanding v14 Recommendations

YAML Structure

metadata:
  version: v14_experimental
  generation_date: '2026-02-15'
  organism: Methylorubrum extorquens AM1

optimization_recommendations:
  - ingredient: (NH4)2SO4_mM_first
    current_range_min: 1.73
    current_range_max: 28.25
    recommended_range_min: 1.73
    recommended_range_max: 42.38    # EXPAND_UPPER: 28.25 * 1.5
    adjustment_type: EXPAND_UPPER
    sobol_index: 0.525
    boundary_effect: UPPER
    priority: HIGH
    rationale: "Optima clustered at upper boundary (UPPER). High impact factor (ST=0.525)."

Adjustment Types

  1. EXPAND_UPPER

    • Trigger: >50% of Pareto points at upper boundary
    • Action: new_max = current_max * 1.5
    • Example: (NH4)2SO4 range 1.73-28.25 → 1.73-42.38 mM
  2. EXPAND_LOWER

    • Trigger: >50% of Pareto points at lower boundary
    • Action: new_min = max(0, current_min * 0.5)
    • Example: PQQ range 0.002-0.005 → 0.001-0.005 µM
  3. CONTRACT

    • Trigger: Sobol ST < 0.05 (low impact)
    • Action: Narrow to ±20% of median
    • Example: CoCl2 range 0.001-0.067 → 0.015-0.045 µM
  4. SHIFT

    • Trigger: Pareto optimal >30% away from current center
    • Action: Center on Pareto median, keep width
    • Example: Succinate 3-63 mM, optimal at 50 → shift to 20-80 mM
  5. MAINTAIN

    • Trigger: None of the above
    • Action: Keep current range unchanged
    • Example: Well-calibrated range with interior optimum

Priority Levels

Scoring Formula:

score = 0.5 * Sobol_ST + 0.3 * (1 - EI_rank/20) + 0.2 * boundary_flag

Thresholds:

  • VERY HIGH (>0.6): High Sobol + top EI + boundary effect
  • HIGH (>0.4): High Sobol or top EI
  • MEDIUM (>0.2): Moderate Sobol or mid-range EI
  • LOW (≤0.2): Low Sobol, low EI, no boundary

Usage:

  • Focus v14 optimization on VERY HIGH and HIGH priority factors
  • Maintain or contract LOW priority factors

Provenance Tracking

Every recommendation includes complete provenance:

provenance:
  decisions:
    - timestamp: '2026-02-15T00:03:21'
      phase: v14_recommendation
      decision_type: factor_range_adjustment
      decision: Generated 6 optimization-driven recommendations
      rationale: Applied optimization-driven range adjustments...

  skills_used:
    - skill_name: generate_optimization_report
      invocation_count: 1
      purpose: Compute Pareto frontier, Sobol sensitivity...

  data_sources:
    - source_type: experimental
      path: outputs/.../optimization
      description: Optimization analysis outputs (absolute mode)

Command Reference

Full Pipeline

# Recommended: Run complete analysis
just analyze-experimental-full data/experimental/v10_results/

# Dual mode (absolute + relative)
uv run python scripts/run_dual_analysis.py data/experimental/v10_results/

# Absolute mode only (faster for testing)
uv run python scripts/run_dual_analysis.py data/experimental/v10_results/ --skip-relative

Standalone Commands

# Optimization report only (requires existing analysis)
just generate-optimization-report outputs/v10_analysis_absolute/ absolute

# v14 recommendations only (requires optimization report)
just recommend-v14-design outputs/v10_analysis_absolute/optimization/ outputs/recommendations/ absolute

# Validate v14 recommendations
just validate-linkml outputs/recommendations/recommendations_v14.yaml

Feature Flags

# Disable optimization (Steps 5-6)
uv run python scripts/run_dual_analysis.py DATA_DIR --disable-optimization-report

# Disable v14 recommendations (Step 6 only)
uv run python scripts/run_dual_analysis.py DATA_DIR --disable-v14-recommendations

# Disable response surfaces (skips Steps 4-6)
uv run python scripts/run_dual_analysis.py DATA_DIR --disable-response-surfaces

Troubleshooting

"Response surface analysis skipped"

Cause: Missing measurements (e.g., Nd_uM not in data)

Solution:

# Run with single measurement
uv run python scripts/generate_optimization_report.py ANALYSIS_DIR --mode absolute --measurements OD600

"Pareto frontier skipped (need 2+ measurements)"

Cause: Only 1 measurement available

Effect: Pareto analysis disabled, BO and Sobol still run

Solution: This is normal for single-objective data. Bayesian optimization and Sobol analysis still provide actionable insights.

"Optimization report failed"

Cause: GP model fitting failed (insufficient data, <5 points)

Check:

# Verify replicate statistics file exists and has >5 rows
wc -l outputs/v10_analysis_absolute/*replicate_statistics*.tsv

Solution:

  • Need at least 5 experimental conditions for GP fitting
  • Check data quality (no all-NaN measurements)

"v14 recommendations failed"

Cause: Optimization report CSVs missing

Check:

ls outputs/v10_analysis_absolute/optimization/*.csv

Solution:

  • Run optimization report first
  • Check for errors in Step 5 output

LinkML Validation Errors

Common Issues:

  1. Missing required fields

    ERROR: 'purpose' is a required property
    

    Solution: Update script to include all required schema fields

  2. Pattern mismatch

    ERROR: 'v14_optimization' does not match '^v[0-9]+(_agentic|_experimental)?$'
    

    Solution: Use valid version patterns (e.g., v14_experimental)

  3. Type mismatch

    ERROR: {...} is not of type 'string'
    

    Solution: Convert dicts to JSON strings for parameters field


Advanced Usage

Custom Measurements

# Multi-measurement optimization (OD600 + Nd_uM + custom)
uv run python scripts/generate_optimization_report.py \\
    outputs/analysis_absolute/ \\
    --mode absolute \\
    --measurements OD600 Nd_uM GFP_fluorescence

Adjust Constraints

Edit scripts/recommend_v14_design.py:

# Relaxed constraints (current default)
osmolarity_limit = 900  # mOsm
cn_ratio_range = (1.5, 120)

# Strict constraints
osmolarity_limit = 800
cn_ratio_range = (2, 100)

Priority Score Tuning

Edit scripts/recommend_v14_design.py:

def _compute_priority(self, sobol_index, ei_rank, boundary_effect):
    # Current weights
    sobol_score = 0.5 * sobol_index
    ei_score = 0.3 * (1 - ei_rank / 20)
    boundary_score = 0.2 if boundary_effect in ['UPPER', 'LOWER', 'BOTH'] else 0.0

    # Alternative: Emphasize Sobol more
    sobol_score = 0.7 * sobol_index  # Increased weight
    ei_score = 0.2 * (1 - ei_rank / 20)
    boundary_score = 0.1 if ...

Batch Processing

# Process multiple datasets
for dataset in data/experimental/v*/; do
    just analyze-experimental-full "$dataset"
done

# Compare v14 recommendations
diff outputs/recommendations_v10/recommendations_v14.yaml \\
     outputs/recommendations_v12/recommendations_v14.yaml

Workflow Summary

Input: Experimental plate data (v10/v12 designs)

Process:

  1. Run full pipeline: just analyze-experimental-full DATA_DIR
  2. Review optimization report: cat outputs/.../optimization/OPTIMIZATION_REPORT_absolute.md
  3. Inspect visualizations: open outputs/.../optimization/*.pdf
  4. Check v14 recommendations: cat outputs/recommendations/recommendations_v14.yaml
  5. Generate v14 design using recommended ranges
  6. Run v14 experiments
  7. Iterate (run pipeline on v14 data for v15 recommendations)

Output: LinkML-validated v14 recommendations with full provenance

Timeline:

  • Pipeline runtime: ~2 minutes (69 conditions, 6 ingredients)
  • Manual review: ~15 minutes
  • v14 design generation: Variable (depends on DOE method)
  • Total: <1 hour from data to v14 design ready

References


Generated: 2026-02-15 Version: 1.0 Optimization System: v14 Contact: MicroGrowAgents Development Team