Optimization-Driven Experimental Design Guide

Complete guide to using the MicroGrowAgents optimization system for data-driven v14 design generation.

Overview
Quick Start
Pipeline Architecture
Interpreting Optimization Reports
Understanding v14 Recommendations
Command Reference
Troubleshooting
Advanced Usage

Overview

The optimization system converts experimental data into actionable recommendations for the next design iteration. It combines:

Gaussian Process (GP) models - Surrogate models for response surface prediction
Pareto frontier analysis - Multi-objective optimization (OD600 + Nd_uM)
Sobol sensitivity analysis - Variance decomposition to identify important factors
Bayesian optimization - Expected Improvement acquisition for next experiments
Boundary effect detection - Identify factors saturated at design space limits
LinkML validation - Structured, validated recommendations for v14 design

Scientific Goal: Identify which ingredient ranges to expand, contract, or shift for the next experimental design iteration based on quantitative evidence from v10/v12 data.

Quick Start

1. Run Full Pipeline (Recommended)

# Start with experimental data directory
just analyze-experimental-full data/experimental/plate_designs_v10_results/

# This runs ALL steps:
# Step 1: Analysis (replicate statistics)
# Step 2: Clustering (hierarchical clustering)
# Step 3: Visualization (growth curves, heatmaps, PCA)
# Step 4: Response surfaces (GP model fitting)
# Step 5: Optimization report (Pareto, Sobol, BO, boundaries)
# Step 6: v14 recommendations (LinkML-validated YAML)

2. Check Outputs

# Optimization report
cat outputs/plate_designs_v10_results_experimental_analysis_absolute/optimization/OPTIMIZATION_REPORT_absolute.md

# v14 recommendations
cat outputs/recommendations/recommendations_v14.yaml

3. Review Visualizations

# Open key plots
open outputs/.../optimization/sensitivity_tornado_absolute.pdf
open outputs/.../optimization/acquisition_surface_absolute.pdf
open outputs/.../optimization/uncertainty_heatmap_absolute.pdf

Pipeline Architecture

Data Flow:
  Raw Experimental Data (plate1.tsv, plate2.tsv, plate3.tsv)
        ↓
  [Step 1-3] Analysis + Clustering + Visualization
        ↓
  replicate_statistics_{mode}.tsv
        ↓
  [Step 4] Response Surface Analysis (GP Fitting)
        ↓
  GP models (pickled), surface predictions
        ↓
  [Step 5] Optimization Report Generation
        ↓
  Pareto frontier, Sobol indices, BO suggestions, boundary effects
        ↓
  [Step 6] v14 Recommendations (absolute mode only)
        ↓
  recommendations_v14.yaml (LinkML-validated)

Step 5: Optimization Report

Inputs:

replicate_statistics_{mode}.tsv - Experimental data with ingredient concentrations and measurements
GP models (fitted or refitted from experimental data)

Outputs:

optimization/
├── OPTIMIZATION_REPORT_{mode}.md              # 6-section comprehensive report
├── pareto_experimental_{mode}.csv             # Experimental Pareto points (if 2+ measurements)
├── pareto_predicted_{mode}.csv                # GP surface Pareto points (novel optima)
├── next_experiments_bayesian_{mode}.csv       # Top 20 by Expected Improvement
├── sensitivity_sobol_{mode}.csv               # Sobol indices (ST, S1)
├── boundary_effects_{mode}.csv                # Boundary flags per ingredient
├── uncertainty_grid_{mode}.csv                # GP σ predictions (5000 points)
├── sensitivity_tornado_{mode}.pdf/png         # Ingredient ranking visualization
├── acquisition_surface_{mode}.pdf/png         # EI surface with top-5 markers
└── uncertainty_heatmap_{mode}.pdf/png         # GP uncertainty hexbin

Analyses Performed:

Pareto Frontier (2+ measurements only):
- Computes non-dominated points on GP surface (10,000-point Sobol grid)
- Identifies novel optima not in experimental data
- Compares experimental vs predicted Pareto frontiers
Bayesian Optimization:
- Computes Expected Improvement (EI) acquisition function
- Suggests top 20 next experiments by EI score
- Balances exploitation (high predicted response) vs exploration (high uncertainty)
Sobol Sensitivity:
- Variance-based global sensitivity analysis
- Computes total-order (ST) and first-order (S1) indices
- Ranks ingredients by importance (ST captures main + interaction effects)
Boundary Effects:
- Detects if optima clustered at design space boundaries (±5% margin)
- Flags ingredients needing range expansion (UPPER, LOWER, BOTH)
Uncertainty Quantification:
- Evaluates GP prediction std (σ) across design space
- Identifies high-uncertainty regions for targeted exploration

Step 6: v14 Recommendations

Inputs:

All optimization report outputs (CSV files)

Outputs:

recommendations/
├── recommendations_v14.yaml               # LinkML-validated recommendations
├── v14_factor_recommendations.csv         # Tabular summary
└── v14_generation_provenance.json         # Decision log

Recommendation Logic:

for each ingredient:
    if boundary_effect == 'UPPER':
        adjustment_type = 'EXPAND_UPPER'
        new_max = current_max * 1.5  # Expand by 50%

    elif boundary_effect == 'LOWER':
        adjustment_type = 'EXPAND_LOWER'
        new_min = current_min * 0.5  # Expand by 50%

    elif sobol_index < 0.05:
        adjustment_type = 'CONTRACT'
        # Narrow to ±20% of median

    elif pareto_optimal far from center:
        adjustment_type = 'SHIFT'
        # Center on Pareto median, keep width

    else:
        adjustment_type = 'MAINTAIN'
        # No change needed

    priority_score = 0.5*Sobol_ST + 0.3*(1-EI_rank/20) + 0.2*boundary_flag

    if priority_score > 0.6:
        priority = 'VERY HIGH'
    elif priority_score > 0.4:
        priority = 'HIGH'
    elif priority_score > 0.2:
        priority = 'MEDIUM'
    else:
        priority = 'LOW'

Interpreting Optimization Reports

Section 1: Executive Summary

What to look for:

Best experimental condition (highest observed response)
Best predicted condition (GP model optimum)
Number of Pareto frontier points (novel optima discovered)
Number of BO suggestions (high-EI candidates)

Example:

**Best Experimental Condition:** OD600 = 1.113
**Best Predicted Condition:** OD600 = 0.912
**Pareto Frontier Points:** 15 novel optima identified
**Next Experiments Suggested:** 20 high-EI conditions

Interpretation:

Experimental best (1.113) is higher than predicted best (0.912) → GP model may be underestimating optimal region
15 novel Pareto points → GP surface suggests unexplored high-performing conditions
20 BO suggestions → Next experiments to improve model accuracy

Section 2: Pareto Frontier Analysis

Key Insights:

Experimental Pareto: Actual non-dominated conditions from v10 data
Predicted Pareto: GP-predicted non-dominated points (10,000-point grid)
Novel optima: Predicted points NOT in experimental data

What to look for:

Are predicted Pareto points far from experimental data? → Exploration opportunity
Do predictions cluster in specific ingredient regions? → Important factor
Are there multiple competing optima? → Trade-offs exist

Visualization: pareto_comparison_{mode}.pdf

Blue circles = Experimental Pareto
Red triangles = Predicted Pareto
Look for red triangles away from blue → novel optima

Section 3: Bayesian Optimization Results

Top 20 Next Experiments:

Ranked by Expected Improvement (EI) score
Higher EI = Better balance of predicted performance + uncertainty

What to look for:

High EI (>0.01): Strong candidates, likely to improve model
Clustered suggestions: Specific ingredient region worth exploring
Diverse suggestions: Model uncertain across design space

Example:

1. EI=0.0108, OD600_pred=0.912, σ=0.145
2. EI=0.0040, OD600_pred=0.874, σ=0.151

Interpretation:

Top suggestion has high EI (0.0108) → Run this condition first in v14
High σ (0.145) → Model uncertain in this region, needs data

Visualization: acquisition_surface_{mode}.pdf

Stars mark top-5 EI maxima
Color intensity = EI value
Look for star clusters → promising regions

Section 4: Sensitivity Analysis (Sobol Indices)

Sobol Indices Explained:

S1 (First-order): Main effect of ingredient alone
ST (Total-order): Main effect + all interactions
ST - S1: Interaction effects

Interpretation Guidelines:

ST > 0.3: High impact factor (prioritize exploration)
ST 0.1-0.3: Moderate impact (standard exploration)
ST < 0.1: Low impact (consider contracting range)

Example:

| Rank | Ingredient      | ST    |
|------|-----------------|-------|
| 1    | (NH4)2SO4       | 0.520 |
| 2    | Succinate       | 0.380 |
| 3    | Phosphate       | 0.322 |
| 4    | Methanol        | 0.268 |
| 5    | PQQ             | 0.249 |
| 6    | CoCl2           | 0.224 |

Interpretation:

(NH4)2SO4 is most important (ST=0.52) → Focus v14 optimization here
Sum of top 3 (1.22) > 1.0 → Strong interaction effects
CoCl2 lowest (ST=0.22) → Less critical, could narrow range

Visualization: sensitivity_tornado_{mode}.pdf

Horizontal bars ranked by ST
Color-coded by magnitude
Focus on top 3-5 ingredients

Section 5: Boundary Effects

Flags:

UPPER: Optima at upper boundary → Expand upper bound in v14
LOWER: Optima at lower boundary → Expand lower bound in v14
BOTH: Multiple optima at boundaries → Expand both bounds
NONE: Interior optimum → Range well-calibrated

Example:

| Ingredient  | Boundary | Recommendation     |
|-------------|----------|---------------------|
| (NH4)2SO4   | UPPER    | Expand upper bound |
| Succinate   | NONE     | Maintain range     |

Interpretation:

(NH4)2SO4 at UPPER → Increase max from 100 mM to 150 mM in v14
Succinate NONE → Current range (10-100 mM) is good

Section 6: Uncertainty Quantification

Metrics:

Mean σ: Average prediction uncertainty
Max σ: Highest uncertainty region
High-uncertainty fraction: % of design space with σ > 50% of max

Example:

OD600:
- Mean σ: 0.179
- Max σ: 0.193
- High-uncertainty regions: 99.8%

Interpretation:

Mean σ=0.179 relative to OD600 range (0-1.2) → ~15% uncertainty
99.8% high-uncertainty → Sparse data coverage, model needs more data
Prioritize BO suggestions to reduce uncertainty

Visualization: uncertainty_heatmap_{mode}.pdf

Hexbin color intensity = GP σ
White crosses = Experimental conditions
Dark regions (high σ) = Unexplored, run BO suggestions there

Understanding v14 Recommendations

YAML Structure

metadata:
  version: v14_experimental
  generation_date: '2026-02-15'
  organism: Methylorubrum extorquens AM1

optimization_recommendations:
  - ingredient: (NH4)2SO4_mM_first
    current_range_min: 1.73
    current_range_max: 28.25
    recommended_range_min: 1.73
    recommended_range_max: 42.38    # EXPAND_UPPER: 28.25 * 1.5
    adjustment_type: EXPAND_UPPER
    sobol_index: 0.525
    boundary_effect: UPPER
    priority: HIGH
    rationale: "Optima clustered at upper boundary (UPPER). High impact factor (ST=0.525)."

Adjustment Types

EXPAND_UPPER
- Trigger: >50% of Pareto points at upper boundary
- Action: new_max = current_max * 1.5
- Example: (NH4)2SO4 range 1.73-28.25 → 1.73-42.38 mM
EXPAND_LOWER
- Trigger: >50% of Pareto points at lower boundary
- Action: new_min = max(0, current_min * 0.5)
- Example: PQQ range 0.002-0.005 → 0.001-0.005 µM
CONTRACT
- Trigger: Sobol ST < 0.05 (low impact)
- Action: Narrow to ±20% of median
- Example: CoCl2 range 0.001-0.067 → 0.015-0.045 µM
SHIFT
- Trigger: Pareto optimal >30% away from current center
- Action: Center on Pareto median, keep width
- Example: Succinate 3-63 mM, optimal at 50 → shift to 20-80 mM
MAINTAIN
- Trigger: None of the above
- Action: Keep current range unchanged
- Example: Well-calibrated range with interior optimum

Priority Levels

Scoring Formula:

score = 0.5 * Sobol_ST + 0.3 * (1 - EI_rank/20) + 0.2 * boundary_flag

Thresholds:

VERY HIGH (>0.6): High Sobol + top EI + boundary effect
HIGH (>0.4): High Sobol or top EI
MEDIUM (>0.2): Moderate Sobol or mid-range EI
LOW (≤0.2): Low Sobol, low EI, no boundary

Usage:

Focus v14 optimization on VERY HIGH and HIGH priority factors
Maintain or contract LOW priority factors

Provenance Tracking

Every recommendation includes complete provenance:

provenance:
  decisions:
    - timestamp: '2026-02-15T00:03:21'
      phase: v14_recommendation
      decision_type: factor_range_adjustment
      decision: Generated 6 optimization-driven recommendations
      rationale: Applied optimization-driven range adjustments...

  skills_used:
    - skill_name: generate_optimization_report
      invocation_count: 1
      purpose: Compute Pareto frontier, Sobol sensitivity...

  data_sources:
    - source_type: experimental
      path: outputs/.../optimization
      description: Optimization analysis outputs (absolute mode)

Command Reference

Full Pipeline

# Recommended: Run complete analysis
just analyze-experimental-full data/experimental/v10_results/

# Dual mode (absolute + relative)
uv run python scripts/run_dual_analysis.py data/experimental/v10_results/

# Absolute mode only (faster for testing)
uv run python scripts/run_dual_analysis.py data/experimental/v10_results/ --skip-relative

Standalone Commands

# Optimization report only (requires existing analysis)
just generate-optimization-report outputs/v10_analysis_absolute/ absolute

# v14 recommendations only (requires optimization report)
just recommend-v14-design outputs/v10_analysis_absolute/optimization/ outputs/recommendations/ absolute

# Validate v14 recommendations
just validate-linkml outputs/recommendations/recommendations_v14.yaml

Feature Flags

# Disable optimization (Steps 5-6)
uv run python scripts/run_dual_analysis.py DATA_DIR --disable-optimization-report

# Disable v14 recommendations (Step 6 only)
uv run python scripts/run_dual_analysis.py DATA_DIR --disable-v14-recommendations

# Disable response surfaces (skips Steps 4-6)
uv run python scripts/run_dual_analysis.py DATA_DIR --disable-response-surfaces

Troubleshooting

"Response surface analysis skipped"

Cause: Missing measurements (e.g., Nd_uM not in data)

Solution:

# Run with single measurement
uv run python scripts/generate_optimization_report.py ANALYSIS_DIR --mode absolute --measurements OD600

"Pareto frontier skipped (need 2+ measurements)"

Cause: Only 1 measurement available

Effect: Pareto analysis disabled, BO and Sobol still run

Solution: This is normal for single-objective data. Bayesian optimization and Sobol analysis still provide actionable insights.

"Optimization report failed"

Cause: GP model fitting failed (insufficient data, <5 points)

Check:

# Verify replicate statistics file exists and has >5 rows
wc -l outputs/v10_analysis_absolute/*replicate_statistics*.tsv

Solution:

Need at least 5 experimental conditions for GP fitting
Check data quality (no all-NaN measurements)

"v14 recommendations failed"

Cause: Optimization report CSVs missing

Check:

ls outputs/v10_analysis_absolute/optimization/*.csv

Solution:

Run optimization report first
Check for errors in Step 5 output

LinkML Validation Errors

Common Issues:

Missing required fields
```
ERROR: 'purpose' is a required property
```
Solution: Update script to include all required schema fields

Pattern mismatch

ERROR: 'v14_optimization' does not match '^v[0-9]+(_agentic|_experimental)?$'

Solution: Use valid version patterns (e.g., v14_experimental)

Type mismatch
```
ERROR: {...} is not of type 'string'
```
Solution: Convert dicts to JSON strings for parameters field

Advanced Usage

Custom Measurements

# Multi-measurement optimization (OD600 + Nd_uM + custom)
uv run python scripts/generate_optimization_report.py \\
    outputs/analysis_absolute/ \\
    --mode absolute \\
    --measurements OD600 Nd_uM GFP_fluorescence

Adjust Constraints

Edit scripts/recommend_v14_design.py:

# Relaxed constraints (current default)
osmolarity_limit = 900  # mOsm
cn_ratio_range = (1.5, 120)

# Strict constraints
osmolarity_limit = 800
cn_ratio_range = (2, 100)

Priority Score Tuning

Edit scripts/recommend_v14_design.py:

def _compute_priority(self, sobol_index, ei_rank, boundary_effect):
    # Current weights
    sobol_score = 0.5 * sobol_index
    ei_score = 0.3 * (1 - ei_rank / 20)
    boundary_score = 0.2 if boundary_effect in ['UPPER', 'LOWER', 'BOTH'] else 0.0

    # Alternative: Emphasize Sobol more
    sobol_score = 0.7 * sobol_index  # Increased weight
    ei_score = 0.2 * (1 - ei_rank / 20)
    boundary_score = 0.1 if ...

Batch Processing

# Process multiple datasets
for dataset in data/experimental/v*/; do
    just analyze-experimental-full "$dataset"
done

# Compare v14 recommendations
diff outputs/recommendations_v10/recommendations_v14.yaml \\
     outputs/recommendations_v12/recommendations_v14.yaml

Workflow Summary

Input: Experimental plate data (v10/v12 designs)

Process:

Run full pipeline: just analyze-experimental-full DATA_DIR
Review optimization report: cat outputs/.../optimization/OPTIMIZATION_REPORT_absolute.md
Inspect visualizations: open outputs/.../optimization/*.pdf
Check v14 recommendations: cat outputs/recommendations/recommendations_v14.yaml
Generate v14 design using recommended ranges
Run v14 experiments
Iterate (run pipeline on v14 data for v15 recommendations)

Output: LinkML-validated v14 recommendations with full provenance

Timeline:

Pipeline runtime: ~2 minutes (69 conditions, 6 ingredients)
Manual review: ~15 minutes
v14 design generation: Variable (depends on DOE method)
Total: <1 hour from data to v14 design ready

References

Generated: 2026-02-15 Version: 1.0 Optimization System: v14 Contact: MicroGrowAgents Development Team

FilesExpand file tree

OPTIMIZATION_GUIDE.md

Latest commit

History

OPTIMIZATION_GUIDE.md

File metadata and controls

Optimization-Driven Experimental Design Guide

Table of Contents

Overview

Quick Start

1. Run Full Pipeline (Recommended)

2. Check Outputs

3. Review Visualizations

Pipeline Architecture

Step 5: Optimization Report

Step 6: v14 Recommendations

Interpreting Optimization Reports

Section 1: Executive Summary

Section 2: Pareto Frontier Analysis

Section 3: Bayesian Optimization Results

Section 4: Sensitivity Analysis (Sobol Indices)

Section 5: Boundary Effects

Section 6: Uncertainty Quantification

Understanding v14 Recommendations

YAML Structure

Adjustment Types

Priority Levels

Provenance Tracking

Command Reference

Full Pipeline

Standalone Commands

Feature Flags

Troubleshooting

"Response surface analysis skipped"

"Pareto frontier skipped (need 2+ measurements)"

"Optimization report failed"

"v14 recommendations failed"

LinkML Validation Errors

Advanced Usage

Custom Measurements

Adjust Constraints

Priority Score Tuning

Batch Processing

Workflow Summary

References