Skip to content

Latest commit

 

History

History
281 lines (220 loc) · 12 KB

File metadata and controls

281 lines (220 loc) · 12 KB

MicroGrowAgents Project Status

Last Updated: 2026-05-15

Quick Summary

Two parallel workstreams are active:

  1. DBTL campaign for M. extorquens AM1 ΔmxaF media optimisation — rounds 1 and 2 of the v10 MaxPro+OptBlock design executed; round 3 (v16) planned. See §DBTL Campaign Status below.
  2. MP Medium Ingredient Properties dataset with citation tracking — see §Ingredient Properties Dataset below.

DBTL Campaign Status

Rounds executed

Round Date Design Growth assay Nd assay Status
1 Feb–Mar 2026 v10 MaxPro+OptBlock (69 conditions, 4 plates) OD600 @ 600 nm, 3 timepoints Arsenazo III @ 660 nm analysed
2 May 2026 v10 (repeat with minor adjustments) Biolog PM08, 740/590 nm, 144 timepoints Arsenazo III @ 660 nm (15 µM Nd dose) analysed
3 planned v16 (proposal pending) TBD LanM-fluorescence (proposed) planning

Round-2 readouts (in outputs/)

  • Adapter + pipeline: scripts/build_round2_replicate_statistics.py ingests Biolog + arsenazo data into the round-1 analysis schema. Recipes: just analyze-experimental-round2[-nd|-redox].
  • Joint OD600 × Nd Pareto: 8 winners (MPOB_008, _019, _020, _022, _024, _035, _058, _066). See outputs/round2_3way_pareto/.
  • Cross-cluster join: only MPOB_008 is a majority double-winner (top-growth ∩ top-Nd-uptake clusters). See outputs/round2_double_winners/.
  • Round-1 vs round-2 reproducibility: Spearman ρ ≈ 0 — measurement modality drift (600 nm → 740 nm + raw abs660 → calibrated µM), not biology drift. See outputs/round1_vs_round2/REPRODUCIBILITY_REPORT.md.
  • Bayesian optimisation seeds for v16: 10 BO suggestions; top predicted OD600 = 0.268 vs round-2 best 0.265 (round-2 close to practical optimum of the 6-factor design). Phosphate is dominant (Sobol ST = 0.65). See outputs/round2_recommendations/v16_bo_seeds.md.

Round-2 falsifiable findings (added 2026-05-15)

  • Precipitation-risk analysis (outputs/round2_precipitation_risk/): Q/Ksp model predicted 7 of 8 Pareto winners as HIGH NdPO₄-precipitation risk. ⚠️ Model REFUTED by empirical abiotic data (see next bullet).
  • Abiotic-correction diagnostic (outputs/round2_abiotic_correction/): An updated round-2 file with paired abs660_abiotic_t{1,2} was added on 2026-05-15. The observed abiotic drift does not match the Q/Ksp ranking (r = +0.23, wrong sign). 54 of 62 model-HIGH-risk conditions show stable abiotic. Conversely, MPOB_058 (model-predicted MEDIUM) shows the strongest abiotic chemistry signal. Either precipitation completed before t1 (so the t1→t2 drift can't see it) or the model is mis-calibrated.
  • Uncertainty-aware MC Pareto (outputs/round2_mc_pareto/): Of the 8 deterministic Pareto winners, only 2 (MPOB_008, MPOB_058) are stable under MC perturbation (freq ≥ 0.8) of replicate σ. (This finding is independent of the chemistry interpretation and still holds.)

Current best-defended interpretation: the abiotic data does not let us declare any condition unambiguously biology or chemistry without cell-pellet ICP-MS. The original "8 Pareto winners × 2 reps" anchor allocation in v16 remains the most defensible plan. The earlier (now-superseded) reallocation that nominated MPOB_058 as a 4-rep biology anchor was based on the unrefuted model and has been reverted in outputs/round2_recommendations/v16_design_recommendation.md.

  • Paired biology signal at t2 (outputs/round2_t2_paired_biology/): with the t1+t2 abiotic data we can compute (biotic - abiotic) at t2 directly — the cleanest chemistry-vs-biology analysis the round-2 data supports. MPOB_008 is the strongest convergent biology candidate (MC-stable + chem-quiet borderline + clear t2 biology signal). MPOB_058 has real biology mixed with real chemistry. MPOB_022 and MPOB_019 have clean biology + clean chemistry but are MC-fragile. The other 4 winners show no biology signal at t2 — their t3 depletion happened after our paired-control window and needs t3 ICP-MS to interpret.

Final v16 anchor allocation (this is the converged plan; see v16 doc for the per-condition reps): MPOB_008 = 4 reps (biology anchor), MPOB_058/_022/_019 = 2 reps each (biology candidates needing replication or paired ICP-MS), MPOB_024/_035/_020/_066 = 1 rep each (late-uptake outliers, flag for ICP-MS). Total 14 anchor wells across plates.

2026-05-15 (4th pass): t2 (6 h) adopted as canonical analysis endpoint

The round-2 analysis stack now defaults to t2 (6 h) instead of t3 (9 h) as the endpoint timepoint. Reasoning: only t2 has a paired abiotic control in the round-2 data, making it the only timepoint where chemistry-vs-biology attribution is empirically possible. All analysis scripts accept --endpoint-timepoint {t1,t2,t3}; default is t2; existing t3-format fallback retained for backward compatibility.

At t2 the 3-way Pareto frontier collapses from 8 conditions to 3: MPOB_058, MPOB_008, MPOB_019. MC-stable: MPOB_058 only (freq 0.99). Five conditions previously on the t3 frontier (MPOB_022/_066/_020/_035/_024) fall off at t2 — their t3 winning status was driven by depletion that happened between t2 and t3, in a window without paired abiotic data.

Authoritative v16 anchor allocation (replaces the pre-t2 allocations above): MPOB_058 = 4 reps + ICP-MS, MPOB_008 = 3 reps, MPOB_019 = 2 reps, MPOB_022 = 1 rep, MPOB_024/_035/_020/_066 = 1 rep each (precipitation+late- uptake controls, flag for ICP-MS). Total 13 anchor wells.

Round-3 (v16) planning artifacts

  • Factor-range proposal: outputs/round2_recommendations/v16_design_recommendation.md (inherits v15 ranges, adds Nd³⁺ as a 7th factor, bumps methanol upper).
  • BO point candidates: outputs/round2_recommendations/v16_bo_seeds.md (5 new condition recipes recommended as 2-rep anchor wells).
  • Assay alternatives report: outputs/round3_recommendations/nd_assay_alternatives_report.md
    • nd_assay_alternatives_1pager.md — recommends lanmodulin (LanM) fluorescence as primary HT readout + cell-pellet ICP-MS on the anchor subset.

Data sources (round-2)

  • data/experimental/plate_designs_v10_maxprooptblock_long__round2_results/ — Biolog 740 + 590 nm raw + collaborator rollup. SHA256s logged in data/checksums.txt.
  • data/experimental/plate_designs_v10_maxprooptblock_long__round2_results_asezuran/ — arsenazo III calibrated Nd predictions. SHA256s logged.

Ingredient Properties Dataset

This project also manages the MP Medium Ingredient Properties dataset with comprehensive citation tracking and validation.

Current Metrics

  • Total Ingredients: 158 rows
  • Total Columns: 68 (47 data + 21 organism context columns)
  • DOI Citations: 158 unique DOIs
  • Citation Coverage: 90.5% (143/158 DOIs with evidence)
    • PDFs: 92 (58.2%)
    • Abstracts: 44 (27.8%)
    • Missing: 15 (9.5%)

Recent Work: DOI Corrections (2026-01-07)

7 invalid DOIs successfully corrected (14 instances in CSV)

  • Improved coverage from 86.1% → 90.5% (+4.4%)
  • See: notes/DOI_CORRECTIONS_FINAL_UPDATED.md for complete details

Key Files

Main Data

  • CSV: data/raw/mp_medium_ingredient_properties.csv (68 columns)
  • Schema: src/microgrowagents/schema/mp_medium_schema.yaml (LinkML)

DOI Corrections & Validation

  • Final Report: notes/DOI_CORRECTIONS_FINAL_UPDATED.mdMOST IMPORTANT
  • Corrections Applied:
    • Batch 1: data/results/doi_corrections_applied.json (4 DOIs → 10 cells)
    • Batch 2: data/results/additional_corrections_applied.json (3 DOIs → 4 cells)
  • Correction Definitions:
    • data/corrections/doi_corrections_17_invalid.yaml
    • data/corrections/additional_corrections_found.yaml
  • Validation Results:
    • data/results/doi_validation_22.json (validation of 22 invalid DOIs)
    • data/results/csv_all_dois_results.json (all CSV DOIs)

Citation Resources

  • All DOIs: data/results/all_doi_links.txt (158 unique DOIs)
  • Missing Citations: data/results/missing_citations_report.txt (77 missing)
  • Coverage Summary: notes/CITATION_COVERAGE_SUMMARY.md

Scripts

Located in scripts/ organized by function:

DOI Validation: scripts/doi_validation/

  • validate_failed_dois.py - Validate DOI HTTP resolution
  • validate_new_corrections.py - Validate correction candidates
  • find_correct_dois.py - Research correct DOI alternatives

DOI Corrections: scripts/doi_corrections/

  • apply_doi_corrections.py - Apply validated corrections
  • apply_additional_corrections.py - Batch corrections
  • clean_invalid_dois.py - Remove invalid DOIs

PDF Downloads: scripts/pdf_downloads/

  • download_all_pdfs_automated.py - Automated PDF retrieval
  • retry_failed_dois_with_fallbackpdf.py - Fallback PDF service

Schema: scripts/schema/

  • add_role_columns.py - Add organism/role columns
  • migrate_schema.py - Schema migration utility

Enrichment: scripts/enrichment/

  • enrich_ingredient_effects.py - Enrich ingredient data

Remaining Issues

1. Invalid DOIs Still Unresolved (6 total)

1 Pre-DOI Era Publication (should be removed/marked):

  • Thiamin + Cu/Fe (PMID 9481873) - published 1997, no DOI exists
  • File: Mark in CSV as "Not available"

5 Unable to Locate (may need institutional access):

  • Thiamin autoclave stability (10.1002/cbdv.201700122)
  • Cobalt upper bound toxicity (10.1007/s00424-010-0920-y)
  • Iron hydrolysis (10.1016/S0016-7037(14)00566-3)
  • Dysprosium EDTA chelation (10.1016/S0304386X23001494)
  • Cobalamin light sensitivity (10.1073/pnas.0804699108)

See notes/DOI_CORRECTIONS_FINAL_UPDATED.md for details.

2. Empty Organism Context Columns

21 organism context columns were added but are not yet populated:

  • Pattern: {Property} Citation Organism
  • Allowed values: scientific names, strain names, taxonomy, or "general"
  • File: data/raw/mp_medium_ingredient_properties.csv (columns 48-68)

3. Missing Citations

77 missing citations identified across 18 ingredients:

  • See: data/results/missing_citations_report.txt

File Organization

MicroGrowAgents/
├── docs/
│   └── STATUS.md                    # ← You are here
├── notes/                           # Research & documentation
│   ├── DOI_CORRECTIONS_FINAL_UPDATED.md  # ⭐ Most important
│   ├── CITATION_COVERAGE_SUMMARY.md
│   └── ... (25+ other notes)
├── data/
│   ├── raw/
│   │   └── mp_medium_ingredient_properties.csv
│   ├── corrections/                 # DOI correction definitions (YAML/JSON)
│   └── results/                     # Validation & processing logs
├── scripts/                         # Organized by function
│   ├── doi_validation/
│   ├── doi_corrections/
│   ├── pdf_downloads/
│   ├── enrichment/
│   └── schema/
└── src/microgrowagents/schema/
    └── mp_medium_schema.yaml        # LinkML schema

Next Actions

  1. Remove/mark pre-DOI publication (1 DOI - PMID 9481873)
  2. Populate organism context columns (21 columns currently empty)
  3. Fill missing citations (77 missing DOI cells)
  4. Consider institutional access for 5 unable-to-locate DOIs

How to Use This Project

Validate DOIs

uv run python scripts/doi_validation/validate_failed_dois.py

Apply Corrections

# Edit data/corrections/doi_corrections_17_invalid.yaml first
uv run python scripts/doi_corrections/apply_doi_corrections.py

Download PDFs

uv run python scripts/pdf_downloads/download_all_pdfs_automated.py

View Results

  • DOI corrections: data/results/doi_corrections_applied.json
  • Validation: data/results/doi_validation_22.json
  • Full report: notes/DOI_CORRECTIONS_FINAL_UPDATED.md

References

  • Main CSV: 68 columns (47 data + 21 organism)
  • LinkML Schema: Defined in src/microgrowagents/schema/
  • Citation Coverage: 90.5% (143/158 DOIs)
  • Corrections Applied: 7 DOIs (14 CSV cells updated)

For detailed history, see files in notes/ directory.