Python port of yeast-GEM (phases 1–7, built on raven-python)#385
Open
edkerk wants to merge 14 commits into
Open
Python port of yeast-GEM (phases 1–7, built on raven-python)#385edkerk wants to merge 14 commits into
edkerk wants to merge 14 commits into
Conversation
PORTING_PLAN.md captures the phased plan for porting the MATLAB functions in code/ to a Python counterpart (yeastgem) built on cobrapy. Key decisions recorded: 1. Keep all functions in yeast-GEM for now; no new RAVEN or ravengem dependencies. Generic helpers that could move upstream are tracked in UPSTREAM_CANDIDATES.md but implemented locally for now. 2. Lock-step parity between MATLAB and Python: every behavior change touches both languages in the same PR, enforced by a CI gate. 3. Demote config-as-code functions (minimal_Y6, anaerobicModel, glycineNitrogenSource, nitrogenLimitation) to data files under data/conditions/, with thin loaders in both languages. 4. Validation contract is semantic equality on the model + metric parity within tolerance on the analyses. UPSTREAM_CANDIDATES.md lists the helpers (biomass subsystem, GAM setter, chemostat sweep, fit_gam, energy-cycle test, batch curation, duplicate detector, model comparator) with proposed upstream signatures and concrete triggers for actually moving them later. Also includes the saveYeastModel -> commitYeastModel rename plan, the loadYeastModel "drop or thin shim" decision, and the load-vs-save asymmetry analysis.
Implements phase 1 of code/python/PORTING_PLAN.md: an importable yeastgem package, the level-1 model comparator, a pytest suite, the Python CI workflow, and the reference-bundle scaffold. Package (code/python/yeastgem/): - io.read_yeast_model / write_yeast_model: ports the legacy code/io.py with robust REPO_PATH discovery via __file__, the YEAST_GEM_PATH env var, a .env file, or the working directory (in that order). The BiGG-compliance code path is preserved verbatim. - compare.compare_models: level-1 semantic-equality comparator for the cross-language CI gate. Checks reaction / metabolite / gene id sets, stoichiometry within tolerance, bounds, objective coefficients, GPRs (whitespace- and case-insensitive), formulas/charges, and a configurable set of annotation keys. Ignores formatting differences by design. Includes a python -m yeastgem.compare CLI. Build + tests: - pyproject.toml: deps cobra, pandas, pyyaml, matplotlib, numpy, python-dotenv. NO ravengem dependency (per Decision #1). dev extra adds pytest, pytest-cov, ruff. ruff configured for E/F/I/UP/B/RUF. - code/python/tests/: 15 tests covering load, comparator equality, dropped reactions, bound and stoichiometry diffs (in and out of tolerance), GPR normalisation, and the report string. Deprecation shim: - code/io.py is now a forwarding shim that re-exports from yeastgem.io and emits a DeprecationWarning. Existing import paths keep working. CI (.github/workflows/python.yml): - pytest matrix on Python 3.10, 3.11, 3.12 plus a ruff lint step. - matlab-reference-compare job stubbed behind `if: false` until the MATLAB reference bundle is seeded; enable once code/python/tests/reference/yeast-GEM.xml is committed. Reference bundle scaffold (code/python/tests/reference/): - README documents what the bundle contains, how to regenerate it, and how it wires into the level-1 and level-2 CI gates. - regenerate.m is a MATLAB stub that errors with a clear "phase-1 scaffold" message; it documents the future contract without pretending to work. Repo hygiene: - .gitignore extended with __pycache__, .pytest_cache, .ruff_cache, *.egg-info, build, dist, .venv. Verified locally: - pip install -e "code/python/[dev]" succeeds. - pytest: 15 passed, 0 failed. - ruff check code/python: clean.
…TLAB)
Implements the MATLAB half of phase 2 in code/python/PORTING_PLAN.md:
the hardcoded bound/stoichiometry edits that lived inside minimal_Y6.m,
anaerobicModel.m, glycineNitrogenSource.m and nitrogenLimitation.m are
moved out into YAML data files, and the functions become one-line
shims calling a generic applyCondition loader.
Data files (single source of truth, shared with the Python port):
data/yeastgem/ids.yml canonical yeast IDs (biomass rxn,
H+ met, GAM cofactors, pseudoreaction
names, ...)
data/conditions/minimal_Y6.yml Y6 minimal media
data/conditions/anaerobic.yml anaerobic conditions (v9.1.0)
data/conditions/glycine_nitrogen.yml
data/conditions/nitrogen_limitation.yml
MATLAB infrastructure:
code/readYAML.m tiny YAML reader (delegates to py.yaml.safe_load,
then converts py.dict/py.list to MATLAB struct/cell).
pyyaml in the MATLAB-linked Python env is a new
soft dependency.
code/applyIDs.m returns the canonical IDs struct.
code/applyCondition.m applies a named condition: handles prelude
(reset_exchanges), cofactor_pseudoreaction
(remove_mets + H+ charge_balance), amino_acid_ratio
(delegates to changeAminoAcidRatio),
biomass_stoichiometry_delta, and bounds. Includes
the expected-uptake-count sanity warning that
mirrors the original minimal_Y6.m check.
The four legacy condition functions are now 3-line shims that call
applyCondition with the appropriate name. They retain their original
docstrings (with references) and signatures, so existing call sites
keep working.
Behavior preservation: the legacy glycine_nitrogen and nitrogen_limitation
functions set lb=1000 with ub=0 on the glycine cleavage reactions,
producing a mathematically infeasible model. The YAML files mirror
this exactly. Lock-step parity treats this as data, not a bug to fix
in this refactor — a behavior change PR can correct it later.
Equivalence verification (manual until CI is wired): see
code/python/tests/reference/README.md "Phase-2 specific" for the
MATLAB before/after check using `python -m yeastgem.compare`.
Implements the Python half of phase 2: yeastgem.config and yeastgem.conditions consume the same data files the MATLAB loader reads, so condition presets have a single source of truth across the two languages. yeastgem.config: - load_ids() -> YeastIDs frozen dataclass with biomass_rxn, protein_rxn, cofactor_rxn, proton_met, pseudoreaction_names dict, and gam_cofactors list. Loaded from data/yeastgem/ids.yml. yeastgem.conditions: - load_condition(name) returns the raw cfg dict. - apply(model, name) mutates and returns model. - Supports prelude (reset_exchanges), cofactor_pseudoreaction (remove_mets + H+ charge_balance), biomass_stoichiometry_delta, and bounds. - The amino_acid_ratio step (used by anaerobic) depends on the biomass module that lands in phase 4; calling apply(model, 'anaerobic') raises NotImplementedError with a clear pointer to PORTING_PLAN.md. The other three presets (minimal_Y6, glycine_nitrogen, nitrogen_limitation) work end-to-end now. - Legacy bound edge case: glycine_nitrogen / nitrogen_limitation set lb=1000, ub=0 on the glycine cleavage reactions, which cobrapy's bounds setter rejects. A small _set_bounds helper bypasses the validator via the underlying private attrs so the Python toolchain produces the same numeric bounds as MATLAB. Documented in-line. Tests (18 new, 33 total): - test_config: shape of the IDs file + presence of every ID in the committed model. - test_conditions: shape of each YAML, end-to-end application for the three supported presets, idempotency, partial anaerobic checks for cofactor edits and FADH2/FAD/H+ biomass delta, and the NotImplementedError contract for the anaerobic AA-ratio gap. Plan status table in PORTING_PLAN.md updated to "mostly done" for phase 2, with the open item (MATLAB before/after equivalence in CI) called out. tests/reference/README.md gains a "Phase-2 specific" section with the exact MATLAB recipe for the equivalence check. Verified: pytest passes 33/33; ruff check is clean.
Ran the phase-2 equivalence check end-to-end against the pre-refactor checkout (feat/anaerobic, HEAD 90a2705) and the current branch: rxns, mets, lb, ub, S all identical pre vs post for all 4 conditions. The two SBML-exportable conditions (minimal_Y6, anaerobicModel) also pass the Python yeastgem.compare semantic-equality gate. The two infeasible-by-design conditions (glycineNitrogenSource, nitrogenLimitation) were verified via .mat struct comparison since SBML export legitimately rejects their lb > ub state. Python-vs-MATLAB cross-language parity: yeastgem.conditions.apply produces identical lb/ub vectors to MATLAB's applyCondition for the three Python-supported presets (minimal_Y6, glycine_nitrogen, nitrogen_limitation). The anaerobic Python path remains gated on the Tier-2 amino_acid_ratio implementation, as expected. This commit persists the verification scripts in the repo so the check can be re-run on any future phase that touches the condition presets or the loader: code/python/tests/reference/runPhase2Equivalence.m Apply the 4 conditions; save model.mat (always) and model.xml (when feasible). code/python/tests/reference/comparePhase2.m Load pre/post .mat files; diff rxns/mets/lb/ub/S. The README documents the full worktree-based recipe and records the verification outcome for the 812151c -> c74afed change set. PORTING_PLAN.md status table flips phase 2 from "mostly done" to "done" with the verification details inline.
saveYeastModel implied a casual save, but the function is the heavy
release pipeline you run before opening a curation PR. Rename it to
commitYeastModel to make that workflow association explicit. The
docstring spells out that the function does NOT perform `git commit`
itself; it prepares the artifacts so the next git commit captures a
coherent release-ready state.
Behaviour changes are limited to cosmetics + one structural cleanup:
- "before committing" replaces "before saving" in the error and
warning messages.
- The cd-shim-cd dance for minimal_Y6 and anaerobicModel is replaced
by direct applyCondition('minimal_Y6') / applyCondition('anaerobic')
calls. Phase 2 already proved these are byte-equivalent because
minimal_Y6.m / anaerobicModel.m are themselves shims for
applyCondition.
saveYeastModel.m is kept as a 3-line deprecation shim that forwards to
commitYeastModel and emits yeastGEM:saveYeastModelDeprecated. It will
be removed at the next minor version bump after this rename ships, so
existing callers (increaseVersion, the v8_*/v9_* curation scripts,
TEMPLATEcuration, GetMNXID, regenerate.m) keep working through the
transition.
Verified end-to-end on MATLAB R2024b + RAVEN: the SBML written by
saveYeastModel on the pre-rename HEAD is semantically equal to the
SBML written by commitYeastModel on the post-rename HEAD (
yeastgem.compare). Recipe in code/python/tests/reference/README.md
(landing in the next commit).
Ports the saveYeastModel-now-commitYeastModel release pipeline to Python. yeastgem.io.commit_yeast_model is the function to call before opening a curation PR; the docstring is explicit that it does not perform git commit itself. The pipeline: 1. apply minimal_Y6 (yeastgem.conditions.apply — phase 2 work) 2. add_sbo_terms (new; yeastgem.missing_fields) 3. SBML validity gate (cobra.io.validate_sbml_model) 4. aerobic growth check (FBA via cobrapy) 5. anaerobic growth check (deferred to phase 4; warns or raises) 6. write SBML to model/yeast-GEM.xml 7. save_delta_g (new; yeastgem.missing_fields) 8. update the model-stats row in README.md yeastgem.io.write_yeast_model becomes a deprecated forwarding shim that warns and calls commit_yeast_model. Removal scheduled for the next minor version bump after the rename ships. New module yeastgem.missing_fields: - add_sbo_terms: ports code/missingFields/addSBOterms.m. Includes a custom transport-reaction detector (same-met-name in two compartments) since cobrapy has no direct equivalent for RAVEN's getTransportRxns. Faithfully replicates the legacy `for i=numel(model.rxns)` bug that iterates only the last reaction when assigning pseudoreaction SBO overrides — fixing this is tracked as a future behaviour-change PR; both languages must move together. - load_delta_g / save_delta_g: port the ΔG CSV persistence in loadDeltaG.m / saveDeltaG.m. ΔG values live in cobra `notes` under the `deltaG` key, so they survive SBML round-trip via the standard notes element. Limitations vs the MATLAB pipeline (documented inline): - No companion .yml / .txt / .xlsx / .mat exports (RAVEN's exportForGit). Python's contract is model/yeast-GEM.xml only; the sidecar formats must currently be regenerated by running the MATLAB commitYeastModel. - No `e-005` → `e-05` exponent normalisation — Python's SBML writer does not produce the legacy MATLAB string. Tests (20 new, 53 total): smoke for shape of SBO assignment, round-trip for ΔG CSV persistence, full commit pipeline via monkeypatched paths so tests never touch the canonical model file, deprecation-warning contract for write_yeast_model. Verification (MATLAB R2024b + RAVEN, recipe in code/python/tests/reference/README.md "Phase-3 specific"): - saveYeastModel (pre-rename) vs commitYeastModel (post-rename): semantically equal. - MATLAB commitYeastModel vs Python commit_yeast_model: semantically equal. runPhase3.m persisted to code/python/tests/reference/ so the check can be reproduced from any future state. PORTING_PLAN.md status table marks phase 3 done.
Reverses Decision #1 in PORTING_PLAN.md (which kept everything local to yeast-GEM) after phase 3 revealed the duplicated code was becoming load-bearing. yeast-GEM now depends on raven-python (Python) and the new readYAML / applyCondition helpers in RAVEN (MATLAB). yeastgem keeps only the yeast-specific configuration of those generics. Upstream additions (separate commits in their respective repos, on the matching feat/yeast-gem-shared branches): raven-python be3f20c Add diff_models, annotation, and conditions modules for yeast-GEM port RAVEN 61a6e0a8 Add readYAML and applyCondition for shared yeast-GEM use Python side: - pyproject.toml gains `raven-python @ git+https://github.com/SysBioChalmers/raven-python@feat/yeast-gem-shared` as a dependency. - yeastgem.compare: re-exports diff_models / DiffReport under the historical compare_models / ComparisonReport names so existing callers keep working. - yeastgem.missing_fields: thin wrappers that hand the yeast CSV paths (data/databases/model_metDeltaG.csv etc.) to raven_python.annotation.{load,save}_delta_g_csv. add_sbo_terms now delegates to raven_python with only_last_reaction_for_pseudo=True so the model artifact stays byte-equivalent during the migration — a future behaviour-change PR will flip the flag in lock-step with the MATLAB side. - yeastgem.conditions.apply: resolves the name to data/conditions/<name>.yml, gates amino_acid_ratio behind NotImplementedError (Tier 2), then delegates to raven_python.conditions.apply_condition. - Tests trimmed: yeast-GEM keeps real-model smoke tests; unit-level coverage of the generic mechanism lives in raven-python's own test suite. MATLAB side: - code/readYAML.m deleted (use RAVEN's). - code/applyCondition.m renamed to code/applyYeastCondition.m. The new wrapper resolves the name to data/conditions/<name>.yml, handles the yeast-specific amino_acid_ratio pre-step via changeAminoAcidRatio, then hands the parsed condition to RAVEN's generic applyCondition. - commitYeastModel.m and the four condition shims (minimal_Y6.m, anaerobicModel.m, glycineNitrogenSource.m, nitrogenLimitation.m) updated to call applyYeastCondition instead of the local applyCondition. - applyIDs.m docstring notes the RAVEN readYAML dependency. Verification (MATLAB R2024b + RAVEN on feat/yeast-gem-shared): - pre-restructure vs post-restructure on all 4 conditions: semantically equal (rxns, mets, lb, ub, S all OK). - SBML pre vs post for minimal_Y6 and anaerobicModel: semantically equal. - MATLAB commitYeastModel vs Python commit_yeast_model on the post- restructure code: semantically equal. Tests: 46 yeast-GEM tests passing (down from 53 because the unit- level mechanism tests moved upstream); 46 new raven-python tests covering the moved pieces; full raven-python suite still passing. PORTING_PLAN.md and UPSTREAM_CANDIDATES.md updated to reflect the new dependency posture.
Closes the Python parity gap that phase 3 left open. The anaerobic
growth check in commit_yeast_model now runs (previously stubbed with
a NotImplementedError) and the anaerobic condition's amino_acid_ratio
step is implemented.
Generic biomass mechanism (sum_biomass / scale_biomass /
rescale_pseudoreaction / set_gam) was added to raven-python on
feat/yeast-gem-shared in a separate commit; yeast-GEM now consumes
it via thin wrappers configured from data/yeastgem/ids.yml.
Yeast-GEM Python changes
------------------------
- data/yeastgem/ids.yml: new `biomass_components` section listing
the seven components that contribute mass (protein, carbohydrate,
RNA, DNA, lipid_backbone, ion, cofactor) with their MW-computation
strategy (mw / mw_minus_2h / mw_minus_water / grams).
- yeastgem.config: YeastIDs grew a `biomass_components` field;
load_ids() reads the new section.
- yeastgem.biomass (new): yeast_biomass_config() builds a
raven_python.biomass.BiomassConfig from ids.yml; sum_biomass /
scale_biomass / set_gam are one-liners that hand the config to
the upstream API. rescale_pseudoreaction handles the yeast
`lipid` → backbone+chain aggregation. set_gam auto-resolves the
NGAM reaction by name when ngam= is set.
change_amino_acid_ratio (new) reads
data/physiology/aminoAcid_Bjorkeroth2020.tsv, replaces the tRNA
stoichiometries in the protein pseudoreaction (r_4047), and
rescales protein back to its pre-switch mass via scale_biomass.
- yeastgem.conditions.apply: when the YAML declares
amino_acid_ratio, run change_amino_acid_ratio first, then delegate
to the upstream apply_condition (no more NotImplementedError).
- yeastgem.io.commit_yeast_model: anaerobic growth check applies the
anaerobic condition on a copy and runs FBA — mirrors the MATLAB
commitYeastModel checkGrowth('anaerobic', ...) step. The deferred-
warning path is gone.
Tests (8 new, 54 total)
-----------------------
- test_biomass.py exercises sum/scale/set_gam/change_amino_acid_ratio
on the real model, checking that scale_biomass lands on the
target and that change_amino_acid_ratio preserves total protein
mass (within float tolerance).
- test_conditions.py: replaced the "raises until tier 2" assertion
with an end-to-end anaerobic application check.
- test_commit.py: replaced the deferred-warning / NotImplementedError
expectations with checks that the anaerobic gate now runs.
Verification (MATLAB R2024b + RAVEN feat/yeast-gem-shared)
----------------------------------------------------------
Python conditions.apply vs MATLAB applyYeastCondition on lb/ub:
minimal_Y6 0 diffs / 0 diffs
anaerobic 0 diffs / 0 diffs
glycine_nitrogen 0 diffs / 0 diffs
nitrogen_limitation 0 diffs / 0 diffs
Anaerobic SBML round-trip (Python vs MATLAB): semantically equal.
Full commit pipeline (Python commit_yeast_model vs MATLAB
commitYeastModel, anaerobic check active): semantically equal.
PORTING_PLAN.md status table marks phase 4 done (core);
UPSTREAM_CANDIDATES.md records the biomass subsystem move.
New yeastgem.model_tests subpackage mirrors code/modelTests/ in Python:
- growth — chemostat R² vs Tobias 2013 across 4 limiting/oxygenation
conditions. The inner simulate_chemostat helper handles the
anaerobic-condition switch and the N-limited biomass rescaling.
- essential_genes — single-gene knockout via cobrapy + comparison
against the Stanford yeast deletion collection. Returns
EssentialGeneResult (accuracy / sensitivity / specificity / MCC +
TP/TN/FP/FN lists).
- anaerobic_flux_predictions — intracellular flux R² + mean relative
error vs Jouhten 2008 / Frick & Wittmann 2005. Caller is responsible
for applying the anaerobic condition first.
- plot_anaerobic — relative fermentation-product bar plot with error
bars (matplotlib Agg-friendly).
- find_duplicated_rxns — thin wrapper that prints duplicate-pair info,
delegating detection to raven_python.manipulation.find_duplicate_reactions.
Stanford ORF lists were extracted from the essentialGenes.m hardcoded
inviableORFs / verifiedORFs blocks into data/essentialGenes/{
inviable,verified}_orfs.txt so both languages read the same source.
A README in that directory documents provenance + the duplicate-entry
oddity in the original list.
Tests (7 new, 61 total) exercise each model_tests function on the real
yeast-GEM model and assert sensible metric ranges (R² ≥ 0.9, accuracy
≥ 0.7, etc.). The strict pass/fail thresholds live in the lock-step
verification driver (tests/reference/runPhase5Metrics.m).
Verification (MATLAB R2024b + Gurobi + RAVEN feat/yeast-gem-shared):
Metric MATLAB Python Δ
growth R² 0.906164 0.906164 1e-7
anaerobic flux R² 0.904765 0.905662 9e-4
essential_genes accuracy 0.90244 0.90154 9e-4
essential_genes specificity 40.88 40.88 0
essential_genes MCC 0.5368 0.5323 4e-3
TP/TN/FP/FN 934/65/94/14 vs 933/65/94/15
The single-gene discrepancy is a Gurobi/HiGHS solver-tolerance edge
case at the 1e-6 growth-ratio threshold. All metrics within the level-2
tolerances in PORTING_PLAN.md.
raven-python feat/yeast-gem-shared gained find_duplicate_reactions
(detection-only counterpart to remove_duplicate_reactions, with an
ignore_direction default of True per the yeast-GEM convention) in a
separate commit.
PORTING_PLAN.md status table marks phase 5 done;
UPSTREAM_CANDIDATES.md records the find_duplicate_reactions move.
Moves the generic batch-curation engine upstream: raven-python gains
raven_python.curation and RAVEN gains core/curateModelFromTables (both
on feat/yeast-gem-shared in their respective repos). yeast-GEM keeps
the user-facing curateMetsRxnsGenes function name as a thin shim that
pins yeast's s_/r_ prefixes and forwards upstream.
MATLAB shim (code/modelCuration/curateMetsRxnsGenes.m):
- Replaces the 330-line original with a ~50-line shim.
- Forwards to curateModelFromTables(..., 's_', 'r_').
- Preserves the original 1-arg through 5-arg call shapes, so the
v8_*/v9_* curation scripts and TEMPLATEcuration keep working
without change.
Python (yeastgem.curation):
- curate_mets_rxns_genes(model, *, mets_df=..., genes_df=...,
rxns_df=..., rxns_coeffs_df=...) returns a CurationResult dataclass
recording adds and overwrites.
- curate_mets_rxns_genes_from_tsv(...) takes file paths instead.
- Both fix met_id_prefix='s_', rxn_id_prefix='r_'; everything else is
delegated to raven_python.curation.
Schema (unchanged from MATLAB):
metabolites — match by (name, comp); columns metNames, comps,
formula, charge, inchi, metNotes, then MIRIAM.
genes — match by gene id; columns genes, geneShortNames,
then MIRIAM.
reactions — match by stoichiometric signature; columns rxnNames,
grRules, lb, ub, rev, subSystems, eccodes, rxnNotes,
rxnReferences, rxnConfidenceScores, then MIRIAM.
coefficients— rxnNames, metNames, comps, coefficient (one row per
(rxn, met) pair). Optional leading 'index' column from
the v8_7_0 schema is silently ignored upstream.
"Everything after the listed core columns is MIRIAM" — yeast-GEM's
TSVs work unchanged.
Tests (4 new yeast-side, 65 total): one each for the s_ / r_ prefix
pinning, the empty-call no-op, and an end-to-end smoke test against
the real v8_6_3 VolPolyP TSV pack (1 gene + 35 reactions match by
stoichiometry → warned overwrites, as expected on the current model
state).
Verification: MATLAB shim no-op call (model unchanged) confirms the
prefix pinning forwards correctly to RAVEN's curateModelFromTables.
Full Python-vs-MATLAB end-to-end parity on real TSV packs is blocked
by pre-existing flakiness in the legacy curateMetsRxnsGenes — it
errors on the v8_6_3 VolPolyP schema (no `index` column) and the
v8_7_0 DBnewRxns pack (logical-index bug at line 282 against the
current model). The Python implementation is more permissive and
handles both packs cleanly; lock-step parity for new curation work
holds by construction since both languages now go through the same
upstream engine.
runPhase6Curation.m persisted under code/python/tests/reference/ for
re-use when a clean test TSV pack becomes available.
PORTING_PLAN.md status table marks phase 6 done;
UPSTREAM_CANDIDATES.md records the curation move.
Closes the porting plan: top-level README reflects the now-functional Python contribution path, code/python/README.md gets a getting-started block + API map, and the Python CI workflow grows two new required parity gates. README updates -------------- - Top-level README: replaced the "Contribution via python is not yet functional" paragraph with a description of the yeastgem + raven-python split. Updated the load/save example to use read_yeast_model / commit_yeast_model (with the saveYeastModel -> commitYeastModel rename note). Removed the obsolete .env setup-step language (yeastgem auto-detects the repo root). - code/python/README.md: rewritten as an API map covering the seven modules (io, compare, conditions, biomass, missing_fields, model_tests, curation) with one-liner descriptions and links into the source. Add the dev / pytest / ruff workflows. CI workflow (.github/workflows/python.yml) ------------------------------------------ - test (matrix Python 3.10/3.11/3.12, ruff + pytest) — unchanged. - parity-level-1-round-trip (new) — runs code/python/tests/ci/check_round_trip.py: load the committed model/yeast-GEM.xml via cobrapy, write it to a temp file, reload, diff via raven_python.comparison.diff_models. Catches SBML library regressions, annotation losses, and accidental id rewrites. - parity-level-2-metrics (new) — runs code/python/tests/ci/check_metrics.py: compute growth R², essential-gene accuracy / sensitivity / specificity / MCC + confusion matrix, and anaerobic flux R² on the committed model; diff against the MATLAB-produced reference at code/python/tests/reference/metrics.json within tolerance. - The matlab-reference-compare placeholder job is gone — its work is now done by the two real parity gates. - Workflow path filters extended to trigger on changes to data/yeastgem/, data/conditions/, data/essentialGenes/, and data/physiology/ (everything the CI reads). Reference metrics ----------------- code/python/tests/reference/metrics.json seeds the level-2 gate with the values measured during phase 5 verification (MATLAB R2024b + Gurobi 13.0 + RAVEN feat/yeast-gem-shared on commit b4d3769): growth_r2 0.906164 essential_genes accuracy 0.902439 (tp/tn/fp/fn 934/65/94/14) anaerobic_flux_r2 0.904765 Tolerances absorb the known Gurobi-vs-HiGHS drift around the 1e-6 growth-ratio threshold: gene counts ±2, R² ±5e-3, MCC ±5e-2. Verified locally: - parity-level-1-round-trip → Models are semantically equal. - parity-level-2-metrics → All metric-parity checks passed. Prerequisite for the CI to pass on GitHub: raven-python's feat/yeast-gem-shared branch must be pushed so the ``raven-python @ git+https://...@feat/yeast-gem-shared`` URL in pyproject.toml resolves. Once that branch lands on a tagged release, the pin can switch to a version constraint. PORTING_PLAN.md status table marks phase 7 done; the porting plan is complete.
The upstream YAML reader was renamed to parseYAML on the feat/yeast-gem-shared RAVEN branch to avoid visual confusion with the model-format-specific readYAMLmodel. applyIDs and applyYeastCondition were the only consumers.
Eight curation helpers are now wrappers around the generic
functions added on the feat/yeast-gem-shared RAVEN branch.
Yeast-specific behaviour stays on the yeast-GEM side; the
algorithm bodies move upstream where other GEMs can share them.
sumBioMass -> getBiomassFractions
scaleBioMass -> scaleBiomassFraction
rescalePseudoReaction -> scaleBiomassPseudoreaction
(plus the lipid backbone/chain
aggregation, which stays here)
changeGAM -> setGAM
addSBOterms -> assignSBOterms with
onlyLastReactionForPseudo=true
(keeps the legacy typo behaviour so
saveYeastModel output is byte-stable)
loadDeltaG -> loadDeltaGfromCSV
saveDeltaG -> saveDeltaGtoCSV
findDuplicatedRxns -> findDuplicateRxns (plus the legacy
print formatting)
New helper code/yeastBiomassConfig.m builds the biomassConfig
struct that the RAVEN biomass functions consume, sourced from
data/yeastgem/ids.yml so all biomass IDs live in one place.
Equivalence verified: a side-by-side harness loaded the model and
ran each legacy implementation against its shim, asserting bitwise
match on the S matrix, bounds, and miriam annotations (or near-zero
float diff for deltaG / sumBioMass scalars). 10/10 checks pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Lands the full Python port plan from code/python/PORTING_PLAN.md. Built on top of raven-python (Python) and the feat/yeast-gem-shared RAVEN branch (MATLAB), with MATLAB-vs-Python lock-step parity verified at every level.
The new Python contribution path
Before: "Contribution via python (cobrapy) is not yet functional."
After:
pip install -e code/python/and:The pipeline mirrors MATLAB's
commitYeastModel(renamed fromsaveYeastModel— kept as a deprecation shim): apply minimal media → SBO terms → SBML validity gate → aerobic + anaerobic growth checks → write SBML + ΔG CSVs → update README.Architecture
yeastgemis intentionally thin. Anything organism-agnostic — model diffing, SBO assignment, condition application, biomass scaling, batch curation, ΔG persistence — lives in raven-python (Python) and RAVEN (MATLAB).yeastgemjust configures those generics with the yeast-specific data files under data/. Mirrors the long-standing MATLAB pattern wherecode/is a thin layer over RAVEN.yeastgem.ioread_yeast_model,commit_yeast_model(release pipeline)yeastgem.conditionsapply(model, name)— minimal_Y6, anaerobic, glycine_nitrogen, nitrogen_limitationyeastgem.biomasssum_biomass,scale_biomass,set_gam,change_amino_acid_ratioyeastgem.missing_fieldsadd_sbo_terms, ΔG CSV persistenceyeastgem.model_testsgrowth(Tobias 2013),essential_genes(Stanford KO),anaerobic_flux_predictions,plot_anaerobic,find_duplicated_rxnsyeastgem.curationcurate_mets_rxns_genes/..._from_tsvThe MATLAB curation helpers follow the same pattern: every function that had a generic body now wraps an upstream RAVEN call.
loadYeastModel(YAML reader path)parseYAMLapplyYeastConditionapplyCondition(+ yeast-onlyamino_acid_ratiopre-step)sumBioMassgetBiomassFractionsscaleBioMassscaleBiomassFractionrescalePseudoReactionscaleBiomassPseudoreaction(+ lipid backbone/chain aggregation)changeGAMsetGAMaddSBOtermsassignSBOterms(withonlyLastReactionForPseudo=truefor byte-stable output)loadDeltaG/saveDeltaGloadDeltaGfromCSV/saveDeltaGtoCSVfindDuplicatedRxnsfindDuplicateRxns(shim adds legacy print formatting)curateMetsRxnsGenescurateModelFromTablesThe new helper
code/yeastBiomassConfig.mbuilds thebiomassConfigstruct that the RAVEN biomass functions consume, sourced fromdata/yeastgem/ids.ymlso all biomass IDs live in one place.Data side
data/yeastgem/ids.yml— canonical yeast IDs (biomass rxn, H+ met, GAM cofactors, biomass-component config) consumed by both languages.data/conditions/{minimal_Y6,anaerobic,glycine_nitrogen,nitrogen_limitation}.yml— yeast condition presets as data, replacing the legacy hardcoded MATLAB functions (which are now 3-line shims).data/essentialGenes/{inviable,verified}_orfs.txt— Stanford yeast deletion collection extracted from the legacyessentialGenes.mso both languages read the same source.Verification (MATLAB R2024b + Gurobi + RAVEN feat/yeast-gem-shared)
commitYeastModelvs Pythoncommit_yeast_model(SBML)conditions.applylb/ub vs MATLABapplyYeastConditionfor all 4 conditionssumBioMass,rescalePseudoReaction(×9 components),scaleBioMass,changeGAM(×2),addSBOterms(mets+rxns),loadDeltaG,saveDeltaGtoCSVround-trip,findDuplicateRxnsRecipe + verification drivers persisted under code/python/tests/reference/.
CI
Three required jobs in .github/workflows/python.yml:
The level-2 reference at code/python/tests/reference/metrics.json is seeded from the phase-5 MATLAB run; tolerances absorb the known Gurobi-vs-HiGHS solver drift.
Companion PRs (must merge first for CI to pass)
pyproject.tomlpinsraven-pythonto itsfeat/yeast-gem-sharedbranch via agit+URL; CI will fail at the install step until that branch lands.Out of scope
v8_*/v9_*curation scripts andcode/.deprecated/are untouched..xml+ ΔG CSVs.addSBOtermspseudoreaction-loop bug (for i = numel(model.rxns)iterating only the last reaction) is faithfully preserved via anonly_last_reaction_for_pseudo=Trueflag — fixing it is tracked as a future behaviour-change PR that needs lock-step verification.Branch base note
This PR targets
feat/anaerobicbecause that's where the porting work was branched from; can be re-targeted tomainordevelopafter the anaerobic curation lands.