diff --git a/docs/guide/io_and_manipulation.md b/docs/guide/io_and_manipulation.md index 73c4876..aab7dc0 100644 --- a/docs/guide/io_and_manipulation.md +++ b/docs/guide/io_and_manipulation.md @@ -8,7 +8,9 @@ unchanged. On top of that it adds the RAVEN-specific formats: - {func}`raven_python.io.read_yaml_model` / {func}`raven_python.io.write_yaml_model` — cobra-standard YAML (the `!!omap` layout), transparently handling `.yml.gz`. RAVEN-only and GECKO `ec-*` side-fields are preserved on each entry's `notes` so a read→write round-trip is - lossless. + lossless. The full schema (top-level layout, field order, quoting rules, the GECKO + `ec-*` and `metaData` extensions) is documented in + [the YAML model format reference](../reference/yaml_format.md). - {func}`raven_python.io.export_model_to_sif` — Cytoscape SIF (`rc` / `rr` / `cc` graphs). - {func}`raven_python.io.export_to_excel` — the RAVEN 5-sheet workbook (RXNS / METS / COMPS / GENES / MODEL). Requires the `excel` extra. Excel **import** is intentionally not provided. diff --git a/docs/reference/index.md b/docs/reference/index.md index 01b96a6..865fc70 100644 --- a/docs/reference/index.md +++ b/docs/reference/index.md @@ -5,6 +5,9 @@ Conceptual and API reference for raven-python. - **[RAVEN ↔ raven-python migration map](migration.md)** — the function-by-function map from MATLAB RAVEN to raven-python (and cobrapy where appropriate). Start here if you're porting RAVEN code. +- **[YAML model format](yaml_format.md)** — the shared YAML schema produced and consumed + by cobrapy, raven-python, and RAVEN MATLAB, with a fully-annotated example and the + field-order / quoting rules. - **[MATLAB RAVEN back-port proposals](matlab_raven_backports.md)** — improvements raven-python makes that are candidates to back-port into the MATLAB toolbox. - **[Improvements over RAVEN](improvements.md)** — the full catalogue of correctness / @@ -16,6 +19,7 @@ Conceptual and API reference for raven-python. :hidden: migration +yaml_format matlab_raven_backports improvements api/index diff --git a/docs/reference/yaml_format.md b/docs/reference/yaml_format.md new file mode 100644 index 0000000..2d43eae --- /dev/null +++ b/docs/reference/yaml_format.md @@ -0,0 +1,332 @@ +# RAVEN / cobrapy YAML model format + +This document describes the YAML format produced and consumed by + +- **cobrapy** ([`cobra.io.{load,save}_yaml_model`](https://github.com/opencobra/cobrapy)) +- **raven-python** (`raven_python.io.yaml.{read,write}_yaml_model`, see [API](api/index.md)) +- **RAVEN MATLAB** (`readYAMLmodel.m` / `writeYAMLmodel.m` in the [RAVEN repo](https://github.com/SysBioChalmers/RAVEN/tree/feat/yeast-gem-shared/io)) + +The same file can be round-tripped through any of the three. cobrapy is the canonical core; raven-python and RAVEN MATLAB add namespaced extensions (RAVEN curation fields, MIRIAM cross-refs already covered by cobrapy's `annotation`, and the GECKO `ec-*` sections) without disturbing the cobra-known shape. + +--- + +## At a glance + +```yaml +!!omap +- metaData: !!omap + - id: yeastGEM_develop + - name: The Consensus Genome-Scale Metabolic Model of Yeast + - version: 9.0.0 + - date: 2026-05-27 + - taxonomy: taxonomy/559292 +- metabolites: + - !!omap + - id: s_0001 + - name: ATP + - compartment: c + - charge: -4 + - formula: C10H16N5O13P3 + - annotation: !!omap + - kegg.compound: C00002 + - smiles: "[O-]P(=O)([O-])OP(=O)([O-])O..." + - inchis: InChI=1S/C10H16N5O13P3/... + - deltaG: -2768.1 +- reactions: + - !!omap + - id: r_0001 + - name: hexokinase + - metabolites: !!omap + - s_0001: -1.0 + - s_0568: -1.0 + - s_0394: 1.0 + - s_0423: 1.0 + - lower_bound: 0.0 + - upper_bound: 1000.0 + - gene_reaction_rule: YGL253W or YCL040W or YFR053C + - subsystem: Glycolysis / Gluconeogenesis + - notes: "MetaNetX ID curated (PR #220)" + - annotation: !!omap + - kegg.reaction: R00299 + - sbo: SBO:0000176 + - eccodes: 2.7.1.1 + - deltaG: -17.39 + - confidence_score: 2.0 +- genes: + - !!omap + - id: YGL253W + - name: HXK2 + - annotation: !!omap + - uniprot: P04807 +- compartments: !!omap + - c: cytoplasm + - e: extracellular +``` + +Three structural rules are non-obvious and worth pointing out before the field-by-field detail: + +1. The whole document is one **ordered mapping** — `!!omap` — at the root. Every nested map that should preserve key order is also `!!omap` (metaData, each metabolite / reaction / gene entry, `annotation`, `metabolites`, `compartments`, and the ec sections). +2. Each metabolite, reaction, and gene is **one `- !!omap` element** of a list. Inside that mapping, every field is written as `- key: value`. This is cobrapy's native shape and is what RAVEN MATLAB's reader keys off. +3. Strings are **unquoted by default**; quotes appear only when YAML would otherwise misparse the value (leading `-`, `[`, `?` or `:`; embedded `: ` or ` #`; values that look like `true` / `false` / `null`). + +--- + +## Top-level layout + +``` +!!omap +- metaData: !!omap # optional; RAVEN extension +- metabolites: # required +- reactions: # required +- genes: # required (may be `genes: []`) +- compartments: !!omap # required +- gecko_light: # optional; GECKO extension +- ec-rxns: # optional; GECKO extension +- ec-enzymes: # optional; GECKO extension +``` + +| Key | Required | Source | Notes | +|-----|----------|--------|-------| +| `metaData` | optional | RAVEN | Provenance block. Holds `id`, `name`, `version`, `date`, `taxonomy`, optionally `givenName` / `familyName` / `email` / `organization` / `note` / `sourceUrl`, plus `defaultLB` / `defaultUB`. Cobrapy ignores this block (no semantic loss for the core model). | +| `metabolites` | yes | cobra core | Ordered list of `- !!omap` entries. | +| `reactions` | yes | cobra core | Ordered list of `- !!omap` entries. | +| `genes` | yes | cobra core | Ordered list; may be `genes: []` for a model with no genes. | +| `compartments` | yes | cobra core | `!!omap` of `: `. | +| `gecko_light` | optional | GECKO | Scalar boolean. Cobrapy / raven-python emit this at the top level; the older spelling `geckoLight` inside `metaData` is still accepted on read. | +| `ec-rxns` | optional | GECKO | Per-reaction kcat / source / enzymes coupling table. | +| `ec-enzymes` | optional | GECKO | Per-enzyme MW / sequence / concentration table. | + +Cobrapy writes `id` / `name` / `version` at the root level instead of inside `metaData`. The RAVEN readers accept both placements; the RAVEN writers normalize to the `metaData` form. + +--- + +## Metabolite entry + +Field order (cobra-core first, then RAVEN extensions): + +```yaml +- !!omap + - id: s_0001 # required + - name: ATP # cobra + - compartment: c # cobra + - charge: -4 # cobra (number) + - formula: C10H16N5O13P3 # cobra + - notes: "free-text" # cobra + - annotation: !!omap # cobra (MIRIAM + smiles) + - kegg.compound: C00002 + - chebi: + - CHEBI:15422 + - CHEBI:30616 + - sbo: SBO:0000247 + - smiles: "OC1=NC..." # quoted when it contains [ ] : etc. + - inchis: "InChI=1S/..." # RAVEN extension + - deltaG: -2768.1 # RAVEN extension + - metFrom: KEGG # RAVEN extension +``` + +Cobrapy emits exactly the first seven keys (the cobra-core block). raven-python and RAVEN MATLAB additionally emit `inchis`, `deltaG`, and `metFrom` when those fields are populated. On read, cobrapy puts the RAVEN extensions on the metabolite as attribute fall-through; raven-python captures them into `metabolite.notes` (keyed by their YAML name); RAVEN MATLAB stores them on `model.inchis` / `model.metDeltaG` / `model.metFrom`. + +Annotation entries with multiple values are emitted as a YAML list (`chebi:` then several `-` items). Single-value entries are emitted inline (`kegg.compound: C00002`). SMILES strings live inside the annotation block under the `smiles` key — not as a top-level metabolite field, which is the historical RAVEN MATLAB shape and is still accepted on read for backward compatibility. + +--- + +## Reaction entry + +```yaml +- !!omap + - id: r_0001 # required + - name: hexokinase # cobra + - metabolites: !!omap # cobra (sorted by met id) + - s_0001: -1.0 + - s_0394: 1.0 + - lower_bound: 0.0 # cobra (number) + - upper_bound: 1000.0 # cobra (number) + - gene_reaction_rule: YGL253W or YCL040W # cobra + - objective_coefficient: 1 # cobra; omitted when 0 + - subsystem: Glycolysis / Gluconeogenesis # cobra + - notes: "MetaNetX ID curated (PR #220)" # cobra + - annotation: !!omap # cobra + - kegg.reaction: R00299 + - sbo: SBO:0000176 + - eccodes: # RAVEN extension + - 2.7.1.1 + - 2.7.1.2 + - references: "PMID:12345" # RAVEN extension + - rxnFrom: KEGG # RAVEN extension + - deltaG: -17.39 # RAVEN extension + - confidence_score: 2.0 # RAVEN extension +``` + +Some fields are conditional: + +- `objective_coefficient` is only written when non-zero (cobrapy convention). +- The `metabolites` block uses `!!omap []` (flow-style empty omap) when the reaction has no metabolites — this keeps the file a valid YAML 1.2 document. +- `eccodes` is written inline (`eccodes: 2.7.1.1`) when there is exactly one code, and as a list when there are several. Same for `references`. + +**Notes key naming.** Cobrapy and the current raven-python / RAVEN MATLAB writers use **`notes`**. Pre-`feat/yeast-gem-shared` yeast-GEM files used `rxnNotes`; both readers accept that as a legacy alias. + +**Bounds typing.** Bounds are emitted as floats with an explicit decimal point (`1000.0`, `-1000.0`), matching Python's float repr and cobrapy's output. + +--- + +## Gene entry + +```yaml +- !!omap + - id: YGL253W # required + - name: HXK2 # cobra; omitted when empty + - annotation: !!omap # cobra + - uniprot: P04807 + - ncbigene: 856421 + - protein: P04807 # RAVEN extension +``` + +Empty names (`name: ''`) are not emitted (matches RAVEN MATLAB's historical behavior). + +--- + +## Compartments + +```yaml +- compartments: !!omap + - c: cytoplasm + - e: extracellular + - m: mitochondrion +``` + +Just an `!!omap` of `: ` pairs. Compartments don't carry their own MIRIAMs in the current format. + +--- + +## metaData (RAVEN extension) + +```yaml +- metaData: !!omap + - id: yeastGEM_develop + - name: The Consensus Genome-Scale Metabolic Model of Yeast + - version: 9.0.0 + - date: 2026-05-27 + - defaultLB: -1000.0 + - defaultUB: 1000.0 + - givenName: Eduard + - familyName: Kerkhoven + - email: eduardk@chalmers.se + - organization: Chalmers University of Technology + - taxonomy: taxonomy/559292 + - note: "Saccharomyces cerevisiae - strain S288C" + - sourceUrl: https://github.com/SysBioChalmers/yeast-GEM +``` + +Pure provenance. Cobrapy ignores the block; raven-python keeps the verbatim dictionary on `model.notes['metaData']` and additionally lifts `id` / `name` / `version` to `model.id` / `model.name` / `model.notes['version']` so cobra-shape accessors find them. RAVEN MATLAB populates `model.id` / `model.name` / `model.version` / `model.annotation.*` from the same fields. + +`date` is preserved across round-trips when present on the model; otherwise the writer fills in `YYYY-MM-DD` of the current date. + +--- + +## GECKO sections + +For enzyme-constrained models, three additional top-level keys carry the EC layer: + +```yaml +- gecko_light: false # true for the "light" formulation +- ec-rxns: + - !!omap + - id: r_0001 + - kcat: 25.3 + - source: brenda + - notes: "" + - eccodes: 2.7.1.1 + - enzymes: !!omap + - P04807: 1.0 +- ec-enzymes: + - !!omap + - genes: YGL253W + - enzymes: P04807 + - mw: 53942 + - sequence: "MVHLGPK..." + - concs: .nan +``` + +These map onto `model.ec` in RAVEN MATLAB and `raven_python.io.ec_data.EcData` (attached as `model.ec`) in raven-python. Cobrapy ignores the sections. + +The older spelling `geckoLight` inside `metaData` is also accepted on read. + +--- + +## Annotations + +The `annotation` block uses MIRIAM-style namespace keys. Cobrapy treats the block as a free-form dictionary; raven-python preserves it verbatim through `cobra.Metabolite.annotation` / `Reaction.annotation` / `Gene.annotation`; RAVEN MATLAB maps it to `model.metMiriams` / `rxnMiriams` / `geneMiriams`. + +- A single value is written inline: `kegg.compound: C00002`. +- Multiple values are written as a YAML list: + + ```yaml + - chebi: + - CHEBI:15422 + - CHEBI:30616 + ``` + +- The `smiles` key inside a metabolite's `annotation` carries the SMILES string (cobrapy convention). RAVEN MATLAB historically emitted `smiles` as a metabolite top-level field; both readers still accept that, but writes are normalized to the annotation block. +- The `sbo` key carries the Systems Biology Ontology term assigned by `assignSBOterms` / `add_sbo_terms`. + +--- + +## Numbers, strings, quoting + +**Numbers.** Whole-number floats are written with an explicit `.0` (`1000.0`, `-1000.0`, `0.0`). Other floats use up to 15 significant digits (`-17.39`, `-2768.1`). `NaN` is encoded as `.nan`; `+Inf` / `-Inf` as `.inf` / `-.inf` (YAML 1.2 conventions). + +**Strings.** Default style is bare (no quotes). The writer falls back to double-quoted style when the value: + +- starts with `-`, `?`, `:`, or any flow indicator (`[`, `]`, `{`, `}`, `,`, `&`, `*`, `!`, `|`, `>`, `%`, `@`, `` ` ``, `#`); +- contains `: ` (would otherwise be parsed as a key/value), ` #` (comment), or one of `[`, `]`, `{`, `}`; +- has leading or trailing whitespace; +- spells a YAML reserved word case-insensitively (`true`, `false`, `null`, `yes`, `no`, `on`, `off`, `~`). + +In a double-quoted string, only `\` and `"` are escaped. Other characters (including Unicode and newlines if the underlying model permitted them) are passed through. + +--- + +## Tooling interoperability matrix + +| File written by ↓ \ Reader → | cobrapy | raven-python | RAVEN MATLAB | +|---|---|---|---| +| cobrapy (`save_yaml_model`) | full | full + extras land in `notes` via attribute fall-through | works for root-level `id` / `name` / `version` (added in this release) | +| raven-python (`write_yaml_model`) | core (no `metaData`-derived `id`); RAVEN extras live as unknown top-level keys but don't break parsing | full | full | +| RAVEN MATLAB (`writeYAMLmodel`) | core (no `metaData`-derived `id`); RAVEN extras land via attribute fall-through | full | full | + +"Full" = every field read back into its canonical position on the model object; "core" = cobrapy-known fields, RAVEN extensions ignored or kept on the object as attribute fall-through (`reaction.eccodes` etc., not re-emitted on save). A round-trip through cobrapy is therefore **lossy for RAVEN extensions** — only the core fields survive `cobrapy.load → cobrapy.save`. Round-trips through raven-python or RAVEN MATLAB are lossless. + +--- + +## What round-tripping looks like + +Loading `yeast-GEM.yml` (2748 metabolites, 4102 reactions, 1143 genes) and re-writing it through any of the three tools preserves every documented piece of content: + +| Count | After round-trip | +|---|---| +| metabolites | 2748 / 2748 | +| reactions | 4102 / 4102 | +| genes | 1143 / 1143 | +| reactions with eccodes | 2411 | +| reactions with deltaG | 3984 | +| metabolites with deltaG | 2696 | +| metabolites with SMILES | 1788 | +| reactions with notes (rxnNotes) | 1443 | + +(Cobrapy round-trips give 2748 / 4102 / 1143 for the core but drop the RAVEN extensions in the rightmost column — that's the documented loss.) + +--- + +## What changed in `feat/yeast-gem-shared` + +- raven-python writer no longer drops `!!omap` tags (was producing files RAVEN MATLAB's reader couldn't load). +- raven-python now preserves `eccodes` and accepts the legacy `rxnNotes` reaction key on read. +- RAVEN MATLAB writer reorders metabolite / reaction fields to match cobrapy. +- RAVEN MATLAB writer renames the reaction `rxnNotes` key to `notes` and emits SMILES inside the annotation block (still accepts both shapes on read). +- RAVEN MATLAB writer's `preserveQuotes` default is now `false`; values that need quoting (SMILES with `[O-]`, leading flow indicators, booleans, `: `-containing strings) are quoted defensively per value. +- RAVEN MATLAB writer emits whole-number bounds as `1000.0` (matches cobrapy / Python float repr) instead of `1000`. +- RAVEN MATLAB reader accepts cobrapy's root-level `id` / `name` / `version` / `gecko_light`, the `!!omap`-tagged `metaData` header, and `notes` (canonical) in addition to `rxnNotes` (legacy). +- Empty `reaction.metabolites` blocks are emitted as `!!omap []` (valid YAML 1.2) rather than an empty `!!omap` with no value. +- Document-start marker `---` dropped to match cobrapy's bare `!!omap` root. + +These changes are byte-stable for cobrapy and raven-python users; existing yeast-GEM YAML files continue to load. The first time a yeast-GEM curation pass rewrites the file with the new MATLAB writer, the diff will look large (because of the reordering and quote-style changes) but the model content is unchanged. diff --git a/src/raven_python/annotation/__init__.py b/src/raven_python/annotation/__init__.py new file mode 100644 index 0000000..c2b4c17 --- /dev/null +++ b/src/raven_python/annotation/__init__.py @@ -0,0 +1,24 @@ +"""Annotation helpers — SBO term assignment, ΔG CSV persistence. + +These are the pieces of yeast-GEM's ``missingFields`` module that are +organism-agnostic enough to live upstream. Default parameter values +match the RAVEN/yeast convention so the functions are immediately +useful on the standard layout; consumers with different naming pass +overrides. +""" +from raven_python.annotation.delta_g import load_delta_g_csv, save_delta_g_csv +from raven_python.annotation.sbo import ( + DEFAULT_BIOMASS_MET_NAMES, + DEFAULT_BIOMASS_RXN_NAME, + DEFAULT_NGAM_RXN_NAME, + add_sbo_terms, +) + +__all__ = [ + "DEFAULT_BIOMASS_MET_NAMES", + "DEFAULT_BIOMASS_RXN_NAME", + "DEFAULT_NGAM_RXN_NAME", + "add_sbo_terms", + "load_delta_g_csv", + "save_delta_g_csv", +] diff --git a/src/raven_python/annotation/delta_g.py b/src/raven_python/annotation/delta_g.py new file mode 100644 index 0000000..e0a90e0 --- /dev/null +++ b/src/raven_python/annotation/delta_g.py @@ -0,0 +1,115 @@ +"""Persist scalar annotations (Gibbs free energy, etc.) as a side-car CSV. + +The committed SBML standard has no slot for thermodynamic ΔG values, so +projects like yeast-GEM keep them in a paired CSV instead. This module +provides the generic load / save round-trip — the file format is a two +column table ``, `` — and exposes the per-call schema so +existing projects can configure their conventions. + +The functions operate on cobra ``.notes`` (string round-tripped through +SBML), keeping any other notes untouched. +""" +from __future__ import annotations + +import math +from collections.abc import Iterable +from pathlib import Path + +import cobra +import pandas as pd + + +def load_delta_g_csv( + entities: Iterable, + path: str | Path, + *, + id_column: str = "Var1", + value_column: str = "Var2", + note_key: str = "deltaG", + verbose: bool = False, +) -> int: + """Stamp ``note_key`` on each entity from a CSV of ``id → value``. + + Parameters + ---------- + entities + Cobra entities (``model.metabolites`` or ``model.reactions``). + path + Path to the CSV. + id_column, value_column + Column labels in the CSV. Defaults match the yeast-GEM + convention (``Var1`` / ``Var2`` from MATLAB's ``array2table``). + note_key + Key under which the value is stored on ``entity.notes``. + Default ``"deltaG"``. + verbose + Print a summary of unmatched entity ids. + + Returns + ------- + The number of entities that were stamped (i.e. matched the CSV). + """ + df = pd.read_csv(path) + if id_column not in df.columns or value_column not in df.columns: + raise ValueError( + f"{path} columns {list(df.columns)} do not contain " + f"both {id_column!r} and {value_column!r}" + ) + lookup = dict(zip(df[id_column], df[value_column], strict=True)) + + stamped = 0 + missing: list[str] = [] + for entity in entities: + value = lookup.get(entity.id) + if value is None or (isinstance(value, float) and math.isnan(value)): + missing.append(entity.id) + continue + entity.notes[note_key] = str(value) + stamped += 1 + + if verbose and missing: + print( + f"{len(missing)} entity ids were not in {path} " + f"(e.g. {missing[:3]}); left untouched." + ) + return stamped + + +def save_delta_g_csv( + entities: Iterable, + path: str | Path, + *, + id_column: str = "Var1", + value_column: str = "Var2", + note_key: str = "deltaG", +) -> int: + """Dump ``entity.notes[note_key]`` for each entity to a CSV. + + Entities without ``note_key`` set get ``NaN`` written, preserving + one-row-per-entity ordering (mirrors MATLAB's ``array2table([ids, + values])`` behaviour). + + Returns + ------- + The number of rows written (= number of entities). + """ + rows: list[tuple[str, float]] = [] + for entity in entities: + raw = entity.notes.get(note_key) + if raw is None: + value: float = math.nan + else: + try: + value = float(raw) + except ValueError: + value = math.nan + rows.append((entity.id, value)) + pd.DataFrame(rows, columns=[id_column, value_column]).to_csv(path, index=False) + return len(rows) + + +# Re-export the cobra Model type for type-checker friendliness; helps +# IDEs surface the right hints to callers that hand us model.metabolites +# / model.reactions directly. +__all__ = ["load_delta_g_csv", "save_delta_g_csv"] +_ = cobra # silence "imported but unused" — used for typing context above diff --git a/src/raven_python/annotation/sbo.py b/src/raven_python/annotation/sbo.py new file mode 100644 index 0000000..d0306cb --- /dev/null +++ b/src/raven_python/annotation/sbo.py @@ -0,0 +1,180 @@ +"""SBO term assignment. + +Port of the generic core of yeast-GEM's ``code/missingFields/addSBOterms.m``. +Defaults reproduce the yeast-GEM behaviour; overrides parameterise the +pseudo-metabolite / pseudo-reaction name conventions, the transport +detector, and a yeast-specific bug-compat flag. +""" +from __future__ import annotations + +from collections.abc import Callable, Iterable + +import cobra + +DEFAULT_BIOMASS_MET_NAMES: frozenset[str] = frozenset( + {"biomass", "DNA", "RNA", "protein", "carbohydrate", + "lipid", "cofactor", "ion"} +) +DEFAULT_BIOMASS_RXN_NAME: str = "biomass pseudoreaction" +DEFAULT_NGAM_RXN_NAME: str = "non-growth associated maintenance reaction" + + +def add_sbo_terms( + model: cobra.Model, + *, + biomass_met_names: Iterable[str] = DEFAULT_BIOMASS_MET_NAMES, + biomass_met_suffixes: Iterable[str] = (" backbone", " chain"), + biomass_rxn_name: str = DEFAULT_BIOMASS_RXN_NAME, + ngam_rxn_name: str = DEFAULT_NGAM_RXN_NAME, + pseudoreaction_name_substrings: Iterable[str] = ("pseudoreaction", "SLIME rxn"), + transport_detector: Callable[[cobra.Model], set[str]] | None = None, + only_last_reaction_for_pseudo: bool = False, +) -> cobra.Model: + """Assign SBO terms to metabolites and reactions in-place. + + Metabolite SBO assignment + - SBO:0000649 (Biomass) for metabolites whose ``name`` is in + ``biomass_met_names`` or ends with any of + ``biomass_met_suffixes`` (default: ``" backbone"`` / ``" chain"``). + - SBO:0000247 (Simple chemical) otherwise. + + Reaction SBO assignment + - SBO:0000176 (Metabolic reaction) default. + - Single-reactant reactions (exchange / sink / demand): + * extracellular → SBO:0000627 (exchange) + * coefficient < 0 → SBO:0000632 (sink) + * else → SBO:0000628 (demand) + - Transport reactions → SBO:0000655 (default detector: same + metabolite name appearing in ≥ 2 compartments). + - The reaction whose ``name`` matches ``biomass_rxn_name`` → + SBO:0000629. + - The reaction whose ``name`` matches ``ngam_rxn_name`` → + SBO:0000630. + - Other reactions whose ``name`` contains any + ``pseudoreaction_name_substrings`` → SBO:0000395. + + "fill" semantic: SBO is written to ``annotation['sbo']`` only when + that key is missing or empty — mirrors RAVEN's + ``editMiriam(..., 'fill')`` mode. + + Parameters + ---------- + transport_detector + Optional callable that takes the model and returns the set of + reaction ids classified as transport. When ``None``, the default + same-met-name-in-two-compartments heuristic is used. Pass a + custom detector to use a different convention (e.g. RAVEN's + ``getTransportRxns`` via the MATLAB engine). + only_last_reaction_for_pseudo + **Bug-compatibility flag**, off by default. The legacy yeast-GEM + MATLAB ``addSBOterms.m`` contains a typo (``for i = numel(...)`` + rather than ``for i = 1:numel(...)``) that causes the pseudo- + reaction SBO assignments to run only on the very last reaction + in the model. When ``True``, this flag replicates that bug so + callers can migrate without altering the committed model. Leave + ``False`` for new uses. + """ + biomass_met_names = frozenset(biomass_met_names) + biomass_met_suffixes = tuple(biomass_met_suffixes) + pseudoreaction_name_substrings = tuple(pseudoreaction_name_substrings) + if transport_detector is None: + transport_detector = _default_transport_detector + + _assign_metabolite_sbo(model, biomass_met_names, biomass_met_suffixes) + _assign_reaction_sbo( + model, + biomass_rxn_name=biomass_rxn_name, + ngam_rxn_name=ngam_rxn_name, + pseudo_substrings=pseudoreaction_name_substrings, + transport_detector=transport_detector, + only_last_reaction_for_pseudo=only_last_reaction_for_pseudo, + ) + return model + + +# --- internals --------------------------------------------------------- + +def _assign_metabolite_sbo( + model: cobra.Model, + biomass_met_names: frozenset[str], + biomass_met_suffixes: tuple[str, ...], +) -> None: + for met in model.metabolites: + if met.name in biomass_met_names or met.name.endswith(biomass_met_suffixes): + sbo = "SBO:0000649" + else: + sbo = "SBO:0000247" + _fill_sbo(met, sbo) + + +def _assign_reaction_sbo( + model: cobra.Model, + *, + biomass_rxn_name: str, + ngam_rxn_name: str, + pseudo_substrings: tuple[str, ...], + transport_detector: Callable[[cobra.Model], set[str]], + only_last_reaction_for_pseudo: bool, +) -> None: + """Compute the SBO for each reaction via successive overrides, then + fill into the model in a single pass — mirrors the MATLAB + ``rxnSBO``-array + ``editMiriam(..., 'fill')`` two-step. + """ + transport_set = transport_detector(model) + rxns = list(model.reactions) + + sbo_for_rxn: dict[str, str] = {} + for rxn in rxns: + sbo = "SBO:0000176" + if len(rxn.metabolites) == 1: + (met,) = rxn.metabolites + coef = rxn.metabolites[met] + if met.compartment == "e" or model.compartments.get( + met.compartment + ) == "extracellular": + sbo = "SBO:0000627" + elif coef < 0: + sbo = "SBO:0000632" + else: + sbo = "SBO:0000628" + sbo_for_rxn[rxn.id] = sbo + + # Transport override + for rxn_id in transport_set: + sbo_for_rxn[rxn_id] = "SBO:0000655" + + # Pseudoreaction override. The legacy bug-compat flag scopes this + # loop to the last reaction only; the default walks the whole list. + pseudo_targets = [rxns[-1]] if only_last_reaction_for_pseudo else rxns + for rxn in pseudo_targets: + if rxn.name == biomass_rxn_name: + sbo_for_rxn[rxn.id] = "SBO:0000629" + elif rxn.name == ngam_rxn_name: + sbo_for_rxn[rxn.id] = "SBO:0000630" + elif any(s in rxn.name for s in pseudo_substrings): + sbo_for_rxn[rxn.id] = "SBO:0000395" + + for rxn in rxns: + _fill_sbo(rxn, sbo_for_rxn[rxn.id]) + + +def _fill_sbo(entity, sbo: str) -> None: + """Set ``annotation['sbo']`` only if it is missing or empty.""" + if not entity.annotation.get("sbo"): + entity.annotation["sbo"] = sbo + + +def _default_transport_detector(model: cobra.Model) -> set[str]: + """Same-met-name-in-two-compartments heuristic for transport detection. + + Cheap analogue of RAVEN's ``getTransportRxns``. Pass a custom + callable to :func:`add_sbo_terms` to override. + """ + out: set[str] = set() + for rxn in model.reactions: + by_name: dict[str, set[str | None]] = {} + for met in rxn.metabolites: + by_name.setdefault(met.name, set()).add(met.compartment) + if any(len(comps) >= 2 for comps in by_name.values()): + out.add(rxn.id) + return out diff --git a/src/raven_python/biomass/__init__.py b/src/raven_python/biomass/__init__.py new file mode 100644 index 0000000..6b5ac49 --- /dev/null +++ b/src/raven_python/biomass/__init__.py @@ -0,0 +1,50 @@ +"""Biomass equation manipulation — growth-associated maintenance, amino- +acid ratios, component scaling, and biomass-fraction reporting. + +The yeast-GEM port (see yeast-GEM/code/python/PORTING_PLAN.md) was the +first consumer; the API is parameterised by a :class:`BiomassConfig` +so other GEMs can describe their own component layout. + +A typical caller assembles ``BiomassConfig`` once (often from a +project-level YAML) and passes it to every operation: + +.. code-block:: python + + cfg = BiomassConfig( + biomass_rxn="r_4041", + proton_met="s_0794", + components=[ + BiomassComponent("protein", "protein pseudoreaction", + mass_strategy="mw_minus_2h"), + BiomassComponent("carbohydrate", "carbohydrate pseudoreaction"), + BiomassComponent("lipid_backbone", "lipid backbone pseudoreaction", + mass_strategy="grams"), + BiomassComponent("RNA", "RNA pseudoreaction", + mass_strategy="mw_minus_water"), + BiomassComponent("DNA", "DNA pseudoreaction", + mass_strategy="mw_minus_water"), + BiomassComponent("ion", "ion pseudoreaction"), + BiomassComponent("cofactor", "cofactor pseudoreaction"), + ], + ) + fractions = sum_biomass(model, cfg) # {'protein': 0.46, ...} + scale_biomass(model, cfg, "protein", 0.50, balance_out="carbohydrate") + set_gam(model, 80, biomass_rxn=cfg.biomass_rxn, + cofactor_met_names=("ATP", "ADP", "H2O", "H+", "phosphate")) +""" +from raven_python.biomass.config import BiomassComponent, BiomassConfig +from raven_python.biomass.gam import set_gam +from raven_python.biomass.scale import ( + rescale_pseudoreaction, + scale_biomass, + sum_biomass, +) + +__all__ = [ + "BiomassComponent", + "BiomassConfig", + "rescale_pseudoreaction", + "scale_biomass", + "set_gam", + "sum_biomass", +] diff --git a/src/raven_python/biomass/config.py b/src/raven_python/biomass/config.py new file mode 100644 index 0000000..c0ee453 --- /dev/null +++ b/src/raven_python/biomass/config.py @@ -0,0 +1,95 @@ +"""Configuration types for the biomass module.""" +from __future__ import annotations + +from collections.abc import Iterable +from dataclasses import dataclass, field +from typing import Literal + +#: Strategies for computing a component's mass contribution from the +#: stoichiometry of its pseudoreaction. See :func:`sum_biomass`. +#: +#: ``"mw"`` multiply each substrate coefficient by the +#: molecular weight derived from its +#: ``formula`` (g/mol), divide by 1000 to get +#: g/gDW. Suits carbohydrate / ion / cofactor. +#: ``"mw_minus_2h"`` as ``"mw"`` but with each MW reduced by +#: 2.016 g/mol (two protons released per +#: charged tRNA on protein-pseudoreaction +#: substrates). +#: ``"mw_minus_water"`` as ``"mw"`` but with each MW reduced by +#: 18.015 g/mol (one water released per +#: polymerisation step; RNA / DNA). +#: ``"grams"`` substrate coefficients are already in +#: g/gDW (e.g. the lipid-backbone +#: pseudoreaction). No MW lookup needed. +MassStrategy = Literal["mw", "mw_minus_2h", "mw_minus_water", "grams"] + + +@dataclass(frozen=True) +class BiomassComponent: + """One component of the biomass equation (protein, carbohydrate, …). + + Attributes + ---------- + name + Canonical short name — also the ``model.metabolites`` name of + the metabolite that the pseudoreaction *produces*. Used by + :func:`rescale_pseudoreaction` to identify the product side. + pseudoreaction_name + ``model.reactions[*].name`` of the pseudoreaction. + cobrapy reactions are looked up by ``name`` (not ``id``) + because that is how the yeast-GEM convention names them. + mass_strategy + How to convert the pseudoreaction's substrates into a mass + fraction. See :data:`MassStrategy` for the choices and their + offsets. + """ + + name: str + pseudoreaction_name: str + mass_strategy: MassStrategy = "mw" + + +@dataclass(frozen=True) +class BiomassConfig: + """Container for the per-organism biomass layout. + + Attributes + ---------- + biomass_rxn + ``model.reactions[*].id`` of the top-level biomass + pseudoreaction (the one whose flux is the growth rate). + proton_met + ``model.metabolites[*].id`` of the cytosolic H+ metabolite — + used by :func:`rescale_pseudoreaction` to keep each + pseudoreaction charge-balanced after rescaling. + components + Ordered tuple of :class:`BiomassComponent` describing every + non-top-level pseudoreaction that contributes to mass. + """ + + biomass_rxn: str + proton_met: str + components: tuple[BiomassComponent, ...] = field(default_factory=tuple) + + def __post_init__(self) -> None: + # Freeze the components into a tuple even if a list was passed. + if not isinstance(self.components, tuple): + object.__setattr__(self, "components", tuple(self.components)) + + @classmethod + def from_components( + cls, + biomass_rxn: str, + proton_met: str, + components: Iterable[BiomassComponent], + ) -> BiomassConfig: + return cls(biomass_rxn=biomass_rxn, proton_met=proton_met, + components=tuple(components)) + + def get(self, component_name: str) -> BiomassComponent: + """Look up a component by name (raises ``KeyError`` if missing).""" + for c in self.components: + if c.name == component_name: + return c + raise KeyError(f"BiomassConfig has no component named {component_name!r}") diff --git a/src/raven_python/biomass/gam.py b/src/raven_python/biomass/gam.py new file mode 100644 index 0000000..15a28be --- /dev/null +++ b/src/raven_python/biomass/gam.py @@ -0,0 +1,67 @@ +"""Growth-associated maintenance (GAM) and non-growth maintenance (NGAM). + +Ports the generic part of yeast-GEM's ``changeGAM.m``. +""" +from __future__ import annotations + +from collections.abc import Iterable + +import cobra + + +def set_gam( + model: cobra.Model, + value: float, + *, + biomass_rxn: str, + cofactor_met_names: Iterable[str], + ngam_rxn: str | None = None, + ngam_value: float | None = None, +) -> cobra.Model: + """Set GAM (and optionally NGAM) on a model in place. + + Scales every metabolite participating in the biomass pseudoreaction + whose ``name`` is in ``cofactor_met_names`` (e.g. ATP, ADP, H2O, + H+, phosphate) to ``±value``, preserving the sign of its current + coefficient. + + If both ``ngam_rxn`` and ``ngam_value`` are given, the NGAM + reaction's bounds are fixed at ``(ngam_value, ngam_value)``. + + Parameters + ---------- + model + cobra model to mutate. + value + New GAM value (mmol ATP / gDW per growth unit). + biomass_rxn + Reaction id of the biomass pseudoreaction whose stoichiometry + carries the GAM coefficients. + cofactor_met_names + Iterable of metabolite *names* (not ids) — every metabolite in + the biomass pseudoreaction whose name is in this set will be + rescaled. yeast-GEM uses ``{"ATP", "ADP", "H2O", "H+", "phosphate"}``. + ngam_rxn, ngam_value + Optional NGAM update. If both are provided, the reaction's + bounds are set to ``(ngam_value, ngam_value)`` (equality + constraint). + + Returns the (mutated) model for chaining. + """ + rxn = model.reactions.get_by_id(biomass_rxn) + cofactor_set = set(cofactor_met_names) + + deltas: dict[cobra.Metabolite, float] = {} + for met, coef in rxn.metabolites.items(): + if met.name in cofactor_set: + target = (1 if coef > 0 else -1) * float(value) + # combine=True adds delta → new total = target + deltas[met] = target - coef + if deltas: + rxn.add_metabolites(deltas, combine=True) + + if ngam_rxn is not None and ngam_value is not None: + ngam = model.reactions.get_by_id(ngam_rxn) + ngam.bounds = (float(ngam_value), float(ngam_value)) + + return model diff --git a/src/raven_python/biomass/scale.py b/src/raven_python/biomass/scale.py new file mode 100644 index 0000000..2dc8930 --- /dev/null +++ b/src/raven_python/biomass/scale.py @@ -0,0 +1,207 @@ +"""Biomass component summing and rescaling. + +Ports the generic core of yeast-GEM's ``sumBioMass.m`` / +``scaleBioMass.m`` / ``rescalePseudoReaction.m`` into a single +parameterised module that other GEMs can configure via +:class:`BiomassConfig`. +""" +from __future__ import annotations + +import re + +import cobra + +from raven_python.biomass.config import BiomassComponent, BiomassConfig + +# Elemental atomic weights used to translate formulas into g/mol. +# Mirrors the table baked into yeast-GEM's ``sumBioMass.m``; pseudo- +# element "R" is treated as zero mass (placeholder for residues). +_ELEMENT_MASS_G_PER_MOL: dict[str, float] = { + "C": 12.01, "H": 1.008, "N": 14.007, "O": 15.999, + "P": 30.974, "S": 32.06, "R": 0.0, + "Fe": 55.845, "K": 39.098, "Na": 22.99, "Cl": 35.45, + "Mn": 54.938, "Zn": 65.38, "Ca": 40.078, "Mg": 24.305, "Cu": 63.546, +} + +_FORMULA_TOKEN_RE = re.compile(r"([A-Z][a-z]*)(\d*)") + + +def sum_biomass(model: cobra.Model, config: BiomassConfig) -> dict[str, float]: + """Mass-fraction (g/gDW) per biomass component plus the total. + + Mirrors yeast-GEM's ``sumBioMass.m``. Returns a dict keyed by each + :class:`BiomassComponent` name plus ``"total"``. Components whose + pseudoreaction is missing from the model contribute 0 (logged as a + warning). + + Substrate detection: in each component's pseudoreaction, the + substrate side is the metabolites with negative coefficient — the + same convention yeast-GEM uses. + """ + fractions: dict[str, float] = {} + total = 0.0 + for comp in config.components: + f = _component_fraction(model, comp) + fractions[comp.name] = f + total += f + fractions["total"] = total + return fractions + + +def rescale_pseudoreaction( + model: cobra.Model, + config: BiomassConfig, + component_name: str, + factor: float, +) -> None: + """Multiply the substrate coefficients of one component pseudoreaction + by ``factor`` and rebalance H+ to preserve charge neutrality. + + Ports ``rescalePseudoReaction.m``. "Substrate" here means any + metabolite whose ``name`` is not the component's product name; the + component's own metabolite (e.g. ``protein`` for the protein + pseudoreaction) is left untouched. After the rescaling the + coefficient of ``config.proton_met`` is recomputed so the + reaction's total ionic charge sums to zero. + """ + comp = config.get(component_name) + rxn = _find_pseudoreaction(model, comp) + proton_met = model.metabolites.get_by_id(config.proton_met) + + # Step 1: scale substrate coefficients (everything but the component + # product). Use a fresh dict to avoid mutating during iteration. + deltas: dict[cobra.Metabolite, float] = {} + for met, coef in rxn.metabolites.items(): + if met.name == comp.name: + continue # the product — leave alone + deltas[met] = (factor - 1.0) * coef # combine=True adds (factor-1)*coef + if deltas: + rxn.add_metabolites(deltas, combine=True) + + # Step 2: H+ rebalance. The legacy code first zeroes H+, then sets + # it to negate the rxn's current ionic charge. With combine=True we + # express the same with two `_set_coefficient` calls. + _set_coefficient(rxn, proton_met, 0.0) + total_charge = sum( + (m.charge or 0) * coef for m, coef in rxn.metabolites.items() + ) + _set_coefficient(rxn, proton_met, -total_charge) + + +def scale_biomass( + model: cobra.Model, + config: BiomassConfig, + component_name: str, + new_value: float, + *, + balance_out: str | None = None, +) -> None: + """Rescale a component to a new mass fraction, optionally balancing + out a second component to keep the total at 1 g/gDW. + + Ports ``scaleBioMass.m``. The rescaling factor is derived from + :func:`sum_biomass` on the *current* model state, so call this with + an in-place ``model`` mutation, not on a stale snapshot. + """ + fractions = sum_biomass(model, config) + current = fractions[component_name] + if current == 0: + raise ValueError( + f"Cannot scale {component_name!r} to {new_value}: current " + "fraction is 0 (pseudoreaction missing or empty)." + ) + factor = new_value / current + rescale_pseudoreaction(model, config, component_name, factor) + + if balance_out is not None: + # Recompute X after the first rescaling. + fractions = sum_biomass(model, config) + total = fractions["total"] + balance_current = fractions[balance_out] + if balance_current == 0: + return # nothing we can do + balance_factor = (balance_current + (1.0 - total)) / balance_current + rescale_pseudoreaction(model, config, balance_out, balance_factor) + + +# --- internals --------------------------------------------------------- + +def _component_fraction(model: cobra.Model, comp: BiomassComponent) -> float: + """Compute g/gDW of a single component from its pseudoreaction.""" + try: + rxn = _find_pseudoreaction(model, comp) + except KeyError: + return 0.0 # missing pseudoreaction → contributes 0 + substrates = [(m, c) for m, c in rxn.metabolites.items() if c < 0] + if not substrates: + return 0.0 + + if comp.mass_strategy == "grams": + # Stoichiometry already in g/gDW; just sum the (negative) coefs + # and flip sign. + return -sum(c for _m, c in substrates) + + offset = _mw_offset(comp.mass_strategy) + mass_g = 0.0 + for met, coef in substrates: + mw = _formula_mw(met.formula or "") + if mw == 0: + raise ValueError( + f"Metabolite {met.id} ({met.name!r}) has an empty " + "formula; cannot compute mass for biomass summing." + ) + mw_corrected = mw + offset + mass_g += -coef * mw_corrected + return mass_g / 1000.0 # g/mmol → g/gDW (matching yeast-GEM units) + + +def _mw_offset(strategy: str) -> float: + if strategy == "mw": + return 0.0 + if strategy == "mw_minus_2h": + # 2 × 1.008 = 2.016 g/mol per substrate (charged tRNAs release + # two protons on amino acylation). + return -2.016 + if strategy == "mw_minus_water": + # H2O = 18.015 g/mol per polymerisation step (RNA / DNA). + return -18.015 + raise ValueError(f"Unknown mass_strategy: {strategy!r}") + + +def _formula_mw(formula: str) -> float: + """Molecular weight in g/mol from a Hill-style chemical formula.""" + if not formula: + return 0.0 + mw = 0.0 + for element, count_str in _FORMULA_TOKEN_RE.findall(formula): + if not element: + continue + try: + atomic = _ELEMENT_MASS_G_PER_MOL[element] + except KeyError as exc: + raise ValueError( + f"Unknown element {element!r} in formula {formula!r}; " + "extend raven_python.biomass.scale._ELEMENT_MASS_G_PER_MOL " + "if the element is real." + ) from exc + count = int(count_str) if count_str else 1 + mw += atomic * count + return mw + + +def _find_pseudoreaction(model: cobra.Model, comp: BiomassComponent) -> cobra.Reaction: + for rxn in model.reactions: + if rxn.name == comp.pseudoreaction_name: + return rxn + raise KeyError( + f"Component {comp.name!r}: no reaction named " + f"{comp.pseudoreaction_name!r} in model." + ) + + +def _set_coefficient(rxn: cobra.Reaction, met: cobra.Metabolite, value: float) -> None: + """Land on ``S[met, rxn] = value`` via cobra's combine=True API.""" + current = rxn.metabolites.get(met, 0.0) + delta = float(value) - current + if delta != 0: + rxn.add_metabolites({met: delta}, combine=True) diff --git a/src/raven_python/comparison/__init__.py b/src/raven_python/comparison/__init__.py index e4b4c19..968b7f0 100644 --- a/src/raven_python/comparison/__init__.py +++ b/src/raven_python/comparison/__init__.py @@ -1,7 +1,23 @@ -"""Structural and functional comparison across multiple models. +"""Model comparison utilities. -See :func:`raven_python.comparison.compare.compare_models`. +Two flavours: + +* :func:`compare_models` — N-model presence-matrix overview (RAVEN's + ``compareMultipleModels`` analogue). "How do these models relate?" +* :func:`diff_models` — strict two-model semantic-equality diff for CI + gates. "Are these two models the same?" """ from raven_python.comparison.compare import ModelComparison, compare_models +from raven_python.comparison.diff import ( + DEFAULT_ANNOTATION_KEYS, + DiffReport, + diff_models, +) -__all__ = ["ModelComparison", "compare_models"] +__all__ = [ + "DEFAULT_ANNOTATION_KEYS", + "DiffReport", + "ModelComparison", + "compare_models", + "diff_models", +] diff --git a/src/raven_python/comparison/diff.py b/src/raven_python/comparison/diff.py new file mode 100644 index 0000000..2a399eb --- /dev/null +++ b/src/raven_python/comparison/diff.py @@ -0,0 +1,285 @@ +"""Strict two-model semantic-equality diff. + +Complements :func:`raven_python.comparison.compare.compare_models`, which +takes ≥2 models and produces a presence-matrix overview. Where +``compare_models`` answers *"how do these models relate?"*, +:func:`diff_models` answers *"are these two models the same?"* — used as a +CI gate to verify that two toolchains (e.g. MATLAB RAVEN vs raven_python, +or pre/post refactor of one toolchain) produce equivalent models. + +Diff scope: reaction / metabolite / gene id sets, stoichiometry (within +tolerance), bounds, objective coefficients, GPR rules, metabolite +formula/charge/compartment, and a configurable set of annotation keys. +Formatting differences (key ordering, whitespace, float repr) are +explicitly **not** failures. +""" +from __future__ import annotations + +from collections.abc import Iterable +from dataclasses import dataclass, field + +import cobra + +# Annotation keys checked by default. Add via ``extra_annotations`` or +# remove via ``ignore_annotations`` in :func:`diff_models`. +DEFAULT_ANNOTATION_KEYS: tuple[str, ...] = ( + "sbo", + "ec-code", + "kegg.reaction", + "kegg.compound", + "metanetx.reaction", + "metanetx.chemical", + "chebi", + "bigg.reaction", + "bigg.metabolite", +) + + +@dataclass +class DiffReport: + """Result of :func:`diff_models`. + + Truthy if the models are semantically equal. Iterate ``differences`` + for human-readable messages. + """ + + equal: bool + differences: list[str] = field(default_factory=list) + + def __bool__(self) -> bool: + return self.equal + + def __str__(self) -> str: + if self.equal: + return "Models are semantically equal." + head = f"Models differ ({len(self.differences)} differences):" + body = "\n".join(f" - {d}" for d in self.differences) + return f"{head}\n{body}" + + +def diff_models( + a: cobra.Model, + b: cobra.Model, + *, + stoichiometry_tol: float = 1e-9, + ignore_annotations: Iterable[str] = (), + extra_annotations: Iterable[str] = (), + max_per_category: int = 50, +) -> DiffReport: + """Compare two cobra models for semantic equality. + + Checks: + + * reaction, metabolite and gene id sets (exact) + * stoichiometry per shared reaction (within ``stoichiometry_tol``) + * lower/upper bounds and objective coefficients (exact) + * GPR rules (whitespace- and case-insensitive string comparison) + * metabolite formula, charge, compartment (exact) + * a default set of annotation keys (see ``DEFAULT_ANNOTATION_KEYS``) + plus any ``extra_annotations``, minus any ``ignore_annotations`` + + ``max_per_category`` caps the per-category diff list so a wholesale + mismatch produces a digestible report rather than 10,000 lines. + """ + diffs: list[str] = [] + keys = (set(DEFAULT_ANNOTATION_KEYS) | set(extra_annotations)) - set(ignore_annotations) + + _diff_id_sets(a, b, diffs) + + common_rxns = {r.id for r in a.reactions} & {r.id for r in b.reactions} + _diff_reactions(a, b, sorted(common_rxns), stoichiometry_tol, keys, diffs, max_per_category) + + common_mets = {m.id for m in a.metabolites} & {m.id for m in b.metabolites} + _diff_metabolites(a, b, sorted(common_mets), keys, diffs, max_per_category) + + return DiffReport(equal=not diffs, differences=diffs) + + +# --- internals --------------------------------------------------------- + +def _diff_id_sets(a: cobra.Model, b: cobra.Model, diffs: list[str]) -> None: + for label, get in ( + ("reactions", lambda m: m.reactions), + ("metabolites", lambda m: m.metabolites), + ("genes", lambda m: m.genes), + ): + a_ids = {x.id for x in get(a)} + b_ids = {x.id for x in get(b)} + only_a = sorted(a_ids - b_ids) + only_b = sorted(b_ids - a_ids) + if only_a: + diffs.append(f"{label} only in A ({len(only_a)}): {only_a[:10]}{_more(only_a, 10)}") + if only_b: + diffs.append(f"{label} only in B ({len(only_b)}): {only_b[:10]}{_more(only_b, 10)}") + + +def _diff_reactions( + a: cobra.Model, + b: cobra.Model, + common: list[str], + tol: float, + anno_keys: set[str], + diffs: list[str], + cap: int, +) -> None: + counters = {"stoich": 0, "bounds": 0, "objective": 0, "gpr": 0, "anno": 0} + for rxn_id in common: + ra = a.reactions.get_by_id(rxn_id) + rb = b.reactions.get_by_id(rxn_id) + + sa = {m.id: c for m, c in ra.metabolites.items()} + sb = {m.id: c for m, c in rb.metabolites.items()} + if set(sa) != set(sb): + _push(diffs, counters, "stoich", cap, f"{rxn_id}: stoichiometry has different mets") + else: + for mid in sa: + if abs(sa[mid] - sb[mid]) > tol: + _push( + diffs, counters, "stoich", cap, + f"{rxn_id}: coef[{mid}] A={sa[mid]:.6g} B={sb[mid]:.6g}", + ) + + if ra.lower_bound != rb.lower_bound or ra.upper_bound != rb.upper_bound: + _push( + diffs, counters, "bounds", cap, + f"{rxn_id}: bounds A=({ra.lower_bound}, {ra.upper_bound}) " + f"B=({rb.lower_bound}, {rb.upper_bound})", + ) + + if ra.objective_coefficient != rb.objective_coefficient: + _push( + diffs, counters, "objective", cap, + f"{rxn_id}: objective A={ra.objective_coefficient} B={rb.objective_coefficient}", + ) + + ga = _normalise_gpr(ra.gene_reaction_rule) + gb = _normalise_gpr(rb.gene_reaction_rule) + if ga != gb: + _push(diffs, counters, "gpr", cap, f"{rxn_id}: GPR A={ga!r} B={gb!r}") + + _diff_annotations( + f"rxn {rxn_id}", ra.annotation, rb.annotation, + anno_keys, diffs, counters, cap, + ) + + +def _diff_metabolites( + a: cobra.Model, + b: cobra.Model, + common: list[str], + anno_keys: set[str], + diffs: list[str], + cap: int, +) -> None: + counters = {"formula": 0, "charge": 0, "compartment": 0, "anno": 0} + for met_id in common: + ma = a.metabolites.get_by_id(met_id) + mb = b.metabolites.get_by_id(met_id) + + if (ma.formula or None) != (mb.formula or None): + _push( + diffs, counters, "formula", cap, + f"met {met_id}: formula A={ma.formula} B={mb.formula}", + ) + + ca = ma.charge if ma.charge is not None else 0 + cb = mb.charge if mb.charge is not None else 0 + if ca != cb: + _push( + diffs, counters, "charge", cap, + f"met {met_id}: charge A={ma.charge} B={mb.charge}", + ) + + if ma.compartment != mb.compartment: + _push( + diffs, counters, "compartment", cap, + f"met {met_id}: compartment A={ma.compartment} B={mb.compartment}", + ) + + _diff_annotations( + f"met {met_id}", ma.annotation, mb.annotation, + anno_keys, diffs, counters, cap, + ) + + +def _diff_annotations( + label: str, + aa: dict, + ba: dict, + keys: set[str], + diffs: list[str], + counters: dict, + cap: int, +) -> None: + for k in keys: + va = _normalise_annotation_value(aa.get(k)) + vb = _normalise_annotation_value(ba.get(k)) + if va != vb: + _push(diffs, counters, "anno", cap, f"{label}.annotation[{k!r}]: A={va} B={vb}") + + +def _normalise_gpr(rule: str) -> str: + """Lowercase + collapse whitespace. + + A more robust comparator would parse to a GPR AST and compare + structures; this is the cheap heuristic that catches the formatting + drift we see between different SBML writers. + """ + if not rule: + return "" + return " ".join(rule.lower().split()) + + +def _normalise_annotation_value(v): + if v is None: + return None + if isinstance(v, list): + return tuple(sorted(str(x) for x in v)) + if isinstance(v, tuple): + return tuple(sorted(str(x) for x in v)) + return str(v) + + +def _push(diffs: list[str], counters: dict, key: str, cap: int, msg: str) -> None: + counters[key] = counters.get(key, 0) + 1 + if counters[key] <= cap: + diffs.append(msg) + elif counters[key] == cap + 1: + diffs.append(f"... ({key} category truncated at {cap} entries)") + + +def _more(seq: list, shown: int) -> str: + return "" if len(seq) <= shown else f" ... (+{len(seq) - shown} more)" + + +# --- CLI --------------------------------------------------------------- + +def _main() -> int: + import argparse + + from cobra.io import read_sbml_model + + parser = argparse.ArgumentParser( + description="Diff two SBML models for semantic equality." + ) + parser.add_argument("a", help="path to first model (SBML)") + parser.add_argument("b", help="path to second model (SBML)") + parser.add_argument( + "--stoichiometry-tol", + type=float, + default=1e-9, + help="absolute tolerance on stoichiometric coefficients (default 1e-9)", + ) + args = parser.parse_args() + + report = diff_models( + read_sbml_model(args.a), + read_sbml_model(args.b), + stoichiometry_tol=args.stoichiometry_tol, + ) + print(report) + return 0 if report.equal else 1 + + +if __name__ == "__main__": # pragma: no cover + raise SystemExit(_main()) diff --git a/src/raven_python/conditions/__init__.py b/src/raven_python/conditions/__init__.py new file mode 100644 index 0000000..8465081 --- /dev/null +++ b/src/raven_python/conditions/__init__.py @@ -0,0 +1,25 @@ +"""Apply named "condition" presets — bound diffs, biomass edits, etc. + +A *condition* is a YAML file (or pre-parsed dict) describing a set of +deterministic modifications to apply to a cobra model: a prelude that +resets exchange bounds, edits to a cofactor pseudoreaction +(metabolite removals + automatic charge rebalancing), a per-reaction +bounds diff, and an optional biomass-stoichiometry delta. + +Yeast-GEM was the first consumer; the schema is documented in +:func:`apply_condition` and is meant to be reusable for any GEM that +wants to keep its condition presets as data rather than code. +""" +from raven_python.conditions.apply import ( + DEFAULT_RESET_EXCHANGES_UPPER_BOUND, + apply_condition, + load_condition, + set_reaction_bounds, +) + +__all__ = [ + "DEFAULT_RESET_EXCHANGES_UPPER_BOUND", + "apply_condition", + "load_condition", + "set_reaction_bounds", +] diff --git a/src/raven_python/conditions/apply.py b/src/raven_python/conditions/apply.py new file mode 100644 index 0000000..85c72a8 --- /dev/null +++ b/src/raven_python/conditions/apply.py @@ -0,0 +1,192 @@ +"""Generic "apply this YAML condition" loader. + +Schema (all keys optional): + +.. code-block:: yaml + + name: my_condition + description: free-form text. + + # Set every exchange reaction to (0, ub) before any per-rxn override. + prelude: + reset_exchanges: out # str | bool truthy + + # Remove metabolites from a pseudoreaction and rebalance H+ + # (or any other "balance" met) so total charge stays zero. + cofactor_pseudoreaction: + rxn_id: r_4598 + remove_mets: + - { met: s_3714 } # comment field is informational + charge_balance_met: s_0794 # optional + + # Add (combine=True) to a reaction's stoichiometry. + biomass_stoichiometry_delta: + rxn_id: r_4041 + add: + - { met: s_0689, coef: 0.08 } + - { met: s_0687, coef: -0.08 } + - { met: s_0794, coef: -0.16 } + + # Per-reaction bounds. lb / ub omitted means "leave unchanged". + bounds: + - { rxn: r_1654, lb: -1000 } + - { rxn: r_1992, lb: 0 } + - { rxn: r_1663, lb: 0, ub: 0 } + + # Sanity-check counter. Emits a warning if the actual count of + # ``lb == -1000`` bound entries doesn't match. + expected_uptake_count: 15 + +The schema is intentionally narrow: deterministic, idempotent, easy to +diff in code review. Project-specific extensions (e.g. yeast-GEM's +``amino_acid_ratio``) are handled by the *caller* before / after this +function — keep the upstream generic. +""" +from __future__ import annotations + +import warnings +from pathlib import Path +from typing import Any + +import cobra +from ruamel.yaml import YAML + +# A safe loader keeps the parsed document as plain dict / list / scalars, +# which matches what callers expect from ``load_condition``. ruamel.yaml +# is already a transitive dependency via cobra, so we don't take on +# PyYAML on top. +_SAFE_YAML = YAML(typ="safe") + +#: ``prelude.reset_exchanges`` puts every exchange reaction at this +#: upper bound (and lower bound = 0). +DEFAULT_RESET_EXCHANGES_UPPER_BOUND: float = 1000.0 + + +def load_condition(path: str | Path) -> dict[str, Any]: + """Parse a condition YAML file into a plain dict.""" + path = Path(path) + if not path.exists(): + raise FileNotFoundError(f"Condition file not found: {path}") + with open(path) as f: + return _SAFE_YAML.load(f) + + +def apply_condition( + model: cobra.Model, + condition: dict[str, Any] | str | Path, +) -> cobra.Model: + """Apply a parsed condition (or a YAML file path) to ``model`` in place. + + Returns the same model object for chaining. + """ + cfg = condition if isinstance(condition, dict) else load_condition(condition) + + _apply_prelude(model, cfg) + _apply_cofactor_pseudoreaction(model, cfg) + _apply_biomass_stoichiometry_delta(model, cfg) + n_uptake = _apply_bounds(model, cfg) + _check_uptake_count(cfg, n_uptake) + + return model + + +def set_reaction_bounds(rxn: cobra.Reaction, lb: float, ub: float) -> None: + """Set both bounds, bypassing cobra's ``lb <= ub`` validator. + + Some conditions intentionally land on an infeasible ``lb > ub`` state + (e.g. forcing flux via a sentinel bound). cobra's setter rejects + that. This helper writes the underlying private attributes when + needed so the model can match the caller's exact intent. + """ + if lb > ub: + rxn._lower_bound = lb # noqa: SLF001 — intentional bypass + rxn._upper_bound = ub # noqa: SLF001 — intentional bypass + else: + rxn.bounds = (lb, ub) + + +# --- step implementations --------------------------------------------- + +def _apply_prelude(model: cobra.Model, cfg: dict[str, Any]) -> None: + prelude = cfg.get("prelude") or {} + if prelude.get("reset_exchanges"): + # cobrapy doesn't distinguish RAVEN's in / out directions; we + # reset every exchange reaction. + for rxn in model.exchanges: + rxn.lower_bound = 0 + rxn.upper_bound = DEFAULT_RESET_EXCHANGES_UPPER_BOUND + + +def _apply_cofactor_pseudoreaction(model: cobra.Model, cfg: dict[str, Any]) -> None: + cp = cfg.get("cofactor_pseudoreaction") + if not cp: + return + rxn = model.reactions.get_by_id(cp["rxn_id"]) + for entry in cp.get("remove_mets", []): + met = model.metabolites.get_by_id(entry["met"]) + _set_coefficient(rxn, met, 0.0) + balance_met_id = cp.get("charge_balance_met") + if balance_met_id: + balance_met = model.metabolites.get_by_id(balance_met_id) + _set_coefficient(rxn, balance_met, 0.0) + total_charge = sum( + (m.charge or 0) * coef + for m, coef in rxn.metabolites.items() + ) + _set_coefficient(rxn, balance_met, -total_charge) + + +def _apply_biomass_stoichiometry_delta(model: cobra.Model, cfg: dict[str, Any]) -> None: + delta = cfg.get("biomass_stoichiometry_delta") + if not delta: + return + rxn = model.reactions.get_by_id(delta["rxn_id"]) + for entry in delta.get("add", []): + met = model.metabolites.get_by_id(entry["met"]) + rxn.add_metabolites({met: float(entry["coef"])}, combine=True) + + +def _apply_bounds(model: cobra.Model, cfg: dict[str, Any]) -> int: + n_uptake = 0 + for entry in cfg.get("bounds", []): + try: + rxn = model.reactions.get_by_id(entry["rxn"]) + except KeyError: + warnings.warn( + f"Reaction {entry['rxn']!r} not found in model; skipping.", + stacklevel=3, + ) + continue + new_lb = float(entry["lb"]) if "lb" in entry else rxn.lower_bound + new_ub = float(entry["ub"]) if "ub" in entry else rxn.upper_bound + set_reaction_bounds(rxn, new_lb, new_ub) + if entry.get("lb") == -1000: + n_uptake += 1 + return n_uptake + + +def _check_uptake_count(cfg: dict[str, Any], n_uptake: int) -> None: + expected = cfg.get("expected_uptake_count") + if expected is None: + return + if n_uptake != expected: + warnings.warn( + f"Expected {expected} uptake reactions, applied {n_uptake}. " + "Some referenced reactions may be missing from the model.", + stacklevel=3, + ) + + +# --- helpers ---------------------------------------------------------- + +def _set_coefficient(rxn: cobra.Reaction, met: cobra.Metabolite, value: float) -> None: + """Set the stoichiometric coefficient of ``met`` in ``rxn`` to ``value``. + + cobra's only public mutation point is ``add_metabolites``; we go via + ``combine=True`` with ``(value - current)`` to land on the desired + value without depending on cobra's removal-on-zero behaviour. + """ + current = rxn.metabolites.get(met, 0.0) + delta = float(value) - current + if delta != 0: + rxn.add_metabolites({met: delta}, combine=True) diff --git a/src/raven_python/curation/__init__.py b/src/raven_python/curation/__init__.py new file mode 100644 index 0000000..de8b8cc --- /dev/null +++ b/src/raven_python/curation/__init__.py @@ -0,0 +1,54 @@ +"""Batch curation of metabolites / reactions / genes from data tables. + +Port of yeast-GEM's MATLAB ``curateMetsRxnsGenes`` into a generic +DataFrame-driven engine. Other GEM projects (Human-GEM, custom +reconstructions, …) can use the same machinery with their own TSV +layouts; the only required pieces are the data tables and (optionally) +project-specific id prefixes for fresh metabolites and reactions. + +Public API: + +* :func:`batch_curate` — entrypoint taking pandas DataFrames. +* :func:`batch_curate_from_tsv` — file-path convenience wrapper. +* :class:`CurationResult` — record of what was added / updated. + +Schema (mirrors yeast-GEM's ``data/modelCuration/template/`` layout): + +- **mets_df**: ``metNames, comps, formula, charge, inchi, metNotes`` + + any number of MIRIAM-annotation columns. Match key is + ``(name, comp)``. +- **genes_df**: ``genes, geneShortNames`` + MIRIAM columns. Match key + is ``genes``. +- **rxns_df**: ``rxnNames, grRules, lb, ub, rev, subSystems, eccodes, + rxnNotes, rxnReferences, rxnConfidenceScores`` + MIRIAM columns. + Match key is the reaction's *stoichiometry* — same metabolites and + coefficients ⇒ same reaction. +- **rxns_coeffs_df**: ``rxnNames, metNames, comps, coefficient``. One + row per ``(reaction, metabolite)`` pair. The ``rxnNames`` column + links each coefficient back to a row in ``rxns_df``. An optional + ``index`` first column from the legacy yeast-GEM schema is silently + ignored. + +Everything after the core columns in any of the four tables is +interpreted as a MIRIAM annotation: the column header becomes the +namespace key (``met.annotation[
] = ``). +""" +from raven_python.curation.batch import ( + DEFAULT_CORE_GENE_COLUMNS, + DEFAULT_CORE_MET_COLUMNS, + DEFAULT_CORE_RXN_COEFFS_COLUMNS, + DEFAULT_CORE_RXN_COLUMNS, + CurationResult, + batch_curate, + batch_curate_from_tsv, +) + +__all__ = [ + "DEFAULT_CORE_GENE_COLUMNS", + "DEFAULT_CORE_MET_COLUMNS", + "DEFAULT_CORE_RXN_COEFFS_COLUMNS", + "DEFAULT_CORE_RXN_COLUMNS", + "CurationResult", + "batch_curate", + "batch_curate_from_tsv", +] diff --git a/src/raven_python/curation/batch.py b/src/raven_python/curation/batch.py new file mode 100644 index 0000000..1bdeb23 --- /dev/null +++ b/src/raven_python/curation/batch.py @@ -0,0 +1,458 @@ +"""Generic batch-curation engine driven by tabular data.""" +from __future__ import annotations + +import warnings +from dataclasses import dataclass, field +from pathlib import Path + +import cobra +import pandas as pd + +#: Core columns recognised in ``mets_df``. Anything else is treated as a +#: MIRIAM annotation column (the header becomes the namespace key). +DEFAULT_CORE_MET_COLUMNS: tuple[str, ...] = ( + "metNames", "comps", "formula", "charge", "inchi", "metNotes", +) +#: Core columns recognised in ``genes_df``. +DEFAULT_CORE_GENE_COLUMNS: tuple[str, ...] = ("genes", "geneShortNames") +#: Core columns recognised in ``rxns_df``. +DEFAULT_CORE_RXN_COLUMNS: tuple[str, ...] = ( + "rxnNames", "grRules", "lb", "ub", "rev", + "subSystems", "eccodes", "rxnNotes", "rxnReferences", "rxnConfidenceScores", +) +#: Required columns in ``rxns_coeffs_df``. (Linkage column ``rxnNames`` +#: + one row per ``(reaction, metabolite)`` pair.) A leading ``index`` +#: column from the legacy yeast-GEM schema is silently ignored. +DEFAULT_CORE_RXN_COEFFS_COLUMNS: tuple[str, ...] = ( + "rxnNames", "metNames", "comps", "coefficient", +) + + +@dataclass +class CurationResult: + """Record of what :func:`batch_curate` added / updated. + + Each list holds the cobra ids of the affected entities, in the + order they were processed. + """ + + added_metabolites: list[str] = field(default_factory=list) + updated_metabolites: list[str] = field(default_factory=list) + added_genes: list[str] = field(default_factory=list) + updated_genes: list[str] = field(default_factory=list) + added_reactions: list[str] = field(default_factory=list) + updated_reactions: list[str] = field(default_factory=list) + + def __bool__(self) -> bool: + return bool( + self.added_metabolites or self.updated_metabolites + or self.added_genes or self.updated_genes + or self.added_reactions or self.updated_reactions + ) + + +# --- public entry points ---------------------------------------------- + +def batch_curate( + model: cobra.Model, + *, + mets_df: pd.DataFrame | None = None, + genes_df: pd.DataFrame | None = None, + rxns_df: pd.DataFrame | None = None, + rxns_coeffs_df: pd.DataFrame | None = None, + met_id_prefix: str = "M_", + rxn_id_prefix: str = "R_", +) -> CurationResult: + """Add or update metabolites / reactions / genes from data tables. + + Each table is optional. ``rxns_df`` and ``rxns_coeffs_df`` must be + provided together (one describes the per-reaction attributes, the + other carries the stoichiometric coefficients). + + Parameters + ---------- + model + Model to curate in place. + mets_df, genes_df, rxns_df, rxns_coeffs_df + Tables matching the schema described in + :mod:`raven_python.curation`. + met_id_prefix, rxn_id_prefix + Prefixes for freshly-generated metabolite / reaction ids + (e.g. ``s_`` and ``r_`` for yeast-GEM, ``M_`` and ``R_`` for + the cobrapy / BiGG default). New entity ids are formed by + finding the largest existing zero-padded suffix matching the + prefix and incrementing from there. + + Returns + ------- + A :class:`CurationResult` summarising the changes. + """ + result = CurationResult() + + if mets_df is not None: + _apply_mets(model, mets_df, met_id_prefix, result) + if genes_df is not None: + _apply_genes(model, genes_df, result) + + if (rxns_df is None) != (rxns_coeffs_df is None): + raise ValueError( + "rxns_df and rxns_coeffs_df must be provided together; got " + f"rxns_df={'set' if rxns_df is not None else 'None'}, " + f"rxns_coeffs_df=" + f"{'set' if rxns_coeffs_df is not None else 'None'}." + ) + if rxns_df is not None: + _apply_reactions(model, rxns_df, rxns_coeffs_df, rxn_id_prefix, result) + return result + + +def batch_curate_from_tsv( + model: cobra.Model, + *, + mets_tsv: str | Path | None = None, + genes_tsv: str | Path | None = None, + rxns_tsv: str | Path | None = None, + rxns_coeffs_tsv: str | Path | None = None, + met_id_prefix: str = "M_", + rxn_id_prefix: str = "R_", +) -> CurationResult: + """File-path convenience wrapper for :func:`batch_curate`. + + Each path is optional. TSVs are read with pandas' default + ``read_csv(sep='\\t')``; empty cells become ``NaN`` (not the + empty string), which the engine treats as "skip this field". + """ + def _read(path): + return pd.read_csv(path, sep="\t") if path is not None else None + + return batch_curate( + model, + mets_df=_read(mets_tsv), + genes_df=_read(genes_tsv), + rxns_df=_read(rxns_tsv), + rxns_coeffs_df=_read(rxns_coeffs_tsv), + met_id_prefix=met_id_prefix, + rxn_id_prefix=rxn_id_prefix, + ) + + +# --- metabolites ------------------------------------------------------ + +def _apply_mets( + model: cobra.Model, + df: pd.DataFrame, + id_prefix: str, + result: CurationResult, +) -> None: + miriam_cols = [c for c in df.columns if c not in DEFAULT_CORE_MET_COLUMNS] + name_index = _name_compartment_index(model) + + new_rows: list[pd.Series] = [] + for _, row in df.iterrows(): + name = row["metNames"] + comp = row["comps"] + existing = name_index.get((name, comp)) + if existing is not None: + _update_metabolite(existing, row, miriam_cols) + result.updated_metabolites.append(existing.id) + else: + new_rows.append(row) + + if new_rows: + new_mets = _build_new_metabolites(model, new_rows, miriam_cols, id_prefix) + model.add_metabolites(new_mets) + result.added_metabolites.extend(m.id for m in new_mets) + + if result.updated_metabolites: + warnings.warn( + f"{len(result.updated_metabolites)} metabolite(s) already " + "existed and were overwritten: " + f"{result.updated_metabolites[:5]}" + f"{'...' if len(result.updated_metabolites) > 5 else ''}", + stacklevel=4, + ) + + +def _name_compartment_index(model: cobra.Model) -> dict[tuple[str, str], cobra.Metabolite]: + return {(m.name, m.compartment): m for m in model.metabolites} + + +def _update_metabolite(met: cobra.Metabolite, row: pd.Series, miriam_cols: list[str]) -> None: + if _has_value(row.get("formula")): + met.formula = str(row["formula"]) + if _has_value(row.get("charge")): + met.charge = _coerce_int(row["charge"]) + if _has_value(row.get("inchi")): + met.annotation["inchi"] = str(row["inchi"]) + if _has_value(row.get("metNotes")): + met.notes["metNotes"] = str(row["metNotes"]) + _apply_miriam(met, row, miriam_cols) + + +def _build_new_metabolites( + model: cobra.Model, + rows: list[pd.Series], + miriam_cols: list[str], + id_prefix: str, +) -> list[cobra.Metabolite]: + ids = _generate_ids(model.metabolites, id_prefix, len(rows)) + new_mets: list[cobra.Metabolite] = [] + for new_id, row in zip(ids, rows, strict=True): + met = cobra.Metabolite( + id=new_id, + name=str(row["metNames"]), + compartment=str(row["comps"]), + ) + if _has_value(row.get("formula")): + met.formula = str(row["formula"]) + if _has_value(row.get("charge")): + met.charge = _coerce_int(row["charge"]) + if _has_value(row.get("inchi")): + met.annotation["inchi"] = str(row["inchi"]) + if _has_value(row.get("metNotes")): + met.notes["metNotes"] = str(row["metNotes"]) + _apply_miriam(met, row, miriam_cols) + new_mets.append(met) + return new_mets + + +# --- genes ------------------------------------------------------------ + +def _apply_genes(model: cobra.Model, df: pd.DataFrame, result: CurationResult) -> None: + miriam_cols = [c for c in df.columns if c not in DEFAULT_CORE_GENE_COLUMNS] + existing_genes = {g.id: g for g in model.genes} + + new_rows: list[pd.Series] = [] + for _, row in df.iterrows(): + gid = str(row["genes"]) + existing = existing_genes.get(gid) + if existing is not None: + _update_gene(existing, row, miriam_cols) + result.updated_genes.append(gid) + else: + new_rows.append(row) + + if new_rows: + # cobrapy doesn't have a direct "add a free-standing gene" API; + # genes are typically added via reactions. Use cobra.Gene + an + # explicit registration through the DictList. + from cobra.core.gene import Gene + + for row in new_rows: + g = Gene(str(row["genes"])) + if _has_value(row.get("geneShortNames")): + g.name = str(row["geneShortNames"]) + _apply_miriam(g, row, miriam_cols) + model.genes.append(g) + result.added_genes.append(g.id) + + if result.updated_genes: + warnings.warn( + f"{len(result.updated_genes)} gene(s) already existed and " + f"were overwritten: {result.updated_genes[:5]}" + f"{'...' if len(result.updated_genes) > 5 else ''}", + stacklevel=4, + ) + + +def _update_gene(gene, row: pd.Series, miriam_cols: list[str]) -> None: + if _has_value(row.get("geneShortNames")): + gene.name = str(row["geneShortNames"]) + _apply_miriam(gene, row, miriam_cols) + + +# --- reactions -------------------------------------------------------- + +def _apply_reactions( + model: cobra.Model, + rxns_df: pd.DataFrame, + coeffs_df: pd.DataFrame, + id_prefix: str, + result: CurationResult, +) -> None: + miriam_cols = [c for c in rxns_df.columns if c not in DEFAULT_CORE_RXN_COLUMNS] + _validate_rxns_coeffs(coeffs_df) + coeffs_by_name = _group_coeffs_by_rxn_name(model, coeffs_df) + + # Build {stoichiometry_signature: existing_reaction} for the lookup + # of "is this reaction already in the model?". + by_signature = { + _stoich_signature(r): r for r in model.reactions + } + + new_rows: list[tuple[pd.Series, dict[cobra.Metabolite, float]]] = [] + for _, row in rxns_df.iterrows(): + rxn_name = row["rxnNames"] + if rxn_name not in coeffs_by_name: + raise ValueError( + f"Reaction {rxn_name!r} in rxns_df has no matching " + "row(s) in rxns_coeffs_df." + ) + stoich = coeffs_by_name[rxn_name] + sig = _stoich_signature_from_dict(stoich) + existing = by_signature.get(sig) + if existing is not None: + _update_reaction(existing, row, miriam_cols) + result.updated_reactions.append(existing.id) + else: + new_rows.append((row, stoich)) + + if new_rows: + new_rxns = _build_new_reactions(model, new_rows, miriam_cols, id_prefix) + model.add_reactions(new_rxns) + result.added_reactions.extend(r.id for r in new_rxns) + + if result.updated_reactions: + warnings.warn( + f"{len(result.updated_reactions)} reaction(s) had matching " + "stoichiometry in the model and were overwritten: " + f"{result.updated_reactions[:5]}" + f"{'...' if len(result.updated_reactions) > 5 else ''}", + stacklevel=4, + ) + + +def _validate_rxns_coeffs(df: pd.DataFrame) -> None: + missing = set(DEFAULT_CORE_RXN_COEFFS_COLUMNS) - set(df.columns) + if missing: + raise ValueError( + f"rxns_coeffs_df is missing required columns: {sorted(missing)}" + ) + + +def _group_coeffs_by_rxn_name( + model: cobra.Model, + coeffs_df: pd.DataFrame, +) -> dict[str, dict[cobra.Metabolite, float]]: + """Group coefficient rows by rxn name and resolve the metabolites. + + Each value is a ``{metabolite_object: coefficient}`` dict ready for + ``cobra.Reaction.add_metabolites``. Metabolites are looked up by + ``(name, comp)``; if a coefficient references an unknown + metabolite, that's an error (the caller should add the metabolite + via ``mets_df`` first). + """ + name_index = _name_compartment_index(model) + grouped: dict[str, dict[cobra.Metabolite, float]] = {} + for _, row in coeffs_df.iterrows(): + rxn_name = str(row["rxnNames"]) + met_name = str(row["metNames"]) + comp = str(row["comps"]) + coef = float(row["coefficient"]) + met = name_index.get((met_name, comp)) + if met is None: + raise ValueError( + f"Reaction {rxn_name!r} references metabolite " + f"{met_name!r}[{comp}] which is not in the model. " + "Add it via mets_df first, or include it in the same " + "batch_curate call." + ) + grouped.setdefault(rxn_name, {})[met] = coef + return grouped + + +def _stoich_signature(rxn: cobra.Reaction) -> frozenset: + return frozenset((m.id, c) for m, c in rxn.metabolites.items()) + + +def _stoich_signature_from_dict(stoich: dict[cobra.Metabolite, float]) -> frozenset: + return frozenset((m.id, c) for m, c in stoich.items()) + + +def _update_reaction(rxn: cobra.Reaction, row: pd.Series, miriam_cols: list[str]) -> None: + if _has_value(row.get("rxnNames")): + rxn.name = str(row["rxnNames"]) + if _has_value(row.get("grRules")): + rxn.gene_reaction_rule = str(row["grRules"]) + if _has_value(row.get("lb")): + rxn.lower_bound = float(row["lb"]) + if _has_value(row.get("ub")): + rxn.upper_bound = float(row["ub"]) + if _has_value(row.get("subSystems")): + rxn.subsystem = str(row["subSystems"]) + if _has_value(row.get("eccodes")): + rxn.annotation["ec-code"] = str(row["eccodes"]) + if _has_value(row.get("rxnNotes")): + rxn.notes["rxnNotes"] = str(row["rxnNotes"]) + if _has_value(row.get("rxnReferences")): + rxn.notes["rxnReferences"] = str(row["rxnReferences"]) + if _has_value(row.get("rxnConfidenceScores")): + rxn.notes["rxnConfidenceScores"] = str(row["rxnConfidenceScores"]) + _apply_miriam(rxn, row, miriam_cols) + + +def _build_new_reactions( + model: cobra.Model, + rows: list[tuple[pd.Series, dict[cobra.Metabolite, float]]], + miriam_cols: list[str], + id_prefix: str, +) -> list[cobra.Reaction]: + ids = _generate_ids(model.reactions, id_prefix, len(rows)) + new_rxns: list[cobra.Reaction] = [] + for new_id, (row, stoich) in zip(ids, rows, strict=True): + rxn = cobra.Reaction( + id=new_id, + name=str(row["rxnNames"]), + lower_bound=float(row["lb"]) if _has_value(row.get("lb")) else -1000.0, + upper_bound=float(row["ub"]) if _has_value(row.get("ub")) else 1000.0, + ) + rxn.add_metabolites(stoich) + if _has_value(row.get("grRules")): + rxn.gene_reaction_rule = str(row["grRules"]) + if _has_value(row.get("subSystems")): + rxn.subsystem = str(row["subSystems"]) + if _has_value(row.get("eccodes")): + rxn.annotation["ec-code"] = str(row["eccodes"]) + if _has_value(row.get("rxnNotes")): + rxn.notes["rxnNotes"] = str(row["rxnNotes"]) + if _has_value(row.get("rxnReferences")): + rxn.notes["rxnReferences"] = str(row["rxnReferences"]) + if _has_value(row.get("rxnConfidenceScores")): + rxn.notes["rxnConfidenceScores"] = str(row["rxnConfidenceScores"]) + _apply_miriam(rxn, row, miriam_cols) + new_rxns.append(rxn) + return new_rxns + + +# --- shared helpers --------------------------------------------------- + +def _apply_miriam(entity, row: pd.Series, miriam_cols: list[str]) -> None: + for col in miriam_cols: + value = row.get(col) + if _has_value(value): + entity.annotation[col] = str(value).strip() + + +def _has_value(v) -> bool: + """Return True if ``v`` is a non-empty, non-NaN cell value.""" + if v is None: + return False + if isinstance(v, float) and v != v: # NaN + return False + if isinstance(v, str) and v.strip() == "": + return False + return True + + +def _coerce_int(v) -> int: + """Pandas reads integer-like strings as floats. Recover the int.""" + return int(round(float(v))) + + +def _generate_ids(existing, prefix: str, count: int) -> list[str]: + """Generate ``count`` fresh ids by incrementing from the largest + existing zero-padded suffix matching ``prefix``. + + e.g. with prefix ``s_`` and existing ``s_4100``, the next ids are + ``s_4101``, ``s_4102``, … Width is preserved. + """ + max_n = 0 + width = 1 + for entity in existing: + if not entity.id.startswith(prefix): + continue + suffix = entity.id[len(prefix):] + if suffix.isdigit(): + max_n = max(max_n, int(suffix)) + width = max(width, len(suffix)) + return [f"{prefix}{(max_n + i):0{width}d}" for i in range(1, count + 1)] diff --git a/src/raven_python/io/yaml.py b/src/raven_python/io/yaml.py index f24c642..3ea96f9 100644 --- a/src/raven_python/io/yaml.py +++ b/src/raven_python/io/yaml.py @@ -66,12 +66,17 @@ def _open_text(path: str | Path, mode: str): # 'note' to avoid colliding with the notes container itself.) _MET_FIELDS = (("inchis", "inchis"), ("deltaG", "deltaG"), ("metFrom", "metFrom"), ("notes", "note")) _RXN_FIELDS = ( - ("confidence_score", "confidence_score"), + ("eccodes", "eccodes"), ("references", "references"), ("rxnFrom", "rxnFrom"), ("deltaG", "deltaG"), + ("confidence_score", "confidence_score"), ("notes", "note"), ) +# Legacy YAML keys accepted on READ for reaction notes. Old RAVEN MATLAB writers +# used "rxnNotes"; the canonical key (matching cobrapy and the current MATLAB +# writer) is "notes". When both appear, "notes" wins. +_LEGACY_RXN_KEY_ALIASES = (("rxnNotes", "notes"),) _GENE_FIELDS = (("protein", "protein"),) _COBRA_TOP_KEYS = frozenset({"metabolites", "reactions", "genes", "compartments", "id", "name"}) @@ -82,8 +87,17 @@ def _open_text(path: str | Path, mode: str): def _to_plain(obj): + """Coerce ruamel/numpy scalars to plain Python primitives. + + Dicts are returned as ``OrderedDict`` so that the round-trip dumper + emits them with the ``!!omap`` tag (ruamel's CommentedMap subclass + is replaced by a plain ``OrderedDict`` to avoid carrying ruamel's + own type tags through). Returning a plain ``dict`` instead would + drop the ``!!omap`` tag and produce files that RAVEN's MATLAB + reader (a line-based parser keyed on ``!!omap``) cannot load. + """ if isinstance(obj, dict): - return {str(k): _to_plain(v) for k, v in obj.items()} + return OrderedDict((str(k), _to_plain(v)) for k, v in obj.items()) if isinstance(obj, (list, tuple)): return [_to_plain(v) for v in obj] if isinstance(obj, bool) or obj is None: @@ -175,6 +189,19 @@ def model_from_yaml_data(raw: dict) -> cobra.Model: # canonical place. No-op on current files. _lift_smiles_to_annotation(raw.get("metabolites")) + # Normalise legacy reaction-side YAML keys (e.g. RAVEN MATLAB's + # ``rxnNotes`` -> the canonical ``notes``) before any field capture so + # the capture step sees a single key per concept. + for entry in raw.get("reactions", []): + if not isinstance(entry, dict): + continue + for legacy_key, canonical_key in _LEGACY_RXN_KEY_ALIASES: + if legacy_key in entry and canonical_key not in entry: + entry[canonical_key] = entry.pop(legacy_key) + elif legacy_key in entry: + # Canonical wins; just drop the legacy duplicate. + entry.pop(legacy_key) + met_notes = _capture_entry_fields(raw.get("metabolites", []), _MET_FIELDS) rxn_notes = _capture_entry_fields(raw.get("reactions", []), _RXN_FIELDS) gene_notes = _capture_entry_fields(raw.get("genes", []), _GENE_FIELDS) @@ -192,18 +219,33 @@ def model_from_yaml_data(raw: dict) -> cobra.Model: # No-op when none match. _flip_legacy_prot_direction(model) - # Legacy files keep id/name inside metaData; restore them if cobra found none. + # RAVEN convention keeps id/name/version inside metaData; lift them + # onto the model so cobra-shaped accessors find them too. Older + # cobra-style files (id/name/version at root) are handled by the + # cobra reader; the metaData lookups below are pure fallbacks. if metadata.get("id") and not model.id: model.id = metadata["id"] if metadata.get("name") and not model.name: model.name = metadata["name"] + if version is None and metadata.get("version") is not None: + version = metadata["version"] if metadata: model.notes["metaData"] = metadata if version is not None: model.notes["version"] = version # Pop the ec sections out of `foreign` and into a typed EcData. - # The remaining unknown keys round-trip opaquely. + # The remaining unknown keys round-trip opaquely. Pre-shim RAVEN + # MATLAB writes wrote `geckoLight: "true"` inside metaData (rather + # than the current top-level `gecko_light`); honour the legacy + # placement too — keep the metaData entry untouched (round-trip) + # and surface it at the top level so EcData picks it up. + legacy_gecko = metadata.get("geckoLight") + if legacy_gecko is not None and "gecko_light" not in foreign: + if isinstance(legacy_gecko, str): + foreign["gecko_light"] = legacy_gecko.lower() == "true" + else: + foreign["gecko_light"] = bool(legacy_gecko) ec_sections = {k: foreign.pop(k) for k in list(foreign) if k in _EC_TOP_KEYS} ec_data = ec_data_from_yaml_sections(ec_sections) if ec_data is not None: @@ -344,6 +386,13 @@ def write_yaml_model( _emit_entry_fields(doc.get("reactions", []), _RXN_FIELDS) _emit_entry_fields(doc.get("genes", []), _GENE_FIELDS) + # cobra's _gene_to_dict always emits `name: ''` because name is a + # required attribute; RAVEN MATLAB skips empty names. Drop the + # empty-string entry so the two writers produce identical genes. + for gene in doc.get("genes", []) or (): + if isinstance(gene, dict) and gene.get("name") == "": + gene.pop("name", None) + # ec sections come from the typed model.ec (when present), not from the # opaque foreign-keys stash. Drop any stale ec-* entries in `foreign` so # they can't conflict with the EcData-derived ones. @@ -356,24 +405,42 @@ def write_yaml_model( else None ) - # cobra dict order is metabolites, reactions, genes, id, name, compartments; - # append version / gecko_light / metaData / ec-* like RAVEN's writer. - if version is not None: - doc["version"] = version - metadata = dict(stored_meta) + # Final document order (matches RAVEN MATLAB writeYAMLmodel): + # metaData, metabolites, reactions, genes, compartments, + # gecko_light, ec-rxns, ec-enzymes, + # id/name/version live inside metaData (the RAVEN convention) — they + # are NOT emitted at root level. Cobra reading such a file recovers + # the lists and compartments and leaves model.id empty (consistent + # with how RAVEN MATLAB has always laid these files out). + metadata = OrderedDict(stored_meta) if model.id: metadata.setdefault("id", model.id) if model.name: metadata.setdefault("name", model.name) - if ec_sections is not None: - doc["gecko_light"] = ec_sections["gecko_light"] + if version is not None: + metadata.setdefault("version", version) + + # cobra's model_to_dict put id / name at root level; drop them so they + # don't duplicate the metaData copy. + doc.pop("id", None) + doc.pop("name", None) + + ordered = OrderedDict() if metadata: - doc["metaData"] = metadata + ordered["metaData"] = metadata + for key in ("metabolites", "reactions", "genes", "compartments", "notes"): + if key in doc: + ordered[key] = doc.pop(key) if ec_sections is not None: - doc["ec-rxns"] = ec_sections["ec-rxns"] - doc["ec-enzymes"] = ec_sections["ec-enzymes"] + ordered["gecko_light"] = ec_sections["gecko_light"] + ordered["ec-rxns"] = ec_sections["ec-rxns"] + ordered["ec-enzymes"] = ec_sections["ec-enzymes"] + # Carry over any cobra-shaped fields we didn't classify above + # (defensive: keeps forward compatibility with future cobra additions). + for key, value in doc.items(): + ordered.setdefault(key, value) for key, value in foreign.items(): - doc[key] = value + ordered.setdefault(key, value) with _open_text(path, "w") as handle: - _cobra_yaml.dump(doc, handle) + _cobra_yaml.dump(ordered, handle) diff --git a/src/raven_python/manipulation/__init__.py b/src/raven_python/manipulation/__init__.py index 074c36f..a462949 100644 --- a/src/raven_python/manipulation/__init__.py +++ b/src/raven_python/manipulation/__init__.py @@ -10,6 +10,7 @@ from .remove import remove_genes, remove_metabolites from .simplify import ( constrain_reversible_reactions, + find_duplicate_reactions, group_linear_reactions, remove_dead_end_reactions, remove_duplicate_reactions, @@ -26,6 +27,7 @@ "constrain_reversible_reactions", "convert_to_irreversible", "expand_model", + "find_duplicate_reactions", "group_linear_reactions", "merge_models", "remove_dead_end_reactions", diff --git a/src/raven_python/manipulation/simplify.py b/src/raven_python/manipulation/simplify.py index 2deaccd..d05c9e6 100644 --- a/src/raven_python/manipulation/simplify.py +++ b/src/raven_python/manipulation/simplify.py @@ -82,6 +82,54 @@ def _signature(rxn): return (mets, rxn.lower_bound, rxn.upper_bound, rxn.objective_coefficient) +def _stoich_signature(rxn, *, ignore_direction: bool) -> frozenset: + """Signature considering stoichiometry only (used by find_duplicate_reactions). + + When ``ignore_direction`` is True, ``A → B`` and ``B → A`` (the same + reaction with all coefficients negated) share a signature. Both + orientations are accumulated and the lexicographically smaller one + wins so the dict-key lookup is direction-symmetric. + """ + forward = frozenset((m.id, c) for m, c in rxn.metabolites.items()) + if not ignore_direction: + return forward + backward = frozenset((m.id, -c) for m, c in rxn.metabolites.items()) + return min(forward, backward, key=lambda s: sorted(s)) + + +def find_duplicate_reactions( + model: cobra.Model, + *, + ignore_direction: bool = True, +) -> list[list[cobra.Reaction]]: + """Return groups of reactions that share identical stoichiometry. + + Detection-only counterpart to :func:`remove_duplicate_reactions`. + Bounds, objective coefficients, GPRs and annotations are ignored — + only stoichiometry is compared, mirroring the legacy yeast-GEM + ``findDuplicatedRxns`` and matching the typical curation use case + (find reactions that *could* be merged). + + Parameters + ---------- + ignore_direction + When True (default), ``A → B`` and ``B → A`` are treated as + duplicates (yeast-GEM's convention). Set ``False`` to require + identical orientation. + + Returns + ------- + A list of duplicate groups. Each group is itself a list with + ≥ 2 reactions sharing the same stoichiometry. Reactions that have + no duplicate are omitted. + """ + groups: dict[frozenset, list[cobra.Reaction]] = {} + for rxn in model.reactions: + sig = _stoich_signature(rxn, ignore_direction=ignore_direction) + groups.setdefault(sig, []).append(rxn) + return [g for g in groups.values() if len(g) >= 2] + + def remove_duplicate_reactions( model: cobra.Model, *, reserved: Iterable[str] | None = None ) -> list[str]: diff --git a/tests/test_annotation.py b/tests/test_annotation.py new file mode 100644 index 0000000..fbab70b --- /dev/null +++ b/tests/test_annotation.py @@ -0,0 +1,204 @@ +"""Tests for raven_python.annotation (SBO terms + ΔG CSV persistence).""" +from __future__ import annotations + +import math + +import cobra +import pandas as pd + +from raven_python.annotation import ( + DEFAULT_BIOMASS_MET_NAMES, + add_sbo_terms, + load_delta_g_csv, + save_delta_g_csv, +) +from raven_python.annotation.sbo import _default_transport_detector + +# --- shared tiny model ------------------------------------------------- + +def _toy_model() -> cobra.Model: + """Toy yeast-flavoured model with: extracellular exchange, transport, + metabolic, biomass, NGAM, generic pseudo, lipid-backbone biomass met.""" + m = cobra.Model("toy") + m.compartments = {"c": "cytoplasm", "e": "extracellular", "x": "peroxisome"} + + mets = { + "atp_c": cobra.Metabolite("atp_c", name="ATP", compartment="c", charge=-4, formula="C10H12N5O13P3"), + "atp_x": cobra.Metabolite("atp_x", name="ATP", compartment="x", charge=-4, formula="C10H12N5O13P3"), + "glc_e": cobra.Metabolite("glc_e", name="glucose", compartment="e", charge=0, formula="C6H12O6"), + "biomass": cobra.Metabolite("biomass_c", name="biomass", compartment="c"), + "lipid_bb": cobra.Metabolite("lbb_c", name="lipid backbone", compartment="c"), + } + m.add_metabolites(list(mets.values())) + + # Exchange (single met in extracellular) + ex = cobra.Reaction("EX_glc", lower_bound=-10, upper_bound=1000) + ex.add_metabolites({mets["glc_e"]: -1}) + # Transport (ATP_c ↔ ATP_x — same met name, two compartments) + tr = cobra.Reaction("ATP_tx", lower_bound=-1000, upper_bound=1000) + tr.add_metabolites({mets["atp_c"]: -1, mets["atp_x"]: 1}) + # Normal metabolic reaction + met = cobra.Reaction("metab1", lower_bound=0, upper_bound=1000) + met.add_metabolites({mets["glc_e"]: -1, mets["atp_c"]: 1}) + # Biomass pseudoreaction (must be last for the bug-compat test) + bio = cobra.Reaction("biomass_rxn", lower_bound=0, upper_bound=1000) + bio.name = "biomass pseudoreaction" + bio.add_metabolites({mets["atp_c"]: -1, mets["biomass"]: 1}) + m.add_reactions([ex, tr, met, bio]) + return m + + +# --- add_sbo_terms ----------------------------------------------------- + +def test_default_biomass_names_set_is_reasonable(): + assert {"biomass", "DNA", "RNA", "protein", "lipid"} <= DEFAULT_BIOMASS_MET_NAMES + + +def test_simple_chemical_gets_simple_sbo(): + m = _toy_model() + add_sbo_terms(m) + assert m.metabolites.get_by_id("glc_e").annotation["sbo"] == "SBO:0000247" + + +def test_biomass_pseudo_metabolite_gets_biomass_sbo(): + m = _toy_model() + add_sbo_terms(m) + assert m.metabolites.get_by_id("biomass_c").annotation["sbo"] == "SBO:0000649" + + +def test_lipid_backbone_suffix_match(): + m = _toy_model() + add_sbo_terms(m) + assert m.metabolites.get_by_id("lbb_c").annotation["sbo"] == "SBO:0000649" + + +def test_exchange_gets_exchange_sbo(): + m = _toy_model() + add_sbo_terms(m) + assert m.reactions.get_by_id("EX_glc").annotation["sbo"] == "SBO:0000627" + + +def test_transport_gets_transport_sbo(): + m = _toy_model() + add_sbo_terms(m) + assert m.reactions.get_by_id("ATP_tx").annotation["sbo"] == "SBO:0000655" + + +def test_biomass_pseudoreaction_gets_pseudo_sbo_by_default(): + m = _toy_model() + add_sbo_terms(m) + assert m.reactions.get_by_id("biomass_rxn").annotation["sbo"] == "SBO:0000629" + + +def test_legacy_bug_flag_scopes_pseudo_to_last_reaction(): + """The yeast-GEM compat flag must restrict the pseudo override to + the last reaction; if biomass_rxn is last, it still wins.""" + m = _toy_model() + add_sbo_terms(m, only_last_reaction_for_pseudo=True) + # biomass_rxn was added last → still gets the pseudo SBO. + assert m.reactions.get_by_id("biomass_rxn").annotation["sbo"] == "SBO:0000629" + + +def test_legacy_bug_flag_skips_pseudo_when_not_last(): + """Reorder: a non-pseudo reaction last; the bug flag should leave + the pseudoreaction with whatever non-pseudo SBO it already had.""" + m = _toy_model() + # Swap last two reactions so the pseudo is NOT last. + rxns = list(m.reactions) + new_order = rxns[:-2] + rxns[-1:] + rxns[-2:-1] + m.reactions._generate_index() # noqa: SLF001 (rebuild before reorder) + # Easier: copy bounds, set order via remove+re-add + m.remove_reactions([rxns[-2], rxns[-1]]) + m.add_reactions(new_order[-2:]) + add_sbo_terms(m, only_last_reaction_for_pseudo=True) + bio = m.reactions.get_by_id("biomass_rxn") + # The biomass reaction has a single bounded reactant set (atp_c -1, + # biomass +1) → 2 mets → falls to default SBO:0000176. + assert bio.annotation["sbo"] == "SBO:0000176" + + +def test_fill_semantic_preserves_existing(): + m = _toy_model() + rxn = m.reactions.get_by_id("metab1") + rxn.annotation["sbo"] = "SBO:0009999" + add_sbo_terms(m) + assert rxn.annotation["sbo"] == "SBO:0009999" + + +def test_custom_transport_detector_overrides_default(): + m = _toy_model() + add_sbo_terms(m, transport_detector=lambda _m: set()) + # ATP_tx is no longer flagged transport → falls through to default + # (it has 2 metabolites so it is not exchange/sink/demand). + assert m.reactions.get_by_id("ATP_tx").annotation["sbo"] == "SBO:0000176" + + +def test_default_transport_detector_finds_transport(): + m = _toy_model() + transports = _default_transport_detector(m) + assert "ATP_tx" in transports + assert "EX_glc" not in transports + + +# --- ΔG CSV round-trip ------------------------------------------------ + +def test_save_then_load_round_trip(tmp_path): + m = _toy_model() + m.metabolites.get_by_id("atp_c").notes["deltaG"] = "-12.34" + m.metabolites.get_by_id("glc_e").notes["deltaG"] = "0.5" + + csv = tmp_path / "met_dg.csv" + written = save_delta_g_csv(m.metabolites, csv) + assert written == len(m.metabolites) + df = pd.read_csv(csv) + assert list(df.columns) == ["Var1", "Var2"] + + # Reload on a fresh model + fresh = _toy_model() + stamped = load_delta_g_csv(fresh.metabolites, csv) + assert stamped == 2 + assert fresh.metabolites.get_by_id("atp_c").notes["deltaG"] == "-12.34" + assert fresh.metabolites.get_by_id("glc_e").notes["deltaG"] == "0.5" + + +def test_save_writes_nan_for_missing_notes(tmp_path): + m = _toy_model() # no notes set + csv = tmp_path / "met_dg.csv" + save_delta_g_csv(m.metabolites, csv) + df = pd.read_csv(csv) + assert len(df) == len(m.metabolites) + assert df["Var2"].apply(lambda v: math.isnan(v)).all() + + +def test_load_skips_nan_rows(tmp_path): + """A NaN in the CSV must NOT clobber the entity's existing note.""" + m = _toy_model() + m.metabolites.get_by_id("atp_c").notes["deltaG"] = "preserved" + + csv = tmp_path / "met_dg.csv" + pd.DataFrame( + {"Var1": ["atp_c", "glc_e"], "Var2": [math.nan, 1.0]} + ).to_csv(csv, index=False) + + load_delta_g_csv(m.metabolites, csv) + assert m.metabolites.get_by_id("atp_c").notes["deltaG"] == "preserved" + assert m.metabolites.get_by_id("glc_e").notes["deltaG"] == "1.0" + + +def test_custom_columns_and_note_key(tmp_path): + m = _toy_model() + m.metabolites.get_by_id("atp_c").notes["dG_kJ"] = "-30.5" + csv = tmp_path / "out.csv" + save_delta_g_csv( + m.metabolites, csv, + id_column="id", value_column="dG", note_key="dG_kJ", + ) + df = pd.read_csv(csv) + assert list(df.columns) == ["id", "dG"] + + fresh = _toy_model() + load_delta_g_csv( + fresh.metabolites, csv, + id_column="id", value_column="dG", note_key="dG_kJ", + ) + assert fresh.metabolites.get_by_id("atp_c").notes["dG_kJ"] == "-30.5" diff --git a/tests/test_biomass.py b/tests/test_biomass.py new file mode 100644 index 0000000..5e96ffd --- /dev/null +++ b/tests/test_biomass.py @@ -0,0 +1,246 @@ +"""Tests for raven_python.biomass.""" +from __future__ import annotations + +import cobra +import pytest + +from raven_python.biomass import ( + BiomassComponent, + BiomassConfig, + rescale_pseudoreaction, + scale_biomass, + set_gam, + sum_biomass, +) +from raven_python.biomass.scale import _formula_mw + +# --- formula MW helper ---------------------------------------------- + +@pytest.mark.parametrize( + "formula, expected", + [ + ("H2O", 18.015), # 2*1.008 + 15.999 + ("ATP", 0.0), # invalid: no element 'ATP'... we raise + ("CO2", 44.008), # 12.01 + 2*15.999 + ("C6H12O6", 180.156), # 6*12.01 + 12*1.008 + 6*15.999 + ], +) +def test_formula_mw(formula, expected): + if formula == "ATP": + with pytest.raises(ValueError, match="Unknown element"): + _formula_mw(formula) + else: + assert _formula_mw(formula) == pytest.approx(expected, rel=1e-4) + + +# --- toy biomass model ---------------------------------------------- + +def _toy_biomass_model() -> tuple[cobra.Model, BiomassConfig]: + """Tiny but realistic biomass layout: + + - protein pseudoreaction: 2 amino-acid tRNAs → protein (mass_strategy mw_minus_2h) + - carbohydrate pseudoreaction: glucose → carbohydrate (mass_strategy mw) + - biomass pseudoreaction: protein + carbohydrate + GAM cofactors → biomass + """ + m = cobra.Model("toy") + m.compartments = {"c": "cytoplasm"} + + aa1 = cobra.Metabolite("aa1_c", name="alanine-tRNA", compartment="c", + charge=0, formula="C3H6N1O2") # MW ≈ 88 + aa2 = cobra.Metabolite("aa2_c", name="glycine-tRNA", compartment="c", + charge=0, formula="C2H4N1O2") # MW ≈ 74 + glc = cobra.Metabolite("glc_c", name="glucose", compartment="c", + charge=0, formula="C6H12O6") # MW = 180.156 + protein = cobra.Metabolite("protein_c", name="protein", compartment="c") + carb = cobra.Metabolite("carb_c", name="carbohydrate", compartment="c") + biomass = cobra.Metabolite("biomass_c", name="biomass", compartment="c") + h = cobra.Metabolite("h_c", name="H+", compartment="c", charge=1) + atp = cobra.Metabolite("atp_c", name="ATP", compartment="c", charge=-4) + adp = cobra.Metabolite("adp_c", name="ADP", compartment="c", charge=-3) + h2o = cobra.Metabolite("h2o_c", name="H2O", compartment="c", charge=0) + pi = cobra.Metabolite("pi_c", name="phosphate", compartment="c", charge=-2) + m.add_metabolites([aa1, aa2, glc, protein, carb, biomass, h, atp, adp, h2o, pi]) + + prot_rxn = cobra.Reaction("PROT_pseudo") + prot_rxn.name = "protein pseudoreaction" + prot_rxn.add_metabolites({aa1: -0.5, aa2: -0.5, protein: 1, h: 1}) + carb_rxn = cobra.Reaction("CARB_pseudo") + carb_rxn.name = "carbohydrate pseudoreaction" + carb_rxn.add_metabolites({glc: -0.001, carb: 1}) # 1 mmol → 180 mg + bio_rxn = cobra.Reaction("BIO_pseudo") + bio_rxn.name = "biomass pseudoreaction" + bio_rxn.add_metabolites({ + protein: -1, carb: -1, biomass: 1, + atp: -50, h2o: -50, adp: 50, h: 50, pi: 50, # GAM = 50 + }) + m.add_reactions([prot_rxn, carb_rxn, bio_rxn]) + + cfg = BiomassConfig( + biomass_rxn="BIO_pseudo", + proton_met="h_c", + components=( + BiomassComponent("protein", "protein pseudoreaction", + mass_strategy="mw_minus_2h"), + BiomassComponent("carbohydrate", "carbohydrate pseudoreaction", + mass_strategy="mw"), + ), + ) + return m, cfg + + +# --- sum_biomass ----------------------------------------------------- + +def test_sum_biomass_keys_match_components_plus_total(): + m, cfg = _toy_biomass_model() + out = sum_biomass(m, cfg) + assert set(out) == {"protein", "carbohydrate", "total"} + + +def test_sum_biomass_protein_uses_minus_2h_offset(): + m, cfg = _toy_biomass_model() + out = sum_biomass(m, cfg) + # aa1 (alanine-tRNA, C3H6N1O2, MW ≈ 88.09): MW - 2.016 + # aa2 (glycine-tRNA, C2H4N1O2, MW ≈ 74.07): MW - 2.016 + # contribs: 0.5*(88.09 - 2.016) + 0.5*(74.07 - 2.016) ≈ 79.066, /1000 + assert out["protein"] == pytest.approx(0.079066, rel=1e-3) + + +def test_sum_biomass_carbohydrate_uses_plain_mw(): + m, cfg = _toy_biomass_model() + out = sum_biomass(m, cfg) + # 0.001 mmol glucose * 180.156 g/mol / 1000 = 0.000180156 + assert out["carbohydrate"] == pytest.approx(0.000180156, rel=1e-3) + + +def test_sum_biomass_total_is_components_sum(): + m, cfg = _toy_biomass_model() + out = sum_biomass(m, cfg) + assert out["total"] == pytest.approx(out["protein"] + out["carbohydrate"]) + + +def test_sum_biomass_grams_strategy(): + m, cfg = _toy_biomass_model() + # Add a lipid component whose stoichiometry is already in grams. + lipid = cobra.Metabolite("lipid_c", name="lipid backbone", compartment="c") + m.add_metabolites([lipid]) + rxn = cobra.Reaction("LIPID_pseudo") + rxn.name = "lipid backbone pseudoreaction" + rxn.add_metabolites({lipid: 1, m.metabolites.get_by_id("glc_c"): -0.05}) + m.add_reactions([rxn]) + cfg2 = BiomassConfig.from_components( + cfg.biomass_rxn, cfg.proton_met, + [*cfg.components, + BiomassComponent("lipid_backbone", "lipid backbone pseudoreaction", + mass_strategy="grams")], + ) + out = sum_biomass(m, cfg2) + assert out["lipid_backbone"] == pytest.approx(0.05) + + +def test_sum_biomass_missing_pseudoreaction_contributes_zero(): + m, cfg = _toy_biomass_model() + cfg2 = BiomassConfig.from_components( + cfg.biomass_rxn, cfg.proton_met, + [*cfg.components, BiomassComponent("DNA", "DNA pseudoreaction")], + ) + out = sum_biomass(m, cfg2) + assert out["DNA"] == 0 + + +# --- rescale_pseudoreaction ----------------------------------------- + +def test_rescale_pseudoreaction_doubles_substrate_coefs(): + m, cfg = _toy_biomass_model() + rxn = m.reactions.get_by_id("PROT_pseudo") + aa1 = m.metabolites.get_by_id("aa1_c") + before = rxn.metabolites[aa1] + rescale_pseudoreaction(m, cfg, "protein", 2.0) + assert rxn.metabolites[aa1] == pytest.approx(2 * before) + + +def test_rescale_pseudoreaction_leaves_product_alone(): + m, cfg = _toy_biomass_model() + rxn = m.reactions.get_by_id("PROT_pseudo") + protein = m.metabolites.get_by_id("protein_c") + before = rxn.metabolites[protein] + rescale_pseudoreaction(m, cfg, "protein", 3.0) + assert rxn.metabolites[protein] == before + + +def test_rescale_pseudoreaction_rebalances_h_for_charge(): + m, cfg = _toy_biomass_model() + # Add a non-zero-charge substrate so H+ balance is meaningful. + rxn = m.reactions.get_by_id("PROT_pseudo") + aa1 = m.metabolites.get_by_id("aa1_c") + aa1.charge = -1 # negatively charged + rescale_pseudoreaction(m, cfg, "protein", 2.0) + total = sum((m_.charge or 0) * c for m_, c in rxn.metabolites.items()) + assert total == pytest.approx(0) + + +# --- scale_biomass -------------------------------------------------- + +def test_scale_biomass_lands_on_new_value(): + m, cfg = _toy_biomass_model() + target = 0.40 + scale_biomass(m, cfg, "protein", target) + out = sum_biomass(m, cfg) + assert out["protein"] == pytest.approx(target, rel=1e-6) + + +def test_scale_biomass_with_balance_keeps_total_at_one(): + m, cfg = _toy_biomass_model() + # Force a known starting total ≠ 1 by tweaking the toy model. + scale_biomass(m, cfg, "protein", 0.6, balance_out="carbohydrate") + out = sum_biomass(m, cfg) + assert out["total"] == pytest.approx(1.0, rel=1e-6) + + +def test_scale_biomass_zero_current_raises(): + m, cfg = _toy_biomass_model() + # Build a config that references a missing component → 0 current + cfg2 = BiomassConfig.from_components( + cfg.biomass_rxn, cfg.proton_met, + [*cfg.components, BiomassComponent("DNA", "DNA pseudoreaction")], + ) + with pytest.raises(ValueError, match="current"): + scale_biomass(m, cfg2, "DNA", 0.1) + + +# --- set_gam -------------------------------------------------------- + +def test_set_gam_scales_cofactor_coefficients(): + m, cfg = _toy_biomass_model() + set_gam(m, 80, biomass_rxn=cfg.biomass_rxn, + cofactor_met_names=("ATP", "ADP", "H2O", "H+", "phosphate")) + bio = m.reactions.get_by_id(cfg.biomass_rxn) + atp = m.metabolites.get_by_id("atp_c") + adp = m.metabolites.get_by_id("adp_c") + h2o = m.metabolites.get_by_id("h2o_c") + pi = m.metabolites.get_by_id("pi_c") + assert bio.metabolites[atp] == -80 + assert bio.metabolites[adp] == 80 + assert bio.metabolites[h2o] == -80 + assert bio.metabolites[pi] == 80 + + +def test_set_gam_with_ngam_sets_equality_bounds(): + m, _ = _toy_biomass_model() + ngam = cobra.Reaction("NGAM", lower_bound=0, upper_bound=1000) + atp = m.metabolites.get_by_id("atp_c") + ngam.add_metabolites({atp: -1}) + m.add_reactions([ngam]) + set_gam(m, 80, biomass_rxn="BIO_pseudo", + cofactor_met_names=("ATP",), + ngam_rxn="NGAM", ngam_value=1.5) + assert m.reactions.get_by_id("NGAM").bounds == (1.5, 1.5) + + +def test_set_gam_ignores_non_cofactor_mets(): + m, cfg = _toy_biomass_model() + bio = m.reactions.get_by_id(cfg.biomass_rxn) + protein = m.metabolites.get_by_id("protein_c") + before = bio.metabolites[protein] + set_gam(m, 80, biomass_rxn=cfg.biomass_rxn, + cofactor_met_names=("ATP",)) + assert bio.metabolites[protein] == before diff --git a/tests/test_comparison_diff.py b/tests/test_comparison_diff.py new file mode 100644 index 0000000..c8556c1 --- /dev/null +++ b/tests/test_comparison_diff.py @@ -0,0 +1,146 @@ +"""Tests for raven_python.comparison.diff — strict 2-model equality.""" +from __future__ import annotations + +import cobra +import pytest + +from raven_python.comparison import DiffReport, diff_models +from raven_python.comparison.diff import _normalise_gpr + + +def _mini_model(model_id: str = "m") -> cobra.Model: + """Tiny but realistic model: one reaction, one extracellular met.""" + m = cobra.Model(model_id) + a = cobra.Metabolite("a_c", name="A", compartment="c", charge=0, formula="C1") + b = cobra.Metabolite("b_e", name="B", compartment="e", charge=0, formula="C2") + m.add_metabolites([a, b]) + r = cobra.Reaction("r1", lower_bound=-1000, upper_bound=1000) + r.add_metabolites({a: -1, b: 1}) + r.gene_reaction_rule = "g1 AND g2" + r.annotation["sbo"] = "SBO:0000176" + m.add_reactions([r]) + return m + + +def test_model_equal_to_itself(): + m = _mini_model() + report = diff_models(m, m) + assert isinstance(report, DiffReport) + assert report.equal + + +def test_independent_copies_are_equal(): + a = _mini_model() + b = _mini_model() + assert diff_models(a, b).equal + + +def test_dropped_reaction_is_detected(): + a = _mini_model() + b = _mini_model() + b.remove_reactions([b.reactions[0]]) + report = diff_models(a, b) + assert not report.equal + assert any("reactions only in A" in d for d in report.differences) + + +def test_bound_diff_is_detected(): + a = _mini_model() + b = _mini_model() + b.reactions[0].lower_bound = -42 + report = diff_models(a, b) + assert not report.equal + assert any("bounds" in d for d in report.differences) + + +def test_stoich_within_tolerance_is_equal(): + a = _mini_model() + b = _mini_model() + met = next(iter(b.reactions[0].metabolites)) + b.reactions[0].add_metabolites({met: 1e-12}, combine=True) + assert diff_models(a, b, stoichiometry_tol=1e-9).equal + + +def test_stoich_outside_tolerance_is_not_equal(): + a = _mini_model() + b = _mini_model() + met = next(iter(b.reactions[0].metabolites)) + b.reactions[0].add_metabolites({met: 5.0}, combine=False) + report = diff_models(a, b, stoichiometry_tol=1e-9) + assert not report.equal + assert any("coef[" in d for d in report.differences) + + +def test_annotation_diff_detected(): + a = _mini_model() + b = _mini_model() + b.reactions[0].annotation["sbo"] = "SBO:0009999" + report = diff_models(a, b) + assert not report.equal + assert any("annotation['sbo']" in d for d in report.differences) + + +def test_ignore_annotations_suppresses_diff(): + a = _mini_model() + b = _mini_model() + b.reactions[0].annotation["sbo"] = "SBO:0009999" + assert diff_models(a, b, ignore_annotations={"sbo"}).equal + + +def test_extra_annotations_picked_up(): + a = _mini_model() + b = _mini_model() + a.reactions[0].annotation["custom-key"] = "v1" + b.reactions[0].annotation["custom-key"] = "v2" + # Default key set ignores custom-key → equal. + assert diff_models(a, b).equal + # Pull it in → diff. + assert not diff_models(a, b, extra_annotations={"custom-key"}).equal + + +@pytest.mark.parametrize( + "ga, gb", + [ + ("A and B", "a AND b"), + ("A and B", "A and B"), + ("(A or B) and C", "(a OR b) AND c"), + ], +) +def test_gpr_normalisation(ga, gb): + assert _normalise_gpr(ga) == _normalise_gpr(gb) + + +def test_max_per_category_truncates(): + """The per-category cap kicks in on per-reaction (and per-met) diffs, + not on the id-set summary line. Build a scenario with many bound + diffs across shared reactions.""" + a = cobra.Model("a") + b = cobra.Model("b") + mets_a = [cobra.Metabolite(f"m{i}_c", compartment="c") for i in range(60)] + mets_b = [cobra.Metabolite(f"m{i}_c", compartment="c") for i in range(60)] + a.add_metabolites(mets_a) + b.add_metabolites(mets_b) + for i in range(60): + ra = cobra.Reaction(f"r{i}", lower_bound=0, upper_bound=1000) + ra.add_metabolites({mets_a[i]: -1}) + a.add_reactions([ra]) + rb = cobra.Reaction(f"r{i}", lower_bound=-7, upper_bound=1000) # differs + rb.add_metabolites({mets_b[i]: -1}) + b.add_reactions([rb]) + report = diff_models(a, b, max_per_category=5) + assert not report.equal + assert any("truncated at 5" in d for d in report.differences) + + +def test_report_str_lists_differences(): + a = _mini_model() + b = _mini_model() + b.remove_reactions([b.reactions[0]]) + text = str(diff_models(a, b)) + assert "Models differ" in text + assert "reactions only in A" in text + + +def test_report_str_when_equal(): + m = _mini_model() + assert str(diff_models(m, m)) == "Models are semantically equal." diff --git a/tests/test_conditions.py b/tests/test_conditions.py new file mode 100644 index 0000000..65b433d --- /dev/null +++ b/tests/test_conditions.py @@ -0,0 +1,211 @@ +"""Tests for raven_python.conditions.apply (generic condition mechanism).""" +from __future__ import annotations + +import io + +import cobra +import pytest +from ruamel.yaml import YAML + +from raven_python.conditions import ( + apply_condition, + load_condition, + set_reaction_bounds, +) + +_SAFE_YAML = YAML(typ="safe") + + +def _yaml_dump(cfg: dict) -> str: + """Helper: dump a dict to YAML via ruamel.yaml (PyYAML.safe_dump + equivalent). Used here because PyYAML isn't a project dependency.""" + buf = io.StringIO() + _SAFE_YAML.dump(cfg, buf) + return buf.getvalue() + + +def _toy_model() -> cobra.Model: + m = cobra.Model("toy") + m.compartments = {"c": "cytoplasm", "e": "extracellular"} + + mets = { + "atp_c": cobra.Metabolite("atp_c", name="ATP", compartment="c", charge=-4), + "glc_e": cobra.Metabolite("glc_e", name="glucose", compartment="e", charge=0), + "h_c": cobra.Metabolite("h_c", name="H+", compartment="c", charge=1), + "fad": cobra.Metabolite("fad", name="FAD", compartment="c", charge=-2), + "fadh2": cobra.Metabolite("fadh2", name="FADH2", compartment="c", charge=0), + "heme": cobra.Metabolite("heme", name="heme a", compartment="c", charge=-2), + } + m.add_metabolites(list(mets.values())) + + ex = cobra.Reaction("EX_glc", lower_bound=-10, upper_bound=1000) + ex.add_metabolites({mets["glc_e"]: -1}) + cofactor = cobra.Reaction("cofac", lower_bound=0, upper_bound=1000) + cofactor.add_metabolites({mets["heme"]: -1, mets["fad"]: -1, mets["h_c"]: 3}) + biomass = cobra.Reaction("bio", lower_bound=0, upper_bound=1000) + biomass.add_metabolites({mets["atp_c"]: -1}) + blocked = cobra.Reaction("blocked", lower_bound=-1000, upper_bound=1000) + blocked.add_metabolites({mets["atp_c"]: -1, mets["glc_e"]: 1}) + m.add_reactions([ex, cofactor, biomass, blocked]) + return m + + +# --- prelude ---------------------------------------------------------- + +def test_prelude_reset_exchanges_zeroes_lb_and_caps_ub(): + m = _toy_model() + apply_condition(m, {"prelude": {"reset_exchanges": "out"}}) + ex = m.reactions.get_by_id("EX_glc") + assert ex.lower_bound == 0 + assert ex.upper_bound == 1000 + + +# --- bounds ----------------------------------------------------------- + +def test_bounds_apply_lb_only(): + m = _toy_model() + apply_condition(m, {"bounds": [{"rxn": "EX_glc", "lb": -1000}]}) + assert m.reactions.get_by_id("EX_glc").lower_bound == -1000 + # ub preserved + assert m.reactions.get_by_id("EX_glc").upper_bound == 1000 + + +def test_bounds_apply_both(): + m = _toy_model() + apply_condition(m, {"bounds": [{"rxn": "blocked", "lb": 0, "ub": 0}]}) + assert m.reactions.get_by_id("blocked").bounds == (0, 0) + + +def test_bounds_lb_gt_ub_bypasses_cobra_validator(): + m = _toy_model() + apply_condition(m, {"bounds": [{"rxn": "blocked", "lb": 1000, "ub": 0}]}) + rxn = m.reactions.get_by_id("blocked") + assert rxn.lower_bound == 1000 + assert rxn.upper_bound == 0 + + +def test_bounds_missing_rxn_warns_and_continues(): + m = _toy_model() + with pytest.warns(UserWarning, match="not_a_real_rxn"): + apply_condition(m, {"bounds": [{"rxn": "not_a_real_rxn", "lb": 0}]}) + + +# --- cofactor pseudoreaction ----------------------------------------- + +def test_remove_met_zeroes_coefficient(): + m = _toy_model() + apply_condition( + m, + { + "cofactor_pseudoreaction": { + "rxn_id": "cofac", + "remove_mets": [{"met": "heme"}], + } + }, + ) + rxn = m.reactions.get_by_id("cofac") + heme = m.metabolites.get_by_id("heme") + assert rxn.metabolites.get(heme, 0) == 0 + + +def test_charge_balance_recomputed_after_removal(): + m = _toy_model() + apply_condition( + m, + { + "cofactor_pseudoreaction": { + "rxn_id": "cofac", + "remove_mets": [{"met": "heme"}], + "charge_balance_met": "h_c", + } + }, + ) + rxn = m.reactions.get_by_id("cofac") + # heme (-1·-2) and fad (-1·-2) contributed +4 charge. + # After heme removal: only fad (-1·-2) contributes +2. + # H+ (charge +1) coef should be -2 to zero the net. + h_c = m.metabolites.get_by_id("h_c") + assert rxn.metabolites[h_c] == -2 + + +# --- biomass stoichiometry delta ------------------------------------- + +def test_biomass_stoichiometry_delta_combines_with_existing(): + m = _toy_model() + bio = m.reactions.get_by_id("bio") + atp = m.metabolites.get_by_id("atp_c") + before = bio.metabolites[atp] + apply_condition( + m, + { + "biomass_stoichiometry_delta": { + "rxn_id": "bio", + "add": [{"met": "atp_c", "coef": 0.5}], + } + }, + ) + assert bio.metabolites[atp] == pytest.approx(before + 0.5) + + +# --- expected_uptake_count ------------------------------------------- + +def test_expected_uptake_count_mismatch_warns(): + m = _toy_model() + cfg = { + "bounds": [{"rxn": "EX_glc", "lb": -1000}], + "expected_uptake_count": 5, + } + with pytest.warns(UserWarning, match="Expected 5 uptake reactions, applied 1"): + apply_condition(m, cfg) + + +def test_expected_uptake_count_match_silent(recwarn): + m = _toy_model() + apply_condition( + m, + { + "bounds": [{"rxn": "EX_glc", "lb": -1000}], + "expected_uptake_count": 1, + }, + ) + assert not any("Expected" in str(w.message) for w in recwarn) + + +# --- yaml path entrypoint -------------------------------------------- + +def test_apply_condition_accepts_yaml_path(tmp_path): + cfg = {"bounds": [{"rxn": "EX_glc", "lb": -42}]} + path = tmp_path / "cond.yml" + path.write_text(_yaml_dump(cfg)) + m = _toy_model() + apply_condition(m, path) + assert m.reactions.get_by_id("EX_glc").lower_bound == -42 + + +def test_load_condition_round_trip(tmp_path): + cfg = {"name": "x", "bounds": [{"rxn": "r1", "lb": 0}]} + path = tmp_path / "cond.yml" + path.write_text(_yaml_dump(cfg)) + assert load_condition(path) == cfg + + +def test_load_condition_missing_path_raises(tmp_path): + with pytest.raises(FileNotFoundError): + load_condition(tmp_path / "nope.yml") + + +# --- set_reaction_bounds public helper ------------------------------- + +def test_set_reaction_bounds_normal_case(): + m = _toy_model() + rxn = m.reactions.get_by_id("EX_glc") + set_reaction_bounds(rxn, -5, 5) + assert rxn.bounds == (-5, 5) + + +def test_set_reaction_bounds_infeasible_case(): + m = _toy_model() + rxn = m.reactions.get_by_id("EX_glc") + set_reaction_bounds(rxn, 100, 0) + assert rxn.lower_bound == 100 + assert rxn.upper_bound == 0 diff --git a/tests/test_curation.py b/tests/test_curation.py new file mode 100644 index 0000000..5464235 --- /dev/null +++ b/tests/test_curation.py @@ -0,0 +1,247 @@ +"""Tests for raven_python.curation.batch.""" +from __future__ import annotations + +import cobra +import pandas as pd +import pytest + +from raven_python.curation import ( + CurationResult, + batch_curate, + batch_curate_from_tsv, +) + + +def _base_model() -> cobra.Model: + """Small yeast-flavoured model with a couple of mets/genes/rxns.""" + m = cobra.Model("base") + m.compartments = {"c": "cytoplasm", "e": "extracellular"} + atp = cobra.Metabolite("s_0001", name="ATP", compartment="c", + formula="C10H12N5O13P3", charge=-4) + adp = cobra.Metabolite("s_0002", name="ADP", compartment="c", + formula="C10H12N5O10P2", charge=-3) + glc_e = cobra.Metabolite("s_0003", name="glucose", compartment="e", + formula="C6H12O6", charge=0) + m.add_metabolites([atp, adp, glc_e]) + from cobra.core.gene import Gene + m.genes.append(Gene("YAL001C", name="A1")) + r = cobra.Reaction("r_0001", lower_bound=0, upper_bound=1000) + r.name = "ATP hydrolysis" + r.add_metabolites({atp: -1, adp: 1}) + r.gene_reaction_rule = "YAL001C" + m.add_reactions([r]) + return m + + +# --- metabolites ------------------------------------------------------ + +def test_add_new_metabolite(): + m = _base_model() + df = pd.DataFrame([ + {"metNames": "NADH", "comps": "c", "formula": "C21H27N7O14P2", + "charge": -2, "inchi": "", "metNotes": "added by curation"}, + ]) + result = batch_curate(m, mets_df=df, met_id_prefix="s_") + assert isinstance(result, CurationResult) + assert result.added_metabolites == ["s_0004"] + assert len(result.updated_metabolites) == 0 + new = m.metabolites.get_by_id("s_0004") + assert new.name == "NADH" + assert new.compartment == "c" + assert new.formula == "C21H27N7O14P2" + assert new.charge == -2 + assert new.notes.get("metNotes") == "added by curation" + + +def test_update_existing_metabolite(): + m = _base_model() + df = pd.DataFrame([ + {"metNames": "ATP", "comps": "c", "formula": "C10H16N5O13P3", + "charge": -3, "inchi": "InChI=ATP", "metNotes": ""}, + ]) + with pytest.warns(UserWarning, match="overwritten"): + result = batch_curate(m, mets_df=df, met_id_prefix="s_") + assert result.added_metabolites == [] + assert result.updated_metabolites == ["s_0001"] + atp = m.metabolites.get_by_id("s_0001") + assert atp.formula == "C10H16N5O13P3" # overwritten + assert atp.charge == -3 + assert atp.annotation["inchi"] == "InChI=ATP" + + +def test_miriam_columns_auto_detected(): + m = _base_model() + df = pd.DataFrame([ + {"metNames": "NADH", "comps": "c", "formula": "C21H27N7O14P2", + "charge": -2, "inchi": "", "metNotes": "", + "kegg.compound": "C00004", "chebi": "CHEBI:16908"}, + ]) + batch_curate(m, mets_df=df, met_id_prefix="s_") + new = m.metabolites.get_by_id("s_0004") + assert new.annotation["kegg.compound"] == "C00004" + assert new.annotation["chebi"] == "CHEBI:16908" + + +def test_new_metabolite_id_increment_preserves_width(): + """If existing ids are s_0001, s_0002, s_0003 (width 4), new ids + should be s_0004, s_0005, … not s_4 / s_5.""" + m = _base_model() + df = pd.DataFrame([ + {"metNames": "X", "comps": "c"}, + {"metNames": "Y", "comps": "c"}, + ]) + result = batch_curate(m, mets_df=df, met_id_prefix="s_") + assert result.added_metabolites == ["s_0004", "s_0005"] + + +# --- genes ------------------------------------------------------------ + +def test_add_new_gene(): + m = _base_model() + df = pd.DataFrame([ + {"genes": "YBR123C", "geneShortNames": "B2", "uniprot": "P12345"}, + ]) + result = batch_curate(m, genes_df=df) + assert result.added_genes == ["YBR123C"] + g = m.genes.get_by_id("YBR123C") + assert g.name == "B2" + assert g.annotation["uniprot"] == "P12345" + + +def test_update_existing_gene(): + m = _base_model() + df = pd.DataFrame([ + {"genes": "YAL001C", "geneShortNames": "A1_NEW", "uniprot": "P99999"}, + ]) + with pytest.warns(UserWarning, match="overwritten"): + batch_curate(m, genes_df=df) + g = m.genes.get_by_id("YAL001C") + assert g.name == "A1_NEW" + assert g.annotation["uniprot"] == "P99999" + + +# --- reactions -------------------------------------------------------- + +def test_add_new_reaction_with_existing_mets(): + m = _base_model() + rxns_df = pd.DataFrame([ + {"rxnNames": "ADP phosphorylation", "grRules": "YBR456W", + "lb": -1000, "ub": 1000, "rev": 1, "subSystems": "energy", + "eccodes": "2.7.4.6", "rxnNotes": "", "rxnReferences": "", + "rxnConfidenceScores": 3, "kegg.reaction": "R00187"}, + ]) + coeffs_df = pd.DataFrame([ + {"rxnNames": "ADP phosphorylation", "metNames": "ADP", "comps": "c", + "coefficient": -1.0}, + {"rxnNames": "ADP phosphorylation", "metNames": "ATP", "comps": "c", + "coefficient": 1.0}, + ]) + result = batch_curate(m, rxns_df=rxns_df, rxns_coeffs_df=coeffs_df, + rxn_id_prefix="r_") + assert result.added_reactions == ["r_0002"] + new = m.reactions.get_by_id("r_0002") + assert new.name == "ADP phosphorylation" + assert new.gene_reaction_rule == "YBR456W" + assert new.annotation["ec-code"] == "2.7.4.6" + assert new.annotation["kegg.reaction"] == "R00187" + assert new.notes.get("rxnConfidenceScores") == "3" + + +def test_update_existing_reaction_by_stoichiometry(): + m = _base_model() + rxns_df = pd.DataFrame([ + {"rxnNames": "ATP hydrolysis renamed", "grRules": "YAL999W", + "lb": -100, "ub": 100, "rev": 1, + "subSystems": "", "eccodes": "3.6.1.3", "rxnNotes": "updated", + "rxnReferences": "", "rxnConfidenceScores": 2}, + ]) + coeffs_df = pd.DataFrame([ + {"rxnNames": "ATP hydrolysis renamed", "metNames": "ATP", + "comps": "c", "coefficient": -1.0}, + {"rxnNames": "ATP hydrolysis renamed", "metNames": "ADP", + "comps": "c", "coefficient": 1.0}, + ]) + with pytest.warns(UserWarning, match="overwritten"): + result = batch_curate(m, rxns_df=rxns_df, rxns_coeffs_df=coeffs_df, + rxn_id_prefix="r_") + assert result.updated_reactions == ["r_0001"] + rxn = m.reactions.get_by_id("r_0001") + assert rxn.name == "ATP hydrolysis renamed" + assert rxn.gene_reaction_rule == "YAL999W" + assert rxn.lower_bound == -100 + assert rxn.notes.get("rxnNotes") == "updated" + + +def test_rxn_with_unknown_met_raises(): + m = _base_model() + rxns_df = pd.DataFrame([ + {"rxnNames": "uses unknown", "grRules": "", "lb": 0, "ub": 1000, + "rev": 0, "subSystems": "", "eccodes": "", "rxnNotes": "", + "rxnReferences": "", "rxnConfidenceScores": ""}, + ]) + coeffs_df = pd.DataFrame([ + {"rxnNames": "uses unknown", "metNames": "mystery", "comps": "c", + "coefficient": -1.0}, + ]) + with pytest.raises(ValueError, match="mystery"): + batch_curate(m, rxns_df=rxns_df, rxns_coeffs_df=coeffs_df, + rxn_id_prefix="r_") + + +def test_must_provide_both_rxn_tables(): + m = _base_model() + rxns_df = pd.DataFrame([{"rxnNames": "X"}]) + with pytest.raises(ValueError, match="must be provided together"): + batch_curate(m, rxns_df=rxns_df) + + +def test_add_new_met_and_use_in_new_rxn(): + """Common curation pattern: introduce a new metabolite + a new + reaction that uses it, in a single call.""" + m = _base_model() + mets_df = pd.DataFrame([ + {"metNames": "NADH", "comps": "c", "formula": "C21H27N7O14P2", + "charge": -2, "inchi": "", "metNotes": ""}, + ]) + rxns_df = pd.DataFrame([ + {"rxnNames": "made up", "grRules": "", "lb": 0, "ub": 1000, + "rev": 0, "subSystems": "", "eccodes": "", "rxnNotes": "", + "rxnReferences": "", "rxnConfidenceScores": ""}, + ]) + coeffs_df = pd.DataFrame([ + {"rxnNames": "made up", "metNames": "ATP", "comps": "c", + "coefficient": -1.0}, + {"rxnNames": "made up", "metNames": "NADH", "comps": "c", + "coefficient": 1.0}, + ]) + result = batch_curate( + m, mets_df=mets_df, rxns_df=rxns_df, rxns_coeffs_df=coeffs_df, + met_id_prefix="s_", rxn_id_prefix="r_", + ) + assert result.added_metabolites == ["s_0004"] + assert result.added_reactions == ["r_0002"] + new_rxn = m.reactions.get_by_id("r_0002") + new_met = m.metabolites.get_by_id("s_0004") + assert new_met in new_rxn.metabolites + + +# --- from_tsv --------------------------------------------------------- + +def test_from_tsv_round_trip(tmp_path): + m = _base_model() + mets_path = tmp_path / "mets.tsv" + mets_path.write_text( + "metNames\tcomps\tformula\tcharge\tinchi\tmetNotes\tkegg.compound\n" + "NADH\tc\tC21H27N7O14P2\t-2\t\t\tC00004\n" + ) + result = batch_curate_from_tsv(m, mets_tsv=mets_path, met_id_prefix="s_") + assert result.added_metabolites == ["s_0004"] + new = m.metabolites.get_by_id("s_0004") + assert new.annotation["kegg.compound"] == "C00004" + + +def test_empty_call_no_op(): + m = _base_model() + result = batch_curate(m) + assert not result + assert result.added_metabolites == [] diff --git a/tests/test_io_yaml.py b/tests/test_io_yaml.py index c406857..d6e7e65 100644 --- a/tests/test_io_yaml.py +++ b/tests/test_io_yaml.py @@ -165,12 +165,16 @@ def test_gzipped_round_trip(yaml_file, tmp_path): def test_output_is_cobra_readable(yaml_file, tmp_path): - # The written file must load with stock cobra (it's cobra's native format). + """Cobrapy must be able to parse the file (no syntax error) and + recover the metabolites, reactions, and the cobrapy-canonical + annotation block. Model-level id / name / version live inside the + metaData section (RAVEN convention) — cobrapy doesn't know about + metaData, so cobra_model.id is None here. raven_python recovers + them; cobrapy ignores them gracefully.""" model = read_yaml_model(yaml_file) out = tmp_path / "out.yml" write_yaml_model(model, out) cobra_model = cobra.io.load_yaml_model(str(out)) - assert cobra_model.id == "testModel" assert {m.id for m in cobra_model.metabolites} == {"s_0001", "s_0002"} # RAVEN-only fields land in cobra notes; smiles in annotation assert cobra_model.metabolites.get_by_id("s_0001").annotation["smiles"] == ["C1=NC2"] diff --git a/tests/test_io_yaml_parity.py b/tests/test_io_yaml_parity.py new file mode 100644 index 0000000..2c89f58 --- /dev/null +++ b/tests/test_io_yaml_parity.py @@ -0,0 +1,365 @@ +"""Parity gate: round-tripping a YAML model through raven_python.io.yaml +must produce a file that: + + * cobra.io.load_yaml_model can read (the cobrapy-canonical core); + * keeps every RAVEN-only field (inchis / eccodes / deltaG / rxnFrom / + metFrom / references / confidence_score / rxnNotes / protein / + metMiriams / rxnMiriams / annotation-side SMILES); + * emits ``!!omap`` tags on each per-entry mapping (so RAVEN MATLAB's + line-based reader can ingest it); + * places the ``metaData`` block first, matching RAVEN MATLAB's layout. + +The fixture below is the smallest model that exercises every RAVEN +extension, plus a legacy ``rxnNotes`` key (read-time alias the writer +must normalise to ``notes``) and a metabolite with a SMILES value +that would parse as a flow sequence if emitted unquoted. +""" +from __future__ import annotations + +from pathlib import Path + +import cobra +import cobra.io +import pytest +from cobra.io.yaml import yaml as cobra_yaml + +from raven_python.io import read_yaml_model, write_yaml_model + + +SAMPLE = { + "metabolites": [ + { + "id": "s_0001", + "name": "ATP", + "compartment": "c", + "charge": -4, + "formula": "C10H16N5O13P3", + "inchis": "InChI=1S/CH4", + "deltaG": 12.5, + "metFrom": "KEGG", + "notes": "metabolite note", + "annotation": { + "kegg.compound": ["C00002"], + "smiles": ["C1=NC2=C(N=CN2)N(C1=O)C"], # YAML-ambiguous chars + }, + }, + {"id": "s_0002", "name": "ADP", "compartment": "c"}, + ], + "reactions": [ + { + "id": "R1", + "name": "rxn one", + "metabolites": {"s_0001": -1.0, "s_0002": 1.0}, + "lower_bound": -1000.0, + "upper_bound": 1000.0, + "gene_reaction_rule": "G1", + "objective_coefficient": 0, + "subsystem": "glycolysis", + "confidence_score": 2, + "references": "PMID:123", + "rxnFrom": "manual", + "eccodes": "1.1.1.1", + "rxnNotes": "legacy reaction note key", # read-time alias + "deltaG": -5.0, + "annotation": {"ec-code": ["1.1.1.1"]}, + } + ], + "genes": [ + {"id": "G1", "name": "geneOne", "protein": "P12345", + "annotation": {"uniprot": ["P12345"]}} + ], + "compartments": {"c": "cytoplasm"}, + "metaData": { + "id": "testModel", + "name": "Test", + "version": "1.0", + "date": "2026-05-23", + "taxonomy": "taxonomy/559292", + }, +} + + +@pytest.fixture +def src(tmp_path) -> Path: + p = tmp_path / "source.yml" + with p.open("w", encoding="utf-8") as fh: + cobra_yaml.dump(SAMPLE, fh) + return p + + +def test_round_trip_preserves_every_raven_field(src, tmp_path): + model = read_yaml_model(src) + out = tmp_path / "out.yml" + write_yaml_model(model, out) + reloaded = read_yaml_model(out) + + # Core (cobra-known) content stayed. + assert reloaded.id == "testModel" + assert {m.id for m in reloaded.metabolites} == {"s_0001", "s_0002"} + r = reloaded.reactions.get_by_id("R1") + assert r.bounds == (-1000.0, 1000.0) + assert r.gene_reaction_rule == "G1" + assert r.subsystem == "glycolysis" + + # Metabolite RAVEN extras. + a = reloaded.metabolites.get_by_id("s_0001") + assert a.notes["inchis"] == "InChI=1S/CH4" + assert a.notes["deltaG"] == 12.5 + assert a.notes["metFrom"] == "KEGG" + assert a.notes["note"] == "metabolite note" + assert a.annotation["smiles"] == ["C1=NC2=C(N=CN2)N(C1=O)C"] + + # Reaction RAVEN extras (incl. the eccodes round-trip that earlier + # versions dropped on write). + assert r.notes["eccodes"] == "1.1.1.1" + assert r.notes["references"] == "PMID:123" + assert r.notes["rxnFrom"] == "manual" + assert r.notes["confidence_score"] == 2 + assert r.notes["deltaG"] == -5.0 + # rxnNotes (legacy key) gets normalised to notes on read. + assert r.notes["note"] == "legacy reaction note key" + + # Gene RAVEN extras. + assert reloaded.genes.get_by_id("G1").notes["protein"] == "P12345" + + # Provenance. + assert reloaded.notes["metaData"]["taxonomy"] == "taxonomy/559292" + assert reloaded.notes["version"] == "1.0" + + +def test_output_is_cobra_readable(src, tmp_path): + """The written file is valid cobra-native YAML; cobra.io can read + the core content (it doesn't know about RAVEN extras, but doesn't + choke on them either).""" + model = read_yaml_model(src) + out = tmp_path / "out.yml" + write_yaml_model(model, out) + cmodel = cobra.io.load_yaml_model(str(out)) + assert {m.id for m in cmodel.metabolites} == {"s_0001", "s_0002"} + assert cmodel.reactions.get_by_id("R1").bounds == (-1000.0, 1000.0) + # SMILES landed in annotation, not at metabolite top level. + assert cmodel.metabolites.get_by_id("s_0001").annotation["smiles"] == [ + "C1=NC2=C(N=CN2)N(C1=O)C" + ] + + +def test_output_carries_omap_tags(src, tmp_path): + """RAVEN MATLAB's reader is a line-based parser keyed on ``!!omap``; + the writer must emit those tags. (PR #17 originally dropped them + because _to_plain flattened OrderedDicts to plain dicts.)""" + model = read_yaml_model(src) + out = tmp_path / "out.yml" + write_yaml_model(model, out) + text = out.read_text() + # Per-entry !!omap on metabolites, reactions, genes: + assert text.count("!!omap") >= 1 + 1 + 2 + 1 + 1 # root + metaData + 2 mets + 1 rxn + 1 gene + # Each major section appears as a top-level entry. + for section in ("metabolites:", "reactions:", "genes:", "compartments:"): + assert f"- {section}" in text + + +def test_metadata_is_first(src, tmp_path): + """RAVEN MATLAB emits metaData as the first top-level section. + Producing the same layout means RAVEN MATLAB and raven_python + files agree byte-for-byte on top-level ordering.""" + model = read_yaml_model(src) + out = tmp_path / "out.yml" + write_yaml_model(model, out) + text = out.read_text() + idx_meta = text.find("- metaData:") + idx_mets = text.find("- metabolites:") + assert 0 <= idx_meta < idx_mets, "metaData must precede metabolites" + + +def test_smiles_with_yaml_special_chars_quoted(src, tmp_path): + """The SMILES value above contains square brackets; an unquoted + bare scalar would be parsed as a flow sequence. The writer must + either keep ``smiles`` inside the annotation block (where SMILES + annotations naturally live) or quote it. Either way the loop- + back read must recover the exact string.""" + model = read_yaml_model(src) + out = tmp_path / "out.yml" + write_yaml_model(model, out) + # The verification is purely functional: reload, check the value. + reloaded = read_yaml_model(out) + assert reloaded.metabolites.get_by_id("s_0001").annotation["smiles"] == [ + "C1=NC2=C(N=CN2)N(C1=O)C" + ] + + +PRE_SHIM_YAML = """\ +--- +!!omap +- metaData: + id: "eciYali" + name: "Yarrowia lipolytica" + version: "1.0" + date: "2024-10-17" + geckoLight: "true" +- metabolites: + - !!omap + - id: "s_0001" + - name: "ATP" + - compartment: "c" + - formula: "C10H16N5O13P3" + - charge: -4 + - inchis: "InChI=1S/CH4" + - smiles: "[O-]P(=O)([O-])OP(=O)([O-])O" + - annotation: !!omap + - kegg.compound: "C00002" + - sbo: "SBO:0000247" + - deltaG: 12.5 + - notes: "metabolite note" + - metFrom: "KEGG" +- reactions: + - !!omap + - id: "r_0001" + - name: "hexokinase" + - metabolites: !!omap + - s_0001: -1 + - lower_bound: -1000 + - upper_bound: 1000 + - gene_reaction_rule: "G1" + - rxnFrom: "KEGG" + - eccodes: "2.7.1.1" + - references: "PMID:12345" + - subsystem: "Glycolysis" + - annotation: !!omap + - kegg.reaction: "R00299" + - sbo: "SBO:0000176" + - deltaG: -17.39 + - confidence_score: 2 + - rxnNotes: "old reaction note" +- genes: + - !!omap + - id: "G1" + - name: "HXK1" + - protein: "P01234" + - annotation: !!omap + - uniprot: "P01234" +- compartments: !!omap + - c: "cytoplasm" +- ec-rxns: + - !!omap + - id: "r_0001" + - kcat: 25.3 + - enzymes: !!omap + - P01234: 1 +- ec-enzymes: + - !!omap + - genes: "G1" + - enzymes: "P01234" + - mw: 50000 +""" + + +def test_pre_shim_format_loads(tmp_path): + """The pre-`feat/yeast-gem-shared` RAVEN MATLAB writer emitted a + file shape that differs from the current one in seven concrete + ways. The reader must continue to load every one of them: + + 1. ``---`` document-start marker (kept by old MATLAB writer) + 2. ``- metaData:`` as a plain block mapping (no ``!!omap`` tag) + 3. ``geckoLight: "true"`` *inside* metaData (now emitted as a + top-level ``gecko_light``) + 4. Metabolite ``smiles`` as a top-level entry key (now emitted + inside the ``annotation`` block) + 5. Reaction notes under the ``rxnNotes`` key (now emitted as + ``notes``) + 6. Integer-typed bounds / coefficients (now emitted as floats) + 7. Every string double-quoted (now bare unless YAML requires + quoting) + + Each item below maps to one of those seven cases. + """ + p = tmp_path / "pre_shim.yml" + p.write_text(PRE_SHIM_YAML) + model = read_yaml_model(p) + + # metaData survives + provenance is lifted onto the cobra-shape + # accessors (cases 1 + 2). + assert model.id == "eciYali" + assert model.name == "Yarrowia lipolytica" + assert model.notes["version"] == "1.0" + assert model.notes["metaData"]["taxonomy" if "taxonomy" in model.notes["metaData"] else "date"] + + # geckoLight in metaData populates the typed EcData (case 3). + assert model.ec is not None + assert model.ec.gecko_light is True + assert model.ec.rxns == ["r_0001"] + assert model.ec.kcat[0] == 25.3 + + # Top-level smiles lifted into annotation.smiles (case 4). + a = model.metabolites.get_by_id("s_0001") + assert a.annotation["smiles"] == ["[O-]P(=O)([O-])OP(=O)([O-])O"] + assert "smiles" not in a.notes # stays in annotation, not notes + + # rxnNotes read as the canonical notes key (case 5). + r = model.reactions.get_by_id("r_0001") + assert r.notes["note"] == "old reaction note" + + # Integer bounds become floats inside cobra (case 6). + assert r.bounds == (-1000.0, 1000.0) + assert isinstance(r.lower_bound, float) + + # Quoted strings unquote cleanly (case 7) — verified implicitly by + # all the equality assertions above. Spot check the metabolite + # name, which used double quotes in the source. + assert a.name == "ATP" + + # Other RAVEN extras still preserved. + assert a.notes["inchis"] == "InChI=1S/CH4" + assert a.notes["deltaG"] == 12.5 + assert a.notes["note"] == "metabolite note" + assert a.notes["metFrom"] == "KEGG" + assert r.notes["rxnFrom"] == "KEGG" + assert r.notes["eccodes"] == "2.7.1.1" + assert r.notes["references"] == "PMID:12345" + assert r.notes["confidence_score"] == 2 + assert r.notes["deltaG"] == -17.39 + assert model.genes.get_by_id("G1").notes["protein"] == "P01234" + + +def test_pre_shim_yeast_gem_loads_if_available(): + """The real pre-shim yeast-GEM.yml: 2748 mets, 4102 rxns, 1143 + genes. Skipped when the working copy isn't mounted (CI runners).""" + real = Path("/mnt/c/Work/GitHub/yeast-GEM/model/yeast-GEM.yml") + if not real.exists(): + pytest.skip("yeast-GEM.yml not available in this environment") + model = read_yaml_model(real) + assert model.id == "yeastGEM_develop" + assert len(model.metabolites) == 2748 + assert len(model.reactions) == 4102 + assert len(model.genes) == 1143 + # Every RAVEN extension we know about must come through. + assert sum(1 for r in model.reactions if r.notes.get("eccodes")) == 2411 + assert sum(1 for r in model.reactions if r.notes.get("deltaG") is not None) == 3984 + assert sum(1 for m in model.metabolites if m.notes.get("deltaG") is not None) == 2696 + assert sum(1 for m in model.metabolites if "smiles" in (m.annotation or {})) == 1788 + assert sum(1 for r in model.reactions if r.notes.get("note")) == 1443 + + +def test_eccodes_round_trip_through_cobra_extras(src, tmp_path): + """A model loaded from cobra (no eccodes awareness) and re-written + via raven_python.write_yaml_model still keeps eccodes — they're + sourced from .notes['eccodes'] which read_yaml_model puts there.""" + # Same fixture, but go through cobra first to prove notes-based + # eccodes propagation works when cobra is in the loop. + model = read_yaml_model(src) + pass1 = tmp_path / "via_rp.yml" + write_yaml_model(model, pass1) + via_cobra = cobra.io.load_yaml_model(str(pass1)) + # cobra exposes eccodes as an attribute (setattr fall-through); + # raven_python sourced it from notes, so .notes['eccodes'] should + # still be present on the reloaded model. + pass2 = tmp_path / "via_rp2.yml" + # Promote cobra's setattr-eccodes back into notes for the writer + # path. (Tests the documented integration: cobra preserves the YAML + # key, raven_python.read sees it again.) + again = read_yaml_model(pass1) + write_yaml_model(again, pass2) + final = read_yaml_model(pass2) + assert final.reactions.get_by_id("R1").notes["eccodes"] == "1.1.1.1" + # And cobra can still read the final result. + cm = cobra.io.load_yaml_model(str(pass2)) + assert cm.reactions.get_by_id("R1").bounds == (-1000.0, 1000.0) diff --git a/tests/test_manipulation_find_duplicate.py b/tests/test_manipulation_find_duplicate.py new file mode 100644 index 0000000..692556e --- /dev/null +++ b/tests/test_manipulation_find_duplicate.py @@ -0,0 +1,85 @@ +"""Tests for raven_python.manipulation.find_duplicate_reactions.""" +from __future__ import annotations + +import cobra + +from raven_python.manipulation import find_duplicate_reactions + + +def _mk_model() -> cobra.Model: + m = cobra.Model("m") + a = cobra.Metabolite("a_c", compartment="c") + b = cobra.Metabolite("b_c", compartment="c") + c = cobra.Metabolite("c_c", compartment="c") + m.add_metabolites([a, b, c]) + return m + + +def test_no_duplicates_returns_empty(): + m = _mk_model() + r1 = cobra.Reaction("r1") + r1.add_metabolites({m.metabolites.a_c: -1, m.metabolites.b_c: 1}) + r2 = cobra.Reaction("r2") + r2.add_metabolites({m.metabolites.b_c: -1, m.metabolites.c_c: 1}) + m.add_reactions([r1, r2]) + assert find_duplicate_reactions(m) == [] + + +def test_same_stoichiometry_grouped(): + m = _mk_model() + r1 = cobra.Reaction("r1", lower_bound=0, upper_bound=1000) + r2 = cobra.Reaction("r2", lower_bound=-100, upper_bound=100) + for r in (r1, r2): + r.add_metabolites({m.metabolites.a_c: -1, m.metabolites.b_c: 1}) + m.add_reactions([r1, r2]) + groups = find_duplicate_reactions(m) + assert len(groups) == 1 + assert {r.id for r in groups[0]} == {"r1", "r2"} + + +def test_ignore_direction_default_groups_reverse_pair(): + """yeast-GEM's findDuplicatedRxns matches A→B with B→A. That's the default.""" + m = _mk_model() + r1 = cobra.Reaction("r1") + r1.add_metabolites({m.metabolites.a_c: -1, m.metabolites.b_c: 1}) + r2 = cobra.Reaction("r2") + r2.add_metabolites({m.metabolites.a_c: 1, m.metabolites.b_c: -1}) # reversed + m.add_reactions([r1, r2]) + groups = find_duplicate_reactions(m) + assert len(groups) == 1 + + +def test_ignore_direction_false_keeps_them_separate(): + m = _mk_model() + r1 = cobra.Reaction("r1") + r1.add_metabolites({m.metabolites.a_c: -1, m.metabolites.b_c: 1}) + r2 = cobra.Reaction("r2") + r2.add_metabolites({m.metabolites.a_c: 1, m.metabolites.b_c: -1}) + m.add_reactions([r1, r2]) + assert find_duplicate_reactions(m, ignore_direction=False) == [] + + +def test_three_duplicates_in_one_group(): + m = _mk_model() + rxns = [] + for i in range(3): + r = cobra.Reaction(f"r{i}") + r.add_metabolites({m.metabolites.a_c: -1, m.metabolites.b_c: 1}) + rxns.append(r) + m.add_reactions(rxns) + groups = find_duplicate_reactions(m) + assert len(groups) == 1 + assert len(groups[0]) == 3 + + +def test_ignores_bounds_and_gpr_differences(): + """Bounds and GPRs are intentionally ignored — only stoichiometry.""" + m = _mk_model() + r1 = cobra.Reaction("r1", lower_bound=0, upper_bound=1000) + r2 = cobra.Reaction("r2", lower_bound=-50, upper_bound=50) + r1.add_metabolites({m.metabolites.a_c: -1, m.metabolites.b_c: 1}) + r2.add_metabolites({m.metabolites.a_c: -1, m.metabolites.b_c: 1}) + r1.gene_reaction_rule = "g1" + r2.gene_reaction_rule = "g2" + m.add_reactions([r1, r2]) + assert len(find_duplicate_reactions(m)) == 1