Skip to content

Commit cbe5f59

Browse files
committed
feat(io.yaml): typed model.ec via EcData; absorb legacy GECKO normalisations
Mirrors RAVEN MATLAB's readYAMLmodel.m / writeYAMLmodel.m, which populate the model.ec struct whenever the YAML defines it. Downstream consumers (geckopy / GECKO) operate on the populated struct rather than re-parsing the YAML themselves. New: src/raven_python/io/ec_data.py - EcData dataclass with the MATLAB-GECKO field shape (per-rxn arrays: rxns/kcat/source/notes/eccodes; per-enzyme arrays: genes/enzymes/mw/ sequence/concs; sparse rxn_enz_mat coupling; gecko_light flag). - ec_data_from_yaml_sections: parses ec-rxns/ec-enzymes/gecko_light into a typed EcData, validating that every enzyme referenced from an ec-rxns row exists in ec-enzymes (catches the common authoring bug where the two sections drift apart). - ec_data_to_yaml_sections: serialises an EcData back to the list-of-mappings YAML form. Empty source/notes/eccodes/sequence and NaN mw/concs are omitted to keep files compact; kcat is always written (0 == "no kcat assigned", matching MATLAB GECKO). - _canonicalize_eccodes / _eccodes_to_yaml handle the scalar-or-list YAML representation for EC numbers. Extended: src/raven_python/io/yaml.py - model_from_yaml_data now pulls ec-rxns / ec-enzymes / gecko_light out of the foreign-keys stash, builds an EcData, and attaches it as model.ec. Other unknown top-level keys still round-trip opaquely via model.notes['_yaml_sections']. - write_yaml_model now serialises model.ec to the top-level ec-rxns / ec-enzymes / gecko_light sections when present, and drops any stale ec-* in _yaml_sections so the file isn't ambiguous. - read_yaml_model also accepts the very old RAVEN shape where the document root is a bare `-` sequence of single-key mappings; the reader merges them into one dict before parsing. - model_from_yaml_data now also normalises two legacy ecModel quirks in line with MATLAB GECKO behaviour: * per-metabolite top-level `smiles` -> annotation['smiles'] (older writers placed SMILES at the metabolite top level); * `usage_prot_*` / `prot_pool_exchange` reactions with negative lower bound and swapped stoichiometry are flipped to the forward convention (warns once per load). Tests - tests/test_io_yaml_ec_data.py (new): 18 focused tests covering load (model.ec population, sentinel handling for omitted optional fields, gecko_light flag, eccodes scalar-or-list, no-ec models), save (sections emitted, NaN/empty omission, numpy-scalar coercion, stale _yaml_sections overridden), legacy quirks (top-level smiles lifted, reverse-direction prot flip with warn, bare-sequence root merge), and error paths (half-pair of ec-* sections, dangling enzyme reference). - tests/test_io_yaml.py (updated): RAVEN_DOC fixture grew a complete ec-rxns/ec-enzymes pair so the round-trip test now verifies typed EcData survives, not just an opaque _yaml_sections stash.
1 parent 8c7672e commit cbe5f59

5 files changed

Lines changed: 875 additions & 27 deletions

File tree

src/raven_python/io/__init__.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,15 @@
1-
"""RAVEN-specific I/O: YAML (cobra + Metabolic Atlas / Human-GEM extensions), SIF,
2-
Excel export, and the Standard-GEM ``model/<fmt>/…`` git layout.
1+
"""RAVEN-specific I/O: YAML (cobra + Metabolic Atlas / Human-GEM extensions, plus
2+
the GECKO ec-model substructure), SIF, Excel export, and the Standard-GEM
3+
``model/<fmt>/…`` git layout.
34
"""
5+
from raven_python.io.ec_data import EcData
46
from raven_python.io.excel import export_to_excel
57
from raven_python.io.git import export_for_git
68
from raven_python.io.sif import export_model_to_sif
79
from raven_python.io.yaml import read_yaml_model, write_yaml_model
810

911
__all__ = [
12+
"EcData",
1013
"export_for_git",
1114
"export_model_to_sif",
1215
"export_to_excel",

src/raven_python/io/ec_data.py

Lines changed: 306 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,306 @@
1+
"""Typed enzyme-constrained ecModel substructure (``model.ec``).
2+
3+
Aligned with MATLAB GECKO's ``model.ec`` struct. Holds the per-reaction and
4+
per-enzyme arrays that the GECKO toolbox attaches to a metabolic model when
5+
making it enzyme-constrained: kcat values, enzyme molecular weights, the
6+
sparse reaction-to-enzyme coupling matrix, and assorted bookkeeping
7+
(provenance source tags, free-text notes, EC numbers, sequences).
8+
9+
This module owns:
10+
11+
- the ``EcData`` dataclass (in-memory shape),
12+
- the YAML schema for the ``ec-rxns`` / ``ec-enzymes`` / ``gecko_light``
13+
top-level sections,
14+
- the (de)serialisation helpers `ec_data_from_yaml_sections` and
15+
`ec_data_to_yaml_sections`.
16+
17+
It does NOT touch the cobra-shaped portion of the document — that stays
18+
with ``raven_python.io.yaml``, which calls into here when ec sections are
19+
present. The split mirrors RAVEN MATLAB: ``readYAMLmodel.m`` populates
20+
``model.ec`` whenever the YAML defines it; downstream consumers
21+
(geckopy / GECKO) operate on the populated struct.
22+
23+
YAML schema (one entry per row):
24+
25+
::
26+
27+
ec-rxns:
28+
- id: R1_EXP_1
29+
kcat: 12.5 # turnover number, 1/s (0 == "no kcat assigned")
30+
source: brenda # optional, omitted if empty
31+
notes: free-text # optional, omitted if empty
32+
eccodes: "1.1.1.1" # optional; scalar OR a list when multiple
33+
enzymes: # column -> stoichiometric subunit count
34+
P12345: 1.0
35+
P67890: 2.0 # heteromeric complex with 2 copies of P67890
36+
ec-enzymes:
37+
- genes: G1 # gene name as it appears in cobra
38+
enzymes: P12345 # uniprot accession (or KEGG id)
39+
mw: 50000.0 # Da; omitted when unknown
40+
sequence: MAGIC # protein sequence; omitted when empty
41+
concs: 0.005 # proteomics-measured concentration mg/gDCW;
42+
# omitted when not measured
43+
gecko_light: false # top-level bool; defaults to false on load
44+
"""
45+
from __future__ import annotations
46+
47+
import math
48+
from dataclasses import dataclass, field
49+
from typing import Any
50+
51+
import numpy as np
52+
from scipy import sparse
53+
54+
55+
@dataclass
56+
class EcData:
57+
"""Typed enzyme-constrained ecModel substructure attached as ``model.ec``.
58+
59+
Field semantics match MATLAB GECKO's ``model.ec`` struct one-to-one.
60+
Two parallel index spaces:
61+
62+
- per-reaction arrays (``rxns``, ``kcat``, ``source``, ``notes``,
63+
``eccodes``) of length ``n_rxns``;
64+
- per-enzyme arrays (``genes``, ``enzymes``, ``mw``, ``sequence``,
65+
``concs``) of length ``n_enzymes``.
66+
67+
Connected by the sparse ``rxn_enz_mat`` of shape ``(n_rxns, n_enzymes)``
68+
whose ``[i, j]`` entry is the subunit count of enzyme j in reaction i
69+
(typically 0 or 1; >1 for heteromeric complexes).
70+
71+
Sentinels (mirror MATLAB GECKO):
72+
73+
- ``kcat == 0`` means "no kcat assigned" (zero is the unset state;
74+
real turnover numbers are always positive).
75+
- ``mw == nan`` means "MW unknown" (the writer omits NaN mw entries).
76+
- ``concs == nan`` means "not measured" (the writer omits NaN concs).
77+
- empty strings in ``source`` / ``notes`` / ``eccodes`` / ``sequence``
78+
are omitted on write and restored as ``""`` on load.
79+
80+
``gecko_light`` marks the gecko-light layout: cobra reactions stay
81+
singular, ec.rxns carries one entry per isozyme distinguished by a
82+
``###_`` counter prefix, and per-enzyme ``prot_<id>`` / usage reactions
83+
are skipped in favour of the shared protein pool. ``False`` is the
84+
default (full layout, where ``ec.rxns`` matches cobra reactions
85+
one-to-one after isozyme expansion).
86+
"""
87+
gecko_light: bool = False
88+
rxns: list[str] = field(default_factory=list)
89+
kcat: np.ndarray = field(default_factory=lambda: np.empty(0, dtype=float))
90+
source: list[str] = field(default_factory=list)
91+
notes: list[str] = field(default_factory=list)
92+
eccodes: list[str] = field(default_factory=list)
93+
genes: list[str] = field(default_factory=list)
94+
enzymes: list[str] = field(default_factory=list)
95+
mw: np.ndarray = field(default_factory=lambda: np.empty(0, dtype=float))
96+
sequence: list[str] = field(default_factory=list)
97+
concs: np.ndarray = field(default_factory=lambda: np.empty(0, dtype=float))
98+
rxn_enz_mat: sparse.csr_matrix = field(
99+
default_factory=lambda: sparse.csr_matrix((0, 0), dtype=float),
100+
)
101+
102+
@property
103+
def n_rxns(self) -> int:
104+
return len(self.rxns)
105+
106+
@property
107+
def n_enzymes(self) -> int:
108+
return len(self.enzymes)
109+
110+
111+
# --------------------------------------------------------------------------- #
112+
# Load
113+
# --------------------------------------------------------------------------- #
114+
115+
def ec_data_from_yaml_sections(sections: dict) -> EcData | None:
116+
"""Build an ``EcData`` from the ec-* top-level YAML sections.
117+
118+
Returns ``None`` when ``ec-rxns`` and ``ec-enzymes`` are both absent —
119+
the caller treats that as "this YAML is not an ec-model" and leaves
120+
``model.ec = None``. If exactly one of the two is present, the YAML
121+
is malformed: raise ValueError.
122+
123+
``sections`` is the dict of foreign top-level keys captured by the YAML
124+
loader. ``gecko_light`` defaults to ``False`` when the key is absent.
125+
"""
126+
has_rxns = "ec-rxns" in sections
127+
has_enzymes = "ec-enzymes" in sections
128+
if not has_rxns and not has_enzymes:
129+
return None
130+
if has_rxns != has_enzymes:
131+
missing = "ec-enzymes" if has_rxns else "ec-rxns"
132+
raise ValueError(
133+
f"ecModel YAML is missing the `{missing}` top-level section; "
134+
"both ec-rxns and ec-enzymes are required."
135+
)
136+
137+
return _build_ec_data(
138+
sections["ec-rxns"],
139+
sections["ec-enzymes"],
140+
gecko_light=bool(sections.get("gecko_light", False)),
141+
)
142+
143+
144+
def _build_ec_data(
145+
ec_rxns_raw: list,
146+
ec_enzymes_raw: list,
147+
*,
148+
gecko_light: bool,
149+
) -> EcData:
150+
"""Construct an ``EcData`` from the parsed YAML lists.
151+
152+
Missing optional fields are filled with sentinels (NaN for mw/concs,
153+
empty string for source/notes/eccodes/sequence, 0.0 for kcat).
154+
Validates that every enzyme referenced from an ec-rxns row exists in
155+
ec-enzymes; raises ValueError otherwise (catches a common authoring
156+
bug where the two sections drifted out of sync).
157+
"""
158+
n_e = len(ec_enzymes_raw)
159+
genes = [str(e["genes"]) for e in ec_enzymes_raw]
160+
enzymes = [str(e["enzymes"]) for e in ec_enzymes_raw]
161+
mw = np.array(
162+
[float(e.get("mw", np.nan)) for e in ec_enzymes_raw], dtype=float,
163+
)
164+
sequence = [str(e.get("sequence", "")) for e in ec_enzymes_raw]
165+
concs = np.array(
166+
[float(e.get("concs", np.nan)) for e in ec_enzymes_raw], dtype=float,
167+
)
168+
169+
enz_index = {eid: i for i, eid in enumerate(enzymes)}
170+
171+
n_r = len(ec_rxns_raw)
172+
rxns = [str(r["id"]) for r in ec_rxns_raw]
173+
# 0 == "no kcat assigned"; real turnover numbers are always positive.
174+
kcat = np.array(
175+
[float(r.get("kcat", 0.0)) for r in ec_rxns_raw], dtype=float,
176+
)
177+
source = [str(r.get("source", "")) for r in ec_rxns_raw]
178+
notes = [str(r.get("notes", "")) for r in ec_rxns_raw]
179+
eccodes = [_canonicalize_eccodes(r.get("eccodes", "")) for r in ec_rxns_raw]
180+
181+
mat = sparse.lil_matrix((n_r, n_e), dtype=float)
182+
for i, r in enumerate(ec_rxns_raw):
183+
for enz_id, stoich in (r.get("enzymes") or {}).items():
184+
j = enz_index.get(str(enz_id))
185+
if j is None:
186+
raise ValueError(
187+
f"ec-rxns[{i}] (id={r.get('id')!r}) references enzyme "
188+
f"{enz_id!r} that is not present in ec-enzymes."
189+
)
190+
mat[i, j] = float(stoich)
191+
192+
return EcData(
193+
gecko_light=gecko_light,
194+
rxns=rxns,
195+
kcat=kcat,
196+
source=source,
197+
notes=notes,
198+
eccodes=eccodes,
199+
genes=genes,
200+
enzymes=enzymes,
201+
mw=mw,
202+
sequence=sequence,
203+
concs=concs,
204+
rxn_enz_mat=mat.tocsr(),
205+
)
206+
207+
208+
# --------------------------------------------------------------------------- #
209+
# Save
210+
# --------------------------------------------------------------------------- #
211+
212+
def ec_data_to_yaml_sections(ec: EcData) -> dict[str, Any]:
213+
"""Serialise an ``EcData`` to a dict suitable for YAML emission.
214+
215+
Returns a fresh dict with three keys: ``gecko_light`` (bool),
216+
``ec-rxns`` (list of mappings), ``ec-enzymes`` (list of mappings).
217+
Values are native Python primitives — no numpy/ruamel scalars — so
218+
the YAML writer can dump them directly without further coercion.
219+
220+
Empty optional fields are omitted to keep the file compact; the
221+
loader fills them back in.
222+
"""
223+
return {
224+
"gecko_light": bool(ec.gecko_light),
225+
"ec-rxns": _build_ec_rxns_list(ec),
226+
"ec-enzymes": _build_ec_enzymes_list(ec),
227+
}
228+
229+
230+
def _build_ec_rxns_list(ec: EcData) -> list[dict[str, Any]]:
231+
"""Translate per-rxn ec fields + ``rxn_enz_mat`` rows to the
232+
list-of-mappings YAML form.
233+
234+
Empty ``source`` / ``notes`` / ``eccodes`` strings are omitted.
235+
``kcat`` is always written: a real turnover number when set,
236+
otherwise ``0`` (0 marks "no kcat assigned").
237+
"""
238+
coo = ec.rxn_enz_mat.tocoo()
239+
per_row_enzymes: list[dict[str, float]] = [{} for _ in range(ec.n_rxns)]
240+
for i, j, v in zip(coo.row, coo.col, coo.data):
241+
per_row_enzymes[int(i)][ec.enzymes[int(j)]] = float(v)
242+
243+
out: list[dict[str, Any]] = []
244+
for i in range(ec.n_rxns):
245+
entry: dict[str, Any] = {
246+
"id": ec.rxns[i],
247+
"kcat": float(ec.kcat[i]),
248+
}
249+
if ec.source[i]:
250+
entry["source"] = ec.source[i]
251+
if ec.notes[i]:
252+
entry["notes"] = ec.notes[i]
253+
if ec.eccodes[i]:
254+
entry["eccodes"] = _eccodes_to_yaml(ec.eccodes[i])
255+
entry["enzymes"] = per_row_enzymes[i]
256+
out.append(entry)
257+
return out
258+
259+
260+
def _build_ec_enzymes_list(ec: EcData) -> list[dict[str, Any]]:
261+
"""Translate per-enzyme ec fields to the list-of-mappings YAML form.
262+
263+
NaN ``mw`` / ``concs`` and empty ``sequence`` are omitted; the loader
264+
restores them as NaN / empty string.
265+
"""
266+
out: list[dict[str, Any]] = []
267+
for j in range(ec.n_enzymes):
268+
entry: dict[str, Any] = {
269+
"genes": ec.genes[j],
270+
"enzymes": ec.enzymes[j],
271+
}
272+
if not math.isnan(ec.mw[j]):
273+
entry["mw"] = float(ec.mw[j])
274+
if ec.sequence[j]:
275+
entry["sequence"] = ec.sequence[j]
276+
if not math.isnan(ec.concs[j]):
277+
entry["concs"] = float(ec.concs[j])
278+
out.append(entry)
279+
return out
280+
281+
282+
# --------------------------------------------------------------------------- #
283+
# eccodes representation helpers
284+
# --------------------------------------------------------------------------- #
285+
286+
def _canonicalize_eccodes(value) -> str:
287+
"""Coerce an EC-codes field to a single `;`-joined string.
288+
289+
The schema accepts either a scalar string (`"1.1.1.1"`) or a list of
290+
strings (`["1.1.1.1", "1.1.99.40"]`); both round-trip to the same
291+
internal representation.
292+
"""
293+
if value is None:
294+
return ""
295+
if isinstance(value, str):
296+
return value
297+
return ";".join(str(v) for v in value)
298+
299+
300+
def _eccodes_to_yaml(eccodes: str):
301+
"""Convert the internal `;`-joined eccodes string back to the YAML form:
302+
a scalar string for one EC, a list for multiple."""
303+
parts = [p for p in eccodes.split(";") if p]
304+
if len(parts) <= 1:
305+
return eccodes
306+
return parts

0 commit comments

Comments
 (0)