Skip to content

Commit 8e31c9a

Browse files
committed
io.yaml: document the format + accept legacy geckoLight-in-metaData
Adds docs/reference/yaml_format.md as the canonical schema reference for the cross-toolchain YAML format (cobrapy / raven-python / RAVEN MATLAB). Covers the full document shape, per-entry field order, RAVEN extensions, the GECKO ec-* sections, the metaData provenance block, number / string / quoting rules, and the cross-reader interoperability matrix. Linked from docs/reference/index.md and the I/O guide. Reader fix: pre-shim RAVEN MATLAB writes emitted GECKO models with geckoLight: "true" inside the metaData block (not as a top-level gecko_light). The reader now lifts that legacy key out of metaData so model.ec.gecko_light is populated whichever placement the file used. Round-trip writes always use the new top-level form. Regression tests: test_pre_shim_format_loads — synthetic fixture covering every legacy quirk we know about (--- doc marker, plain metaData, geckoLight inside metaData, top-level metabolite smiles, rxnNotes reaction key, integer bounds, double-quoted strings). Each quirk has its own assertion + comment. test_pre_shim_yeast_gem_loads_if_available — sanity-loads the real yeast-GEM.yml (2748 mets, 4102 rxns, 1143 genes) and asserts the documented preserved-counts table from the format reference. Skipped on CI runners where the working copy isn't mounted.
1 parent 36e74b4 commit 8e31c9a

5 files changed

Lines changed: 502 additions & 2 deletions

File tree

docs/guide/io_and_manipulation.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,9 @@ unchanged. On top of that it adds the RAVEN-specific formats:
88
- {func}`raven_python.io.read_yaml_model` / {func}`raven_python.io.write_yaml_model`
99
cobra-standard YAML (the `!!omap` layout), transparently handling `.yml.gz`. RAVEN-only and
1010
GECKO `ec-*` side-fields are preserved on each entry's `notes` so a read→write round-trip is
11-
lossless.
11+
lossless. The full schema (top-level layout, field order, quoting rules, the GECKO
12+
`ec-*` and `metaData` extensions) is documented in
13+
[the YAML model format reference](../reference/yaml_format.md).
1214
- {func}`raven_python.io.export_model_to_sif` — Cytoscape SIF (`rc` / `rr` / `cc` graphs).
1315
- {func}`raven_python.io.export_to_excel` — the RAVEN 5-sheet workbook (RXNS / METS / COMPS /
1416
GENES / MODEL). Requires the `excel` extra. Excel **import** is intentionally not provided.

docs/reference/index.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@ Conceptual and API reference for raven-python.
55
- **[RAVEN ↔ raven-python migration map](migration.md)** — the function-by-function map
66
from MATLAB RAVEN to raven-python (and cobrapy where appropriate). Start here if you're
77
porting RAVEN code.
8+
- **[YAML model format](yaml_format.md)** — the shared YAML schema produced and consumed
9+
by cobrapy, raven-python, and RAVEN MATLAB, with a fully-annotated example and the
10+
field-order / quoting rules.
811
- **[MATLAB RAVEN back-port proposals](matlab_raven_backports.md)** — improvements
912
raven-python makes that are candidates to back-port into the MATLAB toolbox.
1013
- **[Improvements over RAVEN](improvements.md)** — the full catalogue of correctness /
@@ -16,6 +19,7 @@ Conceptual and API reference for raven-python.
1619
:hidden:
1720
1821
migration
22+
yaml_format
1923
matlab_raven_backports
2024
improvements
2125
api/index

docs/reference/yaml_format.md

Lines changed: 332 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,332 @@
1+
# RAVEN / cobrapy YAML model format
2+
3+
This document describes the YAML format produced and consumed by
4+
5+
- **cobrapy** ([`cobra.io.{load,save}_yaml_model`](https://github.com/opencobra/cobrapy))
6+
- **raven-python** (`raven_python.io.yaml.{read,write}_yaml_model`, see [API](api/index.md))
7+
- **RAVEN MATLAB** (`readYAMLmodel.m` / `writeYAMLmodel.m` in the [RAVEN repo](https://github.com/SysBioChalmers/RAVEN/tree/feat/yeast-gem-shared/io))
8+
9+
The same file can be round-tripped through any of the three. cobrapy is the canonical core; raven-python and RAVEN MATLAB add namespaced extensions (RAVEN curation fields, MIRIAM cross-refs already covered by cobrapy's `annotation`, and the GECKO `ec-*` sections) without disturbing the cobra-known shape.
10+
11+
---
12+
13+
## At a glance
14+
15+
```yaml
16+
!!omap
17+
- metaData: !!omap
18+
- id: yeastGEM_develop
19+
- name: The Consensus Genome-Scale Metabolic Model of Yeast
20+
- version: 9.0.0
21+
- date: 2026-05-27
22+
- taxonomy: taxonomy/559292
23+
- metabolites:
24+
- !!omap
25+
- id: s_0001
26+
- name: ATP
27+
- compartment: c
28+
- charge: -4
29+
- formula: C10H16N5O13P3
30+
- annotation: !!omap
31+
- kegg.compound: C00002
32+
- smiles: "[O-]P(=O)([O-])OP(=O)([O-])O..."
33+
- inchis: InChI=1S/C10H16N5O13P3/...
34+
- deltaG: -2768.1
35+
- reactions:
36+
- !!omap
37+
- id: r_0001
38+
- name: hexokinase
39+
- metabolites: !!omap
40+
- s_0001: -1.0
41+
- s_0568: -1.0
42+
- s_0394: 1.0
43+
- s_0423: 1.0
44+
- lower_bound: 0.0
45+
- upper_bound: 1000.0
46+
- gene_reaction_rule: YGL253W or YCL040W or YFR053C
47+
- subsystem: Glycolysis / Gluconeogenesis
48+
- notes: "MetaNetX ID curated (PR #220)"
49+
- annotation: !!omap
50+
- kegg.reaction: R00299
51+
- sbo: SBO:0000176
52+
- eccodes: 2.7.1.1
53+
- deltaG: -17.39
54+
- confidence_score: 2.0
55+
- genes:
56+
- !!omap
57+
- id: YGL253W
58+
- name: HXK2
59+
- annotation: !!omap
60+
- uniprot: P04807
61+
- compartments: !!omap
62+
- c: cytoplasm
63+
- e: extracellular
64+
```
65+
66+
Three structural rules are non-obvious and worth pointing out before the field-by-field detail:
67+
68+
1. The whole document is one **ordered mapping** — `!!omap` — at the root. Every nested map that should preserve key order is also `!!omap` (metaData, each metabolite / reaction / gene entry, `annotation`, `metabolites`, `compartments`, and the ec sections).
69+
2. Each metabolite, reaction, and gene is **one `- !!omap` element** of a list. Inside that mapping, every field is written as `- key: value`. This is cobrapy's native shape and is what RAVEN MATLAB's reader keys off.
70+
3. Strings are **unquoted by default**; quotes appear only when YAML would otherwise misparse the value (leading `-`, `[`, `?` or `:`; embedded `: ` or ` #`; values that look like `true` / `false` / `null`).
71+
72+
---
73+
74+
## Top-level layout
75+
76+
```
77+
!!omap
78+
- metaData: !!omap # optional; RAVEN extension
79+
- metabolites: # required
80+
- reactions: # required
81+
- genes: # required (may be `genes: []`)
82+
- compartments: !!omap # required
83+
- gecko_light: <bool> # optional; GECKO extension
84+
- ec-rxns: # optional; GECKO extension
85+
- ec-enzymes: # optional; GECKO extension
86+
```
87+
88+
| Key | Required | Source | Notes |
89+
|-----|----------|--------|-------|
90+
| `metaData` | optional | RAVEN | Provenance block. Holds `id`, `name`, `version`, `date`, `taxonomy`, optionally `givenName` / `familyName` / `email` / `organization` / `note` / `sourceUrl`, plus `defaultLB` / `defaultUB`. Cobrapy ignores this block (no semantic loss for the core model). |
91+
| `metabolites` | yes | cobra core | Ordered list of `- !!omap` entries. |
92+
| `reactions` | yes | cobra core | Ordered list of `- !!omap` entries. |
93+
| `genes` | yes | cobra core | Ordered list; may be `genes: []` for a model with no genes. |
94+
| `compartments` | yes | cobra core | `!!omap` of `<code>: <full name>`. |
95+
| `gecko_light` | optional | GECKO | Scalar boolean. Cobrapy / raven-python emit this at the top level; the older spelling `geckoLight` inside `metaData` is still accepted on read. |
96+
| `ec-rxns` | optional | GECKO | Per-reaction kcat / source / enzymes coupling table. |
97+
| `ec-enzymes` | optional | GECKO | Per-enzyme MW / sequence / concentration table. |
98+
99+
Cobrapy writes `id` / `name` / `version` at the root level instead of inside `metaData`. The RAVEN readers accept both placements; the RAVEN writers normalize to the `metaData` form.
100+
101+
---
102+
103+
## Metabolite entry
104+
105+
Field order (cobra-core first, then RAVEN extensions):
106+
107+
```yaml
108+
- !!omap
109+
- id: s_0001 # required
110+
- name: ATP # cobra
111+
- compartment: c # cobra
112+
- charge: -4 # cobra (number)
113+
- formula: C10H16N5O13P3 # cobra
114+
- notes: "free-text" # cobra
115+
- annotation: !!omap # cobra (MIRIAM + smiles)
116+
- kegg.compound: C00002
117+
- chebi:
118+
- CHEBI:15422
119+
- CHEBI:30616
120+
- sbo: SBO:0000247
121+
- smiles: "OC1=NC..." # quoted when it contains [ ] : etc.
122+
- inchis: "InChI=1S/..." # RAVEN extension
123+
- deltaG: -2768.1 # RAVEN extension
124+
- metFrom: KEGG # RAVEN extension
125+
```
126+
127+
Cobrapy emits exactly the first seven keys (the cobra-core block). raven-python and RAVEN MATLAB additionally emit `inchis`, `deltaG`, and `metFrom` when those fields are populated. On read, cobrapy puts the RAVEN extensions on the metabolite as attribute fall-through; raven-python captures them into `metabolite.notes` (keyed by their YAML name); RAVEN MATLAB stores them on `model.inchis` / `model.metDeltaG` / `model.metFrom`.
128+
129+
Annotation entries with multiple values are emitted as a YAML list (`chebi:` then several `-` items). Single-value entries are emitted inline (`kegg.compound: C00002`). SMILES strings live inside the annotation block under the `smiles` key — not as a top-level metabolite field, which is the historical RAVEN MATLAB shape and is still accepted on read for backward compatibility.
130+
131+
---
132+
133+
## Reaction entry
134+
135+
```yaml
136+
- !!omap
137+
- id: r_0001 # required
138+
- name: hexokinase # cobra
139+
- metabolites: !!omap # cobra (sorted by met id)
140+
- s_0001: -1.0
141+
- s_0394: 1.0
142+
- lower_bound: 0.0 # cobra (number)
143+
- upper_bound: 1000.0 # cobra (number)
144+
- gene_reaction_rule: YGL253W or YCL040W # cobra
145+
- objective_coefficient: 1 # cobra; omitted when 0
146+
- subsystem: Glycolysis / Gluconeogenesis # cobra
147+
- notes: "MetaNetX ID curated (PR #220)" # cobra
148+
- annotation: !!omap # cobra
149+
- kegg.reaction: R00299
150+
- sbo: SBO:0000176
151+
- eccodes: # RAVEN extension
152+
- 2.7.1.1
153+
- 2.7.1.2
154+
- references: "PMID:12345" # RAVEN extension
155+
- rxnFrom: KEGG # RAVEN extension
156+
- deltaG: -17.39 # RAVEN extension
157+
- confidence_score: 2.0 # RAVEN extension
158+
```
159+
160+
Some fields are conditional:
161+
162+
- `objective_coefficient` is only written when non-zero (cobrapy convention).
163+
- The `metabolites` block uses `!!omap []` (flow-style empty omap) when the reaction has no metabolites — this keeps the file a valid YAML 1.2 document.
164+
- `eccodes` is written inline (`eccodes: 2.7.1.1`) when there is exactly one code, and as a list when there are several. Same for `references`.
165+
166+
**Notes key naming.** Cobrapy and the current raven-python / RAVEN MATLAB writers use **`notes`**. Pre-`feat/yeast-gem-shared` yeast-GEM files used `rxnNotes`; both readers accept that as a legacy alias.
167+
168+
**Bounds typing.** Bounds are emitted as floats with an explicit decimal point (`1000.0`, `-1000.0`), matching Python's float repr and cobrapy's output.
169+
170+
---
171+
172+
## Gene entry
173+
174+
```yaml
175+
- !!omap
176+
- id: YGL253W # required
177+
- name: HXK2 # cobra; omitted when empty
178+
- annotation: !!omap # cobra
179+
- uniprot: P04807
180+
- ncbigene: 856421
181+
- protein: P04807 # RAVEN extension
182+
```
183+
184+
Empty names (`name: ''`) are not emitted (matches RAVEN MATLAB's historical behavior).
185+
186+
---
187+
188+
## Compartments
189+
190+
```yaml
191+
- compartments: !!omap
192+
- c: cytoplasm
193+
- e: extracellular
194+
- m: mitochondrion
195+
```
196+
197+
Just an `!!omap` of `<short code>: <human-readable name>` pairs. Compartments don't carry their own MIRIAMs in the current format.
198+
199+
---
200+
201+
## metaData (RAVEN extension)
202+
203+
```yaml
204+
- metaData: !!omap
205+
- id: yeastGEM_develop
206+
- name: The Consensus Genome-Scale Metabolic Model of Yeast
207+
- version: 9.0.0
208+
- date: 2026-05-27
209+
- defaultLB: -1000.0
210+
- defaultUB: 1000.0
211+
- givenName: Eduard
212+
- familyName: Kerkhoven
213+
- email: eduardk@chalmers.se
214+
- organization: Chalmers University of Technology
215+
- taxonomy: taxonomy/559292
216+
- note: "Saccharomyces cerevisiae - strain S288C"
217+
- sourceUrl: https://github.com/SysBioChalmers/yeast-GEM
218+
```
219+
220+
Pure provenance. Cobrapy ignores the block; raven-python keeps the verbatim dictionary on `model.notes['metaData']` and additionally lifts `id` / `name` / `version` to `model.id` / `model.name` / `model.notes['version']` so cobra-shape accessors find them. RAVEN MATLAB populates `model.id` / `model.name` / `model.version` / `model.annotation.*` from the same fields.
221+
222+
`date` is preserved across round-trips when present on the model; otherwise the writer fills in `YYYY-MM-DD` of the current date.
223+
224+
---
225+
226+
## GECKO sections
227+
228+
For enzyme-constrained models, three additional top-level keys carry the EC layer:
229+
230+
```yaml
231+
- gecko_light: false # true for the "light" formulation
232+
- ec-rxns:
233+
- !!omap
234+
- id: r_0001
235+
- kcat: 25.3
236+
- source: brenda
237+
- notes: ""
238+
- eccodes: 2.7.1.1
239+
- enzymes: !!omap
240+
- P04807: 1.0
241+
- ec-enzymes:
242+
- !!omap
243+
- genes: YGL253W
244+
- enzymes: P04807
245+
- mw: 53942
246+
- sequence: "MVHLGPK..."
247+
- concs: .nan
248+
```
249+
250+
These map onto `model.ec` in RAVEN MATLAB and `raven_python.io.ec_data.EcData` (attached as `model.ec`) in raven-python. Cobrapy ignores the sections.
251+
252+
The older spelling `geckoLight` inside `metaData` is also accepted on read.
253+
254+
---
255+
256+
## Annotations
257+
258+
The `annotation` block uses MIRIAM-style namespace keys. Cobrapy treats the block as a free-form dictionary; raven-python preserves it verbatim through `cobra.Metabolite.annotation` / `Reaction.annotation` / `Gene.annotation`; RAVEN MATLAB maps it to `model.metMiriams` / `rxnMiriams` / `geneMiriams`.
259+
260+
- A single value is written inline: `kegg.compound: C00002`.
261+
- Multiple values are written as a YAML list:
262+
263+
```yaml
264+
- chebi:
265+
- CHEBI:15422
266+
- CHEBI:30616
267+
```
268+
269+
- The `smiles` key inside a metabolite's `annotation` carries the SMILES string (cobrapy convention). RAVEN MATLAB historically emitted `smiles` as a metabolite top-level field; both readers still accept that, but writes are normalized to the annotation block.
270+
- The `sbo` key carries the Systems Biology Ontology term assigned by `assignSBOterms` / `add_sbo_terms`.
271+
272+
---
273+
274+
## Numbers, strings, quoting
275+
276+
**Numbers.** Whole-number floats are written with an explicit `.0` (`1000.0`, `-1000.0`, `0.0`). Other floats use up to 15 significant digits (`-17.39`, `-2768.1`). `NaN` is encoded as `.nan`; `+Inf` / `-Inf` as `.inf` / `-.inf` (YAML 1.2 conventions).
277+
278+
**Strings.** Default style is bare (no quotes). The writer falls back to double-quoted style when the value:
279+
280+
- starts with `-`, `?`, `:`, or any flow indicator (`[`, `]`, `{`, `}`, `,`, `&`, `*`, `!`, `|`, `>`, `%`, `@`, `` ` ``, `#`);
281+
- contains `: ` (would otherwise be parsed as a key/value), ` #` (comment), or one of `[`, `]`, `{`, `}`;
282+
- has leading or trailing whitespace;
283+
- spells a YAML reserved word case-insensitively (`true`, `false`, `null`, `yes`, `no`, `on`, `off`, `~`).
284+
285+
In a double-quoted string, only `\` and `"` are escaped. Other characters (including Unicode and newlines if the underlying model permitted them) are passed through.
286+
287+
---
288+
289+
## Tooling interoperability matrix
290+
291+
| File written by ↓ \ Reader → | cobrapy | raven-python | RAVEN MATLAB |
292+
|---|---|---|---|
293+
| cobrapy (`save_yaml_model`) | full | full + extras land in `notes` via attribute fall-through | works for root-level `id` / `name` / `version` (added in this release) |
294+
| raven-python (`write_yaml_model`) | core (no `metaData`-derived `id`); RAVEN extras live as unknown top-level keys but don't break parsing | full | full |
295+
| RAVEN MATLAB (`writeYAMLmodel`) | core (no `metaData`-derived `id`); RAVEN extras land via attribute fall-through | full | full |
296+
297+
"Full" = every field read back into its canonical position on the model object; "core" = cobrapy-known fields, RAVEN extensions ignored or kept on the object as attribute fall-through (`reaction.eccodes` etc., not re-emitted on save). A round-trip through cobrapy is therefore **lossy for RAVEN extensions** — only the core fields survive `cobrapy.load → cobrapy.save`. Round-trips through raven-python or RAVEN MATLAB are lossless.
298+
299+
---
300+
301+
## What round-tripping looks like
302+
303+
Loading `yeast-GEM.yml` (2748 metabolites, 4102 reactions, 1143 genes) and re-writing it through any of the three tools preserves every documented piece of content:
304+
305+
| Count | After round-trip |
306+
|---|---|
307+
| metabolites | 2748 / 2748 |
308+
| reactions | 4102 / 4102 |
309+
| genes | 1143 / 1143 |
310+
| reactions with eccodes | 2411 |
311+
| reactions with deltaG | 3984 |
312+
| metabolites with deltaG | 2696 |
313+
| metabolites with SMILES | 1788 |
314+
| reactions with notes (rxnNotes) | 1443 |
315+
316+
(Cobrapy round-trips give 2748 / 4102 / 1143 for the core but drop the RAVEN extensions in the rightmost column — that's the documented loss.)
317+
318+
---
319+
320+
## What changed in `feat/yeast-gem-shared`
321+
322+
- raven-python writer no longer drops `!!omap` tags (was producing files RAVEN MATLAB's reader couldn't load).
323+
- raven-python now preserves `eccodes` and accepts the legacy `rxnNotes` reaction key on read.
324+
- RAVEN MATLAB writer reorders metabolite / reaction fields to match cobrapy.
325+
- RAVEN MATLAB writer renames the reaction `rxnNotes` key to `notes` and emits SMILES inside the annotation block (still accepts both shapes on read).
326+
- RAVEN MATLAB writer's `preserveQuotes` default is now `false`; values that need quoting (SMILES with `[O-]`, leading flow indicators, booleans, `: `-containing strings) are quoted defensively per value.
327+
- RAVEN MATLAB writer emits whole-number bounds as `1000.0` (matches cobrapy / Python float repr) instead of `1000`.
328+
- RAVEN MATLAB reader accepts cobrapy's root-level `id` / `name` / `version` / `gecko_light`, the `!!omap`-tagged `metaData` header, and `notes` (canonical) in addition to `rxnNotes` (legacy).
329+
- Empty `reaction.metabolites` blocks are emitted as `!!omap []` (valid YAML 1.2) rather than an empty `!!omap` with no value.
330+
- Document-start marker `---` dropped to match cobrapy's bare `!!omap` root.
331+
332+
These changes are byte-stable for cobrapy and raven-python users; existing yeast-GEM YAML files continue to load. The first time a yeast-GEM curation pass rewrites the file with the new MATLAB writer, the diff will look large (because of the reordering and quote-style changes) but the model content is unchanged.

src/raven_python/io/yaml.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -235,7 +235,17 @@ def model_from_yaml_data(raw: dict) -> cobra.Model:
235235
model.notes["version"] = version
236236

237237
# Pop the ec sections out of `foreign` and into a typed EcData.
238-
# The remaining unknown keys round-trip opaquely.
238+
# The remaining unknown keys round-trip opaquely. Pre-shim RAVEN
239+
# MATLAB writes wrote `geckoLight: "true"` inside metaData (rather
240+
# than the current top-level `gecko_light`); honour the legacy
241+
# placement too — keep the metaData entry untouched (round-trip)
242+
# and surface it at the top level so EcData picks it up.
243+
legacy_gecko = metadata.get("geckoLight")
244+
if legacy_gecko is not None and "gecko_light" not in foreign:
245+
if isinstance(legacy_gecko, str):
246+
foreign["gecko_light"] = legacy_gecko.lower() == "true"
247+
else:
248+
foreign["gecko_light"] = bool(legacy_gecko)
239249
ec_sections = {k: foreign.pop(k) for k in list(foreign) if k in _EC_TOP_KEYS}
240250
ec_data = ec_data_from_yaml_sections(ec_sections)
241251
if ec_data is not None:

0 commit comments

Comments
 (0)