|
| 1 | +# RAVEN / cobrapy YAML model format |
| 2 | + |
| 3 | +This document describes the YAML format produced and consumed by |
| 4 | + |
| 5 | +- **cobrapy** ([`cobra.io.{load,save}_yaml_model`](https://github.com/opencobra/cobrapy)) |
| 6 | +- **raven-python** (`raven_python.io.yaml.{read,write}_yaml_model`, see [API](api/index.md)) |
| 7 | +- **RAVEN MATLAB** (`readYAMLmodel.m` / `writeYAMLmodel.m` in the [RAVEN repo](https://github.com/SysBioChalmers/RAVEN/tree/feat/yeast-gem-shared/io)) |
| 8 | + |
| 9 | +The same file can be round-tripped through any of the three. cobrapy is the canonical core; raven-python and RAVEN MATLAB add namespaced extensions (RAVEN curation fields, MIRIAM cross-refs already covered by cobrapy's `annotation`, and the GECKO `ec-*` sections) without disturbing the cobra-known shape. |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## At a glance |
| 14 | + |
| 15 | +```yaml |
| 16 | +!!omap |
| 17 | +- metaData: !!omap |
| 18 | + - id: yeastGEM_develop |
| 19 | + - name: The Consensus Genome-Scale Metabolic Model of Yeast |
| 20 | + - version: 9.0.0 |
| 21 | + - date: 2026-05-27 |
| 22 | + - taxonomy: taxonomy/559292 |
| 23 | +- metabolites: |
| 24 | + - !!omap |
| 25 | + - id: s_0001 |
| 26 | + - name: ATP |
| 27 | + - compartment: c |
| 28 | + - charge: -4 |
| 29 | + - formula: C10H16N5O13P3 |
| 30 | + - annotation: !!omap |
| 31 | + - kegg.compound: C00002 |
| 32 | + - smiles: "[O-]P(=O)([O-])OP(=O)([O-])O..." |
| 33 | + - inchis: InChI=1S/C10H16N5O13P3/... |
| 34 | + - deltaG: -2768.1 |
| 35 | +- reactions: |
| 36 | + - !!omap |
| 37 | + - id: r_0001 |
| 38 | + - name: hexokinase |
| 39 | + - metabolites: !!omap |
| 40 | + - s_0001: -1.0 |
| 41 | + - s_0568: -1.0 |
| 42 | + - s_0394: 1.0 |
| 43 | + - s_0423: 1.0 |
| 44 | + - lower_bound: 0.0 |
| 45 | + - upper_bound: 1000.0 |
| 46 | + - gene_reaction_rule: YGL253W or YCL040W or YFR053C |
| 47 | + - subsystem: Glycolysis / Gluconeogenesis |
| 48 | + - notes: "MetaNetX ID curated (PR #220)" |
| 49 | + - annotation: !!omap |
| 50 | + - kegg.reaction: R00299 |
| 51 | + - sbo: SBO:0000176 |
| 52 | + - eccodes: 2.7.1.1 |
| 53 | + - deltaG: -17.39 |
| 54 | + - confidence_score: 2.0 |
| 55 | +- genes: |
| 56 | + - !!omap |
| 57 | + - id: YGL253W |
| 58 | + - name: HXK2 |
| 59 | + - annotation: !!omap |
| 60 | + - uniprot: P04807 |
| 61 | +- compartments: !!omap |
| 62 | + - c: cytoplasm |
| 63 | + - e: extracellular |
| 64 | +``` |
| 65 | +
|
| 66 | +Three structural rules are non-obvious and worth pointing out before the field-by-field detail: |
| 67 | +
|
| 68 | +1. The whole document is one **ordered mapping** — `!!omap` — at the root. Every nested map that should preserve key order is also `!!omap` (metaData, each metabolite / reaction / gene entry, `annotation`, `metabolites`, `compartments`, and the ec sections). |
| 69 | +2. Each metabolite, reaction, and gene is **one `- !!omap` element** of a list. Inside that mapping, every field is written as `- key: value`. This is cobrapy's native shape and is what RAVEN MATLAB's reader keys off. |
| 70 | +3. Strings are **unquoted by default**; quotes appear only when YAML would otherwise misparse the value (leading `-`, `[`, `?` or `:`; embedded `: ` or ` #`; values that look like `true` / `false` / `null`). |
| 71 | + |
| 72 | +--- |
| 73 | + |
| 74 | +## Top-level layout |
| 75 | + |
| 76 | +``` |
| 77 | +!!omap |
| 78 | +- metaData: !!omap # optional; RAVEN extension |
| 79 | +- metabolites: # required |
| 80 | +- reactions: # required |
| 81 | +- genes: # required (may be `genes: []`) |
| 82 | +- compartments: !!omap # required |
| 83 | +- gecko_light: <bool> # optional; GECKO extension |
| 84 | +- ec-rxns: # optional; GECKO extension |
| 85 | +- ec-enzymes: # optional; GECKO extension |
| 86 | +``` |
| 87 | +
|
| 88 | +| Key | Required | Source | Notes | |
| 89 | +|-----|----------|--------|-------| |
| 90 | +| `metaData` | optional | RAVEN | Provenance block. Holds `id`, `name`, `version`, `date`, `taxonomy`, optionally `givenName` / `familyName` / `email` / `organization` / `note` / `sourceUrl`, plus `defaultLB` / `defaultUB`. Cobrapy ignores this block (no semantic loss for the core model). | |
| 91 | +| `metabolites` | yes | cobra core | Ordered list of `- !!omap` entries. | |
| 92 | +| `reactions` | yes | cobra core | Ordered list of `- !!omap` entries. | |
| 93 | +| `genes` | yes | cobra core | Ordered list; may be `genes: []` for a model with no genes. | |
| 94 | +| `compartments` | yes | cobra core | `!!omap` of `<code>: <full name>`. | |
| 95 | +| `gecko_light` | optional | GECKO | Scalar boolean. Cobrapy / raven-python emit this at the top level; the older spelling `geckoLight` inside `metaData` is still accepted on read. | |
| 96 | +| `ec-rxns` | optional | GECKO | Per-reaction kcat / source / enzymes coupling table. | |
| 97 | +| `ec-enzymes` | optional | GECKO | Per-enzyme MW / sequence / concentration table. | |
| 98 | +
|
| 99 | +Cobrapy writes `id` / `name` / `version` at the root level instead of inside `metaData`. The RAVEN readers accept both placements; the RAVEN writers normalize to the `metaData` form. |
| 100 | +
|
| 101 | +--- |
| 102 | +
|
| 103 | +## Metabolite entry |
| 104 | +
|
| 105 | +Field order (cobra-core first, then RAVEN extensions): |
| 106 | +
|
| 107 | +```yaml |
| 108 | +- !!omap |
| 109 | + - id: s_0001 # required |
| 110 | + - name: ATP # cobra |
| 111 | + - compartment: c # cobra |
| 112 | + - charge: -4 # cobra (number) |
| 113 | + - formula: C10H16N5O13P3 # cobra |
| 114 | + - notes: "free-text" # cobra |
| 115 | + - annotation: !!omap # cobra (MIRIAM + smiles) |
| 116 | + - kegg.compound: C00002 |
| 117 | + - chebi: |
| 118 | + - CHEBI:15422 |
| 119 | + - CHEBI:30616 |
| 120 | + - sbo: SBO:0000247 |
| 121 | + - smiles: "OC1=NC..." # quoted when it contains [ ] : etc. |
| 122 | + - inchis: "InChI=1S/..." # RAVEN extension |
| 123 | + - deltaG: -2768.1 # RAVEN extension |
| 124 | + - metFrom: KEGG # RAVEN extension |
| 125 | +``` |
| 126 | + |
| 127 | +Cobrapy emits exactly the first seven keys (the cobra-core block). raven-python and RAVEN MATLAB additionally emit `inchis`, `deltaG`, and `metFrom` when those fields are populated. On read, cobrapy puts the RAVEN extensions on the metabolite as attribute fall-through; raven-python captures them into `metabolite.notes` (keyed by their YAML name); RAVEN MATLAB stores them on `model.inchis` / `model.metDeltaG` / `model.metFrom`. |
| 128 | + |
| 129 | +Annotation entries with multiple values are emitted as a YAML list (`chebi:` then several `-` items). Single-value entries are emitted inline (`kegg.compound: C00002`). SMILES strings live inside the annotation block under the `smiles` key — not as a top-level metabolite field, which is the historical RAVEN MATLAB shape and is still accepted on read for backward compatibility. |
| 130 | + |
| 131 | +--- |
| 132 | + |
| 133 | +## Reaction entry |
| 134 | + |
| 135 | +```yaml |
| 136 | +- !!omap |
| 137 | + - id: r_0001 # required |
| 138 | + - name: hexokinase # cobra |
| 139 | + - metabolites: !!omap # cobra (sorted by met id) |
| 140 | + - s_0001: -1.0 |
| 141 | + - s_0394: 1.0 |
| 142 | + - lower_bound: 0.0 # cobra (number) |
| 143 | + - upper_bound: 1000.0 # cobra (number) |
| 144 | + - gene_reaction_rule: YGL253W or YCL040W # cobra |
| 145 | + - objective_coefficient: 1 # cobra; omitted when 0 |
| 146 | + - subsystem: Glycolysis / Gluconeogenesis # cobra |
| 147 | + - notes: "MetaNetX ID curated (PR #220)" # cobra |
| 148 | + - annotation: !!omap # cobra |
| 149 | + - kegg.reaction: R00299 |
| 150 | + - sbo: SBO:0000176 |
| 151 | + - eccodes: # RAVEN extension |
| 152 | + - 2.7.1.1 |
| 153 | + - 2.7.1.2 |
| 154 | + - references: "PMID:12345" # RAVEN extension |
| 155 | + - rxnFrom: KEGG # RAVEN extension |
| 156 | + - deltaG: -17.39 # RAVEN extension |
| 157 | + - confidence_score: 2.0 # RAVEN extension |
| 158 | +``` |
| 159 | +
|
| 160 | +Some fields are conditional: |
| 161 | +
|
| 162 | +- `objective_coefficient` is only written when non-zero (cobrapy convention). |
| 163 | +- The `metabolites` block uses `!!omap []` (flow-style empty omap) when the reaction has no metabolites — this keeps the file a valid YAML 1.2 document. |
| 164 | +- `eccodes` is written inline (`eccodes: 2.7.1.1`) when there is exactly one code, and as a list when there are several. Same for `references`. |
| 165 | + |
| 166 | +**Notes key naming.** Cobrapy and the current raven-python / RAVEN MATLAB writers use **`notes`**. Pre-`feat/yeast-gem-shared` yeast-GEM files used `rxnNotes`; both readers accept that as a legacy alias. |
| 167 | + |
| 168 | +**Bounds typing.** Bounds are emitted as floats with an explicit decimal point (`1000.0`, `-1000.0`), matching Python's float repr and cobrapy's output. |
| 169 | + |
| 170 | +--- |
| 171 | + |
| 172 | +## Gene entry |
| 173 | + |
| 174 | +```yaml |
| 175 | +- !!omap |
| 176 | + - id: YGL253W # required |
| 177 | + - name: HXK2 # cobra; omitted when empty |
| 178 | + - annotation: !!omap # cobra |
| 179 | + - uniprot: P04807 |
| 180 | + - ncbigene: 856421 |
| 181 | + - protein: P04807 # RAVEN extension |
| 182 | +``` |
| 183 | + |
| 184 | +Empty names (`name: ''`) are not emitted (matches RAVEN MATLAB's historical behavior). |
| 185 | +
|
| 186 | +--- |
| 187 | +
|
| 188 | +## Compartments |
| 189 | +
|
| 190 | +```yaml |
| 191 | +- compartments: !!omap |
| 192 | + - c: cytoplasm |
| 193 | + - e: extracellular |
| 194 | + - m: mitochondrion |
| 195 | +``` |
| 196 | +
|
| 197 | +Just an `!!omap` of `<short code>: <human-readable name>` pairs. Compartments don't carry their own MIRIAMs in the current format. |
| 198 | + |
| 199 | +--- |
| 200 | + |
| 201 | +## metaData (RAVEN extension) |
| 202 | + |
| 203 | +```yaml |
| 204 | +- metaData: !!omap |
| 205 | + - id: yeastGEM_develop |
| 206 | + - name: The Consensus Genome-Scale Metabolic Model of Yeast |
| 207 | + - version: 9.0.0 |
| 208 | + - date: 2026-05-27 |
| 209 | + - defaultLB: -1000.0 |
| 210 | + - defaultUB: 1000.0 |
| 211 | + - givenName: Eduard |
| 212 | + - familyName: Kerkhoven |
| 213 | + - email: eduardk@chalmers.se |
| 214 | + - organization: Chalmers University of Technology |
| 215 | + - taxonomy: taxonomy/559292 |
| 216 | + - note: "Saccharomyces cerevisiae - strain S288C" |
| 217 | + - sourceUrl: https://github.com/SysBioChalmers/yeast-GEM |
| 218 | +``` |
| 219 | + |
| 220 | +Pure provenance. Cobrapy ignores the block; raven-python keeps the verbatim dictionary on `model.notes['metaData']` and additionally lifts `id` / `name` / `version` to `model.id` / `model.name` / `model.notes['version']` so cobra-shape accessors find them. RAVEN MATLAB populates `model.id` / `model.name` / `model.version` / `model.annotation.*` from the same fields. |
| 221 | + |
| 222 | +`date` is preserved across round-trips when present on the model; otherwise the writer fills in `YYYY-MM-DD` of the current date. |
| 223 | + |
| 224 | +--- |
| 225 | + |
| 226 | +## GECKO sections |
| 227 | + |
| 228 | +For enzyme-constrained models, three additional top-level keys carry the EC layer: |
| 229 | + |
| 230 | +```yaml |
| 231 | +- gecko_light: false # true for the "light" formulation |
| 232 | +- ec-rxns: |
| 233 | + - !!omap |
| 234 | + - id: r_0001 |
| 235 | + - kcat: 25.3 |
| 236 | + - source: brenda |
| 237 | + - notes: "" |
| 238 | + - eccodes: 2.7.1.1 |
| 239 | + - enzymes: !!omap |
| 240 | + - P04807: 1.0 |
| 241 | +- ec-enzymes: |
| 242 | + - !!omap |
| 243 | + - genes: YGL253W |
| 244 | + - enzymes: P04807 |
| 245 | + - mw: 53942 |
| 246 | + - sequence: "MVHLGPK..." |
| 247 | + - concs: .nan |
| 248 | +``` |
| 249 | + |
| 250 | +These map onto `model.ec` in RAVEN MATLAB and `raven_python.io.ec_data.EcData` (attached as `model.ec`) in raven-python. Cobrapy ignores the sections. |
| 251 | + |
| 252 | +The older spelling `geckoLight` inside `metaData` is also accepted on read. |
| 253 | + |
| 254 | +--- |
| 255 | + |
| 256 | +## Annotations |
| 257 | + |
| 258 | +The `annotation` block uses MIRIAM-style namespace keys. Cobrapy treats the block as a free-form dictionary; raven-python preserves it verbatim through `cobra.Metabolite.annotation` / `Reaction.annotation` / `Gene.annotation`; RAVEN MATLAB maps it to `model.metMiriams` / `rxnMiriams` / `geneMiriams`. |
| 259 | + |
| 260 | +- A single value is written inline: `kegg.compound: C00002`. |
| 261 | +- Multiple values are written as a YAML list: |
| 262 | + |
| 263 | + ```yaml |
| 264 | + - chebi: |
| 265 | + - CHEBI:15422 |
| 266 | + - CHEBI:30616 |
| 267 | + ``` |
| 268 | + |
| 269 | +- The `smiles` key inside a metabolite's `annotation` carries the SMILES string (cobrapy convention). RAVEN MATLAB historically emitted `smiles` as a metabolite top-level field; both readers still accept that, but writes are normalized to the annotation block. |
| 270 | +- The `sbo` key carries the Systems Biology Ontology term assigned by `assignSBOterms` / `add_sbo_terms`. |
| 271 | + |
| 272 | +--- |
| 273 | + |
| 274 | +## Numbers, strings, quoting |
| 275 | + |
| 276 | +**Numbers.** Whole-number floats are written with an explicit `.0` (`1000.0`, `-1000.0`, `0.0`). Other floats use up to 15 significant digits (`-17.39`, `-2768.1`). `NaN` is encoded as `.nan`; `+Inf` / `-Inf` as `.inf` / `-.inf` (YAML 1.2 conventions). |
| 277 | + |
| 278 | +**Strings.** Default style is bare (no quotes). The writer falls back to double-quoted style when the value: |
| 279 | + |
| 280 | +- starts with `-`, `?`, `:`, or any flow indicator (`[`, `]`, `{`, `}`, `,`, `&`, `*`, `!`, `|`, `>`, `%`, `@`, `` ` ``, `#`); |
| 281 | +- contains `: ` (would otherwise be parsed as a key/value), ` #` (comment), or one of `[`, `]`, `{`, `}`; |
| 282 | +- has leading or trailing whitespace; |
| 283 | +- spells a YAML reserved word case-insensitively (`true`, `false`, `null`, `yes`, `no`, `on`, `off`, `~`). |
| 284 | + |
| 285 | +In a double-quoted string, only `\` and `"` are escaped. Other characters (including Unicode and newlines if the underlying model permitted them) are passed through. |
| 286 | + |
| 287 | +--- |
| 288 | + |
| 289 | +## Tooling interoperability matrix |
| 290 | + |
| 291 | +| File written by ↓ \ Reader → | cobrapy | raven-python | RAVEN MATLAB | |
| 292 | +|---|---|---|---| |
| 293 | +| cobrapy (`save_yaml_model`) | full | full + extras land in `notes` via attribute fall-through | works for root-level `id` / `name` / `version` (added in this release) | |
| 294 | +| raven-python (`write_yaml_model`) | core (no `metaData`-derived `id`); RAVEN extras live as unknown top-level keys but don't break parsing | full | full | |
| 295 | +| RAVEN MATLAB (`writeYAMLmodel`) | core (no `metaData`-derived `id`); RAVEN extras land via attribute fall-through | full | full | |
| 296 | + |
| 297 | +"Full" = every field read back into its canonical position on the model object; "core" = cobrapy-known fields, RAVEN extensions ignored or kept on the object as attribute fall-through (`reaction.eccodes` etc., not re-emitted on save). A round-trip through cobrapy is therefore **lossy for RAVEN extensions** — only the core fields survive `cobrapy.load → cobrapy.save`. Round-trips through raven-python or RAVEN MATLAB are lossless. |
| 298 | + |
| 299 | +--- |
| 300 | + |
| 301 | +## What round-tripping looks like |
| 302 | + |
| 303 | +Loading `yeast-GEM.yml` (2748 metabolites, 4102 reactions, 1143 genes) and re-writing it through any of the three tools preserves every documented piece of content: |
| 304 | + |
| 305 | +| Count | After round-trip | |
| 306 | +|---|---| |
| 307 | +| metabolites | 2748 / 2748 | |
| 308 | +| reactions | 4102 / 4102 | |
| 309 | +| genes | 1143 / 1143 | |
| 310 | +| reactions with eccodes | 2411 | |
| 311 | +| reactions with deltaG | 3984 | |
| 312 | +| metabolites with deltaG | 2696 | |
| 313 | +| metabolites with SMILES | 1788 | |
| 314 | +| reactions with notes (rxnNotes) | 1443 | |
| 315 | + |
| 316 | +(Cobrapy round-trips give 2748 / 4102 / 1143 for the core but drop the RAVEN extensions in the rightmost column — that's the documented loss.) |
| 317 | + |
| 318 | +--- |
| 319 | + |
| 320 | +## What changed in `feat/yeast-gem-shared` |
| 321 | + |
| 322 | +- raven-python writer no longer drops `!!omap` tags (was producing files RAVEN MATLAB's reader couldn't load). |
| 323 | +- raven-python now preserves `eccodes` and accepts the legacy `rxnNotes` reaction key on read. |
| 324 | +- RAVEN MATLAB writer reorders metabolite / reaction fields to match cobrapy. |
| 325 | +- RAVEN MATLAB writer renames the reaction `rxnNotes` key to `notes` and emits SMILES inside the annotation block (still accepts both shapes on read). |
| 326 | +- RAVEN MATLAB writer's `preserveQuotes` default is now `false`; values that need quoting (SMILES with `[O-]`, leading flow indicators, booleans, `: `-containing strings) are quoted defensively per value. |
| 327 | +- RAVEN MATLAB writer emits whole-number bounds as `1000.0` (matches cobrapy / Python float repr) instead of `1000`. |
| 328 | +- RAVEN MATLAB reader accepts cobrapy's root-level `id` / `name` / `version` / `gecko_light`, the `!!omap`-tagged `metaData` header, and `notes` (canonical) in addition to `rxnNotes` (legacy). |
| 329 | +- Empty `reaction.metabolites` blocks are emitted as `!!omap []` (valid YAML 1.2) rather than an empty `!!omap` with no value. |
| 330 | +- Document-start marker `---` dropped to match cobrapy's bare `!!omap` root. |
| 331 | + |
| 332 | +These changes are byte-stable for cobrapy and raven-python users; existing yeast-GEM YAML files continue to load. The first time a yeast-GEM curation pass rewrites the file with the new MATLAB writer, the diff will look large (because of the reordering and quote-style changes) but the model content is unchanged. |
0 commit comments