Skip to content

Commit 4d81625

Browse files
committed
Add raven_python.curation — batch_curate / batch_curate_from_tsv
Generic batch curation engine extracted from yeast-GEM's MATLAB curateMetsRxnsGenes (yeast-GEM phase 6). Adds or updates metabolites, reactions and genes from pandas DataFrames; a batch_curate_from_tsv convenience wrapper reads the equivalent TSVs. Schema (matches yeast-GEM's data/modelCuration/template/ layout): mets_df metNames, comps, formula, charge, inchi, metNotes + any number of MIRIAM-namespace columns genes_df genes, geneShortNames + MIRIAM columns rxns_df rxnNames, grRules, lb, ub, rev, subSystems, eccodes, rxnNotes, rxnReferences, rxnConfidenceScores + MIRIAM columns rxns_coeffs_df rxnNames, metNames, comps, coefficient (one row per (reaction, metabolite) pair) Match keys: metabolites — (name, compartment) tuple genes — gene id reactions — stoichiometric signature Existing entities get their annotations overwritten (warning emitted); new entities are added with fresh ids generated from the supplied ``met_id_prefix`` / ``rxn_id_prefix`` (defaults M_ / R_ per the BiGG convention; yeast-GEM passes s_ / r_). Width of the existing zero-padded suffix is preserved so s_0001 → s_0002, not s_2. "Everything after the core columns is MIRIAM" — the header of any extra column becomes the annotation namespace key. Matches MATLAB behaviour exactly so yeast-GEM's existing TSVs work unchanged, and projects with different MIRIAM column sets need no code change. CurationResult dataclass records what was added vs updated so callers can verify in tests / CI. Tests: 13 new (add/update mets, add/update genes, add/update rxns by stoichiometry, miriam auto-detect, id-width preservation, combined mets+rxns in one call, missing-metabolite error, batch_curate_from_tsv round trip).
1 parent b482859 commit 4d81625

3 files changed

Lines changed: 759 additions & 0 deletions

File tree

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
"""Batch curation of metabolites / reactions / genes from data tables.
2+
3+
Port of yeast-GEM's MATLAB ``curateMetsRxnsGenes`` into a generic
4+
DataFrame-driven engine. Other GEM projects (Human-GEM, custom
5+
reconstructions, …) can use the same machinery with their own TSV
6+
layouts; the only required pieces are the data tables and (optionally)
7+
project-specific id prefixes for fresh metabolites and reactions.
8+
9+
Public API:
10+
11+
* :func:`batch_curate` — entrypoint taking pandas DataFrames.
12+
* :func:`batch_curate_from_tsv` — file-path convenience wrapper.
13+
* :class:`CurationResult` — record of what was added / updated.
14+
15+
Schema (mirrors yeast-GEM's ``data/modelCuration/template/`` layout):
16+
17+
- **mets_df**: ``metNames, comps, formula, charge, inchi, metNotes``
18+
+ any number of MIRIAM-annotation columns. Match key is
19+
``(name, comp)``.
20+
- **genes_df**: ``genes, geneShortNames`` + MIRIAM columns. Match key
21+
is ``genes``.
22+
- **rxns_df**: ``rxnNames, grRules, lb, ub, rev, subSystems, eccodes,
23+
rxnNotes, rxnReferences, rxnConfidenceScores`` + MIRIAM columns.
24+
Match key is the reaction's *stoichiometry* — same metabolites and
25+
coefficients ⇒ same reaction.
26+
- **rxns_coeffs_df**: ``rxnNames, metNames, comps, coefficient``. One
27+
row per ``(reaction, metabolite)`` pair. The ``rxnNames`` column
28+
links each coefficient back to a row in ``rxns_df``. An optional
29+
``index`` first column from the legacy yeast-GEM schema is silently
30+
ignored.
31+
32+
Everything after the core columns in any of the four tables is
33+
interpreted as a MIRIAM annotation: the column header becomes the
34+
namespace key (``met.annotation[<header>] = <cell>``).
35+
"""
36+
from raven_python.curation.batch import (
37+
DEFAULT_CORE_GENE_COLUMNS,
38+
DEFAULT_CORE_MET_COLUMNS,
39+
DEFAULT_CORE_RXN_COEFFS_COLUMNS,
40+
DEFAULT_CORE_RXN_COLUMNS,
41+
CurationResult,
42+
batch_curate,
43+
batch_curate_from_tsv,
44+
)
45+
46+
__all__ = [
47+
"DEFAULT_CORE_GENE_COLUMNS",
48+
"DEFAULT_CORE_MET_COLUMNS",
49+
"DEFAULT_CORE_RXN_COEFFS_COLUMNS",
50+
"DEFAULT_CORE_RXN_COLUMNS",
51+
"CurationResult",
52+
"batch_curate",
53+
"batch_curate_from_tsv",
54+
]

0 commit comments

Comments
 (0)