Skip to content

Commit 36e5306

Browse files
authored
Documentation index, known-issues catalogue and MATLAB back-port doc (#10)
* Project scaffold: pyproject + package skeleton + README + LICENSE * Add GitHub Actions CI and the maintainer-scripts README * Add the foundation utilities: GPR, balance, parse, sort, validate * Add the model-manipulation layer (add, remove, transport, merge, etc.) * Add binary + data resolvers for external tools and published artefacts * Add YAML and SIF model I/O * Add Excel export and the Standard-GEM git-layout export * Add BLAST and DIAMOND wrappers for protein-homology searches * Add the homology-based draft model builder (getModelFromHomology port) * Add KEGG download, dump parser and taxonomy parser * Add KEGG HMM-library build and HMM-based KO assignment * Add KEGG species-model assembly (per-organism reconstruction) * Add KEGG artefact-build scripts and HMM-cutoff calibration docs * Add metabolic-task parsing and the check_tasks validator * Add connectivity gap-filling (MILP) against template models * Add the tINIT (INIT) MILP and its supporting machinery * Add the ftINIT pipeline and task-aware gap-filling * Add Human-GEM validation, parameter studies and cross-solver tests * Add HPA omics ingestion (proteomics + RNA-seq) * Add FSEOF, reporter metabolites and flux sampling * Add N-model comparison (presence + Jaccard + optional task check) * Add subcellular-localisation prediction (MILP) with pluggable predictors * Add the yeast-GEM localization benchmark (real-data validation) * Add the documentation index, RAVEN migration map and CHANGELOG * Add known-issues catalogue with closed sweep A–F regression notes * Add the consolidated MATLAB RAVEN back-port proposals doc * feat(io.yaml): factor model_from_yaml_data out of read_yaml_model read_yaml_model now opens+parses the file then delegates the post-parse work (capturing per-entry side-fields onto notes, restoring legacy metaData id/name, stashing unknown top-level sections onto model.notes['_yaml_sections']) to a new model_from_yaml_data(raw: dict) helper. This lets downstream packages that need to pre-normalise their YAML before cobra reads it (e.g. geckopy, which lifts legacy MATLAB ec-model quirks like top-level per-metabolite `smiles` into `annotation` and merges bare-`-` sequence-of-single-key-maps back to a mapping) hand the cleaned dict directly to the post-parse pipeline, without round-tripping through a temp file. Both functions are exported from raven_python.io.yaml. Pure refactor on the read side; no behaviour change for existing read_yaml_model callers. * fix(io.yaml): drop metaData/version/_yaml_sections from doc['notes'] cobra's model_to_dict serialises model.notes verbatim into the output doc as the 'notes' section. write_yaml_model already pops these three management keys from a local copy of model.notes to use them as top-level YAML fields, but the originals remained on model.notes and therefore also leaked into doc['notes'], producing duplicate sections in the file (the legitimate top-level emit AND a nested copy inside notes). Strip them from doc['notes'] post-model_to_dict and drop the notes section entirely when nothing else is left. Discovered while round-tripping a geckopy ecModel (it stashes ec-rxns / ec-enzymes / gecko_light on model.notes['_yaml_sections']); was visible as duplicated GECKO sections in the written YAML.
1 parent e8c26fa commit 36e5306

12 files changed

Lines changed: 1736 additions & 1 deletion

CHANGELOG.md

Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
# Changelog
2+
3+
Milestones in the raven-python port. For function-level status see
4+
[docs/raven_migration.md](docs/raven_migration.md); for open work see
5+
[docs/todo.md](docs/todo.md).
6+
7+
## Infrastructure
8+
9+
* **GitHub Actions CI** ([.github/workflows/ci.yml](.github/workflows/ci.yml)) —
10+
ruff + pytest matrix over Python 3.11/3.12/3.13. Tests that require Gurobi
11+
auto-skip (no Gurobi on free runners); the known HiGHS upstream blocker
12+
(`hybrid_interface.Configuration` rejects `lp_method='primal'`) is marked
13+
`xfail(strict=True)` so CI flips red when optlang fixes it.
14+
15+
## Quality sweep — known-issues section F (design-choice divergences)
16+
17+
Closed the five items in section F (the "design choices that differ from RAVEN"
18+
backlog from the original review). Three docstring/comment fixes; two code
19+
fixes with matching MATLAB back-port proposals in IMPROVEMENTS.md (FS4, B2).
20+
21+
* `run_init` docstring spells out the score-0 semantics divergence between
22+
classic INIT and ftINIT.
23+
* `get_init_model` inaccurate "same regime" comment replaced with an accurate
24+
description of the conservative pre-filter.
25+
* `fseof` classifier now uses the slope of `|flux|` (`linregress(enforced, |flux|)`)
26+
instead of first-vs-last endpoints. A track whose endpoints straddle a
27+
peak/trough no longer ends up mislabelled.
28+
* `reporter_metabolites` docstring documents the one-sided p-value + z-score
29+
ordering vs RAVEN's two-tailed sort, and points at the up/down split via
30+
`gene_fold_changes`.
31+
* `get_elemental_balance` now reports `unknown` for empty-stoichiometry
32+
reactions (previously vacuously `balanced`). Original review attributed the
33+
bug to `check_model`; the actual code is in `balance.py`.
34+
35+
Two new regression tests (F3 in `test_analysis_fseof.py`, F5 in
36+
`test_utils_balance.py`). [docs/known_issues.md](docs/known_issues.md) now
37+
fully closed (all sections A–F).
38+
39+
## Quality sweep — known-issues sections C / D / E
40+
41+
Closed all the robustness, efficiency, and dead-code items in one pass.
42+
43+
**Robustness (C):**
44+
* `constrain_reversible_reactions` wraps FVA in try/except + NaN check; both
45+
backend-raised `OptimizationError` and silent-NaN returns now surface as one
46+
clear `RuntimeError` (the original `abs(NaN) < eps` silently no-op'd).
47+
* `ensure_binary` downloads through `.part` + `os.replace`, matching `data.py`
48+
an interrupted download leaves a `.part`, never a half-complete `.zip`.
49+
* `parse_task_list` (.xlsx) checks `wb.sheetnames` before lookup; missing
50+
`TASKS` sheet now raises a clear `ValueError` instead of a bare `KeyError`.
51+
* `parse_taxonomy` pads with explicit `""` when a depth level is skipped and
52+
warns once.
53+
54+
**Efficiency (D):**
55+
* `group_linear_reactions` rewritten with a metabolite worklist (re-enqueue
56+
the mets touched by each merge); same observable result, O(n+m) work per
57+
pass instead of restarting the full scan after every merge.
58+
* `parse_kegg_reactions` now caches the parsed stoichiometry on each
59+
`KeggReaction.stoichiometry`; `build_reference_model` reuses it instead of
60+
re-parsing.
61+
62+
**Dead code (E):**
63+
* Dropped `KeggReaction.modules` and `.rhea` (parsed but never consumed).
64+
* Dropped the vestigial `only_genes_in_models` parameter from `_ortholog_map`.
65+
66+
Six new regression tests; the only one without a test is the `.part` atomic
67+
download (defensive, needs urlopen mocking).
68+
69+
## Quality sweep — known-issues section B
70+
71+
Closed all four "silent misbehaviour" items from [docs/known_issues.md](docs/known_issues.md):
72+
* `merge_models` warns on `formula` / `charge` conflicts when two source models
73+
share a name[comp] but disagree (used to silently keep the first-seen).
74+
* `add_reactions_from_equations` warns when creating a metabolite in an
75+
unregistered compartment — both the `mets_by="id"` and `mets_by="name"` paths
76+
(id-mode used to skip the check entirely, an asymmetry).
77+
* `parse_task_list` warns when continuation data appears before any task ID
78+
has been seen (used to silently drop the orphan row).
79+
* `export_model_to_sif` warns up front when a custom label map sends two
80+
distinct ids to the same label (used to silently collapse nodes).
81+
Four new regression tests cover them.
82+
83+
## Quality sweep — known-issues section A
84+
85+
Closed all six "latent edge-case bug" items from [docs/known_issues.md](docs/known_issues.md):
86+
* `add_reactions_from_equations` no longer misparses `"2 oxoglutarate"` (or any
87+
leading-number metabolite name) — the resolver tries the full token before
88+
splitting off a coefficient.
89+
* `add_reactions_from_equations` warns when an equation's terms cancel to a
90+
zero-metabolite reaction.
91+
* `add_reactions_from_model` tracks ids minted within the batch so two source
92+
metabolites whose ids both collide with the draft don't collapse onto the
93+
same generated id.
94+
* `add_transport_reactions` warns on duplicate metabolite names in the source
95+
or target compartment instead of silently dropping all but one.
96+
* `connect_blocked_reactions` membership-guards the FVA result before
97+
`.at[]` lookup.
98+
* `assign_kos` rejects `cutoff >= 1` up front — would have crashed inside the
99+
ratio filter at `log(best_evalue) == 0`.
100+
Six new regression tests cover the user-reachable cases.
101+
102+
## Phase 7 — Localization
103+
104+
* **Sub-cellular localisation by MILP.** [`localization.predict_localization`](src/raven_python/localization/predict.py)
105+
+ [`apply_localization`](src/raven_python/localization/predict.py). Deterministic (not simulated
106+
annealing); caller-passed `reactions_to_relocate` set with everything else pinned;
107+
incomplete-model tolerant (no silent reaction removal); `apply=False` returns a diff
108+
preview; multi-compartment by default with primary-free, extras-penalised scoring.
109+
* **Predictor loaders.** [`load_wolfpsort`, `load_deeploc`](src/raven_python/localization/scores.py),
110+
with the `gene × compartment` DataFrame contract open for any predictor.
111+
* **Compartment helpers** ([`manipulation/compartments.py`](src/raven_python/manipulation/compartments.py)):
112+
`merge_compartments`, `copy_to_compartment` — useful standalone for model curation.
113+
* **Real-data validation on yeast-GEM** ([docs/yeast_localization_benchmark.md](docs/yeast_localization_benchmark.md))
114+
— accuracy 0.72 → 0.39 on 298 GPR'd reactions as confident predictor mis-scoring rises
115+
from 0 % to 50 %; perfect on compartments with disjoint gene sets (c/g/lp/p/v/vm), and
116+
surfaces a `transport_cost` calibration insight for soft-probability score tables.
117+
118+
## Phase 5 — Data integration & analysis
119+
120+
* **Reporter metabolites, FSEOF, random sampling** ([`analysis/`](src/raven_python/analysis/)).
121+
* **HPA omics ingestion** ([`omics.parse_hpa`, `parse_hpa_rna`, `hpa_gene_scores`, `rna_gene_scores`](src/raven_python/omics/hpa.py))
122+
— pandas-tidy DataFrames replace RAVEN's sparse-matrix layout; scoring adapters reuse the
123+
existing GPR walk.
124+
* **N-model comparison** ([`comparison.compare_models`](src/raven_python/comparison/compare.py)).
125+
* **Dynamic FBA** is **not ported** — established Python packages cover it (`dfba`,
126+
`reframed`, `mewpy`).
127+
128+
## Phase 4d — ftINIT
129+
130+
* **ftINIT pipeline** ([`init.ftinit`](src/raven_python/init/ftinit.py)) — staged MILP, linear merge,
131+
task-aware gap-filling, gene pruning.
132+
* **Validated against MATLAB RAVEN on Human-GEM.** 5 Hart2015 cell-line models;
133+
Jaccard 0.973–0.977 (no-task) and 0.978–0.980 (task-constrained). See
134+
[docs/humangem_validation.md](docs/humangem_validation.md).
135+
* **Parameter calibration & input-robustness study** ([docs/init_param_calibration.md](docs/init_param_calibration.md))
136+
`mip_gap=0.01` is the genome-scale full-pipeline sweet spot (~37% faster than 0.001 at
137+
Jaccard 0.995); pipeline is robust to expression noise (Jaccard 0.92–0.95) but sensitive
138+
to sparsity (50–70% dropout → Jaccard 0.59–0.71); the task + gap-fill layer keeps the
139+
essential-task pass-rate at 67–69/69 across the gradient, whereas tINIT-without-it passes
140+
only 35/69 even on clean data.
141+
* **Cross-solver portability** ([docs/init_solver_benchmark.md](docs/init_solver_benchmark.md))
142+
+ [`tests/test_init_solvers.py`](tests/test_init_solvers.py): Gurobi and GLPK pass at toy
143+
scale; only Gurobi is viable at genome scale today (HiGHS hits an upstream optlang
144+
`clone()` bug; GLPK ignores `configuration.timeout` on MIP).
145+
* **Engineering wins surfaced by the genome-scale work:** `check_tasks` and
146+
`fill_tasks._feasible` rewritten in-place (~12× each); `optlang.symbolics.add` builds
147+
in the MILP construction (the O(n²) sympy `sum()` blow-up was the original genome-scale
148+
blocker); bounded gap-fill MILP; `rescaleModelForINIT` ported.
149+
150+
## Phase 4c — tINIT
151+
152+
* **INIT MILP and the tINIT pipeline** ([`init.run_init`](src/raven_python/init/init.py),
153+
[`init.get_init_model`](src/raven_python/init/build.py)). Clean optlang reformulation;
154+
RNA-seq scoring via `5·ln(level/ref)`-clamped.
155+
156+
## Phase 4b — Gap-filling
157+
158+
* **Connectivity gap-filling** ([`gapfilling.connect_blocked_reactions`](src/raven_python/gapfilling/fill.py))
159+
— MILP. Targeted (toward objective) mode delegates to `cobra.gapfill`.
160+
161+
## Phase 4a — Metabolic tasks
162+
163+
* **Task list parsing + `check_tasks`** ([`tasks/`](src/raven_python/tasks/)).
164+
165+
## Phase 3 — Reconstruction
166+
167+
* **Homology-based draft** from a template GEM + BLAST/DIAMOND wrappers
168+
([`reconstruction/homology/`](src/raven_python/reconstruction/homology/)) — with structured
169+
improvements over RAVEN's `getModelFromHomology` (see IMPROVEMENTS H1–H6).
170+
* **KEGG five-step pipeline** ([`reconstruction/kegg/`](src/raven_python/reconstruction/kegg/)):
171+
dump → parser → HMM library builder → species model → HMM-query draft.
172+
* **MetaCyc reconstruction** **not ported** (and flagged for removal from MATLAB RAVEN —
173+
see IMPROVEMENTS R-MetaCyc).
174+
175+
## Phase 2 — I/O
176+
177+
* **YAML** aligned to cobra's `!!omap` writer + RAVEN-only fields preserved into `.notes`,
178+
plus geckopy `ec-*` for enzyme-constrained models
179+
([`io/yaml.py`](src/raven_python/io/yaml.py)).
180+
* **SIF**, **Excel export**, and **Standard-GEM `model/<fmt>/…` git layout**
181+
([`io/`](src/raven_python/io/)). Excel import intentionally excluded.
182+
183+
## Phase 1 — Foundation
184+
185+
* **GPR / balance / validation / parsing helpers** ([`utils/`](src/raven_python/utils/)) —
186+
cobra-absent bits only; the rest are cheatsheeted.
187+
* **Manipulation ergonomic layer** ([`manipulation/`](src/raven_python/manipulation/)) —
188+
add/change/remove/transport/transfer/merge/simplify/variance + adopted transforms.
189+
* **External-binary resolver** ([`binaries.py`](src/raven_python/binaries.py)) — version-pinned
190+
release-ZIP registry, SHA256-verified cache.
191+
192+
## Phase 0 — Scaffold
193+
194+
* Project structure, packaging, pytest skeleton, license alignment with MATLAB RAVEN
195+
(GPL-3.0-or-later).

0 commit comments

Comments
 (0)