diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..3d42fa9 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,195 @@ +# Changelog + +Milestones in the raven-python port. For function-level status see +[docs/raven_migration.md](docs/raven_migration.md); for open work see +[docs/todo.md](docs/todo.md). + +## Infrastructure + +* **GitHub Actions CI** ([.github/workflows/ci.yml](.github/workflows/ci.yml)) — + ruff + pytest matrix over Python 3.11/3.12/3.13. Tests that require Gurobi + auto-skip (no Gurobi on free runners); the known HiGHS upstream blocker + (`hybrid_interface.Configuration` rejects `lp_method='primal'`) is marked + `xfail(strict=True)` so CI flips red when optlang fixes it. + +## Quality sweep — known-issues section F (design-choice divergences) + +Closed the five items in section F (the "design choices that differ from RAVEN" +backlog from the original review). Three docstring/comment fixes; two code +fixes with matching MATLAB back-port proposals in IMPROVEMENTS.md (FS4, B2). + +* `run_init` docstring spells out the score-0 semantics divergence between + classic INIT and ftINIT. +* `get_init_model` inaccurate "same regime" comment replaced with an accurate + description of the conservative pre-filter. +* `fseof` classifier now uses the slope of `|flux|` (`linregress(enforced, |flux|)`) + instead of first-vs-last endpoints. A track whose endpoints straddle a + peak/trough no longer ends up mislabelled. +* `reporter_metabolites` docstring documents the one-sided p-value + z-score + ordering vs RAVEN's two-tailed sort, and points at the up/down split via + `gene_fold_changes`. +* `get_elemental_balance` now reports `unknown` for empty-stoichiometry + reactions (previously vacuously `balanced`). Original review attributed the + bug to `check_model`; the actual code is in `balance.py`. + +Two new regression tests (F3 in `test_analysis_fseof.py`, F5 in +`test_utils_balance.py`). [docs/known_issues.md](docs/known_issues.md) now +fully closed (all sections A–F). + +## Quality sweep — known-issues sections C / D / E + +Closed all the robustness, efficiency, and dead-code items in one pass. + +**Robustness (C):** +* `constrain_reversible_reactions` wraps FVA in try/except + NaN check; both + backend-raised `OptimizationError` and silent-NaN returns now surface as one + clear `RuntimeError` (the original `abs(NaN) < eps` silently no-op'd). +* `ensure_binary` downloads through `.part` + `os.replace`, matching `data.py` — + an interrupted download leaves a `.part`, never a half-complete `.zip`. +* `parse_task_list` (.xlsx) checks `wb.sheetnames` before lookup; missing + `TASKS` sheet now raises a clear `ValueError` instead of a bare `KeyError`. +* `parse_taxonomy` pads with explicit `""` when a depth level is skipped and + warns once. + +**Efficiency (D):** +* `group_linear_reactions` rewritten with a metabolite worklist (re-enqueue + the mets touched by each merge); same observable result, O(n+m) work per + pass instead of restarting the full scan after every merge. +* `parse_kegg_reactions` now caches the parsed stoichiometry on each + `KeggReaction.stoichiometry`; `build_reference_model` reuses it instead of + re-parsing. + +**Dead code (E):** +* Dropped `KeggReaction.modules` and `.rhea` (parsed but never consumed). +* Dropped the vestigial `only_genes_in_models` parameter from `_ortholog_map`. + +Six new regression tests; the only one without a test is the `.part` atomic +download (defensive, needs urlopen mocking). + +## Quality sweep — known-issues section B + +Closed all four "silent misbehaviour" items from [docs/known_issues.md](docs/known_issues.md): +* `merge_models` warns on `formula` / `charge` conflicts when two source models + share a name[comp] but disagree (used to silently keep the first-seen). +* `add_reactions_from_equations` warns when creating a metabolite in an + unregistered compartment — both the `mets_by="id"` and `mets_by="name"` paths + (id-mode used to skip the check entirely, an asymmetry). +* `parse_task_list` warns when continuation data appears before any task ID + has been seen (used to silently drop the orphan row). +* `export_model_to_sif` warns up front when a custom label map sends two + distinct ids to the same label (used to silently collapse nodes). +Four new regression tests cover them. + +## Quality sweep — known-issues section A + +Closed all six "latent edge-case bug" items from [docs/known_issues.md](docs/known_issues.md): +* `add_reactions_from_equations` no longer misparses `"2 oxoglutarate"` (or any + leading-number metabolite name) — the resolver tries the full token before + splitting off a coefficient. +* `add_reactions_from_equations` warns when an equation's terms cancel to a + zero-metabolite reaction. +* `add_reactions_from_model` tracks ids minted within the batch so two source + metabolites whose ids both collide with the draft don't collapse onto the + same generated id. +* `add_transport_reactions` warns on duplicate metabolite names in the source + or target compartment instead of silently dropping all but one. +* `connect_blocked_reactions` membership-guards the FVA result before + `.at[]` lookup. +* `assign_kos` rejects `cutoff >= 1` up front — would have crashed inside the + ratio filter at `log(best_evalue) == 0`. +Six new regression tests cover the user-reachable cases. + +## Phase 7 — Localization + +* **Sub-cellular localisation by MILP.** [`localization.predict_localization`](src/raven_python/localization/predict.py) + + [`apply_localization`](src/raven_python/localization/predict.py). Deterministic (not simulated + annealing); caller-passed `reactions_to_relocate` set with everything else pinned; + incomplete-model tolerant (no silent reaction removal); `apply=False` returns a diff + preview; multi-compartment by default with primary-free, extras-penalised scoring. +* **Predictor loaders.** [`load_wolfpsort`, `load_deeploc`](src/raven_python/localization/scores.py), + with the `gene × compartment` DataFrame contract open for any predictor. +* **Compartment helpers** ([`manipulation/compartments.py`](src/raven_python/manipulation/compartments.py)): + `merge_compartments`, `copy_to_compartment` — useful standalone for model curation. +* **Real-data validation on yeast-GEM** ([docs/yeast_localization_benchmark.md](docs/yeast_localization_benchmark.md)) + — accuracy 0.72 → 0.39 on 298 GPR'd reactions as confident predictor mis-scoring rises + from 0 % to 50 %; perfect on compartments with disjoint gene sets (c/g/lp/p/v/vm), and + surfaces a `transport_cost` calibration insight for soft-probability score tables. + +## Phase 5 — Data integration & analysis + +* **Reporter metabolites, FSEOF, random sampling** ([`analysis/`](src/raven_python/analysis/)). +* **HPA omics ingestion** ([`omics.parse_hpa`, `parse_hpa_rna`, `hpa_gene_scores`, `rna_gene_scores`](src/raven_python/omics/hpa.py)) + — pandas-tidy DataFrames replace RAVEN's sparse-matrix layout; scoring adapters reuse the + existing GPR walk. +* **N-model comparison** ([`comparison.compare_models`](src/raven_python/comparison/compare.py)). +* **Dynamic FBA** is **not ported** — established Python packages cover it (`dfba`, + `reframed`, `mewpy`). + +## Phase 4d — ftINIT + +* **ftINIT pipeline** ([`init.ftinit`](src/raven_python/init/ftinit.py)) — staged MILP, linear merge, + task-aware gap-filling, gene pruning. +* **Validated against MATLAB RAVEN on Human-GEM.** 5 Hart2015 cell-line models; + Jaccard 0.973–0.977 (no-task) and 0.978–0.980 (task-constrained). See + [docs/humangem_validation.md](docs/humangem_validation.md). +* **Parameter calibration & input-robustness study** ([docs/init_param_calibration.md](docs/init_param_calibration.md)) + — `mip_gap=0.01` is the genome-scale full-pipeline sweet spot (~37% faster than 0.001 at + Jaccard 0.995); pipeline is robust to expression noise (Jaccard 0.92–0.95) but sensitive + to sparsity (50–70% dropout → Jaccard 0.59–0.71); the task + gap-fill layer keeps the + essential-task pass-rate at 67–69/69 across the gradient, whereas tINIT-without-it passes + only 35/69 even on clean data. +* **Cross-solver portability** ([docs/init_solver_benchmark.md](docs/init_solver_benchmark.md)) + + [`tests/test_init_solvers.py`](tests/test_init_solvers.py): Gurobi and GLPK pass at toy + scale; only Gurobi is viable at genome scale today (HiGHS hits an upstream optlang + `clone()` bug; GLPK ignores `configuration.timeout` on MIP). +* **Engineering wins surfaced by the genome-scale work:** `check_tasks` and + `fill_tasks._feasible` rewritten in-place (~12× each); `optlang.symbolics.add` builds + in the MILP construction (the O(n²) sympy `sum()` blow-up was the original genome-scale + blocker); bounded gap-fill MILP; `rescaleModelForINIT` ported. + +## Phase 4c — tINIT + +* **INIT MILP and the tINIT pipeline** ([`init.run_init`](src/raven_python/init/init.py), + [`init.get_init_model`](src/raven_python/init/build.py)). Clean optlang reformulation; + RNA-seq scoring via `5·ln(level/ref)`-clamped. + +## Phase 4b — Gap-filling + +* **Connectivity gap-filling** ([`gapfilling.connect_blocked_reactions`](src/raven_python/gapfilling/fill.py)) + — MILP. Targeted (toward objective) mode delegates to `cobra.gapfill`. + +## Phase 4a — Metabolic tasks + +* **Task list parsing + `check_tasks`** ([`tasks/`](src/raven_python/tasks/)). + +## Phase 3 — Reconstruction + +* **Homology-based draft** from a template GEM + BLAST/DIAMOND wrappers + ([`reconstruction/homology/`](src/raven_python/reconstruction/homology/)) — with structured + improvements over RAVEN's `getModelFromHomology` (see IMPROVEMENTS H1–H6). +* **KEGG five-step pipeline** ([`reconstruction/kegg/`](src/raven_python/reconstruction/kegg/)): + dump → parser → HMM library builder → species model → HMM-query draft. +* **MetaCyc reconstruction** **not ported** (and flagged for removal from MATLAB RAVEN — + see IMPROVEMENTS R-MetaCyc). + +## Phase 2 — I/O + +* **YAML** aligned to cobra's `!!omap` writer + RAVEN-only fields preserved into `.notes`, + plus geckopy `ec-*` for enzyme-constrained models + ([`io/yaml.py`](src/raven_python/io/yaml.py)). +* **SIF**, **Excel export**, and **Standard-GEM `model//…` git layout** + ([`io/`](src/raven_python/io/)). Excel import intentionally excluded. + +## Phase 1 — Foundation + +* **GPR / balance / validation / parsing helpers** ([`utils/`](src/raven_python/utils/)) — + cobra-absent bits only; the rest are cheatsheeted. +* **Manipulation ergonomic layer** ([`manipulation/`](src/raven_python/manipulation/)) — + add/change/remove/transport/transfer/merge/simplify/variance + adopted transforms. +* **External-binary resolver** ([`binaries.py`](src/raven_python/binaries.py)) — version-pinned + release-ZIP registry, SHA256-verified cache. + +## Phase 0 — Scaffold + +* Project structure, packaging, pytest skeleton, license alignment with MATLAB RAVEN + (GPL-3.0-or-later). diff --git a/IMPROVEMENTS.md b/IMPROVEMENTS.md new file mode 100644 index 0000000..a6a9877 --- /dev/null +++ b/IMPROVEMENTS.md @@ -0,0 +1,339 @@ +# Proposed improvements over RAVEN + +This project is **not** a one-to-one MATLAB→Python transcription. Where a RAVEN function can be +made smarter/faster, or where a logical gap in RAVEN's feature set is worth filling, we record the +change here — with enough detail that it can **also be back-ported to MATLAB RAVEN** later. + +Each entry states: what RAVEN does today, the proposed improvement, the rationale, and whether it +is a candidate to upstream into MATLAB RAVEN. + +Categories: +- **EFFICIENCY** — same behavior, faster/smarter implementation. +- **ERGONOMICS** — same job, less friction / fewer foot-guns / clearer contract. +- **NEW** — functionality RAVEN lacks but that fits naturally alongside what it already has. +- **REMOVAL** — functionality that should be dropped (here *and* in MATLAB RAVEN) because it does + more harm than good. + +Status legend: 💡 proposed · 🔨 implemented in raven-python · ⬆️ upstreamed to MATLAB RAVEN · +🗑️ dropped (and to remove from MATLAB RAVEN) + +--- + +## R-MetaCyc: drop MetaCyc reconstruction (REMOVAL — also remove from MATLAB RAVEN) + +**Decision 2026-05-24:** MetaCyc-based reconstruction is **not ported to raven-python** and should be +**removed from MATLAB RAVEN**. Status: 🗑️. + +**What RAVEN does:** `getMetaCycModelForOrganism` builds a draft by BLAST/DIAMOND of the query +proteome against `protseq.fsa` — MetaCyc's **single representative protein sequence per enzyme** +(~11.6k sequences) — keeping each gene's best hit above a bitscore/positives cutoff and assigning +the linked reaction. With one representative per enzyme there is no profile to tell true family +members from look-alikes. + +**Evidence (this repo, real MetaCyc + KEGG 118 data):** a leave-organism-out precision/recall test +(query each representative against the others, excluding its own organism; ground truth = +MetaCyc's own MONOMER→reaction): + +| bitscore (ppos≥45) | reaction precision | EC-family precision | EC recall | +|---|---|---|---| +| 50 | 0.34 | 0.55 | 0.33 | +| 100 *(RAVEN default)* | 0.36 | 0.59 | 0.32 | +| 200 | 0.40 | 0.62 | 0.26 | +| 300 | 0.44 | 0.65 | 0.22 | + +At the default cutoff **~64 % of reaction assignments are wrong** (~41 % wrong even at EC-family +level); **no cutoff rescues precision** — tightening to bitscore 300 reaches only ~44 %/65 % while +recall halves. Real proteomes (with non-enzyme decoys, not in this test) would be worse. Test +scripts/artifacts: `/home/eduardk/metacyc_test/` (not committed). + +**Why drop rather than fix:** the low precision is intrinsic to MetaCyc's one-representative-per- +enzyme data (can't build KEGG-quality HMMs from it). Accurate gene-calling already exists via KEGG +HMMs (3b) and homology-to-template-models (3a). MetaCyc's genuine value (extra reactions/pathways/ +compound structures) does not justify a separate, data-heavy, low-precision track. + +**MATLAB RAVEN removal list** (`external/metacyc/`): `getMetaCycModelForOrganism.m`, +`getModelFromMetaCyc.m`, `getRxnsFromMetaCyc.m`, `getMetsFromMetaCyc.m`, `getEnzymesFromMetaCyc.m`, +`linkMetaCycKEGGRxns.m`, `addSpontaneousRxns.m`, and data `metaCycEnzymes.mat` / `metaCycMets.mat` +/ `metaCycRxns.mat` / `protseq.fsa`; plus any `combineMetaCycKEGGModels` and MetaCyc references in +tutorials/tests/docs. (`addSpontaneousRxns` could be kept as a small standalone helper if wanted — +it is only incidentally in the MetaCyc folder.) + +--- + +## getModelFromHomology (Phase 3a — implemented) + +Design + rationale in [docs/plan_get_model_from_homology.md](docs/plan_get_model_from_homology.md); +implemented in `reconstruction/homology/homology.py`. *Logic* improvements over RAVEN's algorithm +(RAVEN's own comments flag several of these spots as uncertain). + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| H1 | ERGONOMICS | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | Split the overloaded `strictness` 1/2/3 into two orthogonal params: `bidirectional` (reciprocal hits) and `best_hits_only`. RBH = both true. `strictness=` kept as a compat alias. | +| H2 | EFFICIENCY (robustness) | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | Rewrite GPRs on the **cobra GPR AST**, not `regexprep` string substitution — eliminates partial-match hazards and the `OLD_… or` regex cleanup pass RAVEN needs. | +| H3 | ERGONOMICS (correctness) | raven-python 🔨 | 🔨 | Explicit `complex_policy` (default **`flag`** = RAVEN-compatible `OLD_`; plus `keep`/`drop`) for AND-subunits lacking an ortholog, via correct OR/AND **AST** semantics. | +| H4 | (correctness) | both 🔨/💡 | 🔨 | Best-hit selection by **bitscore** (db-size-independent, the RBH standard); `score="evalue"` optional. | +| H5 | EFFICIENCY | raven-python 🔨 | 🔨 | DataFrame ortholog map (pandas merge + dict) replaces `allGenes`/`allTo`/`allFrom` sparse-matrix `sub2ind` index juggling. | +| H6 | NEW | raven-python 🔨 | 🔨 | Structured provenance: `HomologyResult.gene_map` + per-reaction `notes['homology_source']`. | + +## KEGG download / dump parsing / HMM build (Phase 3b.1 / 3b.2 / 3b.3 — implemented) + +`fetch_keggdb.sh` → `reconstruction/kegg/download.py` (3b.1); parsing core of +`getRxnsFromKEGG` / `getMetsFromKEGG` / `getGenesFromKEGG` / `getModelFromKEGG` +→ `reconstruction/kegg/parse.py` (3b.2); `constructMultiFasta` + the +cluster/align/train stages of `getKEGGModelForOrganism` → `reconstruction/kegg/hmm.py` +and `taxonomy.py` (3b.3). Maintainer-side, build-time tooling (PLAN.md §2.3b). + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| K1 | EFFICIENCY (robustness) | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | Read each reaction's equation from its **own `EQUATION` field**, not from `reaction.lst` matched by line order. RAVEN reads `reaction.lst` line *i* into reaction *i*, assuming the two files stay perfectly aligned — brittle. **MATLAB back-port:** parse the `EQUATION` field already present in `reaction`. | +| K2 | ERGONOMICS (correctness) | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | Undefined-stoichiometry terms (`n C00001`, `(n+1) C00002`) keep their **real compound id** with coefficient 1 and the reaction is *flagged*, instead of minting `"n C00001"` pseudo-metabolites later renamed `undefined_N`. Cleaner metabolite graph; flag still drives the `keep*` filters. | +| K3 | ERGONOMICS | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | Reaction quality labels become a tidy boolean **`rxn_flags` table** (spontaneous/undefined-stoich/incomplete/general) instead of free-text appended to `rxnNotes`, so downstream filters join on a column rather than substring-matching notes. | +| K4 | EFFICIENCY | raven-python 🔨 | 🔨 | **Gene-free reference model** + separate `organism_gene_ko` table (the big one), instead of RAVEN's giant `rxnGeneMat` baked into the global model. Per-organism GPRs are built only at runtime (3b.4/3b.5), keeping the published artefact small. | +| K5 | EFFICIENCY (portability) | raven-python 🔨 | 🔨 | **KEGG download in pure Python stdlib** (`urllib`/`tarfile`/`gzip`/`netrc`), porting `fetch_keggdb.sh`. Drops the script's `wget`/`tar`/`gunzip` (and Cygwin-on-Windows) requirement, so it runs unchanged on Linux/macOS/Windows; tar extraction uses the `data` filter (no path traversal); same `~/.netrc` credential hygiene. The arrange step is split out (`extract_kegg_dump`) so it's network-free and unit-tested. | +| K6 | EFFICIENCY | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Per-KO multi-FASTA via a stdlib offset index** (`_index_fasta` → seek), replacing `constructMultiFasta`'s Java-`Hashtable` byte scan with 5M-element preallocation. One streaming pass, only wanted ids retained; no MATLAB/Java heap tuning. | +| K7 | EFFICIENCY | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Concatenate per-KO HMMs and `hmmpress` into one pressed library**, so the query path (3b.5) runs a single `hmmscan` against the database instead of RAVEN's thousands of per-KO `hmmsearch` invocations. | +| K8 | EFFICIENCY (scope) | raven-python 🔨 | 🔨 | **Drop the `getPhylDist` distance matrix.** Its only uses in RAVEN were per-organism HMM-sequence subsampling (`maxPhylDist`/`nSequences`) and the kingdom filter. Our fixed prok90/euk90 libraries (3b.3) remove the subsampling rationale, and domain mode (3b.4) uses the taxonomy domain classification directly — so the O(n²) matrix is never built. Simpler, faster, less code. | +| K9 | EFFICIENCY (memory) | raven-python 🔨 | 🔨 | **Stream `organism_gene_ko` to disk** in `parse_kegg_dump` instead of building it in memory. Real KEGG has **9.05M** gene↔KO associations; the in-memory DataFrame build OOMs in a few GB. Streaming (now via the external merge sort of K14) runs the full parse with flat, bounded peak memory. (Found by validating against a real KEGG FTP dump.) | +| K10 | EFFICIENCY (size) | raven-python 🔨 | 🔨 | **Reference model as gzipped RAVEN/cobra YAML** (`reference_model.yml.gz`) rather than SBML: RAVEN-native, MATLAB-readable, and ~1.1 MB vs ~30 MB SBML for the real 12k-reaction model. Made `io/yaml.py` gzip-aware on a `.gz` suffix (general-purpose). | +| K11 | ERGONOMICS | raven-python 🔨 | 🔨 | **`ensure_data`** (`data.py`): version-pinned registry that fetches/verifies/caches the published KEGG artefacts under `~/.cache/raven-python/data/`, mirroring `ensure_binary`. End users get a draft model with no KEGG access and no manual data handling — the `…_from_artefacts` entry points auto-fetch when no local dir is supplied. | +| K12 | EFFICIENCY | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Fast MAFFT (FFT-NS-2) for HMM training** instead of RAVEN's `--auto`, which selects slow iterative refinement (`dvtditr`) on medium/large KOs — observed ~2.5 min/KO (days for a domain) on real KEGG 118. FFT-NS-2 (`--retree 2 --maxiterate 0`) is seconds/KO and ample for profile-HMM building. **PartTree cutover is residue-based and memory-auto-tuned**: MAFFT memory tracks residues (count × length), not sequence count, so a count threshold let long-protein KOs (K00901: 2,788 seqs, 2.55 M residues) OOM under FFT-NS-2 — measured ~5 GB MAFFT RSS with FFT-NS-2 vs **0.69 GB with PartTree** for the same alignment. The cutover is **length-aware and memory-auto-tuned**: FFT-NS-2 memory is driven by the progressive-alignment **DP cost ≈ n_seqs × mean_len²** (= residues²/n_seqs), *not* residue count — a few hundred long proteins cost far more than the same residues in many short ones. (First tried a residue-only model `RSS≈1.32R²+1.84R`; it then OOM'd on K12047 — 452 seqs but mean length 2082, 0.94 M residues — because long proteins blow the per-residue cost.) Calibrated `RSS_GB ≈ 4.2e-9 × (n_seqs × mean_len²)` across real KEGG KOs (250k/266→0.67 GB … 1.5M/1624→5.73 GB; K12047 cost 1.96e9 = the largest, hence its OOM). `_auto_cost_budget` switches to PartTree when the DP cost exceeds `0.65 × (total − 2.5 GB overhead) / 4.2e-9` (≈7.9e8 on a 7.6 GB box), **warns on low-memory hosts**, and `parttree_residues` overrides with a manual residue cutoff. Back-portable to RAVEN. | +| K13 | EFFICIENCY | raven-python 🗑️ | 🗑️ | ~~Per-KO sequence cap (`max_sequences`)~~ — **removed.** Briefly added as a count-based cap, but the residue-based PartTree cutover (K12) bounds MAFFT memory without dropping any sequences, so the cap was redundant complexity. All deduplicated sequences are kept. | +| K14 | EFFICIENCY (size) | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Sort `organism_gene_ko` by `(organism, gene)` and store it xz-compressed** (`organism_gene_ko.tsv.xz`), cutting the dominant artefact **≈78 → 27 MB (2.9×)**. Gene IDs within an organism share long prefixes (locus tags, numeric runs), so sorting makes them adjacent and far more compressible (sort alone: 78→48 MB; xz vs gzip captures the cross-row redundancy gzip's 32 KB window misses: →27 MB). The sort is an **external merge sort** bounded to `chunk_rows` rows in memory (sorted runs spooled to gzipped temp files, merged with `heapq.merge`), so it keeps K9's flat memory profile. Both `lzma` and `gzip` are Python stdlib (native on Windows/macOS/Linux, no extra binary); small tables stay gzipped TSV (MATLAB-native), only the big one is xz (MATLAB needs an external `unxz`). Sorted order also matches the by-organism query in `get_kegg_model_for_organism`, enabling a future `searchsorted` slice instead of loading all 9M rows. Back-portable to RAVEN. | +| K15 | ERGONOMICS (correctness) | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Recalibrate the HMM-query KO-assignment defaults** (`assign_kos`): cut-off `1e-50 → 1e-30`, `min_score_ratio_g 0.8 → 0.9`; `min_score_ratio_ko` left at 0.3 but **documented as empirically inert**. Cross-validated the full 3b.5 pipeline against the true KEGG gene→KO annotation of four organisms across both libraries and the well-/lesser-studied axis — *S. cerevisiae*, *Cyanidioschyzon merolae* (red alga), *E. coli* K-12, *Mycoplasma genitalium* (minimal genome). Real annotations score overwhelmingly (median E ≈ 1e-100…1e-155; even the weakest 1% ≈ 1e-15…1e-36) while spurious hits cluster at ≈1e-8 — a ~20-order-of-magnitude gap. RAVEN's `1e-50` therefore sits **inside the true-positive tail** and silently drops real-but-divergent hits for no noise-rejection gain: gene→KO recall on *M. genitalium* was only 0.84 (reaction recall 0.87). At `1e-30` + `ratio_g=0.9`: *M. genitalium* recall **0.84→0.94** (rxn 0.87→0.97), *E. coli* 0.95→0.97 with **fewer** unannotated reactions (198→173, the tighter gene-ratio prunes spurious multi-KO genes), *S. cerevisiae*/*C. merolae* held or improved. The three sweep tables showed `min_score_ratio_ko` produced identical output at 0.0/0.3/0.5 across all four organisms — a magic-number knob that does nothing; `min_score_ratio_g` is the real precision lever. Full numbers in [docs/kegg_hmm_cutoff_calibration.md](docs/kegg_hmm_cutoff_calibration.md) (reproduce with `scripts/analyze_hmm_cutoffs.py`). Back-portable to RAVEN. | + +## FSEOF (Phase 5 — implemented, redesigned) + +RAVEN `core/FSEOF.m` → `analysis/fseof.py` (`fseof`). User was unhappy with RAVEN's +output; redesigned substantially. + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| FS1 | CORRECTNESS | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Robust trend via linear regression** (slope + correlation) over the whole scan, instead of RAVEN's strict step-by-step monotonicity that discards a target on a single noisy step (LP alternative optima). |r| is a quality score for filtering. pFBA per step keeps the scan stable. | +| FS2 | NEW | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Reports knockdown/knockout targets**, not just amplification. RAVEN only flags reactions whose \|flux\| *rises* with the enforced product; reactions driven *toward zero* — the down-regulation/deletion candidates, arguably the most actionable — are classified here (`knockdown`/`knockout`). | +| FS3 | ERGONOMICS | raven-python 🔨 | 🔨 | **Gene-level aggregation** (`gene_targets`) mapping reaction targets to genes, plus the **full flux scan** retained — all as DataFrames, vs RAVEN's printed TSV + endpoint-only slope. | +| FS4 | CORRECTNESS | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | Slope is the **regression slope** consistent with the selection criterion, not RAVEN's endpoint difference that disagreed with its own monotonicity filter. **Proposed back-port:** record the enforced target-flux level at each iteration into a vector and replace both the reported slope `abs(fseof.results(num,iterations) - fseof.results(num,1)) / abs(targetMax - targetMax/iterations)` and the `targets.slope` field with `polyfit(enforcedFlux, fseof.results(num,:), 1)` slope across all iterations. The label classification (amplify/knockdown/knockout, see FS2) should also use the slope of `|flux|` instead of comparing endpoints. | + +## randomSampling (Phase 5 — implemented) + +RAVEN `core/randomSampling.m` → `analysis/sampling.py` (`random_sampling`). The +random-objective / extreme-point method of Bordel et al. (2010) — **not** what +`cobra.sampling` (OptGP/ACHR) does (those draw a near-uniform MCMC sample of the +polytope interior), so it is a genuine addition, and it was wrongly listed as +cobra-covered in the PLAN cheatsheet. Each sample maximises a small random linear +combination of reactions. + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| SAMP1 | EFFICIENCY | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **`good_reactions` (loop-free, flux-carrying objective candidates) via one `cobra` FVA pass** rather than RAVEN's hand-rolled per-reaction `parfor` that solves a separate LP maximising and minimising every reaction. FVA computes the same min/max ranges, optimised and optionally `loopless` (`cycleFreeFlux`), in far less code. | +| SAMP2 | ERGONOMICS | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Reproducible** (`seed`) and **`n_objectives` is a parameter** — RAVEN has no seed and hard-codes 2 objective reactions (its docstring even claims 3, a transcription bug worth fixing upstream). | +| SAMP3 | ERGONOMICS | raven-python 🔨 | 🔨 | **Output is a samples × reactions DataFrame** (the `cobra.sampling` layout, directly usable with pandas/`analyzeSampling`-style stats) plus the reusable `good_reactions` list — instead of a reactions × samples matrix and a parallel `goodRxns` index vector that the caller must re-thread. | +| SAMP4 | CORRECTNESS | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **`replace_max_bound` (RAVEN's `replaceBoundsWithInf`) defaults off and is scoped to sampling only.** It applies *after* `good_reactions` is found (FVA cannot evaluate `inf` bounds — it errors as 'unbounded'), and it can open unbounded loop directions; documented to pair with `min_flux`. RAVEN ran goodRxns detection on the inf-replaced model, conflating "loop reaction" with "unbounded objective". | + +## reporterMetabolites (Phase 5 — implemented) + +RAVEN `core/reporterMetabolites.m` → `analysis/reporter.py` (`reporter_metabolites`). + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| RM1 | EFFICIENCY + CORRECTNESS | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Exact closed-form background correction** instead of RAVEN's 100 000-random-set Monte Carlo *per neighbour-count*. RAVEN samples with replacement from the scored-gene Z pool, so a random aggregate `Σz/√n` provably has mean `√n·μ` and std `σ` — the corrected score is exactly `(metZ − √n·μ)/σ`. Removes the slow sampling **and** its run-to-run randomness (deterministic results); back-portable to RAVEN. | +| RM2 | ERGONOMICS | raven-python 🔨 | 🔨 | Returns a sorted **DataFrame** per test (`all`/`up`/`down`) and takes gene→p-value / gene→fold-change **dicts**, vs RAVEN's parallel arrays + struct array + print/file side-effects. Neighbour genes come from cobra's metabolite→reaction→gene graph (no `rxnGeneMat`). | + +## runINIT (Phase 4c — MILP core implemented) + +RAVEN `INIT/runINIT.m` → `init/init.py` (`run_init`). + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| I1 | ERGONOMICS | raven-python 🔨 | 🔨 | **Clean optlang reformulation** of the INIT MILP instead of RAVEN's hand-built sparse `prob.A`/`blc`/`buc`/`vartype` arrays + fake "FAKEFORPM" metabolites. Standard include-indicator form `eps·x ≤ v ≤ ub·x` with objective `max Σ score·x + prod_weight·Σ sink`. Far more readable/reviewable; functional equivalence is the bar (PLAN §0). | +| I2 | ERGONOMICS | raven-python 🔨 | 🔨 | **`no_rev_loops` as a single `x_fwd + x_rev ≤ 1`** per reversible reaction, replacing RAVEN's auxiliary A/B/C metabolites with int1/int2 reactions and `C ub=-1` construction. Same effect (no spurious forward/back connectivity loop), a fraction of the machinery. | +| I3 | ERGONOMICS | raven-python 🔨 | 🔨 | **`present_mets` producibility via a small LP feasibility test** (sum of compartment-form drains ≥ 1), instead of mutating the live MILP's RHS one metabolite at a time. | +| I4 | CORRECTNESS | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **MILP big-M is each reaction's own `ub`** (`v ≤ ub·x`), not RAVEN's fixed 1000; and `eps`/`prod_weight` are exposed parameters. RAVEN's hard-coded 1/1000/0.1/0.5 only suit ±1000-bounded models with O(1) scores — flagged as scale-dependent and tunable (don't blindly trust them). | +| I5 | ERGONOMICS | raven-python 🔨 | 🔨 | **Predictor-agnostic scoring**: `get_init_model` takes gene *or* reaction scores; gene scoring is generic (`gene_scores_from_expression` for the common RNA-seq path), so single-cell/HPA are just alternative upstream sources feeding the same gene→score table — rather than RAVEN's HPA/array-specific structs baked into `getINITModel`. | + +## ftINIT (Phase 4d — in progress) + +RAVEN `INIT/ftINITInternalAlg.m` (+ orchestration) → `init/ftinit.py` (`run_ftinit`), +`tasks/check.py` (`find_task_essential_reactions`). See docs/ftinit_review_and_plan.md. + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| FT1 | EFFICIENCY | raven-python 🔨 | 🔨 | **Essential-reaction discovery via one FVA pass** (`find_task_essential_reactions`) — a reaction is essential for a task iff its FVA range excludes 0 (= RAVEN's constrain-to-0→infeasible), restricted to the flux-carrying candidates of a `pfba` solution. Replaces `getEssentialRxns`' per-reaction knockout loop. | +| FT2 | ERGONOMICS | raven-python 🔨 | 🔨 | **Clean optlang reformulation** of the 6-category MILP instead of RAVEN's hand-built block `prob.a` with the `pi/ni/ei/vprb/vnrvm…` figure. Positive-score reactions keep the continuous-indicator trick (no binary — the ftINIT speedup), encoded as `net_flux ≥ force_on·y`; only negative scores and reversible-direction get binaries. | +| FT3 | CORRECTNESS | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Big-M is each reaction's own bound**, not RAVEN's fixed 100/1000; `force_on`/`force_on_ess` exposed. Scale-dependent, calibrated in 4d.7 (cf. I4). | +| FT4 | (caveat) | RAVEN parity | — | **No loopless constraint** (matches RAVEN): the bare MILP can "include" an internal thermodynamically-infeasible cycle if it carries positive net score. Loop-free models rely on the staged pipeline + exchange handling and, at genome scale, real exchanges making cycles non-optimal. A loopless option could be added later. | +| FT9 | CORRECTNESS (robustness) | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Per-reaction essential forcing, clamped to capacity** (4d.7 calibration). Staging fixes previous-step reactions as essential; RAVEN forces them at `min(0.99·\|prev flux\|, 0.1)` so none is forced above the flux it carried (`essential_force`). On top, the forced magnitude is **clamped to the reaction's bound** so a low-capacity essential never produces an `lb>ub` error — it is forced to its capacity instead (RAVEN would error). | +| FT-met | (deferred) | — | ⬜ | **Metabolomics production bonus (4d.6) deferred.** The linear merge eliminates degree-2 detected metabolites, so it needs RAVEN's producer-group-mapping + `mon`/`vnrbm`/`vnrvm`/`vnim` negative-producer force-flux block — the most intricate MILP in ftINIT, for its least-used input. `ftinit(metabolomics=…)` raises `NotImplementedError`. | +| FT8 | ERGONOMICS | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **`remove_low_score_genes`** (port of `removeLowScoreGenes`): prune negative-scoring genes from GPRs via a recursive walk of **cobra's GPR AST** — isozymes (OR) dropped keeping ≥1 (least-negative if all negative), complex subunits (AND) kept intact, nested groups handled — instead of RAVEN's regex subset-extraction with `#n#` placeholders. The all-negative tie-break is **deterministic** (reproducible) vs RAVEN's random. Wired into `ftinit(gene_scores=…)` (with `fill_tasks` for gap-filling) to complete the pipeline. Matches RAVEN's three docstring example cases. | +| FT7 | EFFICIENCY | raven-python 🔨 | 🔨 | **Task gap-filling** (`fill_tasks`, port of `ftINITFillGapsForAllTasks` + the fill MILP): per task, only if it is **infeasible** in the current model (a cheap LP gates the MILP), add the minimum-cost set of reference reactions (cost = `−min(score, −0.1)`) to satisfy the task's ranged metabolite bounds; accumulate across tasks; exchange reactions excluded as candidates. Built on the shared `apply_task_constraints` (factored out of `check_tasks`) + an optlang on/off MILP, instead of RAVEN's hand-built `[S pos neg int b var]` block and the custom `rxnScores`-field-on-the-model hack. Matches RAVEN tinitTests T0003 (gap at R7 → R7 added back). | +| FT6 | ERGONOMICS | raven-python 🔨 | 🔨 | **Staged pipeline** (`prep_init_model` → `PrepData`, `get_init_steps`/`InitStep`, `ftinit`): one-time reaction classification (`classify_reactions` = the `toIgnore` masks) + essential discovery + linear merge bundled into `PrepData`, reused per sample; the staged `'1+1'`/`'2+1'`/`full` schedule runs `run_ftinit` per step with previous-step reactions fixed as essential **in their flux direction** (carried as `essential_directions`, no model-flipping). RAVEN's Gurobi-specific per-step MIPGap retry schedule is dropped (solver-agnostic; tuning → 4d.7). Matches RAVEN tinitTests T0001 (no tasks → {R1,R4,R6,R8,R9,R10}; with R7/R10 spontaneous → {R1,R2,R4,R6,R7,R8}) and T0002 (task → essentials {R1,R7}, model {R1,R2,R4,R6,R7,R8,R9,R10}). | +| FT5 | EFFICIENCY | raven-python 🔨 | 🔨 | **Linear merge** (`merge_linear` + `group_rxn_scores`, port of `mergeLinear`/`groupRxnScores`): contracts degree-2-metabolite reaction chains (one producer, one consumer; reversibles included) into single reactions, losslessly shrinking the MILP (~⅓ fewer reactions on Human-GEM). Clean Python on a working representation with reversibility ≡ `lb<0` (RAVEN's `rev1·rev2` falls out of the most-constraining bound recompute); drops genes and objective on the reduced model (it exists only to feed the MILP, which scores via `group_rxn_scores`). Matches RAVEN's tinitTests T0004 exactly (group ids, merged bounds/reversibility, flipped reactions, grouped scores incl. the 0→0.01 handling). | + +## parseTaskList / checkTasks (Phase 4a — implemented) + +RAVEN `core/parseTaskList.m` + `core/checkTasks.m` → `tasks/tasklist.py` + `tasks/check.py`. + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| T1 | ERGONOMICS | raven-python 🔨 | 🔨 | **Structured `Task` dataclass + `TaskResult`** instead of RAVEN's parallel-array struct and a printed report; programmatic access to per-task pass/fail/feasibility/error. | +| T2 | ERGONOMICS | raven-python 🔨 | 🔨 | **TSV-first task files** (stdlib `csv`); `.xlsx` still supported but via the lazy `[excel]` extra — no hard Excel dependency just to read a task list. | +| T3 | EFFICIENCY | raven-python 🔨 | 🔨 | Inputs/outputs imposed directly on cobra's **metabolite mass-balance constraint bounds** (the analogue of RAVEN's two-column `model.b`), and existing boundary reactions are auto-closed — so a model with open exchanges is handled correctly (RAVEN assumes a closed model and silently misbehaves otherwise). | + +## fillGaps (Phase 4b — implemented) + +RAVEN `core/fillGaps.m`. Only the **connectivity** mode is ported, as +`connect_blocked_reactions` ([gapfilling/fill.py](src/raven_python/gapfilling/fill.py)) — +MILP via cobra/optlang (GLPK). RAVEN's other mode (fill to make the objective feasible) +is `cobra.flux_analysis.gapfill` and is **cheatsheeted, not re-wrapped** (PLAN §1). + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| GF1 | NEW (vs cobra) | raven-python 🔨 | 🔨 | **Connectivity gap-fill has no cobra equivalent**: add the minimum-penalty set of template reactions so *blocked* draft reactions can carry flux (cobra's `gapfill` only fills toward the objective). Ported as `connect_blocked_reactions` — a name that avoids confusion with `cobra.gapfill` and says what it does, vs RAVEN's overloaded `fillGaps(useModelConstraints=...)` boolean. | +| GF2 | ERGONOMICS | raven-python 🔨 | 🔨 | **Templates matched by `name[comp]`** (via `add_reactions_from_model`), so a template in a different identifier namespace than the draft still contributes — as RAVEN's name-based merge does. (For the targeted `cobra.gapfill` path, ids must be aligned first, since cobra matches by id — noted in the cheatsheet.) | + +## addRxns + +RAVEN `core/addRxns.m` — add reactions from equation strings (or mets+coeffs), auto-creating +metabolites/genes. Ported as `add_reactions_from_equations` +([manipulation/add.py](src/raven_python/manipulation/add.py)). + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| A1 | ERGONOMICS | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Readable matching mode instead of the `eqnType` integer.** RAVEN takes `eqnType=1/2/3` (match by id / by name in a given compartment / `name[comp]`), which is opaque at call sites. raven-python uses `mets_by="id"\|"name"` and auto-detects `name[comp]` per token. **MATLAB back-port:** accept a string keyword. | +| A2 | ERGONOMICS (bug-class) | raven-python 🔨 | 🔨 | **Error on duplicate reaction IDs explicitly.** RAVEN errors; cobra's `add_reactions` *silently ignores* a duplicate. raven-python keeps RAVEN's stricter behaviour (raise) rather than cobra's silent drop. | +| A3 | EFFICIENCY (reuse) | raven-python 🔨 | 🔨 | **Delegate equation/arrow/coefficient parsing and gene/met creation to cobra** (`build_reaction_from_string` semantics, GPR auto-creation) instead of re-implementing RAVEN's `constructS`/`addGenesRaven`. Only the genuinely cobra-absent pieces (name matching, compartment for new mets, strict policies) are hand-written. | +| A4 | NEW | both 💡 | 💡 | **Infer compartment from a structured metabolite ID** (e.g. `atp_c` → `c`) as an alternative to requiring `compartment`. Not yet implemented; would reduce boilerplate for SBML-style IDs. Revisit alongside `addMets`. | + +## changeGrRules + +Ported as `change_gene_reaction_rules` ([manipulation/change.py](src/raven_python/manipulation/change.py)). + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| G8 | EFFICIENCY (reuse) | raven-python 🔨 | 🔨 | **changeGrRules: delegate gene creation + normalization to cobra.** RAVEN calls `getGenesFromGrRules` + `addGenesRaven` + `standardizeGrRules` + rebuilds `rxnGeneMat`; cobra does all of that automatically on `gene_reaction_rule =`. The port keeps only the batch loop and the append (`(old) or (new)`) option. | + +## simplifyModel + +Gap modes ported in [manipulation/simplify.py](src/raven_python/manipulation/simplify.py). + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| S3 | EFFICIENCY (scope) | raven-python 🔨 | 🔨 | **Only the cobra-absent modes ported as focused functions**, not a monolithic 8-flag `simplifyModel`. `deleteMinMax`→`find_blocked_reactions`, `deleteZeroInterval`→filter+prune, `deleteUnconstrained`→moot are cheatsheeted. dead-end / duplicate / constrain-reversible / group-linear are standalone, composable functions. | + +## mergeModels + +Ported as `merge_models` ([manipulation/merge.py](src/raven_python/manipulation/merge.py)). + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| M1 | EFFICIENCY (scope) | raven-python 🔨 | 🔨 | **~560 lines of struct field-padding + manual S-matrix assembly dropped.** On `cobra.Model` the merge is just: unify metabolites by `name[comp]`, re-add reactions remapped to the merged metabolites, let cobra rebuild S and create genes. | +| M2 | ERGONOMICS | raven-python 🔨 | 🔨 | **Provenance via `notes['origin']`** (one place) instead of three parallel `rxnFrom`/`metFrom`/`geneFrom` fields. `match_by="name"|"id"` keyword replaces RAVEN's `metParam` string. | + +## checkModelStruct + +Ported (curation subset) as `check_model` ([utils/validate.py](src/raven_python/utils/validate.py)). + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| V1 | EFFICIENCY (scope) | raven-python 🔨 | 🔨 | **Drop the struct/type/duplicate-ID/`lb>ub`/`rev` checks** — cobra's object model enforces or precludes them (DictList forbids duplicate IDs, Reaction rejects `lb>ub`, no `rev` field). Only the curation checks cobra lacks survive. | +| V2 | ERGONOMICS | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Return structured `ModelIssue`s, not printed warnings** (RAVEN prints / throws). Programmatically filterable by `category`. **MATLAB back-port:** return an issues struct array. | + +## setParam / getElementalBalance + +Ported as `set_parameters` ([manipulation/parameters.py](src/raven_python/manipulation/parameters.py)) +and `get_elemental_balance` ([utils/balance.py](src/raven_python/utils/balance.py)). + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| ~~P1~~ | ERGONOMICS | — | ↩ revised | A 6-mode keyword `set_parameters` was built then **trimmed** (review: not Pythonic — it re-wrapped cobra one-liners for `lb`/`ub`/`eq`/`obj`/`unc`). Only the `var` ±% band, which cobra has no idiom for, is kept as `set_variance_bounds`; the rest are documented as cobra idioms in the §1 cheatsheet. | +| B1 | ERGONOMICS (correctness) | raven-python 🔨 | 🔨 | **getElementalBalance: report `unknown` for missing formulas.** cobra's `check_mass_balance` silently treats a metabolite with no formula as contributing nothing, so the reaction can read as (un)balanced on incomplete data. raven-python flags those as `unknown` rather than guessing — preserving RAVEN's distinction (its `-1` status). | +| B2 | CORRECTNESS | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Empty-stoichiometry reactions report `unknown`, not vacuous `balanced`.** A reaction with no metabolites used to fall through to `balanced` (any-over-empty is False; the zero-imbalance dict is empty). Same bug in MATLAB `core/getElementalBalance.m`: with no entries in `model.S(:,j)`, `balanceStatus` stays NaN through both loops and the final `isnan→1` step labels it `balanced`. **Proposed back-port:** add an `emptyRxns = full(sum(model.S~=0,1))==0` mask and `balanceStatus(emptyRxns) = min(-1, balanceStatus(emptyRxns))` before the `isnan→1` line, so empty reactions are marked "missing information". | + +## getRxnsInComp / getMetsInComp — not ported + +Briefly ported, then **removed** (user review): too thin over cobra (`metabolite.compartment` / +`reaction.compartments` one-liners). Mapped in the §1 migration cheatsheet instead. Reconsider only +if a downstream consumer needs the `include_partial` (fully-contained vs touching) distinction in +several places — and ask before re-adding (see process note: argue pros/cons for marginal WRAPs). + +Ported as `remove_metabolites` / `remove_genes` +([manipulation/remove.py](src/raven_python/manipulation/remove.py)). `removeReactions` was **not** +ported: with orphan cleanup kept coupled (decision: don't separate metabolites from genes), it is +identical to `cobra.Model.remove_reactions(remove_orphans=...)`. + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| ~~R1~~ | ERGONOMICS | — | ❌ rejected | Separable orphan-met vs orphan-gene cleanup was considered, then **dropped** by decision — keep them coupled like cobra. (Removed the `remove_reactions` wrapper entirely as a result.) | +| R2 | EFFICIENCY (reuse) | raven-python 🔨 | 🔨 | **GPR rewriting delegated to cobra's AST**, not RAVEN's `eval` of a `&&`/`\|\|`-substituted rule string. cobra's `remove_genes` already gives correct AND/OR semantics (removing one gene of `A and B` empties the rule; of `A or B` keeps the other). **MATLAB back-port:** replace `canRxnCarryFlux`'s `eval` with a parsed boolean tree (safer, no eval). | +| R3 | ERGONOMICS | raven-python 🔨 | 🔨 | **`blocked_reactions` policy as a clear keyword** (`remove`/`constrain`/`keep`) instead of RAVEN's `removeBlockedRxns` boolean — and `keep` (rewrite GPR, leave bounds) is a third option RAVEN lacks. | +| R4 | (review) | raven-python ⚠️ | 💡 | **`remove_metabolites` is a deletion candidate.** Its only value over cobra is `by_name` cross-compartment deletion, likely rarely used; revisit and possibly drop the wrapper. | + +## readYAMLmodel / writeYAMLmodel + +RAVEN `io/readYAMLmodel.m` + `writeYAMLmodel.m` (+ private legacy parser). Ported as +`read_yaml_model`/`write_yaml_model` ([io/yaml.py](src/raven_python/io/yaml.py)). + +**Lens correction (no separate legacy parser).** RAVEN ships a 462-line `parseYAMLLegacy.m` for the +`!!omap` dialect, and geckopy refuses it ("re-save from MATLAB"). But `!!omap` is **cobra's own YAML +format**: `cobra.io.load_yaml_model` reads a real yeast-GEM.yml (4102 rxns) directly. So the +raven-python-unique capability the PLAN imagined (a legacy reader) is unnecessary; the real cobra-absent +value is preserving `metaData` identity and RAVEN-only per-entry fields, which is what was built. + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| Y1 | EFFICIENCY (scope) | raven-python 🔨 | 🔨 | **Drop the bespoke legacy YAML parser; delegate to cobra's `!!omap` loader.** ~460 lines of RAVEN parsing not reimplemented. raven-python reads old RAVEN/Human-GEM YAML in pure Python with no MATLAB needed (geckopy can't — it tells users to re-save from MATLAB). | +| Y2 | ERGONOMICS (data loss) | raven-python 🔨 | 🔨 | **Don't silently drop model identity/provenance or RAVEN-only fields.** A plain `cobra.io.load_yaml_model` of a RAVEN file yields `model.id is None` and discards `smiles`/`deltaG`/`confidence_score`/etc. raven-python preserves them. **Routed by meaning** (not blindly to notes): chemical-structure identifiers `smiles`/`inchis` → cobra `annotation` (the MIRIAM-style store other tools read); genuinely non-standard data (`deltaG`, `confidence_score`, `metFrom`/`rxnFrom`, `protein`) → `notes`. Not invented as attributes (`met.deltaG`), since cobra only persists `annotation`/`notes` through copy/SBML/JSON/YAML. | +| Y4 | NEW | both 💡 | 💡 | **Upstream candidate: a first-class thermodynamics/confidence field.** `deltaG` and `confidence_score` live in `notes` because neither cobra nor SBML core has a home; if a standard slot (e.g. SBML fbc/groups or a cobra attribute) emerges, migrate them there. Also applies to MATLAB RAVEN's `metDeltaG`/`rxnConfidenceScores` consistency. | +| Y3 | NEW | raven-python 🔨 | 🔨 | **Emit cobra-native `!!omap` output** (via cobra's own dumper) — done, matching RAVEN `fa281a1`. Verified `cobra.io.load_yaml_model` reads the output. | +| Y5 | ERGONOMICS (correctness) | raven-python 🔨 | 🔨 | **Field placement realigned to `fa281a1`:** `smiles`/`ec-code` are in the cobra-owned `annotation` block (not top-level), `inchis` is top-level, and the top-level `notes` *string* (metNotes/rxnNotes) is handled rather than crashing a notes-as-dict assumption. | + +## changeRxns + +RAVEN `core/changeRxns.m` — change reaction equations. Ported as +`change_reaction_equations` ([manipulation/change.py](src/raven_python/manipulation/change.py)). + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| C1 | EFFICIENCY | raven-python 🔨 | 🔨 | **Edit the reaction in place** instead of RAVEN's copy-all-fields → `removeReactions` → `addRxns` → `permuteModel` round-trip. cobra mutates the same `Reaction` object, so every other field and the model order are preserved for free, with no O(n) re-sort. (Not a MATLAB back-port — the round-trip is inherent to the struct layout there.) | + +## getIndexes + +RAVEN `core/getIndexes.m` — resolve a list of IDs / logical mask / index vector into positional +indexes (or a logical array) for `rxns` / `mets` / `genes` / `metNames` / `metcomps` (and GECKO +`ec.*` fields). + +**Decision (raven-python): do NOT port the function.** cobra is object-oriented, so the central +index-resolver that RAVEN's struct-of-parallel-arrays design requires is largely unnecessary. +cobra's `DictList` already covers the use cases more idiomatically — `get_by_any` (mixed +id/object/index → objects), `get_by_id` (O(1)), `query` (name/substring/regex), `index` (position), +list comprehensions for filtering. Porting a 1-based-index resolver would be redundant and +un-Pythonic. **Only** the `name[comp]` composite resolver is kept (G7), as a small internal helper. + +The improvement insights below still hold for **MATLAB RAVEN**, where the function remains — flagged +as upstream-only back-port candidates. + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| G1 | EFFICIENCY | MATLAB RAVEN only | 💡 | **Hash-based lookup instead of per-query linear scan.** RAVEN loops `find(strcmp(obj(i), searchIn))` per query → O(n·m). Build a `containers.Map` `{id: position}` once (O(n)), look up in O(1); `metcomps` likewise. (Moot for raven-python — cobra's `DictList` is already hashed.) | +| G5 | ERGONOMICS (bug) | MATLAB RAVEN only | 💡 | **Disambiguate the `[1 1 1]` mask-vs-index bug.** RAVEN's `if all(objects)` conflates a logical all-true mask with the index vector `[1 1 1]` (its own comment: "This gets weird if it's all 1"). Test `islogical(objects)` explicitly instead of `all(objects)`. (Moot for raven-python — input kind is decided by dtype.) | +| G7 | NEW | raven-python helper | 💡 | **Extract a reusable `name[comp]` parser/resolver.** The composite-id parsing buried in `getIndexes`'s `metcomps` branch is the one capability cobra lacks. Expose as a standalone `parse_name_comp` / `resolve_metabolite_by_name_comp`, reused by `addRxns`/`addTransport`/`mergeModels`. This is the only piece carried into raven-python. | + +**Obsoleted by cobra (no action — these were earlier raven-python proposals now covered by `DictList`):** +predictable return type, return objects-not-positions, configurable missing-object policy across a +batch, and substring/regex matching — all already provided by `get_by_any` / `get_by_id` / `query`. + +--- + +## standardizeGrRules + +RAVEN `core/standardizeGrRules.m` — normalize grRule syntax + flag rules not in simple +OR-of-AND-complex (DNF) form (`findPotentialErrors`). + +**Decision (raven-python): port the lint half only.** cobra auto-normalizes a GPR on assignment +(`"(G1 AND G2) OR G3"` is stored as `"(G1 and G2) or G3"`), so the normalization half is +redundant. The non-DNF lint has no cobra equivalent and was ported as `find_non_dnf_grrules`/`is_dnf` +([utils/gpr.py](src/raven_python/utils/gpr.py)). + +| # | Cat | Target | Status | Improvement | +|---|---|---|---|---| +| S1 | ERGONOMICS | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Return structured lint results, don't just print.** RAVEN's `findPotentialErrors` only emits a `warning()` string; you can't act on it programmatically. raven-python returns a list of `GPRIssue(reaction_id, gpr, reason)`. **MATLAB back-port:** return the `indexes2check`/messages as a struct array (it already computes `indexes2check` — just surface it cleanly instead of only warning). | +| S2 | EFFICIENCY (robustness) | raven-python 🔨 + MATLAB RAVEN 💡 | 🔨 | **Detect non-DNF via the boolean AST, not substring search.** RAVEN scans for the `) and (`, `) and`, `and (` substrings, which is brittle (sensitive to spacing/bracketing and to gene IDs containing those characters). raven-python walks cobra's GPR AST (`is_dnf`: no OR beneath any AND), which is exact. **MATLAB back-port:** parse the rule to a tree rather than string-matching. | diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..d39ae69 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,52 @@ +# Documentation + +Start with the [top-level README](../README.md). The docs are organised as: + +## RAVEN ↔ raven-python reference + +* **[raven_migration.md](raven_migration.md)** — function-by-function map from MATLAB RAVEN + to raven-python (and cobrapy where appropriate). Read this if you're coming from RAVEN. +* **[../IMPROVEMENTS.md](../IMPROVEMENTS.md)** — improvements raven-python makes over RAVEN that + are also candidates to back-port into the MATLAB toolbox. + +## Open work + +* **[todo.md](todo.md)** — what's still on the books. +* **[known_issues.md](known_issues.md)** — low-priority backlog from the full-codebase + review. None affects correctness on well-formed inputs. + +## Empirical studies & calibrations + +* **[humangem_validation.md](humangem_validation.md)** — raven-python ftINIT vs MATLAB RAVEN on + 5 Hart2015 cell lines (Jaccard 0.975–0.980). +* **[init_param_calibration.md](init_param_calibration.md)** — clean-data calibration + + input-robustness study for (f)tINIT (sweeps of `mip_gap` / `big_m` / `force_on` / `eps` / + `prod_weight` / scaling; dropout / noise / downsample robustness). +* **[init_solver_benchmark.md](init_solver_benchmark.md)** — Gurobi vs HiGHS vs GLPK on + genome-scale ftINIT. +* **[yeast_localization_benchmark.md](yeast_localization_benchmark.md)** — + `predict_localization` against curated yeast-GEM, with a predictor-noise sweep + (accuracy 0.72 → 0.39 as confident mis-scoring rises from 0 % to 50 %). +* **[kegg_hmm_cutoff_calibration.md](kegg_hmm_cutoff_calibration.md)** — HMM E-value / + score-ratio sensitivity for the KEGG HMM-query reconstruction path. + +## Data formats & maintenance + +* **[kegg_data_format.md](kegg_data_format.md)** — layout of the KEGG artefact bundle. +* **[maintaining_kegg_data.md](maintaining_kegg_data.md)** — building and publishing the + KEGG artefact releases. +* **[maintaining_binaries.md](maintaining_binaries.md)** — building and publishing the + external-binary (BLAST/DIAMOND/HMMER) ZIP releases. + +## Archive + +Historical design notes (preserved for reference but no longer part of the user-facing +documentation): + +* **[archive/ftinit_review_and_plan.md](archive/ftinit_review_and_plan.md)** — the + pre-implementation critical review of `ftINIT` that drove the Phase 4d port plan. +* **[archive/localization_design.md](archive/localization_design.md)** — the + pre-implementation critical review of `predictLocalization` that drove the Phase 7 + redesign (caller-passed relocate set, MILP instead of SA, partial-update mode). +* **[archive/plan_get_model_from_homology.md](archive/plan_get_model_from_homology.md)** — + early planning notes for the homology-based reconstruction. diff --git a/docs/archive/ftinit_review_and_plan.md b/docs/archive/ftinit_review_and_plan.md new file mode 100644 index 0000000..63e0d66 --- /dev/null +++ b/docs/archive/ftinit_review_and_plan.md @@ -0,0 +1,184 @@ +# ftINIT (Phase 4d) — critical review and implementation plan + +Critical read of RAVEN's ftINIT before porting (memory: *review very critically, don't +transcribe*). Covers `ftINIT.m`, `ftINITInternalAlg.m`, `getINITSteps.m`, +`INITStepDesc.m`, `prepINITModel.m`, `mergeLinear.m`, `groupRxnScores.m`, +`rescaleModelForINIT.m`, `ftINITFillGaps*.m` (~3000 lines). + +## 1. What ftINIT is, vs the tINIT we already have + +tINIT (Phase 4c, `init/`) solves **one** MILP: pick the reaction subset that best +matches expression scores while staying flux-consistent. We ported its core as +`run_init` (+ `get_init_model`). ftINIT ("fast tINIT") keeps the *objective* but +attacks the **runtime** of the MILP on genome-scale models (Human-GEM: ~12k rxns) +with four ideas: + +1. **One-time preprocessing (`prepINITModel` → `prepData`).** All omics-independent + work — simplification, essential-reaction discovery via tasks, reaction + classification, linear merging, scaling — is done once per *template* model and + reused across every sample. (RAVEN comment: prep is ~1 h; each model is then fast.) +2. **Linear merging (`mergeLinear`).** Reactions connected through a degree-2 + metabolite (one producer, one consumer) are contracted into one, summing scores. + Human-GEM 11888 → 7922 rxns — a ~33 % smaller MILP, losslessly. +3. **Staged MILP (`getINITSteps`, default `'1+1'`).** Instead of one big MILP, run + two smaller ones: step 1 decides only GPR-associated reactions (transport/ + exchange/spontaneous/no-GPR *ignored* = left in, not in the problem); step 2 fixes + step-1 reactions as **essential** and decides the remaining no-GPR reactions. The + `'full'` series is a single MILP ≈ classic tINIT. +4. **A cheaper MILP formulation (`ftINITInternalAlg`).** Positive-score reactions use + a **continuous** on-indicator instead of a binary (see §3) — roughly halving the + integer count, the dominant cost. + +After the MILP: map merged groups back to original reactions, delete the rest, +**gap-fill so all metabolic tasks are feasible** (`ftINITFillGapsForAllTasks`), +re-add exchanges, optionally `removeLowScoreGenes`. + +## 2. Architecture (data flow) + +``` +template model ──prepINITModel──> prepData {refModel, refModelWithBM, minModel(merged+scaled), + (once) groupIds, origRxnIds, essentialRxns, + essentialMetsForTasks, taskStruct, toIgnore* masks} + │ +omics (tissue) ──scoreComplexModel──> rxnScores │ (per sample) + ▼ + ftINIT step loop (getINITSteps) + ├ step i: groupRxnScores(mask_i) ─> ftINITInternalAlg (MILP) ─> on/off + │ prev results folded in as 'essential' + └ ... + ▼ + delete off-reactions (mapped through groupIds) + ▼ + ftINITFillGapsForAllTasks (per-task gap-fill MILP) + ▼ + re-add exchanges, removeLowScoreGenes ─> model +``` + +## 3. The core MILP (`ftINITInternalAlg`) — formulation + +Reactions are partitioned into **six categories**, each with its own variables. +`forceOnLim = 0.1` (min flux to count a reaction "on"); big-M `= 100` (rev split) and +`1000`/own-ub elsewhere; `intTol = 1e-7`. + +| category | indicator | key constraint | binary? | +|---|---|---|---| +| **PosIrrev** (score>0, irrev) | `Y∈[0,1]` continuous | `v ≥ forceOnLim·Y` | **no** | +| **PosRev** (score>0, rev) | `Y∈[0,1]` + dir-binary | split `v=v⁺−v⁻`; `v⁺+v⁻ ≥ 0.1·Y`; binary picks direction | 1 (dir only) | +| **NegIrrev** (score<0, irrev) | `Y∈{0,1}` | `v ≤ 100·Y` | **yes** | +| **NegRev** (score<0, rev) | `Y∈{0,1}` | split; `v⁺+v⁻ ≤ 0.1·Y` | yes | +| **EssIrrev** (forced on) | — | `lb = min(0.99·|prev flux|, 0.1)` | no | +| **EssRev** (forced on, rare) | dir-binary | split + force ≥0.1 | 1 | + +**The key insight:** a *positive*-score reaction needs no binary — a continuous `Y` +suffices, because the objective *maximises* `Σ score·Y` so it pushes `Y→1` whenever +flux can flow; the `v ≥ 0.1·Y` gate makes `Y>0` impossible without flux. Only +*negative*-score reactions need a true binary (the objective would otherwise leave +their `Y` at 0 for free). Reactions with **score 0 are left out of the problem +entirely** (always present, can carry flux) — this is how the `toIgnore*` categories +are excluded (their score is zeroed by `groupRxnScores`). + +Objective: minimise `Σ(−score)·Y − prodWeight·Σ mon` (metabolite-production bonus). +`prodWeight = 5` (passed by `ftINIT.m`, overriding the docstring default 0.5). + +We already have a *simpler* version of this in `run_init`: binaries for all, +`eps·x ≤ v ≤ ub·x`, reversible split, `prod_weight` sinks. ftINIT's formulation is +the optimised superset. + +## 4. Critical findings (what NOT to transcribe) + +**Magic numbers** — collected, to be named constants and (where they bite) calibrated +like we did for the HMM cut-offs (K15): + +| value | where | role | concern | +|---|---|---|---| +| `forceOnLim = 0.1` | InternalAlg | min "on" flux | scale-dependent; RAVEN itself muses "test 0.01" | +| big-M `100` / `1000` | InternalAlg / FillGapsMILP | on/off gates | must exceed real fluxes; ties to ±1000 clamp | +| `intTol 1e-7` / `1e-9` | InternalAlg / FillGapsMILP | integer tol | solver-specific (Gurobi v10 numerics) | +| `prodWeight = 5` | ftINIT.m | met-production bonus | "has not been evaluated" (their words) | +| score clamps `±0.1`, `0.01` | ftINIT.m / groupRxnScores | avoid 0 scores | MILP can't have exactly-0 score (binary flips freely) | +| `maxStoichDiff = 25` | rescale | coeff-ratio cap | docstring says 250 — **doc/code disagree** | +| MIPGap schedule `0.0004→0.003`, abs-gap `10/20`, TimeLimit `120/5000` | getINITSteps | per-step solver tuning | Gurobi-tuned; HiGHS will differ | +| int-extraction `>1e-3`, `>0.5` | FillGapsMILP / InternalAlg | read binaries | tolerance-driven | + +**RAVEN/Gurobi coupling & quirks** +- `ftINITFillGapsMILP` **hard-errors on glpk-via-COBRA**; "only tested with Gurobi". + raven-python uses **optlang**, so we are solver-agnostic — but the magic numbers and + `Seed=26` reproducibility were Gurobi-tuned. Validation must expect *equally optimal + but not identical* reaction sets (alternative optima); compare on objective value + and task feasibility, not exact rxn identity. +- Several quirks worth fixing upstream too: `rescaleModelForINIT` doc/code mismatch; + `ftINITFillGaps` has a dead default-score branch referencing an undefined `models`; + `prepData` does **not** store `origRxnIds` (downstream relies on implicit ordering — + we will store it explicitly); the 2-column `b` (ranged mass-balance RHS) is not a + cobra concept and must be built as optlang ranged constraints. + +**Genuine simplification/improvement opportunities** (log in IMPROVEMENTS, back-port): +- `simplifyModel`'s ~1 h reversibility-removal pass is an FVA in disguise; cobra's + optimised (optionally parallel) FVA does it far faster — the single biggest prep + speedup available to us. +- Carry `rxnScores` as an aligned array, not a smuggled model field threaded through + `simplifyModel`/`removeReactions`/`mergeModels`. +- Expose the magic numbers as parameters with documented, calibrated defaults rather + than burying them; `forceOnLim`/`prodWeight`/big-M are scale-dependent and deserve + the same empirical treatment as K15. +- `removeLowScoreGenes` (gene pruning by isozyme/complex role) is a clean standalone + utility, reusable beyond ftINIT. + +**Prerequisite gap in our codebase:** `check_tasks` currently returns only +pass/feasible per task. `prepINITModel` needs **essential-reaction discovery** (which +reactions, and in which direction, are essential for each task) and the task +metabolite set. That work was explicitly deferred from 4a/4c to here. + +## 5. What cobra/raven-python already gives us + +- MILP via **optlang** (HiGHS bundled, Gurobi/CPLEX if licensed) — `run_init` + already builds INIT-style MILPs this way; ftINIT extends that builder. +- FVA / pFBA / `find_blocked_reactions` (`simplifyModel`, gap-fill feasibility). +- `check_tasks` (relaxed mass-balance via constraint bounds = RAVEN's `b`), to be + extended with essential-reaction output. +- `connect_blocked_reactions` (connectivity MILP) — related to but not the same as + the task gap-fill MILP (which adds min-cost template reactions to make a task + feasible); the task gap-filler is new. +- `score_reactions_from_genes` / `gene_scores_from_expression` (scoring). + +## 6. Implementation plan — phased sub-steps + +Correctness first on small models (the RAVEN `tinitTests.m` cases define reaction +scores directly via single-gene expression — ideal oracles), then the speed layers. + +| sub-phase | deliverable | builds on | risk | +|---|---|---|---| +| **4d.0** | **Test oracles**: port the `tinitTests.m` toy models + `getINITSteps`-style score→expression helper, so every later sub-phase is checked against RAVEN's expected on/off sets. | tasks, score | low | +| **4d.1** | **Essential-reaction discovery**: extend `check_tasks` to return the per-task essential-reaction matrix + direction + task-metabolite set (the missing `prepINITModel` step-3 input). | tasks/check.py | med | +| **4d.2** | **prepData preprocessing** (`prep_init_model`): simplify (cobra FVA for the reversibility pass), essential rxns (4d.1), reaction-classification masks (exchange/import/transport/spont/no-GPR), **linear merge** (`merge_linear` + `group_ids`, storing `orig_rxn_ids`), rescale. | manipulation/simplify, 4d.1 | **high** (merge is fiddly) | +| **4d.3** | **Core staged MILP** (`ftinit_internal` + step loop): the 6-category formulation extending `run_init`; `INITStep` dataclass + `get_init_steps('1+1'/'2+1'/'full'/…)`; prev-results-as-essential folding; per-step MIPGap retry schedule. Start **without** metabolomics. | init/init.py | **high** | +| **4d.4** | **Task gap-filling** (`ftinit_fill_gaps_for_tasks`): per-task, LP-feasibility-gated, min-cost template-reaction MILP (ranged `b` constraints in optlang). | gapfilling, 4d.1 | med | +| **4d.5** | **Assembly + `removeLowScoreGenes`**: map merged groups back, delete, re-add exchanges, prune low-score genes; top-level `ftinit(...)` entry. | score, 4d.2–4 | low | +| **4d.6** | **Metabolomics block** — **deferred** (decided 2026-05-26). The linear merge eliminates degree-2 detected metabolites, so a clean flux-based bonus is impossible; it needs RAVEN's producer-group-mapping + `mon`/`vnrbm`/`vnrvm`/`vnim` negative-producer force-flux blocks — the most intricate MILP in ftINIT, for its least-used input (not in the RNA-seq/single-cell/HPA ranking; "not designed for metabolomics only"), with a randomness-laden oracle (T0008). `ftinit(metabolomics=…)` raises `NotImplementedError`. | 4d.3 | med | +| **4d.7** | **Validation + calibration**: run `'full'` vs `'1+1'` on a real template (Human-GEM or a smaller curated GEM), compare to RAVEN by objective/task-feasibility/size (not exact rxns); calibrate `forceOnLim`/`prodWeight`/big-M; document like K15. | all | med | + +Suggested module layout under `init/`: `prep.py` (4d.1–2), `ftinit.py` (4d.3 + top +level), `taskfill.py` (4d.4), `genes.py` (`remove_low_score_genes`), reusing +`init.py`/`score.py`/`build.py`. + +## 7. Decisions (locked 2026-05-25) + +1. **Solver: agnostic.** Write the MILP to optlang's generic interface; let cobra's + configured solver decide. Calibrate the default constants on **HiGHS** (open, + bundled) but do not hardcode anything Gurobi-specific. Validation compares on + objective/task-feasibility, never exact reaction identity. +2. **Correctness-first.** Implement the single-step `'full'` MILP (extend `run_init`, + **no** linear merge) and match RAVEN on the toy oracles *first*; then add linear + merge and `'1+1'`/`'2+1'` staging as separately-verified speed layers. → reorders + the sub-phases below: **4d.3 (full MILP) precedes 4d.2 (merge)**. +3. **Metabolomics deferred to 4d.6.** +4. **Validation includes Human-GEM** (in 4d.7): toy oracles + small GEM for fast + iteration, then Human-GEM + a tissue dataset for the real ftINIT regime. + +### Resulting execution order + +`4d.0 oracles → 4d.1 essential-rxn discovery → 4d.3 full MILP (correctness gate) → +4d.2 linear merge → 4d.3b '1+1'/'2+1' staging → 4d.4 task gap-fill → 4d.5 assembly + +removeLowScoreGenes → 4d.6 metabolomics → 4d.7 calibration + Human-GEM validation.` + +The §6 table lists deliverables by id; this is the order they'll actually be built. diff --git a/docs/archive/localization_design.md b/docs/archive/localization_design.md new file mode 100644 index 0000000..12a1366 --- /dev/null +++ b/docs/archive/localization_design.md @@ -0,0 +1,251 @@ +# Localization (Phase 7) — critical review + redesign + +Design note for porting RAVEN's `predictLocalization.m` to raven-python. **Status: proposal, +not yet implemented**; the user-facing API and the algorithmic choices are settled here +before code lands, because RAVEN's all-or-nothing redistribution doesn't match how the +function actually gets used. + +## 1. What RAVEN's `predictLocalization` does, and where it falls short + +The MATLAB function takes a single-compartment (or to-be-merged) model + a per-gene per- +compartment score table (`GSS`) and assigns every reaction to a compartment by simulated +annealing, maximising `Σ gene_score(g,c) − transport_cost`. Decisions baked into that +shape that we should not transcribe blindly: + +1. **Destroys existing compartmentalization.** "If the model contains several + compartments they will be merged." Existing curated assignments — typically the most + reliable signal in a draft GEM — are thrown out and recreated from the predictor + alone. This is fine for a single-compartment Phase-3 draft but obviously wrong as a + default for a refinement step on a multi-compartment model. +2. **Closed-model requirement.** "The input model should also not include any exchange, + demand or sink reactions, otherwise this function would not provide any results." + The caller must pre-process; the function has no graceful path for open models. +3. **Silent connectivity surgery.** "Reactions that are unconnected are removed and + saved in removedRxns" — and the iterative "metabolite must be produced somewhere" + loop quietly deletes any reaction whose substrate has no producer. **This is the + "model must be complete" assumption**: a draft with dangling substrates loses + reactions without the caller necessarily noticing. +4. **Reactions without GPRs get fake genes** (`&&FAKE&&N`) scored 0.5 everywhere — they + carry no signal but still have to be placed. +5. **Mono-localization for genes** ("a gene can only be assigned to one compartment … + This is a simplification to keep the problem size down"). Biologically wrong for the + many dual-localized enzymes (mitochondrial/cytosolic isoforms etc.); a constraint of + the solver, not the biology. +6. **Simulated annealing** with a `maxTime` cap — stochastic, no optimality certificate, + slow on large models. +7. **`expandModel` first**, splitting isozyme reactions before optimisation. Changes the + reaction set rather than handling isozymes in scoring. +8. **No partial-update mode.** Cannot say *"I added these 50 reactions, only place + those, keep the existing assignments"*. +9. **No suggest-only mode.** Mutates the model in place; no preview. +10. **Plotting and the predictor are baked into the function** (`plotResults`, + GSS-shape input). + +Items 1–3 and 8–9 are exactly what the user flagged. Items 5–7 are also worth +reconsidering on a modern stack. + +> **Final API (post-implementation):** the proposal below was refined during +> implementation. The shipped API: +> +> * `reactions_to_relocate` is a **required** caller-passed set of IDs (no +> `notes['localization']='uncertain'` auto-detect — one mechanism is clearer). +> * **Multi-compartment is the default scoring model.** No `multi_compartment_genes` +> boolean. The highest-scoring compartment a gene lands in is "free"; every +> additional compartment costs `multi_compartment_penalty` *plus* its (typically +> lower) predictor score is its own implicit penalty. Pick a large penalty for +> effectively mono-localised genes. +> * `mergeCompartments` and `copyToComps` are ported separately as +> `raven_python.manipulation.merge_compartments` / `copy_to_compartment` (they're useful +> independently of `predict_localization` — for flattening for analysis or building +> dual-localised pathways). +> * `mapCompartments` is **not** ported — its main use case overlaps with +> `compare_models`. + +## 2. Proposed `predict_localization` for raven-python + +Decompose the function into independent concerns: + +* **A pluggable predictor input** — a `pandas.DataFrame` of *gene × compartment* scores + (higher = stronger evidence). Loaders for WoLF PSORT, DeepLoc, … convert their output + to this table; the algorithm is predictor-agnostic. Ships with `load_wolfpsort()` and + documents the `gene_id × compartment → score` contract so users can plug in DeepLoc / + TargetP / their own. +* **A deterministic MILP** instead of simulated annealing. +* **An explicit scope of reactions to (re-)place** — by default *all reactions whose + compartment is unset or marked uncertain*; user can pass `reactions_to_relocate=[…]` + to lock everything else. +* **Existing compartments respected by default** — they are the high-confidence prior, + not noise to be merged away. +* **Open models tolerated.** Boundary reactions (exchange/sink/demand) are pinned to + their current compartment automatically and excluded from relocation. +* **Incomplete models tolerated.** Dangling metabolites are allowed; no silent reaction + removal. The MILP places what it can and reports the rest in `unplaced_reactions`, + not by deleting them. +* **`apply=False`** returns a `LocalizationProposal` (a diff: which reactions move, + which transports are added) without mutating the input. + +### 2.1 API sketch + +```python +@dataclass +class LocalizationScores: + """Per-gene compartment scores. Index = gene_id; columns = compartment ids. + Missing genes / NaN scores fall back to a uniform prior.""" + df: pd.DataFrame + +@dataclass +class LocalizationProposal: + """What the predictor proposes, before applying it.""" + moved: pd.DataFrame # rxn_id, from_compartment, to_compartment, score_delta + added_transports: pd.DataFrame # met_id, from, to, cost + unplaced_reactions: list[str] # couldn't be placed (e.g. no scored genes) + objective: float + +@dataclass +class LocalizationResult: + """Result of applying a proposal (or running with apply=True directly).""" + model: cobra.Model + gene_compartments: dict[str, str | set[str]] # set when multi_compartment_genes=True + added_transports: list[cobra.Reaction] + proposal: LocalizationProposal + +def predict_localization( + model: cobra.Model, + scores: LocalizationScores, + *, + # === The two requested features === + reactions_to_relocate: Iterable[str] | None = None, # None ⇒ pick automatically (see below) + keep_existing: bool = True, # don't merge compartments first + apply: bool = True, # False ⇒ return proposal only + # === Other knobs === + default_compartment: str = "c", + transport_cost: float | Mapping[str, float] = 0.5, + multi_compartment_genes: bool = False, # relax mono-localization + require_producibility: bool = False, # off ⇒ don't drop unproducible mets + mip_gap: float = 0.001, + time_limit: float | None = None, +) -> LocalizationResult | LocalizationProposal: + """Assign reactions to subcellular compartments via a deterministic MILP. + + By default (``reactions_to_relocate=None``, ``keep_existing=True``), only reactions + that lack a compartment assignment (or are marked uncertain via a notes flag) are + placed; reactions already in a compartment stay put. Pass an explicit + ``reactions_to_relocate`` list to override. + + With ``apply=False`` the function returns a :class:`LocalizationProposal` describing + what would change — useful for review or for diffing against a curator's choices. + """ +``` + +### 2.2 What gets placed by default + +| reaction state | default behaviour | +|---|---| +| has a (single) compartment assignment in the existing model | **pinned** (kept as-is) | +| boundary reaction (exchange / sink / demand) | **pinned** to its current compartment | +| no compartment set OR marked uncertain (e.g. `notes['localization'] == 'uncertain'`) | **relocated** | +| caller passed `reactions_to_relocate=[…]` | **only those** relocated; rest pinned | + +This handles both user requests in one design: + +* *Incomplete model*: dangling mets / missing producers are tolerated; reactions are + placed if any scored gene supports them, and unplaceable ones land in + `unplaced_reactions`. +* *Selective re-localization*: pass the subset; defaults respect what's there. + +### 2.3 MILP formulation (mono-localization variant, `multi_compartment_genes=False`) + +Let `R*` = reactions to relocate, `G*` = unique genes appearing in those reactions, `C` += compartments, `M` = metabolites occurring in `R*` reactions. + +Variables (all binary): +* `x[r, c]` for `r ∈ R*`, `c ∈ C` — reaction `r` placed in compartment `c`. +* `y[g, c]` for `g ∈ G*`, `c ∈ C` — gene `g` assigned to compartment `c`. +* `t[m, c]` for `m ∈ M`, `c ∈ C ∖ {default}` — transport of `m` between `default` and + `c`. + +Constraints: +* Each relocated reaction goes to exactly one compartment: `Σ_c x[r,c] = 1`. +* Each gene goes to exactly one compartment: `Σ_c y[g,c] = 1`. +* Gene-reaction coupling: if reaction `r` requires gene `g` (per its GPR), then placing + `r` in `c` requires `g` in `c`: `x[r,c] ≤ y[g,c]` for the AND-clause; OR-clauses use + the standard linearisation (any isozyme satisfying the placement is enough). +* Metabolite presence in each compartment: if any reaction touching `m` is placed in + `c`, then `m` must exist in `c` — modelled by adding a per-compartment "demand for + transport" when needed. Concretely: if a metabolite participates in reactions placed + in `c ≠ default`, a transport `t[m,c]` is required unless another reaction in `c` + balances it. + +Objective: +``` +maximise Σ_{g,c} gene_score(g,c) · y[g,c] − Σ_{m,c} transport_cost(m) · t[m,c] +``` + +With `multi_compartment_genes=True`, the `Σ_c y[g,c] = 1` constraint is relaxed (gene +can be in several compartments), and gene-reaction coupling becomes +`x[r,c] ≤ y[g,c]` per compartment as before — but the gene now contributes its +compartment-specific score once per assigned compartment. (The cost of allowing this is +that the same gene's score is double-counted across compartments; an alternative is to +include a small penalty per extra compartment, controllable via a `multi_compartment_penalty` +argument — leave this to a later iteration.) + +### 2.4 Why MILP rather than simulated annealing + +RAVEN's SA was chosen in an era when MATLAB MILP solvers were less accessible. Today +we have Gurobi already wired in. The MILP gives: +* deterministic, reproducible answers (important for science), +* an explicit optimality certificate / gap, +* faster solves on the small/medium problems this is (≪ 30k binaries even for a + whole-genome model with 10 compartments), +* the same well-understood `mip_gap` / `time_limit` controls the rest of raven-python uses, +* graceful degradation: at the time limit, return the best incumbent rather than the + "current SA state" with no quality guarantee. + +The SA's main practical benefit was *good-enough answers without a license-encumbered +MILP solver*. We've already documented that genome-scale (f)tINIT needs Gurobi +([docs/init_solver_benchmark.md](init_solver_benchmark.md)); reusing the same backend +is consistent. (For users on open-source solvers, GLPK on this MILP should be tractable +because it's smaller than ftINIT — the per-reaction big-M is at most `|C|`, not 1000.) + +## 3. Pluggable predictors + +The algorithm consumes a `gene × compartment` score table. raven-python ships: +* `load_wolfpsort(path)` — RAVEN-compatible WoLF PSORT output → score table. +* (later) a thin adapter for **DeepLoc** (TSV with per-class probabilities). +* The format is open: a user can build the DataFrame from any source. + +A small `omics.scoring` helper could compute a uniform "no-evidence" row for genes +absent from the predictor's output (RAVEN's 0.5-everywhere fallback). + +## 4. What this delivers for the user's two requests + +* **Incomplete model OK** — `require_producibility=False` (default) means unbalanced + metabolites are tolerated; reactions are placed when their genes have signal, and + unplaceable ones are *reported*, not silently deleted. No "model must be complete" + precondition. +* **Selective re-localization** — `reactions_to_relocate=[…]` pins everything else. + With `apply=False` the user gets a diff to review; with `apply=True` only the + selected reactions move. The default (no `reactions_to_relocate` argument) places + only the uncompartmentalised / uncertain reactions, which is the natural "update the + draft" workflow. + +## 5. Open questions to resolve before implementing + +1. **How does raven-python mark "uncertain" compartmentalization?** Proposal: a + `notes['localization'] = 'uncertain'` flag on reactions; or a passed-in set of ids. + The "auto-pick-what-to-place" mode reads this. +2. **Multi-compartment gene scoring**: simple multi-counting (the same score in every + compartment the gene lands in) vs a penalty per extra compartment. Start with simple + multi-counting; add the penalty later if users ask. +3. **DeepLoc adapter** — ship now or later? DeepLoc has a stable TSV format; a 30-line + loader is trivial when there's a real use case. Defer until requested. + +## 6. Implementation plan + +1. `localization/scores.py` — `LocalizationScores`, `load_wolfpsort`. +2. `localization/milp.py` — `predict_localization` (MILP). +3. `localization/apply.py` — convert a proposal into a re-compartmentalised cobra model + (added transports, updated metabolite compartments). +4. Tests on a small handcrafted model + a regression test against the RAVEN paper's + yeast example. +5. PROGRESS / PLAN updates. diff --git a/docs/archive/plan_get_model_from_homology.md b/docs/archive/plan_get_model_from_homology.md new file mode 100644 index 0000000..f7185ae --- /dev/null +++ b/docs/archive/plan_get_model_from_homology.md @@ -0,0 +1,120 @@ +# Design plan: `get_model_from_homology` (port of RAVEN `getModelFromHomology`) + +Detailed design for the core of homology reconstruction (Phase 3a). Companion to +PLAN.md §2.3a. Goal: faithful *intent*, but a clearer, more robust implementation — +RAVEN's own comments flag several soft spots, and the user asked for logic +improvements. + +--- + +## 1. What it is trying to do (the intent) + +Build a draft GEM for a **new organism** by transferring reactions from one or more +**template GEMs**, using **orthology** (bidirectional protein BLAST/DIAMOND hits) to +map template genes → new-organism genes and rewrite each reaction's GPR accordingly. + +A reaction is transferred if at least one of its genes has an ortholog in the new +organism; its GPR is rewritten so template genes become their new-organism orthologs. +When several templates are given, an optional **preferred order** makes each gene's +reactions come from a single (highest-priority) template; otherwise reactions from all +templates are merged. Metabolites are unified across templates by **name[compartment]**. + +## 2. RAVEN's algorithm, in steps + +1. Strip cross-species gene metadata from templates (`geneShortNames`, `geneMiriams`, + `proteins`, `geneFrom`); standardize template grRules. +2. Validate: gene ids without stray `()`; blast `fromId`s match template ids; ≥5 % of a + template's genes appear in the hits (else the FASTA/model id styles disagree). +3. **Filter hits** by `evalue ≤ maxE`, `aligLen ≥ minLen`, `identity ≥ minIde`. +4. Drop template reactions with no genes (and genes with no reactions). +5. `onlyGenesInModels`: drop hits whose template gene isn't in a model. +6. `strictness==3`: reduce to best-by-**E-value** hits per `fromGene`. +7. Build sparse matrices `allTo` (new→template hits) and `allFrom` (template→new) per + template, indexed by gene position. +8. **Ortholog map** = per strictness: `1`/`3` → `allTo & allFrom` (reciprocal); + `2` → just `allFrom` or `allTo` (one-directional, per `mapNewGenesToOld`). +9. Drop mappings to genes not in the model; simplify. +10. Per template, keep reactions associated with a mapped gene (keeping AND-complexes + where *any* subunit mapped). +11. `preferredOrder`: remove genes already claimed by an earlier template so each gene's + reactions come from one template. +12. **Rewrite GPRs** by `regexprep` string substitution: replace each template gene with + its new ortholog(s) — `(new1 or new2)` if several; template genes with no ortholog + are renamed `OLD__` (kept so AND-complexes survive). +13. `mergeModels(..., 'metNames')`; then `regexprep` away `OLD_…` genes that ended up in + `or` relations; set id/name/notes/confidence; standardize; delete unused genes. + +## 3. Proposed Python design + +```python +def get_model_from_homology( + models, # list[cobra.Model], or one model + hits, # bidirectional hits DataFrame (run_blast / make_ortholog_hits) + model_for, # target organism id + *, + preferred_order=None, # list[model_id]; if set, each gene's reactions from one model + bidirectional=True, # require reciprocal hits (RBH-style) [improvement A] + best_hits_only=False, # keep only best-scoring hit per gene first [improvement A] + map_direction="new_to_old", # used only when bidirectional=False + score="bitscore", # best-hit criterion: "bitscore" | "evalue" [improvement D] + complex_policy="flag", # AND-subunits lacking orthologs: flag|keep|drop [improvement C] + # default "flag" = RAVEN-compatible (OLD__) + only_genes_in_models=False, + max_evalue=1e-30, min_align_len=200, min_identity=40, +) -> HomologyResult # .model + .gene_map + .reaction_sources [improvement F] +``` + +**Data flow (all on the hits DataFrame + cobra GPR AST, no sparse-matrix juggling):** + +1. Validate + filter hits (`max_evalue`/`min_align_len`/`min_identity`) — a DataFrame mask. +2. Optionally reduce to best hit per gene by `score` (groupby-idxmax). +3. Build the **ortholog map** `new_gene -> {model_id: {template_gene, ...}}`: + - `bidirectional`: inner-join the two directions on (new_gene, template_gene) — a pandas + merge, i.e. reciprocal hits. (`bidirectional` + `best_hits_only` = reciprocal best hits.) + - else: take the one direction (`map_direction`). +4. For each template, for each reaction: rewrite the GPR on the **cobra GPR AST** + (improvement B) — substitute each leaf gene by the OR of its new orthologs; handle + unmapped leaves per `complex_policy`; keep the reaction iff ≥1 leaf mapped. +5. Resolve `preferred_order`: a new gene's reactions come from the first template (in order) + that maps it (improvement E — a dict lookup, not matrix index math). +6. Transfer the surviving reactions into the draft, unifying metabolites by `name[comp]` + (reuse `merge_models` / `add_reactions_from_model`); record provenance (improvement F). +7. Finalize: id=`model_for`, name, per-reaction note/confidence, prune unused genes, + lint GPRs (`find_non_dnf_grrules`). + +`make_ortholog_hits` (PLAN §2.3a) feeds the same DataFrame, so the whole thing is +testable without BLAST. + +## 4. Logic improvements (the point of this exercise) + +| # | Improvement | Why it's better than RAVEN | +|---|---|---| +| **H1** | **Two orthogonal params** (`bidirectional`, `best_hits_only`) replace the overloaded `strictness` 1/2/3. | RAVEN's `strictness` conflates two independent axes — *directionality* (one-way vs reciprocal) and *best-hit filtering*. Splitting them is clearer and exposes all 4 combinations (incl. one-way best-hit). RAVEN mapping documented: `1`→bidir; `2`→one-way; `3`→bidir+best-hits = **reciprocal best hits (RBH)**. | +| **H2** | **AST-based GPR rewriting** (cobra `GPR`) instead of `regexprep` on rule strings. | RAVEN's author flags the string substitution as uncertain ("I hope that it's ok…") and needs a follow-up regex to clean `OLD_… or` leftovers. Substituting leaves on the parsed boolean tree is robust (no partial-match hazards, no cleanup pass) and reuses the AST machinery already in `expand_model`/`find_non_dnf_grrules`. | +| **H3** | **Explicit complex policy** for AND-subunits lacking an ortholog: `keep` (drop the unmapped subunit, transfer the reaction), `drop` (require a fully-mapped AND-clause, else drop the reaction), `flag` (RAVEN's `OLD__` placeholder). | RAVEN's behaviour is the implicit, fragile `OLD_`+regex path its comments distrust. Making it an explicit, AST-correct policy (OR = keep isozyme branches that mapped; AND = configurable) is principled and lets the user choose optimism vs strictness. `flag` preserves RAVEN compatibility. | +| **H4** | **Best-hit selection by `bitscore`** (default), E-value optional. | E-value depends on database size and ties at ~0 for strong hits; **bitscore is database-size-independent** and the standard criterion for reciprocal-best-hit orthology. RAVEN uses min E-value only. | +| **H5** | **DataFrame ortholog map** replaces the `allGenes`/`allTo`/`allFrom` sparse-matrix + `sub2ind` index juggling. | The reciprocal mapping and preferred-order resolution become a pandas merge + dict lookup — far less error-prone than the position-index gymnastics RAVEN itself annotates with uncertainty. | +| **H6** | **Provenance** in the result: per reaction, which template + which ortholog pairs supported it (`reaction.notes`), and a returned `gene_map`. | RAVEN returns only `hitGenes` (flat old/new lists) and a fixed note. Structured provenance aids curation and debugging. | + +**RAVEN-compatibility:** accept a `strictness=` alias that sets `bidirectional`/`best_hits_only` +(1/2/3) and default `complex_policy="flag"` *iff* `strictness` is passed, so legacy calls reproduce +RAVEN behaviour; otherwise use the clearer defaults above. + +## 5. Resolved: `complex_policy` default + +**Decided: default `complex_policy="flag"`** (RAVEN-compatible) — for an AND-complex reaction with a +subunit lacking an ortholog, keep the reaction and mark the missing subunit (`OLD__`), +matching current RAVEN output. `keep` (drop the unmapped subunit) and `drop` (require a fully-mapped +AND-clause) remain available for users wanting cleaner or higher-confidence drafts. Implementation +note: even under `flag`, the GPR is rebuilt on the **AST** (improvement H2) — the `OLD_` marking and +the removal of `OLD_` genes left in `or` branches are done as AST operations, not the regex passes +RAVEN uses. + +## 6. Test strategy + +- **Core, no BLAST:** drive with `make_ortholog_hits` + small template models; assert reactions + transfer, GPRs rewrite correctly (one-to-one, one-to-many isozymes, AND-complex under each + `complex_policy`), `bidirectional`/`best_hits_only` select the right hits, `preferred_order` + routes a gene's reactions to one template, metabolites unify by name[comp]. +- **Edge cases:** unmapped subunit policies; gene mapping to multiple orthologs (`(a or b)`); + reaction supported by only one isozyme; the ≥5 % overlap validation; reciprocal vs one-way. diff --git a/docs/known_issues.md b/docs/known_issues.md new file mode 100644 index 0000000..1ed394f --- /dev/null +++ b/docs/known_issues.md @@ -0,0 +1,139 @@ +# Known issues — deferred findings from the full-codebase review + +Low-priority issues found during the 2026-05-26 full critical review (see PROGRESS +commits `db5a1fa`, `2b85830` for the bugs that *were* fixed). These are deliberately +not fixed yet: each is an edge case, robustness gap, efficiency concern, dead code, or +a documented design choice — none affects correctness on normal, well-formed inputs. +Line numbers are indicative; refer to the named function. + +## A. Latent edge-case bugs + +All six items in this section were closed in a quality-sweep pass (see CHANGELOG +"Quality sweep" entry); regression tests live alongside each fixed function. Kept +here for traceability of the original review. + +- ✅ **`manipulation/add.py` — `add_reactions_from_equations` (name mode):** a + metabolite whose *name* starts with a number (e.g. `"2 oxoglutarate"`) was + misparsed — the leading number was taken as a coefficient. Fixed by trying the + full token as a name first and only splitting off a coefficient when the + remainder names something resolvable. Test: + `tests/test_manipulation_add.py::test_name_mode_preserves_leading_number_name`. +- ✅ **`manipulation/add.py` — `add_reactions_from_equations`:** an equation whose + terms net to zero produced a reaction with no metabolites and no warning. Now + warns. Test: `test_empty_stoichiometry_warns`. +- ✅ **`manipulation/transfer.py` — `add_reactions_from_model` (`_new_met_id`):** + two source metabolites sharing an `id` but different `name[comp]`, both + needing minted ids, used to collide. Now tracks ids minted in the batch. + Test: `tests/test_manipulation_transfer.py::test_intra_batch_id_minting_unique`. +- ✅ **`manipulation/transport.py` — `add_transport_reactions`:** the + source/target metabolite lookup was keyed by name, silently dropping + same-name duplicates. Now warns on collision. Test: + `tests/test_manipulation_transport.py::test_duplicate_name_in_source_compartment_warns`. +- ✅ **`gapfilling/fill.py` — `connect_blocked_reactions`:** the + `fva.at[r, "maximum"]` access used to crash with `KeyError` if FVA dropped a + candidate. Now membership-guarded (defensive — the original is unreachable + with cobra's default FVA, no regression test). +- ✅ **`reconstruction/kegg/query.py` — `assign_kos`:** `cutoff >= 1` would let + `log(best_evalue) == 0` through and crash inside the ratio filter. Now + rejected up front with a clear error. Test: + `tests/test_reconstruction_kegg_query.py::test_cutoff_ge_one_rejected`. + +## B. Silent misbehaviour on unusual inputs + +All four items in this section were closed in a quality-sweep pass (see CHANGELOG +"Quality sweep — known-issues section B" entry); regression tests live alongside +each fixed function. Kept here for traceability of the original review. + +- ✅ **`manipulation/merge.py` — `merge_models`:** name[comp]-matched metabolites + with differing `formula` or `charge` across models now warn instead of silently + unifying to the first-seen. Tests + `tests/test_manipulation_merge.py::test_formula_conflict_warns`, + `test_charge_conflict_warns`. +- ✅ **`manipulation/add.py`:** both the `mets_by="id"` and `mets_by="name"` paths + now warn when a new metabolite is created in a compartment that hasn't been + registered yet (the id-mode path used to skip the check entirely). Tests + `tests/test_manipulation_add.py::test_id_mode_unknown_compartment_warns`, + `test_name_comp_unknown_compartment_warns`. +- ✅ **`tasks/tasklist.py` — `parse_task_list`:** continuation rows appearing + *before* the first ID-bearing row used to be silently dropped. Now warns + with the file:row pointer. Test + `tests/test_tasks.py::test_parse_warns_on_data_row_before_first_id`. +- ✅ **`io/sif.py` — `export_model_to_sif`:** when the caller's custom label + map sends two distinct ids to the same label, the target-side dedup used + to silently merge the nodes. Now warns up front. Test + `tests/test_io_sif.py::test_collapsing_label_map_warns`. + +## C. Robustness gaps + +All four items closed in the same quality sweep (see CHANGELOG); regression +tests live alongside each fixed function. + +- ✅ **`manipulation/simplify.py` — `constrain_reversible_reactions`:** the + FVA call is now wrapped in a try/except + NaN check; both backend-raised + `OptimizationError` and silent-NaN returns surface as a single clear + `RuntimeError`. Test + `tests/test_manipulation_simplify.py::test_constrain_reversible_raises_on_infeasible`. +- ✅ **`binaries.py` — `ensure_binary`:** downloads through a `.part` sibling + and `os.replace`s into the final name on success, mirroring `data.py`. + An interrupted download leaves a `.part` (never a half-written `.zip`). + Defensive — no regression test (needs urlopen mocking). +- ✅ **`tasks/tasklist.py`:** the xlsx reader checks `wb.sheetnames` before + the `wb["TASKS"]` lookup; a missing sheet now raises a clear `ValueError` + listing the actual sheets. Test + `tests/test_tasks.py::test_parse_task_list_xlsx_missing_tasks_sheet`. +- ✅ **`reconstruction/kegg/taxonomy.py`:** depth handling pads with explicit + `""` placeholders and warns once when a level is skipped (e.g. `####` + directly under `##`). Test + `tests/test_reconstruction_kegg_hmm.py::test_parse_taxonomy_handles_skipped_depth`. + +## D. Efficiency (correct but slow at scale) + +- ✅ **`manipulation/simplify.py` — `group_linear_reactions`:** rewritten with + a metabolite worklist (re-enqueue the mets touched by each merge) instead of + the restart-after-every-merge loop. Same observable behaviour, O(n+m) work + per pass instead of O(n²·m). Test + `tests/test_manipulation_simplify.py::test_group_linear_merges_long_chain_in_one_pass`. +- ✅ **`reconstruction/kegg/parse.py`:** `parse_kegg_reactions` now caches the + parsed stoichiometry on each `KeggReaction.stoichiometry`; `build_reference_model` + reuses it instead of re-parsing. Test + `tests/test_reconstruction_kegg_parse.py::test_stoichiometry_cached`. + +## E. Dead / vestigial code + +- ✅ **`reconstruction/kegg/parse.py`:** removed `KeggReaction.modules` and + `.rhea` (parsed but never consumed by the artefact builders). +- ✅ **`reconstruction/homology/homology.py`:** removed the vestigial + `only_genes_in_models` parameter from `_ortholog_map` (the actual filtering + happens earlier in `get_model_from_homology`). + +## F. Documented design choices that differ from RAVEN (not bugs) + +All five items addressed in the quality-sweep pass (see CHANGELOG). The three +docstring/comment items got documentation fixes; the two correctness items +(F3, F5) got code fixes plus matching RAVEN-back-port proposals in +[IMPROVEMENTS.md](../IMPROVEMENTS.md) (FS4, B2). + +- ✅ **`init/init.py` — `run_init`:** docstring now spells out the score-0 + semantics divergence between classic INIT (score-0 = removable) and ftINIT + (score-0 = kept by default). +- ✅ **`init/build.py` — `get_init_model`:** the inaccurate "same regime + run_init will use" comment is replaced — the pre-filter is now correctly + described as conservative (only reactions blocked under the *most* permissive + regime are removed, so the strict run_init path never loses a candidate). +- ✅ **`analysis/fseof.py` — `fseof`:** classifier now uses the slope of + `|flux|` (a real `linregress(enforced, |flux|)` fit) instead of the first vs + last endpoints. A track whose endpoints straddle a peak/trough no longer + ends up with the wrong amplify/knockdown label. Matching MATLAB back-port + proposed as IMPROVEMENTS FS4. Test + `tests/test_analysis_fseof.py::test_amplify_label_uses_abs_slope_not_endpoint_difference`. +- ✅ **`analysis/reporter.py` — `reporter_metabolites`:** docstring now + documents the one-sided ``1 - Φ(z)`` `p_value` and the `z_score`-descending + sort, explicitly contrasted with RAVEN's two-tailed `allPValues` ordering. + Internally consistent; same information available via the up/down `gene_fold_changes` + partition. +- ✅ **`utils/balance.py` — `get_elemental_balance`:** empty-stoichiometry + reactions now report `unknown` instead of `balanced`. (Original review + attributed the bug to `utils/validate.py::check_model`; the actual code + lives in `balance.py`.) Matching MATLAB back-port proposed as + IMPROVEMENTS B2. Test + `tests/test_utils_balance.py::test_empty_reaction_is_unknown`. diff --git a/docs/matlab_raven_backports.md b/docs/matlab_raven_backports.md new file mode 100644 index 0000000..b281896 --- /dev/null +++ b/docs/matlab_raven_backports.md @@ -0,0 +1,255 @@ +# MATLAB RAVEN — proposed back-ports + +This is a consolidated list of fixes and improvements that the Python port surfaced +and that are worth carrying upstream into MATLAB RAVEN. Each item names a RAVEN +file/function, briefly diagnoses the current behaviour, and proposes the minimal +MATLAB-side patch. Items are sourced from [IMPROVEMENTS.md](../IMPROVEMENTS.md) +(the `MATLAB RAVEN 💡` rows) plus two new items from the section-F quality sweep +in [docs/known_issues.md](known_issues.md). + +The matching Python-side implementation is linked for context — in most cases the +fix is identical in spirit; the patch shape just differs. + +> Status legend used below: 💡 *proposed back-port* · 🐛 *bug* · ⚡ *efficiency* · +> 🧹 *ergonomics / readability*. + +--- + +## `core/getModelFromHomology.m` + +Implemented in raventoolbox as [`reconstruction.homology.get_model_from_homology`](../src/raventoolbox/reconstruction/homology/homology.py). + +* **H1** 🧹 — Split the overloaded `strictness` 1/2/3 into two orthogonal + options: `bidirectional` (reciprocal hits) and `bestHitsOnly`. Reciprocal + best hits = both true. Keep `strictness=` as a compat alias. Easier to read + at call sites and easier to combine. +* **H2** 🐛 — Rewrite GPRs by walking a parsed boolean tree (the same AST + RAVEN already builds for other checks), not via `regexprep` substring + substitution. The current regex approach has partial-match hazards on gene + IDs that contain `OLD_` or share prefixes, and needs a separate + cleanup pass to fix degenerate `(OLD_… or …)` rules. Doing it on the tree + is exact and removes the cleanup step. +* **H4** 🐛 — Default best-hit selection to **bitscore**, not E-value. Bitscore + is database-size-independent (the RBH standard); the current E-value + preference can flip the chosen template when the database grows. Keep + E-value as an option. + +## `core/getKEGGModelForOrganism.m` and the KEGG pipeline + +Implemented across [`reconstruction.kegg`](../src/raventoolbox/reconstruction/kegg/). + +* **K1** 🐛 — In the KEGG flat-file parser, read each reaction's equation from + its **own `EQUATION` field**, not by matching line *i* of `reaction.lst` + against reaction *i*. The current line-matched approach silently corrupts + the model if `reaction.lst` and `reaction` ever drift out of perfect order. + The `EQUATION` field is already in the entry — just read it. +* **K2** 🧹 — Undefined-stoichiometry terms (`n C00001`, `(n+1) C00002`) keep + the **real compound id** with coefficient 1 and *flag the reaction* as + undefined-stoichiometry, instead of minting `"n C00001"` pseudo-metabolites + that get renamed `undefined_N` later. Cleaner metabolite graph; the flag + still drives the `keep*` filters. +* **K3** 🧹 — Replace the free-text reaction-quality strings appended to + `rxnNotes` (`spontaneous`, `undefined stoichiometry`, `incomplete reaction`, + `general reaction`) with a structured boolean **`rxn_flags`** table. + Downstream filters join on a column instead of substring-matching notes. +* **K6** ⚡ — Per-KO multi-FASTA construction via a single streaming pass + with an offset index (`seek`-based), replacing the Java `Hashtable` + byte-scan in `constructMultiFasta`. Drops the 5M-element preallocation + and the Java heap tuning. +* **K7** ⚡ — Concatenate all per-KO HMMs and `hmmpress` them into one + pressed library, so the query path runs a single `hmmscan` against the + database instead of thousands of per-KO `hmmsearch` invocations. Same + result, orders-of-magnitude less I/O. +* **K12** ⚡ — Use fast MAFFT (`FFT-NS-2`, i.e. `--retree 2 --maxiterate 0`) + for HMM training instead of `--auto`. `--auto` selects slow iterative + refinement on medium/large KOs (observed ~2.5 min/KO, days for a domain + on real KEGG); `FFT-NS-2` is seconds/KO and ample for profile-HMM + building. Switch to `PartTree` when the alignment DP cost + ≈ `n_seqs × mean_len²` exceeds the host's available memory budget (the + Python implementation calibrated this as ≈4.2e-9 × DP-cost in GB and + switches when DP-cost > 0.65 × (RAM − 2.5 GB) / 4.2e-9). Same protein + set, no sequences dropped. +* **K14** ⚡ — Sort `organism_gene_ko` by `(organism, gene)` and store it + xz-compressed (`organism_gene_ko.tsv.xz`). Cuts the dominant artefact from + ~78 MB to ~27 MB (2.9×): gene IDs within an organism share long prefixes + so adjacency makes them far more compressible, and xz's larger window + catches cross-row redundancy gzip's 32 KB window misses. MATLAB needs an + external `unxz` to read the file; the small tables remain plain gzipped TSV. +* **K15** 🐛 — Recalibrate the HMM-query KO-assignment defaults: change + `cutoff` from `1e-50` to `1e-30` and `min_score_ratio_g` from `0.8` to + `0.9`. `min_score_ratio_ko` can stay at `0.3` but is empirically inert + (identical output at 0.0/0.3/0.5 across the validation organisms). + Cross-validated against the true KEGG gene→KO annotation of *S. cerevisiae*, + *Cyanidioschyzon merolae*, *E. coli* K-12, and *Mycoplasma genitalium*: + the current `1e-50` cutoff sits inside the *true-positive* tail + (median true E ≈ 1e-100…1e-155; spurious ≈ 1e-8), silently dropping + divergent real hits. At the proposed values, *M. genitalium* gene→KO recall + rose from 0.84 → 0.94 (reaction recall 0.87 → 0.97) with no precision loss. + Full numbers in [docs/kegg_hmm_cutoff_calibration.md](kegg_hmm_cutoff_calibration.md). + +## `core/FSEOF.m` + +Implemented in raventoolbox as [`analysis.fseof`](../src/raventoolbox/analysis/fseof.py). + +* **FS1** 🐛 — Replace the strict step-by-step monotonicity gate (a target is + discarded if any single step's flux fails to exceed the previous) with a + linear-regression filter: a reaction is a target if its flux track is + reasonably correlated with the enforced product flux (|r| above a + threshold). One noisy step from LP alternative optima no longer kills a + valid target. Run `pFBA` per step to keep the scan stable. +* **FS2** 🧹 — Report **knockdown** and **knockout** targets as first-class + outputs, not just amplification. The currently-flagged "amplify" set are + reactions whose `|flux|` rises with the enforced product; the reactions + driven *toward zero* (knock-down / knock-out candidates — arguably the most + actionable for strain design) are silently dropped. +* **FS4** 🐛 — Replace the reported "Slope" — currently + `abs(fseof.results(num, iterations) - fseof.results(num, 1)) / abs(targetMax - targetMax/iterations)`, + i.e. an endpoint difference — with the **regression slope across all + iterations**: + + ```matlab + enforcedFlux = (1:iterations) * targetMax / iterations; % record per-iter + ... + for num = 1:length(model.rxns) + p = polyfit(enforcedFlux, fseof.results(num,:), 1); + slope(num) = p(1); + end + ``` + + An endpoint difference disagrees with the monotonicity filter on traces + whose first and last values happen to be similar; the regression slope + matches the selection criterion. The amplify/knockdown/knockout label + (after FS2) should use the slope of `|flux|`, not endpoint comparison. + +## `core/randomSampling.m` + +Implemented in raventoolbox as [`analysis.random_sampling`](../src/raventoolbox/analysis/sampling.py). + +* **SAMP1** ⚡ — Compute `goodRxns` (loop-free, flux-carrying objective + candidates) via a **single FVA pass**, not the current per-reaction `parfor` + that solves a separate LP minimising and maximising every reaction. FVA + computes the same min/max ranges, optionally `loopless` (`cycleFreeFlux`), + in far less code. +* **SAMP2** 🐛 — Add a `seed` argument for reproducibility, and make + `nObjectives` a parameter. The current code hard-codes 2 objective + reactions per sample (the docstring even claims 3 — a transcription bug + worth fixing alongside). +* **SAMP4** 🐛 — Run `goodRxns` detection **before** `replaceBoundsWithInf` + applies, not after. FVA errors as "unbounded" on infinite bounds, and + running goodRxns on the inf-replaced model conflates "loop reaction" with + "unbounded objective". Scope `replaceBoundsWithInf` to sampling only, after + the goodRxns set is found. + +## `core/reporterMetabolites.m` + +Implemented in raventoolbox as [`analysis.reporter_metabolites`](../src/raventoolbox/analysis/reporter.py). + +* **RM1** ⚡🐛 — Replace the per-neighbour-count Monte Carlo background + correction (currently 100 000 random sets drawn *for each distinct + neighbour count*) with the exact closed-form expression. RAVEN samples with + replacement from the scored-gene Z pool, so a random aggregate Z = Σz/√n + has mean `√n · μ` and standard deviation `σ` provably — the corrected + metabolite Z is just `(metZ − √n·μ) / σ`. This removes both the slow + sampling step and its run-to-run randomness (results become deterministic). + +## `core/runINIT.m` + +Implemented in raventoolbox as [`init.run_init`](../src/raventoolbox/init/init.py). + +* **I4** 🐛 — Drop the hard-coded big-M (1000) in the MILP and use **each + reaction's own upper bound** instead: `v ≤ ub · x`. Expose `eps` and + `prod_weight` as parameters. The current 1/1000/0.1/0.5 quadruple only + fits ±1000-bounded models with O(1) scores — flagged in RAVEN as + scale-dependent and tunable, but no API knobs are exposed. + +## `core/ftINIT.m` and the ftINIT pipeline + +Implemented across [`init.ftinit`](../src/raventoolbox/init/ftinit.py), [`init.taskfill`](../src/raventoolbox/init/taskfill.py), [`init.merge`](../src/raventoolbox/init/merge.py), [`init.prep`](../src/raventoolbox/init/prep.py). + +* **FT3** 🐛 — Same big-M issue as `runINIT` — use each reaction's own bound + as big-M instead of the fixed 100/1000. Expose `force_on` and `force_on_ess` + as parameters (the calibrated values for genome-scale Human-GEM are + documented in [docs/init_param_calibration.md](init_param_calibration.md)). +* **FT9** 🐛 — Per-reaction essential forcing must be **clamped to the + reaction's bound**. The staged pipeline fixes previous-step reactions as + essential at `min(0.99 · |prev flux|, 0.1)`; if a low-capacity reaction + cannot carry that magnitude, the current code produces `lb > ub` and + errors. Clamp the forced magnitude to `rxn.ub` (in the positive direction) + / `rxn.lb` (in the negative direction) so a low-capacity essential is + forced *to* its capacity instead. +* **FT8** 🧹 — Rewrite `removeLowScoreGenes` on the parsed boolean tree + instead of regex subset-extraction with `#n#` placeholders. Same logic + (isozyme OR-branches dropped while keeping at least one; complex + AND-subunits kept intact; nested groups handled), but exact and tractable + to test. Make the all-negative tie-break **deterministic** (sort + take + the least-negative) instead of random. + +## `core/addRxns.m` + +Implemented in raventoolbox as [`manipulation.add_reactions_from_equations`](../src/raventoolbox/manipulation/add.py). + +* **A1** 🧹 — Accept a string keyword (`metsBy = 'id'` / `'name'`) instead of + the opaque `eqnType = 1 / 2 / 3` integer. Call-sites become self-documenting. + +## `core/checkModelStruct.m` (or its curation subset) + +Implemented in raventoolbox as [`utils.check_model`](../src/raventoolbox/utils/validate.py). + +* **V2** 🧹 — Return a struct array of issues (one per finding, with + `category` / `target` / `message` fields) instead of printing warnings or + throwing on the first problem. Programmatically filterable. + +## `core/getElementalBalance.m` + +Implemented in raventoolbox as [`utils.get_elemental_balance`](../src/raventoolbox/utils/balance.py). + +* **B2** 🐛 — A reaction with no metabolites (an empty `S(:,j)`) currently + falls through to `balanceStatus = 1` ("balanced") because both pre-loops + leave `balanceStatus` at NaN and the final `isnan→1` step maps NaN to + balanced. Fix: + + ```matlab + % Right before the final `balanceStatus(isnan(...)) = 1;` line: + emptyRxns = full(sum(model.S ~= 0, 1)) == 0; + balanceStatus(emptyRxns) = min(-1, balanceStatus(emptyRxns)); + ``` + + Empty reactions then report `-1` ("missing information") rather than + vacuously balanced. + +## `core/findPotentialErrors.m` (GPR linting) + +Implemented in raventoolbox as [`utils.find_non_dnf_grrules`](../src/raventoolbox/utils/gpr.py). + +* **S1** 🧹 — Return the `indexes2check` array (which the function already + computes) plus per-reaction reason strings as a struct array, instead of + only emitting a `warning()`. Callers cannot act on the warning text. +* **S2** 🐛 — Detect non-DNF GPRs via the parsed boolean tree, not substring + search for `) and (`, `) and`, `and (`. The substring approach is sensitive + to spacing and bracketing and to gene IDs containing those characters; the + tree walk (no OR beneath any AND) is exact. + +## `core/getIndexes.m` + +* **G1** ⚡ — Build a `containers.Map` `{id: position}` once per call and use + hashed lookup (O(1)) instead of looping `find(strcmp(obj(i), searchIn))` + per query (O(n · m)). Same for the `metcomps` branch. +* **G5** 🐛 — `if all(objects)` conflates a logical all-true mask with the + index vector `[1 1 1]` (the function's own comment notes "this gets weird + if it's all 1"). Test `islogical(objects)` explicitly instead. + +## Cross-cutting + +* **Y4** 🧹 — A first-class home for `metDeltaG` / `rxnConfidenceScores` (in + RAVEN) and `deltaG` / `confidence_score` (in cobra/raventoolbox `.notes`). + Neither cobra nor SBML core has a standard slot yet; if one emerges (SBML + fbc/groups, or a cobra attribute) both projects should migrate. For now + this is a watching brief. + +## R-MetaCyc — removal candidate + +Already flagged in `IMPROVEMENTS.md` as a removal target on both sides — the +MetaCyc reconstruction path is dropped in raventoolbox and proposed for removal +from MATLAB RAVEN (`external/metacyc/*`). Pasting here for visibility because +the actual removal still needs to land upstream. See +[IMPROVEMENTS.md § R-MetaCyc](../IMPROVEMENTS.md) for the full rationale. diff --git a/docs/raven_migration.md b/docs/raven_migration.md new file mode 100644 index 0000000..6de2067 --- /dev/null +++ b/docs/raven_migration.md @@ -0,0 +1,120 @@ +# RAVEN → raven-python migration reference + +A function-by-function map from the MATLAB RAVEN Toolbox to raven-python (and cobrapy where +appropriate). For each RAVEN function we record one of four outcomes: + +* ✅ **ported** — there's a direct raven-python replacement; see the link. +* 🗒️ **cheatsheet** — cobrapy already covers it; a one-liner or short idiom does the job + (recorded under each row). +* ⛔ **not ported** — explicit decision not to bring it across, with rationale. +* 🆕 **new in raven-python** — functionality raven-python adds that has no RAVEN counterpart. + +For raven-python's deliberate improvements over RAVEN (and which of them are candidates to +upstream into MATLAB RAVEN), see [IMPROVEMENTS.md](../IMPROVEMENTS.md). + +## Design principle + +The in-memory object is always a [`cobra.Model`](https://cobrapy.readthedocs.io). There is +no parallel RAVEN struct, no `ravenCobraWrapper`-style adapter. RAVEN fields that cobra +doesn't model natively (`rxnMiriams`, `metDeltaG`, `rxnConfidenceScores`, …) live in +cobra's `annotation` / `notes` dictionaries. raven-python's YAML I/O follows the cobra YAML +standard plus the geckopy enzyme-constrained extension, so ecModels round-trip. + +--- + +## Foundation: utilities, manipulation, I/O + +| RAVEN | raven-python | Notes | +|---|---|---| +| `addRxns` (equations) | ✅ [`manipulation.add_reactions_from_equations`](../src/raven_python/manipulation/add.py) | The keystone: equation-string → reactions, matching mets by id, name, or `name[comp]`; strict / auto policies for new mets and genes. | +| `addTransport` | ✅ [`manipulation.add_transport_reactions`](../src/raven_python/manipulation/transport.py) | Transport across compartments, matching mets by name, sequential `tr_NNNN` ids. cobra has no transport primitive. | +| `addRxnsGenesMets` | ✅ [`manipulation.add_reactions_from_model`](../src/raven_python/manipulation/transfer.py) | Copy reactions from a source model, matching mets by `name[comp]` (vs cobra's strict-by-id merge). | +| `mergeModels` | ✅ [`manipulation.merge_models`](../src/raven_python/manipulation/merge.py) | N-model merge, unify mets by `name[comp]` (or id), keep all reactions (id collisions renamed). cobra's `merge` is pairwise / strict-by-id. | +| `changeRxns`, `changeGrRules` | ✅ [`manipulation.change_reaction_equations`](../src/raven_python/manipulation/change.py), `change_gene_reaction_rules` | Stoichiometry change in place; batch GPR set/append (`(old) or (new)`). | +| `setParam('var', …)` | ✅ [`manipulation.set_variance_bounds`](../src/raven_python/manipulation/parameters.py) | ±% band around measured values. **Other modes (lb/ub/eq/obj/unc)**: 🗒️ cobra one-liners (`reaction.bounds`, `model.objective`, `Configuration().bounds`). | +| `setExchangeBounds` | ⛔ not ported | cobra's `model.medium = {ex_id: uptake}` covers it. | +| `simplifyModel` (gap modes) | ✅ [`manipulation.remove_dead_end_reactions`](../src/raven_python/manipulation/simplify.py), `remove_duplicate_reactions`, `constrain_reversible_reactions`, `group_linear_reactions` | Cobra-covered modes (no-flux→`find_blocked_reactions`, zero-interval, unconstrained) are 🗒️ cheatsheeted. `group_linear` is lossy (drops genes), per RAVEN. | +| `removeMets`, `removeGenes` | ✅ [`manipulation.remove_metabolites`](../src/raven_python/manipulation/remove.py), `remove_genes` | Delegate to cobra; add `by_name` cross-compartment deletion (mets) and a `blocked_reactions` policy (genes). `removeReactions` itself ⛔ — cobra's `remove_reactions` covers it. | +| `convertToIrrev` | ✅ [`manipulation.convert_to_irreversible`](../src/raven_python/manipulation/irreversible.py) | Splits reversible non-exchange reactions into forward + `_REV`. Adopted from geckopy. | +| `expandModel` | ✅ [`manipulation.expand_model`](../src/raven_python/manipulation/expand.py) | Splits OR-GPR (isozyme) reactions into one per AND-clause. Adopted from geckopy. | +| `mergeCompartments` | ✅ [`manipulation.merge_compartments`](../src/raven_python/manipulation/compartments.py) | Collapse a multi-compartment model into one; deduplicate identical reactions; optionally drop one-met collapses (`drop_single_metabolite_reactions`). | +| `copyToComps` | ✅ [`manipulation.copy_to_compartment`](../src/raven_python/manipulation/compartments.py) | Duplicate reactions into a target compartment (idempotent; `delete_original=True` makes it a move). | +| `mapCompartments` | ⛔ not ported | Overlaps with `comparison.compare_models` on the reaction-id intersection. | +| `getElementalBalance` | ✅ [`utils.get_elemental_balance`](../src/raven_python/utils/balance.py) | Graded `balanced` / `unbalanced` / `unknown` — `unknown` catches a missing formula that cobra's `check_mass_balance` silently miscounts. | +| `checkModelStruct` (curation subset) | ✅ [`utils.check_model`](../src/raven_python/utils/validate.py) | Structured curation report. RAVEN's struct/type checks are moot in cobra. | +| `is_dnf` / GPR check (from `standardizeGrRules`) | ✅ [`utils.is_dnf`](../src/raven_python/utils/gpr.py), `find_non_dnf_grrules` | Lint-only half; cobra auto-normalises GPRs on assignment, so the rewriting half isn't ported. | +| `getIndexes` (`metcomps` sliver) | ✅ [`utils.parse_name_comp`](../src/raven_python/utils/parse.py) | The only `getIndexes` bit cobra doesn't already cover. | +| `editMiriam`, `extractMiriam` | ⛔ not ported | cobra's `.annotation` is already a `{namespace: id(s)}` dict — read/write it directly. | +| `getRxnsInComp`, `getMetsInComp` | ⛔ not ported | One-liners over cobra's `reaction.compartments` / `metabolite.compartment`. | +| `constructEquations` | ⛔ not ported | `reaction.build_reaction_string(use_metabolite_names=...)` already does both id and name equations. | +| `sortIdentifiers` | ✅ [`utils.sort_identifiers`](../src/raven_python/utils/sort.py) | Model-wide alphabetical sort; also via `sort_ids=` on `write_yaml_model`. | + +### I/O + +| RAVEN | raven-python | Notes | +|---|---|---| +| `readYAMLmodel`, `writeYAMLmodel` | ✅ [`io.read_yaml_model`, `write_yaml_model`](../src/raven_python/io/yaml.py) | Aligned to cobra's `!!omap` writer (RAVEN `fa281a1`). Adds the RAVEN-only top-level per-entry keys (inchis/deltaG/metFrom/notes, confidence_score/references/rxnFrom/deltaG, protein) into `.notes`, plus `version`/`metaData`/GECKO `ec-*`. cobra-readable output verified. | +| `exportModelToSIF` | ✅ [`io.export_model_to_sif`](../src/raven_python/io/sif.py) | Cytoscape SIF (`rc`/`rr`/`cc` graphs). | +| `exportToExcelFormat` (export only) | ✅ [`io.export_to_excel`](../src/raven_python/io/excel.py) | RAVEN 5-sheet xlsx (RXNS / METS / COMPS / GENES / MODEL). Excel **import** is intentionally excluded. | +| `exportForGit` | ✅ [`io.export_for_git`](../src/raven_python/io/git.py) | Standard-GEM repo layout (`model//…`). | +| `importYAML/SBML/Mat/Excel` | 🗒️ cobra's standard readers | `cobra.io.read_sbml_model` / `load_json_model` / etc.; Excel import not ported. | + +## Reconstruction + +| RAVEN | raven-python | Notes | +|---|---|---| +| `getModelFromHomology` + `getBlast`/`getDiamond` | ✅ [`reconstruction.homology.get_model_from_homology`](../src/raven_python/reconstruction/homology/homology.py), `run_blast`, `run_diamond`, `blast_from_table` | Core homology reconstruction with structured improvements (bidirectional / best-hits-only, AST GPR rewrite, complex policy, bitscore best-hits, DataFrame ortholog map). | +| KEGG download → species model (5 steps) | ✅ [`reconstruction.kegg.*`](../src/raven_python/reconstruction/kegg/) | All five steps: `download.fetch_keggdb`, `parse.read_kegg_table` + reference model, `hmm.build_libraries`, `organism.build_kegg_model_for_organism` (no-FASTA), `query.assign_kos` + `run_hmmscan`. | +| `getPhylDist` | ⛔ not ported | Fixed prok90/euk90 libraries make the distance matrix moot. | +| `getMetaCycModelForOrganism` | ⛔ not ported (and **flagged for removal from MATLAB RAVEN**) | BLAST-to-single-representatives is low-precision at every cutoff. See [IMPROVEMENTS.md](../IMPROVEMENTS.md) under `R-MetaCyc`. | + +## Tasks, gap-filling, INIT, ftINIT + +| RAVEN | raven-python | Notes | +|---|---|---| +| `parseTaskList`, `checkTasks` | ✅ [`tasks.parse_task_list`](../src/raven_python/tasks/tasklist.py), `check_tasks` ([tasks/check.py](../src/raven_python/tasks/check.py)) | `check_tasks` reuses one model across the task list (no per-task model copy) — at genome scale ~12× faster than the copy-per-task implementation it replaced. | +| `fillGaps` (connectivity mode) | ✅ [`gapfilling.connect_blocked_reactions`](../src/raven_python/gapfilling/fill.py) | MILP (min penalty-weighted template reactions s.t. blocked reactions carry flux). Targeted mode → 🗒️ `cobra.gapfill`. | +| `runINIT`, `scoreComplexModel`, `getINITModel` | ✅ [`init.run_init`](../src/raven_python/init/init.py), [`init.score_reactions_from_genes`, `gene_scores_from_expression`](../src/raven_python/init/score.py), [`init.get_init_model`](../src/raven_python/init/build.py) | INIT MILP rewritten in optlang (no sparse-matrix construction); RNA-seq scoring is `5·ln(level/ref)`-clamped. | +| `ftINIT`, `prepINITModel`, `ftINITInternalAlg`, `getINITSteps` | ✅ [`init.ftinit`](../src/raven_python/init/ftinit.py), [`init.prep_init_model`](../src/raven_python/init/prep.py), [`init.run_ftinit`](../src/raven_python/init/ftinit.py), [`init.get_init_steps`](../src/raven_python/init/steps.py) | Staged MILP + linear merge + scaling (`rescaleModelForINIT`). Validated against RAVEN on Human-GEM (Jaccard 0.975–0.980; see [humangem_validation.md](humangem_validation.md)). Metabolomics-based scoring is the one piece not yet implemented (raises `NotImplementedError`). | +| `ftINITFillGaps`, `ftINITFillGapsMILP`, `ftINITFillGapsForAllTasks` | ✅ [`init.fill_tasks`](../src/raven_python/init/taskfill.py) | Task-aware gap-filling within ftINIT; in-place `_feasible` check + bounded fill MILP (`mip_gap`, `time_limit`). | +| `mergeLinear` | ✅ [`init.merge_linear`](../src/raven_python/init/merge.py) | Linear merge of unit-stoichiometry chains; bookkeeping (`group_ids`, `reversed_rxns`) to map back to the reference model. | +| `removeLowScoreGenes` | ✅ [`init.remove_low_score_genes`](../src/raven_python/init/genes.py) | Final gene-prune step of ftINIT. | +| `fitTasks` | ⛔ not ported | Niche; tasks are normally consumed via `checkTasks` / ftINIT's task layer. | + +## Omics, analysis, comparison + +| RAVEN | raven-python | Notes | +|---|---|---| +| `parseHPA`, `parseHPArna`, `scoreModel` | ✅ [`omics.parse_hpa`, `parse_hpa_rna`, `hpa_gene_scores`, `rna_gene_scores`](../src/raven_python/omics/hpa.py) | Pandas-tidy DataFrames; scoring adapters reuse `score_reactions_from_genes` (single source of truth for the GPR walk). | +| `reporterMetabolites` | ✅ [`analysis.reporter_metabolites`](../src/raven_python/analysis/reporter.py) | Exact closed-form background replaces RAVEN's Monte-Carlo (RM1 in IMPROVEMENTS). | +| `FSEOF` | ✅ [`analysis.fseof`](../src/raven_python/analysis/fseof.py) | Regression slope + correlation, amplify/knockdown/knockout classes, gene aggregation (FS1–FS4 in IMPROVEMENTS). | +| `randomSampling` | ✅ [`analysis.sample`](../src/raven_python/analysis/sampling.py) | Wraps cobra's flux sampling. | +| `compareMultipleModels` | ✅ [`comparison.compare_models`](../src/raven_python/comparison/compare.py) | Tidy DataFrames (reactions / mets / genes / subsystems presence + pairwise Jaccard + optional `check_tasks` pass/fail). Plotting and tSNE/MDS are 🗒️ one-liners in seaborn/scikit-learn; intentionally not in the function. | +| `runDynamicFBA` | ⛔ not ported | Established Python implementations exist: [`dfba`](https://pypi.org/project/dfba/) (Pinheiro et al.; CVODES-backed), [`reframed`](https://pypi.org/project/reframed/) (Machado lab), [`mewpy`](https://pypi.org/project/mewpy/) (Cunha lab). Cobrapy itself has none, but re-porting would duplicate maintained prior art. | + +## Localisation (Phase 7) + +| RAVEN | raven-python | Notes | +|---|---|---| +| `predictLocalization` | ✅ [`localization.predict_localization`](../src/raven_python/localization/predict.py) | Deterministic MILP (not simulated annealing). Caller-passed `reactions_to_relocate` set (everything else pinned). Multi-compartment by default: primary "free", extras pay `multi_compartment_penalty`. Tolerates incomplete models (no silent reaction removal). `apply=False` returns a `LocalizationProposal` diff. Real-data validation against curated yeast-GEM in [yeast_localization_benchmark.md](yeast_localization_benchmark.md). | +| `getWoLFScores`, `parseScores('wolf')` | ✅ [`localization.load_wolfpsort`](../src/raven_python/localization/scores.py) | Parses WoLF PSORT summary output (RAVEN-compatible); row-normalised. Does not shell out to the WoLF PSORT binary — run that separately and feed in the output. | +| `parseScores('deeploc')` | ✅ [`localization.load_deeploc`](../src/raven_python/localization/scores.py) | DeepLoc 2 per-protein CSV (Protein_ID / Localizations / Signals + one column per compartment). | + +## Things deliberately not ported + +* **`ravenCobraWrapper` / RAVEN struct adapter** — cobra is the canonical object; no parallel struct. +* **`checkModelStruct` struct/type checks** — moot in cobra. +* **`runDynamicFBA`** — see Omics/analysis row. +* **`getMetaCycModelForOrganism`** — see Reconstruction row; flagged for upstream removal. +* **`getPhylDist`** — fixed prok90/euk90 libraries make it moot. +* **`mapCompartments`** — overlaps with `compare_models`. +* **`editMiriam`, `extractMiriam`, `getRxnsInComp`, `getMetsInComp`, `constructEquations`, `getIndexes`** (most), **`setExchangeBounds`**, **most `setParam` modes**, **`getBlastFromExcel` Excel branch** — cobra one-liners; recorded above. + +## "New in raven-python" entry points + +These are not direct RAVEN ports: + +* [`comparison.compare_models`](../src/raven_python/comparison/compare.py) — already in RAVEN, but the raven-python version returns tidy DataFrames suitable for downstream analysis (RAVEN's version embeds heatmap/tSNE plotting). +* [`reconstruction.homology.run_blast` / `run_diamond` / `blast_from_table`](../src/raven_python/reconstruction/homology/blast.py) — generic subprocess wrappers; a DataFrame is the canonical "hits" object (no `blastStructure` struct). +* [`binaries.resolve_binary`, `ensure_binary`](../src/raven_python/binaries.py) — version-pinned release-ZIP registry for external tools (BLAST/DIAMOND/HMMER), SHA256-verified. +* [`localization.LocalizationProposal`](../src/raven_python/localization/predict.py) — the diff-preview mode (`apply=False`). diff --git a/docs/todo.md b/docs/todo.md new file mode 100644 index 0000000..29f8571 --- /dev/null +++ b/docs/todo.md @@ -0,0 +1,48 @@ +# Open work + +What's still on the books. See [raven_migration.md](raven_migration.md) for the function-by- +function port status (most of it is done); see [IMPROVEMENTS.md](../IMPROVEMENTS.md) for the +catalogue of raven-python improvements that should also be back-ported into MATLAB RAVEN. + +## Major + +### Visualization (`plotting/`) + +Not started. RAVEN has limited plotting (`drawMap` etc., MATLAB-bound). For raven-python the most +useful targets are: + +* Pathway maps / Escher integration for context-specific models. +* Omics overlay (gene-score / expression heatmaps on the reaction set). +* Flux distribution overlays. + +cobrapy + Escher already covers a lot here — likely a thin integration layer rather than a port. + +### Metabolomics-based scoring in ftINIT + +The metabolomics-detected metabolite production-reward block in [`init.ftinit`](../src/raven_python/init/ftinit.py) +currently raises `NotImplementedError` if a non-empty `metabolomics` argument is passed. The +linear merge eliminates degree-2 detected metabolites, so it needs producer-group-mapping + +negative-producer force-flux constraints — the most intricate MILP piece, for the least-used +input. Worth doing only when a real user request lands. + +## Infrastructure + +* **Binary ZIP releases** for BLAST/DIAMOND (Phase 3a). The runtime resolver in + [`binaries.py`](../src/raven_python/binaries.py) is ready; the registry is empty until ZIPs are + published as GitHub release assets. +* **KEGG data artefact releases.** See [maintaining_kegg_data.md](maintaining_kegg_data.md). + +## Smaller items + +* [known_issues.md](known_issues.md) — backlog of low-priority edge cases / robustness gaps / + dead code from the full-codebase review. None affects correctness on well-formed inputs. +* [IMPROVEMENTS.md](../IMPROVEMENTS.md) — items marked 💡 *proposed* are candidates to + implement (and back-port). + +## Upstream blockers (not raven-python work, but worth tracking) + +* `optlang.hybrid_interface.Configuration.clone()` bug — blocks HiGHS at any scale (CI catches + it in `tests/test_init_solvers.py`). +* GLPK's MIP solve ignores `configuration.timeout` at genome scale — blocks GLPK on large MILPs. +* Both documented in [init_solver_benchmark.md](init_solver_benchmark.md) with concrete fix + suggestions. diff --git a/src/raven_python/io/yaml.py b/src/raven_python/io/yaml.py index 151954b..e3da8ac 100644 --- a/src/raven_python/io/yaml.py +++ b/src/raven_python/io/yaml.py @@ -86,13 +86,33 @@ def _capture_entry_fields(entries, fields): def read_yaml_model(path: str | Path) -> cobra.Model: - """Read a RAVEN/cobrapy YAML model into a ``cobra.Model``.""" + """Read a RAVEN/cobrapy YAML model into a ``cobra.Model``. + + Convenience wrapper around :func:`model_from_yaml_data` that opens the + file (transparently un-gzipping ``.gz``) and parses the YAML. Callers + that need to pre-process the document (e.g. lift legacy fields that + cobra doesn't recognise) can read+normalise themselves and call + :func:`model_from_yaml_data` with the resulting dict. + """ with _open_text(path, "r") as handle: raw = _to_plain(_cobra_yaml.load(handle)) if not isinstance(raw, dict): raise ValueError(f"{path}: top-level YAML is a {type(raw).__name__}, not a mapping.") + return model_from_yaml_data(raw) + +def model_from_yaml_data(raw: dict) -> cobra.Model: + """Build a ``cobra.Model`` from an already-parsed RAVEN/cobrapy YAML dict. + + Strips and restores the RAVEN per-entry side-fields onto each entry's + ``.notes``, lifts ``id``/``name`` out of legacy ``metaData``, and + stashes everything else (``version``, the ``metaData`` block itself, + and any unknown top-level keys such as ``ec-rxns``/``ec-enzymes``) + onto ``model.notes`` so a round-trip via :func:`write_yaml_model` + preserves them. ``raw`` is mutated in place — copy it first if the + caller needs the original. + """ metadata = raw.pop("metaData", None) or {} version = raw.pop("version", None) foreign = {k: raw.pop(k) for k in list(raw) if k not in _COBRA_TOP_KEYS} @@ -159,6 +179,17 @@ def write_yaml_model( doc = OrderedDict(_to_plain(model_to_dict(model))) + # cobra's model_to_dict serialises model.notes verbatim into doc["notes"], + # so the three management keys we just lifted out would otherwise also + # appear nested inside the notes section. Strip them; preserve any other + # genuine notes the caller stored on the model. + doc_notes = doc.get("notes") + if isinstance(doc_notes, dict): + for key in ("metaData", "version", "_yaml_sections"): + doc_notes.pop(key, None) + if not doc_notes: + doc.pop("notes", None) + if sort_ids: for section in ("metabolites", "reactions", "genes"): if section in doc: diff --git a/src/raven_python/plotting/__init__.py b/src/raven_python/plotting/__init__.py new file mode 100644 index 0000000..4fb3839 --- /dev/null +++ b/src/raven_python/plotting/__init__.py @@ -0,0 +1 @@ +"""Pathway-map and omics-overlay visualisation (stub — not yet implemented)."""