SysBioChalmers
diff --git a/‎docs/humangem_validation.md‎
Lines changed: 117 additions & 0 deletions b/‎docs/humangem_validation.md‎
Lines changed: 117 additions & 0 deletions
@@ -0,0 +1,117 @@
+# Human-GEM cell-type model validation: raven-python vs RAVEN
+
+Validation of raven-python's tINIT/ftINIT against MATLAB RAVEN on a real genome-scale
+reconstruction (Human-GEM) using the Hart2015 RNA-seq dataset (5 cell lines: DLD1,
+GBM, HCT116, HELA, RPE1). The goal is functional equivalence — do raven-python and RAVEN
+extract the *same* context-specific reaction sets from the same inputs?
+
+## Method
+
+* **Template & inputs.** RAVEN built the ftINIT reference model from Human-GEM
+  (`prepHumanModelForftINIT`: remove drug/exchange/artificial reactions, set
+  spontaneous/custom lists) and exported it as `raven_refModel.xml` (10198 reactions).
+  raven-python builds on that *same* exported model, so the candidate reaction universe is
+  identical and set comparison is exact.
+* **Scoring.** Gene scores from `log2(TPM+1)`-style expression via
+  `gene_scores_from_expression`, mapped to reactions through the GPR
+  (`score_reactions_from_genes`), matching RAVEN's `getExprForRxnScore`.
+* **ftINIT.** Series `1+1` (2 staged MILP steps). RAVEN run via `ftINIT.m` with Gurobi;
+  raven-python via `raven_python.init.ftinit` with Gurobi (`mip_gap=0.001`, `time_limit=600`).
+* **tINIT.** raven-python `get_init_model` (classic single-MILP INIT) on HCT116, compared to
+  the ftINIT result for the same cell line.
+* **Tasks.** Two raven-python ftINIT variants: *no-task* (expression only) and
+  *task-constrained* (essential metabolic tasks, `metabolicTasks_Essential.txt`, force
+  task-essential reactions to be kept). RAVEN's reference is task-constrained.
+* **Solver.** Gurobi 13.0.1 for both tools.
+
+## Engineering findings (raven-python tractability)
+
+Getting ftINIT to run at genome scale surfaced three issues, all now fixed and matching
+RAVEN's design:
+
+1. **O(n²) constraint construction.** Building the steady-state balances with Python
+   `sum()` re-canonicalises a growing sympy expression at each term; hub metabolites
+   (ATP/H⁺/H₂O in ~10³ reactions) made one constraint take ~minutes (≈154 s total build,
+   benchmark: 1500-term `sum` = 59 s vs `optlang.symbolics.add` = 0.01 s). Fixed by
+   building flat term lists once per reaction and summing with `optlang.symbolics.add`
+   (in both ftINIT and tINIT).
+2. **Big-M too loose.** The on/off indicator constraints used each reaction's own bound
+   (±1000) as big-M; with `force_on=0.1` that is a ~10⁴ ratio → very weak LP relaxation
+   → Gurobi never closes the gap. RAVEN uses a fixed big-M = 100. Adopted.
+3. **Stoichiometric rescaling.** A fixed big-M=100 is only valid if no reaction needs
+   flux ≫100; ported RAVEN's `rescaleModelForINIT` (cap each reaction's coefficient
+   dynamic range at 25×, normalise mean |coeff| to 1) into `prep_init_model`. Without it
+   the staged MILP is infeasible (step-1 caps transports that step-0 used freely).
+
+Net effect: a full ftINIT cell-line solve went from *not finishing* to ~200 s,
+comparable to RAVEN.
+
+## Results
+
+### Reaction counts
+
+| cell line | RAVEN ftINIT | raven-python ftINIT (no-task) | raven-python ftINIT (task) |
+|-----------|-------------:|--------------------------:|-----------------------:|
+| DLD1      | 7782 | 7744 | 7774 |
+| GBM       | 7668 | 7667 | 7680 |
+| HCT116    | 7780 | 7752 | 7776 |
+| HELA      | 7832 | 7789 | 7816 |
+| RPE1      | 7569 | 7564 | 7570 |
+
+Counts agree within ~0.5 % everywhere; the task-constrained run is closest (e.g. RPE1
+7570 vs 7569, HCT116 7776 vs 7780). raven-python tINIT (HCT116) gives 6024 reactions — a
+smaller model, as expected from the different (classic INIT) objective.
+
+### Agreement — raven-python (no-task) ftINIT vs RAVEN ftINIT
+
+| cell line | shared | only raven-python | only RAVEN | Jaccard |
+|-----------|-------:|--------------:|-----------:|--------:|
+| DLD1   | 7667 |  77 | 115 | 0.976 |
+| GBM    | 7562 | 105 | 106 | 0.973 |
+| HCT116 | 7675 |  77 | 105 | 0.977 |
+| HELA   | 7707 |  82 | 125 | 0.974 |
+| RPE1   | 7470 |  94 |  99 | 0.975 |
+
+**~97.5 % of reactions are identical** between the two independent implementations, even
+though this run is *expression-only* while RAVEN's reference is task-constrained. The
+"only RAVEN" surplus (≈99–125) is expected to include task-essential reactions that the
+task-constrained run (below) recovers.
+
+### Agreement — raven-python (task-constrained) ftINIT vs RAVEN ftINIT
+
+| cell line | shared | only raven-python | only RAVEN | Jaccard |
+|-----------|-------:|--------------:|-----------:|--------:|
+| DLD1   | 7699 | 75 | 83 | 0.980 |
+| GBM    | 7588 | 92 | 80 | 0.978 |
+| HCT116 | 7696 | 80 | 84 | 0.979 |
+| HELA   | 7735 | 81 | 97 | 0.978 |
+| RPE1   | 7493 | 77 | 76 | 0.980 |
+
+Adding the essential metabolic tasks (same task list RAVEN uses) raises agreement to
+**Jaccard 0.978–0.980** and makes the disagreement symmetric (only-raven-python ≈ only-RAVEN
+≈ 80), confirming the prediction: the task constraints recover RAVEN's task-essential
+reactions. The residual ≈80 reactions each way out of ~7700 is at the level expected from
+MIP-gap tolerance (both accept near-optimal incumbents) and alternate optima.
+
+### raven-python tINIT vs ftINIT (HCT116)
+
+tINIT 6024 rxns vs ftINIT 7752; shared 5957, Jaccard 0.762. tINIT is nearly a subset
+(only 67 reactions unique to it) — the two methods agree on a common core, with ftINIT
+keeping more (its staged formulation and task handling are less aggressive at removal).
+This matches the expected tINIT/ftINIT relationship rather than indicating a defect.
+
+## Conclusions
+
+From identical inputs on a genome-scale human reconstruction, raven-python reproduces RAVEN's
+ftINIT reaction selection to **97.5 % (no-task) and 98 % (task-constrained) set identity**
+across five cell lines — strong evidence of functional equivalence between the two
+independent implementations. Agreement is symmetric and the residual (~80 reactions each
+way) is consistent with MIP near-optimality and alternate optima rather than any
+systematic divergence.
+
+Reaching genome-scale tractability required matching RAVEN's numerical-conditioning
+choices and fixing optlang-specific construction costs (see *Engineering findings*):
+fixed big-M = 100, `rescaleModelForINIT`, `optlang.symbolics.add` instead of Python
+`sum()` in every MILP build (ftINIT, tINIT, and the gap-fill). With these, a
+task-constrained cell-line model builds in ~15–25 min (dominated by the
+essential-forced staged MILP) and a no-task one in ~3 min, comparable to RAVEN.