Empirical study of raven-toolbox's (f)tINIT parameters on a genome-scale model (Human-GEM, Hart2015 / HCT116). Two questions:
- Calibration — on clean data, which parameter values give the best speed/quality
trade-off? (
scripts/analyze_init_params.py) - Robustness — with the task layer always on (it is part of the pipeline, not a
variable), how does degrading the transcriptomics input affect the model, and which
parameters keep it functional and stable? (
scripts/analyze_init_robustness.py)
Both scripts are resumable and reusable on any model/dataset; the numbers below are HCT116. "Jaccard" is reaction-set overlap with the reference (tightest setting / clean data) — for a model-extraction tool the reaction set is the product, and a MIP gap bounds only the objective, so set-stability is tracked separately.
| mip_gap | time (s) | objective | rel.obj.gap | Jaccard vs tightest |
|---|---|---|---|---|
| 0.0002 | 48 | 49357 | ref | ref |
| 0.001 | 44 | 49357 | +0.0000 | 1.0000 |
| 0.003 | 42 | 49289 | +0.0014 | 0.9973 |
| 0.01 | 42 | 49185 | +0.0035 | 0.9935 |
| 0.03 | 52 | 49185 | +0.0035 | 0.9935 |
| 0.1 | 46 | 45615 | +0.0758 | 0.9469 |
Solve time is essentially flat across the gap (the model build dominates), so a tight
gap is nearly free. mip_gap=0.001 reproduces the proven optimum exactly (Jaccard 1.0);
quality only collapses at 0.1. → Default 0.001. (The genome-scale staged pipeline still
needs some gap + a time_limit because the full essential-forced MILP can be much harder
than this single step — see robustness timings.)
| big_m | time (s) | rel.obj.gap | Jaccard vs big_m=100 |
|---|---|---|---|
| 100 | 51 | ref | ref |
| 50 | 54 | +0.0006 | 0.983 |
| 25 | 53 | +0.0007 | 0.982 |
| 250 | 55 | +0.0005 | 0.984 |
| 1000 | 59 | +0.0001 | 0.986 |
At step-0 (on the scaled model) big_m barely affects objective or time, but shifts which
reactions are kept by ~2% (alternate optima). big_m=100 is RAVEN's value and is required
for the staged pipeline to stay feasible (a fixed 100 is only valid with stoichiometric
rescaling — see §1.4). → Default 100.
| force_on | time (s) | rel.obj.gap | Jaccard vs 0.1 |
|---|---|---|---|
| 0.1 | 63 | ref | ref |
| 0.02 | 69 | +0.0005 | 0.983 |
| 0.05 | 56 | +0.0000 | 0.990 |
| 0.2 | 59 | +0.0004 | 0.982 |
| 0.5 | 79 | +0.0005 | 0.985 |
force_on (minimum flux for a reaction to count as "on") changes the model, not just a
tolerance, but the reaction set is fairly insensitive (Jaccard ≥0.98) and the objective
hardly moves. → Default 0.1 (RAVEN), no strong reason to change.
| config | time (s) | rel.obj.gap | Jaccard vs scaled msd=25 |
|---|---|---|---|
| scale on, msd=25 | 51 | ref | ref |
| msd=10 | 49 | +0.0075 | 0.989 |
| msd=50 | 61 | +0.0003 | 0.982 |
| msd=100 | 62 | −0.0001 | 0.986 |
| scale off | 45 | +0.0129 | 0.973 |
At step-0 even scale=off is feasible, but it drifts most (Jaccard 0.973, objective +1.3%);
max_stoich_diff 10–100 are all within ~1%. This understates scaling's importance — at
step-0 there is no big-M cap on the held-out transports. In the full staged pipeline,
scale=off with big_m=100 is infeasible (step-1 caps transports that step-0 used
freely). → Keep scaling on, msd=25 (RAVEN's default).
Calibration summary (defaults are well-chosen): mip_gap=0.001, big_m=100,
force_on=0.1, scaling on (max_stoich_diff=25). For the genome-scale staged pipeline also
set a time_limit (≈120–600 s/step) so a hard essential-forced step returns a near-optimal
incumbent rather than grinding.
mip_gap (eps=1, prod_weight=0.5):
| mip_gap | time (s) | n_kept | Jaccard vs gap=0.001 |
|---|---|---|---|
| 0.001 | 901 | 6024 | ref |
| 0.003 | 869 | 6036 | 0.991 |
| 0.01 | 595 | 5967 | 0.968 |
Tightening the gap costs ~50% more wall time on this MILP (unlike ftINIT step-0, build
doesn't dominate); a 1% gap is ~30% faster with ~3% reaction-set drift.
→ mip_gap=0.001 for stability, 0.01 for a faster looser solve.
eps (gap=0.005, the connectivity-flux threshold — changes the model):
| eps | n_kept | Jaccard vs eps=1.0 |
|---|---|---|
| 0.1 | 6119 | 0.952 |
| 0.5 | 6123 | 0.952 |
| 1.0 | 6064 | ref |
| 2.0 | 6090 | 0.960 |
Each eps value gives a slightly different model (Jaccard ~0.95 across the range); the
reaction-set spread is ~5%. eps=1.0 is RAVEN's default; smaller values produce slightly
larger models (loosen the connectivity bar). Pick by what the data justifies — see the
caveat at the top of init.py.
prod_weight (gap=0.005, the metabolite-production reward — changes the model):
| prod_weight | n_kept | Jaccard vs 0.5 |
|---|---|---|
| 0.0 | 5973 | 0.961 |
| 0.25 | 6015 | 0.974 |
| 0.5 | 6064 | ref |
| 1.0 | 6106 | 0.955 |
A higher prod_weight keeps slightly more reactions (rewards more connectivity); 0.5
(RAVEN's default) is the middle of the range. Effect is modest (~5%).
big_m (gap=0.005, None = per-reaction ub):
| big_m | n_kept | Jaccard vs None |
|---|---|---|
| None (per-rxn ub) | 6064 | ref |
| 1000 | 6064 | 1.000 |
| 250 | 6114 | 0.953 |
| 100 | 6023 | 0.930 |
big_m=1000 is identical to big_m=None here because the model's ub is ±1000 already
(so the per-reaction cap is 1000). Smaller fixed caps (250, 100) shift alternate optima
by 5–7% but do not change the objective. Unlike ftINIT, tINIT has not been run through
rescaleModelForINIT, so dropping big_m below 1000 may invalidate the LP feasibility
region for reactions whose own bound is larger — keep the default (per-reaction ub).
tINIT calibration summary: mip_gap=0.001 (or 0.01 for ~30% speedup at ~3% drift);
eps, prod_weight, big_m defaults are fine — they all change the model, not just
tolerance, so tune by what the data and biology call for, not by these tables.
| config | time (s) | n_kept | Jaccard vs gap=0.001 |
|---|---|---|---|
| mip_gap=0.001 (default big_m=100) | 346 | 7752 | ref |
| mip_gap=0.003 | 288 | 7748 | 0.993 |
| mip_gap=0.01 | 218 | 7746 | 0.995 |
| big_m=50 (gap=0.003) | 738 | 7799 | 0.974 |
| big_m=250 (gap=0.003) | 345 | 7766 | 0.977 |
Unlike the single-step ftINIT MILP in §1.1 (where build time dominated and the gap was
free), the full pipeline does benefit from a looser gap: mip_gap=0.01 is ~37 %
faster than 0.001 with Jaccard 0.995 — essentially the same model. → For genome-scale
ftINIT, mip_gap=0.01 (or 0.005) is the sweet spot; keep 0.001 only if exact
reproducibility matters more than a few minutes.
big_m=50 is actually slower than the default 100 (738s vs 346s) — a tighter cap makes
the LP relaxation harder for borderline reactions; big_m=250 is the same speed as 100
but shifts the reaction set ~2 %. → Keep big_m=100 (RAVEN's value, what scaling is
designed for).
ftINIT's task layer (gap-fill) and tINIT's task layer (forcing essential_rxns) are
not equivalent. tINIT forces every essential reaction to carry flux ≥ eps. With
Human-GEM's 113 task-essential reactions (the validation set), the resulting steady-state
system is infeasible regardless of eps:
essentials passed to run_init |
result |
|---|---|
| 0 (the original validation call) | ✅ ok, 6024 reactions |
113 (merged-survivor IDs from prep.essential_rxns) |
❌ infeasible (proven by Gurobi presolve, ~330s) |
260 (pre-merge IDs from find_task_essential_reactions cache) |
❌ infeasible (~480s) |
Lowering eps (1.0 → 0.1) does not fix it; the issue is that 100+ reactions cannot
simultaneously each carry a fixed positive flux in their forced direction at steady state.
ftINIT avoids this by using an adaptive per-reaction forcing magnitude
(min(0.99·|previous flux|, force_on)) so each essential is forced at a value it
actually carried in a prior feasible solution. tINIT's one-size-fits-all eps
mechanism doesn't have that escape hatch.
Practical takeaway. For functional context-specific models on genome-scale data, use
ftINIT — the task layer (gap-fill, adaptive essential forcing) is what makes the pipeline
robust. tINIT remains useful for the small/no-essentials case (e.g. the
expression-only baseline in the validation), but pairing it with the full task-essential
set is a known incompatibility; the tINIT robustness study below is therefore reported
with essential_rxns=[].
The metabolic-task + gap-fill layer is held fixed; only the expression input is degraded.
frac = fraction of the 69 essential tasks the extracted model performs (check_tasks);
Jaccard = reaction-set overlap with the clean-data model.
| input | n_rxns | tasks pass | frac | Jaccard vs clean |
|---|---|---|---|---|
| clean | 7777 | 69/69 | 1.000 | ref |
| dropout 50% | 5968 | 67/69 | 0.971 | 0.713 |
| dropout 70% | 5113 | 68/69 | 0.986 | 0.594 |
| noise σ=1.0 | 7812 | 69/69 | 1.000 | 0.952 |
| noise σ=2.0 | 7768 | 69/69 | 1.000 | 0.919 |
| downsample 50% | 6765 | 68/69 | 0.986 | 0.815 |
| downsample 70% | 6123 | 68/69 | 0.986 | 0.728 |
(dropout = genes set to 0 → score −5; noise = ×exp(N(0,σ)); downsample = genes dropped →
no_gene_score.)
Findings:
- Robust to noise, sensitive to sparsity. Multiplicative expression noise barely changes the model (Jaccard 0.92–0.95, size stable, all tasks pass). Sparsity is far more damaging: 50% dropout already drops the reaction set to 0.71 Jaccard (and shrinks 7777→5968), 70% to 0.59.
- Sparsity shrinks the model toward the task-essential core. Missing/zeroed genes remove the expression evidence for a reaction; the task layer only adds back what tasks require, so sparse input yields smaller, more "generic" models. Dropout (−5) is harsher than downsampling (−2).
- Functionality is largely but not perfectly preserved. With the task layer,
fracstays ≥0.97, but dips to 67–68/69 under heavy sparsity — i.e. the bounded gap-fill plus the post-hoc low-score-gene pruning occasionally leave 1–2 essential tasks unsatisfied. (See the lever sweep below for whetherno_gene_score/force_onrecover them.) - Cost tracks damage. Dropout runs are slowest (more broken tasks → more gap-fill); noise is cheap.
Tractability note (a parameter that prevents failure): the gap-fill MILP must be bounded (
mip_gap/time_limit). Unbounded, severe degradation (which breaks many tasks at once) makes it solve a hard min-cost MILP per broken task to proven optimality — observed to run75 min for one 90%-dropout model. With the bound it returns a near-optimal fill quickly.
| config | n_rxns | frac | Jaccard vs clean |
|---|---|---|---|
| default (no_gene_score=−2, force_on=0.1) | 5113 | 0.986 | 0.594 |
| no_gene_score=−1.0 | 5110 | 0.986 | 0.593 |
| no_gene_score=−0.5 | 5128 | 0.986 | 0.593 |
| force_on=0.2 | 5159 | 0.986 | 0.600 |
No lever recovers the drift — Jaccard stays ~0.59 across all settings. Two reasons, both informative:
- The information dropout destroys is simply gone; no scoring/connectivity knob reconstructs the missing expression evidence. You cannot tune your way out of sparse input.
no_gene_scoreis the wrong knob for dropout specifically: dropout leaves genes present but zero (scored −5), whereasno_gene_scoreonly governs reactions whose genes are absent from the data — i.e. the downsampling failure mode. Sono_gene_scoreis a meaningful lever for missing-data sparsity (a less-negative value keeps more unmeasured reactions, growing the model back toward clean), but it has nothing to act on under dropout.
Practical takeaway. The robustness levers that matter are structural, not numeric: the
task + gap-fill layer (keeps the model functional regardless of input quality) and a bounded
gap-fill MILP (keeps it tractable). For missing-gene sparsity specifically, no_gene_score
trades model size against confidence. For noise, defaults are already robust. No parameter
restores fidelity lost to dropout — that is a property of the data, not the pipeline.
For the reasons in §1.5, tINIT cannot accept the full task-essential set as forced
reactions; this section runs get_init_model with essential_rxns=[] to show the
realistic tINIT behaviour on the same degradation gradient — i.e. the cost of not
having ftINIT's gap-fill safety net.
| input | n_rxns | tasks pass | frac | Jaccard vs clean |
|---|---|---|---|---|
| clean | 6277 | 35/69 | 0.507 | ref |
| dropout 50% | 4910 | 23/69 | 0.333 | 0.673 |
| dropout 70% | 2807 | 21/69 | 0.304 | 0.408 |
| noise σ=1.0 | 6661 | 25/69 | 0.362 | 0.878 |
| noise σ=2.0 | 6146 | 24/69 | 0.348 | 0.869 |
| downsample 50% | 5006 | 24/69 | 0.348 | 0.722 |
| downsample 70% | 3541 | 19/69 | 0.275 | 0.515 |
The headline contrast with ftINIT:
| ftINIT (task layer) | tINIT (no task layer) | |
|---|---|---|
| clean | 7777 rxns, 69/69 (1.000) | 6277 rxns, 35/69 (0.507) |
| dropout 0.7 | 5113 rxns, 68/69 (0.986), J 0.594 | 2807 rxns, 21/69 (0.304), J 0.408 |
| noise σ=2.0 | 7768 rxns, 69/69 (1.000), J 0.919 | 6146 rxns, 24/69 (0.348), J 0.869 |
| downsample 0.7 | 6123 rxns, 68/69 (0.986), J 0.728 | 3541 rxns, 19/69 (0.275), J 0.515 |
- tINIT-without-gap-fill fails roughly half the essential tasks even on clean data; ftINIT-with-gap-fill passes them all. Under degradation tINIT collapses further (down to 19/69 at 70 % downsample), ftINIT stays ≥67/69 throughout.
- Reaction-set drift is comparable under noise (Jaccard 0.87 vs 0.92) but worse for tINIT under sparsity (0.41 vs 0.59 at 70 % dropout) because there's no gap-fill to re-add structurally needed reactions.
This is not a critique of the tINIT algorithm — classic INIT was designed for the no-task-layer case. It is the empirical evidence for why ftINIT's design choices (task
- gap-fill, adaptive essential forcing) are the right ones for genome-scale tissue model extraction, and why tINIT is mostly useful here as a baseline.
| config | n_rxns | tasks pass | frac | Jaccard vs clean |
|---|---|---|---|---|
| default (prod_weight=0.5, eps=0.1) | 2807 | 21/69 | 0.304 | 0.408 |
| prod_weight=0.0 | 2791 | 21/69 | 0.304 | 0.416 |
| prod_weight=1.0 | 3386 | 22/69 | 0.319 | 0.485 |
| prod_weight=2.0 | 3888 | 21/69 | 0.304 | 0.458 |
| eps=0.5 | 2620 | 21/69 | 0.304 | 0.391 |
| eps=1.0 | 3311 | 22/69 | 0.319 | 0.460 |
Same conclusion as the ftINIT levers: parameter tuning can nudge (prod_weight≥1.0
or a larger eps modestly grows the model and lifts Jaccard from 0.41 to ~0.48), but
no tINIT parameter recovers anything close to ftINIT's functionality (22/69 at best
vs ftINIT's 67–69/69 at the same dropout). The gap-fill layer, not the parameter
choice, is what bridges the gap.
See init_solver_benchmark.md for the genome-scale
solver comparison (Gurobi/HiGHS/GLPK) and tests/test_init_solvers.py
for CI parameterised over installed MILP backends. Headline: at genome scale only Gurobi
is viable today; HiGHS fails on an upstream optlang hybrid_interface.clone() bug; GLPK
ignores configuration.timeout on MIP and ran 1 h+ without converging. Toy-scale
correctness is portable (Gurobi + GLPK give identical verdicts on the unit-test
networks), so local development works without a Gurobi licence.
python scripts/analyze_init_params.py --cell HCT116 --sweeps ftinit_milp,prep_scale,tinit,ftinit_full
python scripts/analyze_init_robustness.py --cell HCT116 --algo ftinit # then --algo tinitBoth reuse the cached Human-GEM preps from the validation run (docs/humangem_validation.md) and are resumable.