Skip to content

Commit e8c26fa

Browse files
authored
Subcellular localisation prediction and yeast-GEM validation (#9)
* Project scaffold: pyproject + package skeleton + README + LICENSE * Add GitHub Actions CI and the maintainer-scripts README * Add the foundation utilities: GPR, balance, parse, sort, validate * Add the model-manipulation layer (add, remove, transport, merge, etc.) * Add binary + data resolvers for external tools and published artefacts * Add YAML and SIF model I/O * Add Excel export and the Standard-GEM git-layout export * Add BLAST and DIAMOND wrappers for protein-homology searches * Add the homology-based draft model builder (getModelFromHomology port) * Add KEGG download, dump parser and taxonomy parser * Add KEGG HMM-library build and HMM-based KO assignment * Add KEGG species-model assembly (per-organism reconstruction) * Add KEGG artefact-build scripts and HMM-cutoff calibration docs * Add metabolic-task parsing and the check_tasks validator * Add connectivity gap-filling (MILP) against template models * Add the tINIT (INIT) MILP and its supporting machinery * Add the ftINIT pipeline and task-aware gap-filling * Add Human-GEM validation, parameter studies and cross-solver tests * Add HPA omics ingestion (proteomics + RNA-seq) * Add FSEOF, reporter metabolites and flux sampling * Add N-model comparison (presence + Jaccard + optional task check) * Add subcellular-localisation prediction (MILP) with pluggable predictors * Add the yeast-GEM localization benchmark (real-data validation)
1 parent 7deb8d4 commit e8c26fa

6 files changed

Lines changed: 1194 additions & 0 deletions

File tree

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# yeast-GEM localisation benchmark
2+
3+
Real-data validation of [`localization.predict_localization`](../src/raven_python/localization/predict.py)
4+
on the curated yeast-GEM. The benchmark is end-to-end — model, scoring, MILP — and
5+
sweeps predictor noise so the failure modes are visible, not just the headline accuracy.
6+
7+
* Driver: [`scripts/benchmark_localization_yeast.py`](../scripts/benchmark_localization_yeast.py)
8+
* Yeast-GEM source: `pcSecYeastSpecies/Model/yeastGEM.xml` (3991 reactions, 1147 genes,
9+
14 compartments).
10+
* Run command:
11+
```bash
12+
python scripts/benchmark_localization_yeast.py \
13+
--yeast-gem ~/github/pcSecYeastSpecies/Model/yeastGEM.xml \
14+
--noise 0,0.1,0.25,0.5 \
15+
--max-reactions 300 \
16+
--transport-cost 0.05 \
17+
--time-limit 300 \
18+
--doc docs/yeast_localization_benchmark.md
19+
```
20+
21+
## Setup
22+
23+
1. **Truth set**: every yeast-GEM reaction that (a) has a GPR, (b) is non-boundary,
24+
and (c) has all metabolites in the same compartment. 2 155 reactions qualify;
25+
stratified subsampling to 298 keeps the per-compartment distribution. The 14
26+
compartments collapse to 12 placement targets in the truth set (extracellular and
27+
the lipid particle / vacuolar membrane variants stay distinct).
28+
2. **Flattening**: the model is squashed into one compartment with
29+
[`manipulation.merge_compartments`](../src/raven_python/manipulation/compartments.py)
30+
so the predictor cannot lean on metabolite-topology evidence. Without this step
31+
every GPR'd reaction's "predicted" compartment is just its current one — vacuous.
32+
3. **Reference scores**: each gene gets `1.0` in every compartment that hosts one of
33+
its reactions in the original (multi-compartment) model. This is the
34+
*perfect-predictor* upper bound; real WoLF PSORT / DeepLoc output will be noisier.
35+
4. **Noise injection**: at noise level `p` each gene independently has probability
36+
`p` of having a confidently *wrong* compartment grafted in as the new top score
37+
(the true compartment is demoted to half its score). This simulates a predictor
38+
that's right `1-p` of the time and confidently wrong otherwise — a more
39+
pessimistic stand-in than uniform Gaussian jitter.
40+
5. **MILP**: `transport_cost=0.05`, `multi_compartment_penalty=0.5`, `mip_gap=0.01`,
41+
`time_limit=300s`, Gurobi. The MILP has 7 982 binaries, 2 691 rows, 29 842
42+
nonzeros at this scale — solves in 30–50 s.
43+
44+
## Accuracy vs. predictor noise
45+
46+
Accuracy = fraction of relocated reactions that the MILP places back in the truth
47+
compartment.
48+
49+
| noise | seconds | n_total | n_correct | n_unplaced | accuracy |
50+
|------:|--------:|--------:|----------:|-----------:|---------:|
51+
| 0.00 | 46 | 298 | 213 | 0 | 0.715 |
52+
| 0.10 | 34 | 298 | 199 | 0 | 0.668 |
53+
| 0.25 | 41 | 298 | 175 | 0 | 0.587 |
54+
| 0.50 | 31 | 298 | 115 | 0 | 0.386 |
55+
56+
Monotone degradation, no MILP infeasibilities at any noise level. At 10 % confident
57+
mis-scoring the accuracy drops only ~4.7 pp — the algorithm largely shrugs off small
58+
predictor noise because each compartment's evidence is the sum of all its genes'
59+
scores, so a few wrong genes get out-voted by their neighbours. At 50 % the algorithm
60+
is still better than the 1/12 = 8.3 % uniform baseline, but the loss is steep.
61+
62+
## Confusion matrix at noise=0.00
63+
64+
Rows = curated (true) compartment; columns = predicted. The `c` column dominates
65+
because cytosolic genes are also active in many other compartments (so an mm-only
66+
reaction shares its genes with cytosolic reactions, and the algorithm picks `c`).
67+
68+
| true \ pred | c | ce | er | erm | g | gm | lp | m | mm | p | v | vm |
69+
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
70+
| **c** | 91 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
71+
| **ce** | 4 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
72+
| **e** | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
73+
| **er** | 6 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
74+
| **erm** | 18 | 0 | 0 | 30 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
75+
| **g** | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
76+
| **gm** | 4 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
77+
| **lp** | 0 | 0 | 0 | 0 | 0 | 0 | 21 | 0 | 0 | 0 | 0 | 0 |
78+
| **m** | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 20 | 0 | 0 | 0 | 0 |
79+
| **mm** | 38 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 |
80+
| **n** | 3 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
81+
| **p** | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 16 | 0 | 0 |
82+
| **v** | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
83+
| **vm** | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8 |
84+
85+
## Per-compartment accuracy at noise=0.00
86+
87+
| compartment | n | n_correct | accuracy |
88+
|---|---:|---:|---:|
89+
| c | 91 | 91 | 1.000 |
90+
| ce | 15 | 11 | 0.733 |
91+
| e | 1 | 0 | 0.000 |
92+
| er | 13 | 7 | 0.538 |
93+
| erm | 48 | 30 | 0.625 |
94+
| g | 2 | 2 | 1.000 |
95+
| gm | 7 | 3 | 0.429 |
96+
| lp | 21 | 21 | 1.000 |
97+
| m | 28 | 20 | 0.714 |
98+
| mm | 41 | 3 | 0.073 |
99+
| n | 6 | 0 | 0.000 |
100+
| p | 16 | 16 | 1.000 |
101+
| v | 1 | 1 | 1.000 |
102+
| vm | 8 | 8 | 1.000 |
103+
104+
## What the failures mean
105+
106+
* **Compartments with self-contained gene sets are perfect.** `c`, `g`, `lp`, `p`,
107+
`v`, `vm` reach 100 %: their gene sets are largely disjoint from `c`'s, so the
108+
per-compartment evidence sum picks them cleanly.
109+
* **Inner-membrane vs. matrix collapses to cytosol.** The mitochondrial inner
110+
membrane (`mm`, 41 reactions, 7.3 % correct) and nucleus (`n`, 6 reactions, 0 %
111+
correct) lose to `c` because their genes are *also* annotated to cytosolic
112+
reactions. The algorithm sees gene `X` with score `1.0` in both `c` and `mm`,
113+
and `c` wins on the larger pool of co-localised reactions. This is faithful to
114+
the predictor evidence — a real WoLF PSORT / DeepLoc score table that distinguishes
115+
inner-membrane from matrix would do better here, but the
116+
derive-scores-from-the-model harness can't see that distinction.
117+
* **Membrane / non-membrane pairs split correctly.** `erm` vs `er`, `gm` vs `g`
118+
the algorithm prefers the membrane sub-compartment when its genes are more
119+
membrane-typical, which the score harness reproduces. 60–75 % is honest signal.
120+
* **No MILP infeasibilities.** Even at 50 % confident mis-scoring every reaction
121+
gets placed (the unplaced column stays 0).
122+
123+
## Calibration insight: `transport_cost` matters
124+
125+
The first smoke run used `transport_cost=0.5` (the default) and dumped almost every
126+
reaction into `c`. With ~5 metabolites per reaction the per-reaction transport bill
127+
overwhelmed the unit-scale gene reward, so the optimiser's best move was always
128+
"keep it in the default compartment, pay no transports." Dropping to
129+
`transport_cost=0.05` restored the per-compartment signal. For a real predictor with
130+
typical score magnitudes ≪ 1, the user should expect to dial `transport_cost` *down*
131+
into the same per-metabolite-per-compartment range as the typical gene-score-delta —
132+
the doc-string default of 0.5 is sized for clean integer-style scores, not
133+
soft-probability output.
134+
135+
## Reproducing
136+
137+
Make the smoke fast (subsampled, small noise grid):
138+
```bash
139+
python scripts/benchmark_localization_yeast.py \
140+
--noise 0,0.25 --max-reactions 100 --time-limit 60
141+
```
142+
143+
Plug in a real predictor (CSV of `gene_id` + one column per compartment):
144+
```bash
145+
python scripts/benchmark_localization_yeast.py \
146+
--scores-csv my_deeploc_yeast.csv \
147+
--noise 0 --max-reactions 300
148+
```

0 commit comments

Comments
 (0)