yeast-GEM localisation benchmark

Real-data validation of localization.predict_localization on the curated yeast-GEM. The benchmark is end-to-end — model, scoring, MILP — and sweeps predictor noise so the failure modes are visible, not just the headline accuracy.

Driver: scripts/benchmark_localization_yeast.py
Yeast-GEM source: pcSecYeastSpecies/Model/yeastGEM.xml (3991 reactions, 1147 genes, 14 compartments).

Run command:

python scripts/benchmark_localization_yeast.py \
    --yeast-gem ~/github/pcSecYeastSpecies/Model/yeastGEM.xml \
    --noise 0,0.1,0.25,0.5 \
    --max-reactions 300 \
    --transport-cost 0.05 \
    --time-limit 300 \
    --doc docs/yeast_localization_benchmark.md

Setup

Truth set: every yeast-GEM reaction that (a) has a GPR, (b) is non-boundary, and (c) has all metabolites in the same compartment. 2 155 reactions qualify; stratified subsampling to 298 keeps the per-compartment distribution. The 14 compartments collapse to 12 placement targets in the truth set (extracellular and the lipid particle / vacuolar membrane variants stay distinct).
Flattening: the model is squashed into one compartment with manipulation.merge_compartments so the predictor cannot lean on metabolite-topology evidence. Without this step every GPR'd reaction's "predicted" compartment is just its current one — vacuous.
Reference scores: each gene gets 1.0 in every compartment that hosts one of its reactions in the original (multi-compartment) model. This is the perfect-predictor upper bound; real WoLF PSORT / DeepLoc output will be noisier.
Noise injection: at noise level p each gene independently has probability p of having a confidently wrong compartment grafted in as the new top score (the true compartment is demoted to half its score). This simulates a predictor that's right 1-p of the time and confidently wrong otherwise — a more pessimistic stand-in than uniform Gaussian jitter.
MILP: transport_cost=0.05, multi_compartment_penalty=0.5, mip_gap=0.01, time_limit=300s, Gurobi. The MILP has 7 982 binaries, 2 691 rows, 29 842 nonzeros at this scale — solves in 30–50 s.

Accuracy vs. predictor noise

Accuracy = fraction of relocated reactions that the MILP places back in the truth compartment.

noise	seconds	n_total	n_correct	accuracy
0.00	46	298	213	0.715
0.10	34	298	199	0.668
0.25	41	298	175	0.587
0.50	31	298	115	0.386

Monotone degradation, no MILP infeasibilities at any noise level. At 10 % confident mis-scoring the accuracy drops only ~4.7 pp — the algorithm largely shrugs off small predictor noise because each compartment's evidence is the sum of all its genes' scores, so a few wrong genes get out-voted by their neighbours. At 50 % the algorithm is still better than the 1/12 = 8.3 % uniform baseline, but the loss is steep.

Confusion matrix at noise=0.00

Rows = curated (true) compartment; columns = predicted. The c column dominates because cytosolic genes are also active in many other compartments (so an mm-only reaction shares its genes with cytosolic reactions, and the algorithm picks c).

true \ pred	c	ce	er	erm	g	gm	lp	m	mm	p	v	vm
c	91	0	0	0	0	0	0	0	0	0	0	0
ce	4	11	0	0	0	0	0	0	0	0	0	0
e	1	0	0	0	0	0	0	0	0	0	0	0
er	6	0	7	0	0	0	0	0	0	0	0	0
erm	18	0	0	30	0	0	0	0	0	0	0	0
g	0	0	0	0	2	0	0	0	0	0	0	0
gm	4	0	0	0	0	3	0	0	0	0	0	0
lp	0	0	0	0	0	0	21	0	0	0	0	0
m	8	0	0	0	0	0	0	20	0	0	0	0
mm	38	0	0	0	0	0	0	0	3	0	0	0
n	3	0	0	0	0	3	0	0	0	0	0	0
p	0	0	0	0	0	0	0	0	0	16	0	0
v	0	0	0	0	0	0	0	0	0	0	1	0
vm	0	0	0	0	0	0	0	0	0	0	0	8

Per-compartment accuracy at noise=0.00

compartment	n	n_correct	accuracy
c	91	91	1.000
ce	15	11	0.733
e	1	0	0.000
er	13	7	0.538
erm	48	30	0.625
g	2	2	1.000
gm	7	3	0.429
lp	21	21	1.000
m	28	20	0.714
mm	41	3	0.073
n	6	0	0.000
p	16	16	1.000
v	1	1	1.000
vm	8	8	1.000

What the failures mean

Compartments with self-contained gene sets are perfect. c, g, lp, p, v, vm reach 100 %: their gene sets are largely disjoint from c's, so the per-compartment evidence sum picks them cleanly.
Inner-membrane vs. matrix collapses to cytosol. The mitochondrial inner membrane (mm, 41 reactions, 7.3 % correct) and nucleus (n, 6 reactions, 0 % correct) lose to c because their genes are also annotated to cytosolic reactions. The algorithm sees gene X with score 1.0 in both c and mm, and c wins on the larger pool of co-localised reactions. This is faithful to the predictor evidence — a real WoLF PSORT / DeepLoc score table that distinguishes inner-membrane from matrix would do better here, but the derive-scores-from-the-model harness can't see that distinction.
Membrane / non-membrane pairs split correctly. erm vs er, gm vs g — the algorithm prefers the membrane sub-compartment when its genes are more membrane-typical, which the score harness reproduces. 60–75 % is honest signal.
No MILP infeasibilities. Even at 50 % confident mis-scoring every reaction gets placed (the unplaced column stays 0).

Calibration insight: `transport_cost` matters

The first smoke run used transport_cost=0.5 (the default) and dumped almost every reaction into c. With ~5 metabolites per reaction the per-reaction transport bill overwhelmed the unit-scale gene reward, so the optimiser's best move was always "keep it in the default compartment, pay no transports." Dropping to transport_cost=0.05 restored the per-compartment signal. For a real predictor with typical score magnitudes ≪ 1, the user should expect to dial transport_cost down into the same per-metabolite-per-compartment range as the typical gene-score-delta — the doc-string default of 0.5 is sized for clean integer-style scores, not soft-probability output.

Reproducing

Make the smoke fast (subsampled, small noise grid):

python scripts/benchmark_localization_yeast.py \
    --noise 0,0.25 --max-reactions 100 --time-limit 60

Plug in a real predictor (CSV of gene_id + one column per compartment):

python scripts/benchmark_localization_yeast.py \
    --scores-csv my_deeploc_yeast.csv \
    --noise 0 --max-reactions 300

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

yeast-GEM localisation benchmark

Setup

Accuracy vs. predictor noise

Confusion matrix at noise=0.00

Per-compartment accuracy at noise=0.00

What the failures mean

Calibration insight: `transport_cost` matters

Reproducing

FilesExpand file tree

yeast_localization_benchmark.md

Latest commit

History

yeast_localization_benchmark.md

File metadata and controls

yeast-GEM localisation benchmark

Setup

Accuracy vs. predictor noise

Confusion matrix at noise=0.00

Per-compartment accuracy at noise=0.00

What the failures mean

Calibration insight: transport_cost matters

Reproducing

Calibration insight: `transport_cost` matters