Skip to content

Latest commit

 

History

History
148 lines (128 loc) · 7.58 KB

File metadata and controls

148 lines (128 loc) · 7.58 KB

yeast-GEM localisation benchmark

Real-data validation of localization.predict_localization on the curated yeast-GEM. The benchmark is end-to-end — model, scoring, MILP — and sweeps predictor noise so the failure modes are visible, not just the headline accuracy.

  • Driver: scripts/benchmark_localization_yeast.py
  • Yeast-GEM source: pcSecYeastSpecies/Model/yeastGEM.xml (3991 reactions, 1147 genes, 14 compartments).
  • Run command:
    python scripts/benchmark_localization_yeast.py \
        --yeast-gem ~/github/pcSecYeastSpecies/Model/yeastGEM.xml \
        --noise 0,0.1,0.25,0.5 \
        --max-reactions 300 \
        --transport-cost 0.05 \
        --time-limit 300 \
        --doc docs/yeast_localization_benchmark.md

Setup

  1. Truth set: every yeast-GEM reaction that (a) has a GPR, (b) is non-boundary, and (c) has all metabolites in the same compartment. 2 155 reactions qualify; stratified subsampling to 298 keeps the per-compartment distribution. The 14 compartments collapse to 12 placement targets in the truth set (extracellular and the lipid particle / vacuolar membrane variants stay distinct).
  2. Flattening: the model is squashed into one compartment with manipulation.merge_compartments so the predictor cannot lean on metabolite-topology evidence. Without this step every GPR'd reaction's "predicted" compartment is just its current one — vacuous.
  3. Reference scores: each gene gets 1.0 in every compartment that hosts one of its reactions in the original (multi-compartment) model. This is the perfect-predictor upper bound; real WoLF PSORT / DeepLoc output will be noisier.
  4. Noise injection: at noise level p each gene independently has probability p of having a confidently wrong compartment grafted in as the new top score (the true compartment is demoted to half its score). This simulates a predictor that's right 1-p of the time and confidently wrong otherwise — a more pessimistic stand-in than uniform Gaussian jitter.
  5. MILP: transport_cost=0.05, multi_compartment_penalty=0.5, mip_gap=0.01, time_limit=300s, Gurobi. The MILP has 7 982 binaries, 2 691 rows, 29 842 nonzeros at this scale — solves in 30–50 s.

Accuracy vs. predictor noise

Accuracy = fraction of relocated reactions that the MILP places back in the truth compartment.

noise seconds n_total n_correct n_unplaced accuracy
0.00 46 298 213 0 0.715
0.10 34 298 199 0 0.668
0.25 41 298 175 0 0.587
0.50 31 298 115 0 0.386

Monotone degradation, no MILP infeasibilities at any noise level. At 10 % confident mis-scoring the accuracy drops only ~4.7 pp — the algorithm largely shrugs off small predictor noise because each compartment's evidence is the sum of all its genes' scores, so a few wrong genes get out-voted by their neighbours. At 50 % the algorithm is still better than the 1/12 = 8.3 % uniform baseline, but the loss is steep.

Confusion matrix at noise=0.00

Rows = curated (true) compartment; columns = predicted. The c column dominates because cytosolic genes are also active in many other compartments (so an mm-only reaction shares its genes with cytosolic reactions, and the algorithm picks c).

true \ pred c ce er erm g gm lp m mm p v vm
c 91 0 0 0 0 0 0 0 0 0 0 0
ce 4 11 0 0 0 0 0 0 0 0 0 0
e 1 0 0 0 0 0 0 0 0 0 0 0
er 6 0 7 0 0 0 0 0 0 0 0 0
erm 18 0 0 30 0 0 0 0 0 0 0 0
g 0 0 0 0 2 0 0 0 0 0 0 0
gm 4 0 0 0 0 3 0 0 0 0 0 0
lp 0 0 0 0 0 0 21 0 0 0 0 0
m 8 0 0 0 0 0 0 20 0 0 0 0
mm 38 0 0 0 0 0 0 0 3 0 0 0
n 3 0 0 0 0 3 0 0 0 0 0 0
p 0 0 0 0 0 0 0 0 0 16 0 0
v 0 0 0 0 0 0 0 0 0 0 1 0
vm 0 0 0 0 0 0 0 0 0 0 0 8

Per-compartment accuracy at noise=0.00

compartment n n_correct accuracy
c 91 91 1.000
ce 15 11 0.733
e 1 0 0.000
er 13 7 0.538
erm 48 30 0.625
g 2 2 1.000
gm 7 3 0.429
lp 21 21 1.000
m 28 20 0.714
mm 41 3 0.073
n 6 0 0.000
p 16 16 1.000
v 1 1 1.000
vm 8 8 1.000

What the failures mean

  • Compartments with self-contained gene sets are perfect. c, g, lp, p, v, vm reach 100 %: their gene sets are largely disjoint from c's, so the per-compartment evidence sum picks them cleanly.
  • Inner-membrane vs. matrix collapses to cytosol. The mitochondrial inner membrane (mm, 41 reactions, 7.3 % correct) and nucleus (n, 6 reactions, 0 % correct) lose to c because their genes are also annotated to cytosolic reactions. The algorithm sees gene X with score 1.0 in both c and mm, and c wins on the larger pool of co-localised reactions. This is faithful to the predictor evidence — a real WoLF PSORT / DeepLoc score table that distinguishes inner-membrane from matrix would do better here, but the derive-scores-from-the-model harness can't see that distinction.
  • Membrane / non-membrane pairs split correctly. erm vs er, gm vs g — the algorithm prefers the membrane sub-compartment when its genes are more membrane-typical, which the score harness reproduces. 60–75 % is honest signal.
  • No MILP infeasibilities. Even at 50 % confident mis-scoring every reaction gets placed (the unplaced column stays 0).

Calibration insight: transport_cost matters

The first smoke run used transport_cost=0.5 (the default) and dumped almost every reaction into c. With ~5 metabolites per reaction the per-reaction transport bill overwhelmed the unit-scale gene reward, so the optimiser's best move was always "keep it in the default compartment, pay no transports." Dropping to transport_cost=0.05 restored the per-compartment signal. For a real predictor with typical score magnitudes ≪ 1, the user should expect to dial transport_cost down into the same per-metabolite-per-compartment range as the typical gene-score-delta — the doc-string default of 0.5 is sized for clean integer-style scores, not soft-probability output.

Reproducing

Make the smoke fast (subsampled, small noise grid):

python scripts/benchmark_localization_yeast.py \
    --noise 0,0.25 --max-reactions 100 --time-limit 60

Plug in a real predictor (CSV of gene_id + one column per compartment):

python scripts/benchmark_localization_yeast.py \
    --scores-csv my_deeploc_yeast.csv \
    --noise 0 --max-reactions 300