This note records the measurements behind the default KO-assignment parameters in
reconstruction/kegg/query.py (assign_kos / get_kegg_model_from_sequences,
pipeline step 3b.5) and IMPROVEMENTS K15. It is the evidence for moving away
from RAVEN's 1e-50 cut-off.
assign_kos turns an hmmscan KO×gene E-value matrix into gene→KO assignments
in three steps:
cutoff— keep hits withevalue <= cutoff.min_score_ratio_ko— within a KO, drop genes whoselog(evalue)/log(best_evalue_in_KO) < min_score_ratio_ko.min_score_ratio_g— within a gene, drop KOs whoselog(evalue)/log(best_evalue_for_gene) < min_score_ratio_g.
- Data: KEGG release 118. Libraries: the maintainer-built
prokaryotes.hmm(831 MB) andeukaryotes.hmm(692 MB), 90 %-clustered, FFT-NS-2/PartTree (K12). - Queries: each organism's full proteome, extracted from
genes.pep. - Ground truth: the organism's real KEGG gene→KO links, from the
organism_gene_kotable (restricted, as the table is, to reaction-linked KOs). - Prediction:
assign_kosoutput, with theorganism:prefix stripped from query gene IDs so they match the bare gene IDs in the ground truth. - Metrics (gene→KO level): precision = |pred ∩ truth| / |pred|,
recall = |pred ∩ truth| / |truth|, F1. Reaction-level:
rxn_rec= fraction of the organism's true reactions recovered (KO→reaction viako_reaction);rxn_novel= predicted reactions not in the annotation set. - Reproduce with
scripts/analyze_hmm_cutoffs.py.
All four organisms are present in the libraries' training set, so their own
sequences hit their KO profiles strongly and recall is an upper bound. The
calibration is therefore relative (how the parameters trade off, and where
RAVEN's default sits relative to the signal), not an absolute accuracy estimate.
A genome genuinely absent from KEGG would be the next validation. Also note that
rxn_novel / "precision < 1" partly reflects legitimate homology KEGG never
annotated for that organism (paralogs, un-curated genes), not pure error — so the
precision figures are a lower bound on real precision.
| code | organism | library | proteome (seqs) | true gene→KO pairs | true reactions |
|---|---|---|---|---|---|
sce |
Saccharomyces cerevisiae (budding yeast) | euk | 6021 | 841 | 1217 |
cme |
Cyanidioschyzon merolae (red alga) | euk | 5010 | 709 | 1157 |
eco |
Escherichia coli K-12 MG1655 | prok | 4288 | 1071 | 1548 |
mge |
Mycoplasmoides genitalium G37 (minimal genome) | prok | 476 | 110 | 211 |
sce/eco are model organisms; cme/mge are lesser-studied, mge
additionally being a small, divergent genome.
log10(E-value) percentiles of the best hit per (gene, KO) pair, split by whether
the pair is in the organism's annotation (matched) or not (novel). Smaller
(more negative) = stronger hit.
| organism | group | n | p50 | p90 | p95 | p99 |
|---|---|---|---|---|---|---|
sce |
matched | 835 | −155 | −75 | −59 | −33 |
sce |
novel | 9467 | −8 | −2 | −0 | 1 |
cme |
matched | 704 | −133 | −63 | −47 | −25 |
cme |
novel | 10170 | −8 | −2 | −2 | 0 |
eco |
matched | 1070 | −142 | −69 | −57 | −36 |
eco |
novel | 27357 | −7 | −2 | −1 | −0 |
mge |
matched | 110 | −100 | −42 | −35 | −15 |
mge |
novel | 1904 | −4 | −2 | −1 | −0 |
Reading: matched pairs cluster at E ≈ 1e-100…1e-155; even their weakest 1 %
sit at 1e-15…1e-36. Novel pairs cluster at ≈1e-8. The two are separated by ~20
orders of magnitude. RAVEN's 1e-50 lies inside the matched tail (between
the matched p90 and p95 for most organisms; past p90 for mge), so it discards
real-but-weakly-scoring annotations while gaining nothing against the (far weaker)
noise.
min_score_ratio_ko = 0.3, min_score_ratio_g = 0.8 fixed; gene→KO precision /
recall / F1 and reaction recovery vs the annotation.
| cutoff | gKO prec | gKO rec | gKO F1 | rxn rec | rxn novel |
|---|---|---|---|---|---|
| 1e-10 | 0.57 | 0.98 | 0.72 | 0.99 | 334 |
| 1e-20 | 0.65 | 0.98 | 0.78 | 0.97 | 283 |
| 1e-30 | 0.72 | 0.97 | 0.83 | 0.97 | 216 |
| 1e-50 | 0.78 | 0.95 | 0.86 | 0.96 | 157 |
| 1e-70 | 0.81 | 0.91 | 0.86 | 0.91 | 113 |
| 1e-100 | 0.84 | 0.76 | 0.80 | 0.79 | 68 |
| cutoff | gKO prec | gKO rec | gKO F1 | rxn rec | rxn novel |
|---|---|---|---|---|---|
| 1e-10 | 0.50 | 0.98 | 0.67 | 1.00 | 541 |
| 1e-20 | 0.57 | 0.98 | 0.72 | 1.00 | 421 |
| 1e-30 | 0.61 | 0.97 | 0.75 | 0.97 | 367 |
| 1e-50 | 0.70 | 0.93 | 0.80 | 0.94 | 307 |
| 1e-70 | 0.75 | 0.85 | 0.80 | 0.87 | 223 |
| 1e-100 | 0.80 | 0.70 | 0.75 | 0.71 | 136 |
| cutoff | gKO prec | gKO rec | gKO F1 | rxn rec | rxn novel |
|---|---|---|---|---|---|
| 1e-10 | 0.53 | 0.99 | 0.69 | 0.99 | 382 |
| 1e-20 | 0.57 | 0.99 | 0.73 | 0.99 | 300 |
| 1e-30 | 0.60 | 0.98 | 0.75 | 0.99 | 268 |
| 1e-50 | 0.67 | 0.95 | 0.78 | 0.98 | 198 |
| 1e-70 | 0.73 | 0.88 | 0.80 | 0.93 | 157 |
| 1e-100 | 0.82 | 0.74 | 0.77 | 0.80 | 96 |
| cutoff | gKO prec | gKO rec | gKO F1 | rxn rec | rxn novel |
|---|---|---|---|---|---|
| 1e-10 | 0.52 | 0.98 | 0.68 | 0.99 | 75 |
| 1e-20 | 0.62 | 0.96 | 0.75 | 0.98 | 51 |
| 1e-30 | 0.65 | 0.95 | 0.77 | 0.98 | 39 |
| 1e-50 | 0.77 | 0.84 | 0.80 | 0.87 | 29 |
| 1e-70 | 0.85 | 0.73 | 0.78 | 0.73 | 27 |
| 1e-100 | 0.87 | 0.50 | 0.64 | 0.47 | 21 |
Reading: recall is flat-and-high from 1e-10 to ~1e-30, then falls as the
cut-off eats into the matched tail — gently for model organisms, sharply for the
divergent mge (rxn recall 0.98 → 0.87 from 1e-30 → 1e-50, → 0.47 at 1e-100).
The recall lost to a stricter cut-off is not noise rejection (noise is at 1e-8);
it is real annotation. rxn_novel shrinks with stricter cut-offs because strong
un-annotated homologs are also removed.
| organism | ko ratio | g ratio | gKO prec | gKO rec | gKO F1 |
|---|---|---|---|---|---|
sce |
0.0 | 0.50 | 0.61 | 0.96 | 0.74 |
sce |
0.0 | 0.80 | 0.77 | 0.95 | 0.85 |
sce |
0.0 | 0.95 | 0.84 | 0.93 | 0.88 |
sce |
0.3 | 0.80 | 0.78 | 0.95 | 0.86 |
sce |
0.5 | 0.80 | 0.80 | 0.95 | 0.86 |
cme |
0.0 | 0.50 | 0.53 | 0.94 | 0.68 |
cme |
0.0 | 0.80 | 0.69 | 0.93 | 0.79 |
cme |
0.0 | 0.95 | 0.78 | 0.92 | 0.84 |
cme |
0.3 | 0.80 | 0.70 | 0.93 | 0.80 |
cme |
0.5 | 0.80 | 0.70 | 0.93 | 0.80 |
eco |
0.0 | 0.50 | 0.39 | 0.96 | 0.56 |
eco |
0.0 | 0.80 | 0.66 | 0.95 | 0.78 |
eco |
0.0 | 0.95 | 0.76 | 0.94 | 0.84 |
eco |
0.3 | 0.80 | 0.67 | 0.95 | 0.78 |
eco |
0.5 | 0.80 | 0.69 | 0.95 | 0.80 |
mge |
0.0 | 0.50 | 0.62 | 0.85 | 0.72 |
mge |
0.0 | 0.80 | 0.77 | 0.84 | 0.80 |
mge |
0.0 | 0.95 | 0.82 | 0.81 | 0.81 |
mge |
0.3 | 0.80 | 0.77 | 0.84 | 0.80 |
mge |
0.5 | 0.80 | 0.78 | 0.84 | 0.81 |
Reading:
min_score_ratio_kois inert — across all four organisms, varying it 0.0 → 0.3 → 0.5 changes precision/recall by ≤0.02 (mostly 0.00). It is a magic-number knob that does effectively nothing here. (Full 0.0/0.3/0.5 × g-grid in the script output; representative rows shown.)min_score_ratio_gis the real precision lever — 0.80 → 0.95 lifts precision ~0.07–0.10 for ~0.02 recall loss. 0.50 is clearly too loose.
| parameter | RAVEN / old | new default | rationale |
|---|---|---|---|
cutoff |
1e-50 | 1e-30 | recovers the matched tail (esp. divergent genomes); still ~22 orders above the 1e-8 noise floor |
min_score_ratio_g |
0.8 | 0.9 | the effective precision lever; offsets the looser cut-off |
min_score_ratio_ko |
0.3 | 0.3 (kept) | empirically inert; retained for RAVEN parity |
Old default (1e-50, 0.3, 0.8) vs new default (1e-30, 0.3, 0.9)
(min_score_ratio_ko 0.3 ≡ 0.0 here):
| organism | gKO prec | gKO rec | rxn rec | rxn novel |
|---|---|---|---|---|
sce |
0.78 → 0.76 | 0.95 → 0.96 | 0.96 → 0.96 | 157 → 137 |
cme |
0.70 → 0.67 | 0.93 → 0.96 | 0.94 → 0.97 | 307 → 305 |
eco |
0.67 → 0.67 | 0.95 → 0.97 | 0.98 → 0.99 | 198 → 173 |
mge |
0.77 → 0.69 | 0.84 → 0.94 | 0.87 → 0.97 | 29 → 35 |
The divergent minimal genome gains ~10 points of recall (the case the sequence
path exists for); model organisms improve slightly and eco emits fewer
unannotated reactions (the tighter gene-ratio prunes spurious multi-KO genes). The
small precision dip vs annotation is dominated by extra strong homologs, not
weak-hit noise.
Full reconstruction of S. cerevisiae two ways, at the old defaults:
| annotation path (3b.4) | HMM path (3b.5) | |
|---|---|---|
| reactions | 1355 | 1461 |
| metabolites | 1501 | 1567 |
| genes | 835 | 896 |
Reaction recall 96.3 % (1305/1355 shared, Jaccard 0.86); gene recall 96.6 %
(807/835 shared, Jaccard 0.87). The annotation path also exercises the new
organism_gene_ko.tsv.xz artefact (K14). hmmscan throughput ≈ 0.1 s/query
against either library on 12 threads (yeast: 6021 queries in 633 s).