Skip to content

Commit 4c23ed9

Browse files
authored
Merge pull request #573 from Knowledge-Graph-Hub/chore/metpo-release-integration
Integrate METPO release changes across transforms, mappings, and tooling
2 parents e5aa142 + 1390633 commit 4c23ed9

110 files changed

Lines changed: 70184 additions & 4773 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/skills/chemical-mapping/SKILL.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ win tie-breaks during duplicate-name merging.
3030
| `scripts/dump_unmapped_mediadive_ingredients.py` | Emits a MIM-compatible TSV of MediaDive ingredients still unmapped after the current mappings + `fuzzy_hydrate` retry, for curator review. |
3131
| `mappings/culturebotai_reviewed_ingredients.tsv` | Authoritative reviewed source from CultureBotAI (priority=10). |
3232
| `mappings/ingredient_mappings.sssom.tsv` | **Vendored copy** of the MediaIngredientMech SSSOM (priority=11). Auto-refreshed from the sibling repo on every consolidator run — never edit this file directly; edit upstream in MIM and let `sync_mim_sssom` overwrite it. |
33-
| `../MediaIngredientMech/mappings/ingredient_mappings.sssom.tsv` | **Source of truth** for MIM mappings. The MediaIngredientMech repo (https://github.com/KG-Hub/MediaIngredientMech) is expected to be checked out as a sibling of `kg-microbe`. The consolidator wins-from-sibling on content divergence. |
33+
| `../MediaIngredientMech/mappings/ingredient_mappings.sssom.tsv` | **Source of truth** for MIM mappings. The MediaIngredientMech repo (https://github.com/CultureBotAI/MediaIngredientMech) is expected to be checked out as a sibling of `kg-microbe`. The consolidator wins-from-sibling on content divergence. |
3434
| `mappings/chemical_mappings.tsv` | Legacy KEGG/BacDive primary mappings (may be absent). |
3535
| `mappings/README.md` | Schema + regeneration instructions. |
3636
| `scripts/consolidate_chemical_mappings.py` | Consolidator (run to rebuild). |
@@ -166,7 +166,7 @@ Rules:
166166
- New contributors must clone MIM as a sibling:
167167
```bash
168168
cd $(dirname $(pwd)) # parent of kg-microbe
169-
git clone https://github.com/KG-Hub/MediaIngredientMech.git
169+
git clone https://github.com/CultureBotAI/MediaIngredientMech.git
170170
```
171171

172172
### Add a new mapping source
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
---
2+
name: gtdb-phylo-diagram
3+
description: Render a GTDB phylogenetic diagram from a KG-Microbe merged release with each clade sized by the count of non-taxonomy edges (phenotypes, growth media, chemicals, etc.) incident on it. Folds NCBITaxon and kgmicrobe.strain edges onto their GTDB equivalent via in-graph close_match, GTDB metadata, and the published NCBI2GTDB tables. Persists the resolved mapping and a gap report. Use when you need to see *where in the GTDB tree the metadata is concentrated* — which clades are well-characterized vs sparse.
4+
---
5+
6+
# GTDB phylogenetic diagram
7+
8+
## Purpose
9+
10+
The KG-Microbe merged KG carries organism-level facts on three taxon identifier systems: `NCBITaxon:*` (~890K nodes), `kgmicrobe.strain:*` (~250K), and `GTDB:*` (~180K). Phenotypes, growth media, chemicals consumed, and isolation sources attach to whichever identifier the source data used — usually NCBITaxon or strain, rarely the GTDB equivalent.
11+
12+
This skill produces a phylogenetic view rooted in the GTDB taxonomy where each node is sized by the count of non-taxonomy edges incident on it (after folding NCBITaxon and strain edges onto their GTDB equivalent). It also persists the resolved NCBITaxon/strain → GTDB mapping and a small report listing gaps so the user can track curation debt.
13+
14+
It is intentionally a *tree* view, not a network view — for network views use the `scripts/examples/visualization/` patterns.
15+
16+
## What "non-taxon edges" means
17+
18+
Every edge in the merged KG is classified by looking at its two endpoints and predicate:
19+
20+
- If **both** endpoints are taxon-prefixed (`NCBITaxon:`, `GTDB:`, `kgmicrobe.strain:`, `kgmicrobe.genus:`) the edge is *structural* — it carries the GTDB hierarchy, GTDB↔NCBITaxon `close_match` links, or strain→species `subclass_of` links. **Excluded** from node sizing.
21+
- If the predicate is `biolink:subclass_of` (regardless of the other endpoint) the edge is *classification* — most importantly the ~732K `GenBank:<genome> --subclass_of--> GTDB:<species>` edges (one per GTDB genome). **Excluded** from node sizing, otherwise the circles would measure "how many genomes GTDB has for this clade" rather than metadata richness.
22+
- Otherwise the edge contributes +1 to the count for whichever endpoint is a taxon. These are the biologically interesting edges: `has_phenotype`, `location_of` (isolation source), growth-media (`METPO:2000517`), and the METPO trait predicates.
23+
24+
Counts on NCBITaxon are folded onto their GTDB equivalent via the mapping built in step 1; strain counts are folded onto the NCBITaxon parent and then onto GTDB. Cumulative counts propagate up the tree.
25+
26+
## Mapping resolution
27+
28+
NCBITaxon → GTDB is resolved by unioning three sources in priority order:
29+
30+
1. **`merged-kg-close-match`**`biolink:close_match` edges directly in the merged KG (provenance `infores:gtdb`). Highest confidence: this is exactly what the GTDB transform wrote.
31+
2. **`gtdb-metadata`**`bac120_metadata.tsv.gz` + `ar53_metadata.tsv.gz` map `ncbi_taxid → s__species_string`, joined to GTDB:N via the merged-KG node names. Same pattern as `kg_microbe/transform_utils/metatraits_gtdb/metatraits_gtdb.py::_load_gtdb_to_ncbi_mapping`.
32+
3. **`gtdb-published-r220`**`data/raw/NCBI2GTDB.tsv.gz`, kept where `majority_fraction >= --min-majority-fraction` (default 0.5). Joined to GTDB:N via the species string.
33+
34+
Strains use the in-graph `kgmicrobe.strain:* --biolink:subclass_of--> NCBITaxon:*` edge to find their parent, then map the parent through the table above. Source label is `strain-via-parent:<inner-source>`.
35+
36+
## Usage
37+
38+
```bash
39+
# Default: scan data/merged/20260523_nometatraits, write to data/processed/gtdb_phylo_diagram_<release>/
40+
poetry run python .claude/skills/gtdb-phylo-diagram/gtdb_phylo_diagram.py
41+
42+
# Pin a different release
43+
poetry run python .claude/skills/gtdb-phylo-diagram/gtdb_phylo_diagram.py \
44+
--merged-dir data/merged/20260523 \
45+
--out-dir data/processed/gtdb_phylo_diagram_20260523
46+
47+
# Data only, no figures (fast iteration on the mapping logic)
48+
poetry run python .claude/skills/gtdb-phylo-diagram/gtdb_phylo_diagram.py --skip-render
49+
50+
# Skip the toytree-based full-species and interactive renders
51+
poetry run python .claude/skills/gtdb-phylo-diagram/gtdb_phylo_diagram.py --skip-full --skip-interactive
52+
53+
# Tighten the published-mapping confidence floor
54+
poetry run python .claude/skills/gtdb-phylo-diagram/gtdb_phylo_diagram.py --min-majority-fraction 0.8
55+
```
56+
57+
### Flags
58+
59+
| Flag | Default | Purpose |
60+
|---|---|---|
61+
| `--merged-dir PATH` | `data/merged/20260523_nometatraits` | Dir containing `merged-kg_nodes.tsv` + `merged-kg_edges.tsv` |
62+
| `--gtdb-raw-dir PATH` | `data/raw/gtdb` | Dir containing `bac120_metadata.tsv.gz` + `ar53_metadata.tsv.gz` |
63+
| `--gtdb-published-map PATH` | `data/raw/NCBI2GTDB.tsv.gz` | Published NCBI→GTDB mapping table |
64+
| `--out-dir PATH` | `data/processed/gtdb_phylo_diagram_<release>` | Output directory |
65+
| `--collapse-ranks STR` | `phylum,class,family` | Ranks to render rank-collapsed figures at |
66+
| `--min-majority-fraction FLOAT` | `0.5` | Floor for accepting published mappings |
67+
| `--skip-full` | false | Skip full-species toytree static figure |
68+
| `--skip-interactive` | false | Skip interactive toytree HTML |
69+
| `--skip-render` | false | Skip all rendering (TSV + Newick + iTOL only) |
70+
71+
## Output structure
72+
73+
```
74+
data/processed/gtdb_phylo_diagram_<release>/
75+
├── gtdb_tree.nwk # full species-level Newick
76+
├── itol_node_sizes.txt # iTOL DATASET_SIMPLEBAR annotation
77+
├── gtdb_tree_phylum.{png,svg} # ~150 nodes, Bio.Phylo + matplotlib
78+
├── gtdb_tree_class.{png,svg} # ~500 nodes
79+
├── gtdb_tree_family.{png,svg} # ~5,000 nodes, large canvas
80+
├── gtdb_tree_full.{png,svg} # full species, toytree circular
81+
├── gtdb_tree_interactive.html # interactive toytree HTML
82+
├── ncbi_strain_to_gtdb.tsv # resolved mapping with provenance
83+
├── per_node_edge_counts.tsv # one row per GTDB clade
84+
└── report.md # gaps, statistics, top clades
85+
```
86+
87+
## When to invoke
88+
89+
- Diagnosing whether a curation push improved coverage of a target clade (re-run, diff `per_node_edge_counts.tsv`).
90+
- Producing release-companion figures showing where the merged KG concentrates evidence.
91+
- Surfacing NCBITaxon nodes that carry trait edges but don't map to any GTDB species (the "unmapped predicate fingerprint" in the report).
92+
- Producing the Newick + iTOL annotations for an external viewer (iTOL, FigTree, dendroscope).
93+
94+
## Implementation notes
95+
96+
- **One streaming pass over the 800+ MB edges TSV.** No DataFrame load. Memory ceiling is ~1.3M ints for the per-taxon counter plus ~180K Clade objects.
97+
- **The tree is taken from the merged KG itself**, not parsed from `bac120_taxonomy.tsv` — the merged KG already encodes the hierarchy via `GTDB:* --biolink:subclass_of--> GTDB:*` and that's what the GTDB integer IDs in the KG refer to.
98+
- **Synthetic root.** GTDB has two domain roots (`GTDB:639 d__Archaea`, `GTDB:640 d__Bacteria`); the script wraps both under a synthetic `GTDB:root`.
99+
- **`cumulative_count` is post-order sum** of `leaf_count` over the clade's subtree, so a phylum's count reflects all its descendants' folded edges, not the phylum node itself.
100+
- **Toytree is optional.** If it isn't installed, the full-species static and the interactive HTML are skipped with a note in `report.md`. The Bio.Phylo rank-collapsed figures still render. To install: `poetry add toytree` (see also `pyproject.toml`).
101+
- **Toytree circular layout gotchas (already handled in code):** the layout knob is `layout="c"`, *not* `tree_style="c"`; the circular layout additionally requires `edge_type="c"`; the tree must carry non-zero branch lengths (the skill sets unit lengths) or every node collapses onto one point; and CURIE colons must be stripped from labels (the skill sanitizes them) or toytree's Newick parser fails. The toyplot canvas background is set to white explicitly.
102+
- **Rendering large family tree.** The family-rank figure has ~5K leaves and renders as a ~A0-sized canvas at 200 dpi. Expect a multi-MB PNG and a several-hundred-MB intermediate matplotlib figure during rendering.
103+
104+
## Upstream transform issue (flagged in every report)
105+
106+
The GTDB transform currently mints `GTDB:N` integer CURIEs via a monotonic counter in `kg_microbe/transform_utils/gtdb/gtdb.py::_get_or_create_taxon_id` (line ~269). These IDs are **not resolvable** against the public GTDB and **not stable across builds** (different dict-iteration order yields different `N`). The right fix is to mint `GTDB:<slugified-taxon-string>` (e.g. `GTDB:s__Escherichia_coli`) and drop the counter; this skill works around the issue by joining via the node `name` column (the canonical taxon string), so it will keep working unchanged after the transform is fixed.
107+
108+
## Known limitations
109+
110+
- The full-species figure compresses ~143K leaves into a single canvas; legible exploration belongs in iTOL (using `itol_node_sizes.txt`) or the interactive HTML.
111+
- NCBITaxon nodes with no published or in-graph mapping to a GTDB species (common for higher-rank taxa and `environmental sample` entries) land in the unmapped bucket — see the "Unmapped NCBITaxon predicate fingerprint" section of `report.md` to see what we're losing.
112+
- Strain mapping requires both the strain→NCBITaxon parent edge AND an NCBITaxon→GTDB mapping; missing either step drops the strain.
113+
114+
## See also
115+
116+
- `kg-release-diff` — semantic diff between two merged releases (complements this view)
117+
- `kg-postprocess-report` — pipeline status; this skill is one of its consumers
118+
- `scripts/examples/visualization/` — network-based subgraph visualisations (different shape of output)
119+
- `kg_microbe/transform_utils/gtdb/gtdb.py` — defines how the synthetic `GTDB:N` integer IDs are minted
120+
- `kg_microbe/transform_utils/metatraits_gtdb/metatraits_gtdb.py` — original GTDB↔NCBI mapping helper this skill reuses the pattern of

0 commit comments

Comments
 (0)