|
| 1 | +--- |
| 2 | +name: gtdb-phylo-diagram |
| 3 | +description: Render a GTDB phylogenetic diagram from a KG-Microbe merged release with each clade sized by the count of non-taxonomy edges (phenotypes, growth media, chemicals, etc.) incident on it. Folds NCBITaxon and kgmicrobe.strain edges onto their GTDB equivalent via in-graph close_match, GTDB metadata, and the published NCBI2GTDB tables. Persists the resolved mapping and a gap report. Use when you need to see *where in the GTDB tree the metadata is concentrated* — which clades are well-characterized vs sparse. |
| 4 | +--- |
| 5 | + |
| 6 | +# GTDB phylogenetic diagram |
| 7 | + |
| 8 | +## Purpose |
| 9 | + |
| 10 | +The KG-Microbe merged KG carries organism-level facts on three taxon identifier systems: `NCBITaxon:*` (~890K nodes), `kgmicrobe.strain:*` (~250K), and `GTDB:*` (~180K). Phenotypes, growth media, chemicals consumed, and isolation sources attach to whichever identifier the source data used — usually NCBITaxon or strain, rarely the GTDB equivalent. |
| 11 | + |
| 12 | +This skill produces a phylogenetic view rooted in the GTDB taxonomy where each node is sized by the count of non-taxonomy edges incident on it (after folding NCBITaxon and strain edges onto their GTDB equivalent). It also persists the resolved NCBITaxon/strain → GTDB mapping and a small report listing gaps so the user can track curation debt. |
| 13 | + |
| 14 | +It is intentionally a *tree* view, not a network view — for network views use the `scripts/examples/visualization/` patterns. |
| 15 | + |
| 16 | +## What "non-taxon edges" means |
| 17 | + |
| 18 | +Every edge in the merged KG is classified by looking at its two endpoints and predicate: |
| 19 | + |
| 20 | +- If **both** endpoints are taxon-prefixed (`NCBITaxon:`, `GTDB:`, `kgmicrobe.strain:`, `kgmicrobe.genus:`) the edge is *structural* — it carries the GTDB hierarchy, GTDB↔NCBITaxon `close_match` links, or strain→species `subclass_of` links. **Excluded** from node sizing. |
| 21 | +- If the predicate is `biolink:subclass_of` (regardless of the other endpoint) the edge is *classification* — most importantly the ~732K `GenBank:<genome> --subclass_of--> GTDB:<species>` edges (one per GTDB genome). **Excluded** from node sizing, otherwise the circles would measure "how many genomes GTDB has for this clade" rather than metadata richness. |
| 22 | +- Otherwise the edge contributes +1 to the count for whichever endpoint is a taxon. These are the biologically interesting edges: `has_phenotype`, `location_of` (isolation source), growth-media (`METPO:2000517`), and the METPO trait predicates. |
| 23 | + |
| 24 | +Counts on NCBITaxon are folded onto their GTDB equivalent via the mapping built in step 1; strain counts are folded onto the NCBITaxon parent and then onto GTDB. Cumulative counts propagate up the tree. |
| 25 | + |
| 26 | +## Mapping resolution |
| 27 | + |
| 28 | +NCBITaxon → GTDB is resolved by unioning three sources in priority order: |
| 29 | + |
| 30 | +1. **`merged-kg-close-match`** — `biolink:close_match` edges directly in the merged KG (provenance `infores:gtdb`). Highest confidence: this is exactly what the GTDB transform wrote. |
| 31 | +2. **`gtdb-metadata`** — `bac120_metadata.tsv.gz` + `ar53_metadata.tsv.gz` map `ncbi_taxid → s__species_string`, joined to GTDB:N via the merged-KG node names. Same pattern as `kg_microbe/transform_utils/metatraits_gtdb/metatraits_gtdb.py::_load_gtdb_to_ncbi_mapping`. |
| 32 | +3. **`gtdb-published-r220`** — `data/raw/NCBI2GTDB.tsv.gz`, kept where `majority_fraction >= --min-majority-fraction` (default 0.5). Joined to GTDB:N via the species string. |
| 33 | + |
| 34 | +Strains use the in-graph `kgmicrobe.strain:* --biolink:subclass_of--> NCBITaxon:*` edge to find their parent, then map the parent through the table above. Source label is `strain-via-parent:<inner-source>`. |
| 35 | + |
| 36 | +## Usage |
| 37 | + |
| 38 | +```bash |
| 39 | +# Default: scan data/merged/20260523_nometatraits, write to data/processed/gtdb_phylo_diagram_<release>/ |
| 40 | +poetry run python .claude/skills/gtdb-phylo-diagram/gtdb_phylo_diagram.py |
| 41 | + |
| 42 | +# Pin a different release |
| 43 | +poetry run python .claude/skills/gtdb-phylo-diagram/gtdb_phylo_diagram.py \ |
| 44 | + --merged-dir data/merged/20260523 \ |
| 45 | + --out-dir data/processed/gtdb_phylo_diagram_20260523 |
| 46 | + |
| 47 | +# Data only, no figures (fast iteration on the mapping logic) |
| 48 | +poetry run python .claude/skills/gtdb-phylo-diagram/gtdb_phylo_diagram.py --skip-render |
| 49 | + |
| 50 | +# Skip the toytree-based full-species and interactive renders |
| 51 | +poetry run python .claude/skills/gtdb-phylo-diagram/gtdb_phylo_diagram.py --skip-full --skip-interactive |
| 52 | + |
| 53 | +# Tighten the published-mapping confidence floor |
| 54 | +poetry run python .claude/skills/gtdb-phylo-diagram/gtdb_phylo_diagram.py --min-majority-fraction 0.8 |
| 55 | +``` |
| 56 | + |
| 57 | +### Flags |
| 58 | + |
| 59 | +| Flag | Default | Purpose | |
| 60 | +|---|---|---| |
| 61 | +| `--merged-dir PATH` | `data/merged/20260523_nometatraits` | Dir containing `merged-kg_nodes.tsv` + `merged-kg_edges.tsv` | |
| 62 | +| `--gtdb-raw-dir PATH` | `data/raw/gtdb` | Dir containing `bac120_metadata.tsv.gz` + `ar53_metadata.tsv.gz` | |
| 63 | +| `--gtdb-published-map PATH` | `data/raw/NCBI2GTDB.tsv.gz` | Published NCBI→GTDB mapping table | |
| 64 | +| `--out-dir PATH` | `data/processed/gtdb_phylo_diagram_<release>` | Output directory | |
| 65 | +| `--collapse-ranks STR` | `phylum,class,family` | Ranks to render rank-collapsed figures at | |
| 66 | +| `--min-majority-fraction FLOAT` | `0.5` | Floor for accepting published mappings | |
| 67 | +| `--skip-full` | false | Skip full-species toytree static figure | |
| 68 | +| `--skip-interactive` | false | Skip interactive toytree HTML | |
| 69 | +| `--skip-render` | false | Skip all rendering (TSV + Newick + iTOL only) | |
| 70 | + |
| 71 | +## Output structure |
| 72 | + |
| 73 | +``` |
| 74 | +data/processed/gtdb_phylo_diagram_<release>/ |
| 75 | +├── gtdb_tree.nwk # full species-level Newick |
| 76 | +├── itol_node_sizes.txt # iTOL DATASET_SIMPLEBAR annotation |
| 77 | +├── gtdb_tree_phylum.{png,svg} # ~150 nodes, Bio.Phylo + matplotlib |
| 78 | +├── gtdb_tree_class.{png,svg} # ~500 nodes |
| 79 | +├── gtdb_tree_family.{png,svg} # ~5,000 nodes, large canvas |
| 80 | +├── gtdb_tree_full.{png,svg} # full species, toytree circular |
| 81 | +├── gtdb_tree_interactive.html # interactive toytree HTML |
| 82 | +├── ncbi_strain_to_gtdb.tsv # resolved mapping with provenance |
| 83 | +├── per_node_edge_counts.tsv # one row per GTDB clade |
| 84 | +└── report.md # gaps, statistics, top clades |
| 85 | +``` |
| 86 | + |
| 87 | +## When to invoke |
| 88 | + |
| 89 | +- Diagnosing whether a curation push improved coverage of a target clade (re-run, diff `per_node_edge_counts.tsv`). |
| 90 | +- Producing release-companion figures showing where the merged KG concentrates evidence. |
| 91 | +- Surfacing NCBITaxon nodes that carry trait edges but don't map to any GTDB species (the "unmapped predicate fingerprint" in the report). |
| 92 | +- Producing the Newick + iTOL annotations for an external viewer (iTOL, FigTree, dendroscope). |
| 93 | + |
| 94 | +## Implementation notes |
| 95 | + |
| 96 | +- **One streaming pass over the 800+ MB edges TSV.** No DataFrame load. Memory ceiling is ~1.3M ints for the per-taxon counter plus ~180K Clade objects. |
| 97 | +- **The tree is taken from the merged KG itself**, not parsed from `bac120_taxonomy.tsv` — the merged KG already encodes the hierarchy via `GTDB:* --biolink:subclass_of--> GTDB:*` and that's what the GTDB integer IDs in the KG refer to. |
| 98 | +- **Synthetic root.** GTDB has two domain roots (`GTDB:639 d__Archaea`, `GTDB:640 d__Bacteria`); the script wraps both under a synthetic `GTDB:root`. |
| 99 | +- **`cumulative_count` is post-order sum** of `leaf_count` over the clade's subtree, so a phylum's count reflects all its descendants' folded edges, not the phylum node itself. |
| 100 | +- **Toytree is optional.** If it isn't installed, the full-species static and the interactive HTML are skipped with a note in `report.md`. The Bio.Phylo rank-collapsed figures still render. To install: `poetry add toytree` (see also `pyproject.toml`). |
| 101 | +- **Toytree circular layout gotchas (already handled in code):** the layout knob is `layout="c"`, *not* `tree_style="c"`; the circular layout additionally requires `edge_type="c"`; the tree must carry non-zero branch lengths (the skill sets unit lengths) or every node collapses onto one point; and CURIE colons must be stripped from labels (the skill sanitizes them) or toytree's Newick parser fails. The toyplot canvas background is set to white explicitly. |
| 102 | +- **Rendering large family tree.** The family-rank figure has ~5K leaves and renders as a ~A0-sized canvas at 200 dpi. Expect a multi-MB PNG and a several-hundred-MB intermediate matplotlib figure during rendering. |
| 103 | + |
| 104 | +## Upstream transform issue (flagged in every report) |
| 105 | + |
| 106 | +The GTDB transform currently mints `GTDB:N` integer CURIEs via a monotonic counter in `kg_microbe/transform_utils/gtdb/gtdb.py::_get_or_create_taxon_id` (line ~269). These IDs are **not resolvable** against the public GTDB and **not stable across builds** (different dict-iteration order yields different `N`). The right fix is to mint `GTDB:<slugified-taxon-string>` (e.g. `GTDB:s__Escherichia_coli`) and drop the counter; this skill works around the issue by joining via the node `name` column (the canonical taxon string), so it will keep working unchanged after the transform is fixed. |
| 107 | + |
| 108 | +## Known limitations |
| 109 | + |
| 110 | +- The full-species figure compresses ~143K leaves into a single canvas; legible exploration belongs in iTOL (using `itol_node_sizes.txt`) or the interactive HTML. |
| 111 | +- NCBITaxon nodes with no published or in-graph mapping to a GTDB species (common for higher-rank taxa and `environmental sample` entries) land in the unmapped bucket — see the "Unmapped NCBITaxon predicate fingerprint" section of `report.md` to see what we're losing. |
| 112 | +- Strain mapping requires both the strain→NCBITaxon parent edge AND an NCBITaxon→GTDB mapping; missing either step drops the strain. |
| 113 | + |
| 114 | +## See also |
| 115 | + |
| 116 | +- `kg-release-diff` — semantic diff between two merged releases (complements this view) |
| 117 | +- `kg-postprocess-report` — pipeline status; this skill is one of its consumers |
| 118 | +- `scripts/examples/visualization/` — network-based subgraph visualisations (different shape of output) |
| 119 | +- `kg_microbe/transform_utils/gtdb/gtdb.py` — defines how the synthetic `GTDB:N` integer IDs are minted |
| 120 | +- `kg_microbe/transform_utils/metatraits_gtdb/metatraits_gtdb.py` — original GTDB↔NCBI mapping helper this skill reuses the pattern of |
0 commit comments