Extend stub-import transform to cover BTO#570
Merged
Conversation
The 2026-05-18 MIM SSSOM republish (PR #564) added a `MIM:Cell_Lysate → BTO:0004304 cell lysate` row. Combined with the pre-existing `Wound-fluid → BTO:0003114 wound fluid` row in `mappings/isolation_source_to_ontology.tsv`, the BTO footprint in kg-microbe is now 2 IDs — past the threshold where the original ontologies-stubs design (PR #565) opted to leave BTO on the label-only inline path. Promote BTO to the same SemSQL-backed enriched-stub treatment as NCIT and mesh, so the merged KG carries full label + synonyms + xrefs on both BTO nodes. Changes: - kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py: add `BTO` entry to `STUB_ONTOLOGY_SOURCES` (db_filename=bto.db, knowledge_source=infores:bto). Class docstring updated. - kg_microbe/transform_utils/bacdive/bacdive.py:2991-3007: extend the inline-emit skip-list from `{"NCIT", "mesh"}` to `{"NCIT", "mesh", "BTO"}` so BacDive defers BTO stub-node emission to the new transform (avoids duplicate node rows). Code comment updated to reflect the new partitioning. - kg_microbe/utils/isolation_source_mapping_utils.py: STUB_ONTOLOGY_PREFIXES docstring updated to document the new partitioning (NCIT/mesh/BTO enriched via SemSQL; PRIDE/PCO/GENEPIO/FAO/SNOMED stay on the label- only inline path). - download.yaml: add `bto.db.gz` from s3.amazonaws.com/bbop-sqlite (~30 MB, same distribution as the NCIT and mesh SemSQL DBs). - merge.yaml / merge.no_metatraits.yaml / merge_bakta.yaml: add `data/transformed/ontologies_stubs/bto_nodes.tsv` to the ontologies_stubs source filename list in each variant. - tests/test_ontologies_stubs.py: rename + update `test_stub_ontology_sources_covers_ncit_and_mesh` → `test_stub_ontology_sources_covers_ncit_mesh_bto`; assert the set is now exactly `{"NCIT", "mesh", "BTO"}`. Verified: - `collect_stub_curies(['NCIT', 'mesh', 'BTO'])` finds 73 NCIT + 95 mesh + 2 BTO CURIEs from the committed mappings. - 13 unit tests pass; integration test still skipped pending real SemSQL DB download. - ruff clean. End-to-end (requires `poetry run kg download` to fetch the three DBs, ~400 MB total): poetry run kg transform -s ontologies_stubs # → data/transformed/ontologies_stubs/{ncit,mesh,bto}_nodes.tsv poetry run pytest tests/test_ontologies_stubs.py -v # integration test no longer skipped; asserts every collector- # discovered CURIE has a corresponding stub-node row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR extends the existing ontologies_stubs SemSQL-backed enrichment mechanism (previously for NCIT + MeSH) to also cover BTO, so BTO CURIEs referenced by in-repo mappings resolve to enriched stub nodes (label/synonyms/xrefs) rather than inline label-only stubs.
Changes:
- Add BTO to
STUB_ONTOLOGY_SOURCESand skip inline BTO stub-node emission in the BacDive transform to avoid duplicate node rows. - Add
bto.db.gztodownload.yamland includebto_nodes.tsvin theontologies_stubssource list across merge config variants. - Update docs/tests to reflect the expanded stub-enrichment scope.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py |
Adds BTO as a SemSQL-enriched stub ontology source alongside NCIT/MeSH. |
kg_microbe/transform_utils/bacdive/bacdive.py |
Extends the inline-stub emission skip-list to include BTO (defers to ontologies_stubs). |
kg_microbe/utils/isolation_source_mapping_utils.py |
Updates documentation describing which stub prefixes are enriched via SemSQL vs emitted inline. |
download.yaml |
Adds the BTO SemSQL DB (bto.db.gz) to the download manifest. |
merge.yaml |
Includes bto_nodes.tsv in the ontologies_stubs merge inputs. |
merge.no_metatraits.yaml |
Includes bto_nodes.tsv in the ontologies_stubs merge inputs (no-metatraits variant). |
merge_bakta.yaml |
Includes bto_nodes.tsv in the ontologies_stubs merge inputs (bakta variant). |
tests/test_ontologies_stubs.py |
Updates config test to require {NCIT, mesh, BTO} coverage. |
Comments suppressed due to low confidence (1)
kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py:88
- The module/class documentation is now out of date with the addition of BTO: the top-level docstring and the
__init__param docs still describe only NCIT/MESH and{ncit,mesh}_nodes.tsvoutputs. Please update those docstrings to reflect{ncit,mesh,bto}so future readers don’t miss that BTO is handled by this transform.
class OntologiesStubsTransform(Transform):
"""Emit one labelled stub node per referenced NCIT / MESH / BTO CURIE."""
def __init__(
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+71
to
+75
| "BTO": { | ||
| # BRENDA Tissue Ontology. Only ~2 CURIEs in current kg-microbe | ||
| # mappings (wound fluid from BacDive isolation_source; cell lysate | ||
| # added by the MIM 2026-05-18 republish). Added here so those nodes | ||
| # carry full label + synonyms + xrefs instead of label-only stubs. |
Comment on lines
443
to
+447
| url: https://s3.amazonaws.com/bbop-sqlite/mesh.db.gz | ||
| local_name: mesh.db.gz | ||
| - | ||
| url: https://s3.amazonaws.com/bbop-sqlite/bto.db.gz | ||
| local_name: bto.db.gz |
Comment on lines
+91
to
+99
| def test_stub_ontology_sources_covers_ncit_mesh_bto(): | ||
| """ | ||
| Cover the three prefixes that need full SemSQL-backed enrichment. | ||
|
|
||
| NCIT and mesh were added in the initial commit; BTO was added after the | ||
| MIM 2026-05-18 republish brought in `BTO:0004304 cell lysate`, doubling | ||
| the BTO footprint and crossing the "worth a SemSQL fetch" threshold. | ||
| """ | ||
| assert set(STUB_ONTOLOGY_SOURCES.keys()) == {"NCIT", "mesh", "BTO"} |
Comment on lines
+93
to
+97
| # 1. NCIT, mesh, and BTO: a SemSQL-backed enriched stub source. The | ||
| # OntologiesStubsTransform (kg_microbe/transform_utils/ontologies_stubs/) | ||
| # queries data/raw/ncit.db and data/raw/mesh.db via OAK to fetch | ||
| # rdfs:label, exact synonyms, and dbxrefs for every NCIT/mesh CURIE that | ||
| # appears anywhere under mappings/. Output: | ||
| # data/transformed/ontologies_stubs/{ncit,mesh}_nodes.tsv. This is the | ||
| # preferred path — stubs carry full metadata, not just a label. The | ||
| # queries data/raw/{ncit,mesh,bto}.db via OAK to fetch rdfs:label, exact | ||
| # synonyms, and dbxrefs for every NCIT/mesh/BTO CURIE that appears | ||
| # anywhere under mappings/. Output: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Promotes BTO to the same SemSQL-backed enriched-stub treatment as NCIT and mesh (introduced by PR #565), so the merged KG carries full label + synonyms + xrefs on both BTO nodes instead of label-only stubs.
Why now
The 2026-05-18 MIM SSSOM republish (PR #564, already merged) added a
MIM:Cell_Lysate → BTO:0004304 cell lysaterow. Combined with the pre-existingWound-fluid → BTO:0003114 wound fluidrow inmappings/isolation_source_to_ontology.tsv, BTO's in-repo footprint is now 2 IDs — past the threshold where the original ontologies-stubs design (PR #565) opted to leave BTO on the label-only inline path.Changes
kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.pyBTOtoSTUB_ONTOLOGY_SOURCES(db_filename=bto.db, knowledge_source=infores:bto)kg_microbe/transform_utils/bacdive/bacdive.py:2991-3007{NCIT, mesh}→{NCIT, mesh, BTO}so BacDive defers BTO stub-node emission to the new transform (avoids duplicate node rows). Comment updated.kg_microbe/utils/isolation_source_mapping_utils.pySTUB_ONTOLOGY_PREFIXESdocstring updated to reflect new partitioning (NCIT/mesh/BTO enriched via SemSQL; PRIDE/PCO/GENEPIO/FAO/SNOMED stay on the label-only inline path).download.yamlbto.db.gzfroms3.amazonaws.com/bbop-sqlite(~30 MB)merge.yaml/merge.no_metatraits.yaml/merge_bakta.yamldata/transformed/ontologies_stubs/bto_nodes.tsvto theontologies_stubssource filename list in each varianttests/test_ontologies_stubs.pytest_stub_ontology_sources_covers_ncit_and_mesh→test_stub_ontology_sources_covers_ncit_mesh_bto; assert set is exactly{NCIT, mesh, BTO}Verification
Test plan
poetry run kg download— fetchesbto.db.gz(~30 MB, one-time; NCIT and mesh DBs already downloaded if PR Selective per-CURIE NCIT/MESH stub-import transform #565 was run end-to-end)poetry run kg transform -s ontologies_stubs— should writedata/transformed/ontologies_stubs/bto_nodes.tsvwith 2 rows + header (BTO:0003114 wound fluid,BTO:0004304 cell lysate), each carrying label/synonyms/xrefs from the BRENDA Tissue Ontologypoetry run pytest tests/test_ontologies_stubs.py -v— integration test stops being skipped after the transform runs🤖 Generated with Claude Code