Skip to content

Extend stub-import transform to cover BTO#570

Merged
realmarcin merged 1 commit into
masterfrom
add-bto-to-stub-import
May 20, 2026
Merged

Extend stub-import transform to cover BTO#570
realmarcin merged 1 commit into
masterfrom
add-bto-to-stub-import

Conversation

@realmarcin

Copy link
Copy Markdown
Collaborator

Summary

Promotes BTO to the same SemSQL-backed enriched-stub treatment as NCIT and mesh (introduced by PR #565), so the merged KG carries full label + synonyms + xrefs on both BTO nodes instead of label-only stubs.

Why now

The 2026-05-18 MIM SSSOM republish (PR #564, already merged) added a MIM:Cell_Lysate → BTO:0004304 cell lysate row. Combined with the pre-existing Wound-fluid → BTO:0003114 wound fluid row in mappings/isolation_source_to_ontology.tsv, BTO's in-repo footprint is now 2 IDs — past the threshold where the original ontologies-stubs design (PR #565) opted to leave BTO on the label-only inline path.

Changes

File Change
kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py Add BTO to STUB_ONTOLOGY_SOURCES (db_filename=bto.db, knowledge_source=infores:bto)
kg_microbe/transform_utils/bacdive/bacdive.py:2991-3007 Extend inline-emit skip-list from {NCIT, mesh}{NCIT, mesh, BTO} so BacDive defers BTO stub-node emission to the new transform (avoids duplicate node rows). Comment updated.
kg_microbe/utils/isolation_source_mapping_utils.py STUB_ONTOLOGY_PREFIXES docstring updated to reflect new partitioning (NCIT/mesh/BTO enriched via SemSQL; PRIDE/PCO/GENEPIO/FAO/SNOMED stay on the label-only inline path).
download.yaml Add bto.db.gz from s3.amazonaws.com/bbop-sqlite (~30 MB)
merge.yaml / merge.no_metatraits.yaml / merge_bakta.yaml Add data/transformed/ontologies_stubs/bto_nodes.tsv to the ontologies_stubs source filename list in each variant
tests/test_ontologies_stubs.py Rename test_stub_ontology_sources_covers_ncit_and_meshtest_stub_ontology_sources_covers_ncit_mesh_bto; assert set is exactly {NCIT, mesh, BTO}

Verification

$ poetry run python -c "from kg_microbe.utils.stub_curie_collection import collect_stub_curies; print(collect_stub_curies(['NCIT','mesh','BTO']))"
{'NCIT': {…73 CURIEs…}, 'mesh': {…95 CURIEs…}, 'BTO': {'BTO:0003114', 'BTO:0004304'}}

$ poetry run pytest tests/test_stub_curie_collection.py tests/test_ontologies_stubs.py -v
13 passed, 1 skipped

$ poetry run python -m ruff check kg_microbe/ tests/
All checks passed!

Test plan

  • poetry run kg download — fetches bto.db.gz (~30 MB, one-time; NCIT and mesh DBs already downloaded if PR Selective per-CURIE NCIT/MESH stub-import transform #565 was run end-to-end)
  • poetry run kg transform -s ontologies_stubs — should write data/transformed/ontologies_stubs/bto_nodes.tsv with 2 rows + header (BTO:0003114 wound fluid, BTO:0004304 cell lysate), each carrying label/synonyms/xrefs from the BRENDA Tissue Ontology
  • poetry run pytest tests/test_ontologies_stubs.py -v — integration test stops being skipped after the transform runs
  • After re-merge: spot-check that the two BTO nodes in the merged KG now carry non-empty synonym/xref fields

🤖 Generated with Claude Code

The 2026-05-18 MIM SSSOM republish (PR #564) added a `MIM:Cell_Lysate →
BTO:0004304 cell lysate` row. Combined with the pre-existing
`Wound-fluid → BTO:0003114 wound fluid` row in
`mappings/isolation_source_to_ontology.tsv`, the BTO footprint in
kg-microbe is now 2 IDs — past the threshold where the original
ontologies-stubs design (PR #565) opted to leave BTO on the label-only
inline path. Promote BTO to the same SemSQL-backed enriched-stub
treatment as NCIT and mesh, so the merged KG carries full label +
synonyms + xrefs on both BTO nodes.

Changes:

- kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py:
  add `BTO` entry to `STUB_ONTOLOGY_SOURCES` (db_filename=bto.db,
  knowledge_source=infores:bto). Class docstring updated.

- kg_microbe/transform_utils/bacdive/bacdive.py:2991-3007: extend the
  inline-emit skip-list from `{"NCIT", "mesh"}` to
  `{"NCIT", "mesh", "BTO"}` so BacDive defers BTO stub-node emission to
  the new transform (avoids duplicate node rows). Code comment updated
  to reflect the new partitioning.

- kg_microbe/utils/isolation_source_mapping_utils.py: STUB_ONTOLOGY_PREFIXES
  docstring updated to document the new partitioning (NCIT/mesh/BTO
  enriched via SemSQL; PRIDE/PCO/GENEPIO/FAO/SNOMED stay on the label-
  only inline path).

- download.yaml: add `bto.db.gz` from s3.amazonaws.com/bbop-sqlite
  (~30 MB, same distribution as the NCIT and mesh SemSQL DBs).

- merge.yaml / merge.no_metatraits.yaml / merge_bakta.yaml: add
  `data/transformed/ontologies_stubs/bto_nodes.tsv` to the
  ontologies_stubs source filename list in each variant.

- tests/test_ontologies_stubs.py: rename + update
  `test_stub_ontology_sources_covers_ncit_and_mesh` →
  `test_stub_ontology_sources_covers_ncit_mesh_bto`; assert the set
  is now exactly `{"NCIT", "mesh", "BTO"}`.

Verified:
- `collect_stub_curies(['NCIT', 'mesh', 'BTO'])` finds 73 NCIT + 95
  mesh + 2 BTO CURIEs from the committed mappings.
- 13 unit tests pass; integration test still skipped pending real
  SemSQL DB download.
- ruff clean.

End-to-end (requires `poetry run kg download` to fetch the three DBs,
~400 MB total):

  poetry run kg transform -s ontologies_stubs
  # → data/transformed/ontologies_stubs/{ncit,mesh,bto}_nodes.tsv
  poetry run pytest tests/test_ontologies_stubs.py -v
  # integration test no longer skipped; asserts every collector-
  # discovered CURIE has a corresponding stub-node row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 20, 2026 06:10

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the existing ontologies_stubs SemSQL-backed enrichment mechanism (previously for NCIT + MeSH) to also cover BTO, so BTO CURIEs referenced by in-repo mappings resolve to enriched stub nodes (label/synonyms/xrefs) rather than inline label-only stubs.

Changes:

  • Add BTO to STUB_ONTOLOGY_SOURCES and skip inline BTO stub-node emission in the BacDive transform to avoid duplicate node rows.
  • Add bto.db.gz to download.yaml and include bto_nodes.tsv in the ontologies_stubs source list across merge config variants.
  • Update docs/tests to reflect the expanded stub-enrichment scope.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py Adds BTO as a SemSQL-enriched stub ontology source alongside NCIT/MeSH.
kg_microbe/transform_utils/bacdive/bacdive.py Extends the inline-stub emission skip-list to include BTO (defers to ontologies_stubs).
kg_microbe/utils/isolation_source_mapping_utils.py Updates documentation describing which stub prefixes are enriched via SemSQL vs emitted inline.
download.yaml Adds the BTO SemSQL DB (bto.db.gz) to the download manifest.
merge.yaml Includes bto_nodes.tsv in the ontologies_stubs merge inputs.
merge.no_metatraits.yaml Includes bto_nodes.tsv in the ontologies_stubs merge inputs (no-metatraits variant).
merge_bakta.yaml Includes bto_nodes.tsv in the ontologies_stubs merge inputs (bakta variant).
tests/test_ontologies_stubs.py Updates config test to require {NCIT, mesh, BTO} coverage.
Comments suppressed due to low confidence (1)

kg_microbe/transform_utils/ontologies_stubs/ontologies_stubs_transform.py:88

  • The module/class documentation is now out of date with the addition of BTO: the top-level docstring and the __init__ param docs still describe only NCIT/MESH and {ncit,mesh}_nodes.tsv outputs. Please update those docstrings to reflect {ncit,mesh,bto} so future readers don’t miss that BTO is handled by this transform.
class OntologiesStubsTransform(Transform):

    """Emit one labelled stub node per referenced NCIT / MESH / BTO CURIE."""

    def __init__(

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +71 to +75
"BTO": {
# BRENDA Tissue Ontology. Only ~2 CURIEs in current kg-microbe
# mappings (wound fluid from BacDive isolation_source; cell lysate
# added by the MIM 2026-05-18 republish). Added here so those nodes
# carry full label + synonyms + xrefs instead of label-only stubs.
Comment thread download.yaml
Comment on lines 443 to +447
url: https://s3.amazonaws.com/bbop-sqlite/mesh.db.gz
local_name: mesh.db.gz
-
url: https://s3.amazonaws.com/bbop-sqlite/bto.db.gz
local_name: bto.db.gz
Comment on lines +91 to +99
def test_stub_ontology_sources_covers_ncit_mesh_bto():
"""
Cover the three prefixes that need full SemSQL-backed enrichment.

NCIT and mesh were added in the initial commit; BTO was added after the
MIM 2026-05-18 republish brought in `BTO:0004304 cell lysate`, doubling
the BTO footprint and crossing the "worth a SemSQL fetch" threshold.
"""
assert set(STUB_ONTOLOGY_SOURCES.keys()) == {"NCIT", "mesh", "BTO"}
Comment on lines +93 to +97
# 1. NCIT, mesh, and BTO: a SemSQL-backed enriched stub source. The
# OntologiesStubsTransform (kg_microbe/transform_utils/ontologies_stubs/)
# queries data/raw/ncit.db and data/raw/mesh.db via OAK to fetch
# rdfs:label, exact synonyms, and dbxrefs for every NCIT/mesh CURIE that
# appears anywhere under mappings/. Output:
# data/transformed/ontologies_stubs/{ncit,mesh}_nodes.tsv. This is the
# preferred path — stubs carry full metadata, not just a label. The
# queries data/raw/{ncit,mesh,bto}.db via OAK to fetch rdfs:label, exact
# synonyms, and dbxrefs for every NCIT/mesh/BTO CURIE that appears
# anywhere under mappings/. Output:
@realmarcin realmarcin merged commit 02ac444 into master May 20, 2026
7 checks passed
@realmarcin realmarcin deleted the add-bto-to-stub-import branch May 20, 2026 06:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants