Skip to content

Commit 8824afc

Browse files
realmarcinclaude
andcommitted
Address kg-model-review findings: Abscess + BacDive dual-edge + Rhea→EC
Three fixes for the modeling violations the merged-kg review surfaced. After re-running the bacdive + rhea_mappings transforms and re-merging, expected impact is ~133K of the 149,388 domain/range warnings cleared (89%) without minting any new METPO term. Finding #1 — Abscess mapping pointed at a UBERON ID that does not exist (mappings/isolation_source_to_ontology.tsv:4) The earlier codex_review_v2 fix swapped MONDO:0005227 for UBERON:0006548, but UBERON has no generic "abscess" term — only as descriptive text on unrelated anatomy entries. UBERON:0006548 is fictitious. Replaced with HP:0025615 (HP "Abscess", a phenotypic feature defined as 'a localized collection of purulent material surrounded by inflammation and granulation'). HP is present in data/transformed/ontologies/hp_nodes.tsv and avoids the disease-class concern that motivated the original fix. Finding #2 — BacDive dual-edge assay pattern violates predicate ranges (kg_microbe/transform_utils/bacdive/bacdive.py:2794-2812 + constants.py:295) bacdive.py emits two edges per assay result: organism→ChEBI substrate (the trait claim) and organism→assay (the test that produced it). Both were emitted with the same trait predicate (METPO:2000002 assimilates, METPO:2000011 ferments, etc.), generating 125,119 organism→Procedure violations because those predicates' biolink-mapped range is biolink:ChemicalEntity, not biolink:Procedure. Two changes: * The organism→assay edge now uses METPO:2000511 (has observation), the existing parent of the has-X-observation predicates. This correctly reads as "organism has the observation that was recorded on this assay procedure." Substrate edge predicate is unchanged. * ASSAY_CATEGORY swapped from "biolink:Procedure" to multi-cat "biolink:Procedure|METPO:1001000" (observation). biolink:Procedure carries the biolink semantic for downstream tooling; METPO:1001000 satisfies METPO:2000511's existing range without any upstream METPO range modification. Avoids minting a new METPO term and keeps the proposal TSV (metpo_proposal_2026_05) untouched. Finding #3 — Rhea→EC enabled_by violates biolink range (kg_microbe/transform_utils/constants.py:237) RHEA reactions (biolink:MolecularActivity) → EC enzymes (also typed biolink:MolecularActivity in this graph) was emitted with biolink:enabled_by whose range is biolink:PhysicalEntity, generating 7,902 violations. Could not change EC_CATEGORY without breaking the upstream "Protein → enables → EC" semantics that requires EC to be activity. So swapped RHEA_TO_EC_EDGE to biolink:close_match — biolink's purpose-built predicate for cross-vocabulary alignments, which is what Rhea↔EC actually is (both classify the same enzymatic reaction at different granularities). Domain/range unconstrained, semantic match is good. Re-run scope before next merge: * bacdive — picks up bacdive.py edge change, ASSAY_CATEGORY multi-cat, and the load_isolation_source_mappings() reload of the Abscess fix. * rhea_mappings — picks up RHEA_TO_EC_EDGE constant. * merge — produces fresh merged-kg.tar.gz with all three fixes applied. Validation: ROBOT template/merge/ELK reason clean; isolation_source validator OK; pytest tests/test_extract_metpo_proposals.py + tests/test_isolation_source_mapping_utils.py → 10 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 8e1ac1e commit 8824afc

3 files changed

Lines changed: 20 additions & 7 deletions

File tree

kg_microbe/transform_utils/bacdive/bacdive.py

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2792,18 +2792,26 @@ def run(self, data_file: Union[Optional[Path], Optional[str]] = None, show_statu
27922792
)
27932793

27942794
# NEW: Add organism → assay edge (dual-edge pattern)
2795-
# This is in addition to the direct organism→ChEBI edges above
2795+
# This is in addition to the direct organism→ChEBI edges above.
27962796
# Build assay ID: assay:{kit_name}_{well_name}
27972797
assay_id = f"{ASSAY_PREFIX}{assay_name}_{test_label}".replace(" ", "_")
27982798

2799-
# Write organism→assay edge using same METPO predicate
2799+
# The assay node carries provenance for the trait claim, not the
2800+
# trait itself, so it must NOT use the trait predicate (ferments /
2801+
# assimilates etc.) — that would map an organism to a procedure as
2802+
# if it were a substrate, violating the predicate's biolink-mapped
2803+
# range. Use METPO:2000511 (has observation); the assay nodes carry
2804+
# METPO:1001000 (observation) in their multi-category (see
2805+
# ASSAY_CATEGORY in constants.py) so the predicate's existing
2806+
# METPO:1001000 range is satisfied without an upstream change.
2807+
assay_predicate = "METPO:2000511"
28002808
knowledge_level, agent_type = self._add_edge_metadata(
2801-
metpo_predicate, "RO:0002434", assay_id
2809+
assay_predicate, "RO:0002434", assay_id
28022810
)
28032811
edge_writer.writerow(
28042812
[
28052813
organism_id,
2806-
metpo_predicate,
2814+
assay_predicate,
28072815
assay_id,
28082816
"RO:0002434",
28092817
self.knowledge_source,

kg_microbe/transform_utils/constants.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -234,7 +234,7 @@
234234
SUBSTRATE_TO_ASSAY_EDGE = "biolink:occurs_in" # [substrate -> assay]
235235
ENZYME_TO_SUBSTRATE_EDGE = "biolink:has_input" # [enzyme -> substrate]
236236
NCBI_TO_SUBSTRATE_EDGE = "biolink:consumes"
237-
RHEA_TO_EC_EDGE = "biolink:enabled_by"
237+
RHEA_TO_EC_EDGE = "biolink:close_match"
238238

239239
# Assay → Entity predicates (methodological reference edges)
240240
ASSAY_HAS_OUTPUT_PREDICATE = "biolink:has_output" # [assay -> GO/EC]
@@ -292,7 +292,12 @@
292292
GENOME_CATEGORY = "biolink:Genome"
293293

294294
# Procedure categories
295-
ASSAY_CATEGORY = "biolink:Procedure" # API kit assay tests
295+
# Multi-cat: biolink:Procedure carries the biolink semantic for downstream
296+
# tooling (Procedure is biolink's closest match for a microbial test kit / well),
297+
# and METPO:1001000 (observation) makes the node a valid object for the
298+
# METPO:2000511 (has observation) predicate that BacDive uses on the
299+
# organism→assay edge — so no METPO range-modification is needed upstream.
300+
ASSAY_CATEGORY = "biolink:Procedure|METPO:1001000" # API kit assay tests
296301

297302
# Deprecated categories - do not use
298303
# CHEMICAL_SUBSTANCE_CATEGORY = "biolink:ChemicalSubstance" # removed from biolink; use CHEBI_CATEGORY

mappings/isolation_source_to_ontology.tsv

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
subject_label subject_label_normalized object_id object_label object_source predicate_id confidence mapping_justification curator source_dataset notes verified_date
22
Abdomen abdomen UBERON:0000916 abdomen UBERON skos:exactMatch high semapv:LexicalMatching ols4_auto bacdive The subdivision of the vertebrate body between the thorax and pelvis. The ventral part of the abdomen contains the abdom 2026-05-01
33
Abort abort codex_review_fix_v2 bacdive fix(codex-v2): was MONDO:0041526 (disease); abortion-as-event has no clean isolation-source ontology 2026-05-02
4-
Abscess abscess UBERON:0006548 abscess UBERON skos:exactMatch high semapv:ManualMappingCuration codex_review_fix_v2 bacdive fix(codex-v2): was MONDO:0005227 (disease); UBERON has abscess as tissue/structure 2026-05-02
4+
Abscess abscess HP:0025615 Abscess HP skos:closeMatch high semapv:ManualMappingCuration kg_review_fix bacdive fix(kg-review): was UBERON:0006548 which does not exist in current UBERON; UBERON has no generic 'abscess' term. HP:0025615 (phenotypic feature: 'a localized collection of purulent material surrounded by inflammation and granulation') is the closest available term and verified present in data/transformed/ontologies/hp_nodes.tsv 2026-05-02
55
Acidic acidic PATO:0001429 acidic PATO skos:exactMatch high semapv:LexicalMatching ols4_auto bacdive An medium acidity quality inhering in a solution by virtue of the bearer's a high concentration of H+ ions. 2026-05-01
66
Activated-sludge activated sludge ENVO:00002046 activated sludge ENVO skos:exactMatch high semapv:LexicalMatching ols4_auto bacdive 2026-05-01
77
Agriculture agriculture ENVO:01001442 agriculture ENVO skos:exactMatch high semapv:LexicalMatching ols4_auto bacdive A land use process during which terrestrial environments are modified such that they can grow crop plants or allow the r 2026-05-01

0 commit comments

Comments
 (0)