Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
0efec12
Carry source measurement value/unit through metatraits edges + consol…
realmarcin Apr 29, 2026
808c5d5
Regenerate unified mappings: absorb 10 new NCIT-mapped MIM ingredients
realmarcin Apr 30, 2026
e6a2444
Regenerate unified mappings: absorb 18 metatraits chemical mappings
realmarcin Apr 30, 2026
8f90809
Regenerate unified mappings: 4 MIM-side metatraits reconciliation fixes
realmarcin May 1, 2026
2acf02f
Regenerate unified mappings: +4 MIM rows from CultureMech ingredient …
realmarcin May 1, 2026
4617f84
Fix 17 chemical mappings via MIM reconciliation (2026-04-30)
realmarcin May 1, 2026
2808f5b
Block PubChem/CAS-RN-mangled CHEBI ids and retire legacy unified TSV
realmarcin May 1, 2026
dd66bb0
Extend kg-model-review with mapping-file validators and curation report
realmarcin May 1, 2026
1a87b55
Regenerate unified mappings after MIM chemistry+evidence backfill
realmarcin May 1, 2026
2e88521
Regenerate unified mappings: +3 FOODON ingredients (Bakers_Yeast, Bee…
realmarcin May 1, 2026
0b73b84
Regenerate unified mappings: +9 FOODON/ENVO ingredient upgrades
realmarcin May 1, 2026
9e4ad21
Regenerate unified mappings: +33 placeholder upgrades + 5 new MICRO r…
realmarcin May 1, 2026
d818999
Regenerate unified mappings: +55 unmapped resolutions across 6 ontolo…
realmarcin May 2, 2026
1f38ee2
Regenerate unified mappings: 5 peptone records FOODON→MICRO specifici…
realmarcin May 2, 2026
89e70da
Regenerate unified mappings: first kgmicrobe.ingredient:* term (Vermo…
realmarcin May 2, 2026
d261263
Merge branch 'master' into team-review-sssom
realmarcin May 2, 2026
6d51c15
Address Copilot review feedback (PR #558)
realmarcin May 2, 2026
3475bcb
Regenerate unified mappings: validation_method stamps from team-revie…
realmarcin May 2, 2026
cdcc456
Fix CI lint: D213 on LoaderFilteringTests docstring
realmarcin May 2, 2026
8ef355b
Regenerate unified mappings: PubChem chemistry backfill + 3 kgm.ingre…
realmarcin May 2, 2026
82f44ba
Regenerate unified mappings: +56 STEM_MATCH unmapped resolutions + fi…
realmarcin May 2, 2026
cb380e7
Add METPO:1007092 xerophilic + METPO:1007093 epibiont proposal terms
realmarcin May 2, 2026
01b9931
Add bacdive isolation_source → ontology mapping (358 rows, 93% coverage)
realmarcin May 2, 2026
5b253f9
Regenerate unified mappings: parent-term backfill + dual SSSOM emission
realmarcin May 2, 2026
355b506
Address Codex adversarial review of isolation_source_to_ontology.tsv
realmarcin May 2, 2026
aa8f4f0
Fix Wound: UBERON:0006988 does not exist; use mesh:D014947
realmarcin May 2, 2026
92a07f9
Fix 12 MONDO mis-mappings + add validator CI workflow
realmarcin May 2, 2026
c53c8e1
Codex review pass: family_mismatch_fix — 25 rows
realmarcin May 2, 2026
6edab82
Wire isolation_source_to_ontology mapping into BacDive transform
realmarcin May 2, 2026
8e1ac1e
consolidate_chemical_mappings: accept mesh/MICRO/BTO/kgmicrobe.ingred…
realmarcin May 2, 2026
8824afc
Address kg-model-review findings: Abscess + BacDive dual-edge + Rhea→EC
realmarcin May 2, 2026
e9e6f1e
Regenerate unified mappings: mesh/MICRO/BTO/kgmicrobe.ingredient roun…
realmarcin May 2, 2026
a7726e6
Fix CI: metpo.json fallback fetch + soft-gate culturebotai-claw
realmarcin May 2, 2026
c317d60
Phase 1 ontology coverage: PO + TAXRANK + MICRO load, PRIDE/PCO node …
realmarcin May 2, 2026
b1d2a0b
ontologies_transform: drop synonym entries missing 'val' before KGX read
realmarcin May 2, 2026
cf6cd77
Curation sweep: fix family_mismatch on 7 isolation-source mappings
realmarcin May 2, 2026
4b6b309
Drop validate-isolation-source workflow — redundant with QC pytest gate
realmarcin May 2, 2026
0444faf
Reconcile 38 special_chemical_mappings rows: kgmicrobe.compound:* → m…
realmarcin May 2, 2026
d53217b
Promote Currency → ENVO:00003896; drop 'currency note' false-positive
realmarcin May 3, 2026
3987db5
Regenerate unified mappings: post-dihydrate-fix consolidation
realmarcin May 3, 2026
3e03f36
Address Copilot review on PR #558
realmarcin May 3, 2026
d213f87
Apply audit-agent fixes: 23 mapping corrections from web/literature r…
realmarcin May 3, 2026
959baa6
Subclass plumbing: emit subclass_of edges for 4 custom-prefix groups
realmarcin May 3, 2026
f3a8199
Subclass plumbing for MIM narrowMatch + extractor over-generalization…
realmarcin May 3, 2026
7bc3fd7
Address Codex adversarial review #558 — three high-priority findings
realmarcin May 3, 2026
0626294
Re-audit 41 dropped closeMatch rows: 34 promoted, 7 stay dropped
realmarcin May 3, 2026
131a41e
Fix CI lint: D213 docstring formatting in test_loader_honors_manually…
realmarcin May 3, 2026
ff95d04
Address Codex adversarial review #558 round 2 — two high-priority fin…
realmarcin May 3, 2026
b132be6
Address Codex round-3 review: stop polluting parents + rename unified…
realmarcin May 3, 2026
5286b54
Sync vendored MIM SSSOM: drop 5 known-bad narrowMatch rows
realmarcin May 3, 2026
ed421d0
Sync MIM SSSOM and remove redundant KNOWN_BAD_NARROWMATCH filter
realmarcin May 3, 2026
b7a099b
Add local METPO alias overlay layer in load_metpo_mappings
realmarcin May 3, 2026
df62d07
Fix family-mismatch in madin_etal PATO + mediadive medium categorization
realmarcin May 3, 2026
d62011c
Upgrade kg-path-review and kg-model-review skills with session lessons
realmarcin May 3, 2026
0a4574d
Add Claude Code Review workflow for PRs
realmarcin May 3, 2026
ac4cc52
Fix ruff lint errors in new tests and helpers (D205, D213, D209)
realmarcin May 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 59 additions & 20 deletions .claude/skills/chemical-mapping/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: chemical-mapping
description: Work with KG-Microbe's unified chemical mapping system (`mappings/unified_chemical_mappings.tsv.gz` and `kg_microbe/utils/chemical_mapping_utils.py`). Use when adding a new mapping source, regenerating the unified file, debugging a missing ChEBI lookup, validating mappings against OLS, or reasoning about which source wins when sources disagree.
description: Work with KG-Microbe's unified chemical mapping system (`mappings/kgmicrobe_unified_entity_mappings.sssom.tsv.gz` and `kg_microbe/utils/chemical_mapping_utils.py`). Use when adding a new mapping source, regenerating the unified file, debugging a missing ChEBI lookup, validating mappings against OLS, or reasoning about which source wins when sources disagree.
---

# KG-Microbe Chemical Mapping
Expand All @@ -12,11 +12,10 @@ KG-Microbe resolves free-text chemical names from many source transforms
etc.) to canonical **ChEBI** identifiers via a single consolidated
mapping set. All transforms that need "name → ChEBI" go through
`kg_microbe.utils.chemical_mapping_utils`, which reads
`mappings/unified_ingredient_mappings.sssom.tsv.gz` once per process
`mappings/kgmicrobe_unified_entity_mappings.sssom.tsv.gz` once per process
and reconstructs the in-memory name/xref/formula/category indices from
the SSSOM rows. The `unified_chemical_mappings.tsv.gz` TSV is retained
only as a complementary entity-centric export for external consumers —
the reader no longer loads it.
the SSSOM rows. The unified SSSOM is the **single source of truth** for
chemical mappings; legacy entity-centric TSV outputs have been retired.

The unified file is built by `scripts/consolidate_chemical_mappings.py`
from multiple source files with a **priority system** — higher-priority
Expand All @@ -27,11 +26,11 @@ win tie-breaks during duplicate-name merging.

| Path | Role |
|---|---|
| `mappings/unified_ingredient_mappings.sssom.tsv.gz` | **Primary mapping product.** Standards-compliant SSSOM mapping set covering xrefs (`skos:exactMatch`) + canonical names + free-text synonyms via synthetic `kgm.name:<slug>` subjects (`skos:exactMatch` / `skos:closeMatch`, justification `semapv:LexicalMatching`). Validated with the `sssom` Python package on every write. |
| `mappings/unified_chemical_mappings.tsv.gz` | **In-process runtime index.** 7-col gzipped TSV consumed by all transforms. Entity-centric (one row per primary CURIE). Needed because plain-string synonyms cannot be represented as SSSOM subjects. Holds CHEBI chemicals **and** non-CHEBI ingredients (FOODON foods, UBERON anatomy, ENVO environments) in a single file. |
| `mappings/kgmicrobe_unified_entity_mappings.sssom.tsv.gz` | **Single source of truth.** Standards-compliant SSSOM mapping set covering xrefs (`skos:exactMatch`) + canonical names + free-text synonyms via synthetic `kgm.name:<slug>` subjects (`skos:exactMatch` / `skos:closeMatch`, justification `semapv:LexicalMatching`). Holds CHEBI chemicals **and** non-CHEBI ingredients (FOODON foods, UBERON anatomy, ENVO environments). Validated with the `sssom` Python package on every write. |
| `scripts/dump_unmapped_mediadive_ingredients.py` | Emits a MIM-compatible TSV of MediaDive ingredients still unmapped after the current mappings + `fuzzy_hydrate` retry, for curator review. |
| `mappings/culturebotai_reviewed_ingredients.tsv` | Authoritative reviewed source from CultureBotAI (priority=10). |
| `mappings/ingredient_mappings.sssom.tsv` | Authoritative SSSOM mapping set from the MediaIngredientMech sibling repo (priority=11). |
| `mappings/ingredient_mappings.sssom.tsv` | **Vendored copy** of the MediaIngredientMech SSSOM (priority=11). Auto-refreshed from the sibling repo on every consolidator run — never edit this file directly; edit upstream in MIM and let `sync_mim_sssom` overwrite it. |
| `../MediaIngredientMech/mappings/ingredient_mappings.sssom.tsv` | **Source of truth** for MIM mappings. The MediaIngredientMech repo (https://github.com/KG-Hub/MediaIngredientMech) is expected to be checked out as a sibling of `kg-microbe`. The consolidator wins-from-sibling on content divergence. |
| `mappings/chemical_mappings.tsv` | Legacy KEGG/BacDive primary mappings (may be absent). |
| `mappings/README.md` | Schema + regeneration instructions. |
| `scripts/consolidate_chemical_mappings.py` | Consolidator (run to rebuild). |
Expand All @@ -45,17 +44,29 @@ win tie-breaks during duplicate-name merging.
| `kg_microbe/transform_utils/madin_etal/chebi_manual_annotation.tsv` | Expert annotation source (may be absent). |
| `kg_microbe/transform_utils/bacdive/metabolite_mapping.json` | BacDive metabolite source (may be absent). |

## Schema: `unified_chemical_mappings.tsv.gz`
## Schema: SSSOM rows in `kgmicrobe_unified_entity_mappings.sssom.tsv.gz`

Per-entity attributes are reconstructed at read time by grouping rows on
`object_id`. Three row shapes (emitted by `export_unified_sssom`):

| Row shape | `subject_id` | `comment` | Carries |
|---|---|---|---|
| canonical name | `kgm.name:<slug>` | `canonical_name` | the entity's preferred label via `subject_label` / `object_label` |
| synonym | `kgm.name:<slug>` | `synonym` | one synonym per row via `subject_label` |
| xref | plain CURIE (e.g. `cas:7647-14-5`) | _empty_ | an equivalent identifier mapped to the entity |
| attribute carrier | _equal to_ `object_id` | _empty_ | extension columns only (when the entity has no other rows) |

Extension columns ride on every row as per-entity attributes:

| Column | Description |
|---|---|
| `id` | Primary key. Any supported ontology CURIE: `CHEBI:<int>` (preferred for chemicals), `FOODON:<int>`, `UBERON:<int>`, `ENVO:<int>`, etc. |
| `category` | Biolink category for the entry (`biolink:ChemicalSubstance`, `biolink:Food`, `biolink:AnatomicalEntity`, `biolink:EnvironmentalFeature`, …). Populated at consolidation time; downstream transforms read it directly instead of deriving category from the CURIE prefix. |
| `canonical_name` | Preferred name. Dominated by the highest-priority source. |
| `formula` | Molecular formula (chemicals only). Higher-priority wins. |
| `synonyms` | Pipe-delimited. Always unioned across all sources. |
| `xrefs` | Pipe-delimited. Union. Includes `cas:*`, `kegg.compound:*`, `pubchem.compound:*`, `MediaIngredientMech:*`, etc. |
| `sources` | Pipe-delimited provenance tags (one per contributing source loader). |
| `object_id` | Primary key — any supported ontology CURIE: `CHEBI:<int>`, `FOODON:<int>`, `UBERON:<int>`, `ENVO:<int>`, `pubchem.compound:<int>`, `cas:<dash-separated>`, etc. |
| `object_label` | The entity's canonical name. |
| `object_formula` | Molecular formula (chemicals only). Higher-priority source wins. |
| `object_category` | Biolink category (`biolink:ChemicalSubstance`, `biolink:Food`, `biolink:AnatomicalEntity`, `biolink:EnvironmentalFeature`, …). |
| `predicate_id` | `skos:exactMatch` (default) or `skos:closeMatch` / `narrowMatch` / `broadMatch` for asymmetric matches. |
| `mapping_justification` | `semapv:LexicalMatching` for synthetic name rows; `semapv:ManualMappingCuration` for curated xrefs. |
| `source` | Pipe-delimited provenance tags (one per contributing source loader). |

## Priority system

Expand Down Expand Up @@ -120,15 +131,43 @@ poetry run python scripts/consolidate_chemical_mappings.py
```

Behaviour:
1. Seeds from the existing `mappings/unified_chemical_mappings.tsv.gz` (priority inferred per row from source labels).
1. Seeds from the existing `mappings/kgmicrobe_unified_entity_mappings.sssom.tsv.gz` (the single source of truth; priority inferred per row from source labels).
2. Layers in any still-present legacy source files (absent ones are skipped).
3. Always loads `mappings/culturebotai_reviewed_ingredients.tsv` (priority=10).
4. Syncs + loads `mappings/ingredient_mappings.sssom.tsv` from the MIM sibling repo (priority=11).
4. Calls `sync_mim_sssom` to refresh `mappings/ingredient_mappings.sssom.tsv` from the MIM sibling repo at `../MediaIngredientMech/mappings/ingredient_mappings.sssom.tsv` (sibling wins on divergence; vendored is a cache, not a fork), then loads it (priority=11).
5. Enriches from `data/raw/chebi.db` via OAK (labels fill only when no higher-priority name is present; aliases always accumulate).
6. Harvests CHEBI xref labels via OAK into owning-record synonyms.
7. Propagates names across equivalent-CURIE records via xrefs (symmetric snapshot; no record merge).
8. Resolves name-index conflicts by source priority (no cross-CURIE merge pass).
9. Writes `mappings/unified_chemical_mappings.tsv.gz` and the SSSOM mapping product.
9. Writes `mappings/kgmicrobe_unified_entity_mappings.sssom.tsv.gz` (validated round-trip via the `sssom` package).

### MIM SSSOM source-of-truth contract

The MediaIngredientMech repo is the **authoritative** source for ingredient
mappings (priority=11). Its SSSOM lives at
`../MediaIngredientMech/mappings/ingredient_mappings.sssom.tsv` (sibling of
the kg-microbe repo). The vendored copy at
`mappings/ingredient_mappings.sssom.tsv` is a cache, refreshed on every
consolidator run by `sync_mim_sssom` (see `scripts/consolidate_chemical_mappings.py:182`):

| Sibling | Vendored | Sync action |
|---|---|---|
| present, content matches | present | no-op (`MIM SSSOM up-to-date`) |
| present, content differs | present | overwrite vendored (sibling wins) |
| present | absent | copy sibling → vendored |
| absent | present | warn, continue with stale vendored copy |
| absent | absent | **fatal** — script aborts with clone instructions |

Rules:
- **Never edit the vendored copy directly** — your changes will be silently
overwritten by the next consolidator run.
- To change a mapping: edit `../MediaIngredientMech/mappings/ingredient_mappings.sssom.tsv`,
open a PR against MediaIngredientMech, and once it merges, re-run the consolidator.
- New contributors must clone MIM as a sibling:
```bash
cd $(dirname $(pwd)) # parent of kg-microbe
git clone https://github.com/KG-Hub/MediaIngredientMech.git
```

### Add a new mapping source

Expand Down Expand Up @@ -166,7 +205,7 @@ If a name should map but doesn't:

## Known limitations

- **Not ChEBI-only anymore**: `unified_chemical_mappings.tsv.gz` and the consolidator now support non-ChEBI primary IDs, including FOODON, UBERON, ENVO, NCIT, `pubchem.compound`, `cas`, `mediadive.ingredient`, and `kgmicrobe.compound`. Some downstream helpers and workflows are still ChEBI-oriented (for example `find_chebi_*` utilities), so callers that assume every row resolves to a ChEBI ID should handle non-ChEBI primary IDs explicitly.
- **Not ChEBI-only anymore**: the unified SSSOM and the consolidator support non-ChEBI primary IDs, including FOODON, UBERON, ENVO, NCIT, `pubchem.compound`, `cas`, `mediadive.ingredient`, and `kgmicrobe.compound`. Some downstream helpers and workflows are still ChEBI-oriented (for example `find_chebi_*` utilities), so callers that assume every row resolves to a ChEBI ID should handle non-ChEBI primary IDs explicitly.
- **CAS RN format**: stored as `cas:<dash-separated>` xrefs (e.g. `cas:7647-14-5`). Consumers must include the `cas:` prefix when calling `find_chebi_by_xref`.
- **Priority inference on baseline reseed**: when `load_existing_unified` re-ingests the current `.tsv.gz`, the priority field is reconstructed from the `sources` column via prefix matching. A brand-new priority tier requires updating `priority_for` in that loader as well.
- **ChEBI enrichment cost**: `enrich_with_chebi_synonyms` iterates every entry through an OAK adapter; it is the slowest step (~165k entries × label + aliases). If `data/raw/chebi.db` is absent, the enrichment is silently skipped.
Expand Down
54 changes: 54 additions & 0 deletions .claude/skills/kg-model-review/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,12 @@ biolink:EnvironmentalProcess
biolink:PathologicalProcess
biolink:Disease
biolink:GrowthMedium (KG-Microbe extension)
biolink:Procedure (used multi-cat as METPO:1001000|biolink:Procedure on kgmicrobe.assay nodes)
biolink:PhenotypicQuality (PATO terms, surfaced as ENVO has_attribute targets in madin_etal)
```

**Multi-category nodes**: a node's `category` column may carry pipe-delimited categories (e.g. `METPO:1001000|biolink:Procedure` on the 503 `kgmicrobe.assay:*` nodes). The reviewer accepts any pipe-split component being valid; this is intentional, not a regression.

Valid predicates (flag anything not in this list):
```
biolink:has_phenotype
Expand All @@ -93,6 +97,7 @@ biolink:consumes
biolink:located_in
biolink:location_of
biolink:has_part
biolink:has_attribute (used by madin_etal for ENVO substrate → PATO quality attachment)
biolink:subclass_of
biolink:related_to
biolink:associated_with
Expand Down Expand Up @@ -120,6 +125,43 @@ biolink:contains_process
- Standard known prefixes: `NCBITaxon`, `CHEBI`, `GO`, `EC`, `RO`, `METPO`, `biolink`, `FOODON`, `UBERON`, `HP`, `MONDO`, `ENVO`, `infores`, `semapv`, `KGM`
- Flag any prefix not registered

#### Mapping files (curation TSVs)

Run with `--mappings` (in addition to a transform/merged scope) or
`--mappings-only` (just the curation TSVs) to validate every curation artifact
KG-Microbe ships and consumes. Four file groups are checked:

| Group | Example files | Schema |
|---|---|---|
| A — canonical | `kg_microbe/transform_utils/metatraits/mappings/{chemical,enzyme,metpo_alias,pathway,phenotype}_mappings.tsv` | 12-column canonical (`subject_label`, `object_id`, `predicate_id`, `mapping_justification`, `confidence`, …) |
| B — bespoke | `enzyme_name_to_go.tsv`, `special_chemical_mappings.tsv` | per-file expected columns |
| C — queues / audit / proposals | `mappings/mediadive_unmapped_ingredients_to_curate.tsv`, `mappings/culturebotai_reviewed_ingredients.tsv`, METPO proposal TSVs | status counts + sanity |
| D — SSSOM | `mappings/ingredient_mappings.sssom.tsv` | YAML metadata block + SSSOM required columns |

Per-row checks include: CURIE format on every `_id` column; `predicate_id`
restricted to the `skos:` namespace; `mapping_justification` restricted to
`semapv:`; `confidence` ∈ {high, medium, low}; deprecated biolink target
detection; METPO refs that don't exist in the ontology output; ontology IDs
(CHEBI/GO/EC/UBERON/ENVO/HP/MONDO/PATO/PR/CL/FOODON/NCBITaxon/OMP) that don't
resolve in `data/transformed/ontologies/*_nodes.tsv`.

Cross-file checks:
- Same `subject_label` mapped to conflicting `object_id` across canonical files

**Curation upgrade report.** When `--mappings` (or `--mappings-only`) is
specified, a markdown "Curation upgrade report" is appended to the artifact.
It summarises:
1. Top unmapped MediaDive ingredients by occurrence (drives MIM/CultureBotAI
curation priority).
2. Cross-file mapping conflicts.
3. Object IDs that don't resolve in the ontologies output.
4. Low-confidence canonical rows.
5. Prefix normalization candidates (e.g. `PUBCHEM.COMPOUND` → `pubchem.compound`).
6. CultureBotAI ingredient review queue status counts.

This section is what you hand to the upstream curation repos
(`CultureBotAI`, `MIM`, `CultureBotHT`) to drive new mappings.

## Usage

### Review all transforms
Expand All @@ -137,6 +179,16 @@ biolink:contains_process
/kg-model-review --merged
```

### Review mapping files only (no transforms)
```
/kg-model-review --mappings-only --format md
```

### Combined: merged KG plus mapping files
```
/kg-model-review --merged --mappings --format md
```

### Verbose output with example violations
```
/kg-model-review --transform bacdive --verbose
Expand All @@ -151,6 +203,8 @@ biolink:contains_process

- `--transform NAME` — review specific transform output in `data/transformed/NAME/`
- `--merged` — review `data/merged/` instead of individual transforms
- `--mappings` — additionally review curation TSVs (canonical, bespoke, queues, SSSOM) and append a curation upgrade report
- `--mappings-only` — review ONLY the curation TSVs (skip transforms/merged)
- `--format {text,md,json}` — output format (default: text)
- `--verbose` — show up to 5 example violating rows per check
- `--max-rows N` — limit rows sampled per file (default: 100000; use 0 for all)
Expand Down
Loading
Loading