Skip to content

Latest commit

 

History

History
287 lines (223 loc) · 6.9 KB

File metadata and controls

287 lines (223 loc) · 6.9 KB

Exact Matching Enhancement - Quick Start Guide

Overview

The enhanced SSSOM enrichment pipeline now includes:

  • Exact matching strategies for higher precision
  • Normalization to handle hydrated salts and unicode variations
  • OAK integration for multi-ontology search and synonym matching
  • Preprocessing for solutions, brand names, and gas abbreviations

Quick Start

# 1. Extract unique ingredients
just extract-ingredients

# 2. Generate base SSSOM file
just generate-sssom

# 3. Run enhanced enrichment
just enrich-sssom-exact

Command Options

Basic Enrichment (Original)

just enrich-sssom-with-ols
  • Uses OLS fuzzy search only
  • Lower precision, more false positives
  • Faster execution

Enhanced Enrichment (New)

just enrich-sssom-exact
  • Uses exact matching first
  • OAK synonym search
  • Multi-ontology support
  • Higher precision, fewer false positives

Manual Control

uv run python scripts/enrich_sssom_with_ols.py \
    --input-sssom output/culturemech_chebi_mappings.sssom.tsv \
    --input-ingredients output/ingredients_unique.tsv \
    --output output/culturemech_chebi_mappings_exact.sssom.tsv \
    --use-oak \          # Enable OAK client
    --exact-first \      # Try exact matching first
    --rate-limit 5 \     # API rate limit (req/sec)
    --verbose            # Show progress

What Gets Improved?

1. Hydrated Salts

Before: MnCl2・6H2O (unmapped) After: Normalized to MnCl2 x 6H2O → matches CHEBI Impact: +800-1,500 matches

2. Gas Abbreviations

Before: N2 (unmapped) After: Expands to ["nitrogen gas", "molecular nitrogen", "dinitrogen"] Impact: +200 matches

3. Bio-Materials

Before: Yeast extract (unmapped in CHEBI) After: Searches FOODON for biological materials Impact: +3,000-5,000 matches

4. Solutions

Before: 5% Na2S solution (unmapped) After: Extracts Na2S as base chemical Impact: +1,000-1,500 matches

5. Brand Names

Before: Bacto Peptone (unmapped) After: Strips to Peptone Impact: +300-500 matches

Understanding Confidence Scores

Range Meaning Strategy Predicate
0.95-1.0 Exact match OLS exact skos:exactMatch
0.92-0.94 Synonym match OAK synonym skos:exactMatch
0.85-0.91 Multi-ontology CHEBI/FOODON skos:closeMatch
0.80-0.84 Multi-ontology FOODON only skos:closeMatch
0.50-0.79 Fuzzy match OLS fuzzy skos:closeMatch

Recommendation: Trust scores ≥0.90 for automated use. Review scores 0.80-0.89 manually.

Validation

Compare Results

uv run python scripts/validate_exact_matches.py \
    --before output/culturemech_chebi_mappings_enriched.sssom.tsv \
    --after output/culturemech_chebi_mappings_exact.sssom.tsv \
    --verbose

Shows:

  • Coverage improvement
  • Confidence distribution changes
  • New vs. lost mappings
  • Strategy breakdown

Test Normalization

uv run python scripts/test_normalization.py

Validates:

  • Unicode normalization
  • Solution parsing
  • Brand name stripping
  • Gas expansion
  • Bio-material detection

Statistics

The enrichment script now reports:

Preprocessing Statistics

  • Normalized ingredients (unicode fixes)
  • Solution parsed (concentration removed)
  • Gas expanded (abbreviations → full names)
  • Brand stripped (brand names removed)

Mapping Strategy Breakdown

  • OLS exact matches (confidence: 0.95)
  • OAK synonym matches (confidence: 0.92)
  • Multi-ontology matches (confidence: 0.80-0.85)
  • OLS fuzzy matches (confidence: 0.50-0.80)

OAK Client Statistics

  • Total searches performed
  • Exact matches found
  • Synonym matches found
  • Success rate

OLS Client Statistics

  • Total API requests
  • Cache hits/misses
  • Cache hit rate
  • Errors

Troubleshooting

OAK Not Working?

Problem: ⚠ Warning: Could not initialize OAK client

Solution:

# OAK dependencies should already be installed
uv pip list | grep oaklib

# If missing, reinstall
uv pip install -e .

No Improvements Seen?

Check:

  1. Using --exact-first flag?
  2. Using --use-oak flag?
  3. Check input file has unmapped ingredients
  4. Review statistics output

Slow Performance?

Optimize:

  • Increase --rate-limit (default: 5)
  • OAK caches locally, subsequent runs faster
  • OLS caching enabled by default

Advanced Usage

OAK Client Direct Testing

# Search CHEBI
uv run python -m culturemech.ontology.oak_client \
    --search "glucose" \
    --ontology chebi \
    --synonyms

# Multi-ontology search
uv run python -m culturemech.ontology.oak_client \
    --search "yeast extract" \
    --multi \
    --synonyms

OLS Client Direct Testing

# Exact search
uv run python -m culturemech.ontology.ols_client \
    --search "glucose" \
    --exact

# Fuzzy search
uv run python -m culturemech.ontology.ols_client \
    --search "glucose"

Custom Preprocessing

Edit scripts/enrich_sssom_with_ols.py to customize:

  • GAS_ABBREVIATIONS - Add more gas mappings
  • BIO_MATERIAL_KEYWORDS - Add more bio-material keywords
  • BRAND_PATTERNS - Add more brand name patterns
  • CONCENTRATION_PATTERN - Modify concentration regex

Integration with Existing Workflows

Before Enrichment

just extract-ingredients    # Extract unique ingredients
just generate-sssom        # Generate base SSSOM

After Enrichment

just enrich-with-chebi     # Apply CHEBI IDs to YAML files
just validate-recipes      # Validate recipe files

Full Pipeline

just sssom-pipeline        # Original (fuzzy only)
# OR
just extract-ingredients && \
    just generate-sssom && \
    just enrich-sssom-exact  # Enhanced (exact-first)

Performance Notes

  • OLS API: Rate-limited to 5 req/sec (configurable)
  • OAK: Local caching, faster after first run
  • Expected runtime: 5-10 minutes for ~4,000 unmapped ingredients
  • Cache location: ~/.cache/culturemech/

Output Files

  • output/culturemech_chebi_mappings_exact.sssom.tsv - Enhanced SSSOM file
  • ~/.cache/culturemech/ols/*.json - OLS cache files
  • ~/.cache/culturemech/oak/ - OAK cache directory

Need Help?

  1. Check EXACT_MATCHING_IMPLEMENTATION.md for technical details
  2. Run tests: uv run python scripts/test_normalization.py
  3. Check logs: Use --verbose flag
  4. Review confidence scores in output SSSOM file

Example Session

# Start fresh
just extract-ingredients
# Output: 5,058 unique ingredients (4,058 unmapped)

# Generate base SSSOM
just generate-sssom
# Output: 1,194 mapped (23.7%)

# Run enhanced enrichment
just enrich-sssom-exact
# Expected: 2,500-3,000 mapped (50-60%)

# Validate
uv run python scripts/validate_exact_matches.py \
    --before output/culturemech_chebi_mappings_enriched.sssom.tsv \
    --after output/culturemech_chebi_mappings_exact.sssom.tsv \
    --verbose

Version: 1.0 Date: 2026-02-06 Status: Production Ready