Date: 2026-02-09 Status: ✅ Implemented
Added a mapping_method column to all SSSOM mappings to clearly distinguish between different mapping approaches:
- Curated dictionaries (BioProductDict from MicroMediaParam)
- Ontology-based matching (OAK/EBI OLS APIs)
- Manual curation (Original hand-curated mappings)
This addresses the need to track whether mappings come from ontology matching vs. other approaches.
Source: Pre-curated biological products from MicroMediaParam Confidence: 0.98 (highest) Speed: Instant (no API calls) Examples: Yeast extract, Peptone, DNA, Agar, BSA, Blood, Serum
Source: Exact matches via OLS/OAK ontology APIs Confidence: 0.92-0.95 Speed: Fast (cached API calls) Examples: Direct term matches, exact synonym matches
Source: Fuzzy/approximate matches via OLS/OAK Confidence: 0.50-0.89 Speed: Moderate (API calls) Examples: Multi-ontology searches, partial matches
Source: Original manually curated mappings Confidence: 0.10-1.0 (varies) Speed: N/A (pre-existing) Examples: Pre-existing CHEBI mappings in knowledge base
Quick analysis:
uv run python scripts/analyze_mapping_methods.pySpecific file:
uv run python scripts/analyze_mapping_methods.py output/culturemech_chebi_mappings_exact.sssom.tsvOutput:
Mapping Method Analysis
======================================================================
Total mappings: 2,900
Mapping Method Breakdown:
----------------------------------------------------------------------
Curated Dictionary (BioProductDict): 200 ( 6.9%)
Ontology Exact Match (OLS/OAK): 900 ( 31.0%)
Ontology Fuzzy Match (OLS/OAK): 350 ( 12.1%)
Manual Curation (Original): 1,450 ( 50.0%)
Total ontology-based (OAK/OLS): 1,250 ( 43.1%)
Total non-ontology (curated + manual): 1,650 ( 56.9%)
import pandas as pd
# Load SSSOM file
df = pd.read_csv('output/culturemech_chebi_mappings_exact.sssom.tsv',
sep='\t', comment='#')
# Get only curated dictionary mappings
curated = df[df['mapping_method'] == 'curated_dictionary']
print(f"Curated biological products: {len(curated)}")
print(curated[['subject_label', 'object_id', 'confidence']].head(10))
# Get all ontology-based mappings (OAK/OLS combined)
ontology = df[df['mapping_method'].isin(['ontology_exact', 'ontology_fuzzy'])]
print(f"\nOntology-based mappings: {len(ontology)}")
# Get high-confidence automated mappings only
auto_high_conf = df[(df['mapping_method'].isin(['curated_dictionary', 'ontology_exact'])) &
(df['confidence'] >= 0.92)]
print(f"\nHigh-confidence automated: {len(auto_high_conf)}")
# Compare ontology vs non-ontology
ontology_count = len(df[df['mapping_method'].isin(['ontology_exact', 'ontology_fuzzy'])])
non_ontology_count = len(df) - ontology_count
print(f"\nOntology-based: {ontology_count} ({ontology_count/len(df)*100:.1f}%)")
print(f"Non-ontology: {non_ontology_count} ({non_ontology_count/len(df)*100:.1f}%)")When running enrichment, you'll now see:
just enrich-sssom-exactOutput includes:
Enrichment Summary
======================================================================
Original mappings: 1302
Verified mappings: 1302
Invalid/deprecated: 0
New OLS mappings: 1248
Total enriched mappings: 2550
Confidence distribution:
0.9-1.0: 1450
0.8-0.9: 350
0.5-0.8: 600
0.0-0.5: 150
Mapping method breakdown:
Curated Dictionary (BioProductDict): 185 (7.3%)
Ontology Exact Match (OLS/OAK): 863 (33.8%)
Ontology Fuzzy Match (OLS/OAK): 200 (7.8%)
Manual Curation (Original): 1302 (51.1%)
1. scripts/enrich_sssom_with_ols.py
- Updated
create_mapping()to addmapping_methodparameter - Auto-determines method from
mapping_toolif not specified - Updated
verify_existing_mappings()to preserve/add method - Updated statistics output to show method breakdown
2. scripts/analyze_mapping_methods.py (NEW)
- Standalone script to analyze mapping methods
- Shows detailed breakdown by method and tool
- Confidence distribution by method
- Handles legacy files without
mapping_methodcolumn
def create_mapping(..., mapping_method=None):
# Auto-determine if not provided
if mapping_method is None:
if 'BioProductDict' in tool:
mapping_method = 'curated_dictionary'
elif 'OLS' in tool and 'exact' in tool:
mapping_method = 'ontology_exact'
elif 'OAK' in tool and 'synonym' in tool:
mapping_method = 'ontology_exact'
elif 'OLS' in tool and 'fuzzy' in tool:
mapping_method = 'ontology_fuzzy'
elif 'OAK' in tool or 'MultiOntology' in tool:
mapping_method = 'ontology_fuzzy'
else:
mapping_method = 'manual_curation'- Distinguish automated vs. manual mappings
- Track ontology-based vs. dictionary-based approaches
- Transparent methodology for publications
- Prioritize high-confidence methods
- Identify mappings needing manual review
- Filter by trust level
- Identify which methods work for specific ingredients
- Track success rates by method
- Optimize pipeline based on method performance
- Show percentage ontology-grounded
- Demonstrate automated vs. manual effort
- Quantify improvement from MicroMediaParam integration
# Get mappings that are ontology-based and high confidence
export_df = df[(df['mapping_method'].isin(['ontology_exact', 'curated_dictionary'])) &
(df['confidence'] >= 0.90)]
export_df.to_csv('high_confidence_ontology_mappings.tsv', sep='\t', index=False)# Get fuzzy matches that might need manual verification
review_df = df[(df['mapping_method'] == 'ontology_fuzzy') &
(df['confidence'] < 0.70)]
print("Ingredients needing manual review:")
for _, row in review_df.iterrows():
print(f" {row['subject_label']:40s} → {row['object_label']:40s} ({row['confidence']:.2f})")# Before MicroMediaParam integration
uv run python scripts/analyze_mapping_methods.py output/before_integration.sssom.tsv > before.txt
# After MicroMediaParam integration
uv run python scripts/analyze_mapping_methods.py output/after_integration.sssom.tsv > after.txt
# Compare
diff before.txt after.txtFiles created before 2026-02-09 won't have the mapping_method column.
Detection:
uv run python scripts/analyze_mapping_methods.py old_file.sssom.tsvOutput:
⚠️ Warning: 'mapping_method' column not found in SSSOM file
This column was added in the MicroMediaParam integration (2026-02-09)
Please re-run enrichment to add this column.
Mapping tool breakdown (legacy):
MicrobeMediaParam|v1.0: 629
CultureMech|manual: 395
EBI_OLS_API|fuzzy: 267
...
To upgrade: Re-run enrichment with the updated script.
✅ Added mapping_method column to all SSSOM mappings
✅ Created analysis script (analyze_mapping_methods.py)
✅ Updated statistics output to show method breakdown
✅ Documented usage with examples and use cases
✅ Backward compatible (handles legacy files)
Next Steps:
- Re-run enrichment to add
mapping_methodcolumn - Use
analyze_mapping_methods.pyto see breakdown - Filter by method for specific use cases
- Report ontology coverage statistics
Questions? See NORMALIZATION_IMPROVEMENTS.md for full integration details.