This document describes the workflow for curating media ingredient role assignments in MediaIngredientMech. Role curation assigns functional roles (e.g., CARBON_SOURCE, BUFFER, MINERAL) to ingredients based on evidence from CultureMech database annotations and scientific literature.
Current Status (as of 2026-03-15):
- 446 ingredients with roles (44.8% coverage of 996 mapped ingredients)
- 448 total role assignments (1.0 average per ingredient)
- Average confidence: 0.998 (extremely high quality)
- 99.8% citation coverage (447/448 roles have structured evidence)
Role curation in MediaIngredientMech follows an evidence-first approach:
- CultureMech Baseline: Primary evidence comes from 8,644+ media formulations in CultureMech database
- Occurrence-Weighted Confidence: High-occurrence roles (500+ media) receive higher confidence scores
- Property-Based Scoring: "Defined component" vs "Undefined component" metadata informs confidence
- Structured Citations: All roles include DATABASE_ENTRY citations with occurrence statistics and property excerpts
Rather than auto-assigning roles to all 996 ingredients, we prioritize:
- High-occurrence ingredients first (top 100 by media usage)
- Clear, unambiguous roles (MINERAL, BUFFER) over complex metabolic functions
- Structured evidence with occurrence counts and property metadata
- Manual review for edge cases and conflicting roles
Source: 8,644+ media formulations from global culture collections
Coverage: 570 ingredients with role annotations
Format: Embedded in synonym text as "Role: [role_text]; Properties: [properties]"
Example:
Role: Mineral source; Properties: Defined component, Inorganic compound, Simple component
Mapping:
- CultureMech role text (e.g., "Mineral source") → IngredientRoleEnum value (e.g.,
MINERAL) - See
scripts/analyze_culturemech_roles.py::CULTUREMECH_ROLE_MAPPINGfor full mapping table
Confidence Scoring Rules:
- "Defined component" + occurrence >500 → 1.0
- "Defined component" + occurrence 100-500 → 0.95
- "Defined component" + occurrence <100 → 0.9
- "Undefined component" → 0.8
Status: Infrastructure built (src/mediaingredientmech/utils/doi_resolver.py), manual review deferred
Purpose: High-priority roles requiring publication-level evidence
Workflow: DOI resolution → APA citation generation → RoleCitation format
When to Use:
- Novel or unexpected role assignments
- Conflicting evidence between sources
- Roles requiring biochemical/metabolic context
Core Nutritional Roles:
CARBON_SOURCE- Organic carbon for biosynthesis/energyNITROGEN_SOURCE- Nitrogen for amino acids/nucleotidesMINERAL- Inorganic minerals (phosphate, sulfate, magnesium)TRACE_ELEMENT- Micronutrients (iron, zinc, cobalt)VITAMIN_SOURCE- Vitamins or vitamin precursorsPROTEIN_SOURCE- Peptides, proteins, amino acidsAMINO_ACID_SOURCE- Specific amino acids
Physicochemical Roles:
BUFFER- pH buffering agentsSALT- Ionic strength and osmotic balanceSOLIDIFYING_AGENT- Gelling agents (agar)
Metabolic Roles:
ENERGY_SOURCE- Primary energy substrateELECTRON_ACCEPTOR- Terminal electron acceptor (nitrate, oxygen)ELECTRON_DONOR- Electron donor for chemolithotrophsCOFACTOR_PROVIDER- Enzyme cofactors/prosthetic groups
Indicator Roles (added 2026-03-15):
REDOX_INDICATOR- pH-dependent redox indicators (resazurin)PH_INDICATOR- pH indicator dyesSELECTIVE_AGENT- Antimicrobial/selective agentsSURFACTANT- Surfactants/detergents
media_roles:
- role: MINERAL
confidence: 1.0
evidence:
- reference_type: DATABASE_ENTRY
reference_text: "CultureMech database (6041 occurrences as 'Mineral source')"
url: "https://github.com/CultureBotAI/CultureMech"
excerpt: "Role: Mineral source; Properties: Defined component, Inorganic compound, Simple component"
curator_note: "Widespread use in media formulations (6041 occurrences). High confidence based on 'Defined component' property."Fields:
role: IngredientRoleEnum value (required)confidence: Float 0.0-1.0 (required)evidence: List of RoleCitation objects (required for quality)reference_type: PUBLICATION, DATABASE_ENTRY, EXPERT_ANNOTATIONreference_text: Citation text with occurrence statsurl: Link to sourceexcerpt: Direct quote from source (role + properties)curator_note: Contextual notes about assignment
Each role is assigned independently with its own confidence score:
Example: Distilled Water
media_roles:
- role: MINERAL
confidence: 1.0
evidence: [...CultureMech 4105 occurrences as "Mineral source"...]
- role: SALT # Solvating media
confidence: 1.0
evidence: [...CultureMech 4105 occurrences as "Solvating media"...]Some role combinations require additional context:
- CARBON_SOURCE + ELECTRON_ACCEPTOR: Rare, but valid for compounds like fumarate
- BUFFER + SELECTIVE_AGENT: Requires pH context (e.g., acidic pH inhibits some organisms)
Validation: scripts/validate_roles.py flags these for manual review
Purpose: Extract and analyze role data from CultureMech synonyms
Steps:
-
Run
scripts/analyze_culturemech_roles.py:- Parses 996 ingredient records for role annotations
- Generates role distribution CSV
- Creates top 100 cross-reference with confidence scores
- Output:
data/analysis/top100_role_crossref.yaml
-
Review unmapped role texts:
- Example: "Growth factor" (60 occurrences) - needs mapping decision
- Update
CULTUREMECH_ROLE_MAPPINGif new enum values warranted
Outputs:
data/analysis/culturemech_role_distribution.csv- Role frequency tabledata/analysis/top100_role_crossref.yaml- Top 100 ingredient cross-reference
Purpose: Add roles for highest-occurrence ingredients
Steps:
-
Dry-run preview:
PYTHONPATH=src python scripts/extract_top100_roles.py --dry-run
-
Execute extraction:
PYTHONPATH=src python scripts/extract_top100_roles.py
-
Review curation history:
- Check
data/curated/mapped_ingredients.yamlfor new role assignments - Verify confidence scores and citations
- Check
Features:
- Deduplication: Skips roles already present
- Structured citations: Includes occurrence counts and property metadata
- Audit trail: All changes logged in
curation_history
Purpose: Upgrade minimal citations with structured CultureMech metadata
Steps:
-
Identify generic citations:
- Pattern: "Imported from CultureMech pipeline" (no occurrence stats)
- 82 roles upgraded in top 100
-
Run enrichment:
PYTHONPATH=src python scripts/enrich_existing_roles.py
-
Validate improvements:
- Check upgraded citations include occurrence counts
- Verify property excerpts present
Before:
evidence:
- reference_text: "Imported from CultureMech pipeline"
reference_type: DATABASE_ENTRYAfter:
evidence:
- reference_text: "CultureMech database (1307 occurrences as 'pH dependent redox indicator')"
reference_type: DATABASE_ENTRY
url: "https://github.com/CultureBotAI/CultureMech"
excerpt: "Role: pH dependent redox indicator; Properties: Defined component, Organic compound, Simple component"
curator_note: "Widespread use in anaerobic media formulations (1307 occurrences). High confidence based on 'Defined component' property."Purpose: Quality assurance and progress tracking
Steps:
-
Run validation:
PYTHONPATH=src python scripts/validate_roles.py
Checks:
- Enum validity (all roles in
VALID_MEDIA_ROLES) - Citation coverage (all roles have evidence)
- Confidence consistency (aligns with properties)
- Multi-role coherence (no unexpected conflicts)
- Enum validity (all roles in
-
Generate statistics:
PYTHONPATH=src python scripts/generate_role_statistics.py
Outputs:
data/analysis/role_statistics_report.yaml- Summary metrics (coverage, confidence, citation types)
- Role distribution histogram
- Top 20 ingredients by occurrence
- Confidence score distribution
-
Review validation errors:
- Example: "Water (base)" has invalid role
SOLVENT(not in enum) - Action: Add to enum or change to
SALT
- Example: "Water (base)" has invalid role
Minimum Requirements:
- ✅
reference_typespecified (DATABASE_ENTRY, PUBLICATION, etc.) - ✅
reference_textwith occurrence stats or publication details - ✅
urlto source (GitHub, DOI link) - ✅
excerptwith direct quote from source - ✅
curator_notewith context (high occurrence, property-based confidence)
RED FLAGS:
- ❌ Empty
reference_text - ❌ Generic citations without metadata ("Imported from...")
- ❌ Missing occurrence counts for DATABASE_ENTRY
- ❌ Confidence >0.95 without "Defined component" property
Before committing role changes, run:
# 1. Validate all role assignments
PYTHONPATH=src python scripts/validate_roles.py
# 2. Generate updated statistics
PYTHONPATH=src python scripts/generate_role_statistics.py
# 3. Review outputs
cat data/analysis/role_statistics_report.yaml
# 4. Check for errors in validation
# - 0 errors: ✅ Proceed
# - Warnings only: ⚠️ Review and decide
# - Errors present: ❌ Fix before commitHigh Confidence (0.95-1.0):
- "Defined component" in properties
- Occurrence >100 media
- Single, unambiguous role
Medium Confidence (0.8-0.94):
- "Undefined component" or missing properties
- Occurrence <100 media
- Multi-role ingredient with context
Low Confidence (<0.8):
- Provisional assignment pending expert review
- Conflicting evidence from sources
- Novel/unexpected role
Target: Achieve >80% coverage (800+ ingredients with roles)
Approach:
- Extend analysis to all 570 ingredients with CultureMech annotations
- Focus on occurrence >50 media first
- Manual review for <50 occurrence ingredients
High-Priority Roles:
- ELECTRON_ACCEPTOR/DONOR - requires metabolic context
- COFACTOR_PROVIDER - needs biochemical evidence
- SELECTIVE_AGENT - antimicrobial spectrum details
Workflow:
- Identify ingredients needing DOI citations
- Search PubMed/CrossRef for relevant publications
- Use
doi_resolver.pyto fetch metadata - Add PUBLICATION citations alongside DATABASE_ENTRY
Features:
- Web-based role assignment interface
- Side-by-side evidence comparison (CultureMech vs PubMed)
- Batch approval for high-confidence assignments
- Export to KGX format for KG-Microbe integration
Goal: Import biochemical roles from ChEBI ontology
Example: CHEBI:15377 (water) has role "solvent" in ChEBI Action: Map ChEBI roles to IngredientRoleEnum, auto-populate where confident
Symptom: Invalid media role at index 0: SOLVENT
Cause: Role value not in VALID_MEDIA_ROLES set
Fix:
- Check if role should be added to enum (
src/mediaingredientmech/schema/mediaingredientmech.yaml) - Update
VALID_MEDIA_ROLESinsrc/mediaingredientmech/curation/ingredient_curator.py - Or change role value to existing enum (e.g., SOLVENT → SALT)
Symptom: "Roles with citations: 250/500 (50%)"
Cause: Missing evidence entries in role assignments
Fix:
- Run
scripts/enrich_existing_roles.pyto upgrade generic citations - For remaining gaps, add manual citations using
curator.add_media_role()
Symptom: Warning about "Defined component" with low confidence
Cause: Manual override or incorrect property parsing
Fix:
- Check cross-reference data for ingredient
- Verify occurrence count and properties
- Recalculate confidence using rules in "Data Sources" section
Scripts:
scripts/analyze_culturemech_roles.py- CultureMech data extractionscripts/extract_top100_roles.py- Top 100 role assignmentscripts/enrich_existing_roles.py- Citation enrichmentscripts/validate_roles.py- Validation checksscripts/generate_role_statistics.py- Statistics reporting
Data:
data/curated/mapped_ingredients.yaml- Main ingredient databasedata/analysis/culturemech_role_distribution.csv- Role frequencydata/analysis/top100_role_crossref.yaml- Top 100 cross-referencedata/analysis/role_statistics_report.yaml- Comprehensive statistics
Utilities:
src/mediaingredientmech/utils/doi_resolver.py- DOI resolution clientsrc/mediaingredientmech/curation/ingredient_curator.py- Core curation logic
- CultureMech Repository: https://github.com/CultureBotAI/CultureMech
- MediaIngredientMech Schema:
src/mediaingredientmech/schema/mediaingredientmech.yaml - Role Enum Documentation: Lines 466-502 in schema
- Crossref API: https://api.crossref.org/ (for DOI resolution)
Phase 1: Schema Extensions
- Added 4 new IngredientRoleEnum values: REDOX_INDICATOR, PH_INDICATOR, SELECTIVE_AGENT, SURFACTANT
- Updated VALID_MEDIA_ROLES in ingredient_curator.py
Phase 2: DOI Infrastructure
- Created doi_resolver.py with Crossref API integration
- Caching and rate limiting implemented
- Ready for future DOI literature review
Phase 3: Top 100 Curation
- Analyzed 570 ingredients with CultureMech role annotations
- Extracted roles for top 100 high-occurrence ingredients
- Added 18 new role assignments (82 already existed)
- Enriched 82 existing citations with structured metadata
Phase 4: Validation and Statistics
- Generated comprehensive statistics report
- 446 ingredients with roles (44.8% coverage)
- 448 total roles, 99.8% citation coverage
- Average confidence: 0.998
Metrics:
- Analysis time: ~2 hours
- Lines of code added: ~2000
- Data quality: Production-ready