This document describes the detailed methodology used to create and maintain the cofactor reference data for the MicroGrowAgents system.
The cofactor curation process integrates data from multiple authoritative biological databases into structured reference files that power the CofactorMediaAgent. The methodology emphasizes:
- Multi-source validation: Cross-referencing across ChEBI, KEGG, BRENDA, and ExplorEnz
- Literature-based curation: Specialized cofactors validated against primary literature
- Pattern-based mapping: Scalable EC-to-cofactor relationships using wildcards
- Reproducibility: All curation steps documented with version control
Data Sources:
- ChEBI ontology search for term "cofactor" (CHEBI:23357)
- KEGG compound search for "coenzyme" and "prosthetic group"
- Literature review for specialized cofactors (lanthanides, PQQ, archaeal cofactors)
Inclusion Criteria:
- Essential for enzyme catalysis or electron transfer
- Chemically well-defined structure
- Present in at least one known microbial enzyme
- Has a ChEBI ID or KEGG compound ID
Exclusion Criteria:
- Generic terms (e.g., "organic cofactor")
- Insufficiently characterized compounds
- Eukaryote-specific cofactors not found in bacteria/archaea
Categories defined:
- Vitamins - Organic cofactors derived from dietary vitamins
- Metals - Metal ions and metalloenzyme prosthetic groups
- Nucleotides - Nucleotide-based redox carriers and energy currencies
- Energy Cofactors - Group transfer and energy metabolism cofactors
- Other - Specialized cofactors (PQQ, archaeal cofactors, etc.)
Categorization Process:
# Pseudo-algorithm
for cofactor in candidate_list:
if has_vitamin_precursor(cofactor):
category = "vitamins"
elif contains_metal(cofactor):
category = "metals"
elif is_nucleotide_derivative(cofactor):
category = "nucleotides"
elif is_group_transfer_cofactor(cofactor):
category = "energy_cofactors"
else:
category = "other"Process:
- Search ChEBI ontology by cofactor name
- Verify chemical structure matches expected cofactor
- Select most specific ChEBI ID (e.g., CHEBI:15846 for NAD+ vs CHEBI:13389 for NAD)
- Record primary ID and all synonyms
ChEBI Search Tool: https://www.ebi.ac.uk/chebi/advancedSearchFT.do
Example:
thiamine_pyrophosphate:
id: "CHEBI:9532" # Exact match from ChEBI search
names: ["TPP", "ThDP", "Thiamine diphosphate"] # All synonymsProcess:
- Search KEGG pathway database for cofactor biosynthesis
- Record pathway IDs (ko##### format)
- Validate pathway relevance to bacterial/archaeal metabolism
KEGG Resources:
- Pathway database: https://www.genome.jp/kegg/pathway.html
- Compound database: https://www.genome.jp/kegg/compound/
Example Pathways:
- ko00730: Thiamine metabolism (for TPP)
- ko00760: Nicotinate and nicotinamide metabolism (for NAD/NADP)
- ko00780: Biotin metabolism
Process:
- Query BRENDA for enzymes requiring specific cofactor
- Extract EC numbers from enzyme entries
- Generalize to EC patterns where appropriate
- Validate against ExplorEnz classifications
Example:
pyridoxal_phosphate:
ec_associations: ["2.6.1.-", "4.1.1.-", "5.1.1.-"] # All aminotransferases, decarboxylases, racemasesValidation Checks:
- All cofactors have valid ChEBI IDs
- All synonyms verified against ChEBI
- KEGG pathway IDs are valid
- EC associations match BRENDA/ExplorEnz
- No duplicate cofactor entries
- All categories properly defined
Process:
- For each EC class (1-6), query BRENDA for cofactor requirements
- Extract cofactor-EC relationships from enzyme entries
- Identify patterns across EC subclasses
BRENDA Query Example:
EC 1.1.1.- (NAD/NADP-dependent dehydrogenases)
→ Primary: NAD+, NADP+
→ Optional: Zn2+ (for some alcohol dehydrogenases)
Pattern Hierarchy:
- Exact match (highest priority):
"1.1.1.27"- specific enzyme - 3-digit pattern:
"1.1.1.-"- EC subclass - 2-digit pattern:
"1.1.-.-"- EC sub-subclass (rarely used) - 1-digit pattern: Never used (too broad)
Generalization Rules:
- If >80% of enzymes in EC subclass share cofactor → create pattern
- If cofactor is optional in <50% of enzymes → mark as "optional"
- If conflicting evidence → use exact EC number
Definitions:
- Primary: Absolutely required for catalytic activity
- Optional: Enhances activity or is alternative substrate
Classification Process:
"1.1.1.-":
primary: ["nad", "nadp"] # Required for all dehydrogenases
optional: ["zinc_ion"] # Only for alcohol dehydrogenasesProcess:
- Query MetaCyc for EC number
- Verify cofactor requirements match BRENDA
- Resolve conflicts by checking primary literature
Validation Checks:
- All EC patterns are valid (e.g., "1.1.1.-" not "1.1.1")
- No conflicting mappings (same EC → different cofactors)
- All cofactor keys reference
cofactor_hierarchy.yamlentries - Pattern hierarchy respected (exact > 3-digit > 2-digit)
- Primary/optional classification justified
Input: data/raw/mp_medium_ingredient_properties.csv (158 ingredients)
Script: scripts/generate_ingredient_cofactor_mapping.py
Step 1: Component Name Parsing
def extract_cofactor_from_component(component_name, chebi_id):
"""
Rule-based pattern matching on component names.
Examples:
- "Thiamin" → thiamine_pyrophosphate
- "FeSO₄·7H₂O" → iron_ion, heme, iron_sulfur_clusters
- "MgCl₂·6H₂O" → magnesium_ion
"""
cofactors = []
# Vitamin mappings
if "thiamin" in component_name.lower():
cofactors.append("thiamine_pyrophosphate")
elif "biotin" in component_name.lower():
cofactors.append("biotin")
elif "cobal" in component_name.lower(): # Cobalamin, cobalt
cofactors.extend(["cobalamin", "cobalt_ion"])
# Metal mappings (extract from chemical formula)
if extract_metal(component_name) == "Mg":
cofactors.append("magnesium_ion")
elif extract_metal(component_name) == "Fe":
cofactors.extend(["iron_ion", "heme", "iron_sulfur_clusters"])
elif extract_metal(component_name) == "Zn":
cofactors.append("zinc_ion")
# ... etc for all metals
return cofactorsStep 2: ChEBI ID Validation
- Verify ChEBI ID from source CSV matches expected chemical
- Cross-check with ChEBI ontology using
chebi_client.py
Step 3: Manual Curation
- Review edge cases (e.g., rare earth elements → lanthanides)
- Add lanthanide mapping for NdCl₃, PrCl₃, etc.
- Verify multi-cofactor assignments (e.g., Fe → iron_ion + heme + Fe-S)
Validation Checks:
- All ingredient names from MP medium CSV present
- ChEBI IDs match source data
- Cofactor keys reference
cofactor_hierarchy.yaml - No missing cofactor assignments for known cofactor-providing ingredients
- Multi-cofactor assignments justified (e.g., Fe provides multiple forms)
E. coli K-12: Well-characterized model organism
- Expected cofactors: NAD, FAD, CoA, Fe, Mg, etc.
- Biosynthesis: Capable for most vitamins
- Validation: Compare to EcoCyc pathways
M. radiotolerans: Methylotroph representative
- Expected cofactors: PQQ, lanthanides (XoxF-MDH)
- Validation: Compare to published genome analysis
Test 1: Cofactor Identification
def test_cofactor_identification():
"""Test that all expected cofactors are identified from genome."""
agent = CofactorMediaAgent()
result = agent.analyze_cofactor_requirements("SAMN02604091") # E. coli
expected_cofactors = ["nad", "fad", "thiamine_pyrophosphate", "iron_ion"]
found_cofactors = [req.cofactor_key for req in result]
assert all(cf in found_cofactors for cf in expected_cofactors)Test 2: Ingredient Mapping
def test_ingredient_mapping():
"""Test that MP medium ingredients map to cofactors."""
agent = CofactorMediaAgent()
requirements = [
CofactorRequirement(cofactor_key="magnesium_ion", ...)
]
mapping = agent.map_ingredients_to_cofactors(requirements, "MP")
# Mg should be mapped to MgCl₂·6H₂O
assert any("MgCl" in ing.ingredient_name for ing in mapping["existing"])Test 3: Cross-Validation with Literature
def test_literature_validation():
"""Validate lanthanide cofactor for M. extorquens."""
agent = CofactorMediaAgent()
result = agent.analyze_cofactor_requirements("Methylobacterium_extorquens")
# Should detect lanthanide requirement for XoxF
cofactors = [req.cofactor_key for req in result]
assert "lanthanides" in cofactorsVersion Format: MAJOR.MINOR (e.g., 1.0, 1.1, 2.0)
Version Increments:
- MAJOR: Breaking changes (e.g., restructure categories, change keys)
- MINOR: Additions (e.g., new cofactors, new EC mappings)
Current Version: 1.0 (as of 2025-01-10)
Adding New Cofactor:
- Verify ChEBI ID exists
- Assign to appropriate category
- Map EC associations from BRENDA
- Identify KEGG biosynthesis pathway
- Update
cofactor_hierarchy.yaml - Run tests to verify integration
- Increment MINOR version
Adding New EC Mapping:
- Query BRENDA for EC number
- Identify cofactor requirements
- Determine if pattern or exact match
- Update
ec_to_cofactor_map.yaml - Run tests
- Increment MINOR version
Citation Updates:
- Review annually for database updates (ChEBI, KEGG, BRENDA)
- Update DOIs if new papers published
- Maintain backward compatibility with existing ChEBI IDs
-
Automated ChEBI ID Validation
- Script to check all ChEBI IDs are still valid
- Detect deprecated IDs and suggest replacements
-
BRENDA API Integration
- Automate EC-to-cofactor mapping updates
- Quarterly refresh of cofactor-enzyme relationships
-
Quantitative Cofactor Requirements
- Add stoichiometry information (e.g., 2 Mg2+ per enzyme)
- Concentration ranges for optimal activity
-
Organism-Specific Cofactor Variants
- Expand lanthanide mapping (La, Ce, Nd, Pr, Dy specificities)
- Quinone pool variations (ubiquinone vs menaquinone preferences)
Annual Review (recommended: January each year):
- Check for updated versions of ChEBI, KEGG, BRENDA
- Update DOIs in YAML headers
- Verify URL accessibility
- Update BibTeX file (
docs/references/cofactor_sources.bib)
Adding New Citations:
- Add to BibTeX file first
- Update YAML header with citation format: "Author et al. (YEAR) Journal. Vol(Issue):Pages"
- Include DOI
Citation Format:
# Citation: LastName F, et al. (YEAR) Journal Name. Vol(Issue):Pages
# DOI: 10.####/######| Database | Access Method | Update Frequency |
|---|---|---|
| ChEBI | OWL download + API | Monthly |
| KEGG | Web interface | Quarterly |
| BRENDA | Web interface | Annually |
| ExplorEnz | Web interface | Annually |
| MetaCyc | Web interface | Biannually |
scripts/generate_ingredient_cofactor_mapping.py- Ingredient mapping generatorscripts/validate_cofactor_hierarchy.py- Quality control validator (TODO)scripts/update_chebi_ids.py- ChEBI ID validator (TODO)
For questions about curation methodology:
- GitHub Issues: https://github.com/CultureBotAI/MicroGrowAgents/issues
- Maintainer: MicroGrowAgents Team
Last Updated: 2025-01-10 Version: 1.0 Related Documentation: Cofactor Data Sources