Cofactor Curation Methodology

This document describes the detailed methodology used to create and maintain the cofactor reference data for the MicroGrowAgents system.

Overview

The cofactor curation process integrates data from multiple authoritative biological databases into structured reference files that power the CofactorMediaAgent. The methodology emphasizes:

Multi-source validation: Cross-referencing across ChEBI, KEGG, BRENDA, and ExplorEnz
Literature-based curation: Specialized cofactors validated against primary literature
Pattern-based mapping: Scalable EC-to-cofactor relationships using wildcards
Reproducibility: All curation steps documented with version control

1. Cofactor Hierarchy Curation

File: `cofactor_hierarchy.yaml`

Step 1: Initial Cofactor Identification

Data Sources:

ChEBI ontology search for term "cofactor" (CHEBI:23357)
KEGG compound search for "coenzyme" and "prosthetic group"
Literature review for specialized cofactors (lanthanides, PQQ, archaeal cofactors)

Inclusion Criteria:

Essential for enzyme catalysis or electron transfer
Chemically well-defined structure
Present in at least one known microbial enzyme
Has a ChEBI ID or KEGG compound ID

Exclusion Criteria:

Generic terms (e.g., "organic cofactor")
Insufficiently characterized compounds
Eukaryote-specific cofactors not found in bacteria/archaea

Step 2: Functional Categorization

Categories defined:

Vitamins - Organic cofactors derived from dietary vitamins
Metals - Metal ions and metalloenzyme prosthetic groups
Nucleotides - Nucleotide-based redox carriers and energy currencies
Energy Cofactors - Group transfer and energy metabolism cofactors
Other - Specialized cofactors (PQQ, archaeal cofactors, etc.)

Categorization Process:

# Pseudo-algorithm
for cofactor in candidate_list:
    if has_vitamin_precursor(cofactor):
        category = "vitamins"
    elif contains_metal(cofactor):
        category = "metals"
    elif is_nucleotide_derivative(cofactor):
        category = "nucleotides"
    elif is_group_transfer_cofactor(cofactor):
        category = "energy_cofactors"
    else:
        category = "other"

Step 3: ChEBI ID Assignment

Process:

Search ChEBI ontology by cofactor name
Verify chemical structure matches expected cofactor
Select most specific ChEBI ID (e.g., CHEBI:15846 for NAD+ vs CHEBI:13389 for NAD)
Record primary ID and all synonyms

ChEBI Search Tool: https://www.ebi.ac.uk/chebi/advancedSearchFT.do

Example:

thiamine_pyrophosphate:
  id: "CHEBI:9532"  # Exact match from ChEBI search
  names: ["TPP", "ThDP", "Thiamine diphosphate"]  # All synonyms

Step 4: KEGG Pathway Mapping

Process:

Search KEGG pathway database for cofactor biosynthesis
Record pathway IDs (ko##### format)
Validate pathway relevance to bacterial/archaeal metabolism

KEGG Resources:

Pathway database: https://www.genome.jp/kegg/pathway.html
Compound database: https://www.genome.jp/kegg/compound/

Example Pathways:

ko00730: Thiamine metabolism (for TPP)
ko00760: Nicotinate and nicotinamide metabolism (for NAD/NADP)
ko00780: Biotin metabolism

Step 5: EC Association Assignment

Process:

Query BRENDA for enzymes requiring specific cofactor
Extract EC numbers from enzyme entries
Generalize to EC patterns where appropriate
Validate against ExplorEnz classifications

Example:

pyridoxal_phosphate:
  ec_associations: ["2.6.1.-", "4.1.1.-", "5.1.1.-"]  # All aminotransferases, decarboxylases, racemases

Step 6: Quality Control

Validation Checks:

All cofactors have valid ChEBI IDs
All synonyms verified against ChEBI
KEGG pathway IDs are valid
EC associations match BRENDA/ExplorEnz
No duplicate cofactor entries
All categories properly defined

2. EC-to-Cofactor Mapping

File: `ec_to_cofactor_map.yaml`

Step 1: BRENDA Data Mining

Process:

For each EC class (1-6), query BRENDA for cofactor requirements
Extract cofactor-EC relationships from enzyme entries
Identify patterns across EC subclasses

BRENDA Query Example:

EC 1.1.1.- (NAD/NADP-dependent dehydrogenases)
→ Primary: NAD+, NADP+
→ Optional: Zn2+ (for some alcohol dehydrogenases)

Step 2: Pattern Generalization

Pattern Hierarchy:

Exact match (highest priority): "1.1.1.27" - specific enzyme
3-digit pattern: "1.1.1.-" - EC subclass
2-digit pattern: "1.1.-.-" - EC sub-subclass (rarely used)
1-digit pattern: Never used (too broad)

Generalization Rules:

If >80% of enzymes in EC subclass share cofactor → create pattern
If cofactor is optional in <50% of enzymes → mark as "optional"
If conflicting evidence → use exact EC number

Step 3: Primary vs Optional Classification

Definitions:

Primary: Absolutely required for catalytic activity
Optional: Enhances activity or is alternative substrate

Classification Process:

"1.1.1.-":
  primary: ["nad", "nadp"]     # Required for all dehydrogenases
  optional: ["zinc_ion"]       # Only for alcohol dehydrogenases

Step 4: Cross-Validation with MetaCyc

Process:

Query MetaCyc for EC number
Verify cofactor requirements match BRENDA
Resolve conflicts by checking primary literature

Step 5: Quality Control

Validation Checks:

All EC patterns are valid (e.g., "1.1.1.-" not "1.1.1")
No conflicting mappings (same EC → different cofactors)
All cofactor keys reference cofactor_hierarchy.yaml entries
Pattern hierarchy respected (exact > 3-digit > 2-digit)
Primary/optional classification justified

3. Ingredient-to-Cofactor Mapping

File: `ingredient_cofactor_mapping.csv`

Data Source

Input: data/raw/mp_medium_ingredient_properties.csv (158 ingredients)

Mapping Algorithm

Script: scripts/generate_ingredient_cofactor_mapping.py

Step 1: Component Name Parsing

def extract_cofactor_from_component(component_name, chebi_id):
    """
    Rule-based pattern matching on component names.

    Examples:
    - "Thiamin" → thiamine_pyrophosphate
    - "FeSO₄·7H₂O" → iron_ion, heme, iron_sulfur_clusters
    - "MgCl₂·6H₂O" → magnesium_ion
    """
    cofactors = []

    # Vitamin mappings
    if "thiamin" in component_name.lower():
        cofactors.append("thiamine_pyrophosphate")
    elif "biotin" in component_name.lower():
        cofactors.append("biotin")
    elif "cobal" in component_name.lower():  # Cobalamin, cobalt
        cofactors.extend(["cobalamin", "cobalt_ion"])

    # Metal mappings (extract from chemical formula)
    if extract_metal(component_name) == "Mg":
        cofactors.append("magnesium_ion")
    elif extract_metal(component_name) == "Fe":
        cofactors.extend(["iron_ion", "heme", "iron_sulfur_clusters"])
    elif extract_metal(component_name) == "Zn":
        cofactors.append("zinc_ion")
    # ... etc for all metals

    return cofactors

Step 2: ChEBI ID Validation

Verify ChEBI ID from source CSV matches expected chemical
Cross-check with ChEBI ontology using chebi_client.py

Step 3: Manual Curation

Review edge cases (e.g., rare earth elements → lanthanides)
Add lanthanide mapping for NdCl₃, PrCl₃, etc.
Verify multi-cofactor assignments (e.g., Fe → iron_ion + heme + Fe-S)

Quality Control

Validation Checks:

All ingredient names from MP medium CSV present
ChEBI IDs match source data
Cofactor keys reference cofactor_hierarchy.yaml
No missing cofactor assignments for known cofactor-providing ingredients
Multi-cofactor assignments justified (e.g., Fe provides multiple forms)

4. Integration Testing

Test Organisms

E. coli K-12: Well-characterized model organism

Expected cofactors: NAD, FAD, CoA, Fe, Mg, etc.
Biosynthesis: Capable for most vitamins
Validation: Compare to EcoCyc pathways

M. radiotolerans: Methylotroph representative

Expected cofactors: PQQ, lanthanides (XoxF-MDH)
Validation: Compare to published genome analysis

Test Cases

Test 1: Cofactor Identification

def test_cofactor_identification():
    """Test that all expected cofactors are identified from genome."""
    agent = CofactorMediaAgent()
    result = agent.analyze_cofactor_requirements("SAMN02604091")  # E. coli

    expected_cofactors = ["nad", "fad", "thiamine_pyrophosphate", "iron_ion"]
    found_cofactors = [req.cofactor_key for req in result]

    assert all(cf in found_cofactors for cf in expected_cofactors)

Test 2: Ingredient Mapping

def test_ingredient_mapping():
    """Test that MP medium ingredients map to cofactors."""
    agent = CofactorMediaAgent()
    requirements = [
        CofactorRequirement(cofactor_key="magnesium_ion", ...)
    ]

    mapping = agent.map_ingredients_to_cofactors(requirements, "MP")

    # Mg should be mapped to MgCl₂·6H₂O
    assert any("MgCl" in ing.ingredient_name for ing in mapping["existing"])

Test 3: Cross-Validation with Literature

def test_literature_validation():
    """Validate lanthanide cofactor for M. extorquens."""
    agent = CofactorMediaAgent()
    result = agent.analyze_cofactor_requirements("Methylobacterium_extorquens")

    # Should detect lanthanide requirement for XoxF
    cofactors = [req.cofactor_key for req in result]
    assert "lanthanides" in cofactors

5. Version Control and Updates

Versioning Strategy

Version Format: MAJOR.MINOR (e.g., 1.0, 1.1, 2.0)

Version Increments:

MAJOR: Breaking changes (e.g., restructure categories, change keys)
MINOR: Additions (e.g., new cofactors, new EC mappings)

Current Version: 1.0 (as of 2025-01-10)

Update Procedure

Adding New Cofactor:

Verify ChEBI ID exists
Assign to appropriate category
Map EC associations from BRENDA
Identify KEGG biosynthesis pathway
Update cofactor_hierarchy.yaml
Run tests to verify integration
Increment MINOR version

Adding New EC Mapping:

Query BRENDA for EC number
Identify cofactor requirements
Determine if pattern or exact match
Update ec_to_cofactor_map.yaml
Run tests
Increment MINOR version

Citation Updates:

Review annually for database updates (ChEBI, KEGG, BRENDA)
Update DOIs if new papers published
Maintain backward compatibility with existing ChEBI IDs

6. Future Enhancements

Planned Improvements

Automated ChEBI ID Validation
- Script to check all ChEBI IDs are still valid
- Detect deprecated IDs and suggest replacements
BRENDA API Integration
- Automate EC-to-cofactor mapping updates
- Quarterly refresh of cofactor-enzyme relationships
Quantitative Cofactor Requirements
- Add stoichiometry information (e.g., 2 Mg2+ per enzyme)
- Concentration ranges for optimal activity
Organism-Specific Cofactor Variants
- Expand lanthanide mapping (La, Ce, Nd, Pr, Dy specificities)
- Quinone pool variations (ubiquinone vs menaquinone preferences)

7. Citation Management

Citation Update Process

Annual Review (recommended: January each year):

Check for updated versions of ChEBI, KEGG, BRENDA
Update DOIs in YAML headers
Verify URL accessibility
Update BibTeX file (docs/references/cofactor_sources.bib)

Adding New Citations:

Add to BibTeX file first
Update YAML header with citation format: "Author et al. (YEAR) Journal. Vol(Issue):Pages"
Include DOI

Citation Format:

# Citation: LastName F, et al. (YEAR) Journal Name. Vol(Issue):Pages
# DOI: 10.####/######

Appendix: Tools and Resources

Database Access

Database	Access Method	Update Frequency
ChEBI	OWL download + API	Monthly
KEGG	Web interface	Quarterly
BRENDA	Web interface	Annually
ExplorEnz	Web interface	Annually
MetaCyc	Web interface	Biannually

Scripts

scripts/generate_ingredient_cofactor_mapping.py - Ingredient mapping generator
scripts/validate_cofactor_hierarchy.py - Quality control validator (TODO)
scripts/update_chebi_ids.py - ChEBI ID validator (TODO)

Contacts

For questions about curation methodology:

GitHub Issues: https://github.com/CultureBotAI/MicroGrowAgents/issues
Maintainer: MicroGrowAgents Team

Last Updated: 2025-01-10 Version: 1.0 Related Documentation: Cofactor Data Sources

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cofactor Curation Methodology

Overview

1. Cofactor Hierarchy Curation

File: `cofactor_hierarchy.yaml`

Step 1: Initial Cofactor Identification

Step 2: Functional Categorization

Step 3: ChEBI ID Assignment

Step 4: KEGG Pathway Mapping

Step 5: EC Association Assignment

Step 6: Quality Control

2. EC-to-Cofactor Mapping

File: `ec_to_cofactor_map.yaml`

Step 1: BRENDA Data Mining

Step 2: Pattern Generalization

Step 3: Primary vs Optional Classification

Step 4: Cross-Validation with MetaCyc

Step 5: Quality Control

3. Ingredient-to-Cofactor Mapping

File: `ingredient_cofactor_mapping.csv`

Data Source

Mapping Algorithm

Quality Control

4. Integration Testing

Test Organisms

Test Cases

5. Version Control and Updates

Versioning Strategy

Update Procedure

6. Future Enhancements

Planned Improvements

7. Citation Management

Citation Update Process

Appendix: Tools and Resources

Database Access

Scripts

Contacts

FilesExpand file tree

cofactor_curation_methodology.md

Latest commit

History

cofactor_curation_methodology.md

File metadata and controls

Cofactor Curation Methodology

Overview

1. Cofactor Hierarchy Curation

File: cofactor_hierarchy.yaml

Step 1: Initial Cofactor Identification

Step 2: Functional Categorization

Step 3: ChEBI ID Assignment

Step 4: KEGG Pathway Mapping

Step 5: EC Association Assignment

Step 6: Quality Control

2. EC-to-Cofactor Mapping

File: ec_to_cofactor_map.yaml

Step 1: BRENDA Data Mining

Step 2: Pattern Generalization

Step 3: Primary vs Optional Classification

Step 4: Cross-Validation with MetaCyc

Step 5: Quality Control

3. Ingredient-to-Cofactor Mapping

File: ingredient_cofactor_mapping.csv

Data Source

Mapping Algorithm

Quality Control

4. Integration Testing

Test Organisms

Test Cases

5. Version Control and Updates

Versioning Strategy

Update Procedure

6. Future Enhancements

Planned Improvements

7. Citation Management

Citation Update Process

Appendix: Tools and Resources

Database Access

Scripts

Contacts

File: `cofactor_hierarchy.yaml`

File: `ec_to_cofactor_map.yaml`

File: `ingredient_cofactor_mapping.csv`