CultureMech recipes include data_quality_flags to provide transparency about known data quality limitations. These flags help users understand the provenance and completeness of recipe data.
Description: Recipe has placeholder ingredients or incomplete composition data.
Common causes:
- PDF parsing failures (MediaDive, algae collections)
- Source data incomplete or unavailable
- Composition referenced but not provided
Example:
name: Bacillariophycean Medium
ingredients:
- preferred_term: "See source for composition"
concentration:
value: variable
unit: VARIABLE
data_quality_flags:
- incomplete_compositionUser guidance:
- Refer to source URL in
notesfield for original composition - Consider manual curation for high-priority media
- Flag indicates composition is not suitable for automated analysis
Description: Recipe requires manual review or enrichment.
Common causes:
- Automated import flagged potential issues
- Missing critical metadata
- Conflicting information from multiple sources
User guidance:
- Recipe is structurally valid but needs human review
- Check
curation_historyfor specific issues - Contact maintainers if this is a priority recipe
Description: Data source has lower reliability or confidence.
Common causes:
- Secondary or tertiary sources
- Automated web scraping
- Historical recipes without validation
User guidance:
- Use with caution for critical applications
- Cross-reference with authoritative sources when possible
- Recipe may require experimental validation
# Find all recipes with quality flags
grep -r "data_quality_flags" data/normalized_yaml/
# Count incomplete compositions
grep -r "incomplete_composition" data/normalized_yaml/ | wc -l
# List all flagged recipes
find data/normalized_yaml -name "*.yaml" -exec grep -l "incomplete_composition" {} \;from pathlib import Path
import yaml
def find_flagged_recipes(normalized_dir, flag="incomplete_composition"):
"""Find recipes with specific quality flag."""
flagged = []
for recipe_path in Path(normalized_dir).rglob("*.yaml"):
with open(recipe_path) as f:
recipe = yaml.safe_load(f)
flags = recipe.get('data_quality_flags', [])
if flag in flags:
flagged.append(recipe_path)
return flagged
# Usage
flagged = find_flagged_recipes('data/normalized_yaml', 'incomplete_composition')
print(f"Found {len(flagged)} recipes with incomplete compositions")(As of last pipeline run)
Total recipes: 10,595
Flagged recipes: 339 (3.2%)
- incomplete_composition: 339
- pending_curation: 0
- low_confidence: 0
Unfixable issues: 377 (3.6%)
(KOMODO recipes without matching DSMZ media)
Flags are added automatically by quality pipeline:
- During import: Importers add flags for known limitations
- During validation: Quality tagger detects placeholder ingredients
- During enrichment: Resolvers flag unresolvable issues
Quality flags should not be removed unless:
- Composition data has been manually curated and verified
- Source data has been updated and re-imported
- Flag was added in error (rare)
Simply disliking a flag is not sufficient reason to remove it - flags provide important provenance information.
If you need a high-quality version of a flagged recipe:
- Check source: Visit URL in
notesfield for original composition - Manual curation: Extract composition from source and submit PR
- Alternative sources: Search for same medium in other databases
- Contact maintainers: Request prioritization for critical recipes
To improve flagged recipes:
- Find flagged recipes: Use commands above
- Curate composition: Extract from authoritative source
- Update recipe file: Replace placeholder ingredients
- Add provenance: Update
curation_historywith source - Remove flag: Delete from
data_quality_flagslist - Submit PR: Include evidence of curation quality
To prevent flags on new imports:
- Improve PDF parsing: Use better extraction tools
- Structured data: Provide machine-readable formats (JSON, TSV)
- Complete records: Include full composition in source database
- Validation: Verify completeness before publication
Planned improvements to quality system:
- Confidence scores: Numeric quality scores (0-100)
- Severity levels: WARN, INFO, ERROR classifications
- Automated fixes: ML-based composition inference
- Quality dashboard: Web UI showing quality metrics
- Validation rules: Custom quality checks per domain
- Implementation:
docs/DATA_QUALITY_FIXES.md - Tagger script:
scripts/tag_placeholder_recipes.py - Schema:
src/culturemech/schema/culturemech.yaml - Validation:
src/culturemech/validation/