Three new tools have been created to streamline the systematic evidence curation process:
- apply_suggested_snippets.py - Applies fixes from curation report
- intelligent_snippet_fixer.py - Direct abstract fetching with AI-powered snippet suggestion
- batch_snippet_fixer.py - Batch processing for multiple files
Parses the evidence_curation_report.txt and applies suggested snippet fixes that were identified by the curation audit.
# Interactive mode - review each suggestion
poetry run python scripts/apply_suggested_snippets.py --file Australian_Lead_Zinc_Polymetallic.yaml
# Auto-approve mode - automatically apply all suggestions
poetry run python scripts/apply_suggested_snippets.py --file Australian_Lead_Zinc_Polymetallic.yaml --auto-approve- Parses curation report for SNIPPET_NOT_IN_SOURCE fixes
- Interactive review mode
- Auto-approve mode for batch processing
- Creates backups (.yaml.bak_snippets) before changes
- Shows current vs suggested snippets
- Only works for snippets that appear in the curation report
- Report may truncate some suggestions ("... and X more")
- Doesn't fetch new abstracts
Directly fetches abstracts from PMID/DOI references and uses context-aware analysis to suggest the most relevant snippets.
# Interactive mode - review suggestions for all evidence
poetry run python scripts/intelligent_snippet_fixer.py --file Australian_Lead_Zinc_Polymetallic.yaml
# Process only invalid snippets (short ones)
poetry run python scripts/intelligent_snippet_fixer.py --file Australian_Lead_Zinc_Polymetallic.yaml --only-invalid
# Auto-approve mode (applies top suggestion automatically)
poetry run python scripts/intelligent_snippet_fixer.py --file Australian_Lead_Zinc_Polymetallic.yaml --auto-approve
# Verbose mode (shows debugging info)
poetry run python scripts/intelligent_snippet_fixer.py --file Australian_Lead_Zinc_Polymetallic.yaml --verbose- Direct abstract fetching - Uses EnhancedLiteratureFetcher to get abstracts on-demand
- Context-aware scoring - Uses organism name and functional roles to find relevant sentences
- Multiple suggestions - Shows top 3 snippet candidates ranked by relevance
- Confidence scoring - High/medium/low confidence based on relevance
- Smart filtering - Excludes author info, copyright, email addresses
- Interactive workflow - Review abstract, choose from suggestions, or skip
- Creates backups (.yaml.bak_intelligent) before changes
- Reads YAML file - Extracts organism, reference, current snippet
- Fetches abstract - Uses PMID or DOI to get full abstract text
- Extracts context - Gets organism name, functional roles, metabolic keywords
- Scores sentences - Ranks sentences by relevance using:
- Organism name presence (full or partial)
- Context keyword matching
- Numerical data presence (%, pH, concentrations)
- Scientific verbs (showed, demonstrated, found)
- Penalties for generic sentences
- Suggests snippets - Returns top 3 most relevant sentences
- Interactive review - User can:
- [1-3] Apply specific suggestion
- [V] View full abstract
- [S] Skip
- [Q] Quit
================================================================================
Item 1/13
================================================================================
Organism: Acidithiobacillus ferrooxidans
Reference: PMID:22092956
❌ CURRENT snippet:
Bacterial communities associated with a mineral weathering profile
🔍 Context keywords: degradation, iron, oxidation, sulfur
✅ SUGGESTED snippets (confidence: high):
[1] Acidithiobacillus ferrooxidans dominated the upper oxidized zones,
catalyzing both iron and sulfur oxidation at pH 1.5-2.5.
[2] The bacterial community showed distinct stratification with
Acidithiobacillus species exhibiting highest abundance (45%) in
surface layers.
[3] Metagenome analysis revealed genes for carbon fixation and sulfur
oxidation were predominantly from Acidithiobacillus.
👉 Enter number to apply, [S]kip, [V]iew abstract, [Q]uit:
- Works even when report doesn't have suggestions
- Fetches abstracts for DOI references (not just PMID)
- More context-aware snippet selection
- Handles truncated report entries
- Can process ALL evidence items, not just flagged ones
Process multiple YAML files in sequence using the intelligent snippet fixer.
# Process Phase 1 top 10 priority files
poetry run python scripts/batch_snippet_fixer.py --phase 1
# Process Phase 2 medium-priority files
poetry run python scripts/batch_snippet_fixer.py --phase 2
# Process all files from curation report (sorted by issue count)
poetry run python scripts/batch_snippet_fixer.py --from-report
# Process specific files
poetry run python scripts/batch_snippet_fixer.py --files Australian_Lead_Zinc_Polymetallic.yaml AMD_Acidophile_Heterotroph_Network.yaml
# Limit to first N files
poetry run python scripts/batch_snippet_fixer.py --phase 1 --limit 3
# Auto-approve mode (non-interactive)
poetry run python scripts/batch_snippet_fixer.py --phase 1 --auto-approve- Batch processing - Process multiple files in sequence
- Phase-based selection - Pre-configured file lists for Phase 1/2/3
- Report-based selection - Auto-load files from curation report
- Progress tracking - Shows files processed, issues fixed
- Pre/post validation - Runs validation before and after each file
- Summary report - Shows improvement metrics for all files
# Start with Phase 1 (top 10 files), interactive mode
poetry run python scripts/batch_snippet_fixer.py --phase 1
# After familiarization, use auto-approve for faster processing
poetry run python scripts/batch_snippet_fixer.py --phase 1 --auto-approve --limit 3================================================================================
BATCH PROCESSING SUMMARY
================================================================================
Total files processed: 10
✅ Australian_Lead_Zinc_Polymetallic.yaml
Issues: 34 → 12 (fixed 22)
✅ AMD_Acidophile_Heterotroph_Network.yaml
Issues: 25 → 8 (fixed 17)
✅ Chromium_Sulfur_Reduction_Enrichment.yaml
Issues: 24 → 10 (fixed 14)
...
🎉 Total issues fixed across all files: 142
# 1. Use intelligent fixer with interactive mode
poetry run python scripts/intelligent_snippet_fixer.py --file FILENAME.yaml
# 2. Review suggestions carefully, apply best ones
# 3. Validate
poetry run python scripts/curate_evidence_with_pdfs.py --file FILENAME.yaml
# 4. Schema check
just validate kb/communities/FILENAME.yaml# 1. Start with Phase 1, limited scope for testing
poetry run python scripts/batch_snippet_fixer.py --phase 1 --limit 3
# 2. Review results
# 3. Process full Phase 1 with auto-approve
poetry run python scripts/batch_snippet_fixer.py --phase 1 --auto-approve
# 4. Continue with Phase 2
poetry run python scripts/batch_snippet_fixer.py --phase 2 --auto-approve# Apply report suggestions (fast, limited scope)
poetry run python scripts/apply_suggested_snippets.py --file FILENAME.yaml --auto-approve
# Then use intelligent fixer for remaining issues
poetry run python scripts/intelligent_snippet_fixer.py --file FILENAME.yaml --only-invalid| Feature | apply_suggested_snippets.py | intelligent_snippet_fixer.py | batch_snippet_fixer.py |
|---|---|---|---|
| Data source | Curation report | Direct PMID/DOI fetch | Uses intelligent fixer |
| Coverage | Only reported issues | All evidence items | Multiple files |
| Context-aware | ❌ | ✅ | ✅ |
| Abstract fetching | ❌ | ✅ | ✅ |
| Multiple suggestions | ❌ | ✅ (top 3) | ✅ |
| Confidence scoring | ❌ | ✅ | ✅ |
| DOI support | Limited | ✅ | ✅ |
| Batch mode | ❌ | ❌ | ✅ |
| Validation | ❌ | ❌ | ✅ |
| Best for | Quick report fixes | Individual file curation | Phase-based processing |
- Read curation report
- Search for "SNIPPET_NOT_IN_SOURCE" issues
- Copy suggested snippet (if available)
- Open YAML file
- Find evidence item
- Manually update snippet
- Validate
- Time: 5-10 minutes per snippet
- Run intelligent fixer:
poetry run python scripts/intelligent_snippet_fixer.py --file FILENAME.yaml --auto-approve - Time: ~30 seconds per file (with auto-approve)
- Time: ~2-3 minutes per file (with interactive review)
- Individual file: 30-60 minutes → 2-5 minutes (10-30x faster)
- Phase 1 (10 files): 10-20 hours → 1-2 hours (10x faster)
- All 40 files: 20-41 hours → 2-5 hours (10x faster)
Process files using the batch tool:
# Phase 1: Top 10 priority files
poetry run python scripts/batch_snippet_fixer.py --phase 1 --auto-approve
# Phase 2: Medium priority
poetry run python scripts/batch_snippet_fixer.py --phase 2 --auto-approve
# Phase 3: Remaining files
poetry run python scripts/batch_snippet_fixer.py --phase 3 --auto-approveFocus on highest-impact files first:
# Process just the top 3 worst files
poetry run python scripts/batch_snippet_fixer.py \
--files Australian_Lead_Zinc_Polymetallic.yaml AMD_Acidophile_Heterotroph_Network.yaml Chromium_Sulfur_Reduction_Enrichment.yaml- Some DOIs may not have abstracts in CrossRef
- PMID references are more reliable
- Check if reference format is correct (PMID:xxx or doi:xxx)
- Consider manual lookup for important items
- Abstract may not mention the organism by name
- Try viewing full abstract ([V] in interactive mode)
- Manually select best sentence and use [E]dit mode
- YAML formatting may be unexpected
- Check for quotes, multi-line snippets
- Try manual editing as fallback
/scripts/apply_suggested_snippets.py- Report-based snippet replacer/scripts/intelligent_snippet_fixer.py- Context-aware snippet suggester ⭐/scripts/batch_snippet_fixer.py- Batch processor for multiple files/docs/AUTOMATION_TOOLS.md- This documentation
All tools include:
- Help text (
--help) - Backup creation
- Error handling
- Progress indicators
- Summary statistics