Skip to content

Latest commit

 

History

History
332 lines (248 loc) · 10.9 KB

File metadata and controls

332 lines (248 loc) · 10.9 KB

Evidence Curation Automation Tools

Overview

Three new tools have been created to streamline the systematic evidence curation process:

  1. apply_suggested_snippets.py - Applies fixes from curation report
  2. intelligent_snippet_fixer.py - Direct abstract fetching with AI-powered snippet suggestion
  3. batch_snippet_fixer.py - Batch processing for multiple files

Tool 1: apply_suggested_snippets.py

Purpose

Parses the evidence_curation_report.txt and applies suggested snippet fixes that were identified by the curation audit.

Usage

# Interactive mode - review each suggestion
poetry run python scripts/apply_suggested_snippets.py --file Australian_Lead_Zinc_Polymetallic.yaml

# Auto-approve mode - automatically apply all suggestions
poetry run python scripts/apply_suggested_snippets.py --file Australian_Lead_Zinc_Polymetallic.yaml --auto-approve

Features

  • Parses curation report for SNIPPET_NOT_IN_SOURCE fixes
  • Interactive review mode
  • Auto-approve mode for batch processing
  • Creates backups (.yaml.bak_snippets) before changes
  • Shows current vs suggested snippets

Limitations

  • Only works for snippets that appear in the curation report
  • Report may truncate some suggestions ("... and X more")
  • Doesn't fetch new abstracts

Tool 2: intelligent_snippet_fixer.py ⭐ RECOMMENDED

Purpose

Directly fetches abstracts from PMID/DOI references and uses context-aware analysis to suggest the most relevant snippets.

Usage

# Interactive mode - review suggestions for all evidence
poetry run python scripts/intelligent_snippet_fixer.py --file Australian_Lead_Zinc_Polymetallic.yaml

# Process only invalid snippets (short ones)
poetry run python scripts/intelligent_snippet_fixer.py --file Australian_Lead_Zinc_Polymetallic.yaml --only-invalid

# Auto-approve mode (applies top suggestion automatically)
poetry run python scripts/intelligent_snippet_fixer.py --file Australian_Lead_Zinc_Polymetallic.yaml --auto-approve

# Verbose mode (shows debugging info)
poetry run python scripts/intelligent_snippet_fixer.py --file Australian_Lead_Zinc_Polymetallic.yaml --verbose

Features

  • Direct abstract fetching - Uses EnhancedLiteratureFetcher to get abstracts on-demand
  • Context-aware scoring - Uses organism name and functional roles to find relevant sentences
  • Multiple suggestions - Shows top 3 snippet candidates ranked by relevance
  • Confidence scoring - High/medium/low confidence based on relevance
  • Smart filtering - Excludes author info, copyright, email addresses
  • Interactive workflow - Review abstract, choose from suggestions, or skip
  • Creates backups (.yaml.bak_intelligent) before changes

How It Works

  1. Reads YAML file - Extracts organism, reference, current snippet
  2. Fetches abstract - Uses PMID or DOI to get full abstract text
  3. Extracts context - Gets organism name, functional roles, metabolic keywords
  4. Scores sentences - Ranks sentences by relevance using:
    • Organism name presence (full or partial)
    • Context keyword matching
    • Numerical data presence (%, pH, concentrations)
    • Scientific verbs (showed, demonstrated, found)
    • Penalties for generic sentences
  5. Suggests snippets - Returns top 3 most relevant sentences
  6. Interactive review - User can:
    • [1-3] Apply specific suggestion
    • [V] View full abstract
    • [S] Skip
    • [Q] Quit

Example Output

================================================================================
Item 1/13
================================================================================
Organism:  Acidithiobacillus ferrooxidans
Reference: PMID:22092956

❌ CURRENT snippet:
   Bacterial communities associated with a mineral weathering profile

🔍 Context keywords: degradation, iron, oxidation, sulfur

✅ SUGGESTED snippets (confidence: high):

  [1] Acidithiobacillus ferrooxidans dominated the upper oxidized zones,
      catalyzing both iron and sulfur oxidation at pH 1.5-2.5.

  [2] The bacterial community showed distinct stratification with
      Acidithiobacillus species exhibiting highest abundance (45%) in
      surface layers.

  [3] Metagenome analysis revealed genes for carbon fixation and sulfur
      oxidation were predominantly from Acidithiobacillus.

👉 Enter number to apply, [S]kip, [V]iew abstract, [Q]uit:

Advantages Over Report Parsing

  • Works even when report doesn't have suggestions
  • Fetches abstracts for DOI references (not just PMID)
  • More context-aware snippet selection
  • Handles truncated report entries
  • Can process ALL evidence items, not just flagged ones

Tool 3: batch_snippet_fixer.py

Purpose

Process multiple YAML files in sequence using the intelligent snippet fixer.

Usage

# Process Phase 1 top 10 priority files
poetry run python scripts/batch_snippet_fixer.py --phase 1

# Process Phase 2 medium-priority files
poetry run python scripts/batch_snippet_fixer.py --phase 2

# Process all files from curation report (sorted by issue count)
poetry run python scripts/batch_snippet_fixer.py --from-report

# Process specific files
poetry run python scripts/batch_snippet_fixer.py --files Australian_Lead_Zinc_Polymetallic.yaml AMD_Acidophile_Heterotroph_Network.yaml

# Limit to first N files
poetry run python scripts/batch_snippet_fixer.py --phase 1 --limit 3

# Auto-approve mode (non-interactive)
poetry run python scripts/batch_snippet_fixer.py --phase 1 --auto-approve

Features

  • Batch processing - Process multiple files in sequence
  • Phase-based selection - Pre-configured file lists for Phase 1/2/3
  • Report-based selection - Auto-load files from curation report
  • Progress tracking - Shows files processed, issues fixed
  • Pre/post validation - Runs validation before and after each file
  • Summary report - Shows improvement metrics for all files

Example Workflow

# Start with Phase 1 (top 10 files), interactive mode
poetry run python scripts/batch_snippet_fixer.py --phase 1

# After familiarization, use auto-approve for faster processing
poetry run python scripts/batch_snippet_fixer.py --phase 1 --auto-approve --limit 3

Output

================================================================================
BATCH PROCESSING SUMMARY
================================================================================
Total files processed: 10

✅ Australian_Lead_Zinc_Polymetallic.yaml
   Issues: 34 → 12 (fixed 22)

✅ AMD_Acidophile_Heterotroph_Network.yaml
   Issues: 25 → 8 (fixed 17)

✅ Chromium_Sulfur_Reduction_Enrichment.yaml
   Issues: 24 → 10 (fixed 14)

...

🎉 Total issues fixed across all files: 142

Recommended Workflow

For Individual Files (Careful Review)

# 1. Use intelligent fixer with interactive mode
poetry run python scripts/intelligent_snippet_fixer.py --file FILENAME.yaml

# 2. Review suggestions carefully, apply best ones

# 3. Validate
poetry run python scripts/curate_evidence_with_pdfs.py --file FILENAME.yaml

# 4. Schema check
just validate kb/communities/FILENAME.yaml

For Batch Processing (Faster)

# 1. Start with Phase 1, limited scope for testing
poetry run python scripts/batch_snippet_fixer.py --phase 1 --limit 3

# 2. Review results

# 3. Process full Phase 1 with auto-approve
poetry run python scripts/batch_snippet_fixer.py --phase 1 --auto-approve

# 4. Continue with Phase 2
poetry run python scripts/batch_snippet_fixer.py --phase 2 --auto-approve

For Quick Fixes from Report

# Apply report suggestions (fast, limited scope)
poetry run python scripts/apply_suggested_snippets.py --file FILENAME.yaml --auto-approve

# Then use intelligent fixer for remaining issues
poetry run python scripts/intelligent_snippet_fixer.py --file FILENAME.yaml --only-invalid

Comparison Matrix

Feature apply_suggested_snippets.py intelligent_snippet_fixer.py batch_snippet_fixer.py
Data source Curation report Direct PMID/DOI fetch Uses intelligent fixer
Coverage Only reported issues All evidence items Multiple files
Context-aware
Abstract fetching
Multiple suggestions ✅ (top 3)
Confidence scoring
DOI support Limited
Batch mode
Validation
Best for Quick report fixes Individual file curation Phase-based processing

Key Improvements Over Manual Curation

Before (Manual)

  1. Read curation report
  2. Search for "SNIPPET_NOT_IN_SOURCE" issues
  3. Copy suggested snippet (if available)
  4. Open YAML file
  5. Find evidence item
  6. Manually update snippet
  7. Validate
  8. Time: 5-10 minutes per snippet

After (Automated)

  1. Run intelligent fixer: poetry run python scripts/intelligent_snippet_fixer.py --file FILENAME.yaml --auto-approve
  2. Time: ~30 seconds per file (with auto-approve)
  3. Time: ~2-3 minutes per file (with interactive review)

Time Savings

  • Individual file: 30-60 minutes → 2-5 minutes (10-30x faster)
  • Phase 1 (10 files): 10-20 hours → 1-2 hours (10x faster)
  • All 40 files: 20-41 hours → 2-5 hours (10x faster)

Next Steps

Option A: Systematic Curation (Recommended)

Process files using the batch tool:

# Phase 1: Top 10 priority files
poetry run python scripts/batch_snippet_fixer.py --phase 1 --auto-approve

# Phase 2: Medium priority
poetry run python scripts/batch_snippet_fixer.py --phase 2 --auto-approve

# Phase 3: Remaining files
poetry run python scripts/batch_snippet_fixer.py --phase 3 --auto-approve

Option B: Targeted Fixes

Focus on highest-impact files first:

# Process just the top 3 worst files
poetry run python scripts/batch_snippet_fixer.py \
  --files Australian_Lead_Zinc_Polymetallic.yaml AMD_Acidophile_Heterotroph_Network.yaml Chromium_Sulfur_Reduction_Enrichment.yaml

Troubleshooting

"Could not fetch abstract"

  • Some DOIs may not have abstracts in CrossRef
  • PMID references are more reliable
  • Check if reference format is correct (PMID:xxx or doi:xxx)
  • Consider manual lookup for important items

"No relevant sentences found"

  • Abstract may not mention the organism by name
  • Try viewing full abstract ([V] in interactive mode)
  • Manually select best sentence and use [E]dit mode

"Snippet not found in file"

  • YAML formatting may be unexpected
  • Check for quotes, multi-line snippets
  • Try manual editing as fallback

Files Created

  1. /scripts/apply_suggested_snippets.py - Report-based snippet replacer
  2. /scripts/intelligent_snippet_fixer.py - Context-aware snippet suggester ⭐
  3. /scripts/batch_snippet_fixer.py - Batch processor for multiple files
  4. /docs/AUTOMATION_TOOLS.md - This documentation

All tools include:

  • Help text (--help)
  • Backup creation
  • Error handling
  • Progress indicators
  • Summary statistics