diff --git a/docs/data_exploration.md b/docs/data_exploration.md new file mode 100644 index 0000000..a99d052 --- /dev/null +++ b/docs/data_exploration.md @@ -0,0 +1,177 @@ +# Data Exploration for Missing Charity Information + +This document describes the data exploration functionality created to identify organizations and charities with missing data elements. + +## Overview + +The data exploration system analyzes both Sites (charity service locations) and Organizations (parent charity entities) to identify missing essential and important data fields. This helps prioritize data collection efforts and improve data completeness. + +## Key Components + +### 1. OrganizationOperations (`src/tackle_hunger/organization_operations.py`) + +Provides GraphQL operations for fetching and analyzing organization data: + +- `get_organizations_for_ai()` - Fetch organizations with all relevant fields +- `get_organization_by_id()` - Fetch specific organization details +- `update_organization()` - Update organization information + +### 2. DataExplorer (`src/tackle_hunger/data_explorer.py`) + +Analyzes data completeness and generates insights: + +- **Field Classification**: + - **Essential Site Fields**: name, streetAddress, city, state, zip, publicEmail, publicPhone + - **Important Site Fields**: website, description, serviceArea, acceptsFoodDonations, ein, contact details + - **Essential Org Fields**: name + - **Important Org Fields**: address, contact, description, ein, Feeding America affiliation + +- **Analysis Functions**: + - `analyze_site_completeness()` - Analyze individual site data completeness + - `analyze_organization_completeness()` - Analyze individual organization data completeness + - `explore_sites_data()` - Comprehensive site data analysis + - `explore_organizations_data()` - Comprehensive organization data analysis + - `generate_comprehensive_report()` - Full analysis with recommendations + +### 3. Exploration Script (`scripts/explore_data_alesha.py`) + +Command-line tool for running data exploration: + +```bash +python scripts/explore_data_alesha.py [OPTIONS] +``` + +#### Options: +- `--sites-limit N` - Number of sites to analyze (default: 50) +- `--orgs-limit N` - Number of organizations to analyze (default: 50) +- `--output-file FILE` - Save detailed JSON report to file +- `--environment ENV` - API environment (dev/staging/production) +- `--summary-only` - Show only summary without saving detailed report + +## Usage Examples + +### Basic Analysis +```bash +# Analyze 50 sites and 50 organizations, show summary +python scripts/explore_data_alesha.py --summary-only + +# Analyze more data points +python scripts/explore_data_alesha.py --sites-limit 100 --orgs-limit 75 +``` + +### Detailed Analysis with Report +```bash +# Generate full report with custom output file +python scripts/explore_data_alesha.py \ + --sites-limit 200 \ + --orgs-limit 150 \ + --output-file charity_data_analysis.json +``` + +### Programmatic Usage +```python +from src.tackle_hunger import TackleHungerClient, DataExplorer + +# Setup client +client = TackleHungerClient() +explorer = DataExplorer(client) + +# Generate comprehensive report +report = explorer.generate_comprehensive_report( + sites_limit=100, + orgs_limit=100 +) + +# Print summary +explorer.print_summary(report) + +# Save detailed report +explorer.save_report(report, "analysis_results.json") +``` + +## Report Structure + +The analysis generates a comprehensive report with: + +### Executive Summary +- Total entities analyzed +- Entities with essential data gaps +- Overall data gap percentage +- Average completeness scores + +### Detailed Analysis +- **Sites Analysis**: Missing field counts, most problematic sites +- **Organizations Analysis**: Missing field counts, most problematic organizations +- **Recommendations**: Prioritized actions based on findings + +### Field-Specific Insights +- Count of missing data by field +- Percentage of entities missing each field +- Priority ranking for data collection efforts + +## Completeness Scoring + +Each entity receives a completeness score (0.0 to 1.0) calculated as: +- Essential fields are weighted 2x +- Important fields are weighted 1x +- Score = (complete_essential × 2 + complete_important) / (total_essential × 2 + total_important) + +## Key Findings Categories + +### Essential Data Gaps +Entities missing critical fields required for basic functionality: +- Site: Missing address, contact information +- Organization: Missing name + +### Important Data Gaps +Entities missing valuable but not critical fields: +- Missing descriptions, websites, service details +- Missing EIN numbers, affiliation information + +## Recommendations Engine + +The system automatically generates prioritized recommendations: + +1. **High Priority**: Focus on most commonly missing essential fields +2. **Medium Priority**: Address important fields with high miss rates +3. **System Improvements**: Suggestions for validation and data collection + +## Integration with Existing Workflow + +This analysis integrates with the existing charity validation workflow: + +1. **Data Collection Planning**: Identify priority fields for AI/ETL operations +2. **Quality Assurance**: Monitor data completeness over time +3. **Volunteer Focus**: Direct volunteer efforts to highest-impact data gaps +4. **API Enhancement**: Inform required field validation improvements + +## Error Handling + +The system includes robust error handling for: +- Network connectivity issues +- API authentication problems +- Missing or malformed data +- GraphQL schema changes + +## Future Enhancements + +Potential improvements for the data exploration system: + +1. **Historical Tracking**: Monitor data completeness trends over time +2. **Geographic Analysis**: Identify regional data quality patterns +3. **Automated Alerts**: Notify when data quality drops below thresholds +4. **Integration Testing**: Validate against different API environments +5. **Performance Optimization**: Handle larger datasets efficiently + +## Testing + +Comprehensive test coverage includes: +- Unit tests for all analysis functions +- Mock data scenarios for edge cases +- Integration tests for GraphQL operations +- Command-line interface testing + +Run tests with: +```bash +python -m pytest tests/ -v +``` \ No newline at end of file diff --git a/examples/demo_data_exploration.py b/examples/demo_data_exploration.py new file mode 100755 index 0000000..bbba066 --- /dev/null +++ b/examples/demo_data_exploration.py @@ -0,0 +1,170 @@ +#!/usr/bin/env python3 +""" +Demo script showing data exploration functionality with mock data. + +This demonstrates the DataExplorer functionality without requiring API access. +""" + +import sys +from pathlib import Path + +# Add src to path for imports +sys.path.insert(0, str(Path(__file__).parent.parent / "src")) + +from tackle_hunger.data_explorer import DataExplorer +from unittest.mock import Mock + + +def create_mock_data(): + """Create realistic mock data for demonstration.""" + + # Mock sites with varying levels of completeness + mock_sites = [ + { + 'id': 'site1', + 'name': 'Complete Community Food Bank', + 'streetAddress': '123 Main Street', + 'city': 'Anytown', + 'state': 'CA', + 'zip': '12345', + 'publicEmail': 'info@communityfood.org', + 'publicPhone': '555-0123', + 'website': 'https://communityfood.org', + 'description': 'Serving the community since 1985 with fresh food and support.', + 'serviceArea': 'City-wide, focus on downtown area', + 'acceptsFoodDonations': 'Yes', + 'ein': '12-3456789', + 'contactEmail': 'director@communityfood.org', + 'contactName': 'Jane Smith', + 'contactPhone': '555-0124' + }, + { + 'id': 'site2', + 'name': 'Partial Info Food Pantry', + 'streetAddress': '456 Oak Avenue', + 'city': 'Somewhere', + 'state': 'NY', + 'zip': '54321', + 'publicEmail': '', # Missing + 'publicPhone': '555-0456', + 'website': '', # Missing + 'description': '', # Missing + 'acceptsFoodDonations': 'Yes', + # Missing many other fields + }, + { + 'id': 'site3', + 'name': 'Minimal Data Shelter', + 'streetAddress': '789 Pine Road', + 'city': 'Elsewhere', + 'state': 'TX', + 'zip': '', # Missing essential field + 'publicEmail': '', # Missing essential field + 'publicPhone': '', # Missing essential field + # Missing most fields + } + ] + + # Mock organizations with varying completeness + mock_orgs = [ + { + 'id': 'org1', + 'name': 'Complete Charity Organization', + 'streetAddress': '100 Charity Lane', + 'city': 'Generous City', + 'state': 'CA', + 'zip': '98765', + 'publicEmail': 'contact@charity.org', + 'publicPhone': '555-9876', + 'description': 'A well-established charity serving multiple communities.', + 'ein': '98-7654321', + 'isFeedingAmericaAffiliate': 'Yes', + 'sites': [{'id': 'site1', 'name': 'Site 1'}, {'id': 'site2', 'name': 'Site 2'}] + }, + { + 'id': 'org2', + 'name': 'Incomplete Organization', + 'streetAddress': '', # Missing + 'city': '', # Missing + 'publicEmail': '', # Missing + 'description': '', # Missing + 'ein': '', # Missing + 'sites': [{'id': 'site3', 'name': 'Site 3'}] + }, + { + 'id': 'org3', + 'name': '', # Missing essential field! + 'streetAddress': '200 Hope Street', + 'city': 'Kindness', + 'state': 'FL', + 'sites': [] + } + ] + + return mock_sites, mock_orgs + + +def main(): + """Run the demonstration.""" + print("Data Exploration Demo - Mock Data Analysis") + print("=" * 50) + + # Create mock client and explorer + mock_client = Mock() + + # Create mock data + mock_sites, mock_orgs = create_mock_data() + + # Create data explorer + explorer = DataExplorer(mock_client) + + # Mock the data fetching methods + explorer.site_ops.get_sites_for_ai = Mock(return_value=mock_sites) + explorer.org_ops.get_organizations_for_ai = Mock(return_value=mock_orgs) + + print("Analyzing mock data...") + print(f"Sites: {len(mock_sites)}, Organizations: {len(mock_orgs)}") + + # Generate analysis report + report = explorer.generate_comprehensive_report( + sites_limit=len(mock_sites), + orgs_limit=len(mock_orgs) + ) + + # Print summary + explorer.print_summary(report) + + # Show detailed findings for each entity + print("\nDETAILED FINDINGS:") + print("-" * 40) + + print("\nSite Analysis Details:") + for analysis in report['sites_analysis']['all_site_analyses']: + print(f"• {analysis['name']} (ID: {analysis['site_id']})") + print(f" Completeness Score: {analysis['completeness_score']}") + if analysis['missing_essential']: + print(f" Missing Essential: {', '.join(analysis['missing_essential'])}") + if analysis['missing_important']: + print(f" Missing Important: {', '.join(analysis['missing_important'])}") + print() + + print("Organization Analysis Details:") + for analysis in report['organizations_analysis']['all_organization_analyses']: + print(f"• {analysis['name']} (ID: {analysis['org_id']})") + print(f" Completeness Score: {analysis['completeness_score']}") + print(f" Sites: {analysis['site_count']}") + if analysis['missing_essential']: + print(f" Missing Essential: {', '.join(analysis['missing_essential'])}") + if analysis['missing_important']: + print(f" Missing Important: {', '.join(analysis['missing_important'])}") + print() + + # Save report for inspection + filename = explorer.save_report(report, "/tmp/demo_analysis_report.json") + print(f"\n✓ Demo completed! Full report saved to: {filename}") + print("\nThis demonstrates how the data exploration system identifies") + print("missing data elements in both sites and organizations.") + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/scripts/explore_data_alesha.py b/scripts/explore_data_alesha.py new file mode 100755 index 0000000..0d84908 --- /dev/null +++ b/scripts/explore_data_alesha.py @@ -0,0 +1,143 @@ +#!/usr/bin/env python3 +""" +Data exploration script for Alesha - Explore missing data in organizations and charities. + +This script analyzes the GraphQL data to identify organizations and charities +that are missing essential data elements. +""" + +import os +import sys +import argparse +from pathlib import Path + +# Add src to path for imports +sys.path.insert(0, str(Path(__file__).parent.parent / "src")) + +from tackle_hunger.graphql_client import TackleHungerClient, TackleHungerConfig +from tackle_hunger.data_explorer import DataExplorer + + +def setup_environment(): + """Setup environment variables from .env file if available.""" + try: + from dotenv import load_dotenv + env_file = Path(__file__).parent.parent / ".env" + if env_file.exists(): + load_dotenv(env_file) + print(f"✓ Loaded environment from {env_file}") + else: + print("⚠ No .env file found, using defaults/environment variables") + except ImportError: + print("⚠ python-dotenv not available, relying on environment variables") + + +def main(): + """Main exploration function.""" + parser = argparse.ArgumentParser( + description="Explore Tackle Hunger data to identify missing elements" + ) + parser.add_argument( + "--sites-limit", + type=int, + default=50, + help="Number of sites to analyze (default: 50)" + ) + parser.add_argument( + "--orgs-limit", + type=int, + default=50, + help="Number of organizations to analyze (default: 50)" + ) + parser.add_argument( + "--output-file", + type=str, + help="Output file for detailed JSON report (optional)" + ) + parser.add_argument( + "--environment", + type=str, + choices=["dev", "staging", "production"], + default="dev", + help="GraphQL API environment to use (default: dev)" + ) + parser.add_argument( + "--summary-only", + action="store_true", + help="Show only summary (don't save detailed report)" + ) + + args = parser.parse_args() + + print("Tackle Hunger Data Exploration - Missing Data Analysis") + print("=" * 60) + + # Setup environment + setup_environment() + + # Configure client + try: + config = TackleHungerConfig(environment=args.environment) + client = TackleHungerClient(config) + print(f"✓ Connected to {args.environment} environment: {config.graphql_endpoint}") + except Exception as e: + print(f"✗ Failed to create GraphQL client: {e}") + print("\nPlease ensure you have:") + print("1. Set AI_SCRAPING_TOKEN in your environment or .env file") + print("2. Valid network access to the GraphQL endpoint") + return 1 + + # Create data explorer + explorer = DataExplorer(client) + + try: + # Generate comprehensive report + print(f"\nAnalyzing up to {args.sites_limit} sites and {args.orgs_limit} organizations...") + report = explorer.generate_comprehensive_report( + sites_limit=args.sites_limit, + orgs_limit=args.orgs_limit + ) + + # Print summary + explorer.print_summary(report) + + # Save detailed report if requested + if not args.summary_only: + output_file = args.output_file + saved_file = explorer.save_report(report, output_file) + print(f"\n✓ Detailed report saved to: {saved_file}") + + # Show most problematic entries + print("\nMOST PROBLEMATIC SITES (Top 5):") + for i, site in enumerate(report['sites_analysis']['most_problematic_sites'][:5], 1): + print(f"{i}. {site['name']} (ID: {site['site_id']})") + print(f" Completeness: {site['completeness_score']}, Missing: {site['total_missing']} fields") + if site['missing_essential']: + print(f" Essential missing: {', '.join(site['missing_essential'])}") + + print("\nMOST PROBLEMATIC ORGANIZATIONS (Top 5):") + for i, org in enumerate(report['organizations_analysis']['most_problematic_organizations'][:5], 1): + print(f"{i}. {org['name']} (ID: {org['org_id']})") + print(f" Completeness: {org['completeness_score']}, Missing: {org['total_missing']} fields") + print(f" Sites: {org['site_count']}") + if org['missing_essential']: + print(f" Essential missing: {', '.join(org['missing_essential'])}") + + print(f"\n✓ Data exploration completed successfully!") + print(f"Analyzed {report['executive_summary']['total_entities_analyzed']} total entities") + print(f"Found {report['executive_summary']['total_with_essential_data_gaps']} entities with essential data gaps") + + return 0 + + except Exception as e: + print(f"\n✗ Error during data exploration: {e}") + print("\nThis might be due to:") + print("1. Network connectivity issues") + print("2. Invalid API credentials") + print("3. GraphQL schema changes") + print("4. API rate limiting") + return 1 + + +if __name__ == "__main__": + sys.exit(main()) \ No newline at end of file diff --git a/src/tackle_hunger/__init__.py b/src/tackle_hunger/__init__.py index 292e20e..9045779 100644 --- a/src/tackle_hunger/__init__.py +++ b/src/tackle_hunger/__init__.py @@ -7,3 +7,16 @@ __version__ = "1.0.0" __author__ = "LNRS Tech for Good Volunteers" + +from .graphql_client import TackleHungerClient, TackleHungerConfig +from .site_operations import SiteOperations +from .organization_operations import OrganizationOperations +from .data_explorer import DataExplorer + +__all__ = [ + "TackleHungerClient", + "TackleHungerConfig", + "SiteOperations", + "OrganizationOperations", + "DataExplorer", +] diff --git a/src/tackle_hunger/data_explorer.py b/src/tackle_hunger/data_explorer.py new file mode 100644 index 0000000..c698d9b --- /dev/null +++ b/src/tackle_hunger/data_explorer.py @@ -0,0 +1,282 @@ +""" +Data exploration utilities for charity validation. + +Analyzes organizations and sites to identify missing data elements. +""" + +from typing import Dict, Any, List, Set, Optional, Tuple +import json +from datetime import datetime +from .graphql_client import TackleHungerClient +from .site_operations import SiteOperations +from .organization_operations import OrganizationOperations + + +class DataExplorer: + """Analyzes charity data to identify missing information.""" + + def __init__(self, client: TackleHungerClient): + self.client = client + self.site_ops = SiteOperations(client) + self.org_ops = OrganizationOperations(client) + + # Field definitions for analysis + ESSENTIAL_SITE_FIELDS = { + 'name', 'streetAddress', 'city', 'state', 'zip', 'publicEmail', 'publicPhone' + } + + IMPORTANT_SITE_FIELDS = { + 'website', 'description', 'serviceArea', 'acceptsFoodDonations', 'ein', + 'contactEmail', 'contactName', 'contactPhone' + } + + ESSENTIAL_ORG_FIELDS = { + 'name' + } + + IMPORTANT_ORG_FIELDS = { + 'streetAddress', 'city', 'state', 'zip', 'publicEmail', 'publicPhone', + 'description', 'ein', 'isFeedingAmericaAffiliate' + } + + def analyze_site_completeness(self, site: Dict[str, Any]) -> Dict[str, Any]: + """Analyze completeness of a single site.""" + missing_essential = [] + missing_important = [] + + for field in self.ESSENTIAL_SITE_FIELDS: + if not site.get(field) or site.get(field) == '': + missing_essential.append(field) + + for field in self.IMPORTANT_SITE_FIELDS: + if not site.get(field) or site.get(field) == '': + missing_important.append(field) + + completeness_score = ( + (len(self.ESSENTIAL_SITE_FIELDS) - len(missing_essential)) * 2 + + (len(self.IMPORTANT_SITE_FIELDS) - len(missing_important)) + ) / (len(self.ESSENTIAL_SITE_FIELDS) * 2 + len(self.IMPORTANT_SITE_FIELDS)) + + return { + 'site_id': site.get('id'), + 'name': site.get('name') or 'Unknown', + 'completeness_score': round(completeness_score, 2), + 'missing_essential': missing_essential, + 'missing_important': missing_important, + 'has_essential_gaps': len(missing_essential) > 0, + 'total_missing': len(missing_essential) + len(missing_important) + } + + def analyze_organization_completeness(self, org: Dict[str, Any]) -> Dict[str, Any]: + """Analyze completeness of a single organization.""" + missing_essential = [] + missing_important = [] + + for field in self.ESSENTIAL_ORG_FIELDS: + if not org.get(field) or org.get(field) == '': + missing_essential.append(field) + + for field in self.IMPORTANT_ORG_FIELDS: + if not org.get(field) or org.get(field) == '': + missing_important.append(field) + + completeness_score = ( + (len(self.ESSENTIAL_ORG_FIELDS) - len(missing_essential)) * 2 + + (len(self.IMPORTANT_ORG_FIELDS) - len(missing_important)) + ) / (len(self.ESSENTIAL_ORG_FIELDS) * 2 + len(self.IMPORTANT_ORG_FIELDS)) + + return { + 'org_id': org.get('id'), + 'name': org.get('name') or 'Unknown', + 'completeness_score': round(completeness_score, 2), + 'missing_essential': missing_essential, + 'missing_important': missing_important, + 'has_essential_gaps': len(missing_essential) > 0, + 'total_missing': len(missing_essential) + len(missing_important), + 'site_count': len(org.get('sites', [])) + } + + def explore_sites_data(self, limit: int = 100) -> Dict[str, Any]: + """Explore sites data and identify missing information.""" + print(f"Fetching {limit} sites for analysis...") + sites = self.site_ops.get_sites_for_ai(limit=limit) + + analyses = [] + for site in sites: + analysis = self.analyze_site_completeness(site) + analyses.append(analysis) + + # Summary statistics + total_sites = len(analyses) + sites_with_essential_gaps = sum(1 for a in analyses if a['has_essential_gaps']) + avg_completeness = sum(a['completeness_score'] for a in analyses) / total_sites if total_sites > 0 else 0 + + # Field-specific missing counts + field_missing_counts = {} + for field in self.ESSENTIAL_SITE_FIELDS | self.IMPORTANT_SITE_FIELDS: + count = sum(1 for a in analyses if field in (a['missing_essential'] + a['missing_important'])) + field_missing_counts[field] = count + + # Sort by most problematic + most_problematic = sorted(analyses, key=lambda x: (-len(x['missing_essential']), -x['total_missing']))[:10] + + return { + 'analysis_timestamp': datetime.now().isoformat(), + 'summary': { + 'total_sites_analyzed': total_sites, + 'sites_with_essential_gaps': sites_with_essential_gaps, + 'average_completeness_score': round(avg_completeness, 2), + 'percentage_with_essential_gaps': round((sites_with_essential_gaps / total_sites * 100) if total_sites > 0 else 0, 1) + }, + 'field_missing_counts': field_missing_counts, + 'most_problematic_sites': most_problematic, + 'all_site_analyses': analyses + } + + def explore_organizations_data(self, limit: int = 100) -> Dict[str, Any]: + """Explore organizations data and identify missing information.""" + print(f"Fetching {limit} organizations for analysis...") + organizations = self.org_ops.get_organizations_for_ai(limit=limit) + + analyses = [] + for org in organizations: + analysis = self.analyze_organization_completeness(org) + analyses.append(analysis) + + # Summary statistics + total_orgs = len(analyses) + orgs_with_essential_gaps = sum(1 for a in analyses if a['has_essential_gaps']) + avg_completeness = sum(a['completeness_score'] for a in analyses) / total_orgs if total_orgs > 0 else 0 + + # Field-specific missing counts + field_missing_counts = {} + for field in self.ESSENTIAL_ORG_FIELDS | self.IMPORTANT_ORG_FIELDS: + count = sum(1 for a in analyses if field in (a['missing_essential'] + a['missing_important'])) + field_missing_counts[field] = count + + # Sort by most problematic + most_problematic = sorted(analyses, key=lambda x: (-len(x['missing_essential']), -x['total_missing']))[:10] + + return { + 'analysis_timestamp': datetime.now().isoformat(), + 'summary': { + 'total_organizations_analyzed': total_orgs, + 'organizations_with_essential_gaps': orgs_with_essential_gaps, + 'average_completeness_score': round(avg_completeness, 2), + 'percentage_with_essential_gaps': round((orgs_with_essential_gaps / total_orgs * 100) if total_orgs > 0 else 0, 1) + }, + 'field_missing_counts': field_missing_counts, + 'most_problematic_organizations': most_problematic, + 'all_organization_analyses': analyses + } + + def generate_comprehensive_report(self, sites_limit: int = 100, orgs_limit: int = 100) -> Dict[str, Any]: + """Generate a comprehensive data completeness report.""" + print("Generating comprehensive data completeness report...") + + sites_analysis = self.explore_sites_data(limit=sites_limit) + orgs_analysis = self.explore_organizations_data(limit=orgs_limit) + + # Cross-reference analysis + total_entities = sites_analysis['summary']['total_sites_analyzed'] + orgs_analysis['summary']['total_organizations_analyzed'] + total_with_gaps = sites_analysis['summary']['sites_with_essential_gaps'] + orgs_analysis['summary']['organizations_with_essential_gaps'] + + return { + 'report_timestamp': datetime.now().isoformat(), + 'executive_summary': { + 'total_entities_analyzed': total_entities, + 'total_with_essential_data_gaps': total_with_gaps, + 'overall_data_gap_percentage': round((total_with_gaps / total_entities * 100) if total_entities > 0 else 0, 1), + 'sites_average_completeness': sites_analysis['summary']['average_completeness_score'], + 'organizations_average_completeness': orgs_analysis['summary']['average_completeness_score'] + }, + 'sites_analysis': sites_analysis, + 'organizations_analysis': orgs_analysis, + 'recommendations': self._generate_recommendations(sites_analysis, orgs_analysis) + } + + def _generate_recommendations(self, sites_analysis: Dict[str, Any], orgs_analysis: Dict[str, Any]) -> List[str]: + """Generate recommendations based on the analysis.""" + recommendations = [] + + # Site recommendations + site_missing = sites_analysis['field_missing_counts'] + most_missing_site_field = max(site_missing.items(), key=lambda x: x[1]) if site_missing else ('', 0) + + if most_missing_site_field[1] > 0: + recommendations.append( + f"Priority: Focus on collecting '{most_missing_site_field[0]}' data for sites - " + f"missing in {most_missing_site_field[1]} out of {sites_analysis['summary']['total_sites_analyzed']} sites " + f"({round(most_missing_site_field[1]/sites_analysis['summary']['total_sites_analyzed']*100, 1)}%)" + ) + + # Organization recommendations + org_missing = orgs_analysis['field_missing_counts'] + most_missing_org_field = max(org_missing.items(), key=lambda x: x[1]) if org_missing else ('', 0) + + if most_missing_org_field[1] > 0: + recommendations.append( + f"Priority: Focus on collecting '{most_missing_org_field[0]}' data for organizations - " + f"missing in {most_missing_org_field[1]} out of {orgs_analysis['summary']['total_organizations_analyzed']} organizations " + f"({round(most_missing_org_field[1]/orgs_analysis['summary']['total_organizations_analyzed']*100, 1)}%)" + ) + + # General recommendations + if sites_analysis['summary']['percentage_with_essential_gaps'] > 50: + recommendations.append( + "High priority: More than 50% of sites have essential data gaps. Consider implementing automated data validation." + ) + + if orgs_analysis['summary']['percentage_with_essential_gaps'] > 50: + recommendations.append( + "High priority: More than 50% of organizations have essential data gaps. Consider implementing mandatory field validation." + ) + + return recommendations + + def save_report(self, report: Dict[str, Any], filename: Optional[str] = None) -> str: + """Save the analysis report to a JSON file.""" + if filename is None: + timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") + filename = f"/tmp/tackle_hunger_data_analysis_{timestamp}.json" + + with open(filename, 'w') as f: + json.dump(report, f, indent=2, default=str) + + print(f"Report saved to: {filename}") + return filename + + def print_summary(self, report: Dict[str, Any]) -> None: + """Print a human-readable summary of the report.""" + print("\n" + "="*80) + print("TACKLE HUNGER DATA COMPLETENESS ANALYSIS SUMMARY") + print("="*80) + + exec_summary = report['executive_summary'] + print(f"Total Entities Analyzed: {exec_summary['total_entities_analyzed']}") + print(f"Entities with Essential Data Gaps: {exec_summary['total_with_essential_data_gaps']}") + print(f"Overall Data Gap Percentage: {exec_summary['overall_data_gap_percentage']}%") + print(f"Sites Average Completeness: {exec_summary['sites_average_completeness']}") + print(f"Organizations Average Completeness: {exec_summary['organizations_average_completeness']}") + + print("\nRECOMMENDATIONS:") + for i, rec in enumerate(report['recommendations'], 1): + print(f"{i}. {rec}") + + print("\nTOP MISSING FIELDS - SITES:") + sites_missing = report['sites_analysis']['field_missing_counts'] + for field, count in sorted(sites_missing.items(), key=lambda x: x[1], reverse=True)[:5]: + if count > 0: + total_sites = report['sites_analysis']['summary']['total_sites_analyzed'] + percentage = round(count/total_sites*100, 1) if total_sites > 0 else 0 + print(f" - {field}: {count} missing ({percentage}%)") + + print("\nTOP MISSING FIELDS - ORGANIZATIONS:") + orgs_missing = report['organizations_analysis']['field_missing_counts'] + for field, count in sorted(orgs_missing.items(), key=lambda x: x[1], reverse=True)[:5]: + if count > 0: + total_orgs = report['organizations_analysis']['summary']['total_organizations_analyzed'] + percentage = round(count/total_orgs*100, 1) if total_orgs > 0 else 0 + print(f" - {field}: {count} missing ({percentage}%)") + + print("\n" + "="*80) \ No newline at end of file diff --git a/src/tackle_hunger/organization_operations.py b/src/tackle_hunger/organization_operations.py new file mode 100644 index 0000000..879f2ae --- /dev/null +++ b/src/tackle_hunger/organization_operations.py @@ -0,0 +1,115 @@ +""" +Organization operations for charity validation. + +Provides operations for fetching and analyzing charity organizations through GraphQL. +""" + +from typing import Dict, Any, List, Optional +from .graphql_client import TackleHungerClient + + +class OrganizationOperations: + """Operations for managing charity organizations.""" + + def __init__(self, client: TackleHungerClient): + self.client = client + + def get_organizations_for_ai(self, limit: int = 50) -> List[Dict[str, Any]]: + """Fetch organizations for AI processing.""" + query = ''' + query GetOrganizationsForAI($limit: Int) { + organizationsForAI(limit: $limit) { + id + sites { + id + name + } + name + streetAddress + addressLine2 + city + state + zip + publicEmail + publicPhone + email + phone + isFeedingAmericaAffiliate + description + ein + banner + logo + updatedAt + createdAt + } + } + ''' + + result = self.client.execute_query(query, {"limit": limit}) + return result.get("organizationsForAI", []) + + def get_organization_by_id(self, org_id: str) -> Optional[Dict[str, Any]]: + """Fetch a specific organization by ID.""" + query = ''' + query GetOrganizationForAI($orgId: ID!) { + organizationForAI(id: $orgId) { + id + sites { + id + name + streetAddress + city + state + zip + publicEmail + publicPhone + website + description + serviceArea + acceptsFoodDonations + status + ein + } + name + streetAddress + addressLine2 + city + state + zip + publicEmail + publicPhone + email + phone + isFeedingAmericaAffiliate + description + ein + banner + logo + updatedAt + createdAt + } + } + ''' + + try: + result = self.client.execute_query(query, {"orgId": org_id}) + return result.get("organizationForAI") + except Exception: + return None + + def update_organization(self, org_id: str, org_data: Dict[str, Any]) -> Dict[str, Any]: + """Update an existing organization.""" + mutation = ''' + mutation UpdateOrganizationFromAI($organizationId: String!, $input: organizationInputUpdate!) { + updateOrganizationFromAI(organizationId: $organizationId, input: $input) { + id + name + updatedAt + } + } + ''' + + return self.client.execute_query( + mutation, + {"organizationId": org_id, "input": org_data} + ) \ No newline at end of file diff --git a/tests/test_data_explorer.py b/tests/test_data_explorer.py new file mode 100644 index 0000000..e4a84be --- /dev/null +++ b/tests/test_data_explorer.py @@ -0,0 +1,186 @@ +""" +Tests for data exploration functionality. +""" + +import pytest +from unittest.mock import Mock, MagicMock +from src.tackle_hunger.data_explorer import DataExplorer +from src.tackle_hunger.graphql_client import TackleHungerClient + + +@pytest.fixture +def mock_client(): + """Create a mock GraphQL client.""" + return Mock(spec=TackleHungerClient) + + +@pytest.fixture +def data_explorer(mock_client): + """Create a data explorer with mock client.""" + return DataExplorer(mock_client) + + +def test_analyze_site_completeness_complete_site(data_explorer): + """Test analysis of a complete site.""" + complete_site = { + 'id': 'site1', + 'name': 'Complete Food Bank', + 'streetAddress': '123 Main St', + 'city': 'Anytown', + 'state': 'CA', + 'zip': '12345', + 'publicEmail': 'info@foodbank.org', + 'publicPhone': '555-1234', + 'website': 'https://foodbank.org', + 'description': 'A complete food bank', + 'serviceArea': 'City wide', + 'acceptsFoodDonations': 'Yes', + 'ein': '12-3456789', + 'contactEmail': 'contact@foodbank.org', + 'contactName': 'John Doe', + 'contactPhone': '555-5678' + } + + result = data_explorer.analyze_site_completeness(complete_site) + + assert result['site_id'] == 'site1' + assert result['name'] == 'Complete Food Bank' + assert result['completeness_score'] == 1.0 + assert result['missing_essential'] == [] + assert result['missing_important'] == [] + assert result['has_essential_gaps'] == False + assert result['total_missing'] == 0 + + +def test_analyze_site_completeness_incomplete_site(data_explorer): + """Test analysis of an incomplete site.""" + incomplete_site = { + 'id': 'site2', + 'name': 'Incomplete Food Bank', + 'streetAddress': '456 Oak Ave', + 'city': 'Somewhere', + 'state': 'NY', + 'zip': '', # Missing zip + 'publicEmail': '', # Missing email + 'publicPhone': '555-9999', + # Missing many other fields + } + + result = data_explorer.analyze_site_completeness(incomplete_site) + + assert result['site_id'] == 'site2' + assert result['name'] == 'Incomplete Food Bank' + assert result['completeness_score'] < 1.0 + assert 'zip' in result['missing_essential'] + assert 'publicEmail' in result['missing_essential'] + assert result['has_essential_gaps'] == True + assert result['total_missing'] > 0 + + +def test_analyze_organization_completeness_complete_org(data_explorer): + """Test analysis of a complete organization.""" + complete_org = { + 'id': 'org1', + 'name': 'Complete Charity Org', + 'streetAddress': '789 Elm St', + 'city': 'Metropolis', + 'state': 'TX', + 'zip': '54321', + 'publicEmail': 'info@charity.org', + 'publicPhone': '555-0000', + 'description': 'A complete charity organization', + 'ein': '98-7654321', + 'isFeedingAmericaAffiliate': 'Yes', + 'sites': [{'id': 'site1', 'name': 'Site 1'}] + } + + result = data_explorer.analyze_organization_completeness(complete_org) + + assert result['org_id'] == 'org1' + assert result['name'] == 'Complete Charity Org' + assert result['completeness_score'] == 1.0 + assert result['missing_essential'] == [] + assert result['missing_important'] == [] + assert result['has_essential_gaps'] == False + assert result['total_missing'] == 0 + assert result['site_count'] == 1 + + +def test_analyze_organization_completeness_missing_name(data_explorer): + """Test analysis of an organization missing essential name field.""" + incomplete_org = { + 'id': 'org2', + 'name': '', # Missing essential name + 'streetAddress': '321 Pine St', + 'city': 'Smalltown', + 'state': 'FL', + 'sites': [] + } + + result = data_explorer.analyze_organization_completeness(incomplete_org) + + assert result['org_id'] == 'org2' + assert result['name'] == 'Unknown' # Default for missing/empty name + assert result['completeness_score'] < 1.0 + assert 'name' in result['missing_essential'] + assert result['has_essential_gaps'] == True + assert result['site_count'] == 0 + + +def test_field_definitions(data_explorer): + """Test that field definitions are properly set.""" + assert 'name' in data_explorer.ESSENTIAL_SITE_FIELDS + assert 'streetAddress' in data_explorer.ESSENTIAL_SITE_FIELDS + assert 'city' in data_explorer.ESSENTIAL_SITE_FIELDS + assert 'state' in data_explorer.ESSENTIAL_SITE_FIELDS + assert 'zip' in data_explorer.ESSENTIAL_SITE_FIELDS + assert 'publicEmail' in data_explorer.ESSENTIAL_SITE_FIELDS + assert 'publicPhone' in data_explorer.ESSENTIAL_SITE_FIELDS + + assert 'name' in data_explorer.ESSENTIAL_ORG_FIELDS + + assert 'website' in data_explorer.IMPORTANT_SITE_FIELDS + assert 'description' in data_explorer.IMPORTANT_SITE_FIELDS + + assert 'description' in data_explorer.IMPORTANT_ORG_FIELDS + assert 'ein' in data_explorer.IMPORTANT_ORG_FIELDS + + +def test_generate_recommendations_empty_data(data_explorer): + """Test recommendation generation with empty data.""" + sites_analysis = { + 'summary': {'total_sites_analyzed': 0, 'percentage_with_essential_gaps': 0}, + 'field_missing_counts': {} + } + + orgs_analysis = { + 'summary': {'total_organizations_analyzed': 0, 'percentage_with_essential_gaps': 0}, + 'field_missing_counts': {} + } + + recommendations = data_explorer._generate_recommendations(sites_analysis, orgs_analysis) + + # Should return empty list for no data + assert isinstance(recommendations, list) + + +def test_generate_recommendations_with_gaps(data_explorer): + """Test recommendation generation with data gaps.""" + sites_analysis = { + 'summary': {'total_sites_analyzed': 100, 'percentage_with_essential_gaps': 60}, + 'field_missing_counts': {'publicEmail': 50, 'website': 30} + } + + orgs_analysis = { + 'summary': {'total_organizations_analyzed': 50, 'percentage_with_essential_gaps': 70}, + 'field_missing_counts': {'description': 40, 'ein': 25} + } + + recommendations = data_explorer._generate_recommendations(sites_analysis, orgs_analysis) + + assert isinstance(recommendations, list) + assert len(recommendations) > 0 + + # Should have recommendations for high gap percentages + high_priority_recs = [r for r in recommendations if 'High priority' in r] + assert len(high_priority_recs) >= 2 # Both sites and orgs have >50% gaps \ No newline at end of file diff --git a/tests/test_organization_operations.py b/tests/test_organization_operations.py new file mode 100644 index 0000000..d596ba2 --- /dev/null +++ b/tests/test_organization_operations.py @@ -0,0 +1,217 @@ +""" +Tests for organization operations functionality. +""" + +import pytest +from unittest.mock import Mock +from src.tackle_hunger.organization_operations import OrganizationOperations +from src.tackle_hunger.graphql_client import TackleHungerClient + + +@pytest.fixture +def mock_client(): + """Create a mock GraphQL client.""" + client = Mock(spec=TackleHungerClient) + return client + + +@pytest.fixture +def org_operations(mock_client): + """Create organization operations with mock client.""" + return OrganizationOperations(mock_client) + + +def test_init(mock_client): + """Test OrganizationOperations initialization.""" + ops = OrganizationOperations(mock_client) + assert ops.client == mock_client + + +def test_get_organizations_for_ai(org_operations, mock_client): + """Test fetching organizations for AI processing.""" + # Mock the response + mock_response = { + "organizationsForAI": [ + { + "id": "org1", + "name": "Test Organization", + "sites": [{"id": "site1", "name": "Test Site"}], + "streetAddress": "123 Main St", + "city": "Test City", + "state": "CA", + "zip": "12345", + "publicEmail": "info@test.org", + "ein": "12-3456789" + } + ] + } + + mock_client.execute_query.return_value = mock_response + + result = org_operations.get_organizations_for_ai(limit=10) + + # Verify the query was called + mock_client.execute_query.assert_called_once() + call_args = mock_client.execute_query.call_args + + # Check that the query contains expected fields + query = call_args[0][0] + assert "organizationsForAI" in query + assert "limit: $limit" in query + assert "id" in query + assert "name" in query + assert "sites" in query + + # Check variables + variables = call_args[0][1] + assert variables == {"limit": 10} + + # Check result + assert result == mock_response["organizationsForAI"] + assert len(result) == 1 + assert result[0]["id"] == "org1" + assert result[0]["name"] == "Test Organization" + + +def test_get_organizations_for_ai_default_limit(org_operations, mock_client): + """Test fetching organizations with default limit.""" + mock_client.execute_query.return_value = {"organizationsForAI": []} + + org_operations.get_organizations_for_ai() + + # Check that default limit was used + call_args = mock_client.execute_query.call_args + variables = call_args[0][1] + assert variables == {"limit": 50} + + +def test_get_organization_by_id_success(org_operations, mock_client): + """Test fetching a specific organization by ID successfully.""" + mock_response = { + "organizationForAI": { + "id": "org123", + "name": "Specific Organization", + "sites": [ + { + "id": "site1", + "name": "Site 1", + "streetAddress": "456 Oak Ave", + "city": "Another City", + "state": "NY", + "zip": "54321" + } + ], + "publicEmail": "contact@specific.org" + } + } + + mock_client.execute_query.return_value = mock_response + + result = org_operations.get_organization_by_id("org123") + + # Verify the query was called + mock_client.execute_query.assert_called_once() + call_args = mock_client.execute_query.call_args + + # Check query content + query = call_args[0][0] + assert "organizationForAI" in query + assert "$orgId: ID!" in query + + # Check variables + variables = call_args[0][1] + assert variables == {"orgId": "org123"} + + # Check result + assert result == mock_response["organizationForAI"] + assert result["id"] == "org123" + assert result["name"] == "Specific Organization" + assert len(result["sites"]) == 1 + + +def test_get_organization_by_id_not_found(org_operations, mock_client): + """Test fetching a non-existent organization by ID.""" + mock_client.execute_query.return_value = {"organizationForAI": None} + + result = org_operations.get_organization_by_id("nonexistent") + + assert result is None + + +def test_get_organization_by_id_exception(org_operations, mock_client): + """Test handling exception when fetching organization by ID.""" + mock_client.execute_query.side_effect = Exception("Network error") + + result = org_operations.get_organization_by_id("org123") + + assert result is None + + +def test_update_organization(org_operations, mock_client): + """Test updating an organization.""" + mock_response = { + "updateOrganizationFromAI": { + "id": "org123", + "name": "Updated Organization", + "updatedAt": "2024-01-01T12:00:00Z" + } + } + + mock_client.execute_query.return_value = mock_response + + update_data = { + "name": "Updated Organization", + "publicEmail": "updated@org.com", + "description": "Updated description" + } + + result = org_operations.update_organization("org123", update_data) + + # Verify the mutation was called + mock_client.execute_query.assert_called_once() + call_args = mock_client.execute_query.call_args + + # Check mutation content + mutation = call_args[0][0] + assert "updateOrganizationFromAI" in mutation + assert "$organizationId: String!" in mutation + assert "$input: organizationInputUpdate!" in mutation + + # Check variables + variables = call_args[0][1] + assert variables["organizationId"] == "org123" + assert variables["input"] == update_data + + # Check result + assert result == mock_response + assert result["updateOrganizationFromAI"]["id"] == "org123" + assert result["updateOrganizationFromAI"]["name"] == "Updated Organization" + + +def test_get_organizations_for_ai_empty_response(org_operations, mock_client): + """Test handling empty response from organizationsForAI query.""" + mock_client.execute_query.return_value = {} + + result = org_operations.get_organizations_for_ai() + + assert result == [] + + +def test_query_contains_all_expected_fields(org_operations, mock_client): + """Test that the organizationsForAI query contains all expected fields.""" + mock_client.execute_query.return_value = {"organizationsForAI": []} + + org_operations.get_organizations_for_ai() + + call_args = mock_client.execute_query.call_args + query = call_args[0][0] + + # Check for essential fields + expected_fields = [ + "id", "sites", "name", "streetAddress", "addressLine2", "city", "state", "zip", + "publicEmail", "publicPhone", "email", "phone", "isFeedingAmericaAffiliate", + "description", "ein", "banner", "logo", "updatedAt", "createdAt" + ] + + for field in expected_fields: + assert field in query, f"Field '{field}' missing from query" \ No newline at end of file