Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 177 additions & 0 deletions docs/data_exploration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# Data Exploration for Missing Charity Information

This document describes the data exploration functionality created to identify organizations and charities with missing data elements.

## Overview

The data exploration system analyzes both Sites (charity service locations) and Organizations (parent charity entities) to identify missing essential and important data fields. This helps prioritize data collection efforts and improve data completeness.

## Key Components

### 1. OrganizationOperations (`src/tackle_hunger/organization_operations.py`)

Provides GraphQL operations for fetching and analyzing organization data:

- `get_organizations_for_ai()` - Fetch organizations with all relevant fields
- `get_organization_by_id()` - Fetch specific organization details
- `update_organization()` - Update organization information

### 2. DataExplorer (`src/tackle_hunger/data_explorer.py`)

Analyzes data completeness and generates insights:

- **Field Classification**:
- **Essential Site Fields**: name, streetAddress, city, state, zip, publicEmail, publicPhone
- **Important Site Fields**: website, description, serviceArea, acceptsFoodDonations, ein, contact details
- **Essential Org Fields**: name
- **Important Org Fields**: address, contact, description, ein, Feeding America affiliation

- **Analysis Functions**:
- `analyze_site_completeness()` - Analyze individual site data completeness
- `analyze_organization_completeness()` - Analyze individual organization data completeness
- `explore_sites_data()` - Comprehensive site data analysis
- `explore_organizations_data()` - Comprehensive organization data analysis
- `generate_comprehensive_report()` - Full analysis with recommendations

### 3. Exploration Script (`scripts/explore_data_alesha.py`)

Command-line tool for running data exploration:

```bash
python scripts/explore_data_alesha.py [OPTIONS]
```

#### Options:
- `--sites-limit N` - Number of sites to analyze (default: 50)
- `--orgs-limit N` - Number of organizations to analyze (default: 50)
- `--output-file FILE` - Save detailed JSON report to file
- `--environment ENV` - API environment (dev/staging/production)
- `--summary-only` - Show only summary without saving detailed report

## Usage Examples

### Basic Analysis
```bash
# Analyze 50 sites and 50 organizations, show summary
python scripts/explore_data_alesha.py --summary-only

# Analyze more data points
python scripts/explore_data_alesha.py --sites-limit 100 --orgs-limit 75
```

### Detailed Analysis with Report
```bash
# Generate full report with custom output file
python scripts/explore_data_alesha.py \
--sites-limit 200 \
--orgs-limit 150 \
--output-file charity_data_analysis.json
```

### Programmatic Usage
```python
from src.tackle_hunger import TackleHungerClient, DataExplorer

# Setup client
client = TackleHungerClient()
explorer = DataExplorer(client)

# Generate comprehensive report
report = explorer.generate_comprehensive_report(
sites_limit=100,
orgs_limit=100
)

# Print summary
explorer.print_summary(report)

# Save detailed report
explorer.save_report(report, "analysis_results.json")
```

## Report Structure

The analysis generates a comprehensive report with:

### Executive Summary
- Total entities analyzed
- Entities with essential data gaps
- Overall data gap percentage
- Average completeness scores

### Detailed Analysis
- **Sites Analysis**: Missing field counts, most problematic sites
- **Organizations Analysis**: Missing field counts, most problematic organizations
- **Recommendations**: Prioritized actions based on findings

### Field-Specific Insights
- Count of missing data by field
- Percentage of entities missing each field
- Priority ranking for data collection efforts

## Completeness Scoring

Each entity receives a completeness score (0.0 to 1.0) calculated as:
- Essential fields are weighted 2x
- Important fields are weighted 1x
- Score = (complete_essential × 2 + complete_important) / (total_essential × 2 + total_important)

## Key Findings Categories

### Essential Data Gaps
Entities missing critical fields required for basic functionality:
- Site: Missing address, contact information
- Organization: Missing name

### Important Data Gaps
Entities missing valuable but not critical fields:
- Missing descriptions, websites, service details
- Missing EIN numbers, affiliation information

## Recommendations Engine

The system automatically generates prioritized recommendations:

1. **High Priority**: Focus on most commonly missing essential fields
2. **Medium Priority**: Address important fields with high miss rates
3. **System Improvements**: Suggestions for validation and data collection

## Integration with Existing Workflow

This analysis integrates with the existing charity validation workflow:

1. **Data Collection Planning**: Identify priority fields for AI/ETL operations
2. **Quality Assurance**: Monitor data completeness over time
3. **Volunteer Focus**: Direct volunteer efforts to highest-impact data gaps
4. **API Enhancement**: Inform required field validation improvements

## Error Handling

The system includes robust error handling for:
- Network connectivity issues
- API authentication problems
- Missing or malformed data
- GraphQL schema changes

## Future Enhancements

Potential improvements for the data exploration system:

1. **Historical Tracking**: Monitor data completeness trends over time
2. **Geographic Analysis**: Identify regional data quality patterns
3. **Automated Alerts**: Notify when data quality drops below thresholds
4. **Integration Testing**: Validate against different API environments
5. **Performance Optimization**: Handle larger datasets efficiently

## Testing

Comprehensive test coverage includes:
- Unit tests for all analysis functions
- Mock data scenarios for edge cases
- Integration tests for GraphQL operations
- Command-line interface testing

Run tests with:
```bash
python -m pytest tests/ -v
```
170 changes: 170 additions & 0 deletions examples/demo_data_exploration.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
#!/usr/bin/env python3
"""
Demo script showing data exploration functionality with mock data.

This demonstrates the DataExplorer functionality without requiring API access.
"""

import sys
from pathlib import Path

# Add src to path for imports
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))

from tackle_hunger.data_explorer import DataExplorer
from unittest.mock import Mock


def create_mock_data():
"""Create realistic mock data for demonstration."""

# Mock sites with varying levels of completeness
mock_sites = [
{
'id': 'site1',
'name': 'Complete Community Food Bank',
'streetAddress': '123 Main Street',
'city': 'Anytown',
'state': 'CA',
'zip': '12345',
'publicEmail': 'info@communityfood.org',
'publicPhone': '555-0123',
'website': 'https://communityfood.org',
'description': 'Serving the community since 1985 with fresh food and support.',
'serviceArea': 'City-wide, focus on downtown area',
'acceptsFoodDonations': 'Yes',
'ein': '12-3456789',
'contactEmail': 'director@communityfood.org',
'contactName': 'Jane Smith',
'contactPhone': '555-0124'
},
{
'id': 'site2',
'name': 'Partial Info Food Pantry',
'streetAddress': '456 Oak Avenue',
'city': 'Somewhere',
'state': 'NY',
'zip': '54321',
'publicEmail': '', # Missing
'publicPhone': '555-0456',
'website': '', # Missing
'description': '', # Missing
'acceptsFoodDonations': 'Yes',
# Missing many other fields
},
{
'id': 'site3',
'name': 'Minimal Data Shelter',
'streetAddress': '789 Pine Road',
'city': 'Elsewhere',
'state': 'TX',
'zip': '', # Missing essential field
'publicEmail': '', # Missing essential field
'publicPhone': '', # Missing essential field
# Missing most fields
}
]

# Mock organizations with varying completeness
mock_orgs = [
{
'id': 'org1',
'name': 'Complete Charity Organization',
'streetAddress': '100 Charity Lane',
'city': 'Generous City',
'state': 'CA',
'zip': '98765',
'publicEmail': 'contact@charity.org',
'publicPhone': '555-9876',
'description': 'A well-established charity serving multiple communities.',
'ein': '98-7654321',
'isFeedingAmericaAffiliate': 'Yes',
'sites': [{'id': 'site1', 'name': 'Site 1'}, {'id': 'site2', 'name': 'Site 2'}]
},
{
'id': 'org2',
'name': 'Incomplete Organization',
'streetAddress': '', # Missing
'city': '', # Missing
'publicEmail': '', # Missing
'description': '', # Missing
'ein': '', # Missing
'sites': [{'id': 'site3', 'name': 'Site 3'}]
},
{
'id': 'org3',
'name': '', # Missing essential field!
'streetAddress': '200 Hope Street',
'city': 'Kindness',
'state': 'FL',
'sites': []
}
]

return mock_sites, mock_orgs


def main():
"""Run the demonstration."""
print("Data Exploration Demo - Mock Data Analysis")
print("=" * 50)

# Create mock client and explorer
mock_client = Mock()

# Create mock data
mock_sites, mock_orgs = create_mock_data()

# Create data explorer
explorer = DataExplorer(mock_client)

# Mock the data fetching methods
explorer.site_ops.get_sites_for_ai = Mock(return_value=mock_sites)
explorer.org_ops.get_organizations_for_ai = Mock(return_value=mock_orgs)

print("Analyzing mock data...")
print(f"Sites: {len(mock_sites)}, Organizations: {len(mock_orgs)}")

# Generate analysis report
report = explorer.generate_comprehensive_report(
sites_limit=len(mock_sites),
orgs_limit=len(mock_orgs)
)

# Print summary
explorer.print_summary(report)

# Show detailed findings for each entity
print("\nDETAILED FINDINGS:")
print("-" * 40)

print("\nSite Analysis Details:")
for analysis in report['sites_analysis']['all_site_analyses']:
print(f"• {analysis['name']} (ID: {analysis['site_id']})")
print(f" Completeness Score: {analysis['completeness_score']}")
if analysis['missing_essential']:
print(f" Missing Essential: {', '.join(analysis['missing_essential'])}")
if analysis['missing_important']:
print(f" Missing Important: {', '.join(analysis['missing_important'])}")
print()

print("Organization Analysis Details:")
for analysis in report['organizations_analysis']['all_organization_analyses']:
print(f"• {analysis['name']} (ID: {analysis['org_id']})")
print(f" Completeness Score: {analysis['completeness_score']}")
print(f" Sites: {analysis['site_count']}")
if analysis['missing_essential']:
print(f" Missing Essential: {', '.join(analysis['missing_essential'])}")
if analysis['missing_important']:
print(f" Missing Important: {', '.join(analysis['missing_important'])}")
print()

# Save report for inspection
filename = explorer.save_report(report, "/tmp/demo_analysis_report.json")
print(f"\n✓ Demo completed! Full report saved to: {filename}")
print("\nThis demonstrates how the data exploration system identifies")
print("missing data elements in both sites and organizations.")


if __name__ == "__main__":
main()
Loading