Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
220 changes: 220 additions & 0 deletions docs/data_exploration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
# Data Exploration Guide

This guide explains how to use the data exploration functionality to identify organizations and charities missing data elements.

## Overview

The data exploration system provides comprehensive analysis of charity data quality by:

- Fetching organizations and sites from the Tackle Hunger GraphQL API
- Analyzing missing or incomplete data fields
- Generating reports with actionable insights
- Calculating data completeness scores
- Providing recommendations for data quality improvements

## Quick Start

### Using the CLI Script

The easiest way to explore data is using the provided CLI script:

```bash
# Basic usage - analyze 100 sites and organizations
python scripts/explore_data.py

# Analyze more data
python scripts/explore_data.py --sites 500 --organizations 200

# Get only summary (faster)
python scripts/explore_data.py --summary-only

# Export detailed report
python scripts/explore_data.py --output report.json

# Use different environment
python scripts/explore_data.py --environment staging
```

### Using the Python API

For more advanced usage, you can use the Python API directly:

```python
from src.tackle_hunger.graphql_client import TackleHungerClient
from src.tackle_hunger.data_explorer import DataExplorer

# Initialize client
client = TackleHungerClient()
explorer = DataExplorer(client)

# Get comprehensive analysis
analysis = explorer.get_missing_data_analysis(site_limit=100, org_limit=100)

# Get summary with recommendations
summary = explorer.get_data_completeness_summary(site_limit=100, org_limit=100)

# Export detailed report
explorer.export_missing_data_report("analysis.json", site_limit=100, org_limit=100)
```

## Report Structure

### Summary Report

The summary report includes:

- **Data Overview**: Total counts of sites and organizations analyzed
- **Completeness Scores**: Graded scores (A-F) for data completeness
- **Data Integrity Issues**: Orphaned sites, incomplete organizations
- **Recommendations**: Top actionable recommendations

### Detailed Analysis

The detailed analysis includes:

#### Sites Analysis
- Field-by-field missing data statistics
- List of sites with critical missing fields
- Percentage of missing data for each field

#### Organizations Analysis
- Field-by-field missing data statistics for organizations
- List of organizations with critical missing fields
- Percentage of missing data for each field

#### Combined Analysis
- Orphaned sites (sites without valid organization references)
- Sites with incomplete organization data
- Data integrity statistics

## Field Classifications

### Critical Fields (Sites)
- `name` - Site name
- `streetAddress` - Physical address
- `city` - City location
- `state` - State location
- `zip` - ZIP code
- `publicEmail` - Public contact email
- `publicPhone` - Public contact phone
- `website` - Website URL
- `description` - Service description

### Optional Fields (Sites)
- `serviceArea` - Service coverage area
- `acceptsFoodDonations` - Food donation acceptance
- `ein` - Tax ID number

### Critical Fields (Organizations)
- `name` - Organization name
- `streetAddress` - Mailing address
- `city` - City location
- `state` - State location
- `zip` - ZIP code
- `publicEmail` - Public contact email
- `publicPhone` - Public contact phone

### Optional Fields (Organizations)
- `addressLine2` - Address line 2
- `email` - Internal contact email
- `phone` - Internal contact phone
- `website` - Website URL
- `description` - Organization description
- `ein` - Tax ID number
- `nonProfitStatus` - Non-profit status

## Completeness Scoring

The system calculates weighted completeness scores:

- **Critical Fields**: 70% weight
- **Optional Fields**: 30% weight

Grades are assigned as follows:
- **A**: 90-100% complete
- **B**: 80-89% complete
- **C**: 70-79% complete
- **D**: 60-69% complete
- **F**: Below 60% complete

## Common Use Cases

### 1. Initial Data Quality Assessment

```bash
python scripts/explore_data.py --summary-only
```

This provides a quick overview of data quality across your charity database.

### 2. Detailed Gap Analysis

```bash
python scripts/explore_data.py --sites 1000 --organizations 500 --output full_analysis.json
```

This generates a comprehensive analysis for further investigation.

### 3. Environment-Specific Analysis

```bash
python scripts/explore_data.py --environment staging --summary-only
```

This analyzes data quality in different environments.

### 4. Targeted Analysis

```python
# Focus on specific issues
analysis = explorer.get_missing_data_analysis(site_limit=50, org_limit=50)

# Find sites with missing critical contact info
sites_missing_contact = []
for site in analysis['sites']['sites_with_critical_missing']:
missing_fields = site['missing_fields']
if 'publicEmail' in missing_fields or 'publicPhone' in missing_fields:
sites_missing_contact.append(site)
```

## Troubleshooting

### Authentication Issues

Ensure your `.env` file contains valid credentials:

```
AI_SCRAPING_TOKEN=your_token_here
TKH_GRAPHQL_API_URL=https://devapi.sboc.us/graphql
```

### Rate Limiting

If you encounter rate limits, reduce the number of records analyzed:

```bash
python scripts/explore_data.py --sites 50 --organizations 25
```

### Memory Issues

For large datasets, use the summary-only mode:

```bash
python scripts/explore_data.py --summary-only
```

## Contributing

When adding new data exploration features:

1. Add appropriate field classifications in `DataExplorer`
2. Update scoring algorithms if needed
3. Add comprehensive tests
4. Update this documentation

## Examples

See the test files for examples of using the data exploration API:
- `tests/test_data_explorer.py` - Comprehensive API examples
- `tests/test_organization_operations.py` - Organization data access examples
165 changes: 165 additions & 0 deletions examples/explore_missing_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
#!/usr/bin/env python3
"""
Example: Data exploration for missing charity information.

This example demonstrates how to use the data exploration functionality
to identify charities and organizations with missing data elements.
"""

import os
import sys
from pathlib import Path

# Add the src directory to Python path for imports
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))

from tackle_hunger import TackleHungerClient, DataExplorer


def main():
"""Demonstrate data exploration functionality."""

print("Tackle Hunger Data Exploration Example")
print("=" * 50)

# Note: This example requires valid API credentials
if not os.getenv("AI_SCRAPING_TOKEN"):
print("\nNote: This example requires API credentials to run.")
print("Set AI_SCRAPING_TOKEN and TKH_GRAPHQL_API_URL environment variables.")
print("\nShowing example output structure instead...")
show_example_output()
return

try:
# Initialize client and explorer
client = TackleHungerClient()
explorer = DataExplorer(client)

print("\nConnecting to Tackle Hunger API...")

# Get summary analysis
print("Fetching data completeness summary...")
summary = explorer.get_data_completeness_summary(site_limit=10, org_limit=10)

# Display results
print_summary(summary)

# Demonstrate detailed analysis
print("\nFetching detailed missing data analysis...")
analysis = explorer.get_missing_data_analysis(site_limit=10, org_limit=10)

print_detailed_insights(analysis)

except Exception as e:
print(f"\nError: {e}")
print("This is expected if API credentials are not configured.")
show_example_output()


def print_summary(summary):
"""Print summary analysis results."""
print("\n" + "="*50)
print("DATA COMPLETENESS SUMMARY")
print("="*50)

# Basic stats
basic_stats = summary.get('summary', {})
print(f"\nAnalyzed Data:")
print(f" • Sites: {basic_stats.get('total_sites', 0)}")
print(f" • Organizations: {basic_stats.get('total_organizations', 0)}")

# Completeness scores
site_comp = summary.get('site_completeness', {})
org_comp = summary.get('organization_completeness', {})

print(f"\nCompleteness Scores:")
print(f" • Sites: {site_comp.get('score', 0):.1f}/100 (Grade: {site_comp.get('grade', 'N/A')})")
print(f" • Organizations: {org_comp.get('score', 0):.1f}/100 (Grade: {org_comp.get('grade', 'N/A')})")

# Recommendations
recommendations = summary.get('recommendations', [])
if recommendations:
print(f"\nTop Recommendations:")
for i, rec in enumerate(recommendations[:3], 1):
print(f" {i}. {rec}")


def print_detailed_insights(analysis):
"""Print detailed analysis insights."""
print("\n" + "="*50)
print("DETAILED ANALYSIS INSIGHTS")
print("="*50)

# Sites with missing critical data
sites_analysis = analysis.get('sites', {})
sites_missing = sites_analysis.get('sites_with_critical_missing', [])

if sites_missing:
print(f"\nSites with Critical Missing Data ({len(sites_missing)} found):")
for site in sites_missing[:3]: # Show first 3
print(f" • {site.get('name', 'Unknown')} (ID: {site.get('id')})")
print(f" Missing: {', '.join(site.get('missing_fields', []))}")

# Organizations with missing critical data
orgs_analysis = analysis.get('organizations', {})
orgs_missing = orgs_analysis.get('organizations_with_critical_missing', [])

if orgs_missing:
print(f"\nOrganizations with Critical Missing Data ({len(orgs_missing)} found):")
for org in orgs_missing[:3]: # Show first 3
print(f" • {org.get('name', 'Unknown')} (ID: {org.get('id')})")
print(f" Missing: {', '.join(org.get('missing_fields', []))}")

# Data integrity issues
combined = analysis.get('combined', {})
orphaned = combined.get('orphaned_sites', {})

if orphaned.get('count', 0) > 0:
print(f"\nData Integrity Issues:")
print(f" • {orphaned.get('count')} orphaned sites found")


def show_example_output():
"""Show example output when API is not available."""
print("\n" + "="*50)
print("EXAMPLE OUTPUT (API not available)")
print("="*50)

print("""
Data Overview:
• Total Sites: 150
• Total Organizations: 75
• Analysis Timestamp: 2024-12-19T10:30:00

Data Completeness Scores:
• Sites: 78.5/100 (Grade: C)
- Critical fields: 85.2/100
- Optional fields: 65.8/100
• Organizations: 82.1/100 (Grade: B)
- Critical fields: 88.4/100
- Optional fields: 70.3/100

Data Integrity Issues:
• Orphaned Sites: 3 (2.0%)
• Sites w/ Incomplete Organizations: 12 (8.0%)

Key Recommendations:
1. Priority: 25% of sites missing critical field 'publicEmail'
2. Priority: 18% of organizations missing critical field 'streetAddress'
3. Data integrity issue: 3 sites have missing or invalid organization references
""")

print("\nExample Missing Data Fields:")
print(" Sites commonly missing:")
print(" • publicEmail (25% missing)")
print(" • website (35% missing)")
print(" • description (40% missing)")
print(" ")
print(" Organizations commonly missing:")
print(" • streetAddress (18% missing)")
print(" • publicPhone (22% missing)")
print(" • ein (45% missing)")


if __name__ == "__main__":
main()
Loading