Skip to content

kalviumcommunity/S75-0326-AirToxics-Python-pandas-numpy-scikit-learn-matlotlib-streamlit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

59 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

S75-0326-AirToxics-Python-pandas-numpy-scikit-learn-matlotlib-streamlit

๐ŸŒ Air Quality Intelligence Dashboard for Indian Cities

โœ… Project Status: FULLY BUILT & OPERATIONAL

This is a complete, working implementation of an interactive air quality monitoring system with machine learning predictions and health recommendations.


๐Ÿ“Œ Project Overview

Urban areas in India generate massive volumes of air pollution data daily through monitoring stations. However, this data is often complex, fragmented, and difficult for the general public to interpret.

This project bridges that gap by transforming raw air quality data into clear, interactive visualizations and predictive insights. Using data science and machine learning, the system enables citizens to understand pollution trends, assess health risks, and make informed daily decisions.


โ— Problem Statement

Despite the availability of air quality data in Indian cities:

  • Citizens struggle to interpret AQI (Air Quality Index) values
  • Lack of clear trend visualization over time
  • No accessible forecasting of pollution levels
  • Limited awareness of health risks associated with pollution

This results in poor decision-making regarding:

  • Outdoor activities
  • Travel planning
  • Health precautions

๐ŸŽฏ Objectives

  • โœ… Simplify complex pollution data for public understanding
  • โœ… Provide visual insights into AQI trends
  • โœ… Forecast future pollution levels using ML models
  • โœ… Enable health-conscious decision-making

๐Ÿš€ Features (IMPLEMENTED)

๐Ÿ“Š 1. Data Visualization

  • โœ… Interactive charts showing daily AQI trends
  • โœ… Pollutant distribution analysis (PM2.5, PM10, NOโ‚‚, Oโ‚ƒ, CO, SOโ‚‚)
  • โœ… City-wise comparison dashboards
  • โœ… Correlation heatmaps and distributions

๐Ÿ“ˆ 2. Trend Analysis

  • โœ… Historical data analysis
  • โœ… Seasonal pollution pattern detection
  • โœ… Monthly trends and patterns
  • โœ… City-specific analysis

๐Ÿ”ฎ 3. AQI Forecasting

  • โœ… Random Forest ML models for AQI prediction
  • โœ… Rยฒ scores: 0.90-0.95 (highly accurate)
  • โœ… Multi-day forecasts (up to 14 days)
  • โœ… Individual models per city

โš ๏ธ 4. Health Risk Indicators

  • โœ… 6-category AQI scale (Good โ†’ Hazardous)
  • โœ… Health impact recommendations
  • โœ… Risk group-specific guidelines
  • โœ… Activity recommendations

๐Ÿ–ฅ๏ธ 5. Interactive Dashboard

  • โœ… Multi-page Streamlit interface
  • โœ… Real-time data display and filtering
  • โœ… City-wise drill-down analysis
  • โœ… Responsive design

๐Ÿ› ๏ธ Tech Stack

๐Ÿ Programming Language

  • Python 3.13.6

๐Ÿ“ฆ Data Processing

  • Pandas 2.0.3 โ€“ Data cleaning and manipulation
  • NumPy 1.24.3 โ€“ Numerical computations

๐Ÿค– Machine Learning

  • scikit-learn 1.3.0
    • Random Forest Regression for AQI prediction
    • Linear Regression as baseline
    • Model evaluation and metrics (Rยฒ, RMSE, MAE)

๐Ÿ“Š Visualization

  • Matplotlib 3.7.2 โ€“ Static plots and trend analysis
  • Seaborn 0.12.2 โ€“ Enhanced statistical visualizations
  • Plotly 5.17.0 โ€“ Interactive visualizations

๐ŸŒ Web Framework

  • Streamlit 1.27.0 โ€“ Interactive dashboard and real-time UI

๐Ÿ“ Project Structure

.
โ”œโ”€โ”€ app.py                          # Main Streamlit application
โ”œโ”€โ”€ setup.py                        # Project initialization script
โ”œโ”€โ”€ requirements.txt                # Python dependencies
โ”œโ”€โ”€ README.md                       # Project documentation
โ”‚
โ”œโ”€โ”€ src/                            # Source code modules
โ”‚   โ”œโ”€โ”€ data_loader.py              # Data loading and preprocessing
โ”‚   โ”œโ”€โ”€ ml_models.py                # Machine learning models
โ”‚   โ”œโ”€โ”€ visualizations.py           # Visualization utilities
โ”‚   โ”œโ”€โ”€ basic_analysis.py           # Basic analytics
โ”‚   โ”œโ”€โ”€ data_types_demo.py          # Data type examples
โ”‚   โ””โ”€โ”€ fundamentals/               # Advanced modules
โ”‚       โ”œโ”€โ”€ functions_demo.py
โ”‚       โ”œโ”€โ”€ numpy_broadcasting.py
โ”‚       โ”œโ”€โ”€ pandas_series_demo.py
โ”‚       โ””โ”€โ”€ vectorized_operations.py
โ”‚
โ”œโ”€โ”€ notebooks/                      # Learning notebooks
โ”‚   โ””โ”€โ”€ setup/                      # Jupyter notebooks for learning
โ”‚       โ”œโ”€โ”€ pandas_dataframes.ipynb
โ”‚       โ”œโ”€โ”€ numpy_arrays_from_lists.ipynb
โ”‚       โ””โ”€โ”€ ...
โ”‚
โ”œโ”€โ”€ data/                           # Data directory
โ”‚   โ”œโ”€โ”€ raw/                        # Raw air quality data
โ”‚   โ”‚   โ””โ”€โ”€ air_quality_data.csv    # Generated sample data (1825 records)
โ”‚   โ”œโ”€โ”€ processed/                  # Processed data
โ”‚   โ”‚   โ””โ”€โ”€ processed_data.csv
โ”‚   โ””โ”€โ”€ outputs/                    # Analysis outputs
โ”‚
โ””โ”€โ”€ models/                         # Trained ML models
    โ””โ”€โ”€ aqi_model.pkl               # Serialized models for 5 cities

๐Ÿง  How It Works

1. Data Generation & Preprocessing

raw_data โ†’ cleaning โ†’ normalization โ†’ feature engineering โ†’ processed_data
  • Generates realistic air quality data for 5 Indian cities
  • Handles seasonal variations and weekly patterns
  • Creates lag features and moving averages
  • Adds derived features (month, day_of_week, is_weekend)

2. Exploratory Data Analysis (EDA)

processed_data โ†’ statistical analysis โ†’ visualization โ†’ insights
  • Analyzes pollutant distributions
  • Identifies seasonal patterns
  • Computes correlations
  • Creates 8+ visualization types

3. Model Building & Training

features โ†’ train/test split โ†’ scaling โ†’ Random Forest โ†’ prediction
  • Separate models per city (optimized for local patterns)
  • Performance Metrics:
    • Bangalore: Rยฒ = 0.919, RMSE = 7.77
    • Chennai: Rยฒ = 0.902, RMSE = 8.42
    • Delhi: Rยฒ = 0.949, RMSE = 9.17
    • Kolkata: Rยฒ = 0.957, RMSE = 7.31
    • Mumbai: Rยฒ = 0.921, RMSE = 9.44

4. Forecasting & Visualization

trained_model + future_features โ†’ predictions โ†’ visualization โ†’ dashboard
  • Generates up to 14-day forecasts
  • Visualizes predictions with confidence
  • Creates interactive charts

5. Interactive Dashboard

streamlit + data + models + visualizations โ†’ web application
  • 5 main pages with different functionalities
  • Real-time data filtering
  • Health recommendations engine
  • Model performance metrics

๐Ÿš€ Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Initialize Project

python setup.py

This will:

  • Generate sample air quality data
  • Train ML models for all cities
  • Create processed data files
  • Save serialized models

3. Run Dashboard

streamlit run app.py

The dashboard will open at: http://localhost:8501


๐Ÿ“Š Dashboard Pages

๐Ÿ  Dashboard (Home)

  • Key Metrics: Average, Highest, Lowest AQI
  • Visualizations:
    • 30-day AQI trend lines
    • City comparison bar chart
    • Monthly seasonal patterns
    • Pollutant correlation heatmap

๐Ÿ™๏ธ City Analysis

  • Select a city from dropdown
  • Current pollutant levels (PM2.5, PM10, NOโ‚‚, Oโ‚ƒ, CO, SOโ‚‚)
  • 60-day trend analysis for selected city
  • Pollutant composition pie chart
  • Detailed pollutant table

๐Ÿ”ฎ Forecasting

  • Custom forecast (1-14 days)
  • Visual comparison with historical data
  • Forecast table with predictions
  • Model performance metrics (Rยฒ, RMSE, MAE, MSE)

โš ๏ธ Health Guidelines

  • AQI categories with health impacts
  • Personalized recommendations based on current AQI
  • Health tips for different groups:
    • General population
    • At-home recommendations
    • At-risk groups (children, elderly, respiratory patients)

โ„น๏ธ About

  • Project overview and objectives
  • Tech stack details
  • Dataset information
  • Target users and impact

๐Ÿ“ˆ Sample Data

The project includes 1,825 realistic sample records covering:

  • 5 Major Cities: Delhi, Mumbai, Chennai, Bangalore, Kolkata
  • 365 Days of historical data
  • Realistic Patterns:
    • Seasonal variations (higher pollution in winter)
    • Weekly cycles (higher weekday traffic)
    • Random weather-based fluctuations

Data Features:

  • AQI (consolidated index)
  • PM2.5, PM10 (particulate matter)
  • NOโ‚‚ (nitrogen dioxide)
  • Oโ‚ƒ (ozone)
  • CO (carbon monoxide)
  • SOโ‚‚ (sulfur dioxide)
  • Temperature, Humidity

๐Ÿค– Machine Learning Models

Model Type: Random Forest Regressor

RandomForestRegressor(
    n_estimators=100,      # 100 decision trees
    max_depth=10,          # Tree depth limit
    min_samples_split=5,   # Minimum samples for split
    random_state=42,       # Reproducibility
    n_jobs=-1              # Use all cores
)

Features Used:

  1. Current pollutant levels (PM2.5, PM10, NOโ‚‚, Oโ‚ƒ, CO, SOโ‚‚)
  2. Weather data (temperature, humidity)
  3. Temporal features (month, is_weekend)
  4. Lag features (AQI from 1, 7, 30 days ago)
  5. Moving averages (7, 30-day moving averages)

Metrics:

  • Rยฒ Score: 0.90-0.96 (explains 90-96% of variance)
  • RMSE: 7-10 AQI points (very accurate)
  • MAE: Mean Absolute Error tracking

๐Ÿ“Š Key Visualizations

1. Trend Charts

  • Line plots with fill-between for visual impact
  • AQI zones (Good, Satisfactory, Poor, etc.)
  • Smoothed by city-level data

2. City Comparison

  • Bar charts with color-coded AQI levels
  • Threshold lines for Moderate (100) and Poor (200)
  • Value labels on bars

3. Pollutant Composition

  • Pie charts showing pollutant percentages
  • Color-coded by pollutant type
  • Latest data snapshot

4. Monthly Patterns

  • Line plots with min-max range fills
  • Identifies seasonal peaks/troughs
  • All months annotated

5. Correlation Heatmap

  • Shows relationships between variables
  • Red = positive, Blue = negative correlation
  • Helps identify pollution drivers

6. Forecast Visualization

  • Historical data + forecast overlay
  • Different colors/styles for distinction
  • Confidence visualization

๐Ÿ’พ Data Files

Generated Files

data/raw/air_quality_data.csv           # Raw sample data (1825 rows)
data/processed/processed_data.csv       # Cleaned and featured data
models/aqi_model.pkl                    # Trained ML models (pickle)

Data Preprocessing Steps

  1. Forward and backward fill for NaN values
  2. Remove duplicate date-city combinations
  3. Sort by city and date
  4. Create temporal features
  5. Add lag features
  6. Add moving averages
  7. Categorize AQI levels

๐ŸŽฏ Use Cases

Urban Residents

  • Check daily AQI and plan outdoor activities
  • Get personalized health recommendations
  • Receive 7-day forecasts for planning

Health-Conscious Individuals

  • Track pollutant levels by type
  • Get health risk assessments
  • Access group-specific guidelines

Researchers & Policymakers

  • Analyze seasonal patterns
  • Study city-wise pollution trends
  • Access clean, processed datasets
  • Understand ML prediction accuracy

Students & Data Enthusiasts

  • Learn data science workflows
  • Understand ML model implementation
  • Explore real-world dataset
  • Reference for projects

๐Ÿงช Testing & Validation

Model Validation

  • Train/test split: 80/20
  • Cross-validation metrics
  • RMSE < 10 for all cities
  • Rยฒ > 0.90 for all models

Data Quality

  • No missing values after preprocessing
  • Realistic value ranges
  • Seasonal patterns validated
  • Correlation checks passed

Application Testing

  • All pages load successfully
  • Interactive elements work
  • Forecasts generate correctly
  • Visualizations render properly

๐Ÿ“š Code Examples

Load Data

from src.data_loader import AirQualityDataLoader

loader = AirQualityDataLoader()
df = loader.load_data()
df = loader.preprocess_data(df)
df = loader.categorize_aqi(df)

Train Models

from src.ml_models import AQIPredictor

predictor = AQIPredictor()
predictor.train_city_models(df)
predictor.save_model()

Create Visualization

from src.visualizations import AQIVisualizer

visualizer = AQIVisualizer()
fig = visualizer.plot_aqi_trend(df, city='Delhi', days=30)
plt.show()

Generate Forecast

forecast_df = predictor.forecast_next_days(df, 'Delhi', days=7)
print(forecast_df)

๐Ÿ”ฎ Future Enhancements

  • Integration with real-time APIs (CPCB, AirVisual)
  • Mobile-friendly UI with responsive design
  • Advanced ML models (LSTM for time-series)
  • Email alerts for hazardous AQI levels
  • Historical data archival (1+ years)
  • User profiles with saved preferences
  • API endpoint for programmatic access
  • Docker containerization for deployment

๐Ÿ‘ฅ Target Users

  • Urban residents across Indian cities
  • Health-conscious individuals
  • Researchers & policymakers
  • Students and data enthusiasts
  • Environmental organizations

๐Ÿ’ก Impact

This project empowers citizens by:

โœ… Making pollution data easy to understand
โœ… Encouraging health-aware lifestyle choices
โœ… Promoting awareness about environmental conditions
โœ… Enabling data-driven decision making
โœ… Supporting policy research and development


๐Ÿงช Environment Setup Verification

๐Ÿ’ป Operating System

Windows 11 (64-bit)

๐Ÿ Python Version

Python 3.13.6

๐Ÿ“ฆ Package Versions

  • pandas: 2.0.3
  • numpy: 1.24.3
  • scikit-learn: 1.3.0
  • matplotlib: 3.7.2
  • streamlit: 1.27.0
  • seaborn: 0.12.2
  • plotly: 5.17.0

โœ… Verification Status

โœ… Python is installed and configured
โœ… All dependencies installed successfully
โœ… Sample data generated (1825 records)
โœ… ML models trained and saved
โœ… Streamlit dashboard is operational
โœ… All visualizations render correctly


๐Ÿ“– Documentation

This project includes:

  • โœ… Detailed README (this file)
  • โœ… Inline code comments
  • โœ… Docstrings for all functions
  • โœ… Setup automation script
  • โœ… Example usage in main blocks

๐Ÿ“ License

This is an educational project for learning data science, machine learning, and web application development.


๐ŸŽ“ Learning Outcomes

By exploring this project, you'll learn:

  1. Data Engineering

    • Data loading and preprocessing
    • Feature engineering techniques
    • Handling time-series data
  2. Machine Learning

    • Model training and evaluation
    • Hyperparameter tuning
    • Cross-validation
  3. Data Visualization

    • Creating publication-quality plots
    • Interactive visualizations
    • Dashboard design
  4. Web Development

    • Building interactive web apps
    • Streamlit framework
    • User experience design
  5. Software Engineering

    • Project structure and organization
    • Code modularity and reusability
    • Best practices

๐Ÿค Contributing

This project is open for enhancements. Suggested areas:

  • Real API integration
  • Additional cities
  • More ML models
  • Enhanced UI
  • Performance optimization

๐Ÿ“ž Support

For issues or questions:

  1. Check the code documentation
  2. Review example usage in main blocks
  3. Consult the About page in dashboard

Last Updated: April 8, 2026
Project Status: โœ… Complete and Operational

๐Ÿ Python Verification

Command: python --version

Output: Python 3.11.x

Command: python

Test:

print("Hello DS Sprint")

Verification:

  • Python is accessible via terminal
  • Python REPL runs without errors

๐Ÿ Conda Verification

Command: conda --version

Output: conda 24.x.x

Command: conda env list

Output: (base) environment available

Command: conda activate base

Verification:

  • Conda is installed and accessible
  • Environment activates successfully

๐Ÿ““ Jupyter Verification

Command: jupyter notebook

Verification:

  • Jupyter opens successfully in browser
  • New notebook created
  • Python cell executed:

print("Jupyter working")

Output: Jupyter working


โœ… Conclusion

Python, Conda, and Jupyter are correctly installed and integrated. The environment is verified and ready for Data Science workflows.

Markdown in Jupyter Notebook

๐Ÿ“Œ Overview This project demonstrates how to use Markdown in Jupyter Notebooks to create clear, structured, and readable documentation alongside code.

Markdown helps transform notebooks into professional, easy-to-understand documents by explaining the logic, steps, and results of the analysis.

๐ŸŽฏ Objectives

  • Understand Markdown cells and their purpose
  • Use headings to organize notebook content
  • Create ordered and unordered lists
  • Write inline code and code blocks
  • Combine Markdown and code cells effectively

๐Ÿ› ๏ธ Tools Used Python Jupyter Notebook Markdown

๐Ÿ”„ Workflow Add Markdown cell for explanation Add Code cell for execution Add Markdown cell for interpretation Repeat for each

๐Ÿ’ก Conclusion Markdown plays a key role in making notebooks understandable. It acts as a bridge between code and human understanding, helping others follow the thought process clearly.

Data Science Project Structure

๐Ÿ“Œ Overview

This project demonstrates how to create a clean and organized folder structure for Data Science work. A well-structured project helps in maintaining clarity, avoiding confusion, and making collaboration easier.

๐ŸŽฏ Objectives Understand the importance of project organization Create a standard folder structure Separate data, code, and outputs Make the project easy to navigate and reuse

๐Ÿ’ก Conclusion

A well-organized project structure is essential for scalable and maintainable Data Science work. It improves efficiency and helps others understand your project easily.

About

Bridging the gap between complex Indian pollution data and actionable health decisions through intuitive, localized, and trend-focused visualizations.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors