Skip to content

thilankadw/Telco_Churn_Model_Pipeline---Mini_Project_0---Building_Production-Ready_ML_Systems_by_ZuuCrew

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Enhanced MLflow Artifact Tracking for ML Pipelines

This project demonstrates production-ready machine learning pipelines with comprehensive MLflow artifact tracking, focusing on Telco customer churn prediction.

🎯 Project Overview

A complete ML system with enhanced MLflow tracking that provides:

  • Comprehensive Data Lineage: Track data from raw input to final model predictions
  • Rich Artifact Management: Automated logging of datasets, models, visualizations, and metadata
  • Production-Ready Monitoring: Real-time inference tracking and performance monitoring
  • Complete Reproducibility: All artifacts needed to reproduce experiments and results

πŸ“ Project Structure

Telco Churn Model Pipeline/
β”œβ”€β”€ README.md                          # This file - comprehensive project documentation
β”œβ”€β”€ Makefile                           # Build and deployment automation
β”œβ”€β”€ config.yaml                        # Central configuration management
β”œβ”€β”€ requirements.txt                   # Python dependencies
β”‚
β”œβ”€β”€ artifacts/                         # Generated artifacts and models
β”‚   β”œβ”€β”€ data/                         # Processed datasets
β”‚   β”‚   β”œβ”€β”€ X_train.csv               # Training features
β”‚   β”‚   β”œβ”€β”€ X_test.csv                # Testing features
β”‚   β”‚   β”œβ”€β”€ Y_train.csv               # Training labels
β”‚   β”‚   └── Y_test.csv                # Testing labels
β”‚   β”œβ”€β”€ encode/                       # Feature encoders
β”‚   β”œβ”€β”€ models/                       # Trained models
β”‚   β”‚   └── churn_prediction_model.pkl # Main trained model
β”‚   β”œβ”€β”€ scale/                        # Feature scalers
β”‚   └── mlflow_run_artifacts/         # MLflow-specific artifacts
β”‚       └── {run_id}/                 # Run-specific artifacts
β”‚           β”œβ”€β”€ visualizations_*/     # Data visualizations by stage
β”‚           └── final_csv_files/      # Final dataset metadata
β”‚
β”œβ”€β”€ data/                             # Data storage
β”‚   β”œβ”€β”€ raw/                          # Original raw data
β”‚   β”‚   └── telcochurndata.csv        # Raw Telco customer churn dataset
β”‚   └── processed/                    # Intermediate processed data
β”‚       β”œβ”€β”€ missing_val_handled.csv   # Data after missing value handling
β”‚       β”œβ”€β”€ outliers_handled.csv      # Data after outlier removal
β”‚       β”œβ”€β”€ encoded.csv               # Data after feature encoding
β”‚       └── scaled.csv                # Data after feature scaling
β”‚
β”œβ”€β”€ mlruns/                           # MLflow tracking storage
β”‚   β”œβ”€β”€ 0/                           # Default experiment
β”‚   β”œβ”€β”€ models/                      # MLflow model registry
β”‚   └── {experiment_id}/             # Experiment-specific runs
β”‚       └── {run_id}/                # Individual run artifacts
β”‚           β”œβ”€β”€ artifacts/           # Run artifacts
β”‚           β”œβ”€β”€ metrics/             # Logged metrics
β”‚           β”œβ”€β”€ params/              # Logged parameters
β”‚           └── tags/                # Run tags and metadata
β”‚
β”œβ”€β”€ pipelines/                        # ML pipeline implementations
β”‚   β”œβ”€β”€ __pycache__/                 # Python cache files
β”‚   β”œβ”€β”€ data_pipeline.py             # ✨ Enhanced data processing pipeline
β”‚   β”œβ”€β”€ training_pipeline.py         # ✨ Enhanced model training pipeline
β”‚   └── streaming_inference_pipeline.py # ✨ Enhanced inference pipeline
β”‚
β”œβ”€β”€ src/                             # Core ML modules
β”‚   β”œβ”€β”€ __pycache__/                 # Python cache files
β”‚   β”œβ”€β”€ __init__.py                  # Package initialization
β”‚   β”œβ”€β”€ data_ingestion.py            # Data loading and validation
β”‚   β”œβ”€β”€ data_spiltter.py             # Train/test splitting strategies
β”‚   β”œβ”€β”€ feature_binning.py           # Feature binning transformations
β”‚   β”œβ”€β”€ feature_encoding.py          # Feature encoding strategies
β”‚   β”œβ”€β”€ feature_scaling.py           # Feature scaling transformations
β”‚   β”œβ”€β”€ handle_missing_values.py     # Missing value handling strategies
β”‚   β”œβ”€β”€ model_building.py            # Model architecture definitions
β”‚   β”œβ”€β”€ model_evaluation.py          # Model evaluation metrics
β”‚   β”œβ”€β”€ model_inference.py           # Model inference and prediction
β”‚   β”œβ”€β”€ model_training.py            # Model training orchestration
β”‚   └── outlier_detection.py         # Outlier detection and handling
β”‚
└── utils/                           # Utility modules
    β”œβ”€β”€ __pycache__/                 # Python cache files
    β”œβ”€β”€ config.py                    # Configuration management
    └── mlflow_utils.py              # MLflow tracking utilities

πŸš€ Key Enhancements Implemented

1. Enhanced Data Pipeline (pipelines/data_pipeline.py)

πŸ“Š Comprehensive Data Profiling

  • Stage-wise Tracking: Profiles data at each processing stage (raw β†’ missing_val_handled β†’ outliers_handled β†’ encoded β†’ scaled β†’ final)
  • Rich Visualizations: Automatic generation of distribution plots, correlation matrices
  • Dataset Artifacts: Proper MLflow dataset tracking with lineage and versioning

πŸ” Data Quality Monitoring

  • Metrics Tracking: Rows, columns, missing values, memory usage at each stage
  • Transformation Logging: Before/after metrics for each transformation step
  • Error Handling: Graceful handling of processing failures with detailed logging

πŸ“ Artifact Management

# Example: Data profiling and visualization
create_data_visualizations(df, 'raw', run_artifacts_dir)
log_stage_metrics(df, 'raw')

# MLflow dataset tracking
raw_dataset = mlflow.data.from_pandas(df, source=data_path, name="raw_churn_data")
mlflow.log_input(raw_dataset, context="raw_data")

2. Enhanced Training Pipeline (pipelines/training_pipeline.py)

🎯 Model Performance Tracking

  • Comprehensive Visualizations: Confusion matrices, ROC curves, feature importance plots
  • Training Metadata: Training time, model size, complexity metrics
  • Performance Analytics: Detailed model performance analysis and comparison

πŸ“ˆ Model Artifacts

# Example: Model performance visualization
create_model_performance_visualizations(model, X_test, y_test, evaluation_results, 
                                      run_artifacts_dir, 'XGboost')

# Model metadata logging
log_model_metadata(model, 'XGboost', model_params, training_time, run_artifacts_dir)

3. Enhanced Inference Pipeline (pipelines/streaming_inference_pipeline.py)

⚑ Real-time Monitoring

  • Batch Processing: Configurable batch sizes for efficient logging (default: 100 predictions)
  • Performance Tracking: Inference time, prediction distributions, risk categorization
  • Production Monitoring: Real-time model performance metrics

πŸ“Š Prediction Analytics

# Example: Inference tracking
class InferenceTracker:
    def track_prediction(self, input_data, prediction_result, inference_time):
        # Tracks individual predictions with metadata
        # Logs batches automatically when batch size is reached

πŸ› οΈ MLflow Artifacts Generated

Data Pipeline Artifacts

MLflow Run Artifacts:
β”œβ”€β”€ raw_data/                         # Original dataset
β”œβ”€β”€ visualizations/                   # Stage-wise data visualizations
β”‚   β”œβ”€β”€ raw/                         # Raw data distributions
β”‚   β”œβ”€β”€ encoded/                     # Post-encoding visualizations  
β”‚   └── final/                       # Final processed data plots
β”œβ”€β”€ final_datasets/                   # Train/test CSV files with metadata
β”‚   β”œβ”€β”€ X_train.csv, X_test.csv      # Feature datasets
β”‚   β”œβ”€β”€ Y_train.csv, Y_test.csv      # Label datasets
β”‚   └── final_csv_metadata.json      # Comprehensive metadata
└── processed_datasets/               # Final processed datasets

Training Pipeline Artifacts

MLflow Run Artifacts:
β”œβ”€β”€ model_performance/                # Model performance analysis
β”‚   β”œβ”€β”€ XGboost/                     # Model-specific artifacts
β”‚   β”‚   β”œβ”€β”€ confusion_matrix_XGboost.png
β”‚   β”‚   β”œβ”€β”€ roc_curve_XGboost.png
β”‚   β”‚   β”œβ”€β”€ feature_importance_XGboost.png
β”‚   β”‚   └── prediction_distribution_XGboost.png
β”œβ”€β”€ model_metadata/                   # Model metadata and information
β”‚   └── model_metadata_XGboost.json
β”œβ”€β”€ trained_models/                   # Actual model files
β”‚   └── churn_analysis.joblib
└── training_summary/                 # Complete training summary
    └── training_summary.json

Inference Pipeline Artifacts

MLflow Run Artifacts:
β”œβ”€β”€ inference_batches/                # Prediction batch logs
β”‚   β”œβ”€β”€ inference_batch_20241219_143022.json
β”‚   └── inference_batch_20241219_143122.json
└── prediction_analytics/             # Inference performance metrics

πŸ“Š MLflow Tracking Features

Dataset Tracking

  • MLflow Datasets: Proper dataset versioning and lineage tracking
  • Schema Evolution: Automatic tracking of schema changes
  • Data Lineage: Complete traceability from raw data to final models

Metrics Logged

# Data Pipeline Metrics
- raw_rows, raw_columns, raw_missing_values, raw_memory_mb
- missing_handled_rows_removed, outliers_removed_count
- final_train_samples, final_test_samples, final_features
- train_class_0, train_class_1, test_class_0, test_class_1

# Training Pipeline Metrics  
- training_time_seconds, model_size_mb, model_complexity
- accuracy, precision, recall, f1, roc_auc
- XGboost_training_time_seconds, XGboost_model_size_mb

# Inference Pipeline Metrics
- batch_size, avg_inference_time_ms, avg_churn_probability
- high_risk_predictions, medium_risk_predictions, low_risk_predictions

Parameters Logged

# Pipeline Configuration
- final_feature_names, preprocessing_steps, data_pipeline_version
- model_type, training_strategy, sklearn_version
- feature_encoding_applied, feature_scaling_applied

# Model Parameters
- n_estimators, max_depth, random_state
- test_size, missing_value_strategy, outlier_detection_method

πŸš€ Getting Started

Prerequisites

# Install dependencies
pip install -r requirements.txt

# Or using uv (recommended)
uv pip install -r requirements.txt

Running the Pipelines

1. Data Pipeline

# Run data processing pipeline
python pipelines/data_pipeline.py

# Or using Makefile
make data-pipeline

2. Training Pipeline

# Run model training pipeline
python pipelines/training_pipeline.py

# Or using Makefile  
make train-model

3. Inference Pipeline

# Run streaming inference
python pipelines/streaming_inference_pipeline.py

# Or using Makefile
make inference

MLflow UI

# Start MLflow UI to view experiments and artifacts
mlflow ui

# Access at: http://localhost:5000

πŸ“ˆ Key Benefits

πŸ” Enhanced Observability

  • Complete Lineage: Track data and model lineage from raw input to predictions
  • Rich Visualizations: Automatic generation of insightful plots and charts
  • Comprehensive Metrics: Detailed metrics at every pipeline stage

πŸš€ Production Ready

  • Error Handling: Robust error handling with graceful degradation
  • Monitoring: Real-time inference monitoring and performance tracking
  • Reproducibility: Complete artifact tracking for experiment reproduction

⚑ Developer Experience

  • Automated Tracking: Minimal code changes for maximum tracking benefit
  • Rich Metadata: Comprehensive metadata for all artifacts
  • Easy Debugging: Quick access to intermediate results and visualizations

πŸ”§ Configuration

The system is configured through config.yaml:

mlflow:
  tracking_uri: "file:./mlruns"
  experiment_name: "Telco Customer Churn Prediction"
  model_registry_name: "churn_prediction_v1"
  artifact_path: "model"
  run_name_prefix: "churn_run"
  tags:
    project: "telco_churn_prediction"
    team: "ml_engineering" 
    environment: "development"
    dataset: "telco_customer_churn"
  autolog: true

πŸ“Š Performance Optimizations

Code Efficiency

  • 68% Code Reduction: Optimized from ~950 lines to ~300 lines in data pipeline
  • Consolidated Functions: Streamlined helper functions for better maintainability
  • Essential Visualizations: Focus on most valuable plots and metrics

Resource Management

  • Memory Efficient: Efficient handling of large datasets with cleanup
  • Batch Processing: Configurable batch sizes for inference tracking
  • Error Recovery: Graceful fallbacks when artifact logging fails

🎯 Future Enhancements

  • Data Drift Detection: Monitor for data drift in production
  • Model Registry Management: Automated model stage transitions
  • Advanced Monitoring: Additional performance and quality metrics
  • Integration Testing: Comprehensive pipeline testing framework

πŸ“ Development Notes

This enhanced MLflow tracking system provides:

  • Production-grade logging throughout all modules
  • Comprehensive error handling and input validation
  • Enhanced type safety and documentation
  • Complete artifact traceability for ML operations

The implementation follows clean architecture principles with separation of concerns and comprehensive observability for production ML systems.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors