This documentation describes the full model training architecture implemented in the models/ package of the AI MLOps Platform.
The module supports two fully independent training pipelines depending on the environment:
- Development Mode (sklearn) → Local training using Scikit-learn
- Production Mode (spark) → Distributed training using Spark MLlib
| Component | Description |
|---|---|
| Algorithm | Random Forest Classifier |
| Data Preprocessing | StandardScaler normalization + SMOTETomek oversampling |
| Hyperparameter Tuning | GridSearchCV (configurable via YAML) |
| Calibration | Isotonic Calibration (CalibratedClassifierCV) |
| Threshold Selection | Automatic optimal threshold via Precision-Recall curve |
| Evaluation Metrics | Accuracy, Precision, Recall, F1, ROC AUC, Rolling Window F1 |
| Rolling Metrics | 7-day sliding evaluation window for time-series stability |
| Experiment Tracking | MLflow (local filesystem backend) |
| Artifacts Exported | Model signature, confusion matrix, classification report, curves |
| Feature Engineering | Includes rolling 7-day windows and user segmentation |
Raw Features --> StandardScaler --> SMOTETomek --> RandomForest --> Calibration --> Threshold Optimization
| Metric | Value |
|---|---|
| Accuracy | 71.84% |
| Precision | 43.96% |
| Recall | 67.99% |
| F1-score | 53.39% |
| ROC AUC | 77.33% |
| Best Threshold | 0.215 |
✅ These results reflect the latest improvements:
- Porting of advanced behavioral features from Spark
- Full GridSearchCV optimization
- Isotonic model calibration
- Thresholding via F1 maximization
- MLflow artifact tracking (no residual local files)
- Path:
configs/dev/model.yaml - Fully drives hyperparameter search, calibration, feature selection, and MLflow logging.
- Confusion Matrix (
confusion_matrix/confusion_matrix.png) - ROC Curve (
roc_curve/roc_curve.png) - Precision-Recall Curve (
precision_recall_curve/precision_recall_curve.png) - Classification Report (
classification_report/classification_report.parquet) - Rolling Window Metrics (
rolling_metrics/rolling_metrics.parquet) - Full Scikit-learn model (
model/with signature & input example)
| Component | Description |
|---|---|
| Algorithm | Gradient Boosted Trees (GBTClassifier) |
| Pipeline | VectorAssembler + GBTClassifier (Spark ML Pipeline) |
| Imbalance Handling | Class Weight Compensation (calculated dynamically) |
| Hyperparameter Tuning | Spark CrossValidator (ParamGridBuilder) |
| Distributed Training | Fully parallelized Spark execution |
| Evaluation Metrics | Accuracy, F1, Precision, Recall, ROC AUC |
| Experiment Tracking | MLflow (local or remote tracking backend) |
| Model Registration | Full MLflow Model Registry integration |
| Feature Engineering | Rolling 7-day metrics + segmentation (same as dev) |
Raw Features --> VectorAssembler --> GBTClassifier --> CrossValidation --> Model Registration
- Path:
configs/prod/model.yaml - Fully configurable hyperparameter space and Spark resources.
- Confusion Matrix (seaborn heatmap)
- Full Spark MLlib model serialization
- Model Registry version control logic
- Metrics: precision_adj, recall_adj, f1_adj, roc_auc, f1_score, accuracy
| Metric | Value |
|---|---|
| Accuracy | 91.25% |
| Precision | 63.87% |
| Recall | 100.0% |
| F1-score adj | 77.95% |
| ROC AUC | 98.17% |
The train.py script dynamically dispatches to the appropriate training pipeline:
python train.py --env dev
python train.py --env prod --mlflow-ui- If
dev→ Executestrain_sklearn.py - If
prod→ Executestrain_spark.py
Central configuration is always loaded dynamically from environment YAML.
models/
├── train_sklearn.py # Scikit-learn training pipeline (with calibration, tuning, rolling metrics)
├── train_spark.py # Distributed Spark MLlib training pipeline
└── train.py # Unified training dispatcher- ✅ Complete pipeline calibration (isotonic method)
- ✅ Dynamic threshold optimization (F1 maximization)
- ✅ SMOTETomek balancing for class imbalance
- ✅ Full GridSearch hyperparameter tuning (sklearn & Spark)
- ✅ MLflow artifact tracking and conditional model registry
- ✅ Daily rolling window evaluation for time-based stability
- ✅ Fully reproducible YAML-driven configurations
- ✅ Unified feature engineering across Spark and Scikit-learn (rolling stats & segmentation)
Documentation updated: June 2025 — Synchronized Dev & Prod Pipelines ✅