Purpose: Measures shift in the distribution of a continuous variable (like model confidence) between a baseline (training or reference) and current population. Used for data drift / model input shift monitoring.
Formula:
Where:
-
$n$ = number of bins -
$P_i$ = % of baseline (expected) data in bin i -
$Q_i$ = % of current (actual) data in bin i
Calculation Steps:
- Divide baseline data into n bins (quantiles or fixed width)
- Count % of baseline data in each bin →
$P_i$ - Count % of current/live data in each bin →
$Q_i$ - Apply the formula above
Interpretation / Thresholds:
- PSI < 0.1 → No significant change
- 0.1 < PSI < 0.25 → Moderate shift
- PSI > 0.25 → Significant shift (investigate data/model)
Use Cases:
- Monitor model confidence distributions (module/date softmax)
- Detect population drift in features or outputs
Purpose: Similar to PSI, but calculated per feature, i.e., measures input feature distribution drift.
Formula: Same as PSI but applied to feature distributions.
Calculation Steps:
- For each feature, split baseline into n bins
- Compute baseline and live % per bin
- Apply formula
Thresholds & Interpretation:
- CSI < 0.1 → Feature stable
- 0.1–0.25 → Moderate drift
- > 0.25 → Strong drift (may require retraining)
Use Cases:
- Track input features in tabular models or embedding dimensions
Purpose: Detects when the relationship between inputs and outputs changes. Unlike PSI/CSI which track input distribution, concept drift checks if model predictions become unreliable.
Common Methods:
- Compare predicted probability distribution over time vs training
- Use PSI formula on model predictions (softmax probabilities)
- Example: Confidence PSI for module/date predictions
- Track metrics like accuracy, F1-score, AUC over a sliding window of live data
- If metric drops significantly → drift detected
Formula:
For confidence / probability drift: same as PSI applied to predicted class probabilities.
For performance: simple moving average difference:
Thresholds:
- Accuracy drop > 5–10% may indicate drift
- PSI of predictions > 0.1–0.25 → moderate to severe drift
Purpose: Measures change in feature embeddings (often from neural networks) between baseline and live data.
Method: Cosine similarity is commonly used.
Formula:
Steps:
- Compute mean embedding from training set →
$\vec{e}_{baseline}$ - For each live query, compute its embedding →
$\vec{e}_{live}$ - Calculate cosine similarity → subtract from 1 to get drift score
Interpretation:
- Drift < 0.1 → embeddings stable
- 0.1–0.3 → moderate change
- > 0.3 → significant drift (investigate input or model)
Use Cases:
- NLP embeddings for BERT/TinyBERT outputs
- Image embeddings in CV pipelines
Purpose: Measures uncertainty of model predictions.
Formula:
Where
Interpretation:
- Low entropy → confident predictions
- High entropy → uncertain predictions → may trigger human review or alert
Thresholds:
- Relative to training distribution; monitor moving average
Purpose: Tracks how often the model outputs low-confidence predictions.
Method:
- Count predictions where max(softmax_probs) < threshold
- Threshold is usually 0.5 (or domain-specific)
Interpretation:
- Sudden rise → model unsure on current population → potential concept drift
This guide covers all commonly used MLOps monitoring metrics for tabular, NLP, CV, and classical ML applications, including formulas, interpretations, and threshold guidance.
These metrics monitor changes in the input feature distributions between training and live data.
| Metric | Formula / Method | Use Case | Threshold / Meaning |
|---|---|---|---|
| Population Stability Index (PSI) | Continuous/categorical feature drift | <0.1 stable, 0.1–0.25 moderate, >0.25 severe drift | |
| Characteristic Stability Index (CSI) | PSI per feature | Categorical/continuous input features | Same as PSI |
| Kolmogorov–Smirnov (KS) Test | Continuous feature drift | p-value < 0.05 → significant drift | |
| Jensen-Shannon Divergence (JSD) | Distribution comparison | 0–1, higher → more divergence | |
| Wasserstein Distance (Earth Mover's Distance) | Measures "cost" to move baseline distribution to live | Continuous features | Domain-specific thresholds |
✅ Use in all applications: tabular, NLP embeddings, CV features (pixel distributions, extracted features).
These metrics measure changes in the relationship between inputs and outputs.
| Metric | Formula / Method | Use Case | Threshold / Meaning |
|---|---|---|---|
| Prediction Distribution Drift (Confidence PSI) | PSI on softmax probabilities per class | Classification models | <0.1 stable, >0.25 alert |
| Accuracy / F1 / AUC Moving Average | Compare baseline metric vs live metric over window | All supervised tasks | Drop > 5–10% → drift |
| KL Divergence of Predictions | Classification output drift | Higher → model behavior changed | |
| Prediction Entropy | Uncertainty monitoring | High → model unsure | |
| Low Confidence Count | Count(max_prob < threshold) | Classification confidence | Rising count → alert |
For deep learning embeddings (text, images, tabular):
| Metric | Formula / Method | Use Case | Threshold / Meaning |
|---|---|---|---|
| Cosine Similarity Drift | $1 - \frac{\vec{e}{live} \cdot \vec{e}{baseline}}{|\vec{e}{live}| |\vec{e}{baseline}|}$ | NLP embeddings, CV embeddings | <0.1 stable, 0.1–0.3 moderate, >0.3 severe |
| Mahalanobis Distance | Detect OOD in embeddings | > threshold → anomaly | |
| Euclidean / L2 Distance | Embedding drift | Domain-specific |
✅ Use Cases:
- NLP: BERT/TinyBERT CLS token embeddings
- CV: CNN feature maps or bottleneck layers
- Tabular: Autoencoder latent representations
| Metric | Formula / Method | Use Case | Threshold / Meaning |
|---|---|---|---|
| Maximum Softmax Probability (MSP) | Detect OOD samples | High → anomalous | |
| ODIN Score | Temperature-scaled softmax with small perturbation | NLP, CV | Higher → OOD |
| Mahalanobis Distance in Feature Space | As above | Detect OOD embeddings | High → OOD |
| Autoencoder Reconstruction Error | Anomaly detection | High error → OOD |
| Metric | Formula / Method | Use Case | Threshold / Meaning |
|---|---|---|---|
| Mean Squared Error (MSE) Drift | Compare live MSE with baseline | Regression models | Significant increase → drift |
| Mean Absolute Error (MAE) Drift | Compare live MAE with baseline | Regression models | Significant increase → drift |
| Residual Distribution Drift (KS Test / PSI) | Apply KS/PSI to residuals | Regression | Residual distribution shifts → model misalignment |
| Prediction Interval Coverage | % of true values inside predicted interval | Uncertainty monitoring | Drop → model underestimates uncertainty |
| Metric | Formula / Method | Use Case | Threshold / Meaning |
|---|---|---|---|
| SHAP Distribution Drift | PSI / KS test on SHAP values per feature | Detect change in model reasoning | High drift → retrain or investigate |
| Permutation Importance Drift | Compare feature importance baseline vs live | Tabular / CV | Drift → model relying on different features |
| Metric | Formula / Method | Use Case | Threshold / Meaning |
|---|---|---|---|
| Inference Latency | Wall-clock time per request | All models | SLA violation → alert |
| Throughput / Requests per Second | Count / time | System performance | Low → bottleneck |
| Error Rate | Failed inference count / total | Reliability | High → alert |
| Metric | Domain | Formula / Method | Interpretation |
|---|---|---|---|
| PSI | Tabular/NLP/CV | Data / prediction distribution shift | |
| CSI | Tabular | PSI per feature | Feature drift |
| KL / JSD | Tabular/NLP/CV | Distribution change | |
| Cosine Drift | NLP/CV | 1 - cos_sim | Embedding drift |
| Mahalanobis / L2 | NLP/CV/Tabular | Distance in feature space | OOD detection |
| Entropy | Classification | Prediction uncertainty | |
| Low Confidence | Classification | Count(max_prob < threshold) | Confidence monitoring |
| Residual / Error Drift | Regression | MSE/MAE comparison | Model performance drift |
| SHAP / Feature Drift | Tabular/NLP/CV | PSI / KS on importance | Explainability drift |
| Latency | All | Time per inference | SLA monitoring |
- PSI/CSI → distribution shift
- Embedding / Representation Drift → changes in feature space
- Prediction-based Drift → model concept drift
- Entropy / Confidence → uncertainty monitoring
- Residual / Error Drift → regression model performance
- SHAP / Feature Drift → explainability shift
- System Metrics → operational health