Skip to content

reidsendroff/public-interpretable-tabular-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Interpretable Tabular Classification

Binary classification on an anonymized high-dimensional dataset using XGBoost, hybrid categorical encoding, Bayesian hyperparameter optimization, and SHAP-based model interpretability.

Python XGBoost scikit-learn SHAP Optuna


Overview

This project is a Kaggle-style binary classification competition from Harvard Extension School's CSCI E-82 Advanced Machine Learning course (Fall 2025). The dataset — 4,584 training rows and 1,732 test rows across 669 fully anonymized features — was designed with a deliberate difficulty: the most predictive feature contained category values in the test set that were completely absent from training, affecting 54.3% of all test rows. No domain context was provided.

The project is structured as a six-experiment lab notebook. Each experiment follows a hypothesis-to-result cycle: a modeling idea is stated, implemented, evaluated against a held-out leaderboard, and interpreted. This disciplined structure separates the work from a typical homework submission — experiments that underperformed (ElasticNet, XGBoost without encoding) are fully documented alongside successful ones, showing how the final design decision was earned rather than guessed.

The final model combines a hand-engineered hybrid categorical encoding scheme, gradient boosting (XGBoost), and Bayesian hyperparameter optimization with Optuna, reaching a Kaggle AUC of 0.731 — a 23.5% improvement over the naive baseline. SHAP SHapley values are used to explain both the global feature importance and the directional effect of individual features on predictions, making the model's behavior interpretable despite the fully anonymized feature space.


What This Project Demonstrates

  • Diagnosing and solving train/test categorical distribution shift via custom encoding strategies
  • Systematic multi-model experimentation with documented hypothesis, result, and interpretation for each approach
  • Bayesian hyperparameter optimization (Optuna) with proper stratified cross-validation
  • SHAP-based model interpretability applied to a high-dimensional, anonymized feature space
  • Boruta-style probe feature selection for unsupervised dimensionality reduction
  • PCA for exploratory data analysis and distribution-shift visualization
  • Handling class imbalance (84.5% majority class) in both RF and XGBoost settings

Methods Used

Method Type Role in Project
Random Forest Ensemble / bagging Baseline classifier; initial feature importance analysis
PCA Dimensionality reduction EDA; visualizing train/test subject distribution shift
Probe feature selection (Boruta-style) Feature selection Unsupervised elimination of noise features
ElasticNet Regularized linear model Experiment 3; linear baseline on reduced feature set
Target encoding (smoothed) Feature engineering Encode subject with Bayesian shrinkage toward global mean
Frequency encoding Feature engineering Encode subject by training-set count; maps unseen to 0
Hybrid encoding Feature engineering Concatenate target and frequency encodings for subject
XGBoost Gradient boosting Final model architecture
Bayesian optimization (Optuna) Hyperparameter search 60-trial TPE search over 9 XGBoost parameters
SHAP values Model interpretability Global + directional feature importance for final model

Datasets / Inputs

Source: Harvard Extension School CSCI E-82 course competition (Fall 2025). Proprietary; domain intentionally hidden.

Split Rows Features Target
Train 4,584 669 output (binary: 0/1)
Test 1,732 669 Hidden

Feature structure: 669 anonymized columns — x-series (binary/numeric), y-series (continuous), z-series (continuous, 221 columns), plus categorical columns subject, state, and phase.

Class imbalance: 84.5% positive class (3,873 of 4,584 training rows).

Key distribution shift: subject has 11 levels in training (A–D, F–M) and 13 in test (adds E, J). 941 of 1,732 test rows (54.3%) carry an unseen subject category.

See data/DATA_SOURCES.md for schema, download instructions, and verification commands.


Key Technical Steps

Raw data (4584 train × 669 features)
        |
        v
[1] Feature importance (RF baseline)
        |— Identifies `subject` as top feature and distribution-shift trap
        v
[2] PCA / EDA
        |— Quantifies 73.32% variance in 2 components
        |— Subject E projects perpendicular to all training clusters
        v
[3] Probe feature selection + ElasticNet
        |— 184 features eliminated; linear model insufficient
        v
[4] Hybrid encoding design
        |— Target encoding: smoothed mean per category
        |— Frequency encoding: count-based; unseen → 0
        |— Concatenate both into feature matrix
        v
[5] RF with hybrid encoding + tuning
        |— RandomizedSearchCV; best CV AUC: 0.855; Kaggle: 0.718
        v
[6] XGBoost + hybrid encoding + Bayesian search (60 trials)
        |— Optuna TPE sampler; 9 hyperparameters; 5-fold stratified CV
        |— Best CV AUC: 0.859; Kaggle AUC: 0.731
        v
[7] SHAP interpretability
        |— Global bar chart: subject_target_enc dominates
        |— Beeswarm: high encoded subject values → strong class-1 push

Hybrid Encoding Formula

Target encoding uses Bayesian smoothing (also called additive smoothing or empirical Bayes shrinkage) to avoid leaking raw class means for low-count categories:

encoded(c) = (n_c × mean_c + smoothing × global_mean) / (n_c + smoothing)

where n_c is the count of category c in training, mean_c is the per-category positive rate, global_mean = 0.8449, and smoothing = 100 (tuned via grid search).

Unseen categories (E, J in test) receive the global mean under target encoding and frequency 0 under frequency encoding, providing two independent soft signals instead of a missing or arbitrary value.

Bayesian Optimization Search Space (Optuna)

Parameter Range
n_estimators 50–1000
max_depth 3–15
learning_rate 0.001–0.3 (log-uniform)
subsample 0.4–1.0
colsample_bytree 0.4–1.0
min_child_weight 1–10
gamma 0–5
reg_alpha 0.01–10 (log-uniform)
reg_lambda 0.01–10 (log-uniform)

Results and Interpretation

Performance Progression

# Method CV AUC Kaggle AUC
1 RF baseline (label encoding) 0.592
2 RF (subject dropped entirely) 0.729
3 ElasticNet + probe feature selection 0.698 0.549
4a RF + hybrid encoding (untuned) 0.810
4b RF + hybrid encoding (tuned, smoothing=100) 0.855 0.718
5 XGBoost baseline (no encoding) 0.761 0.596
6 XGBoost + hybrid encoding + Bayesian (final) 0.859 0.731

Total improvement over naive baseline: +23.5% (Kaggle AUC 0.592 → 0.731)

Key Findings

  • The encoding was the decisive intervention. Moving from label encoding to hybrid encoding improved CV AUC by +11.1% (0.729 → 0.810) holding the model architecture constant.
  • Model choice mattered, but less. Switching from RF to XGBoost on the same encoded features improved CV AUC by +4.8% (0.810 → 0.853 baseline XGBoost).
  • Bayesian optimization provided a smaller but meaningful gain. 60 Optuna trials added +0.6 AUC points (0.853 → 0.859 CV AUC), confirming the data engineering gains dominated model tuning.
  • ElasticNet underperformed the RF baseline despite probe-based feature selection, confirming the dataset is not linearly separable.
  • SHAP confirmed the encoding's behavior: subject_target_enc has the highest mean absolute SHAP value; high encoded values (high positive-rate subjects) consistently push predictions toward class 1.

Example Visualizations

All plot files are at the repository root. See images/README.md for the full catalog with generation instructions.

File Contents
feature_importance.png RF baseline Gini feature importance — subject dominates
shap_bar_plot.png SHAP global feature importance (mean |SHAP|) for final XGBoost
shap_summary_plot.png SHAP beeswarm showing directional feature impact per sample
optuna_optimization.png Bayesian search convergence over 60 trials
xgboost_comparison.png Baseline vs. optimized XGBoost CV AUC comparison
xgboost_feature_importance.png XGBoost gain-based feature importance

Repository Structure

interpretable-tabular-classification/
|
├── RF Attempt.ipynb                        # Experiment 1: RF baseline
├── PCA.ipynb                               # Experiment 2: EDA and PCA
├── ProbeFeatures.ipynb                     # Experiment 3: ElasticNet + probe selection
├── encoding+RF+Tuning.ipynb                # Experiment 4: Hybrid encoding + RF tuning
├── xgboosthw3.ipynb                        # Experiment 5: XGBoost baseline
├── xgboost_hybrid_bayesian.ipynb           # Experiment 6: Final model (XGBoost + Bayesian)
├── SingleFileSubmission.ipynb              # Consolidated end-to-end pipeline
|
├── feature_importance.png                  # RF Gini feature importance
├── shap_bar_plot.png                       # SHAP global importance
├── shap_summary_plot.png                   # SHAP beeswarm
├── optuna_optimization.png                 # Optuna convergence history
├── xgboost_comparison.png                  # Baseline vs. tuned XGBoost
├── xgboost_feature_importance.png          # XGBoost gain importance
├── image.png, image-1.png ... image-8.png  # Experiment-specific plots
|
├── data/
│   ├── sample_solution.csv                 # Submission format reference
│   ├── train_data.csv                      # Training data (obtain from course)
│   ├── test_data.csv                       # Test data (obtain from course)
│   └── DATA_SOURCES.md                     # Data provenance and setup
|
├── notebooks/
│   └── README.md                           # Per-notebook guide
├── images/
│   └── README.md                           # Visualization catalog
├── notebooks/
│   └── README.md                           # Per-notebook guide
├── images/
│   └── README.md                           # Visualization catalog
|
├── requirements.txt                        # Python dependencies with min versions
├── pyproject.toml                          # uv project config
├── .gitignore
├── PROJECT_SUMMARY.md                      # Resume bullets, interview prep, tech summary
└── README.md                               # This file

How to Run

Prerequisites

  • Python 3.12+
  • train_data.csv and test_data.csv from the CSCI E-82 course portal (see data/DATA_SOURCES.md)

Installation

git clone https://github.com/reidsendroff/interpretable-tabular-classification.git
cd interpretable-tabular-classification
python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Place data files:

mkdir -p data
# Copy train_data.csv and test_data.csv into data/
# Note: notebook cells reference HW3Data/ paths — update them to data/ after cloning

Run

jupyter lab

Open notebooks in order (RF Attempt → PCA → ProbeFeatures → encoding+RF+Tuning → xgboosthw3 → xgboost_hybrid_bayesian) or jump directly to xgboost_hybrid_bayesian.ipynb for the final model.

To reproduce the final Kaggle submission end-to-end, use SingleFileSubmission.ipynb.


Skills Demonstrated

Mathematical / Statistical

  • Bayesian shrinkage estimation (smoothed target encoding)
  • Principal component analysis and variance decomposition
  • SHAP SHapley values for cooperative game-theoretic feature attribution
  • Stratified k-fold cross-validation for AUC estimation
  • ElasticNet regularization (L1 + L2 penalty)
  • Class imbalance correction (scale_pos_weight, class_weight='balanced')

Modeling

  • Gradient boosted trees (XGBoost)
  • Random forest ensembles
  • Linear classifiers (ElasticNet / Logistic regression)
  • Boruta-style probe feature selection

Programming & Tools

  • Python 3.12, pandas, numpy, scikit-learn, XGBoost, SHAP, Optuna
  • Jupyter Lab notebooks
  • Matplotlib, Seaborn for visualization

Workflow

  • Hypothesis-driven experiment design
  • Controlled ablation (encoding vs. no encoding; RF vs. XGBoost)
  • Kaggle-format submission pipeline
  • Git version control

Project Context

Harvard Extension School — CSCI E-82: Advanced Machine Learning (Fall 2025).

Class-only Kaggle competition. Dataset domain intentionally withheld. Evaluation metric: AUC-ROC on a private leaderboard split. Authors: Reid Sendroff and Joshua Harvey.


Why This Matters

The train/test categorical distribution mismatch encountered here is not a contrived academic problem — it is one of the most common failure modes in deployed ML systems. New customer segments, new geographic markets, new product SKUs, or new device types routinely appear in production data without training-set representation. The hybrid encoding approach developed here — combining Bayesian shrinkage toward a global prior with frequency-based signal — is a principled, production-ready solution to this class of problem. SHAP explainability further ensures that model behavior can be audited and communicated, a requirement in any regulated or high-stakes deployment context.


Future Improvements

  • Apply the hybrid encoding to the state feature (currently label-encoded; may also suffer from distribution shift)
  • Add LIME for individual prediction explanations alongside SHAP
  • Evaluate a stacking ensemble (RF + XGBoost + ElasticNet) with a meta-learner
  • Refactor encoding logic into a reusable sklearn transformer (BaseEstimator + TransformerMixin) for pipeline compatibility
  • Replace Google Drive dependency in xgboosthw3.ipynb with local path for reproducibility
  • Add MLflow or a results CSV for cross-notebook experiment tracking
  • Investigate whether state feature has unseen categories in test (not yet analyzed)

Author

Reid Sendroff and Joshua Harvey Harvard Extension School — CSCI E-82 Advanced Machine Learning, Fall 2025


Six iterative experiments, one distribution-shift problem, and a hybrid encoding trick that moved the needle more than any model choice — because the data engineering was the hard part.



Appendix: Lab Notebook

The original experiment log is preserved below. All image references link to plot files at the repository root.


HW3

Names: Joshua Harvey and Reid Sendroff

"Lab" Notebook

Experiment 1: Simply fit a Random Forest

File: RF Attempt

The experiment, and goal of this, is to simply see how a "basic" random forest will do on the classification problem. Since it's pretty immune to overfitting, it'll give us a quick guide to what to expect going forward.

Additionally, we'll be able to use the feature importance plot to start getting a handle on the dataset, since there's a ton of columns.

Performance: 0.59246 AUC

Learnings: alt text

Feature importance:

Feature Importance (Top 4):

feature importance
subject 0.042882
z205 0.013999
z206 0.008880
phase 0.008546
... ...

Subject is very important, but missing from the training set! It's a trap!

As a whole, we beat the "benchmark", but clearly overfit - training AUC of 0.93 vs. a test score of 0.59. Also, we need to figure out the subject issue. That should come next.

Experiment 2: EDA and PCA

File: PCA

The goal of this experiment is to start learning a bit more about the data, now that we have a few insights from the random forest to work with. Specifically, we want to understand how the subject distribution, being the most important variable, changes between train and test sets.

Finally, we want to also learn how the projection of the unseen subjects aligns with known subjects - we'll somehow need to do imputation of unseen classes.

Performance: N/A

Learnings:

alt text

There's the problem in a nutshell. PCA captures a reasonable amount (73%) of the variance. However, one of the unseen classes (E) projects perpendicular to the rest of them. We don't have information on that class, and it's also a large portion of the test set.

alt text

This is the same picture, but for train and test sets.

Finally, to look at the same picture, but colored by output: alt text

So, we've figured out that imputation is going to be critically important to improving the model performance, and that we have that unseen class.

Experiment 3: Elastic Net with Probe Features (aka Boruta)

File: Probe Features

The goal here is to use probe features - injected noise, to do unsupervised feature selection. Once we have a smaller share of features, use elastic net to fit on that feature set and see how it performs.

We don't expect it to outperform the baseline RF, but it should not do much worse.

Performance: 0.54912 AUC

We were wrong. This model does not beat the benchmark at all.

The probe feature procedure removed 184 features from the dataset. This feels significant, but really did not do much for our performance. Even after CV and fitting, ~220 of the features kept a non-zero coefficient.

Training Set Classification Report:

precision recall f1-score support
0 0.28 0.72 0.41 711
1 0.93 0.67 0.78 3873

Experiment 4: Encoding + RF (and tuning)

Using what we've learnt so far, we want to hit a homerun with this experiment.

We know: Random forest performs decently well - a quick test with removing the missing class entirely improved performance to 0.63908 AUC on kaggle. But we also need to handle the missing classes.

We have two ways to do this:

  • Target encode, and leak information

    • Instead of a categorical feature, use the weighted mean of the frequency of the class with a "smoothing" factor that applies a value to unseen values
      • This is sourced experience from doing geodemographic smoothing in insurance pricing at the census block level.
    • Formula: (n * category_mean + smoothing * global_mean) / (n + smoothing)
  • Simpler: Frequency encode the subject. Missing values get 0.

    • So we still get some information, but largely ignore the missing class entirely.

Performance: 0.68334 AUC

It worked!

alt text

alt text

We appear to have managed to keep lots of the importances, especially in the subject (which is the missing case)

To push this even further, we can tune how much "leakage" we encode in the target encoding:

alt text

This pushed us to:

Performance 0.71756 AUC

However, all of this was with Random Forest. Time to move to XGBoost:

Experiment 5: XGBoost, basic

As a baseline, how does XGboost perform on the dataset (prior to any encoding)

File: XGBoost

Performance: 0.59567 AUC

Slightly better than Random Forest, but not much. Tuning will help, going forward, as well as layering in the encoding.

Experiment 6: XGBoost, Tuned, with the Encoding

Pulling it all together. With bayesian optimization to get the best hyperparameters.

Performance: 0.73141 AUC!

Big win. Proper jump in performance with the increased complexity.

Some cool plots from it:

Shapley global feature impacts:

alt text

Shapley feature importance:

alt text

On the whole, it builds nicely on the earlier learnings!

About

Binary classification on an anonymized tabular dataset using XGBoost, hybrid encoding, Bayesian optimization, and SHAP interpretability.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors