Interpretable Tabular Classification

Binary classification on an anonymized high-dimensional dataset using XGBoost, hybrid categorical encoding, Bayesian hyperparameter optimization, and SHAP-based model interpretability.

Overview

This project is a Kaggle-style binary classification competition from Harvard Extension School's CSCI E-82 Advanced Machine Learning course (Fall 2025). The dataset — 4,584 training rows and 1,732 test rows across 669 fully anonymized features — was designed with a deliberate difficulty: the most predictive feature contained category values in the test set that were completely absent from training, affecting 54.3% of all test rows. No domain context was provided.

The project is structured as a six-experiment lab notebook. Each experiment follows a hypothesis-to-result cycle: a modeling idea is stated, implemented, evaluated against a held-out leaderboard, and interpreted. This disciplined structure separates the work from a typical homework submission — experiments that underperformed (ElasticNet, XGBoost without encoding) are fully documented alongside successful ones, showing how the final design decision was earned rather than guessed.

The final model combines a hand-engineered hybrid categorical encoding scheme, gradient boosting (XGBoost), and Bayesian hyperparameter optimization with Optuna, reaching a Kaggle AUC of 0.731 — a 23.5% improvement over the naive baseline. SHAP SHapley values are used to explain both the global feature importance and the directional effect of individual features on predictions, making the model's behavior interpretable despite the fully anonymized feature space.

What This Project Demonstrates

Diagnosing and solving train/test categorical distribution shift via custom encoding strategies
Systematic multi-model experimentation with documented hypothesis, result, and interpretation for each approach
Bayesian hyperparameter optimization (Optuna) with proper stratified cross-validation
SHAP-based model interpretability applied to a high-dimensional, anonymized feature space
Boruta-style probe feature selection for unsupervised dimensionality reduction
PCA for exploratory data analysis and distribution-shift visualization
Handling class imbalance (84.5% majority class) in both RF and XGBoost settings

Methods Used

Method	Type	Role in Project
Random Forest	Ensemble / bagging	Baseline classifier; initial feature importance analysis
PCA	Dimensionality reduction	EDA; visualizing train/test subject distribution shift
Probe feature selection (Boruta-style)	Feature selection	Unsupervised elimination of noise features
ElasticNet	Regularized linear model	Experiment 3; linear baseline on reduced feature set
Target encoding (smoothed)	Feature engineering	Encode `subject` with Bayesian shrinkage toward global mean
Frequency encoding	Feature engineering	Encode `subject` by training-set count; maps unseen to 0
Hybrid encoding	Feature engineering	Concatenate target and frequency encodings for `subject`
XGBoost	Gradient boosting	Final model architecture
Bayesian optimization (Optuna)	Hyperparameter search	60-trial TPE search over 9 XGBoost parameters
SHAP values	Model interpretability	Global + directional feature importance for final model

Datasets / Inputs

Source: Harvard Extension School CSCI E-82 course competition (Fall 2025). Proprietary; domain intentionally hidden.

Split	Rows	Features	Target
Train	4,584	669	`output` (binary: 0/1)
Test	1,732	669	Hidden

Feature structure: 669 anonymized columns — x-series (binary/numeric), y-series (continuous), z-series (continuous, 221 columns), plus categorical columns subject, state, and phase.

Class imbalance: 84.5% positive class (3,873 of 4,584 training rows).

Key distribution shift: subject has 11 levels in training (A–D, F–M) and 13 in test (adds E, J). 941 of 1,732 test rows (54.3%) carry an unseen subject category.

See data/DATA_SOURCES.md for schema, download instructions, and verification commands.

Key Technical Steps

Raw data (4584 train × 669 features)
        |
        v
[1] Feature importance (RF baseline)
        |— Identifies `subject` as top feature and distribution-shift trap
        v
[2] PCA / EDA
        |— Quantifies 73.32% variance in 2 components
        |— Subject E projects perpendicular to all training clusters
        v
[3] Probe feature selection + ElasticNet
        |— 184 features eliminated; linear model insufficient
        v
[4] Hybrid encoding design
        |— Target encoding: smoothed mean per category
        |— Frequency encoding: count-based; unseen → 0
        |— Concatenate both into feature matrix
        v
[5] RF with hybrid encoding + tuning
        |— RandomizedSearchCV; best CV AUC: 0.855; Kaggle: 0.718
        v
[6] XGBoost + hybrid encoding + Bayesian search (60 trials)
        |— Optuna TPE sampler; 9 hyperparameters; 5-fold stratified CV
        |— Best CV AUC: 0.859; Kaggle AUC: 0.731
        v
[7] SHAP interpretability
        |— Global bar chart: subject_target_enc dominates
        |— Beeswarm: high encoded subject values → strong class-1 push

Hybrid Encoding Formula

Target encoding uses Bayesian smoothing (also called additive smoothing or empirical Bayes shrinkage) to avoid leaking raw class means for low-count categories:

encoded(c) = (n_c × mean_c + smoothing × global_mean) / (n_c + smoothing)

where n_c is the count of category c in training, mean_c is the per-category positive rate, global_mean = 0.8449, and smoothing = 100 (tuned via grid search).

Unseen categories (E, J in test) receive the global mean under target encoding and frequency 0 under frequency encoding, providing two independent soft signals instead of a missing or arbitrary value.

Bayesian Optimization Search Space (Optuna)

Parameter	Range
`n_estimators`	50–1000
`max_depth`	3–15
`learning_rate`	0.001–0.3 (log-uniform)
`subsample`	0.4–1.0
`colsample_bytree`	0.4–1.0
`min_child_weight`	1–10
`gamma`	0–5
`reg_alpha`	0.01–10 (log-uniform)
`reg_lambda`	0.01–10 (log-uniform)

Results and Interpretation

Performance Progression

#	Method	CV AUC	Kaggle AUC
1	RF baseline (label encoding)	—	0.592
2	RF (subject dropped entirely)	0.729	—
3	ElasticNet + probe feature selection	0.698	0.549
4a	RF + hybrid encoding (untuned)	0.810	—
4b	RF + hybrid encoding (tuned, smoothing=100)	0.855	0.718
5	XGBoost baseline (no encoding)	0.761	0.596
6	XGBoost + hybrid encoding + Bayesian (final)	0.859	0.731

Total improvement over naive baseline: +23.5% (Kaggle AUC 0.592 → 0.731)

Key Findings

The encoding was the decisive intervention. Moving from label encoding to hybrid encoding improved CV AUC by +11.1% (0.729 → 0.810) holding the model architecture constant.
Model choice mattered, but less. Switching from RF to XGBoost on the same encoded features improved CV AUC by +4.8% (0.810 → 0.853 baseline XGBoost).
Bayesian optimization provided a smaller but meaningful gain. 60 Optuna trials added +0.6 AUC points (0.853 → 0.859 CV AUC), confirming the data engineering gains dominated model tuning.
ElasticNet underperformed the RF baseline despite probe-based feature selection, confirming the dataset is not linearly separable.
SHAP confirmed the encoding's behavior: subject_target_enc has the highest mean absolute SHAP value; high encoded values (high positive-rate subjects) consistently push predictions toward class 1.

Example Visualizations

All plot files are at the repository root. See images/README.md for the full catalog with generation instructions.

File	Contents
`feature_importance.png`	RF baseline Gini feature importance — `subject` dominates
`shap_bar_plot.png`	SHAP global feature importance (mean \|SHAP\|) for final XGBoost
`shap_summary_plot.png`	SHAP beeswarm showing directional feature impact per sample
`optuna_optimization.png`	Bayesian search convergence over 60 trials
`xgboost_comparison.png`	Baseline vs. optimized XGBoost CV AUC comparison
`xgboost_feature_importance.png`	XGBoost gain-based feature importance

Repository Structure

interpretable-tabular-classification/
|
├── RF Attempt.ipynb                        # Experiment 1: RF baseline
├── PCA.ipynb                               # Experiment 2: EDA and PCA
├── ProbeFeatures.ipynb                     # Experiment 3: ElasticNet + probe selection
├── encoding+RF+Tuning.ipynb                # Experiment 4: Hybrid encoding + RF tuning
├── xgboosthw3.ipynb                        # Experiment 5: XGBoost baseline
├── xgboost_hybrid_bayesian.ipynb           # Experiment 6: Final model (XGBoost + Bayesian)
├── SingleFileSubmission.ipynb              # Consolidated end-to-end pipeline
|
├── feature_importance.png                  # RF Gini feature importance
├── shap_bar_plot.png                       # SHAP global importance
├── shap_summary_plot.png                   # SHAP beeswarm
├── optuna_optimization.png                 # Optuna convergence history
├── xgboost_comparison.png                  # Baseline vs. tuned XGBoost
├── xgboost_feature_importance.png          # XGBoost gain importance
├── image.png, image-1.png ... image-8.png  # Experiment-specific plots
|
├── data/
│   ├── sample_solution.csv                 # Submission format reference
│   ├── train_data.csv                      # Training data (obtain from course)
│   ├── test_data.csv                       # Test data (obtain from course)
│   └── DATA_SOURCES.md                     # Data provenance and setup
|
├── notebooks/
│   └── README.md                           # Per-notebook guide
├── images/
│   └── README.md                           # Visualization catalog
├── notebooks/
│   └── README.md                           # Per-notebook guide
├── images/
│   └── README.md                           # Visualization catalog
|
├── requirements.txt                        # Python dependencies with min versions
├── pyproject.toml                          # uv project config
├── .gitignore
├── PROJECT_SUMMARY.md                      # Resume bullets, interview prep, tech summary
└── README.md                               # This file

How to Run

Prerequisites

Python 3.12+
train_data.csv and test_data.csv from the CSCI E-82 course portal (see data/DATA_SOURCES.md)

Installation

git clone https://github.com/reidsendroff/interpretable-tabular-classification.git
cd interpretable-tabular-classification
python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Place data files:

mkdir -p data
# Copy train_data.csv and test_data.csv into data/
# Note: notebook cells reference HW3Data/ paths — update them to data/ after cloning

Run

jupyter lab

Open notebooks in order (RF Attempt → PCA → ProbeFeatures → encoding+RF+Tuning → xgboosthw3 → xgboost_hybrid_bayesian) or jump directly to xgboost_hybrid_bayesian.ipynb for the final model.

To reproduce the final Kaggle submission end-to-end, use SingleFileSubmission.ipynb.

Skills Demonstrated

Mathematical / Statistical

Bayesian shrinkage estimation (smoothed target encoding)
Principal component analysis and variance decomposition
SHAP SHapley values for cooperative game-theoretic feature attribution
Stratified k-fold cross-validation for AUC estimation
ElasticNet regularization (L1 + L2 penalty)
Class imbalance correction (scale_pos_weight, class_weight='balanced')

Modeling

Gradient boosted trees (XGBoost)
Random forest ensembles
Linear classifiers (ElasticNet / Logistic regression)
Boruta-style probe feature selection

Programming & Tools

Python 3.12, pandas, numpy, scikit-learn, XGBoost, SHAP, Optuna
Jupyter Lab notebooks
Matplotlib, Seaborn for visualization

Workflow

Hypothesis-driven experiment design
Controlled ablation (encoding vs. no encoding; RF vs. XGBoost)
Kaggle-format submission pipeline
Git version control

Project Context

Harvard Extension School — CSCI E-82: Advanced Machine Learning (Fall 2025).

Class-only Kaggle competition. Dataset domain intentionally withheld. Evaluation metric: AUC-ROC on a private leaderboard split. Authors: Reid Sendroff and Joshua Harvey.

Why This Matters

The train/test categorical distribution mismatch encountered here is not a contrived academic problem — it is one of the most common failure modes in deployed ML systems. New customer segments, new geographic markets, new product SKUs, or new device types routinely appear in production data without training-set representation. The hybrid encoding approach developed here — combining Bayesian shrinkage toward a global prior with frequency-based signal — is a principled, production-ready solution to this class of problem. SHAP explainability further ensures that model behavior can be audited and communicated, a requirement in any regulated or high-stakes deployment context.

Future Improvements

Apply the hybrid encoding to the state feature (currently label-encoded; may also suffer from distribution shift)
Add LIME for individual prediction explanations alongside SHAP
Evaluate a stacking ensemble (RF + XGBoost + ElasticNet) with a meta-learner
Refactor encoding logic into a reusable sklearn transformer (BaseEstimator + TransformerMixin) for pipeline compatibility
Replace Google Drive dependency in xgboosthw3.ipynb with local path for reproducibility
Add MLflow or a results CSV for cross-notebook experiment tracking
Investigate whether state feature has unseen categories in test (not yet analyzed)

Author

Reid Sendroff and Joshua Harvey Harvard Extension School — CSCI E-82 Advanced Machine Learning, Fall 2025

Six iterative experiments, one distribution-shift problem, and a hybrid encoding trick that moved the needle more than any model choice — because the data engineering was the hard part.

Appendix: Lab Notebook

The original experiment log is preserved below. All image references link to plot files at the repository root.

HW3

Names: Joshua Harvey and Reid Sendroff

"Lab" Notebook

Experiment 1: Simply fit a Random Forest

File: RF Attempt

The experiment, and goal of this, is to simply see how a "basic" random forest will do on the classification problem. Since it's pretty immune to overfitting, it'll give us a quick guide to what to expect going forward.

Additionally, we'll be able to use the feature importance plot to start getting a handle on the dataset, since there's a ton of columns.

Performance: 0.59246 AUC

Learnings:

Feature importance:

Feature Importance (Top 4):

feature	importance
subject	0.042882
z205	0.013999
z206	0.008880
phase	0.008546
...	...

Subject is very important, but missing from the training set! It's a trap!

As a whole, we beat the "benchmark", but clearly overfit - training AUC of 0.93 vs. a test score of 0.59. Also, we need to figure out the subject issue. That should come next.

Experiment 2: EDA and PCA

File: PCA

The goal of this experiment is to start learning a bit more about the data, now that we have a few insights from the random forest to work with. Specifically, we want to understand how the subject distribution, being the most important variable, changes between train and test sets.

Finally, we want to also learn how the projection of the unseen subjects aligns with known subjects - we'll somehow need to do imputation of unseen classes.

Performance: N/A

Learnings:

There's the problem in a nutshell. PCA captures a reasonable amount (73%) of the variance. However, one of the unseen classes (E) projects perpendicular to the rest of them. We don't have information on that class, and it's also a large portion of the test set.

This is the same picture, but for train and test sets.

Finally, to look at the same picture, but colored by output:

So, we've figured out that imputation is going to be critically important to improving the model performance, and that we have that unseen class.

Experiment 3: Elastic Net with Probe Features (aka Boruta)

File: Probe Features

The goal here is to use probe features - injected noise, to do unsupervised feature selection. Once we have a smaller share of features, use elastic net to fit on that feature set and see how it performs.

We don't expect it to outperform the baseline RF, but it should not do much worse.

Performance: 0.54912 AUC

We were wrong. This model does not beat the benchmark at all.

The probe feature procedure removed 184 features from the dataset. This feels significant, but really did not do much for our performance. Even after CV and fitting, ~220 of the features kept a non-zero coefficient.

Training Set Classification Report:

	precision	recall	f1-score	support
0	0.28	0.72	0.41	711
1	0.93	0.67	0.78	3873

Experiment 4: Encoding + RF (and tuning)

Using what we've learnt so far, we want to hit a homerun with this experiment.

We know: Random forest performs decently well - a quick test with removing the missing class entirely improved performance to 0.63908 AUC on kaggle. But we also need to handle the missing classes.

We have two ways to do this:

Target encode, and leak information
- Instead of a categorical feature, use the weighted mean of the frequency of the class with a "smoothing" factor that applies a value to unseen values
  - This is sourced experience from doing geodemographic smoothing in insurance pricing at the census block level.
- Formula: (n * category_mean + smoothing * global_mean) / (n + smoothing)
Simpler: Frequency encode the subject. Missing values get 0.
- So we still get some information, but largely ignore the missing class entirely.

Performance: 0.68334 AUC

It worked!

We appear to have managed to keep lots of the importances, especially in the subject (which is the missing case)

To push this even further, we can tune how much "leakage" we encode in the target encoding:

This pushed us to:

Performance 0.71756 AUC

However, all of this was with Random Forest. Time to move to XGBoost:

Experiment 5: XGBoost, basic

As a baseline, how does XGboost perform on the dataset (prior to any encoding)

File: XGBoost

Performance: 0.59567 AUC

Slightly better than Random Forest, but not much. Tuning will help, going forward, as well as layering in the encoding.

Experiment 6: XGBoost, Tuned, with the Encoding

Pulling it all together. With bayesian optimization to get the best hyperparameters.

Performance: 0.73141 AUC!

Big win. Proper jump in performance with the increased complexity.

Some cool plots from it:

Shapley global feature impacts:

Shapley feature importance:

On the whole, it builds nicely on the earlier learnings!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
HarveyJoshuaSendroffReid		HarveyJoshuaSendroffReid
data		data
images		images
notebooks		notebooks
.gitignore		.gitignore
Final Summary.docx		Final Summary.docx
HarveyJoshuaSendroffReid.zip		HarveyJoshuaSendroffReid.zip
PCA.ipynb		PCA.ipynb
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
ProbeFeatures.ipynb		ProbeFeatures.ipynb
README.md		README.md
README.pdf		README.pdf
RF Attempt.ipynb		RF Attempt.ipynb
SingleFileSubmission.html		SingleFileSubmission.html
SingleFileSubmission.ipynb		SingleFileSubmission.ipynb
encoding+RF+Tuning.ipynb		encoding+RF+Tuning.ipynb
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
submission.csv		submission.csv
submission_20251013_153126.csv		submission_20251013_153126.csv
submission_20251014_145729.csv		submission_20251014_145729.csv
submission_elasticnet_20251014_160541.csv		submission_elasticnet_20251014_160541.csv
submission_hybrid_tuned_20251014_221120.csv		submission_hybrid_tuned_20251014_221120.csv
submission_xgboost_hybrid_bayesian_20251017_114724.csv		submission_xgboost_hybrid_bayesian_20251017_114724.csv
xgb_submission_20251016_005405.csv		xgb_submission_20251016_005405.csv
xgboost_hybrid_bayesian.ipynb		xgboost_hybrid_bayesian.ipynb
xgboosthw3.ipynb		xgboosthw3.ipynb

Folders and files

Latest commit

History

Repository files navigation

Interpretable Tabular Classification

Overview

What This Project Demonstrates

Methods Used

Datasets / Inputs

Key Technical Steps

Hybrid Encoding Formula

Bayesian Optimization Search Space (Optuna)

Results and Interpretation

Performance Progression

Key Findings

Example Visualizations

Repository Structure

How to Run

Prerequisites

Installation

Run

Skills Demonstrated

Mathematical / Statistical

Modeling

Programming & Tools

Workflow

Project Context

Why This Matters

Future Improvements

Author

Appendix: Lab Notebook

HW3

Names: Joshua Harvey and Reid Sendroff

"Lab" Notebook

Experiment 1: Simply fit a Random Forest

Performance: 0.59246 AUC

Experiment 2: EDA and PCA

Performance: N/A

Experiment 3: Elastic Net with Probe Features (aka Boruta)

Performance: 0.54912 AUC

Experiment 4: Encoding + RF (and tuning)

Performance: 0.68334 AUC

Performance 0.71756 AUC

Experiment 5: XGBoost, basic

Performance: 0.59567 AUC

Experiment 6: XGBoost, Tuned, with the Encoding

Performance: 0.73141 AUC!

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages