Project Summary — Interpretable Tabular Classification

Concise Summary

A Kaggle-style binary classification competition on an anonymized high-dimensional tabular dataset (4,584 train / 1,732 test rows, 669 features). The primary technical challenge was a deliberate train/test distribution mismatch: the most predictive feature (subject) contained two category values in the test set that were absent from training, affecting 54.3% of all test rows. The winning approach combined a custom hybrid encoding strategy (smoothed target encoding + frequency encoding) with XGBoost and Bayesian hyperparameter optimization (Optuna), reaching a final Kaggle AUC of 0.731. SHAP values provide global and directional feature-level interpretability for the final model.

Resume Bullets

Achieved 0.731 AUC on a private Kaggle leaderboard by engineering a hybrid categorical encoding strategy (smoothed target + frequency encoding) that resolved a 54.3% train/test category-distribution mismatch invisible to baseline models
Reduced model overfitting gap from 0.255 (RF baseline) to 0.147 through systematic experiment progression across Random Forest, ElasticNet, and XGBoost architectures with 60-trial Bayesian hyperparameter optimization via Optuna
Delivered SHAP-based model interpretability (global feature importance + beeswarm plots) on a 669-feature anonymized dataset using Python, scikit-learn, XGBoost, and the SHAP library

Technical Explanation

The dataset presented two compounding challenges: (1) 669 anonymized features with no domain context, and (2) a categorical subject feature that was the single most predictive variable (Gini importance 0.0429, 3x the next feature) but contained unseen categories in 54.3% of test rows. Standard label encoding assigns arbitrary integers to unseen categories, destroying the signal; dropping the column entirely left substantial predictive power on the table (CV AUC 0.729 vs. 0.810 with encoding).

The hybrid encoding scheme encodes subject twice in parallel: once via smoothed target encoding — a Bayesian shrinkage estimator that blends the per-category positive rate toward the global mean — and once via frequency encoding that maps each category to its training-set count (defaulting to 0 for unseen values). Unseen test categories receive the global positive rate (0.8449) under target encoding and frequency 0 under frequency encoding, providing two independent soft signals that together achieve CV AUC 0.810 vs. 0.729 without the feature.

The final model layer adds XGBoost (vs. Random Forest) for gradient boosting advantages on tabular data, and Optuna Bayesian search over 9 hyperparameters across 60 trials. SHAP SHapley values confirm that subject_target_enc is the dominant driver with directional consistency: high encoded values (high positive-rate subjects) push predictions strongly toward class 1.

Interview Version

In this project, I built a binary classifier for an anonymized tabular dataset as part of a Kaggle-style competition in Harvard's Advanced Machine Learning course. The dataset had 669 features and a deliberate trap: the most important feature, subject, contained two categories in the test set that never appeared in training, affecting over half the test rows. My first step was to run a baseline Random Forest and use feature importance to identify this problem immediately. I then used PCA to visualize how one of the unseen categories — subject E — projected nearly perpendicular to all known subjects in the first two principal components, confirming that imputation would need to be soft. I designed a hybrid encoding that applied smoothed target encoding and frequency encoding simultaneously, preserving signal while gracefully handling the unknown categories. Layering XGBoost and Bayesian hyperparameter search on top pushed my final Kaggle AUC to 0.731, up from 0.592 with the naive baseline. I also computed SHAP values to explain which features the model relied on most and how their values directionally influenced predictions.

Why This Project Stands Out

Most toy classification tutorials use clean, pre-processed benchmark datasets. This project is notable because:

The core challenge was data engineering, not just model selection. The train/test category mismatch is a real-world problem (e.g., new customer segments, new geographic markets) that cannot be solved by choosing a fancier algorithm.
Explainability is built-in, not bolted on. SHAP was used not as an afterthought but as a validation step to confirm the encoding's behavior and identify dominant drivers.
The experiment log is complete. All six experiments are documented with hypothesis, method, result, and interpretation — including two experiments that underperformed (ElasticNet, XGBoost without encoding), showing methodical iteration rather than cherry-picking.
Bayesian optimization was applied correctly. Optuna's TPE sampler was used with stratified cross-validation rather than a simple train/val split, avoiding optimistic bias.

Key Skills Demonstrated

Machine Learning & Statistics

Binary classification under class imbalance
Cross-validation strategy (StratifiedKFold, 5-fold)
Gradient boosting (XGBoost) vs. ensemble methods (Random Forest)
Regularized linear models (ElasticNet with L1/L2 penalty)
Dimensionality reduction (PCA) for EDA and distribution shift analysis

Feature Engineering

Smoothed target encoding (Bayesian shrinkage)
Frequency encoding
Hybrid encoding for out-of-vocabulary categorical values
Boruta-style probe feature selection

Hyperparameter Optimization

Bayesian optimization with Optuna (TPE sampler)
60-trial search over 9 XGBoost parameters

Model Interpretability

SHAP global feature importance (bar chart)
SHAP beeswarm (directional impact per sample)
Gini-based feature importance (RF and XGBoost)

Engineering & Workflow

Reproducible notebook pipeline
Stratified submission CSV generation
Version-controlled experiment results

Project Outcomes

Final Kaggle AUC: 0.731

Milestone	CV AUC	Kaggle AUC
RF baseline	—	0.592
ElasticNet + probe features	0.698	0.549
RF + hybrid encoding (tuned)	0.855	0.718
XGBoost + hybrid + Bayesian	0.859	0.731

Total AUC improvement over baseline: +23.5% (0.592 → 0.731)

Context

Course: Harvard Extension School, CSCI E-82 — Advanced Machine Learning (Fall 2025)
Format: Kaggle-style private leaderboard competition; hidden dataset domain
Authors: Reid Sendroff and Joshua Harvey
Submission date: October 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Summary — Interpretable Tabular Classification

Concise Summary

Resume Bullets

Technical Explanation

Interview Version

Why This Project Stands Out

Key Skills Demonstrated

Project Outcomes

Context

FilesExpand file tree

PROJECT_SUMMARY.md

Latest commit

History

PROJECT_SUMMARY.md

File metadata and controls

Project Summary — Interpretable Tabular Classification

Concise Summary

Resume Bullets

Technical Explanation

Interview Version

Why This Project Stands Out

Key Skills Demonstrated

Project Outcomes

Context