Binary classification on an anonymized high-dimensional dataset using XGBoost, hybrid categorical encoding, Bayesian hyperparameter optimization, and SHAP-based model interpretability.
This project is a Kaggle-style binary classification competition from Harvard Extension School's CSCI E-82 Advanced Machine Learning course (Fall 2025). The dataset — 4,584 training rows and 1,732 test rows across 669 fully anonymized features — was designed with a deliberate difficulty: the most predictive feature contained category values in the test set that were completely absent from training, affecting 54.3% of all test rows. No domain context was provided.
The project is structured as a six-experiment lab notebook. Each experiment follows a hypothesis-to-result cycle: a modeling idea is stated, implemented, evaluated against a held-out leaderboard, and interpreted. This disciplined structure separates the work from a typical homework submission — experiments that underperformed (ElasticNet, XGBoost without encoding) are fully documented alongside successful ones, showing how the final design decision was earned rather than guessed.
The final model combines a hand-engineered hybrid categorical encoding scheme, gradient boosting (XGBoost), and Bayesian hyperparameter optimization with Optuna, reaching a Kaggle AUC of 0.731 — a 23.5% improvement over the naive baseline. SHAP SHapley values are used to explain both the global feature importance and the directional effect of individual features on predictions, making the model's behavior interpretable despite the fully anonymized feature space.
- Diagnosing and solving train/test categorical distribution shift via custom encoding strategies
- Systematic multi-model experimentation with documented hypothesis, result, and interpretation for each approach
- Bayesian hyperparameter optimization (Optuna) with proper stratified cross-validation
- SHAP-based model interpretability applied to a high-dimensional, anonymized feature space
- Boruta-style probe feature selection for unsupervised dimensionality reduction
- PCA for exploratory data analysis and distribution-shift visualization
- Handling class imbalance (84.5% majority class) in both RF and XGBoost settings
| Method | Type | Role in Project |
|---|---|---|
| Random Forest | Ensemble / bagging | Baseline classifier; initial feature importance analysis |
| PCA | Dimensionality reduction | EDA; visualizing train/test subject distribution shift |
| Probe feature selection (Boruta-style) | Feature selection | Unsupervised elimination of noise features |
| ElasticNet | Regularized linear model | Experiment 3; linear baseline on reduced feature set |
| Target encoding (smoothed) | Feature engineering | Encode subject with Bayesian shrinkage toward global mean |
| Frequency encoding | Feature engineering | Encode subject by training-set count; maps unseen to 0 |
| Hybrid encoding | Feature engineering | Concatenate target and frequency encodings for subject |
| XGBoost | Gradient boosting | Final model architecture |
| Bayesian optimization (Optuna) | Hyperparameter search | 60-trial TPE search over 9 XGBoost parameters |
| SHAP values | Model interpretability | Global + directional feature importance for final model |
Source: Harvard Extension School CSCI E-82 course competition (Fall 2025). Proprietary; domain intentionally hidden.
| Split | Rows | Features | Target |
|---|---|---|---|
| Train | 4,584 | 669 | output (binary: 0/1) |
| Test | 1,732 | 669 | Hidden |
Feature structure: 669 anonymized columns — x-series (binary/numeric), y-series (continuous), z-series (continuous, 221 columns), plus categorical columns subject, state, and phase.
Class imbalance: 84.5% positive class (3,873 of 4,584 training rows).
Key distribution shift: subject has 11 levels in training (A–D, F–M) and 13 in test (adds E, J). 941 of 1,732 test rows (54.3%) carry an unseen subject category.
See data/DATA_SOURCES.md for schema, download instructions, and verification commands.
Raw data (4584 train × 669 features)
|
v
[1] Feature importance (RF baseline)
|— Identifies `subject` as top feature and distribution-shift trap
v
[2] PCA / EDA
|— Quantifies 73.32% variance in 2 components
|— Subject E projects perpendicular to all training clusters
v
[3] Probe feature selection + ElasticNet
|— 184 features eliminated; linear model insufficient
v
[4] Hybrid encoding design
|— Target encoding: smoothed mean per category
|— Frequency encoding: count-based; unseen → 0
|— Concatenate both into feature matrix
v
[5] RF with hybrid encoding + tuning
|— RandomizedSearchCV; best CV AUC: 0.855; Kaggle: 0.718
v
[6] XGBoost + hybrid encoding + Bayesian search (60 trials)
|— Optuna TPE sampler; 9 hyperparameters; 5-fold stratified CV
|— Best CV AUC: 0.859; Kaggle AUC: 0.731
v
[7] SHAP interpretability
|— Global bar chart: subject_target_enc dominates
|— Beeswarm: high encoded subject values → strong class-1 push
Target encoding uses Bayesian smoothing (also called additive smoothing or empirical Bayes shrinkage) to avoid leaking raw class means for low-count categories:
encoded(c) = (n_c × mean_c + smoothing × global_mean) / (n_c + smoothing)
where n_c is the count of category c in training, mean_c is the per-category positive rate, global_mean = 0.8449, and smoothing = 100 (tuned via grid search).
Unseen categories (E, J in test) receive the global mean under target encoding and frequency 0 under frequency encoding, providing two independent soft signals instead of a missing or arbitrary value.
| Parameter | Range |
|---|---|
n_estimators |
50–1000 |
max_depth |
3–15 |
learning_rate |
0.001–0.3 (log-uniform) |
subsample |
0.4–1.0 |
colsample_bytree |
0.4–1.0 |
min_child_weight |
1–10 |
gamma |
0–5 |
reg_alpha |
0.01–10 (log-uniform) |
reg_lambda |
0.01–10 (log-uniform) |
| # | Method | CV AUC | Kaggle AUC |
|---|---|---|---|
| 1 | RF baseline (label encoding) | — | 0.592 |
| 2 | RF (subject dropped entirely) | 0.729 | — |
| 3 | ElasticNet + probe feature selection | 0.698 | 0.549 |
| 4a | RF + hybrid encoding (untuned) | 0.810 | — |
| 4b | RF + hybrid encoding (tuned, smoothing=100) | 0.855 | 0.718 |
| 5 | XGBoost baseline (no encoding) | 0.761 | 0.596 |
| 6 | XGBoost + hybrid encoding + Bayesian (final) | 0.859 | 0.731 |
Total improvement over naive baseline: +23.5% (Kaggle AUC 0.592 → 0.731)
- The encoding was the decisive intervention. Moving from label encoding to hybrid encoding improved CV AUC by +11.1% (0.729 → 0.810) holding the model architecture constant.
- Model choice mattered, but less. Switching from RF to XGBoost on the same encoded features improved CV AUC by +4.8% (0.810 → 0.853 baseline XGBoost).
- Bayesian optimization provided a smaller but meaningful gain. 60 Optuna trials added +0.6 AUC points (0.853 → 0.859 CV AUC), confirming the data engineering gains dominated model tuning.
- ElasticNet underperformed the RF baseline despite probe-based feature selection, confirming the dataset is not linearly separable.
- SHAP confirmed the encoding's behavior:
subject_target_enchas the highest mean absolute SHAP value; high encoded values (high positive-rate subjects) consistently push predictions toward class 1.
All plot files are at the repository root. See images/README.md for the full catalog with generation instructions.
| File | Contents |
|---|---|
feature_importance.png |
RF baseline Gini feature importance — subject dominates |
shap_bar_plot.png |
SHAP global feature importance (mean |SHAP|) for final XGBoost |
shap_summary_plot.png |
SHAP beeswarm showing directional feature impact per sample |
optuna_optimization.png |
Bayesian search convergence over 60 trials |
xgboost_comparison.png |
Baseline vs. optimized XGBoost CV AUC comparison |
xgboost_feature_importance.png |
XGBoost gain-based feature importance |
interpretable-tabular-classification/
|
├── RF Attempt.ipynb # Experiment 1: RF baseline
├── PCA.ipynb # Experiment 2: EDA and PCA
├── ProbeFeatures.ipynb # Experiment 3: ElasticNet + probe selection
├── encoding+RF+Tuning.ipynb # Experiment 4: Hybrid encoding + RF tuning
├── xgboosthw3.ipynb # Experiment 5: XGBoost baseline
├── xgboost_hybrid_bayesian.ipynb # Experiment 6: Final model (XGBoost + Bayesian)
├── SingleFileSubmission.ipynb # Consolidated end-to-end pipeline
|
├── feature_importance.png # RF Gini feature importance
├── shap_bar_plot.png # SHAP global importance
├── shap_summary_plot.png # SHAP beeswarm
├── optuna_optimization.png # Optuna convergence history
├── xgboost_comparison.png # Baseline vs. tuned XGBoost
├── xgboost_feature_importance.png # XGBoost gain importance
├── image.png, image-1.png ... image-8.png # Experiment-specific plots
|
├── data/
│ ├── sample_solution.csv # Submission format reference
│ ├── train_data.csv # Training data (obtain from course)
│ ├── test_data.csv # Test data (obtain from course)
│ └── DATA_SOURCES.md # Data provenance and setup
|
├── notebooks/
│ └── README.md # Per-notebook guide
├── images/
│ └── README.md # Visualization catalog
├── notebooks/
│ └── README.md # Per-notebook guide
├── images/
│ └── README.md # Visualization catalog
|
├── requirements.txt # Python dependencies with min versions
├── pyproject.toml # uv project config
├── .gitignore
├── PROJECT_SUMMARY.md # Resume bullets, interview prep, tech summary
└── README.md # This file
- Python 3.12+
train_data.csvandtest_data.csvfrom the CSCI E-82 course portal (seedata/DATA_SOURCES.md)
git clone https://github.com/reidsendroff/interpretable-tabular-classification.git
cd interpretable-tabular-classification
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtPlace data files:
mkdir -p data
# Copy train_data.csv and test_data.csv into data/
# Note: notebook cells reference HW3Data/ paths — update them to data/ after cloningjupyter labOpen notebooks in order (RF Attempt → PCA → ProbeFeatures → encoding+RF+Tuning → xgboosthw3 → xgboost_hybrid_bayesian) or jump directly to xgboost_hybrid_bayesian.ipynb for the final model.
To reproduce the final Kaggle submission end-to-end, use SingleFileSubmission.ipynb.
- Bayesian shrinkage estimation (smoothed target encoding)
- Principal component analysis and variance decomposition
- SHAP SHapley values for cooperative game-theoretic feature attribution
- Stratified k-fold cross-validation for AUC estimation
- ElasticNet regularization (L1 + L2 penalty)
- Class imbalance correction (
scale_pos_weight,class_weight='balanced')
- Gradient boosted trees (XGBoost)
- Random forest ensembles
- Linear classifiers (ElasticNet / Logistic regression)
- Boruta-style probe feature selection
- Python 3.12, pandas, numpy, scikit-learn, XGBoost, SHAP, Optuna
- Jupyter Lab notebooks
- Matplotlib, Seaborn for visualization
- Hypothesis-driven experiment design
- Controlled ablation (encoding vs. no encoding; RF vs. XGBoost)
- Kaggle-format submission pipeline
- Git version control
Harvard Extension School — CSCI E-82: Advanced Machine Learning (Fall 2025).
Class-only Kaggle competition. Dataset domain intentionally withheld. Evaluation metric: AUC-ROC on a private leaderboard split. Authors: Reid Sendroff and Joshua Harvey.
The train/test categorical distribution mismatch encountered here is not a contrived academic problem — it is one of the most common failure modes in deployed ML systems. New customer segments, new geographic markets, new product SKUs, or new device types routinely appear in production data without training-set representation. The hybrid encoding approach developed here — combining Bayesian shrinkage toward a global prior with frequency-based signal — is a principled, production-ready solution to this class of problem. SHAP explainability further ensures that model behavior can be audited and communicated, a requirement in any regulated or high-stakes deployment context.
- Apply the hybrid encoding to the
statefeature (currently label-encoded; may also suffer from distribution shift) - Add LIME for individual prediction explanations alongside SHAP
- Evaluate a stacking ensemble (RF + XGBoost + ElasticNet) with a meta-learner
- Refactor encoding logic into a reusable
sklearntransformer (BaseEstimator+TransformerMixin) for pipeline compatibility - Replace Google Drive dependency in
xgboosthw3.ipynbwith local path for reproducibility - Add MLflow or a results CSV for cross-notebook experiment tracking
- Investigate whether
statefeature has unseen categories in test (not yet analyzed)
Reid Sendroff and Joshua Harvey Harvard Extension School — CSCI E-82 Advanced Machine Learning, Fall 2025
Six iterative experiments, one distribution-shift problem, and a hybrid encoding trick that moved the needle more than any model choice — because the data engineering was the hard part.
The original experiment log is preserved below. All image references link to plot files at the repository root.
File: RF Attempt
The experiment, and goal of this, is to simply see how a "basic" random forest will do on the classification problem. Since it's pretty immune to overfitting, it'll give us a quick guide to what to expect going forward.
Additionally, we'll be able to use the feature importance plot to start getting a handle on the dataset, since there's a ton of columns.
Feature importance:
Feature Importance (Top 4):
| feature | importance |
|---|---|
| subject | 0.042882 |
| z205 | 0.013999 |
| z206 | 0.008880 |
| phase | 0.008546 |
| ... | ... |
Subject is very important, but missing from the training set! It's a trap!
As a whole, we beat the "benchmark", but clearly overfit - training AUC of 0.93 vs. a test score of 0.59. Also, we need to figure out the subject issue. That should come next.
File: PCA
The goal of this experiment is to start learning a bit more about the data, now that we have a few insights from the random forest to work with. Specifically, we want to understand how the subject distribution, being the most important variable, changes between train and test sets.
Finally, we want to also learn how the projection of the unseen subjects aligns with known subjects - we'll somehow need to do imputation of unseen classes.
Learnings:
There's the problem in a nutshell. PCA captures a reasonable amount (73%) of the variance. However, one of the unseen classes (E) projects perpendicular to the rest of them. We don't have information on that class, and it's also a large portion of the test set.
This is the same picture, but for train and test sets.
Finally, to look at the same picture, but colored by output:

So, we've figured out that imputation is going to be critically important to improving the model performance, and that we have that unseen class.
File: Probe Features
The goal here is to use probe features - injected noise, to do unsupervised feature selection. Once we have a smaller share of features, use elastic net to fit on that feature set and see how it performs.
We don't expect it to outperform the baseline RF, but it should not do much worse.
We were wrong. This model does not beat the benchmark at all.
The probe feature procedure removed 184 features from the dataset. This feels significant, but really did not do much for our performance. Even after CV and fitting, ~220 of the features kept a non-zero coefficient.
Training Set Classification Report:
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0 | 0.28 | 0.72 | 0.41 | 711 |
| 1 | 0.93 | 0.67 | 0.78 | 3873 |
Using what we've learnt so far, we want to hit a homerun with this experiment.
We know: Random forest performs decently well - a quick test with removing the missing class entirely improved performance to 0.63908 AUC on kaggle. But we also need to handle the missing classes.
We have two ways to do this:
-
Target encode, and leak information
- Instead of a categorical feature, use the weighted mean of the frequency of the class with a "smoothing" factor that applies a value to unseen values
- This is sourced experience from doing geodemographic smoothing in insurance pricing at the census block level.
- Formula: (n * category_mean + smoothing * global_mean) / (n + smoothing)
- Instead of a categorical feature, use the weighted mean of the frequency of the class with a "smoothing" factor that applies a value to unseen values
-
Simpler: Frequency encode the subject. Missing values get 0.
- So we still get some information, but largely ignore the missing class entirely.
It worked!
We appear to have managed to keep lots of the importances, especially in the subject (which is the missing case)
To push this even further, we can tune how much "leakage" we encode in the target encoding:
This pushed us to:
However, all of this was with Random Forest. Time to move to XGBoost:
As a baseline, how does XGboost perform on the dataset (prior to any encoding)
File: XGBoost
Slightly better than Random Forest, but not much. Tuning will help, going forward, as well as layering in the encoding.
Pulling it all together. With bayesian optimization to get the best hyperparameters.
Big win. Proper jump in performance with the increased complexity.
Some cool plots from it:
Shapley global feature impacts:
Shapley feature importance:
On the whole, it builds nicely on the earlier learnings!







