A Kaggle-style binary classification competition on an anonymized high-dimensional tabular dataset (4,584 train / 1,732 test rows, 669 features). The primary technical challenge was a deliberate train/test distribution mismatch: the most predictive feature (subject) contained two category values in the test set that were absent from training, affecting 54.3% of all test rows. The winning approach combined a custom hybrid encoding strategy (smoothed target encoding + frequency encoding) with XGBoost and Bayesian hyperparameter optimization (Optuna), reaching a final Kaggle AUC of 0.731. SHAP values provide global and directional feature-level interpretability for the final model.
- Achieved 0.731 AUC on a private Kaggle leaderboard by engineering a hybrid categorical encoding strategy (smoothed target + frequency encoding) that resolved a 54.3% train/test category-distribution mismatch invisible to baseline models
- Reduced model overfitting gap from 0.255 (RF baseline) to 0.147 through systematic experiment progression across Random Forest, ElasticNet, and XGBoost architectures with 60-trial Bayesian hyperparameter optimization via Optuna
- Delivered SHAP-based model interpretability (global feature importance + beeswarm plots) on a 669-feature anonymized dataset using Python, scikit-learn, XGBoost, and the SHAP library
The dataset presented two compounding challenges: (1) 669 anonymized features with no domain context, and (2) a categorical subject feature that was the single most predictive variable (Gini importance 0.0429, 3x the next feature) but contained unseen categories in 54.3% of test rows. Standard label encoding assigns arbitrary integers to unseen categories, destroying the signal; dropping the column entirely left substantial predictive power on the table (CV AUC 0.729 vs. 0.810 with encoding).
The hybrid encoding scheme encodes subject twice in parallel: once via smoothed target encoding — a Bayesian shrinkage estimator that blends the per-category positive rate toward the global mean — and once via frequency encoding that maps each category to its training-set count (defaulting to 0 for unseen values). Unseen test categories receive the global positive rate (0.8449) under target encoding and frequency 0 under frequency encoding, providing two independent soft signals that together achieve CV AUC 0.810 vs. 0.729 without the feature.
The final model layer adds XGBoost (vs. Random Forest) for gradient boosting advantages on tabular data, and Optuna Bayesian search over 9 hyperparameters across 60 trials. SHAP SHapley values confirm that subject_target_enc is the dominant driver with directional consistency: high encoded values (high positive-rate subjects) push predictions strongly toward class 1.
In this project, I built a binary classifier for an anonymized tabular dataset as part of a Kaggle-style competition in Harvard's Advanced Machine Learning course. The dataset had 669 features and a deliberate trap: the most important feature, subject, contained two categories in the test set that never appeared in training, affecting over half the test rows. My first step was to run a baseline Random Forest and use feature importance to identify this problem immediately. I then used PCA to visualize how one of the unseen categories — subject E — projected nearly perpendicular to all known subjects in the first two principal components, confirming that imputation would need to be soft. I designed a hybrid encoding that applied smoothed target encoding and frequency encoding simultaneously, preserving signal while gracefully handling the unknown categories. Layering XGBoost and Bayesian hyperparameter search on top pushed my final Kaggle AUC to 0.731, up from 0.592 with the naive baseline. I also computed SHAP values to explain which features the model relied on most and how their values directionally influenced predictions.
Most toy classification tutorials use clean, pre-processed benchmark datasets. This project is notable because:
- The core challenge was data engineering, not just model selection. The train/test category mismatch is a real-world problem (e.g., new customer segments, new geographic markets) that cannot be solved by choosing a fancier algorithm.
- Explainability is built-in, not bolted on. SHAP was used not as an afterthought but as a validation step to confirm the encoding's behavior and identify dominant drivers.
- The experiment log is complete. All six experiments are documented with hypothesis, method, result, and interpretation — including two experiments that underperformed (ElasticNet, XGBoost without encoding), showing methodical iteration rather than cherry-picking.
- Bayesian optimization was applied correctly. Optuna's TPE sampler was used with stratified cross-validation rather than a simple train/val split, avoiding optimistic bias.
Machine Learning & Statistics
- Binary classification under class imbalance
- Cross-validation strategy (StratifiedKFold, 5-fold)
- Gradient boosting (XGBoost) vs. ensemble methods (Random Forest)
- Regularized linear models (ElasticNet with L1/L2 penalty)
- Dimensionality reduction (PCA) for EDA and distribution shift analysis
Feature Engineering
- Smoothed target encoding (Bayesian shrinkage)
- Frequency encoding
- Hybrid encoding for out-of-vocabulary categorical values
- Boruta-style probe feature selection
Hyperparameter Optimization
- Bayesian optimization with Optuna (TPE sampler)
- 60-trial search over 9 XGBoost parameters
Model Interpretability
- SHAP global feature importance (bar chart)
- SHAP beeswarm (directional impact per sample)
- Gini-based feature importance (RF and XGBoost)
Engineering & Workflow
- Reproducible notebook pipeline
- Stratified submission CSV generation
- Version-controlled experiment results
Final Kaggle AUC: 0.731
| Milestone | CV AUC | Kaggle AUC |
|---|---|---|
| RF baseline | — | 0.592 |
| ElasticNet + probe features | 0.698 | 0.549 |
| RF + hybrid encoding (tuned) | 0.855 | 0.718 |
| XGBoost + hybrid + Bayesian | 0.859 | 0.731 |
Total AUC improvement over baseline: +23.5% (0.592 → 0.731)
- Course: Harvard Extension School, CSCI E-82 — Advanced Machine Learning (Fall 2025)
- Format: Kaggle-style private leaderboard competition; hidden dataset domain
- Authors: Reid Sendroff and Joshua Harvey
- Submission date: October 17, 2025