This doc outlines confirmed data status, 2026 best practices, and a concrete roadmap to revive and modernize the repo.
All datasets load successfully via portfolio_utils.data_loader (Kaggle API or local data/):
| Dataset | Loader | Shape / notes |
|---|---|---|
| Corporate Bankruptcy | load_bankruptcy() |
(6819, 96) |
| Telecom Churn | load_telecom_churn() |
(7043, 34); columns normalized |
| NJ Transit | load_nj_transit() |
(98698, 13) |
| Heart Disease | load_heart() |
(1025, 14) |
| NYC Bus | load_nyc_bus() |
(6.7M, 17); use on_bad_lines='skip' |
| Jane Street | load_jane_street() |
Optional; requires competition accept |
Run once: python setup_data.py --no-jane-street. Smoke test: python scripts/smoke_test_notebooks.py.
- Pipelines
Usesklearn.pipeline.Pipeline(ormake_pipeline) so preprocessing (scaler, imputer, selector) is fitted only on train and applied consistently to test. Avoids data leakage and forgotten steps. - Seeds
Setnp.random.seed,random.seed, and estimatorrandom_state=(and in TF:tf.keras.utils.set_random_seed) so runs are reproducible. - Locked envs
Preferpyproject.toml+ lockfile (uv.lockwithuv, orpip-toolswithrequirements.in→requirements.txt) so installs are reproducible. - Dataset versioning
Optional: DVC or a simpledata/README.mdwith dataset source URLs and download dates.
- Stratified splits
UseStratifiedKFold/stratify=intrain_test_splitfor classification so train/test class balance is representative. - Multiple metrics
Report precision, recall, F1, and (for imbalanced) PR-AUC or ROC-AUC; avoid relying only on accuracy. - Cross-validation
Usecross_validateorcross_val_predictwith a pipeline (preprocessing + model) so scaling/selection is not fitted on test folds.
- SHAP
Add a short "Model interpretability" section in 1–2 notebooks (e.g. Bankruptcy, Churn) usingshap.TreeExplainerfor tree models orshap.KernelExplainerfor others;shap.summary_plotandshap.force_plotfor one-off explanations. Keeps the narrative "why does the model say this?". - Feature importance
For tree models, use built-infeature_importances_plus a bar plot; mention that SHAP gives instance-level and more consistent attributions.
- Pandas 2.x
Useon_bad_lines='skip'(or'warn') instead of deprecatederror_bad_lines/warn_bad_lines. - sklearn
PreferPipeline,ColumnTransformerfor mixed types, andcross_validate; keepGridSearchCV/RandomizedSearchCVon the pipeline. - Type hints
Use inportfolio_utilsand any new modules (e.g.def load_*(...) -> pd.DataFrame).
- Model serving
FastAPI endpoint for the churn pipeline with/predict,/predict/batch,/explain(SHAP), health check, and model metadata. Dockerized. - Experiment logging
Log key runs (params + metrics) to a small JSON/CSV or to MLflow for a single "flagship" notebook. - CI
GitHub Action that runsscripts/smoke_test_notebooks.py(and optionallysetup_data.pywith a small subset) so notebooks stay runnable. - Containers
Dockerfile+docker-composefor "run everything in one command."
- DuckDB
portfolio_utils.db_utilsprovides aDuckDBLoaderclass for SQL-based EDA and feature engineering alongside pandas. Pre-built SQL queries for NJ Transit delay analysis and churn segmentation demonstrate the warehouse-query-then-model pattern used in production.
- Small "from scratch" project
One notebook using onlypyproject.toml+portfolio_utils: load data → pipeline (preprocess + model) → cross_validate → SHAP summary. Shows clean structure. - RAG/LLM
Keep the Streamlit RAG demo; optionally add a second app (e.g. simple API with FastAPI) or a notebook that uses the same embedding + retrieval logic. - Time series
NJ Transit / rail could be extended with a proper time-based split and a simple forecast baseline (e.g. seasonal naive or small LSTM/Transformer) to show awareness of temporal leakage and modern TS tools.
| Priority | Task | Status |
|---|---|---|
| High | sklearn.pipeline.Pipeline in notebooks (Bankruptcy, Churn, Heart, etc.) |
✓ Done |
| High | set_seed() + consistent random_state |
✓ Done |
| High | pyproject.toml + uv sync |
✓ Done |
| High | FastAPI model serving endpoint (api/) with Docker |
✓ Done |
| High | pytest suite (tests/) for utils, data loaders, API, training |
✓ Done |
| High | Quantified business impact in notebooks (Bankruptcy, Churn) | ✓ Done |
| Medium | SHAP sections in notebooks | ✓ Done |
| Medium | portfolio_utils.ml_utils (seeds, pipelines, SHAP) |
✓ Done |
| Medium | DuckDB SQL analytics layer (portfolio_utils.db_utils) |
✓ Done |
| Medium | GitHub Action: smoke test on push | |
| Low | DVC or data/README.md for dataset versioning |
- Modern_Classification_Workflow_Bankruptcy.ipynb — End-to-end example:
set_seed, load data, stratified split, single pipeline (scaler → SelectKBest → XGBoost),cross_validatewith multiple metrics, holdout evaluation, SHAP summary plot, and quantified business impact. - docs/BEST_PRACTICES.md — Thorough explanation of each practice (reproducibility, pipelines, stratified splits, multiple metrics, SHAP, locked envs) and how to apply them in this repo.
- api/ — FastAPI model serving (train → serialize → serve → predict → explain) with Dockerfile and docker-compose.
- tests/ — pytest suite covering ml_utils, data_loader, db_utils, API endpoints, and training pipeline.
- Kaggle API–backed data loaders; all five main datasets load successfully.
DATA_DIRand Colab fallbacks keep notebooks runnable locally and in the cloud.- Pandas 2–friendly options (e.g.
on_bad_lines='skip'for NYC Bus). display.max_columns/display.max_rowsfixes for pandas OptionError.- Smoke test script to verify notebooks run without errors.
- README with setup and data verification command.
Use this file as a living checklist: tick items as you implement them and add new 2026 techniques as you adopt them.
Goal: a hiring manager or recruiter opens the repo and thinks "we need to hire Drake Talley."
| Priority | Task | Status |
|---|---|---|
| High | README leads with value prop + Quick links (this repo, nbviewer, resume, LinkedIn) | ✓ Done |
| High | "Projects at a glance" table — one-line impact per project | ✓ Done |
| High | "For recruiters & hiring managers" — 5-min tour + skills map | ✓ Done |
| High | FastAPI model serving + Dockerfile (proves deployment ability) | ✓ Done |
| High | pytest test suite (proves engineering discipline) | ✓ Done |
| High | Quantified business impact in notebooks | ✓ Done |
| High | DuckDB data engineering layer | ✓ Done |
| Medium | Add real resume link and LinkedIn URL in README (replace placeholders) | |
| Medium | Ensure GitHub repo description and topics are set (e.g. data-science, machine-learning, portfolio) |
|
| Low | Optional: 2–3 min Loom/walkthrough linked in README | |
| Low | Optional: "What I'd do in the first 90 days" or case study tied to target role/industry |