Parkinson's Disease Assessment Portal

A comprehensive medical assessment system for Parkinson's disease diagnosis using the PPMI (Parkinson's Progression Markers Initiative) curated dataset. The system combines traditional machine learning, medical transformer models, and RAG-based report generation.

Dataset

Uses the PPMI Curated Data Cut CSV files containing patient data with the following key features:

Category	Features
Demographics	Age, Sex, Education Years, Race, BMI
Family History	Family PD history (categorical + binary)
Motor Symptoms	Tremor, Rigidity, Bradykinesia, Postural Instability
Non-Motor	REM sleep, Epworth Sleepiness, Depression (GDS), Anxiety (STAI)
Cognitive	MoCA, Clock Drawing, Benton JLO

Target classes (COHORT): HC (Healthy Control) · PD (Parkinson's Disease) · SWEDD · PRODROMAL

Architecture

├── src/
│   ├── data_preprocessing.py      # Patient-level leak-free data pipeline
│   ├── web_interface.py            # Flask web app
│   ├── rag_system.py               # Medical knowledge base + report generation
│   ├── document_manager.py         # PDF/text document indexing (TF-IDF)
│   ├── feature_mapping.py          # Patient questionnaire ↔ PPMI feature mapping
│   ├── analyze_data.py             # Dataset EDA script
│   ├── train_traditional_models.py # Train LightGBM, XGBoost, SVM
│   ├── train_transformer_models.py # Train PubMedBERT, BioGPT, Clinical-T5
│   ├── train_multimodal.py         # Train multimodal ensemble
│   ├── evaluate_traditional_models.py
│   └── models/
│       ├── traditional_ml.py       # LightGBM, XGBoost, SVM wrappers
│       ├── transformer_models.py   # DistilBERT, BioBERT, PubMedBERT for tabular
│       ├── medical_transformers.py # PubMedBERT, BioGPT, Clinical-T5 classifiers
│       └── multimodal_ml.py        # Stacking ensemble
├── templates/                      # Flask HTML templates
├── static/                         # CSS, JS assets
├── medical_docs/                   # Medical literature for RAG
├── models/saved/                   # Trained model weights
├── start_server.py                 # Entry point for web app
└── requirements.txt

Features

Leak-Free Preprocessing: Patient-level train/test split ensures no patient appears in both sets
Traditional ML: LightGBM, XGBoost, SVM with class weight balancing
Medical Transformers: PubMedBERT (encoder), BioGPT (decoder), Clinical-T5 (encoder-decoder)
Multimodal Ensemble: Stacking ensemble combining all model predictions
RAG-Enhanced Reports: Retrieves medical literature to generate comprehensive diagnostic reports
Web Interface: Patient assessment form with automated report generation and PDF export

Installation

# Create virtual environment
python -m venv venv
source venv/bin/activate    # Linux/Mac
venv\Scripts\activate       # Windows

# Install dependencies
pip install -r requirements.txt

# Install PyTorch with CUDA (recommended for GPU training)
pip install torch --index-url https://download.pytorch.org/whl/cu124

requirements.txt now includes sacremoses, which BioGPT needs for tokenization. If the A4000 preflight fails on sacremoses, rerun pip install -r requirements.txt inside the training venv.

RTX A4000 Setup

For the RTX A4000 training machine, use this flow from the project root.

Ubuntu:

source venv/bin/activate
bash check_a4000_ready.sh
bash train_a4000_models.sh

Windows:

venv\Scripts\activate
python check_a4000_ready.py
train_a4000_models.bat

What the preflight checks:

CUDA-enabled PyTorch import and torch.cuda.is_available()
detected GPU name, CUDA version, and VRAM
BioGPT tokenizer dependency (sacremoses)
required PPMI CSV files
medical_docs/ availability for RAG training
free disk space and output path write access

Helper scripts:

check_a4000_ready.sh / check_a4000_ready.bat run the GPU/data preflight
train_a4000_models.sh / train_a4000_models.bat run preflight, then start training with --gpu-profile rtx-a4000
resume_a4000_training.sh / resume_a4000_training.bat resume the same run if the session is interrupted

The A4000 training recipe now defaults to class-weighted focal loss and keeps the best transformer checkpoint by validation F1, with validation loss used only as a tie-breaker.

Recommended direct commands:

python src/train_model_suite.py train --run-name a4000_full --gpu-profile rtx-a4000 --epochs 30 --patience 8 --traditional-trials 6 --transformer-trials 6 --transformer-loss focal --focal-gamma 1.5
python src/train_model_suite.py resume --run-name a4000_full --gpu-profile rtx-a4000 --epochs 30 --patience 8 --traditional-trials 6 --transformer-trials 6 --transformer-loss focal --focal-gamma 1.5
python src/train_model_suite.py status --run-name a4000_full

Usage

Train Models

cd src

# Train traditional ML models
python train_traditional_models.py

# Train transformer models (requires GPU recommended)
python train_transformer_models.py

# Train multimodal ensemble
python train_multimodal.py

For the full resumable training pipeline with the A4000 profile, run from the project root instead of src/:

bash train_a4000_models.sh

Run Web App

# From project root
python start_server.py
# Access at http://localhost:5000

Evaluate Models

cd src
python evaluate_traditional_models.py

Model Performance

Models are evaluated on a held-out test set using patient-level splitting:

Model	Type
LightGBM	Gradient Boosting
XGBoost	Gradient Boosting
SVM (RBF)	Support Vector Machine
PubMedBERT	Encoder-only Transformer
BioGPT	Decoder-only Transformer
Clinical-T5	Encoder-Decoder Transformer
Multimodal Ensemble	Stacking (all above)

License

This project uses data from the Parkinson's Progression Markers Initiative (PPMI).

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
docs		docs
evaluation_results		evaluation_results
frontend		frontend
medical_docs		medical_docs
models/saved		models/saved
reports		reports
src		src
static		static
templates		templates
.gitattributes		.gitattributes
.gitignore		.gitignore
CuratedDataInfo.pdf		CuratedDataInfo.pdf
PPMI_Curated_Data_Cut_Public_20240129.csv		PPMI_Curated_Data_Cut_Public_20240129.csv
PPMI_Curated_Data_Cut_Public_20241211.csv		PPMI_Curated_Data_Cut_Public_20241211.csv
PPMI_Curated_Data_Cut_Public_20250321.csv		PPMI_Curated_Data_Cut_Public_20250321.csv
PPMI_Curated_Data_Cut_Public_20250714.csv		PPMI_Curated_Data_Cut_Public_20250714.csv
README.md		README.md
check_a4000_ready.bat		check_a4000_ready.bat
check_a4000_ready.py		check_a4000_ready.py
check_a4000_ready.sh		check_a4000_ready.sh
datadictionary.csv		datadictionary.csv
requirements.txt		requirements.txt
resume_a4000_training.bat		resume_a4000_training.bat
resume_a4000_training.sh		resume_a4000_training.sh
run_web_app.bat		run_web_app.bat
start_server.py		start_server.py
test_imports.py		test_imports.py
test_model_memory.py		test_model_memory.py
test_training_runtime.py		test_training_runtime.py
test_web_smoke.py		test_web_smoke.py
train_a4000_models.bat		train_a4000_models.bat
train_a4000_models.sh		train_a4000_models.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parkinson's Disease Assessment Portal

Dataset

Architecture

Features

Installation

RTX A4000 Setup

Usage

Train Models

Run Web App

Evaluate Models

Model Performance

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Parkinson's Disease Assessment Portal

Dataset

Architecture

Features

Installation

RTX A4000 Setup

Usage

Train Models

Run Web App

Evaluate Models

Model Performance

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages