SingleCell_type/README.md at main · birchgit/SingleCell_type

Project Overview

This project compares deep learning (DNN) and XGBoost models for classifying cell types using single-cell RNA sequencing (scRNA-seq) data. We evaluate model performance on datasets of varying sizes (50K, 60K cells), using three preprocessing strategies:

Raw features
PCA-reduced features
Randomly undersampled (RUS) features

Workflow Summary

Data Selection
Preprocessing & Data Reduction
Data Splitting
Model Training (DNN, XGBoost)
Evaluation & Metrics
Results & Visualization

Project Workflow

1. Data Selection

Selected two large scRNA-seq datasets (50K+ cells each) from the Cellxgene database.
Chose datasets with a diverse range of cell types and class imbalance to simulate realistic classification challenges.

2. Preprocessing and Data Reduction

Normalization: Applied log normalization and scaling.
Filtering: Removed low-quality cells with:
- High mitochondrial gene expression
- Low gene expression
Annotation: Annotated cell types for supervised learning.
PCA (Principal Component Analysis) was used to reduce the number of features from thousands of genes to a lower-dimensional representation, capturing the most variance.
Random Undersampling (RUS) was applied to reduce the number of majority class samples, addressing class imbalance while preserving minority cell types.

Each preprocessing strategy (Raw, PCA, RUS) was used to create distinct input datasets for model evaluation.

3. Data Splitting

Split ratio: 80% training / 20% testing.

4. Model Training

Trained two types of models:
- Deep Neural Networks (DNN)
- XGBoost classifiers
Each model was trained on all 3 dataset versions (raw, PCA, RUS).

5. Evaluation

Evaluated using:
- Accuracy
- F1 Score
- Confusion Matrix
Benchmarked computational performance for each model-dataset combination.

6. Results & Visualization

Performance metrics visualized using matplotlib.
Appendix figures demonstrate model scalability and resource consumption.

Repository Structure

.
├── notebooks/
│   ├── DNN_50K_Raw.ipynb
│   ├── DNN_50K_PCA.ipynb
│   ├── DNN_50K_RUS.ipynb
│   ├── DNN_60K_Raw.ipynb
│   ├── DNN_60K_PCA.ipynb
│   ├── DNN_60K_RUS.ipynb
│   └── XGBoost.ipynb
├── figures/
│   └── appendix_figure_A_1.png
├── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Overview

Workflow Summary

Project Workflow

1. Data Selection

2. Preprocessing and Data Reduction

3. Data Splitting

4. Model Training

5. Evaluation

6. Results & Visualization

Repository Structure

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Project Overview

Workflow Summary

Project Workflow

1. Data Selection

2. Preprocessing and Data Reduction

3. Data Splitting

4. Model Training

5. Evaluation

6. Results & Visualization

Repository Structure