Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
99 commits
Select commit Hold shift + click to select a range
9073d65
Adding python scripts
May 10, 2024
2bc2c28
remove useless comment
May 10, 2024
790e1e9
update latest python methods
Mar 1, 2025
cf25fef
added the bash script for running the benchmarking on PCA (not DM)
Mar 1, 2025
305ec54
bash script for generating synthetic labels
Mar 1, 2025
f4eb9b1
revised the bash script for synthetic and real datasets (only python …
Mar 3, 2025
bff34ab
generate synthetic labels using PCA layer for datasets
Mar 3, 2025
0636f2c
generate the synthetic labels on the diffusion map space, default in …
Mar 3, 2025
1892afd
R evaluation methods on dm layer
Mar 3, 2025
c283ffa
Update bm_syn_python_synthetic_pca_final.sh
TracyY123 Mar 3, 2025
1cba6a4
evaluate python methods on the synthetic datasets on diffusion map (d…
Mar 3, 2025
10738f0
Merge branch 'tracy' of github.com:settylab/benchmarkDA_private into …
Mar 3, 2025
74bf546
python methods evaluation on real datasets diffusion map (dm=30)
Mar 3, 2025
e09076f
R methods evaluation for datasets on pca
Mar 3, 2025
5da6f83
put all original bash scripts from benchmarkDA in this folder
Mar 3, 2025
59601f6
remove original bash scripts from current path, save them in the orig…
Mar 3, 2025
54fb98f
synchronize changes in local and github for r method running files
Mar 3, 2025
69d14ad
update the renv file to include which packages are used
Mar 3, 2025
9fb7dba
export environment for running the benchmarking
Mar 3, 2025
22ac302
remove previous python codes, which may cause confusion
Mar 3, 2025
0065d31
changes the save paths and Slurmlog in all scripts to be thje same
Mar 3, 2025
4d4af0b
modified and organized R script for performing DA methods, combine di…
Mar 3, 2025
558daea
Combined R scripts for running DA methods on PCA layer for synthetic …
Mar 3, 2025
9f782c5
notebooks for calculating DM, transfer rds and anndata, and pca level…
Mar 4, 2025
8d8d7f8
diffusion map space packages performance evaluation, no batch effect …
Mar 4, 2025
6108c28
small revise on saving path
Mar 4, 2025
811ef50
Update README.md
TracyY123 Mar 4, 2025
e95a9b5
small change on saving the file path
Mar 4, 2025
4c4d474
Merge branch 'tracy' of github.com:settylab/benchmarkDA_private into …
Mar 4, 2025
9f307dc
small changes on saving file path
Mar 4, 2025
1fa5b6e
small change on saving file
Mar 4, 2025
572d346
Update README.md
TracyY123 Mar 4, 2025
cf35c61
debug
Mar 4, 2025
39f1f57
Update README.md
TracyY123 Mar 4, 2025
2da63e6
Update README.md
TracyY123 Mar 4, 2025
9bcb520
Update README.md
TracyY123 Mar 4, 2025
1b5dfe0
re-organize the code structure
Mar 5, 2025
467cc6a
Merge branch 'tracy' of github.com:settylab/benchmarkDA_private into …
Mar 5, 2025
3c84614
removed python scripts which are not in use
Mar 5, 2025
0d8bb02
remove incorrect environemnt setting script
Mar 5, 2025
77300dd
update bash script after code structure re-organizing
Mar 5, 2025
6aeb525
revised scripts after code reorganization
Mar 5, 2025
960b10e
updated wrong file
Mar 5, 2025
a6c2f1e
remove scripts which are not using
Mar 5, 2025
aec4993
add revised bash script for synthetic datasets on diffusion map
Mar 5, 2025
7389d97
bash script for real datasets, diffusion map
Mar 5, 2025
2748d94
debug on bash script, tried to avoid permission denied error when rea…
Mar 6, 2025
a20a007
debug on script
Mar 6, 2025
89a1550
remove absolute path
Mar 6, 2025
2653216
remove readme not in use
Mar 6, 2025
c0a33d1
delete useless __pychache__/
Mar 6, 2025
10e9c0d
remove useless .DS_store
Mar 6, 2025
7f7bbc9
Update .gitignore
TracyY123 Mar 6, 2025
62e631e
remove useless .DS_Store file
Mar 6, 2025
9211251
Merge branch 'tracy' of github.com:settylab/benchmarkDA_private into …
Mar 6, 2025
05a3473
update small synthetic datasets
Mar 6, 2025
24022a4
add small branch rds datasets, anndata sets are not uploaded
Mar 6, 2025
fbd74cc
Create README.md
TracyY123 Mar 6, 2025
8c2c051
Update README.md
TracyY123 Mar 6, 2025
33094d1
Create README.md
TracyY123 Mar 6, 2025
2252fd4
Update README.md
TracyY123 Mar 6, 2025
c1d94c5
upload small cluster datasets in rds format, anndata too large to upload
Mar 6, 2025
81f4cce
Merge branch 'tracy' of github.com:settylab/benchmarkDA_private into …
Mar 6, 2025
a800c85
upload small bcl-xl datasets, rds and h5ad both uploaded
Mar 6, 2025
a7e7167
Create readme files for saving the path of large real datasets
Mar 6, 2025
c4fedd4
Update README.md
TracyY123 Mar 6, 2025
6d0b7bc
Update README.md
TracyY123 Mar 6, 2025
bfcbc72
Update README.md
TracyY123 Mar 6, 2025
94dd722
Update README.md
TracyY123 Mar 6, 2025
3afb9bd
Create README.md
TracyY123 Mar 6, 2025
0ec2dab
remove large datasets uploaded to github
Mar 6, 2025
014240b
upload scripts for dataset preprocessing
Mar 7, 2025
69d1d40
Merge branch 'tracy' of github.com:settylab/benchmarkDA_private into …
Mar 7, 2025
2612947
modified bash scripts for running jobs
Mar 7, 2025
b76a536
revised pipeline for benchmarking
Mar 7, 2025
9455bb5
Update README.md
TracyY123 Mar 7, 2025
4cfd1a2
Update README.md
TracyY123 Mar 7, 2025
6d7d769
Update README.md
TracyY123 Mar 7, 2025
a18b8ef
Update README.md
TracyY123 Mar 7, 2025
b5db5c9
Update README.md
TracyY123 Mar 7, 2025
8406356
small debug
Mar 7, 2025
8591105
Merge branch 'tracy' of github.com:settylab/benchmarkDA_private into …
Mar 7, 2025
87e9874
small debug
Mar 7, 2025
f810e36
small debug
Mar 7, 2025
64cbca3
debug, change the sparse expression matrix to array
Mar 11, 2025
34b40f5
debug for running benchmarking on cytof diffusion map
Mar 11, 2025
1735fe5
simplified testing plan
katosh Mar 11, 2025
ad2d06a
Merge origin/revision_update: fix sparse matrix handling in CyTOF dat…
katosh Mar 11, 2025
0172378
script contains sbatch call
katosh Mar 11, 2025
84f0dbf
debugged pipeline
Mar 13, 2025
42aad88
removed the methods folder, has tested that the pipeline run well
Mar 13, 2025
9deedd7
add batch_sd after using simulation of batch effect on the log space,…
Mar 17, 2025
738f888
Update README.md
TracyY123 Mar 18, 2025
bcc3a2d
upload the updated main.sh
Mar 18, 2025
a272a1d
Merge branch 'tracy' of github.com:settylab/benchmarkDA_private into …
Mar 18, 2025
ff64bc0
Update main.sh
TracyY123 Mar 18, 2025
01987f4
added the aging dataset
Mar 31, 2025
33deeb2
Merge branch 'tracy' of github.com:settylab/benchmarkDA_private into …
Mar 31, 2025
4528ec4
add aging dataset to the main.sh
Apr 1, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,6 @@ data/
.ipynb_checkpoints/
scripts/core
notebooks/
__pycache__/
.DS_Store
.nfs*
3 changes: 0 additions & 3 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,3 +0,0 @@
[submodule "methods/mellon"]
path = methods/mellon
url = git@github.com:settylab/benchmarkDA_mellon_sub.git
167 changes: 121 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,67 +1,142 @@
# bmDA
# BenchmarkDA: Differential Abundance Testing Benchmarks

The implementation of benchmarking the performance of differential abundance (DA) testing methods including 5 clustering-free methods:
This repository contains a unified framework for benchmarking differential abundance (DA) testing methods in single-cell data.

1. [Testing for differential abundance in mass cytometry data (Cydar)](https://www.nature.com/articles/nmeth.4295)
2. [Detection of differentially abundant cell subpopulations in scRNA-seq data (DAseq)](https://www.pnas.org/doi/abs/10.1073/pnas.2100293118)
3. [Quantifying the effect of experimental perturbations at single-cell resolution (MELD)](https://www.nature.com/articles/s41587-020-00803-5)
4. [Differential abundance testing on single-cell data using k-nearest neighbor graphs (Milo)](https://www.nature.com/articles/s41587-021-01033-z)
5. [Co-varying neighborhood analysis identifies cell populations associated with phenotypes of interest from single-cell transcriptomics (CNA)](https://www.nature.com/articles/s41587-021-01066-4),
## Overview

and a clustering-based method, *Louvain*.
Differential abundance testing aims to identify cell types or states that change in proportion between conditions in single-cell data. This benchmark evaluates multiple methods:

## Dependencies
**Python Methods:**
- MELD
- CNA (Conditional Neighbor Analysis)
- Mellon

To run this benchmarking codes, it needs to install a list of R and Python packages. The R packages needed are:
**R Methods:**
- Milo
- DAseq
- CyDAR

- argparse
- SingleCellExperiment
- scran
- [DAseq](https://github.com/KlugerLab/DAseq)
- [miloR](https://github.com/MarioniLab/miloR)
- tibble
- dplyr
- tidyverse
- igraph
- [cydar](http://bioconductor.org/packages/cydar)
- pdist
- reshape2
The benchmark evaluates these methods using both real datasets and synthetic topologies, with various parameters including:
- PCA and diffusion map (DM) embeddings
- Different cell populations
- Various enrichment levels
- Different batch effect strengths
- Multiple random seeds

The Python packages needed are
## Requirements

- [MELD](https://github.com/KrishnaswamyLab/MELD)
- [cna](https://github.com/immunogenomics/cna)
- scanpy
- [graphtools](https://github.com/KrishnaswamyLab/graphtools)
- scikit-learn
- multianndata
Our implementation requires a Slurm job scheduler since we need to run thousands of parallel jobs.

## Data
### Dependencies

- Synthetic datasets and BCR-XL dataset are available under the `data` directory.
- The COVID-19 PBMC dataset is available at https://www.covid19cellatlas.org/#wilk20.
- **Python environment** (for Python methods):
- Managed with micromamba
- Two environment file options:
- `environment_full.yml`: Complete environment with exact versions (recommended for reproducibility)
- `environment_minimal.yml`: Minimal environment with flexible versioning (for cross-platform compatibility)
- Original configuration in `differential_abundance_env_list.yml` (Linux-specific, kept for reference)

## Usage
- **R environment** (for R methods):
- Managed with renv
- Configuration in `renv.lock`

*Note:* Our implementation can only be used on a cluster with Slurm job scheduler since we need to run thousands of jobs.
## Datasets

The benchmarking scripts are all located in the `bin` drectory.
The benchmark uses two types of datasets:

```text
bin
├── bm_parameter.sh
├── bm_runtime.sh
├── bm_syn_real.sh
└── make_bm_data.sh
1. **Synthetic topologies**:
- Linear
- Branch
- Cluster

2. **Real datasets** (must be downloaded separately, e.g., from [Google Drive](https://drive.google.com/drive/folders/15wWFD5FMe0VdzN1pUnaUUpQ17OXkeebH)):
- bcr-xl
- covid19-pbmc
- levine32
- pancreas

## Running the Benchmark

The entire workflow is orchestrated by the `main.sh` script, which:

1. Sets up required Python and R environments
2. Creates necessary directory structure
3. Preprocesses datasets with PCA and diffusion map embeddings
4. Generates synthetic condition labels with known ground truth
5. Runs all methods across all parameter combinations using Slurm

To run the full benchmark:

NOTE: the command line for running main.sh has been edited:

```bash
bash main.sh $BatchEffectMode
```

To run a benchmarking job, use the following command:
1. If the BatchEffectMode is "orig", it will use the original batch_sd parameter value from benchmarkDA package on PCA layer, or using 0 on diffusion map space to not add batch effect on the diffusion map.
2. If the BatchEffectMode is "modified", it will use the batch_sd calculated from the logspace.

Example:

```sh
bash bm_{the script}.sh
```bash
bash main.sh "modified"
```

## Architecture

The benchmark has been refactored for improved reliability and readability:

- **Configuration Files**:
- `config/dataset_config.py`: Dataset parameters and path templates
- `config/method_config.py`: Method configurations and command generation

- **Core Components**:
- `python_method/data_loader.py`: Unified data loading for all methods
- Python method implementations: `Mellon_bm.py`, `meld_bm.py`, `CNA_bm.py`
- R method execution: Handled via `scripts/run_DA.r`

- **Execution Framework**:
- `bin/run_benchmark.py`: Unified script generator for all methods
- Runtime-generated scripts: Stored in `benchmark_scripts/` directory
- Slurm job logs: Stored in `SlurmLog/` directory

## Output Structure

Results are organized in the `benchmark/` directory with the following structure:

```
benchmark/
├── dm/ # Diffusion map results
│ ├── synthetic/
│ │ ├── linear/
│ │ ├── branch/
│ │ └── cluster/
│ └── real/
│ ├── bcr-xl/
│ ├── covid19-pbmc/
│ └── ...
└── pca/ # PCA results
├── synthetic/
│ ├── linear/
│ ├── branch/
│ └── cluster/
└── real/
├── bcr-xl/
├── covid19-pbmc/
└── ...
```

Within each dataset directory, results are further organized by specific parameters:
`[dataset]-[population]-[enrichment]-[seed]-[batch_sd]-[balance]-[embedding]/`

## Developers Guide

## Acknowledgement
To extend the benchmark with new methods:

Our implementation is inspired by the repo https://github.com/MarioniLab/milo_analysis_2020.
1. Add method configuration to `config/method_config.py`
2. For Python methods:
- Implement a wrapper in `python_method/`
- Use the unified data loading interface
3. For R methods:
- Add handling to `scripts/run_DA.r`
4. Run the benchmark with your new method
Loading