settylab · katosh · May 10, 2024 · May 10, 2024 · Mar 1, 2025 · Mar 1, 2025
diff --git a/.gitignore b/.gitignore
@@ -3,3 +3,6 @@ data/
 .ipynb_checkpoints/
 scripts/core
 notebooks/
+__pycache__/
+.DS_Store
+.nfs*
diff --git a/.gitmodules b/.gitmodules
@@ -1,3 +0,0 @@
-[submodule "methods/mellon"]
-	path = methods/mellon
-	url = git@github.com:settylab/benchmarkDA_mellon_sub.git

diff --git a/README.md b/README.md
@@ -1,67 +1,142 @@
-# bmDA
+# BenchmarkDA: Differential Abundance Testing Benchmarks
 
-The implementation of benchmarking the performance of differential abundance (DA) testing methods including 5 clustering-free methods:
+This repository contains a unified framework for benchmarking differential abundance (DA) testing methods in single-cell data.
 
-1. [Testing for differential abundance in mass cytometry data (Cydar)](https://www.nature.com/articles/nmeth.4295)
-2. [Detection of differentially abundant cell subpopulations in scRNA-seq data (DAseq)](https://www.pnas.org/doi/abs/10.1073/pnas.2100293118)
-3. [Quantifying the effect of experimental perturbations at single-cell resolution (MELD)](https://www.nature.com/articles/s41587-020-00803-5)
-4. [Differential abundance testing on single-cell data using k-nearest neighbor graphs (Milo)](https://www.nature.com/articles/s41587-021-01033-z)
-5. [Co-varying neighborhood analysis identifies cell populations associated with phenotypes of interest from single-cell transcriptomics (CNA)](https://www.nature.com/articles/s41587-021-01066-4),
+## Overview
 
-and a clustering-based method, *Louvain*.
+Differential abundance testing aims to identify cell types or states that change in proportion between conditions in single-cell data. This benchmark evaluates multiple methods:
 
-## Dependencies
+**Python Methods:**
+- MELD
+- CNA (Conditional Neighbor Analysis)
+- Mellon
 
-To run this benchmarking codes, it needs to install a list of R and Python packages. The R packages needed are:
+**R Methods:**
+- Milo
+- DAseq
+- CyDAR
 
-- argparse
-- SingleCellExperiment
-- scran
-- [DAseq](https://github.com/KlugerLab/DAseq)
-- [miloR](https://github.com/MarioniLab/miloR)
-- tibble
-- dplyr
-- tidyverse
-- igraph
-- [cydar](http://bioconductor.org/packages/cydar)
-- pdist
-- reshape2
+The benchmark evaluates these methods using both real datasets and synthetic topologies, with various parameters including:
+- PCA and diffusion map (DM) embeddings
+- Different cell populations
+- Various enrichment levels
+- Different batch effect strengths
+- Multiple random seeds
 
-The Python packages needed are
+## Requirements
 
-- [MELD](https://github.com/KrishnaswamyLab/MELD)
-- [cna](https://github.com/immunogenomics/cna)
-- scanpy
-- [graphtools](https://github.com/KrishnaswamyLab/graphtools)
-- scikit-learn
-- multianndata
+Our implementation requires a Slurm job scheduler since we need to run thousands of parallel jobs.
 
-## Data
+### Dependencies
 
-- Synthetic datasets and BCR-XL dataset are available under the `data` directory.
-- The COVID-19 PBMC dataset is available at https://www.covid19cellatlas.org/#wilk20.
+- **Python environment** (for Python methods):
+  - Managed with micromamba
+  - Two environment file options:
+    - `environment_full.yml`: Complete environment with exact versions (recommended for reproducibility)
+    - `environment_minimal.yml`: Minimal environment with flexible versioning (for cross-platform compatibility)
+  - Original configuration in `differential_abundance_env_list.yml` (Linux-specific, kept for reference)
 
-## Usage
+- **R environment** (for R methods):
+  - Managed with renv
+  - Configuration in `renv.lock`
 
-*Note:* Our implementation can only be used on a cluster with Slurm job scheduler since we need to run thousands of jobs.
+## Datasets
 
-The benchmarking scripts are all located in the `bin` drectory.
+The benchmark uses two types of datasets:
 
-```text
-bin
-├── bm_parameter.sh
-├── bm_runtime.sh
-├── bm_syn_real.sh
-└── make_bm_data.sh
+1. **Synthetic topologies**:
+   - Linear
+   - Branch
+   - Cluster
+
+2. **Real datasets** (must be downloaded separately, e.g., from [Google Drive](https://drive.google.com/drive/folders/15wWFD5FMe0VdzN1pUnaUUpQ17OXkeebH)):
+   - bcr-xl
+   - covid19-pbmc
+   - levine32
+   - pancreas
+
+## Running the Benchmark
+
+The entire workflow is orchestrated by the `main.sh` script, which:
+
+1. Sets up required Python and R environments
+2. Creates necessary directory structure
+3. Preprocesses datasets with PCA and diffusion map embeddings
+4. Generates synthetic condition labels with known ground truth
+5. Runs all methods across all parameter combinations using Slurm
+
+To run the full benchmark:
+
+NOTE: the command line for running main.sh has been edited:
+
+```bash
+bash main.sh $BatchEffectMode
 ```
 
-To run a benchmarking job, use the following command:
+1. If the BatchEffectMode is "orig", it will use the original batch_sd parameter value from benchmarkDA package on PCA layer, or using 0 on diffusion map space to not add batch effect on the diffusion map.
+2. If the BatchEffectMode is "modified", it will use the batch_sd calculated from the logspace.
+
+Example:
 
-```sh
-bash bm_{the script}.sh
+```bash
+bash main.sh "modified"
 ```
 
+## Architecture
+
+The benchmark has been refactored for improved reliability and readability:
+
+- **Configuration Files**:
+  - `config/dataset_config.py`: Dataset parameters and path templates
+  - `config/method_config.py`: Method configurations and command generation
+
+- **Core Components**:
+  - `python_method/data_loader.py`: Unified data loading for all methods
+  - Python method implementations: `Mellon_bm.py`, `meld_bm.py`, `CNA_bm.py`
+  - R method execution: Handled via `scripts/run_DA.r`
+
+- **Execution Framework**:
+  - `bin/run_benchmark.py`: Unified script generator for all methods
+  - Runtime-generated scripts: Stored in `benchmark_scripts/` directory
+  - Slurm job logs: Stored in `SlurmLog/` directory
+
+## Output Structure
+
+Results are organized in the `benchmark/` directory with the following structure:
+
+```
+benchmark/
+├── dm/                   # Diffusion map results
+│   ├── synthetic/
+│   │   ├── linear/
+│   │   ├── branch/
+│   │   └── cluster/
+│   └── real/
+│       ├── bcr-xl/
+│       ├── covid19-pbmc/
+│       └── ...
+└── pca/                  # PCA results
+    ├── synthetic/
+    │   ├── linear/
+    │   ├── branch/
+    │   └── cluster/
+    └── real/
+        ├── bcr-xl/
+        ├── covid19-pbmc/
+        └── ...
+```
+
+Within each dataset directory, results are further organized by specific parameters:
+`[dataset]-[population]-[enrichment]-[seed]-[batch_sd]-[balance]-[embedding]/`
+
+## Developers Guide
 
-## Acknowledgement
+To extend the benchmark with new methods:
 
-Our implementation is inspired by the repo https://github.com/MarioniLab/milo_analysis_2020.
+1. Add method configuration to `config/method_config.py`
+2. For Python methods:
+   - Implement a wrapper in `python_method/`
+   - Use the unified data loading interface
+3. For R methods:
+   - Add handling to `scripts/run_DA.r`
+4. Run the benchmark with your new method