This is a Snakemake pipeline to automatically preprocess data (e.g., run PEAR to merge R1 and R2), conduct sequence quality control, and run the CARLIN pipeline. It is especially useful when you have multiple samples from a single sequencing run. This is developed through the DARLIN project L. Li,...,S.-W. Wang, F. Camargo, Cell (2023).
Note that this pipeline must be used with a customized version of the CARLIN pipeline, which we adapted from the original software to deal with several different DARLIN references in the CA, TA, and RA loci respectively.
First, make a conda environment:
kernel_name='snakemake_darlin'
conda create -n $kernel_name python=3.9 --yes
conda activate $kernel_name
conda install -c conda-forge mamba --yes
mamba install -c conda-forge -c bioconda snakemake=7.24.0 --yes
pip install --user ipykernel
pip install jupyterlab umi_tools seaborn papermill biopython
python -m ipykernel install --user --name=$kernel_nameNext, go to a directory where you want to store the code and install all relevant packages
code_directory='.' # change it to the directory where you want to put the packages
cd $code_directory
git clone https://github.com/ShouWenWang-Lab/snakemake_DARLIN --depth=1
cd snakemake_DARLIN
python setup.py develop
cd ..
mkdir CARLIN_pipeline
cd CARLIN_pipeline
git clone https://github.com/ShouWenWang-Lab/Custom_CARLIN --depth=1Finally, you need to install pear and MATLAB. It is often needed to install pear on a HPC locally so that no root permission is needed. To do so, you can use ./configure --prefix /local/directory to install it locally where you do have access. Below is an example script for installing pear on HPC locally.
cd pear_installation_folder
./configure --prefix ~ # install at the local home directory
make
make installMATLAB should be available in the command line interface. In an HPC environment, MATLAB can be loaded with the command:
module load matlabMATLAB should have Bioinformatics Toolbox and Image Processing Toolbox addons installed. FastQC and MultiQC should also be available from the command line, otherwise you will not get the QC report (however, you can finish the DARLIN preprocessing without them).
This pipeline also use fastqc and multiqc to visualize sequence quality. The pipeline should run correctly even without them. But if you want to see the QC results, please have these two commands available in the terminal.
The pipeline assumes that it is being called on a server with SLURM if sbatch=1 in the config file (see below). If not, you can copy-and-paste the generated command and run it locally.
With sbatch=0, it should run properly in a normal Linux system without SLURM. However, it cannot submit jobs and run them in parallel in this case.
Please create three separate folders named CA, TA, and RA. Inside each folder, create a raw_fastq subfolder along with a config.yaml file. Each raw_fastq folder should contain reads that correspond to its parent template (i.e., CA folder’s raw_fastq contains CA reads).
Alternatively, it is also acceptable to mix reads from different templates (i.e., include all reads from CA, TA, and RA together) within a single raw_fastq subfolder. In this case, during downstream analysis, each folder (CA, TA, RA) will extract only the reads that match its respective template. As a result, the fraction of valid_lines reported in the Results.txt file will appear low, because only a subset of the total reads corresponds to any given template.
As indicated in the above example, the config.yaml file should be at the root folder of a project, and the fastq data should be at the folder raw_fastq.
We assume that the data is generated with Miseq machine from Illumina. Specifically, we assume that the file name starts with a sample_ID, and has both R1 and R2:
fq_R1=f"{sample}_L001_R1_001.fastq.gz"
fq_R2=f"{sample}_L001_R2_001.fastq.gz"Please rename the files if they are not in this format. An example of config.yaml file is as follows:
project_name : 'Li_112219'
project_ID : '144505366'
SampleList : ['HSC','MPP','MyP'] #Remove 1_S*, it will have few reads, affect the output
cfg_type : 'sc10xV3' # available protocol: BulkRNA_Tigre_14UMI, BulkRNA_Rosa_14UMI, BulkRNA_12UMI, scCamellia,sc10xV3
template : 'cCARLIN' # short_primer_set: {Tigre_2022_v2, Rosa_v2, cCARLIN}, long_primer_set: {Tigre_2022,Rosa,cCARLIN}
read_cutoff_UMI_override : [3,10] # assume to be a list, UMI cutoff is the same as CB cutoff for single-cell protocol
CARLIN_memory_factor : 300 # request memory at X times the size of the pear fastq file.
sbatch : 1 # 1, run sbatch job; 0, run in the interactive mode.
CARLIN_max_run_time : 12 # hourcode_directory should be the same directory where you clone the code.
SampleList should be the list of samples that you want to analyze.
cfg_type should match the protocol of the experiment. Some of the provided protocols include:
BulkRNA_Tigre_14UMI: Bulk CARLIN library with Tigre locus, with a UMI of 14bpBulkRNA_Rosa_14UMI: Bulk CARLIN library with Rosa locus, with a UMI of 14bpBulkRNA_12UMI: Bulk CARLIN library with Col1a1 locus, with a UMI of 12bpscCamellia: Single-cell CARLIN library using the scCamellia-seq protocolsc10xV3: Single-cell CARLIN library using the 10X v3 protocol
template should match the primer set used. We have template corresponding to shorter primers in TC and RC: {Tigre_2022_v2, Rosa_v2}, and longer primers: {Tigre_2022, Rosa}. For Col1a1 locus, we only have a single primer set, corresponding to tempalte cCARLIN.
read_cutoff_UMI_override: minimum number of reads needed to support a UMI (bulk library) or a cell barcode (single cell library). It should be a list of read cutoff like [3,10].
CARLIN_memory_factor: When running on o2, the requested memory should be CARLIN_memory_factor times the fastq file size.
sbatch: when running on o2, whether to run with sbatch jobs (1) or in interactive mode (0).
CARLIN_max_run_time: When running on o2, the maximum run time to request, in the unit of hours.
When the fastq files are not downloaded yet in the raw_fastq folder, and the data sits at base space of Illumina, you can provide project_name and project_ID in config.yaml to automaically download the data.
First, check the available fastq data with the terminal command:
bs auth # this needs to be done only once for authentification
bs list projectNext, select the desired project name and ID. In the above config.yaml file, we selected the data from the first entry.
Next, activate the correct environment
conda activate snakemake_darlin # activate the environmentand run the snakemake script at the same directory as the config.yaml file:
snakemake -s $code_directory/snakemake_DARLIN/snakefiles/snakefile_get_data.py --configfile config.yaml --core 1This command will generate the QC report and process each sample with the CARLIN pipeline:
snakemake -s $code_directory/packages/snakemake_DARLIN/snakefiles/snakefile_matlab_DARLIN_Part1.py --configfile config.yaml --core 10Finally, you may run this command to get an html report across all samples:
snakemake -s $code_directory/packages/snakemake_DARLIN/snakefiles/snakefile_matlab_DARLIN_Part2.py --configfile config.yaml --core 5 --ri -R generate_report -R plots The result will show up at the merge_all folder as shown in the above image.
To test if the pipeline has been installed correctly, please go to the test folder and run the command
bash test.shIf everything goes correctly, the expected output for the three test datasets should be like this:

A log file for running this test module is available to download at here.
Active changes are being made to the github repository. If you want to incorporate the latest changes, please run
cd $code_directory
cd snakemake_DARLIN
git pull
cd ../CARLIN_pipeline/Custom_CARLIN
git pull
cd ../../MosaicLineage
git pullL. Li, S. Bowling, S. E. McGeary, Q. Yu, B. Lemke, K. Alcedo, Y. Jia, X. Liu, M. Ferreira, A. M. Klein, S.-W. Wang*, F. D. Camargo*, A mouse model with high clonal barcode diversity for joint lineage, transcriptomic, and epigenomic profiling in single cells, Cell (2023). [* corresponding authors]
-
MosaicLineage, A (mosaic) collection of python help functions related to lineage tracing data analysis, developed through the DARLIN project.
-
Notebooks to reproduce Figure 4 and Figure 5 in our paper. It also illustrates how to use the MosaicLineage package there.
-
Raw and intermediate data for these notebooks. To download all raw or processed data, please go to GEO: GSE222486

