Examples of workflows to run tools on the cluster, with implementation notes.
#!/bin/bash
# Motif analysis workflow for Super Enhancer BED file inputs
# directory we're doing the work in # (1.)
project_dir="/ifs/home/kellys04/projects/SmithLab_ChIpSeq_2016-12-31/project_notes/motifs-SuperEnhancers"
# directory that holds our input BED files # (1.)
SE_dir="/results/ChIP-Seq/superenhancer/ROSE/bed/"
output_HOMER="${project_dir}/HOMER"
mkdir -p "$output_HOMER" # (2.)
# (3.)
run_HOMER_motif () {
# function to run HOMER on a BED file, and set up output directories
# http://homer.salk.edu/homer/ngs/peakMotifs.html
local input_bed="$1" # (4.)
local genome="hg19"
local motif_options="-size given"
local sample_ID="$(basename "$input_bed" | cut -d '-' -f1)"
local sample_status="$(basename "$input_bed" | cut -d '-' -f2)"
local outdir="${output_HOMER}/${sample_ID}/${sample_status}"
mkdir -p "$outdir"
local preparsed_dir="${outdir}/preparsed"
mkdir -p "$preparsed_dir"
local logdir="${outdir}/logs"
mkdir -p "$logdir"
local job_name="homer-${sample_ID}"
echo "$sample_ID $sample_status"
echo "$outdir"
# (5.)
qsub -wd $outdir -o :${logdir}/ -e :${logdir}/ -j y -pe threaded 4 -N "$job_name" <<E0F
# (6.)
module load homer/v4.6
set -x
threads=\$NSLOTS
findMotifsGenome.pl $input_bed $genome $outdir/ -p $threads -preparse -preparsedDir ${preparsed_dir}/ -dumpFasta $motif_options # && rm -rf $preparsed_dir
set +x
E0F
}
# we need to find all the input BED files, and pass them to our function to submit a qsub job for HOMER on it
# file paths look like this:
# /results/ChIP-Seq/superenhancer/ROSE/bed/ABC-R-H3K27AC_ROSE_superenhancer.bed
# HOMER
# (7.)
find "$SE_dir" -name "*.bed" -exec readlink -f {} \; | while read item; do
echo "$item"
run_HOMER_motif "$item"
doneNotes:
- Its a good idea to start with the full path to the directories you're working on.
- Use
mkdir -pto create all directories in the given path. - Isolate the code needed to perform your task within a function, to make the code more readable and keep 'file-finding' code from getting tangled up in your 'task-performing' code. Also note that we're using the "old style" bash function syntax here, since it is more portable in case you end up working on an older system.
- Inside the function, use the
localcommand to set variables in a scope local to the function, so they are not globally accessible by the rest of your code. This way, if something goes wrong while your code is running, you don't accidentally useinput_bedthat was set by another iteration of the function. - Command to submit the job to
qsubto run on the compute cluster.-wd: Sets the working directory for the job to run in. By default SGE runs jobs out of a type of temp directory, which can cause issues with some programs. You can set the working directory for the job explicitly with an argument like-wd $outdir, or you can use-cwdto run from the current working directory-o :${logdir}/sets the location for the std out log (and-efor std error). The trailing slash and leading colon are required.-j y: merges the std out and std error logs into a single file (so-e :${logdir}/was not actually required here)-pe threaded 4: The number of threads for the job to use, can also be a range such as4-8-N "$job_name": The name of the job to use, instead of the default value. This is the name that appears inqstat
- We are using a bash "heredoc" to pass the commands for
qsubinstead of using an external script file. This is useful since it keeps all your code in one place, and allows you to dynamically build your commands to be run. Heredoc's do have a few quirks to be aware of if you plan to use them, so be sure to look over the documentation for them.- The code will be running in a new bash session. First we need to load the
modulefor our program, and set other environment variables - Using
set -xis useful for debugging since it prints all commands run by bash to std out, with all variables filled in. These entries are prefixed by+in the console or log file. This way, when you have a lot of dynamically set variables, you can see exactly what is getting run. You turn this feature back off withset +x. - By default, the heredoc will try to fill in any items prefixed with
$as if it were a variable set in the session in which it was created. So,$input_bedwill get filled in with the value that we assigned previously. This feature can be disabled on a per-item basis by escaping with a\like this:\$ $NSLOTSis an environment variable set byqsubwithin the running job, and represents the number of threads ("slots") assigned to the job; we want to assign this value to thethreadsvariable after the heredoc has been submitted as a job, so it is escaped and written asthreads=\$NSLOTS. If you wanted to have a default value forthreadsto use with the code whenNSLOTSisn't set or doesn't exist, you could usethreads=${NSLOTS:=4}, which in this case would also need to be escaped for our heredoc.- As described, all the variables passed to the
findMotifsGenome.plprogram here will be filled in either from the function, or from variables assigned within the heredoc.
- The code will be running in a new bash session. First we need to load the
- Here we use a simple
findcommand to search for all the BED files in the specified directory.- The argument
-exec readlink -f {} \;is used to ensure that the full path is returned for every file found. Not necessarily required in this example since ourSE_dirvariable started with the full path in the first place, but this is a good practice to use when passing around file paths to be processed. - Every line output by
findis passed to awhileloop, which will print the file being processed and then run therun_HOMER_motiffunction on the file. Using-exec run_HOMER_motifis not possible here since-execonly works with programs, commands, and script invocations that are accessible external to the current session; we're trying to keep all our code in one place, so we want to avoid that.
- The argument