all scripts are located in snp-calling directory.
- Input files were downloaded from CyVerse using
getFiles.shscript. The files that were large (ran without barcodes), were split to smaller chunks for quick processing usingsplit-fastq.shscript. - The genome file was downloaded from CyVerse (B73.v5) and processed using
gatk-prepare-reference.shto create all the necessary files for running GATK pipeline. - The fastq files were mapped to B73.v5 and processed using
process-fastq.shscript. Briefly, this script:- converts unmapped fastq to bam
FastqToSam - runs Picard
MarkIlluminaAdapters - converts bam back to fastq
SamToFastq - Maps fastq files to B73.v5 using
bwa memand converts to bam file usingsamtools - merged unmapped reads with mapped reads using
MergeBamAlignment - runs picard's
MarkDuplicatesto mark optical duplicates.
- converts unmapped fastq to bam
- As a final step of processing, using
run-add-readgroups.shcorrect read groups were added to the bam files and indexed. - GATK was run on 1Mb intervals, using the script
gatkcmds-round-1.shand the intervals fileB73.PLATINUM.pseudomolecules-v1_1mb_coords.bed, the commands were generated and was run on the cluster creating slurm job submission script using GNU parallel. - Once the VCF files were generated (2,813 total), they were gathered and processed to filter and retain very high quality SNPs only, using the script
gatk-process.sh - The bam files were recalibrated using the filtered first round SNP files using
gatk-bsqr.sh - Using the recalibrated BAM files, GATK was ran again on 1Mb intervals, using the script
gatkcmds-round-2.shand the intervals fileB73.PLATINUM.pseudomolecules-v1_1mb_coords.bed, the commands were generated and was run on the cluster creating slurm job submission script using GNU parallel. - The final files were filtered again using the
gatk-process.shscript again. - The final files were uploaded to CyVerse