In this case study, we describe applying DeepVariant to a Complete Genomics T7
sample.
Then we assess the quality of the DeepVariant variant calls with hap.py.
To make it faster to run over this case study, we run only on chromosome 20.
For how to prepare environment, the steps are the same as this doc.
We will be using GRCh38 for this case study.
mkdir -p reference
FTPDIR=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > reference/GRCh38_no_alt_analysis_set.fasta
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai > reference/GRCh38_no_alt_analysis_set.fasta.faimkdir -p input
HTTPDIR=https://storage.googleapis.com/deepvariant/complete-case-study-testdata
curl ${HTTPDIR}/HG001.complete_t7.E100030471QC960.grch38.chr20.bam > input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam
curl ${HTTPDIR}/HG001.complete_t7.E100030471QC960.grch38.chr20.bam.bai > input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam.baimkdir -p benchmark
FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv4.2.1/GRCh38
curl ${FTPDIR}/HG001_GRCh38_1_22_v4.2.1_benchmark.bed > benchmark/HG001_GRCh38_1_22_v4.2.1_benchmark.bed
curl ${FTPDIR}/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > benchmark/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbiHTTPDIR=https://storage.googleapis.com/deepvariant/complete-case-study-testdata
curl ${HTTPDIR}/complete-t7/weights-51-0.995354.ckpt.data-00000-of-00001 > input/weights-51-0.995354.ckpt.data-00000-of-00001
curl ${HTTPDIR}/complete-t7/weights-51-0.995354.ckpt.index > input/weights-51-0.995354.ckpt.index
curl ${HTTPDIR}/complete-t7/example_info.json > input/example_info.json
curl ${HTTPDIR}/complete-t7/model.example_info.json > input/model.example_info.jsonOn a CPU-only machine:
mkdir -p output
mkdir -p output/intermediate_results_dir
BIN_VERSION="1.10.0"
sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type WGS \
--ref /reference/GRCh38_no_alt_analysis_set.fasta \
--reads /input/HG001.complete_t7.E100030471QC960.grch38.chr20.bam \
--output_vcf /output/HG001.output.vcf.gz \
--output_gvcf /output/HG001.output.g.vcf.gz \
--num_shards $(nproc) \
--regions chr20 \
--intermediate_results_dir /output/intermediate_results_dir \
--customized_model /input/weights-51-0.995354.ckptFor running on GPU machines, or using Singularity instead of Docker, see Quick Start.
mkdir -p happy
sudo docker pull jmcdani20/hap.py:v0.3.12
sudo docker run \
-v "${PWD}/benchmark":"/benchmark" \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
-v "${PWD}/happy:/happy" \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
/benchmark/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
/output/HG001.output.vcf.gz \
-f /benchmark/HG001_GRCh38_1_22_v4.2.1_benchmark.bed \
-r /reference/GRCh38_no_alt_analysis_set.fasta \
-o /happy/happy.output \
--engine=vcfeval \
--pass-only \
-l chr20Output:
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 9974 9946 28 21045 8 10743 2 5 0.997193 0.999223 0.510478 0.998207 NaN NaN 1.630447 2.158659
INDEL PASS 9974 9946 28 21045 8 10743 2 5 0.997193 0.999223 0.510478 0.998207 NaN NaN 1.630447 2.158659
SNP ALL 69175 68876 299 85012 45 16048 8 2 0.995678 0.999347 0.188773 0.997509 2.288757 2.083179 1.730097 1.781651
SNP PASS 69175 68876 299 85012 45 16048 8 2 0.995678 0.999347 0.188773 0.997509 2.288757 2.083179 1.730097 1.781651