In this case study, we describe applying DeepTrio to a
real PacBio WGS trio. Then we assess the quality of the DeepTrio variant calls
with hap.py. In addition we evaluate a Mendelian violation rate for a merged
VCF.
To make it faster to run over this case study, we run only on chromosome 20.
Docker will be used to run DeepTrio and hap.py,
We will be using GRCh38 for this case study.
mkdir -p reference
FTPDIR=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > reference/GRCh38_no_alt_analysis_set.fasta
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai > reference/GRCh38_no_alt_analysis_set.fasta.faiWe will benchmark our variant calls against v4.2.1 of the Genome in a Bottle small variant benchmarks for HG002, HG003, and HG004 trio.
mkdir -p benchmark
FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio
curl ${FTPDIR}/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > benchmark/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > benchmark/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
curl ${FTPDIR}/HG003_NA24149_father/NISTv4.2.1/GRCh38/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG003_NA24149_father/NISTv4.2.1/GRCh38/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG003_NA24149_father/NISTv4.2.1/GRCh38/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
curl ${FTPDIR}/HG004_NA24143_mother/NISTv4.2.1/GRCh38/HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG004_NA24143_mother/NISTv4.2.1/GRCh38/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG004_NA24143_mother/NISTv4.2.1/GRCh38/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbiWe'll use HG002, HG003, HG004 PacBio HiFi WGS reads publicly available from the PrecisionFDA Truth v2 Challenge. These reads have been aligned to the GRCh38_no_alt_analysis reference using pbmm2.
mkdir -p input
HTTPDIR=https://storage.googleapis.com/deepvariant/pacbio-case-study-testdata
curl ${HTTPDIR}/HG002.pfda_challenge.grch38.phased.chr20.bam > input/HG002.pfda_challenge.grch38.phased.chr20.bam
curl ${HTTPDIR}/HG002.pfda_challenge.grch38.phased.chr20.bam.bai > input/HG002.pfda_challenge.grch38.phased.chr20.bam.bai
curl ${HTTPDIR}/HG003.pfda_challenge.grch38.phased.chr20.bam > input/HG003.pfda_challenge.grch38.phased.chr20.bam
curl ${HTTPDIR}/HG003.pfda_challenge.grch38.phased.chr20.bam.bai > input/HG003.pfda_challenge.grch38.phased.chr20.bam.bai
curl ${HTTPDIR}/HG004.pfda_challenge.grch38.phased.chr20.bam > input/HG004.pfda_challenge.grch38.phased.chr20.bam
curl ${HTTPDIR}/HG004.pfda_challenge.grch38.phased.chr20.bam.bai > input/HG004.pfda_challenge.grch38.phased.chr20.bam.baiDeepTrio pipeline consists of 4 steps: make_examples, call_variants,
postprocess_variants and GLnexus merge. It is possible to run the first
three steps with one command using the run_deeptrio script. GLnexus
is run as a separate command.
mkdir -p output
mkdir -p output/intermediate_results_dir
BIN_VERSION="1.10.0"
sudo apt -y update
sudo apt-get -y install docker.io
sudo docker pull google/deepvariant:deeptrio-"${BIN_VERSION}"
time sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
google/deepvariant:deeptrio-"${BIN_VERSION}" \
/opt/deepvariant/bin/deeptrio/run_deeptrio \
--model_type PACBIO \
--ref /reference/GRCh38_no_alt_analysis_set.fasta \
--reads_child /input/HG002.pfda_challenge.grch38.phased.chr20.bam \
--reads_parent1 /input/HG003.pfda_challenge.grch38.phased.chr20.bam \
--reads_parent2 /input/HG004.pfda_challenge.grch38.phased.chr20.bam \
--output_vcf_child /output/HG002.output.vcf.gz \
--output_vcf_parent1 /output/HG003.output.vcf.gz \
--output_vcf_parent2 /output/HG004.output.vcf.gz \
--sample_name_child 'HG002' \
--sample_name_parent1 'HG003' \
--sample_name_parent2 'HG004' \
--num_shards $(nproc) \
--intermediate_results_dir /output/intermediate_results_dir \
--output_gvcf_child /output/HG002.g.vcf.gz \
--output_gvcf_parent1 /output/HG003.g.vcf.gz \
--output_gvcf_parent2 /output/HG004.g.vcf.gz \
--regions chr20By specifying --model_type PACBIO, you'll be using a model that is best suited
for PacBio HiFi Whole Genome Sequencing data.
NOTE: If you want to run each of the steps separately, add --dry_run=true
to the command above to figure out what flags you need in each step. Based on
the different model types, different flags are needed in the make_examples
step.
--intermediate_results_dir flag is optional. By specifying it, the
intermediate outputs of make_examples and call_variants stages can be found
in the directory. After the command, you can find these files in the directory:
call_variants_output_child.tfrecord.gz
call_variants_output_parent1.tfrecord.gz
call_variants_output_parent2.tfrecord.gz
gvcf_child.tfrecord-?????-of-?????.gz
gvcf_parent1.tfrecord-?????-of-?????.gz
gvcf_parent2.tfrecord-?????-of-?????.gz
make_examples_child.tfrecord-?????-of-?????.gz
make_examples_parent1.tfrecord-?????-of-?????.gz
make_examples_parent2.tfrecord-?????-of-?????.gz
For running on GPU machines, or using Singularity instead of Docker, see Quick Start or DeepVariant PacBio case study.
At this step we take all 3 VCFs generated in the previous step and merge them using GLnexus.
sudo docker pull quay.io/mlin/glnexus:v1.2.7
# bcftools and bgzip are now included in our docker images.
# You can also install them separately.
sudo docker run \
-v "${PWD}/output":"/output" \
quay.io/mlin/glnexus:v1.2.7 \
/usr/local/bin/glnexus_cli \
--config DeepVariant_unfiltered \
/output/HG002.g.vcf.gz \
/output/HG003.g.vcf.gz \
/output/HG004.g.vcf.gz \
| sudo docker run -i google/deepvariant:deeptrio-"${BIN_VERSION}" \
bcftools view - \
| sudo docker run -i google/deepvariant:deeptrio-"${BIN_VERSION}" \
bgzip -c > output/HG002_trio_merged.vcf.gzAfter completion of GLnexus command we should have a new merged VCF file in the output directory.
HG002_trio_merged.vcf.gz
sudo docker pull realtimegenomics/rtg-tools
sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/reference":"/reference" \
realtimegenomics/rtg-tools format \
-o /reference/GRCh38_no_alt_analysis_set.sdf "/reference/GRCh38_no_alt_analysis_set.fasta"
FILE="reference/trio.ped"
cat <<EOM >$FILE
#PED format pedigree
#
#fam-id/ind-id/pat-id/mat-id: 0=unknown
#sex: 1=male; 2=female; 0=unknown
#phenotype: -9=missing, 0=missing; 1=unaffected; 2=affected
#
#fam-id ind-id pat-id mat-id sex phen
1 HG002 HG003 HG004 1 0
1 HG003 0 0 1 0
1 HG004 0 0 2 0
EOM
sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/reference":"/reference" \
-v "${PWD}/output":"/output" \
realtimegenomics/rtg-tools mendelian \
-i "/output/HG002_trio_merged.vcf.gz" \
-o "/output/HG002_trio_annotated.output.vcf.gz" \
--pedigree=/reference/trio.ped \
-t /reference/GRCh38_no_alt_analysis_set.sdf \
| tee output/deepvariant.input_rtg_output.txtAs a result we should get the following output:
Checking: /output/HG002_trio_merged.vcf.gz
Family: [HG003 + HG004] -> [HG002]
125 non-pass records were skipped
Concordance HG002: F:173626/180690 (96.09%) M:174133/180697 (96.37%) F+M:165693/176930 (93.65%)
Sample HG002 has less than 99.0 concordance with both parents. Check for incorrect pedigree or sample mislabelling.
0/191763 (0.00%) records did not conform to expected call ploidy
185903/191763 (96.94%) records were variant in at least 1 family member and checked for Mendelian constraints
7526/185903 (4.05%) records had indeterminate consistency status due to incomplete calls
12414/185903 (6.68%) records contained a violation of Mendelian constraintsmkdir -p happy
sudo docker pull jmcdani20/hap.py:v0.3.12
sudo docker run \
-v "${PWD}/benchmark":"/benchmark" \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
-v "${PWD}/happy:/happy" \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
/benchmark/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
/output/HG002.output.vcf.gz \
-f /benchmark/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-r /reference/GRCh38_no_alt_analysis_set.fasta \
-o /happy/HG002.output \
--engine=vcfeval \
--pass-only \
-l chr20
sudo docker run \
-v "${PWD}/benchmark":"/benchmark" \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
-v "${PWD}/happy:/happy" \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
/benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
/output/HG003.output.vcf.gz \
-f /benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-r /reference/GRCh38_no_alt_analysis_set.fasta \
-o /happy/HG003.output \
--engine=vcfeval \
--pass-only \
-l chr20
sudo docker run \
-v "${PWD}/benchmark":"/benchmark" \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
-v "${PWD}/happy:/happy" \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
/benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
/output/HG004.output.vcf.gz \
-f /benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-r /reference/GRCh38_no_alt_analysis_set.fasta \
-o /happy/HG004.output \
--engine=vcfeval \
--pass-only \
-l chr20Benchmarking Summary for HG002:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 11256 11228 28 23150 57 11388 16 38 0.997512 0.995154 0.491922 0.996332 NaN NaN 1.561710 2.078579
INDEL PASS 11256 11228 28 23150 57 11388 16 38 0.997512 0.995154 0.491922 0.996332 NaN NaN 1.561710 2.078579
SNP ALL 71333 71311 22 109348 19 37943 8 11 0.999692 0.999734 0.346993 0.999713 2.314904 1.729368 1.715978 1.614759
SNP PASS 71333 71311 22 109348 19 37943 8 11 0.999692 0.999734 0.346993 0.999713 2.314904 1.729368 1.715978 1.614759
Benchmarking Summary for HG003:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 10628 10580 48 23643 70 12505 33 35 0.995484 0.993715 0.528909 0.994599 NaN NaN 1.748961 2.309936
INDEL PASS 10628 10580 48 23643 70 12505 33 35 0.995484 0.993715 0.528909 0.994599 NaN NaN 1.748961 2.309936
SNP ALL 70166 70147 19 118404 37 48170 9 13 0.999729 0.999473 0.406827 0.999601 2.296566 1.572775 1.883951 1.716717
SNP PASS 70166 70147 19 118404 37 48170 9 13 0.999729 0.999473 0.406827 0.999601 2.296566 1.572775 1.883951 1.716717
Benchmarking Summary for HG004:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 11000 10958 42 23959 58 12433 25 30 0.996182 0.994968 0.518928 0.995574 NaN NaN 1.792709 2.285876
INDEL PASS 11000 10958 42 23959 58 12433 25 30 0.996182 0.994968 0.518928 0.995574 NaN NaN 1.792709 2.285876
SNP ALL 71659 71621 38 118676 26 46946 8 10 0.999470 0.999638 0.395581 0.999554 2.310073 1.620388 1.878340 1.626939
SNP PASS 71659 71621 38 118676 26 46946 8 10 0.999470 0.999638 0.395581 0.999554 2.310073 1.620388 1.878340 1.626939