Skip to content

Commit 6849f51

Browse files
authored
Add exome mode (#38)
* Add --exome mode - new --exome specific options in ntroot, rules in ntroot_run_pipeline.smk - Updated README.md - Added exome tests to demo * Update help page * Add log message for --exome and --masked * Update README.md * Update README.md * Adjust cutoff parameter setting for exome mode - In case we ever want the option to set --solid, allow this in smk - In practice, cutoff is always set, so will not happen with current ntroot driver script
1 parent 4a13293 commit 6849f51

6 files changed

Lines changed: 197 additions & 21 deletions

File tree

README.md

Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -56,28 +56,33 @@ No compilation is required for ntRoot (only the dependencies), so simply add the
5656

5757
## Usage <a name=usage></a>
5858
```
59-
usage: ntroot [-h] [-r REFERENCE] [--reads READS] [--genome GENOME [GENOME ...]] -l L [-k K] [--tile TILE] [--lai] [-t T] [-z Z] [-j J] [-Y Y] [--custom_vcf CUSTOM_VCF]
60-
[--strip_info] [-v] [-V] [-n] [-f]
59+
usage: ntroot [-h] [-r REFERENCE] [--reads READS] [--genome GENOME [GENOME ...]] -l L [--exome] [-k K] [--tile TILE] [--lai] [-t T] [-z Z] [-j J] [--cutoff CUTOFF] [-Y Y]
60+
[--custom_vcf CUSTOM_VCF] [--masked] [--exome_bed EXOME_BED] [--strip_info] [-v] [-V] [-n] [-f]
6161
6262
ntRoot: Ancestry inference from genomic data
6363
64-
optional arguments:
64+
options:
6565
-h, --help show this help message and exit
6666
-r REFERENCE, --reference REFERENCE
6767
Reference genome (FASTA, Multi-FASTA, and/or gzipped compatible)
6868
--reads READS Prefix of input reads file(s) for detecting SNVs. All files in the working directory with the specified prefix will be used. (fastq, fasta, gz, bz, zip)
6969
--genome GENOME [GENOME ...]
7070
Genome assembly file(s) for detecting SNVs compared to --reference
7171
-l L input IVC VCF file with annotated variants (e.g., 1000GP_integrated_snv_v2a_27022019.GRCh38.phased_gt1.vcf.gz, clinvar.vcf, etc.)
72+
--exome Input reads for detecting SNVs are from whole exome sequencing. If provided, must also specify either --exome_bed or --masked. --cutoff 2 is implied unless otherwise specified.
7273
-k K k-mer size
7374
--tile TILE Tile size for ancestry fraction inference (bp) [default=5000000]
7475
--lai Output ancestry predictons per tile in a separate output file
7576
-t T Number of threads [default=4]
7677
-z Z Minimum contig length [default=100]
7778
-j J controls size of k-mer subset. When checking subset of k-mers, check every jth k-mer [default=3]
79+
--cutoff CUTOFF Minimum coverage of k-mers in ntEdit Bloom filter. Solid k-mers are used if set to 0 [0]
7880
-Y Y Ratio of number of k-mers in the k subset that should be present to accept an edit (higher=stringent) [default=0.55]
7981
--custom_vcf CUSTOM_VCF
8082
Input VCF for computing ancestry. When specified, ntRoot will skip the ntEdit step, and predict ancestry from the provided VCF.
83+
--masked Exome Mode (--exome) only: Indicates that the reference genome provided with --reference has all NON-targeted exonic regions hard-masked.
84+
--exome_bed EXOME_BED
85+
Exome Mode (--exome) only: BED file of exome targeted regions.
8186
--strip_info When using --custom_vcf, strip the existing INFO field from the input VCF.
8287
-v, --verbose Verbose mode [default=False]
8388
-V, --version show program's version number and exit
@@ -117,6 +122,7 @@ GRCh38.fa.gz
117122
readme
118123
</pre>
119124

125+
**Running ntRoot with whole genome sequencing reads or genome assemblies**
120126

121127
Users will specify:
122128
<pre>
@@ -128,6 +134,19 @@ Example command:
128134
ntroot -k 55 --reference GRCh38.fa.gz --reads ERR3242308_ -t 48 -Y 0.55 -l 1000GP_integrated_snv_v2a_27022019.GRCh38.phased_gt1.vcf.gz
129135
</pre>
130136

137+
**Running ntRoot with whole exome sequencing**
138+
139+
If your input reads are from whole exome sequencing, the regions of your reference genome that are NOT targeted exonic regions should be hard-masked (converted to Ns):
140+
<pre>
141+
ntroot -k 55 --exome --masked --reference masked_GRCh38.fa.gz --reads test_exome -t 48 -Y 0.55 -l 1000GP_integrated_snv_v2a_27022019.GRCh38.phased_gt1.vcf.gz
142+
</pre>
143+
ntRoot can perform the masking automatically if you do not already have a masked reference file. In that case, provide a BED file with all the targeted regions, and ntRoot will use bedtools to mask the reference regions that are NOT targeted regions:
144+
<pre>
145+
ntroot -k 55 --exome --exome_bed exome_seq_regions.bed --reference GRCh38.fa.gz --reads test_exome -t 48 -Y 0.55 -l 1000GP_integrated_snv_v2a_27022019.GRCh38.phased_gt1.vcf.gz
146+
</pre>
147+
148+
**Running ntRoot using a pre-existing VCF file:**
149+
131150
If you would like to infer ancestry from a pre-existing VCF file:
132151
<pre>
133152
ntroot -r GRCh38.fa.gz --custom_vcf third_party.vcf -l 1000GP_integrated_snv_v2a_27022019.GRCh38.phased_gt1.vcf.gz

demo/chr20-21.fa.gz

29.2 MB
Binary file not shown.

demo/pop-spec-snp_chr20-21.vcf.gz

14.8 MB
Binary file not shown.

demo/run_ntroot_demo.sh

Lines changed: 35 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,12 @@ echo "Please ensure that ntRoot is on your PATH"
66
set -eux -o pipefail
77

88
echo "Running ntRoot reads demo..."
9-
wget https://www.bcgsc.ca/downloads/btl/ntroot/reads_demo/ERR3239334.chr21_1.fq.gz
10-
wget https://www.bcgsc.ca/downloads/btl/ntroot/reads_demo/ERR3239334.chr21_2.fq.gz
9+
if [ ! -f ERR3239334.chr21_1.fq.gz ]; then
10+
wget https://www.bcgsc.ca/downloads/btl/ntroot/reads_demo/ERR3239334.chr21_1.fq.gz
11+
fi
12+
if [ ! -f ERR3239334.chr21_2.fq.gz ]; then
13+
wget https://www.bcgsc.ca/downloads/btl/ntroot/reads_demo/ERR3239334.chr21_2.fq.gz
14+
fi
1115

1216
ntroot --reference chr21.fa --reads ERR3239334.chr21_ -k 55 -l pop-spec-snp_chr21.vcf.gz
1317

@@ -42,6 +46,34 @@ else
4246
echo "ntRoot input VCF test failed.. Please check your installation."
4347
exit 1
4448
fi
45-
49+
50+
echo "Running ntRoot --exome reads demo with --exome_bed..."
51+
52+
if [ ! -f HG00864_ERR050736.chr20-21.fastq.gz ]; then
53+
wget https://www.bcgsc.ca/downloads/btl/ntroot/reads_demo/HG00864_ERR050736.chr20-21.fastq.gz
54+
fi
55+
if [ ! -f HG00864_ERR050737.chr20-21.fastq.gz ]; then
56+
wget https://www.bcgsc.ca/downloads/btl/ntroot/reads_demo/HG00864_ERR050737.chr20-21.fastq.gz
57+
fi
58+
59+
ntroot --reference chr20-21.fa.gz --reads HG00864 -k 55 -l pop-spec-snp_chr20-21.vcf.gz --exome --exome_bed exome_targets.bed
60+
prediction=$(cat HG00864_ntedit_k55_exome_variants.vcf_ancestry-predictions_tile5000000.tsv | awk '{print $1}' |head -n 2 |tail -n 1)
61+
if [ ${prediction} == "EAS" ]; then
62+
echo "ntRoot --exome reads test successful!"
63+
else
64+
echo "ntRoot --exome reads test failed.. Please check your installation."
65+
exit 1
66+
fi
67+
68+
echo "Running ntRoot --exome demo with --masked (input reference already masked based on exon target coordinates)"
69+
ntroot --exome --reference masked_chr20-21.fa.gz --reads HG00864 -k 55 -l pop-spec-snp_chr20-21.vcf.gz --masked --force
70+
prediction=$(cat HG00864_ntedit_k55_exome_variants.vcf_ancestry-predictions_tile5000000.tsv | awk '{print $1}' |head -n 2 |tail -n 1)
71+
if [ ${prediction} == "EAS" ]; then
72+
echo "ntRoot --exome masked test successful!"
73+
else
74+
echo "ntRoot --exome masked test failed.. Please check your installation."
75+
exit 1
76+
fi
77+
4678

4779
echo "Done ntRoot tests!"

ntroot

Lines changed: 50 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,9 @@ def set_up_parser():
3434
parser.add_argument("-l",
3535
help="input IVC VCF file with annotated variants (e.g., 1000GP_integrated_snv_v2a_27022019.GRCh38.phased_gt1.vcf.gz, clinvar.vcf, etc.)",
3636
type=str, required=True)
37+
parser.add_argument("--exome", help="Input reads for detecting SNVs are from whole exome sequencing. "
38+
"If provided, must also specify either --exome_bed or --masked. --cutoff 2 is implied unless otherwise specified.",
39+
action="store_true")
3740

3841
parser.add_argument("-k",
3942
help="k-mer size",
@@ -50,6 +53,8 @@ def set_up_parser():
5053
help="controls size of k-mer subset. When checking subset of k-mers, check every jth k-mer "
5154
"[default=3]",
5255
default=3, type=int)
56+
parser.add_argument("--cutoff", help="Minimum coverage of k-mers in ntEdit Bloom filter. "
57+
"Solid k-mers are used if set to 0 [0]", default=0, type=int)
5358
parser.add_argument("-Y",
5459
help="Ratio of number of k-mers in the k subset that should be present to accept "
5560
"an edit (higher=stringent) "
@@ -58,7 +63,14 @@ def set_up_parser():
5863
"When specified, ntRoot will skip the ntEdit step, and "
5964
"predict ancestry from the provided VCF.",
6065
type=str)
61-
parser.add_argument("--strip_info", help="When using --custom_vcf, strip the existing INFO field from the input VCF.",
66+
parser.add_argument("--masked", help="Exome Mode (--exome) only: "
67+
"Indicates that the reference genome provided with --reference "
68+
"has all NON-targeted exonic regions hard-masked. ",
69+
action="store_true")
70+
parser.add_argument("--exome_bed", help="Exome Mode (--exome) only: BED file of exome targeted regions. ",
71+
type=str)
72+
parser.add_argument("--strip_info", help="When using --custom_vcf, "
73+
"strip the existing INFO field from the input VCF.",
6274
action="store_true")
6375
parser.add_argument("-v", "--verbose",
6476
help="Verbose mode [default=False]", action="store_true", default=False)
@@ -80,6 +92,25 @@ def main():
8092
if ((args.reads and args.genome) or (not args.reads and not args.genome)) and not args.custom_vcf:
8193
raise argparse.ArgumentTypeError("Please specify --reads OR --genome")
8294

95+
if args.exome and not (args.masked or args.exome_bed):
96+
raise argparse.ArgumentTypeError("In exome mode, please specify either --masked or --exome_bed")
97+
98+
if args.exome and (args.masked and args.exome_bed):
99+
raise argparse.ArgumentTypeError("In exome mode, please specify either --masked OR --exome_bed")
100+
101+
if not args.exome and (args.masked or args.exome_bed):
102+
raise argparse.ArgumentTypeError("Parameters --masked and --exome_bed are only used only in exome mode")
103+
104+
if args.exome and args.masked:
105+
print("--exome and --masked specified. Assuming reference genome provided with --reference has all"
106+
" NON-targeted exonic regions hard-masked.", flush=True)
107+
108+
if args.exome and args.cutoff < 2:
109+
args.cutoff = 2
110+
print("In exome mode, minimum k-mer coverage cutoff is set to 2 by default "
111+
"unless explicitly set higher. Setting cutoff to 2.",
112+
flush=True)
113+
83114
if not args.draft and not args.reference:
84115
raise argparse.ArgumentTypeError("Please specify the reference genome with --reference")
85116
if args.draft:
@@ -95,7 +126,11 @@ def main():
95126
"Parameter settings:"]
96127

97128
if args.reads:
98-
if args.lai:
129+
if args.exome and args.lai:
130+
smk_rule = "ntroot_reads_exome_lai"
131+
elif args.exome:
132+
smk_rule = "ntroot_reads_exome"
133+
elif args.lai:
99134
smk_rule = "ntroot_reads_lai"
100135
else:
101136
smk_rule = "ntroot_reads"
@@ -127,6 +162,19 @@ def main():
127162
intro_string.append(f"\t--reads {args.reads}")
128163
command += f"reads={args.reads}"
129164

165+
intro_string.append(f"\t--cutoff {args.cutoff}")
166+
command += f" cutoff={args.cutoff}"
167+
168+
if args.exome:
169+
intro_string.append("\t--exome")
170+
command += " exome=True"
171+
if args.masked:
172+
intro_string.append("\t--masked")
173+
command += " masked=True"
174+
elif args.exome_bed:
175+
intro_string.append(f"\t--exome_bed {args.exome_bed}")
176+
command += f" exome_bed={args.exome_bed}"
177+
130178
intro_string.extend([f"\t--reference {args.reference}",
131179
f"\t-t {args.t}"])
132180

ntroot_run_pipeline.smk

Lines changed: 90 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ import shutil
77
onsuccess:
88
shutil.rmtree(".snakemake", ignore_errors=True)
99

10+
ruleorder: exome_mask_fasta > exome_masked_provided
11+
1012
# Read parameters from config or set default values
1113
reference=config["reference"]
1214
draft_base = os.path.basename(os.path.realpath(reference))
@@ -28,6 +30,7 @@ v = config["verbose"] if "verbose" in config else 0
2830
j = config["j_param"] if "j_param" in config else 3
2931
Y = config["Y_param"] if "Y_param" in config else 0.55
3032
l = config["l_vcf"] if "l_vcf" in config else ""
33+
cutoff = config["cutoff"] if "cutoff" in config else 0
3134

3235
# Ancestry inference parameters
3336
tile_size = config["tile_size"] if "tile_size" in config else 5000000
@@ -37,6 +40,12 @@ input_vcf = config["input_vcf"] if "input_vcf" in config else None
3740
input_vcf_basename = os.path.basename(os.path.realpath(input_vcf)) if input_vcf else "None"
3841
strip_info = config["strip_info"] if "strip_info" in config else None
3942

43+
# Exome parameters
44+
exome = config["exome"] if "exome" in config else False
45+
masked = config["masked"] if "masked" in config else False
46+
exome_bed = config["exome_bed"] if "exome_bed" in config else ""
47+
exome_bed_base = os.path.basename(os.path.realpath(exome_bed)) if exome_bed else "None"
48+
4049
# time command
4150
mac_time_command = "command time -l -o"
4251
linux_time_command = "command time -v -o"
@@ -51,12 +60,18 @@ rule ntroot_genome:
5160
rule ntroot_reads:
5261
input: f"{reads_prefix}_ntedit_k{k}_variants.vcf_ancestry-predictions_tile{tile_size}.tsv"
5362

63+
rule ntroot_reads_exome:
64+
input: f"{reads_prefix}_ntedit_k{k}_exome_variants.vcf_ancestry-predictions_tile{tile_size}.tsv"
65+
5466
rule ntroot_genome_lai:
5567
input: f"{genome_prefix}_ntedit_k{k}_variants.vcf_ancestry-predictions-tile-resolution_tile{tile_size}.tsv"
5668

5769
rule ntroot_reads_lai:
5870
input: f"{reads_prefix}_ntedit_k{k}_variants.vcf_ancestry-predictions-tile-resolution_tile{tile_size}.tsv"
5971

72+
rule ntroot_reads_exome_lai:
73+
input: f"{reads_prefix}_ntedit_k{k}_exome_variants.vcf_ancestry-predictions-tile-resolution_tile{tile_size}.tsv"
74+
6075
rule ntroot_input_vcf:
6176
input: f"{input_vcf_basename}.cross-ref.vcf_ancestry-predictions_tile{tile_size}.tsv"
6277

@@ -73,28 +88,32 @@ rule ntedit_reads:
7388
out_bf = temp(f"{reads_prefix}_k{k}.bf")
7489
params:
7590
benchmark = f"{time_command} ntedit_snv_k{k}.time",
76-
params = f"-k {k} -t {t} -z {z} -j {j} -Y {Y} --solid ",
91+
params = f"-k {k} -t {t} -z {z} -j {j} -Y {Y}",
92+
cutoff = f"--cutoff {cutoff}" if cutoff > 0 else "--solid",
7793
vcf_input = f"-l {l}" if l else ""
7894
shell:
79-
"{params.benchmark} run-ntedit snv --reference {reference} --reads {reads_prefix_full} {params.params} "
80-
"{params.vcf_input}"
95+
"{params.benchmark} run-ntedit snv --reference {input.reference} --reads {reads_prefix_full} {params.params} "
96+
"{params.vcf_input} {params.cutoff}"
8197

82-
rule ntedit_genome:
98+
rule ntedit_exome_reads:
8399
input:
84-
reference = reference,
85-
genomes = genomes
100+
reference = f"masked_{draft_base}"
86101
output:
87-
out_vcf = f"{genome_prefix}_ntedit_k{k}_variants.vcf",
88-
out_fa = temp(f"{genome_prefix}_ntedit_k{k}_edited.fa"),
89-
out_changes = temp(f"{genome_prefix}_ntedit_k{k}_changes.tsv"),
90-
out_bf = temp(f"{genome_prefix}_k{k}.bf")
102+
out_vcf = f"{reads_prefix}_ntedit_k{k}_exome_variants.vcf",
103+
out_fa = temp(f"{reads_prefix}_ntedit_k{k}_edited.fa"),
104+
out_changes = temp(f"{reads_prefix}_ntedit_k{k}_changes.tsv"),
105+
out_bf = temp(f"{reads_prefix}_k{k}.bf")
91106
params:
92107
benchmark = f"{time_command} ntedit_snv_k{k}.time",
93108
params = f"-k {k} -t {t} -z {z} -j {j} -Y {Y}",
94-
vcf_input = f"-l {l}" if l else ""
109+
cutoff = f"--cutoff {cutoff}" if cutoff > 0 else "--solid",
110+
vcf_input = f"-l {l}" if l else "",
111+
out_vcf_tmp = f"{reads_prefix}_ntedit_k{k}_variants.vcf"
95112
shell:
96-
"{params.benchmark} run-ntedit snv --reference {reference} --genome {input.genomes} {params.params} "
97-
" {params.vcf_input}"
113+
"""
114+
{params.benchmark} run-ntedit snv --reference {input.reference} --reads {reads_prefix_full} {params.params} {params.vcf_input} {params.cutoff}
115+
mv {params.out_vcf_tmp} {output.out_vcf}
116+
"""
98117

99118
rule samtools_faidx:
100119
input: reference = reference
@@ -107,6 +126,64 @@ rule samtools_faidx:
107126
else:
108127
shell("{params.benchmark} samtools faidx -o {output.out_fai} {input.reference}")
109128

129+
rule exome_masked_provided:
130+
input:
131+
reference = reference
132+
output:
133+
masked_reference = f"masked_{draft_base}"
134+
shell:
135+
"ln -sf {input.reference} {output.masked_reference}"
136+
137+
rule exome_sort_bed:
138+
input:
139+
bed = exome_bed,
140+
ref_fai = rules.samtools_faidx.output
141+
output:
142+
sorted_bed = temp(f"{exome_bed_base}.sorted.bed")
143+
shell:
144+
"bedtools sort -i {input.bed} -faidx {input.ref_fai} > {output.sorted_bed}"
145+
146+
rule exome_complement_bed:
147+
input:
148+
sorted_bed = f"{exome_bed_base}.sorted.bed",
149+
ref_fai = rules.samtools_faidx.output
150+
output:
151+
complement_bed = temp(f"{exome_bed_base}.sorted.bed.complement.bed")
152+
shell:
153+
"bedtools complement -i {input.sorted_bed} -g {input.ref_fai} > {output.complement_bed}"
154+
155+
rule exome_mask_fasta:
156+
input:
157+
reference = reference,
158+
complement_bed = rules.exome_complement_bed.output
159+
output:
160+
masked_reference = f"masked_{draft_base}"
161+
run:
162+
if input.reference.endswith(".gz"):
163+
shell("bedtools maskfasta -fi <(gunzip -c {input.reference}) -bed {input.complement_bed} -fo tmp_{output.masked_reference}")
164+
shell("gzip -c tmp_{output.masked_reference} > {output.masked_reference}")
165+
shell("rm tmp_{output.masked_reference}")
166+
else:
167+
shell("bedtools maskfasta -fi {input.reference} -bed {input.complement_bed} -fo {output.masked_reference}")
168+
169+
rule ntedit_genome:
170+
input:
171+
reference = reference,
172+
genomes = genomes
173+
output:
174+
out_vcf = f"{genome_prefix}_ntedit_k{k}_variants.vcf",
175+
out_fa = temp(f"{genome_prefix}_ntedit_k{k}_edited.fa"),
176+
out_changes = temp(f"{genome_prefix}_ntedit_k{k}_changes.tsv"),
177+
out_bf = temp(f"{genome_prefix}_k{k}.bf")
178+
params:
179+
benchmark = f"{time_command} ntedit_snv_k{k}.time",
180+
params = f"-k {k} -t {t} -z {z} -j {j} -Y {Y}",
181+
vcf_input = f"-l {l}" if l else ""
182+
shell:
183+
"{params.benchmark} run-ntedit snv --reference {reference} --genome {input.genomes} {params.params} "
184+
" {params.vcf_input}"
185+
186+
110187
rule ancestry_prediction:
111188
input:
112189
vcf = "{vcf}"

0 commit comments

Comments
 (0)