Skip to content
This repository was archived by the owner on May 22, 2026. It is now read-only.

Commit 3bac523

Browse files
authored
Merge pull request #133 from PolinaBevad/transfer_documentation_from_doc
Updating the documentation and deleting docx
2 parents eb72162 + 9ea1d83 commit 3bac523

5 files changed

Lines changed: 184 additions & 36 deletions

File tree

Readme.md

Lines changed: 183 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,10 @@ VarDictJava can run in single sample (see Single sample mode section), paired sa
2525
3. Perl (uses /usr/bin/env perl)
2626
4. Internet connection to download dependencies using gradle.
2727

28+
To see the help page for the program, run
29+
```
30+
<path_to_vardict_folder>/build/install/VarDict/bin/VarDict -H.
31+
```
2832
## Getting started
2933

3034
### Getting source code
@@ -64,9 +68,36 @@ or
6468
./gradlew distTar
6569
```
6670

71+
#### Distribution Package Structure
72+
When the build command completes successfully, the `build/install/VarDict` folder contains the distribution package.
73+
74+
The distribution package has the following structure:
75+
* `bin/` - contains the launch scripts
76+
* `lib/` - has the jar file that contains the compiled project code and the jar files of the third-party libraries that the project uses.
77+
78+
You can move the distribution package (the content of the `build/install/VarDict` folder) to any convenient location.
79+
80+
Generated zip and tar releases will also contain scripts from VarDict Perl repository in `bin/` directory (`teststrandbias.R`,
81+
`testsomatic.R`, `var2vcf_valid.pl`, `var2vcf_paired.pl`).
82+
83+
You can add VarDictJava on PATH by adding this line to `.bashrc`:
84+
```
85+
export PATH=/path/to/VarDict/bin:$PATH
86+
```
87+
After that you can run VarDict by `Vardict` command instead of full path to `<path_to_vardict_folder>/build/install/VarDict/bin/VarDict`.
88+
89+
#### Third-Party Libraries
90+
Currently, the project uses the following third-party libraries:
91+
* JRegex (http://jregex.sourceforge.net, BSD license) is a regular expressions library that is used instead of the
92+
standard Java library because its performance is much higher than that of the standard library.
93+
* Commons CLI (http://commons.apache.org/proper/commons-cli, Apache License) – a library for parsing the command line.
94+
* HTSJDK (http://smtools.github.io/htsjdk/) is an implementation of a unified Java library for accessing common file formats, such as SAM and VCF.
95+
* PowerMock and TestNG are the testing frameworks (not included in distribution, used only in tests).
96+
6797
### Single sample mode
6898

69-
To run VarDictJava in single sample mode, use a BAM file specified without the `|` symbol and perform Steps 3 and 4 (see the Program workflow section) using `teststrandbias.R` and `var2vcf_valid.pl.`
99+
To run VarDictJava in single sample mode, use a BAM file specified without the `|` symbol and perform Steps 3 and 4
100+
(see the Program workflow section) using `teststrandbias.R` and `var2vcf_valid.pl.`
70101
The following is an example command to run in single sample mode:
71102

72103
```
@@ -81,22 +112,42 @@ The following is an example command to run VarDictJava for a region (chromosome
81112
<path_to_vardict_folder>/build/install/VarDict/bin/VarDict -G /path/to/hg19.fa -f 0.001 -N sample_name -b /path/to/sample.bam -z -R chr7:55270300-55270348:EGFR | VarDict/teststrandbias.R | VarDict/var2vcf_valid.pl -N sample_name -E -f 0.001 >vars.vcf
82113
```
83114

84-
In single sample mode, output columns contain a description and statistical info for variants in the single sample. See section Output Columns for list of columns in the output.
115+
In single sample mode, output columns contain a description and statistical info for variants in the single sample.
116+
See section Output Columns for list of columns in the output.
85117

86118
### Paired variant calling
87119

88-
To run paired variant calling, use BAM files specified as `BAM1|BAM2` and perform Steps 3 and 4 (see the Program Workflow section) using `testsomatic.R` and `var2vcf_paired.pl`.
120+
To run paired variant calling, use BAM files specified as `BAM1|BAM2` and perform Steps 3 and 4
121+
(see the Program Workflow section) using `testsomatic.R` and `var2vcf_paired.pl`.
89122

90-
In this mode, the number of statistics columns in the output is doubled: one set of columns is for the first sample, the other - for second sample.
123+
In this mode, the number of statistics columns in the output is doubled: one set of columns is
124+
for the first sample, the other - for second sample.
91125

92126
The following is an example command to run in paired mode:
93127

94128
```
95129
AF_THR="0.01" # minimum allele frequency
96130
<path_to_vardict_folder>/build/install/VarDict/bin/VarDict -G /path/to/hg19.fa -f $AF_THR -N tumor_sample_name -b "/path/to/tumor.bam|/path/to/normal.bam" -z -F -c 1 -S 2 -E 3 -g 4 /path/to/my.bed | VarDict/testsomatic.R | VarDict/var2vcf_paired.pl -N "tumor_sample_name|normal_sample_name" -f $AF_THR
97131
```
98-
### Running Tests
99132

133+
### Amplicon based calling
134+
This mode is active if the BED file uses 8-column format and the -R option is not specified.
135+
136+
In this mode, only the first list of BAM files is used even if the files are specified as `BAM1|BAM2` - like for paired variant calling.
137+
138+
For each segment, the BED file specifies the list of positions as start and end positions (columns 7 and 8 of
139+
the BED file). The Amplicon based calling mode outputs a record for every position between start and end that has
140+
any variant other than the reference one (all positions with the `-p` option). For any of these positions,
141+
VarDict in amplicon based calling mode outputs the following:
142+
* Same columns as in the single sample mode for the most frequent variant
143+
* Good variants for this position with the prefixes `GOOD1`, `GOOD2`, etc.
144+
* Bad variants for this position with the prefixes `BAD1`, `BAD2`, etc.
145+
146+
For this running mode, the `-a` option (default: `10:0.95`) specifies the criteria of discarding reads that are too
147+
far away from segments. A read is skipped if its start and end are more than 10 positions away from the segment
148+
ends and the overlap fraction between the read and the segment is less than 0.95.
149+
150+
### Running Tests
100151
#### Integration testing
101152

102153
The list of integration test cases is stored in files in `testdata/intergationtestcases` directory.
@@ -172,35 +223,92 @@ chr1,200,205,GCCGA
172223
chr2,10,12,AC
173224
```
174225

175-
>Note: VarDict expands given regions by 700bp to left and right (plus given value by `-x` option).
226+
>Note: VarDict expands given regions by 1200bp to left and right (plus given value by `-x` option).
176227
177228
## Program Workflow
178229

230+
#### The main workflow
179231
The VarDictJava program follows the workflow:
180232

181233
1. Get regions of interest from a BED file or the command line.
182234
2. For each segment:
183235
1. Find all variants for this segment in mapped reads:
184236
1. Optionally skip duplicated reads, low mapping-quality reads, and reads having a large number of mismatches.
185-
2. Skip a read if it does not overlap with the segment.
186-
3. Preprocess the CIGAR string for each read.
187-
4. For each position, create a variant. If a variant is already present, adjust its count using the adjCnt function.
188-
1. Realign some of the variants using special ad-hoc approaches.
189-
2. Calculate statistics for the variant, filter out some bad ones, if any.
190-
3. Assign a type to each variant.
191-
4. Output variants in an intermediate internal format (tabular). Columns of the table are described in the Output Columns section.
237+
2. Skip unmapped reads.
238+
3. Skip a read if it does not overlap with the segment.
239+
4. Preprocess the CIGAR string for each read (section CIGAR Preprocessing).
240+
5. For each position, create a variant. If a variant is already present, adjust its count using the adjCnt function.
241+
2. Find structural variants (optionally can be disabled by option `-U`).
242+
3. Realign some of the variants using realignment of insertions, deletions, large insertions, and large deletions using unaligned parts of reads
243+
(soft-clipped ends). This step is optional and can be disabled using the `-k 0` switch.
244+
4. Calculate statistics for the variant, filter out some bad ones, if any.
245+
5. Assign a type to each variant.
246+
6. Output variants in an intermediate internal format (tabular). Columns of the table are described in the Output Columns section.
192247

193248
**Note**: To perform Steps 1 and 2, use Java VarDict.
194249

195250
3. Perform a statistical test for strand bias using an R script.
196-
**Note**: Use R script for this step.
251+
**Note**: Use R scripts `teststrandbias.R` or `testsomatic.R` for this step.
197252
4. Transform the intermediate tabular format to VCF. Output the variants with filtering and statistical data.
198253
**Note**: Use the Perl scripts `var2vcf_valid.pl` or `var2vcf_paired.pl` for this step.
199-
200-
254+
255+
#### CIGAR Preprocessing (Initial Realignment)
256+
Read alignment is specified in a BAM file as a CIGAR string. VarDict modifies this string (and alignment) in the following special cases:
257+
* Soft clipping next to insertion/deletion is replaced with longer soft-clipping.
258+
The same takes place if insertion/deletion is separated from soft clipping by no more than 10 matched bases.
259+
* Short matched sequence and insertion/deletion at the beginning/end are replaced by soft-clipping.
260+
* Two close deletions and insertions are combined into one deletion and one insertion
261+
* Three close deletions are combined in one
262+
* Three close insertions/deletions are combined in one deletion or in insertion+deletion
263+
* Two close deletions are combined into one
264+
* Two close insertions/deletions are combined into one
265+
* Mis-clipping at the start/end are changed to matched sequences
266+
267+
#### Variants
268+
Simple variants (SNV, simple insertions, and deletions) are constructed in the following way:
269+
* Single-nucleotide variation (SNV). VarDict inserts an SNV into the variants structure for every matched or
270+
mismatched base in the reads. If an SNV is already present in variants, VarDict adjusts its counts and statistics.
271+
* Simple insertion variant. If read alignment shows an insertion at the position, VarDict inserts +BASES
272+
string into the variants structure. If the variant is already present, VarDict adjusts its count and statistics.
273+
* Simple Deletion variant. If read alignment shows a deletion at the position, VarDict inserts -NUMBER
274+
into the variants structure. If the variant is already present, VarDict adjusts its count and statistics.
275+
VarDict also handles complex variants (for example, an insertion that is close to SNV or to deletion)
276+
using specialized ad-hoc methods.
277+
278+
Structural Variants are looked after simple variants. VarDict supported DUP, INV and DEL structural variants.
279+
280+
#### Variant Description String
281+
The description string encodes a variant for VarDict internal use.
282+
283+
The following table describes Variant description string encoding:
284+
285+
String | Description
286+
----- | ----------
287+
[ATGC] | for SNPs
288+
+[ATGC]+ | for insertions
289+
-[0-9]+ | for deletions
290+
...#[ATGC]+ | for insertion/deletion variants followed by a short matched sequence
291+
...^[ATGC]+ | something followed by an insertion
292+
...^[0-9]+ | something followed by a deletion
293+
...&amp;[ATGC]+ | for insertion/deletion variants followed by a matched sequence
294+
295+
#### Variant Filtering
296+
A variant appears in the output if it satisfies the following criteria (in this order):
297+
1. Frequency of the variant exceeds the threshold set by the `-f` option (default = 1%).
298+
2. The minimum number of high-quality reads supporting variant is larger than the threshold set by the `-r` option (default = 2).
299+
3. The mean position of the variant in reads is less than the value set by the `-P` option (default = 5).
300+
4. The mean base quality (phred score) for the variant is less than the threshold set by the `-q` option (default = 22.5).
301+
5. Variant frequency is more than 25% or reference allele does not have much better mapping quality than the variant.
302+
6. Deletion variants are not located in the regions where the reference genome is missing.
303+
7. The ratio of high-quality reads to low-quality reads is larger than the threshold specified by `-o` option (default=1.5).
304+
8. Variant frequency exceeds 30%.
305+
9. Mean mapping quality exceeds the threshold set by the `-O` option (default: no filtering)
306+
10. In the case of an MSI region, the variant size is less than 12 nucleotides for the non-monomer MSI or 15 for the monomer MSI.
307+
Variant frequency is more than 10% for the non-monomer MSI and 25% for the monomer MSI.
308+
11. Variant has not "2;1" bias.
309+
11. Variant is not SNV and variants refallele or varallele lengths are more then 3 nucleotides when variant frequency less then 20%.
201310

202311
## Program Options
203-
204312
- `-H|-?`
205313
Print help page
206314
- `-h|--header`
@@ -240,7 +348,7 @@ The VarDictJava program follows the workflow:
240348
- `-N string`
241349
The sample name to be used directly. Will overwrite `-n` option
242350
- `-b string`
243-
The indexed BAM file
351+
The indexed BAM file. Multiple BAM files can be specified with the “:” delimiter.
244352
- `-c INT`
245353
The column for chromosome
246354
- `-S INT`
@@ -282,13 +390,17 @@ The VarDictJava program follows the workflow:
282390
- `-V freq`
283391
The lowest frequency in a normal sample allowed for a putative somatic mutations. Defaults to `0.05`
284392
- `-I INT`
285-
The indel size. Default: 50bp
393+
The indel size. Default: 50bp.
394+
Be cautious with -I option, especially in the amplicon mode, as amplicon sequencing is not a way
395+
to find large indels. Increasing the search size might slow and the false positives may appear in low
396+
complexity regions. Increase it to 200-300 bp would recommend only for hybrid capture sequencing.
286397
- `-M INT`
287398
The minimum matches for a read to be considered. If, after soft-clipping, the matched bp is less than INT, then the
288399
read is discarded. It's meant for PCR based targeted sequencing where there's no insert and the matching is only the primers.
289400
Default: 0, or no filtering
290401
- `-th [threads]`
291-
If this parameter is missing, then the mode is one-thread. If you add the -th parameter, the number of threads equals to the number of processor cores. The parameter -th threads sets the number of threads explicitly.
402+
If this parameter is missing, then the mode is one-thread. If you add the -th parameter, the number of threads
403+
equals to the number of processor cores. The parameter -th threads sets the number of threads explicitly.
292404
- `-VS STRICT | LENIENT | SILENT`
293405
How strict to be when reading a SAM or BAM.
294406
`STRICT` - throw an exception if something looks wrong.
@@ -302,7 +414,7 @@ The VarDictJava program follows the workflow:
302414
Indicate unique mode, which when mate pairs overlap, the overlapping part will be counted only once using **first** read only.
303415
Default: unique mode disabled, all reads are counted.
304416
- `--chimeric`
305-
Indicate to turn off chimeric reads filtering. Chimeric reads are artifacts from library construction,
417+
Indicate to turn off chimeric reads filtering. Chimeric reads are artifacts from library construction,
306418
where a read can be split into two segments, each will be aligned within 1-2 read length distance,
307419
but in opposite direction. Default: filtering enabled
308420
- `-U|--nosv`
@@ -342,19 +454,56 @@ The VarDictJava program follows the workflow:
342454
18. PStd - flag for read position standard deviation
343455
19. QMean - mean base quality
344456
20. QStd - flag for base quality standard deviation
345-
23. QRATIO - ratio of high quality reads to low-quality reads
346-
24. HIFREQ - variant frequency for high-quality reads
347-
25. EXTRAFR - Adjusted AF for indels due to local realignment
348-
26. SHIFT3 - No. of bases to be shifted to 3 prime for deletions due to alternative alignment
349-
27. MSI - MicroSattelite. > 1 indicates MSI
350-
28. MSINT - MicroSattelite unit length in bp
351-
29. NM - average number of mismatches for reads containing the variant
352-
30. HICNT - number of high-quality reads with the variant
353-
31. HICOV - position coverage by high quality reads
354-
21. 5pFlankSeq - neighboring reference sequence to 5' end
355-
22. 3pFlankSeq - neighboring reference sequence to 3' end
356-
23. SEGMENT:CHR_START_END - position description
357-
24. VARTYPE - variant type
457+
21. MAPQ - mapping quality
458+
22. QRATIO - ratio of high quality reads to low-quality reads
459+
23. HIFREQ - variant frequency for high-quality reads
460+
24. EXTRAFR - Adjusted AF for indels due to local realignment
461+
25. SHIFT3 - No. of bases to be shifted to 3 prime for deletions due to alternative alignment
462+
26. MSI - MicroSattelite. > 1 indicates MSI
463+
27. MSINT - MicroSattelite unit length in bp
464+
28. NM - average number of mismatches for reads containing the variant
465+
29. HICNT - number of high-quality reads with the variant
466+
30. HICOV - position coverage by high quality reads
467+
31. 5pFlankSeq - neighboring reference sequence to 5' end
468+
32. 3pFlankSeq - neighboring reference sequence to 3' end
469+
33. SEGMENT:CHR_START_END - position description
470+
34. VARTYPE - variant type
471+
35. DUPRATE - duplication rate in fraction
472+
36. SV splits-pairs-clusters: Splits - No. of split reads supporting SV, Pairs - No. of pairs supporting SV,
473+
Clusters - No. of clusters supporting SV
474+
475+
### Input Files
476+
477+
#### BED File – Regions
478+
VarDict uses 2 types of BED files for specifying regions of interest: 4-column and 8-column.
479+
The 8-column file format is used for targeted DNA deep sequencing analysis (amplicon based calling),
480+
the 4-column file format - for single sample analysis.
481+
482+
All lines starting with #, browser, and track in a BED file are skipped.
483+
The column delimiter can be specified as the -d option (the default value is a tab “\t“).
484+
485+
The 8-column file format involves the following data:
486+
* Chromosome name
487+
* Region start position
488+
* Region end position
489+
* Gene name
490+
* Score - not used by VarDict
491+
* Strand - not used by VarDict
492+
* Start position – VarDict starts outputting variants from this position
493+
* End position –VarDict ends outputting variants from this position
494+
495+
The 4-column file format involves the following data:
496+
* Chromosome name
497+
* Region start position
498+
* Region end position
499+
* Gene name
500+
501+
#### FASTA File - Reference Genome
502+
The reference genome in FASTA format is read using HTSJDK library.
503+
For every invocation of the toVars function (usually 1 for a region in a BED file)
504+
and for every BAM file, a part of the reference genome is extracted from the FASTA file.
505+
506+
Region of FASTA extends and this extension can be regulate by REFEXT variable (option `-Y INT`, default 1200 bp).
358507

359508
# License
360509

VarDictDescription.docx

-40.9 KB
Binary file not shown.
-514 KB
Binary file not shown.

build.gradle

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,4 +65,4 @@ test {
6565
useTestNG()
6666
}
6767

68-
applicationDistribution.from('VarDict/teststrandbias.R','VarDict/var2vcf_valid.pl','VarDict/var2vcf_paired.pl').into('bin')
68+
applicationDistribution.from('VarDict/teststrandbias.R','VarDict/testsomatic.R','VarDict/var2vcf_valid.pl','VarDict/var2vcf_paired.pl').into('bin')

src/main/java/com/astrazeneca/vardict/modules/CigarUtils.java

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,6 @@ public static Tuple.Tuple2<Integer, String> modifyCigar(int indel,
215215
flag = true;
216216
}
217217
} else if (threeDeletionsMatcher.find()) {
218-
// deletions added from 27.04.2018
219218
//length of both matched sequences and insertion
220219
tslen = toInt(threeDeletionsMatcher.group(4)) + toInt(threeDeletionsMatcher.group(6));
221220
//length of deletions and internal matched sequences

0 commit comments

Comments
 (0)