Skip to content
This repository was archived by the owner on May 22, 2026. It is now read-only.

Commit b772179

Browse files
authored
Merge pull request #278 from PolinaBevad/fix_documentation_bed_fisher_test
Added Fisher exact test option and small fixes in docs.
2 parents db17547 + a0a5dde commit b772179

23 files changed

Lines changed: 2060 additions & 119 deletions

Readme.md

Lines changed: 22 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -99,18 +99,20 @@ standard Java library because its performance is much higher than that of the st
9999

100100
To run VarDictJava in single sample mode, use a BAM file specified without the `|` symbol and perform Steps 3 and 4
101101
(see the Program workflow section) using `teststrandbias.R` and `var2vcf_valid.pl`.
102-
The following is an example command to run in single sample mode:
102+
The following is an example command to run in single sample mode with BED file.
103+
You have to set options `-c`, `-S`, `-E`, `-g` using number of columns in your BED file for chromosome, start, end
104+
and gene of region respectively:
103105

104106
```
105107
AF_THR="0.01" # minimum allele frequency
106-
<path_to_vardict_folder>/build/install/VarDict/bin/VarDict -G /path/to/hg19.fa -f $AF_THR -N sample_name -b /path/to/my.bam -z -c 1 -S 2 -E 3 -g 4 /path/to/my.bed | VarDict/teststrandbias.R | VarDict/var2vcf_valid.pl -N sample_name -E -f $AF_THR
108+
<path_to_vardict_folder>/build/install/VarDict/bin/VarDict -G /path/to/hg19.fa -f $AF_THR -N sample_name -b /path/to/my.bam -c 1 -S 2 -E 3 -g 4 /path/to/my.bed | VarDict/teststrandbias.R | VarDict/var2vcf_valid.pl -N sample_name -E -f $AF_THR > vars.vcf
107109
```
108110

109111
VarDictJava can also be invoked without a BED file if the region is specified in the command line with `-R` option.
110112
The following is an example command to run VarDictJava for a region (chromosome 7, position from 55270300 to 55270348, EGFR gene) with `-R` option:
111113

112114
```
113-
<path_to_vardict_folder>/build/install/VarDict/bin/VarDict -G /path/to/hg19.fa -f 0.001 -N sample_name -b /path/to/sample.bam -z -R chr7:55270300-55270348:EGFR | VarDict/teststrandbias.R | VarDict/var2vcf_valid.pl -N sample_name -E -f 0.001 >vars.vcf
115+
<path_to_vardict_folder>/build/install/VarDict/bin/VarDict -G /path/to/hg19.fa -f 0.001 -N sample_name -b /path/to/sample.bam -R chr7:55270300-55270348:EGFR | VarDict/teststrandbias.R | VarDict/var2vcf_valid.pl -N sample_name -E -f 0.001 > vars.vcf
114116
```
115117

116118
In single sample mode, output columns contain a description and statistical info for variants in the single sample.
@@ -124,7 +126,9 @@ To run paired variant calling, use BAM files specified as `BAM1|BAM2` and perfor
124126
In this mode, the number of statistics columns in the output is doubled: one set of columns is
125127
for the first sample, the other - for second sample.
126128

127-
The following is an example command to run in paired mode:
129+
The following is an example command to run in paired mode.
130+
You have to set options `-c`, `-S`, `-E`, `-g` using number of columns in your bed file for chromosome, start,
131+
end and gene of region respectively:
128132

129133
```
130134
AF_THR="0.01" # minimum allele frequency
@@ -360,7 +364,7 @@ These are only rough classification. You need to examine the p-value (after test
360364
- `-F bit`
361365
The hexical to filter reads. Default: `0x504` (filter unmapped reads, 2nd alignments and duplicates). Use `-F 0` to turn it off.
362366
- `-z 0/1`
363-
Indicate whether the BED file contains zero-based coordinates, the same way as the Genome browser IGV does. -z 1 indicates that coordinates in a BED file start from 0. -z 0 indicates that the coordinates start from 1. Default: `1` for a BED file or amplicon BED file. Use `0` to turn it off. When using `-R` option, it is set to `0`
367+
Indicate whether the BED file contains zero-based coordinates, the same way as the Genome browser IGV does. -z 1 indicates that coordinates in a BED file start from 0. -z 0 indicates that the coordinates start from 1. Default: `1` for a BED file or amplicon BED file (0-based). Use `0` to turn it off. When using `-R` option, it is set to `0`
364368
- `-a|--amplicon int:float`
365369
Indicate it is amplicon based calling. Reads that do not map to the amplicon will be skipped. A read pair is considered to belong to the amplicon if the edges are less than int bp to the amplicon, and overlap fraction is at least float. Default: `10:0.95`
366370
- `-k 0/1`
@@ -485,6 +489,11 @@ These are only rough classification. You need to examine the p-value (after test
485489
The variant frequency threshold to determine variant as good in case of non-monomer MSI. Default: 0.1
486490
- `--mfreq`
487491
The variant frequency threshold to determine variant as good in case of monomer MSI. Default: 0.25
492+
- `--fisher`
493+
EXPERIMENTAL FEATURE: to exclude R script from the VarDict pipeline we added this option to calculate pvalue and oddratio from Fisher Test.
494+
It will decrease time processing on big samples because R script uses slow `textConnection` function.
495+
If you use this, do NOT run `teststrandbias.R` or `testsomatic.R` after Vardict, but use `var2vcf_valid.pl`
496+
or `var2vcf_paired.pl` after VarDictJava as usual.
488497
## Output columns
489498
### Simple mode:
490499
1. Sample - sample name
@@ -599,14 +608,16 @@ Clusters - No. of clusters supporting SV from second sample
599608
### Input Files
600609

601610
#### BED File – Regions
602-
VarDict uses 2 types of BED files for specifying regions of interest: 4-column and 8-column.
603-
The 8-column file format is used for targeted DNA deep sequencing analysis (amplicon based calling),
604-
the 4-column file format - for single sample analysis.
611+
VarDict uses 2 types of BED files for specifying regions of interest: 8-column and all others.
612+
The 8-column file format is used for targeted DNA deep sequencing analysis (amplicon based calling), amplicon analysis will
613+
try to start if BED with 8 columns was provided.
614+
Otherwise you can start single and paired sample analysis by providing options `-c`, `-S`, `-E`, `-g`
615+
with number of columns for chromosome, start, end, gene of the region respectively.
605616

606617
All lines starting with #, browser, and track in a BED file are skipped.
607618
The column delimiter can be specified as the `-d` option (the default value is a tab “\t“).
608619

609-
The 8-column file format involves the following data:
620+
The 8-column amplicon BED file format involves the following data:
610621
* Chromosome name
611622
* Region start position
612623
* Region end position
@@ -616,7 +627,8 @@ The 8-column file format involves the following data:
616627
* Start position – VarDict starts outputting variants from this position
617628
* End position – VarDict ends outputting variants from this position
618629

619-
The 4-column file format involves the following data:
630+
For example 4-column BED file format involves the following data and VarDict must be start with `-c 1 -S 2 -E 3 -g 4` to
631+
recognize it:
620632
* Chromosome name
621633
* Region start position
622634
* Region end position

build.gradle

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ plugins {
44
id 'jacoco'
55
}
66

7-
version = '1.7.0'
7+
version = '1.8.0'
88

99
repositories {
1010
mavenCentral()
@@ -23,6 +23,7 @@ afterEvaluate {
2323

2424
dependencies {
2525
compile 'commons-cli:commons-cli:1.2'
26+
compile 'org.apache.commons:commons-math3:3.6.1'
2627
compile 'com.edropple.jregex:jregex:1.2_01'
2728
compile('com.github.samtools:htsjdk:2.8.0') {
2829
transitive = false

src/main/java/com/astrazeneca/vardict/CmdParser.java

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,8 @@ private Configuration parseCmd(CommandLine cmd) throws ParseException {
166166
config.adaptor.addAll(Arrays.asList(cmd.getOptionValue("adaptor").split(",")));
167167
}
168168

169+
config.fisher = cmd.hasOption("fisher");
170+
169171
if (cmd.hasOption("DP")) {
170172
String defaultPrinter = cmd.getOptionValue("DP", PrinterType.OUT.name());
171173
switch(defaultPrinter) {
@@ -231,6 +233,7 @@ private Options buildOptions() {
231233
options.addOption("UN", false, "Indicate unique mode, which when mate pairs overlap, the overlapping part will be counted only once using first read only.");
232234
options.addOption("chimeric", false, "Indicate to turn off chimeric reads filtering.");
233235
options.addOption("deldupvar", false, "Turn on deleting of duplicate variants. Variants in this mode are considered and outputted only if start position of variant is inside the region interest.");
236+
options.addOption("fisher", false, "Experimental feature: Changes R script (teststrandbias.R and testsomatic.) to Java implementation of Fisher exact test.");
234237
options.addOption("U", "nosv", false, "Turn off structural variant calling.");
235238

236239
options.addOption(OptionBuilder.withArgName("bit")

src/main/java/com/astrazeneca/vardict/Configuration.java

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -229,6 +229,11 @@ public class Configuration {
229229
*/
230230
public boolean deleteDuplicateVariants = false;
231231

232+
/**
233+
* Applying Fisher exact test on forward and reverse counts of variant.
234+
*/
235+
public boolean fisher = false;
236+
232237
/**
233238
* The minimum distance between two SV clusters in term of read length
234239
*/

src/main/java/com/astrazeneca/vardict/Utils.java

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,12 @@ public static double roundHalfEven(String pattern, double value) {
102102
return Double.parseDouble(new DecimalFormat(pattern).format(value));
103103
}
104104

105+
public static String getRoundedValueToPrint(String pattern, double value) {
106+
return value == Math.round(value)
107+
? new DecimalFormat("0").format(value)
108+
: new DecimalFormat(pattern).format(value).replaceAll("0+$", "");
109+
}
110+
105111
/**
106112
* Method creates substring of string begin from specified idx.
107113
* If idx is negative, it returns substring, counted from the right end of string.
Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
package com.astrazeneca.vardict.data.fishertest;
2+
3+
import org.apache.commons.math3.distribution.HypergeometricDistribution;
4+
5+
import java.text.DecimalFormat;
6+
import java.util.ArrayList;
7+
import java.util.Collections;
8+
import java.util.List;
9+
import java.util.function.Function;
10+
11+
import static com.astrazeneca.vardict.Utils.roundHalfEven;
12+
13+
/**
14+
* EXPERIMENTAL FEATURE.
15+
* <p>
16+
* Implementation of FisherExact Test as it is implemented in R.
17+
* <p>
18+
* R implementation of Fisher Test for oddratio uses conditional MLE (maximum likelihood estimation)
19+
* that wasn't found in standard libraries for Java.
20+
* <p>
21+
* Reason to replace R fisher test with this implementation is a slow R `textConnection` function.
22+
* In other case we have to use temp files to process VarDict result in R faster, and this is not a good option.
23+
*/
24+
25+
public class FisherExact {
26+
private List<Double> logdc;
27+
private int m;
28+
private int n;
29+
private int k;
30+
private int x;
31+
private int lo;
32+
private int hi;
33+
private double PvalueLess;
34+
private double PvalueGreater;
35+
private double PvalueTwoSided;
36+
private List<Integer> support;
37+
38+
// Seems that Java and R have differences with round half even (JDK-8227248 example, it will round value in memory)
39+
public static double RESULT_ROUND_R = 1E5;
40+
41+
public FisherExact(int refFwd, int refRev, int altFwd, int altRev) {
42+
m = refFwd + refRev;
43+
n = altFwd + altRev;
44+
k = refFwd + altFwd;
45+
x = refFwd;
46+
lo = Math.max(0, k - n);
47+
hi = Math.min(k, m);
48+
support = new ArrayList<>();
49+
for (int j = lo; j <= hi; j++) {
50+
support.add(j);
51+
}
52+
logdc = logdcDhyper(m, n, k);
53+
54+
calculatePValue();
55+
}
56+
57+
// Density of the central hypergeometric distribution on its support: store for once as this is needed quite a bit.
58+
private List<Double> logdcDhyper(int m, int n, int k) {
59+
List<Double> logdc = new ArrayList<>();
60+
61+
for (int element : support) {
62+
if (m + n == 0) {
63+
logdc.add(0.0);
64+
continue;
65+
}
66+
// m + n - total number of successes, m - number of successes (reference) k - sample size (forward)
67+
HypergeometricDistribution dhyper = new HypergeometricDistribution(m + n, m, k);
68+
Double value = dhyper.logProbability(element);
69+
if (value.isNaN()) {
70+
value = 0.0;
71+
}
72+
logdc.add(roundHalfEven("0.0000000", value));
73+
}
74+
return logdc;
75+
}
76+
77+
// Determine the MLE for ncp by solving E(X) = x, where the expectation is with respect to H.
78+
// Note that in general the conditional distribution of x given the marginals is a non-central hypergeometric
79+
// distribution H with non-centrality parameter ncp, the odds ratio.
80+
// The null conditional independence is equivalent to the hypothesis that the odds ratio equals one. `Exact`
81+
// inference can be based on observing that in general, given all marginal totals fixed, the first element of the
82+
// contingency table has a non-central hypergeometric distribution with non-centrality parameter given by odds
83+
// ratio (Fisher, 1935). The alternative for a one-sided test is based on the odds ratio, so alternative =
84+
// 'greater' is a test of the odds ratio being bigger than or = 1.
85+
private Double mle(double x) {
86+
double eps = Math.ulp(1.0);
87+
if (x == lo) return 0.0;
88+
if (x == hi) return Double.POSITIVE_INFINITY;
89+
double mu = mnhyper(1.0);
90+
double root;
91+
if (mu > x) {
92+
Function<Double, Double> f = t -> mnhyper(t) - x;
93+
root = UnirootZeroIn.zeroinC(0, 1, f, Math.pow(eps, 0.25));
94+
} else if (mu < x) {
95+
Function<Double, Double> f = t -> mnhyper(1.0 / t) - x;
96+
root = 1.0 / UnirootZeroIn.zeroinC(eps, 1, f, Math.pow(eps, 0.25));
97+
} else {
98+
root = 1.0;
99+
}
100+
return root;
101+
}
102+
103+
private Double mnhyper(Double ncp) {
104+
if (ncp == 0) return (double) lo;
105+
if (ncp.isInfinite()) return (double) hi;
106+
else {
107+
List<Double> dnhyperResult = dnhyper(ncp);
108+
List<Double> multiply = new ArrayList<>();
109+
for (int i = 0; i < support.size(); i++) {
110+
multiply.add(support.get(i) * dnhyperResult.get(i));
111+
}
112+
double b = multiply.stream().mapToDouble(a -> a).sum();
113+
return b;
114+
}
115+
}
116+
117+
private List<Double> dnhyper(Double ncp) {
118+
List<Double> result = new ArrayList<>();
119+
for (int i = 0; i < support.size(); i++) {
120+
result.add(logdc.get(i) + Math.log(ncp) * support.get(i));
121+
}
122+
double maxResult = Collections.max(result);
123+
List<Double> exponentResult = new ArrayList<>();
124+
125+
for (double el : result) {
126+
exponentResult.add(Math.exp(el - maxResult));
127+
}
128+
result = new ArrayList<>();
129+
double sum = exponentResult.stream().mapToDouble(a -> a).sum();
130+
for (double element : exponentResult) {
131+
result.add(element / sum);
132+
}
133+
return result;
134+
}
135+
136+
public String getOddRatio() {
137+
Double oddRatio = mle(x);
138+
if (oddRatio.isInfinite()) {
139+
return "Inf";
140+
} else if (oddRatio == Math.round(oddRatio)) {
141+
return new DecimalFormat("0").format(oddRatio);
142+
} else {
143+
return String.valueOf(round_as_r(oddRatio));
144+
}
145+
}
146+
147+
public double getPValue() {
148+
return round_as_r(PvalueTwoSided);
149+
}
150+
151+
public List<Double> getLogdc() {
152+
logdc = logdcDhyper(m, n, k);
153+
return logdc;
154+
}
155+
156+
public double getPValueGreater() {
157+
return round_as_r(PvalueGreater);
158+
}
159+
160+
public double getPValueLess() {
161+
return round_as_r(PvalueLess);
162+
}
163+
164+
private double round_as_r(double value) {
165+
value = roundHalfEven("0", value * RESULT_ROUND_R);
166+
value = value/RESULT_ROUND_R;
167+
value = value == 0.0 ? 0 : (value == 1.0 ? 1 : value);
168+
return value;
169+
}
170+
171+
private void calculatePValue() {
172+
PvalueLess = pnhyper(x, false);
173+
PvalueGreater = pnhyper(x, true);
174+
175+
double relErr = 1 + 1E-7;
176+
List<Double> d = dnhyper(1.0);
177+
double sum = 0.0;
178+
for (Double el : d) {
179+
if (el <= d.get(x - lo) * relErr) {
180+
sum += el;
181+
}
182+
}
183+
PvalueTwoSided = sum;
184+
}
185+
186+
private double pnhyper(int q, boolean upper_tail) {
187+
if (m + n == 0) {
188+
return 1.0;
189+
}
190+
if (upper_tail) {
191+
HypergeometricDistribution dhyper = new HypergeometricDistribution(m + n, m, k);
192+
return dhyper.upperCumulativeProbability(q);
193+
} else {
194+
HypergeometricDistribution dhyper = new HypergeometricDistribution(m + n, m, k);
195+
return dhyper.cumulativeProbability(q);
196+
}
197+
}
198+
}
199+
200+

0 commit comments

Comments
 (0)