Skip to content

Commit dbef3f5

Browse files
authored
Merge pull request #132 from evancofer/master
Allow specifying a substring to mutate with ISM
2 parents 22ec277 + 1bb78d4 commit dbef3f5

File tree

4 files changed

+192
-12
lines changed

4 files changed

+192
-12
lines changed

docs/source/overview/cli.md

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -303,13 +303,15 @@ You may find that there are more output files than you expect in `output_dir` at
303303
- **Warnings:** Selene may detect that the `ref` base(s) in a variant do not match with the bases specified in the reference sequence FASTA at the `(chrom, pos)`. In this case, Selene will use the `ref` base(s) specified in the VCF file in place of those in the reference genome and output predictions accordingly. These predictions will be distinguished by the row label column `ref_match` value `False`. You may review these variants and determine whether you still want to use those predictions/scores. If you find that most of the variants have `ref_match = False`, it may be that you have specified the wrong reference genome version---please check this before proceeding.
304304

305305
### _In silico_ mutagenesis
306-
An example configuration for _in silico_ mutagenesis when using a single sequence as input:
306+
An example configuration for _in silico_ mutagenesis of the whole sequence (i.e. rather than a subsequence), when using a single sequence as input:
307307
```YAML
308308
in_silico_mutagenesis: {
309309
input_sequence: ATCGATAAAATTCTGGAG...,
310310
save_data: [predictions, diffs],
311311
output_path_prefix: /path/to/output/dir/filename_prefix,
312-
mutate_n_bases: 1
312+
mutate_n_bases: 1,
313+
start_position: 0,
314+
end_position: None
313315
}
314316
```
315317

@@ -318,15 +320,19 @@ in_silico_mutagenesis: {
318320
- `save_data`: A list of the data files to output. Must input 1 or more of the following options: `[abs_diffs, diffs, logits, predictions]`. (Note that the raw prediction values will not be outputted by default---you must specify `predictions` in the list if you want them.)
319321
- `output_path_prefix`: Optional, default is "ism". The path to which the data files are written. We have specified that it should be a filename _prefix_ because we will append additional information depending on what files you would like to output (e.g. `fileprefix_logits.tsv`) If directories in the path do not yet exist, they will automatically be created.
320322
- `mutate_n_bases`: Optional, default is 1. The number of bases to mutate at any time. Standard _in silico_ mutagenesis only mutates a single base at a time, so we encourage users to start by leaving this value at 1. Double/triple mutations will be more difficult to interpret and are something we may work on in the future.
323+
- `start_position`: Optional, default is 0. The starting position of the subsequence that should be mutated. This value should be nonnegative, and less than `end_position`. Also, the value of `end_position - start_position` should be at least `mutate_n_bases`.
324+
- `end_position`: Optional, default is `None`. If left as `None`, Selene will use the `sequence_length` parameter from `analyze_sequences`. This is the ending position of the subsequence that should be mutated. This value should be nonnegative, and greater than `start_position`. The value of `end_position - start_position` should be at least `mutate_n_bases`.
321325

322-
An example configuration for _in silico_ mutagenesis when using a FASTA file as input:
326+
An example configuration for _in silico_ mutagenesis of the center 100 bases of a 1000 base sequence read from a FASTA file input:
323327
```YAML
324328
in_silico_mutagenesis: {
325-
input_path: /path/to/sequences1.fa,
329+
input_path: /path/to/sequences1.fa,
326330
save_data: [logits],
327331
output_dir: /path/to/output/predictions/dir,
328332
mutate_n_bases: 1,
329-
use_sequence_name: True
333+
use_sequence_name: True,
334+
start_position: 450,
335+
end_position: 550
330336
}
331337
```
332338

@@ -338,6 +344,8 @@ in_silico_mutagenesis: {
338344
- `use_sequence_name`: Optional, default is `True`.
339345
- If `use_sequence_name`, output files are prefixed by the sequence name/description corresponding to each sequence in the FASTA file. Spaces in the description are replaced with underscores '_'.
340346
- If not `use_sequence_name`, output files are prefixed with the index `i` corresponding to the `i`th sequence in the FASTA file.
347+
- `start_position`: Optional, default is 0. The starting position of the subsequence that should be mutated. This value should be nonnegative, and less than `end_position`. The value of `end_position - start_position` should be at least `mutate_n_bases`.
348+
- `end_position`: Optional, default is `None`. If left as `None`, Selene will use the `sequence_length` parameter passed to `analyze_sequences`. This is the ending position of the subsequence that should be mutated. This value should be nonnegative, and greater than `start_position`. The value of `end_position - start_position` should be at least `mutate_n_bases`.
341349

342350
## Sampler configurations
343351
Data sampling is used during model training and evaluation. You must specify the sampler in the configuration YAML file alongside the other operation-specific configurations (i.e. `train_model` or `evaluate_model`).

selene_sdk/predict/_in_silico_mutagenesis.py

Lines changed: 50 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,9 @@
77

88
def in_silico_mutagenesis_sequences(sequence,
99
mutate_n_bases=1,
10-
reference_sequence=Genome):
10+
reference_sequence=Genome,
11+
start_position=0,
12+
end_position=None):
1113
"""
1214
Creates a list containing each mutation that occurs from an
1315
*in silico* mutagenesis across the whole sequence.
@@ -26,6 +28,13 @@ def in_silico_mutagenesis_sequences(sequence,
2628
reference_sequence : class, optional
2729
Default is `selene_sdk.sequences.Genome`. The type of sequence
2830
that has been passed in.
31+
start_position : int, optional
32+
Default is 0. The starting position of the subsequence to be
33+
mutated.
34+
end_position : int or None, optional
35+
Default is None. The ending position of the subsequence to be
36+
mutated. If left as `None`, then `len(sequence)` will be
37+
used.
2938
3039
Returns
3140
-------
@@ -39,7 +48,46 @@ def in_silico_mutagenesis_sequences(sequence,
3948
we return a list with length of 3000-4000, depending on the number of
4049
unknown bases in the input sequences.
4150
51+
Raises
52+
------
53+
ValueError
54+
If the value of `start_position` or `end_position` is negative.
55+
ValueError
56+
If there are fewer than `mutate_n_bases` between `start_position`
57+
and `end_position`.
58+
ValueError
59+
If `start_position` is greater or equal to `end_position`.
60+
ValueError
61+
If `start_position` is not less than `len(sequence)`.
62+
ValueError
63+
If `end_position` is greater than `len(sequence)`.
64+
4265
"""
66+
if end_position is None:
67+
end_position = len(sequence)
68+
if start_position >= end_position:
69+
raise ValueError(("Starting positions must be less than the ending "
70+
"positions. Found a starting position of {0} with "
71+
"an ending position of {1}.").format(start_position,
72+
end_position))
73+
if start_position < 0:
74+
raise ValueError("Negative starting positions are not supported.")
75+
if end_position < 0:
76+
raise ValueError("Negative ending positions are not supported.")
77+
if start_position >= len(sequence):
78+
raise ValueError(("Starting positions must be less than the sequence length."
79+
" Found a starting position of {0} with a sequence length "
80+
"of {1}.").format(start_position, len(sequence)))
81+
if end_position > len(sequence):
82+
raise ValueError(("Ending positions must be less than or equal to the sequence "
83+
"length. Found an ending position of {0} with a sequence "
84+
"length of {1}.").format(end_position, len(sequence)))
85+
if (end_position - start_position) < mutate_n_bases:
86+
raise ValueError(("Fewer bases exist in the substring specified by the starting "
87+
"and ending positions than need to be mutated. There are only "
88+
"{0} currently, but {1} bases must be mutated at a "
89+
"time").format(end_position - start_position, mutate_n_bases))
90+
4391
sequence_alts = []
4492
for index, ref in enumerate(sequence):
4593
alts = []
@@ -50,7 +98,7 @@ def in_silico_mutagenesis_sequences(sequence,
5098
sequence_alts.append(alts)
5199
all_mutated_sequences = []
52100
for indices in itertools.combinations(
53-
range(len(sequence)), mutate_n_bases):
101+
range(start_position, end_position), mutate_n_bases):
54102
pos_mutations = []
55103
for i in indices:
56104
pos_mutations.append(sequence_alts[i])

selene_sdk/predict/model_predict.py

Lines changed: 107 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -654,7 +654,9 @@ def in_silico_mutagenesis(self,
654654
save_data,
655655
output_path_prefix="ism",
656656
mutate_n_bases=1,
657-
output_format="tsv"):
657+
output_format="tsv",
658+
start_position=0,
659+
end_position=None):
658660
"""
659661
Applies *in silico* mutagenesis to a sequence.
660662
@@ -674,6 +676,13 @@ def in_silico_mutagenesis(self,
674676
optimized operations for double and triple mutations.
675677
output_format : {'tsv', 'hdf5'}, optional
676678
Default is 'tsv'. The desired output format.
679+
start_position : int, optional
680+
Default is 0. The starting position of the subsequence to be
681+
mutated.
682+
end_position : int or None, optional
683+
Default is None. The ending position of the subsequence to be
684+
mutated. If left as `None`, then `self.sequence_length` will be
685+
used.
677686
678687
Returns
679688
-------
@@ -683,7 +692,46 @@ def in_silico_mutagenesis(self,
683692
file named `*_ref_predictions.h5` will be outputted with the
684693
model prediction for the original input sequence.
685694
695+
Raises
696+
------
697+
ValueError
698+
If the value of `start_position` or `end_position` is negative.
699+
ValueError
700+
If there are fewer than `mutate_n_bases` between `start_position`
701+
and `end_position`.
702+
ValueError
703+
If `start_position` is greater or equal to `end_position`.
704+
ValueError
705+
If `start_position` is not less than `self.sequence_length`.
706+
ValueError
707+
If `end_position` is greater than `self.sequence_length`.
708+
686709
"""
710+
if end_position is None:
711+
end_position = self.sequence_length
712+
if start_position >= end_position:
713+
raise ValueError(("Starting positions must be less than the ending "
714+
"positions. Found a starting position of {0} with "
715+
"an ending position of {1}.").format(start_position,
716+
end_position))
717+
if start_position < 0:
718+
raise ValueError("Negative starting positions are not supported.")
719+
if end_position < 0:
720+
raise ValueError("Negative ending positions are not supported.")
721+
if start_position >= self.sequence_length:
722+
raise ValueError(("Starting positions must be less than the sequence length."
723+
" Found a starting position of {0} with a sequence length "
724+
"of {1}.").format(start_position, self.sequence_length))
725+
if end_position > self.sequence_length:
726+
raise ValueError(("Ending positions must be less than or equal to the sequence "
727+
"length. Found an ending position of {0} with a sequence "
728+
"length of {1}.").format(end_position, self.sequence_length))
729+
if (end_position - start_position) < mutate_n_bases:
730+
raise ValueError(("Fewer bases exist in the substring specified by the starting "
731+
"and ending positions than need to be mutated. There are only "
732+
"{0} currently, but {1} bases must be mutated at a "
733+
"time").format(end_position - start_position, mutate_n_bases))
734+
687735
path_dirs, _ = os.path.split(output_path_prefix)
688736
if path_dirs:
689737
os.makedirs(path_dirs, exist_ok=True)
@@ -704,7 +752,9 @@ def in_silico_mutagenesis(self,
704752
sequence = str.upper(sequence)
705753
mutated_sequences = in_silico_mutagenesis_sequences(
706754
sequence, mutate_n_bases=1,
707-
reference_sequence=self.reference_sequence)
755+
reference_sequence=self.reference_sequence,
756+
start_position=start_position,
757+
end_position=end_position)
708758
reporters = self._initialize_reporters(
709759
save_data,
710760
output_path_prefix,
@@ -744,7 +794,9 @@ def in_silico_mutagenesis_from_file(self,
744794
output_dir,
745795
mutate_n_bases=1,
746796
use_sequence_name=True,
747-
output_format="tsv"):
797+
output_format="tsv",
798+
start_position=0,
799+
end_position=None):
748800
"""
749801
Apply *in silico* mutagenesis to all sequences in a FASTA file.
750802
@@ -776,6 +828,16 @@ def in_silico_mutagenesis_from_file(self,
776828
the FASTA file will have its own set of output files, where
777829
the number of output files depends on the number of `save_data`
778830
predictions/scores specified.
831+
start_position : int, optional
832+
Default is 0. The starting position of the subsequence to be
833+
mutated.
834+
end_position : int or None, optional
835+
Default is None. The ending position of the subsequence to be
836+
mutated. If left as `None`, then `self.sequence_length` will be
837+
used.
838+
839+
840+
779841
780842
Returns
781843
-------
@@ -785,7 +847,46 @@ def in_silico_mutagenesis_from_file(self,
785847
file named `*_ref_predictions.h5` will be outputted with the
786848
model prediction for the original input sequence.
787849
850+
Raises
851+
------
852+
ValueError
853+
If the value of `start_position` or `end_position` is negative.
854+
ValueError
855+
If there are fewer than `mutate_n_bases` between `start_position`
856+
and `end_position`.
857+
ValueError
858+
If `start_position` is greater or equal to `end_position`.
859+
ValueError
860+
If `start_position` is not less than `self.sequence_length`.
861+
ValueError
862+
If `end_position` is greater than `self.sequence_length`.
863+
788864
"""
865+
if end_position is None:
866+
end_position = self.sequence_length
867+
if start_position >= end_position:
868+
raise ValueError(("Starting positions must be less than the ending "
869+
"positions. Found a starting position of {0} with "
870+
"an ending position of {1}.").format(start_position,
871+
end_position))
872+
if start_position < 0:
873+
raise ValueError("Negative starting positions are not supported.")
874+
if end_position < 0:
875+
raise ValueError("Negative ending positions are not supported.")
876+
if start_position >= self.sequence_length:
877+
raise ValueError(("Starting positions must be less than the sequence length."
878+
" Found a starting position of {0} with a sequence length "
879+
"of {1}.").format(start_position, self.sequence_length))
880+
if end_position > self.sequence_length:
881+
raise ValueError(("Ending positions must be less than or equal to the sequence "
882+
"length. Found an ending position of {0} with a sequence "
883+
"length of {1}.").format(end_position, self.sequence_length))
884+
if (end_position - start_position) < mutate_n_bases:
885+
raise ValueError(("Fewer bases exist in the substring specified by the starting "
886+
"and ending positions than need to be mutated. There are only "
887+
"{0} currently, but {1} bases must be mutated at a "
888+
"time").format(end_position - start_position, mutate_n_bases))
889+
789890
os.makedirs(output_dir, exist_ok=True)
790891

791892
fasta_file = pyfaidx.Fasta(input_path)
@@ -803,7 +904,9 @@ def in_silico_mutagenesis_from_file(self,
803904
mutated_sequences = in_silico_mutagenesis_sequences(
804905
cur_sequence,
805906
mutate_n_bases=mutate_n_bases,
806-
reference_sequence=self.reference_sequence)
907+
reference_sequence=self.reference_sequence,
908+
start_position=start_position,
909+
end_position=end_position)
807910
cur_sequence_encoding = self.reference_sequence.sequence_to_encoding(
808911
cur_sequence)
809912
base_encoding = cur_sequence_encoding.reshape(

selene_sdk/predict/tests/test_model_predict.py

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,19 @@ def test_in_silico_muta_sequences_single(self):
2121
expected_lists = [[e] for e in expected]
2222
self.assertListEqual(observed, expected_lists)
2323

24+
def test_in_silico_muta_sequences_single_subset_positions(self):
25+
observed = in_silico_mutagenesis_sequences("ATCCG", start_position=1, end_position=4)
26+
expected = [
27+
(1, 'A'), (1, 'C'), (1, 'G'),
28+
(2, 'A'), (2, 'G'), (2, 'T'),
29+
(3, 'A'), (3, 'G'), (3, 'T')]
30+
31+
expected_lists = [[e] for e in expected]
32+
self.assertListEqual(observed, expected_lists)
33+
2434
def test_in_silico_muta_sequences_double(self):
2535
observed = in_silico_mutagenesis_sequences(
26-
"ATC", mutate_n_bases=2)
36+
"ATC", mutate_n_bases=2, start_position=0, end_position=3)
2737
expected = [
2838
[(0, 'C'), (1, 'A')], [(0, 'G'), (1, 'A')], [(0, 'T'), (1, 'A')],
2939
[(0, 'C'), (1, 'C')], [(0, 'G'), (1, 'C')], [(0, 'T'), (1, 'C')],
@@ -39,5 +49,16 @@ def test_in_silico_muta_sequences_double(self):
3949
]
4050
self.assertCountEqual(observed, expected)
4151

52+
def test_in_silico_muta_sequences_double_subset_positions(self):
53+
observed = in_silico_mutagenesis_sequences(
54+
"ATCG", mutate_n_bases=2, start_position=1, end_position=3)
55+
expected = [
56+
[(1, 'A'), (2, 'A')], [(1, 'C'), (2, 'A')], [(1, 'G'), (2, 'A')],
57+
[(1, 'A'), (2, 'G')], [(1, 'C'), (2, 'G')], [(1, 'G'), (2, 'G')],
58+
[(1, 'A'), (2, 'T')], [(1, 'C'), (2, 'T')], [(1, 'G'), (2, 'T')],
59+
]
60+
self.assertCountEqual(observed, expected)
61+
62+
4263
if __name__ == "__main__":
4364
unittest.main()

0 commit comments

Comments
 (0)