Skip to content

Commit 1bb78d4

Browse files
committed
Cli doc on selective ISM and better errors
1 parent a6ef969 commit 1bb78d4

3 files changed

Lines changed: 55 additions & 20 deletions

File tree

docs/source/overview/cli.md

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -303,13 +303,15 @@ You may find that there are more output files than you expect in `output_dir` at
303303
- **Warnings:** Selene may detect that the `ref` base(s) in a variant do not match with the bases specified in the reference sequence FASTA at the `(chrom, pos)`. In this case, Selene will use the `ref` base(s) specified in the VCF file in place of those in the reference genome and output predictions accordingly. These predictions will be distinguished by the row label column `ref_match` value `False`. You may review these variants and determine whether you still want to use those predictions/scores. If you find that most of the variants have `ref_match = False`, it may be that you have specified the wrong reference genome version---please check this before proceeding.
304304

305305
### _In silico_ mutagenesis
306-
An example configuration for _in silico_ mutagenesis when using a single sequence as input:
306+
An example configuration for _in silico_ mutagenesis of the whole sequence (i.e. rather than a subsequence), when using a single sequence as input:
307307
```YAML
308308
in_silico_mutagenesis: {
309309
input_sequence: ATCGATAAAATTCTGGAG...,
310310
save_data: [predictions, diffs],
311311
output_path_prefix: /path/to/output/dir/filename_prefix,
312-
mutate_n_bases: 1
312+
mutate_n_bases: 1,
313+
start_position: 0,
314+
end_position: None
313315
}
314316
```
315317

@@ -318,15 +320,19 @@ in_silico_mutagenesis: {
318320
- `save_data`: A list of the data files to output. Must input 1 or more of the following options: `[abs_diffs, diffs, logits, predictions]`. (Note that the raw prediction values will not be outputted by default---you must specify `predictions` in the list if you want them.)
319321
- `output_path_prefix`: Optional, default is "ism". The path to which the data files are written. We have specified that it should be a filename _prefix_ because we will append additional information depending on what files you would like to output (e.g. `fileprefix_logits.tsv`) If directories in the path do not yet exist, they will automatically be created.
320322
- `mutate_n_bases`: Optional, default is 1. The number of bases to mutate at any time. Standard _in silico_ mutagenesis only mutates a single base at a time, so we encourage users to start by leaving this value at 1. Double/triple mutations will be more difficult to interpret and are something we may work on in the future.
323+
- `start_position`: Optional, default is 0. The starting position of the subsequence that should be mutated. This value should be nonnegative, and less than `end_position`. Also, the value of `end_position - start_position` should be at least `mutate_n_bases`.
324+
- `end_position`: Optional, default is `None`. If left as `None`, Selene will use the `sequence_length` parameter from `analyze_sequences`. This is the ending position of the subsequence that should be mutated. This value should be nonnegative, and greater than `start_position`. The value of `end_position - start_position` should be at least `mutate_n_bases`.
321325

322-
An example configuration for _in silico_ mutagenesis when using a FASTA file as input:
326+
An example configuration for _in silico_ mutagenesis of the center 100 bases of a 1000 base sequence read from a FASTA file input:
323327
```YAML
324328
in_silico_mutagenesis: {
325-
input_path: /path/to/sequences1.fa,
329+
input_path: /path/to/sequences1.fa,
326330
save_data: [logits],
327331
output_dir: /path/to/output/predictions/dir,
328332
mutate_n_bases: 1,
329-
use_sequence_name: True
333+
use_sequence_name: True,
334+
start_position: 450,
335+
end_position: 550
330336
}
331337
```
332338

@@ -338,6 +344,8 @@ in_silico_mutagenesis: {
338344
- `use_sequence_name`: Optional, default is `True`.
339345
- If `use_sequence_name`, output files are prefixed by the sequence name/description corresponding to each sequence in the FASTA file. Spaces in the description are replaced with underscores '_'.
340346
- If not `use_sequence_name`, output files are prefixed with the index `i` corresponding to the `i`th sequence in the FASTA file.
347+
- `start_position`: Optional, default is 0. The starting position of the subsequence that should be mutated. This value should be nonnegative, and less than `end_position`. The value of `end_position - start_position` should be at least `mutate_n_bases`.
348+
- `end_position`: Optional, default is `None`. If left as `None`, Selene will use the `sequence_length` parameter passed to `analyze_sequences`. This is the ending position of the subsequence that should be mutated. This value should be nonnegative, and greater than `start_position`. The value of `end_position - start_position` should be at least `mutate_n_bases`.
341349

342350
## Sampler configurations
343351
Data sampling is used during model training and evaluation. You must specify the sampler in the configuration YAML file alongside the other operation-specific configurations (i.e. `train_model` or `evaluate_model`).

selene_sdk/predict/_in_silico_mutagenesis.py

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -66,17 +66,27 @@ def in_silico_mutagenesis_sequences(sequence,
6666
if end_position is None:
6767
end_position = len(sequence)
6868
if start_position >= end_position:
69-
raise ValueError("Starting positions must be less than the ending positions.")
69+
raise ValueError(("Starting positions must be less than the ending "
70+
"positions. Found a starting position of {0} with "
71+
"an ending position of {1}.").format(start_position,
72+
end_position))
7073
if start_position < 0:
7174
raise ValueError("Negative starting positions are not supported.")
7275
if end_position < 0:
7376
raise ValueError("Negative ending positions are not supported.")
7477
if start_position >= len(sequence):
75-
raise ValueError("Starting positions must be less than the sequence length.")
78+
raise ValueError(("Starting positions must be less than the sequence length."
79+
" Found a starting position of {0} with a sequence length "
80+
"of {1}.").format(start_position, len(sequence)))
7681
if end_position > len(sequence):
77-
raise ValueError("Ending positions must be less than or equal to the sequence length.")
82+
raise ValueError(("Ending positions must be less than or equal to the sequence "
83+
"length. Found an ending position of {0} with a sequence "
84+
"length of {1}.").format(end_position, len(sequence)))
7885
if (end_position - start_position) < mutate_n_bases:
79-
raise ValueError("Fewer bases exist in the substring specified by the starting and ending positions than need to be mutated.")
86+
raise ValueError(("Fewer bases exist in the substring specified by the starting "
87+
"and ending positions than need to be mutated. There are only "
88+
"{0} currently, but {1} bases must be mutated at a "
89+
"time").format(end_position - start_position, mutate_n_bases))
8090

8191
sequence_alts = []
8292
for index, ref in enumerate(sequence):

selene_sdk/predict/model_predict.py

Lines changed: 28 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -710,18 +710,27 @@ def in_silico_mutagenesis(self,
710710
if end_position is None:
711711
end_position = self.sequence_length
712712
if start_position >= end_position:
713-
raise ValueError("Starting positions must be less than the ending positions.")
713+
raise ValueError(("Starting positions must be less than the ending "
714+
"positions. Found a starting position of {0} with "
715+
"an ending position of {1}.").format(start_position,
716+
end_position))
714717
if start_position < 0:
715718
raise ValueError("Negative starting positions are not supported.")
716719
if end_position < 0:
717720
raise ValueError("Negative ending positions are not supported.")
718721
if start_position >= self.sequence_length:
719-
raise ValueError("Starting positions must be less than the sequence length.")
722+
raise ValueError(("Starting positions must be less than the sequence length."
723+
" Found a starting position of {0} with a sequence length "
724+
"of {1}.").format(start_position, self.sequence_length))
720725
if end_position > self.sequence_length:
721-
raise ValueError("Ending positions must be less than or equal to the sequence length.")
726+
raise ValueError(("Ending positions must be less than or equal to the sequence "
727+
"length. Found an ending position of {0} with a sequence "
728+
"length of {1}.").format(end_position, self.sequence_length))
722729
if (end_position - start_position) < mutate_n_bases:
723-
raise ValueError("Fewer bases exist in the substring specified by the starting and ending positions than need to be mutated.")
724-
730+
raise ValueError(("Fewer bases exist in the substring specified by the starting "
731+
"and ending positions than need to be mutated. There are only "
732+
"{0} currently, but {1} bases must be mutated at a "
733+
"time").format(end_position - start_position, mutate_n_bases))
725734

726735
path_dirs, _ = os.path.split(output_path_prefix)
727736
if path_dirs:
@@ -856,19 +865,27 @@ def in_silico_mutagenesis_from_file(self,
856865
if end_position is None:
857866
end_position = self.sequence_length
858867
if start_position >= end_position:
859-
raise ValueError("Starting positions must be less than the ending positions.")
868+
raise ValueError(("Starting positions must be less than the ending "
869+
"positions. Found a starting position of {0} with "
870+
"an ending position of {1}.").format(start_position,
871+
end_position))
860872
if start_position < 0:
861873
raise ValueError("Negative starting positions are not supported.")
862874
if end_position < 0:
863875
raise ValueError("Negative ending positions are not supported.")
864876
if start_position >= self.sequence_length:
865-
raise ValueError("Starting positions must be less than the sequence length.")
877+
raise ValueError(("Starting positions must be less than the sequence length."
878+
" Found a starting position of {0} with a sequence length "
879+
"of {1}.").format(start_position, self.sequence_length))
866880
if end_position > self.sequence_length:
867-
raise ValueError("Ending positions must be less than or equal to the sequence length.")
881+
raise ValueError(("Ending positions must be less than or equal to the sequence "
882+
"length. Found an ending position of {0} with a sequence "
883+
"length of {1}.").format(end_position, self.sequence_length))
868884
if (end_position - start_position) < mutate_n_bases:
869-
raise ValueError("Fewer bases exist in the substring specified by the starting and ending positions than need to be mutated.")
870-
871-
885+
raise ValueError(("Fewer bases exist in the substring specified by the starting "
886+
"and ending positions than need to be mutated. There are only "
887+
"{0} currently, but {1} bases must be mutated at a "
888+
"time").format(end_position - start_position, mutate_n_bases))
872889

873890
os.makedirs(output_dir, exist_ok=True)
874891

0 commit comments

Comments
 (0)