You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/overview/cli.md
+13-5Lines changed: 13 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -303,13 +303,15 @@ You may find that there are more output files than you expect in `output_dir` at
303
303
- **Warnings:** Selene may detect that the `ref` base(s) in a variant do not match with the bases specified in the reference sequence FASTA at the `(chrom, pos)`. In this case, Selene will use the `ref` base(s) specified in the VCF file in place of those in the reference genome and output predictions accordingly. These predictions will be distinguished by the row label column `ref_match` value `False`. You may review these variants and determine whether you still want to use those predictions/scores. If you find that most of the variants have `ref_match = False`, it may be that you have specified the wrong reference genome version---please check this before proceeding.
304
304
305
305
### _In silico_ mutagenesis
306
-
An example configuration for _in silico_ mutagenesis when using a single sequence as input:
306
+
An example configuration for _in silico_ mutagenesis of the whole sequence (i.e. rather than a subsequence), when using a single sequence as input:
- `save_data`: A list of the data files to output. Must input 1 or more of the following options: `[abs_diffs, diffs, logits, predictions]`. (Note that the raw prediction values will not be outputted by default---you must specify `predictions` in the list if you want them.)
319
321
- `output_path_prefix`: Optional, default is "ism". The path to which the data files are written. We have specified that it should be a filename _prefix_ because we will append additional information depending on what files you would like to output (e.g. `fileprefix_logits.tsv`) If directories in the path do not yet exist, they will automatically be created.
320
322
- `mutate_n_bases`: Optional, default is 1. The number of bases to mutate at any time. Standard _in silico_ mutagenesis only mutates a single base at a time, so we encourage users to start by leaving this value at 1. Double/triple mutations will be more difficult to interpret and are something we may work on in the future.
323
+
- `start_position`: Optional, default is 0. The starting position of the subsequence that should be mutated. This value should be nonnegative, and less than `end_position`. Also, the value of `end_position - start_position` should be at least `mutate_n_bases`.
324
+
- `end_position`: Optional, default is `None`. If left as `None`, Selene will use the `sequence_length` parameter from `analyze_sequences`. This is the ending position of the subsequence that should be mutated. This value should be nonnegative, and greater than `start_position`. The value of `end_position - start_position` should be at least `mutate_n_bases`.
321
325
322
-
An example configuration for _in silico_ mutagenesis when using a FASTA file as input:
326
+
An example configuration for _in silico_ mutagenesis of the center 100 bases of a 1000 base sequence read from a FASTA file input:
323
327
```YAML
324
328
in_silico_mutagenesis: {
325
-
input_path: /path/to/sequences1.fa,
329
+
input_path: /path/to/sequences1.fa,
326
330
save_data: [logits],
327
331
output_dir: /path/to/output/predictions/dir,
328
332
mutate_n_bases: 1,
329
-
use_sequence_name: True
333
+
use_sequence_name: True,
334
+
start_position: 450,
335
+
end_position: 550
330
336
}
331
337
```
332
338
@@ -338,6 +344,8 @@ in_silico_mutagenesis: {
338
344
- `use_sequence_name`: Optional, default is `True`.
339
345
- If `use_sequence_name`, output files are prefixed by the sequence name/description corresponding to each sequence in the FASTA file. Spaces in the description are replaced with underscores '_'.
340
346
- If not `use_sequence_name`, output files are prefixed with the index `i` corresponding to the `i`th sequence in the FASTA file.
347
+
- `start_position`: Optional, default is 0. The starting position of the subsequence that should be mutated. This value should be nonnegative, and less than `end_position`. The value of `end_position - start_position` should be at least `mutate_n_bases`.
348
+
- `end_position`: Optional, default is `None`. If left as `None`, Selene will use the `sequence_length` parameter passed to `analyze_sequences`. This is the ending position of the subsequence that should be mutated. This value should be nonnegative, and greater than `start_position`. The value of `end_position - start_position` should be at least `mutate_n_bases`.
341
349
342
350
## Sampler configurations
343
351
Data sampling is used during model training and evaluation. You must specify the sampler in the configuration YAML file alongside the other operation-specific configurations (i.e. `train_model` or `evaluate_model`).
0 commit comments