Skip to content

Commit 6fdaa86

Browse files
authored
Minor documentation updates ahead of 0.5.0 release (#171)
* add CLI script back * fix URL on homepage * bug in master for the dev version and setup.py mismatch * update links in overview docs * add RELEASE_NOTES file * wording change * test .rst bullet points change
1 parent dba5855 commit 6fdaa86

8 files changed

Lines changed: 46 additions & 26 deletions

File tree

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44

55
Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.
66

7+
Please see our [release notes](./RELEASE_NOTES.md) for the latest updates to Selene.
8+
79
## Installation
810

911
We recommend using Selene with Python 3.6 or above.

RELEASE_NOTES.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# Release notes
2+
3+
This is a document describing new functionality, bug fixes, breaking changes, etc. associated with Selene version releases from v0.5.0 onwards.
4+
5+
## Version 0.5.0
6+
7+
### New functionality
8+
- `sampler.MultiSampler`: `MultiSampler` accepts any Selene sampler for each of the train, validation, and test partitions where previously `MultiFileSampler` only accepted `FileSampler`s. We will deprecate `MultiFileSampler` in our next major release.
9+
- `DataLoader`: Parallel data loading based on PyTorch's `DataLoader` class, which can be used with Selene's `MultiSampler` and `MultiFileSampler` class. (see: `sampler.SamplerDataLoader`, `sampler.H5DataLoader`)
10+
- To support parallelism via multiprocessing, the sampler that `SamplerDataLoader` used needs to be picklable. To enable this, opening file operations are delayed to when any method that needs the file is called. There is no change to the API and setting `init_unpicklable=True` in `__init__` for `Genome` and all `OnlineSampler` classes will fully reproduce the functionality in `selene_sdk<=0.4.8`.
11+
- `sampler.RandomPositionsSampler`: added support for `center_bin_to_predict` taking in a list/tuple of two integers to specify the region from which to query the targets---that is, `center_bin_to_predict` by default (`center_bin_to_predict=<int>`) queries targets based on the center bin size, but can be specified as start and end integers that are not at the center if desired.
12+
- `EvaluateModel`: accepts a list of metrics (by default computing ROC AUC and average precision) with which to evaluate the test dataset.
13+
14+
### Usage
15+
- **Command-line interface (CLI)**: You can now run the CLI directly with `python -m selene_sdk` (if you have cloned the repository, make sure you have locally installed `selene_sdk` via `python setup.py install`, or `selene_sdk` is in the same directory as your script / added to `PYTHONPATH`). Developers can make a copy of the `selene_sdk/cli.py` script and use it the same way that `selene_cli.py` was used in earlier versions of Selene (`python -u cli.py <config-yml> [--lr]`)
16+
17+
### Bug fixes
18+
- `EvaluateModel`: `use_features_ord` allows you to evaluate a trained model on only a subset of chromatin features (targets) predicted by the model. If you are using a `FileSampler` for your test dataset, you now have the option to pass in a subsetted matrix; however, this matrix must be ordered the same way as `features` (the original targets prediction ordering) and not in the same ordering as `use_features_ord`. However, the final model predictions and targets
19+
(`test_predictions.npz` and `test_targets.npz`) will be outputted according to the `use_features_ord` list and ordering.
20+
- `MatFileSampler`: Previously the `MatFileSampler` reset the pointer to the start of the matrix too early (going back to the first sample before we had finished sampling the whole matrix).
21+
- CLI learning rate: Edge cases (e.g. not specifying the learning rate via CLI or config) previously were not handled correctly and did not throw an informative error.

docs/source/index.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Welcome! This is the documentation for Selene, a PyTorch-based deep learning lib
1010
The Github repository is located `here <https://github.com/FunctionLab/selene>`_.
1111

1212
The documentation here corresponds to the latest version of Selene (i.e. up-to-date with `master`).
13-
You can view the documentation for Selene `version 0.4.8 here`<http://selene.flatironinstitute.org/0.4.8/>`,
13+
You can view the documentation for Selene `version 0.4.8 here <http://selene.flatironinstitute.org/0.4.8/>`_,
1414
and we will add other older versions of the library docs to the website soon.
1515

1616
.. toctree::
@@ -40,6 +40,6 @@ and we will add other older versions of the library docs to the website soon.
4040
Indices and tables
4141
==================
4242

43-
* :ref:`genindex`
44-
* :ref:`modindex`
45-
* :ref:`search`
43+
- :ref:`genindex`
44+
- :ref:`modindex`
45+
- :ref:`search`

docs/source/overview/cli.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,12 +26,12 @@ Selene's CLI accepts configuration files in the `YAML <https://docs.ansible.com/
2626
We recommend you start off by using one of the `example configuration files <https://github.com/FunctionLab/selene/tree/master/config_examples>`_ provided in the repository as a template for your own configuration file:
2727

2828

29-
* `Training configuration <https://github.com/FunctionLab/selene/blob/master/config_examples/train.yml>`_
30-
* `Evaluate with test BED file <https://github.com/FunctionLab/selene/blob/master/config_examples/evaluate_test_bed.yml>`_
31-
* `Evaluate with test matrix file <https://github.com/FunctionLab/selene/blob/master/config_examples/evaluate_test_mat.yml>`_
32-
* `Get predictions from trained model <https://github.com/FunctionLab/selene/blob/master/config_examples/get_predictions.yml>`_
33-
* `\ *In silico* mutagenesis <https://github.com/FunctionLab/selene/blob/master/config_examples/in_silico_mutagenesis.yml>`_
34-
* `Variant effect prediction <https://github.com/FunctionLab/selene/blob/master/config_examples/variant_effect_prediction.yml>`_
29+
- `Training configuration <https://github.com/FunctionLab/selene/blob/master/config_examples/train.yml>`_
30+
- `Evaluate with test BED file <https://github.com/FunctionLab/selene/blob/master/config_examples/evaluate_test_bed.yml>`_
31+
- `Evaluate with test matrix file <https://github.com/FunctionLab/selene/blob/master/config_examples/evaluate_test_mat.yml>`_
32+
- `Get predictions from trained model <https://github.com/FunctionLab/selene/blob/master/config_examples/get_predictions.yml>`_
33+
- `\ *In silico* mutagenesis <https://github.com/FunctionLab/selene/blob/master/config_examples/in_silico_mutagenesis.yml>`_
34+
- `Variant effect prediction <https://github.com/FunctionLab/selene/blob/master/config_examples/variant_effect_prediction.yml>`_
3535

3636
There are also various configuration files associated with the Jupyter notebook `tutorials <https://github.com/FunctionLab/selene/tree/master/tutorials>`_ and `manuscript <https://github.com/FunctionLab/selene/tree/master/manuscript>`_ case studies that you may use as a starting point.
3737

docs/source/overview/overview.rst

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -12,21 +12,21 @@ Sampling
1212

1313
We start with the modules for sampling data because both training and evaluting a model in Selene will require a user to specify the kind of sampler they want to use.
1414

15-
*sequences* submodule (\ `API <http://selene.flatironinstitute.org/sequences.html>`_\ )
16-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
15+
*sequences* submodule (\ `API <../sequences.html>`_\ )
16+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1717

1818
The *sequences* submodule defines the ``Sequence`` type, and includes implementations for several sub-classes.
1919
These sub-classes--\ ``Genome`` and ``Proteome``\ --represent different kinds of biological sequences (e.g. DNA, RNA, amino acid sequences), and implement the ``Sequence`` interface’s methods for reading the reference sequence from files (e.g. FASTA), querying subsequences of the reference sequence, and subsequently converting those queried subsequences into a numeric representation.
2020
Further, each sequence class specifies its own alphabet (e.g., nucleotides, amino acids) to represent query results as strings.
2121

22-
*targets* submodule (\ `API <http://selene.flatironinstitute.org/targets.html>`_\ )
23-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
22+
*targets* submodule (\ `API <../targets.html>`_\ )
23+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2424

2525
The *targets* submodule defines the ``Target`` class, which specifies the interface for classes to retrieve labels or “targets” for a given query sequence.
2626
At present, we supply a single implementation of this interface: ``GenomicFeatures``.
2727
This class takes a tabix-indexed file of intervals for each label we want our model to predict, and uses this file to identify the labels for a given sequence drawn from the reference.
2828

29-
*samplers* submodule (\ `API <http://selene.flatironinstitute.org/samplers.html>`_\ )
29+
*samplers* submodule (\ `API <../samplers.html>`_\ )
3030
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3131

3232
The *samplers* submodule provides methods and classes for randomly sampling and partitioning datasets for training and evaluation.
@@ -36,7 +36,7 @@ Further, a file of names must be provided for the features to be predicted.
3636
We provide several implementations adhering to the ``Sampler`` interface: the ``RandomPositionsSampler``\ , ``IntervalsSampler``\ , and ``MultiFileSampler``.
3737

3838
``MultiFileSampler`` draws samples from structured data files for each partition.
39-
There is currently support for loading either .bed or .mat files via the ``FileSampler`` classes ``BedFileSampler`` and ``MatFileSampler``\ , respectively (see `API docs for file samplers <http://selene.flatironinstitute.org/samplers.file_samplers.html>`_\ ).
39+
There is currently support for loading either .bed or .mat files via the ``FileSampler`` classes ``BedFileSampler`` and ``MatFileSampler``\ , respectively (see `API docs for file samplers <../samplers.file_samplers.html>`_\ ).
4040
It is worth noting that the .bed file used by ``BedFileSampler`` includes the coordinates of each sequence, and the indices corresponding to each feature for which said sequence is a positive example.
4141
We hope that users will request or contribute classes for other file samplers in the future.
4242
``MultiFileSampler`` does not support saving the sampled data to a file, so calling the ``save_dataset_to_file`` method from this class will have no effect.
@@ -47,7 +47,7 @@ These samplers automatically partition said data according to user-specified par
4747
Since ``OnlineSampler``\ ’s samples are randomly generated, we allow the user to save the sampled data to file.
4848
This file can be subsequently loaded with the ``BedFileSampler``. They rely on classes from the *sequences* and *targets* submodules for retrieving each sequence and its targets in the proper matrix format.
4949

50-
Training a model (\ `API <http://selene.flatironinstitute.org/selene.html#trainmodel>`_\ )
50+
Training a model (\ `API <../selene.html#trainmodel>`_\ )
5151
------------------------------------------------------------------------------------------
5252

5353
The ``TrainModel`` class may be used for training and testing of sequence-based models, and provides the core functionality of the CLI’s train command.
@@ -58,14 +58,14 @@ The model’s loss, area under the receiver operating characteristic curve (AUC)
5858
The frequency of logging is provided by the user.
5959
At the end of evaluation, ``TrainModel`` logs the performance metrics for each feature predicted, and produces plots of the precision recall and receiver operating characteristic curves.
6060

61-
Evaluating a model (\ `API <http://selene.flatironinstitute.org/selene.html#evaluatemodel>`_\ )
61+
Evaluating a model (\ `API <../selene.html#evaluatemodel>`_\ )
6262
-----------------------------------------------------------------------------------------------
6363

6464
The ``EvaluateModel`` class is used to test the performance of a trained model.
6565
``EvaluateModel`` uses an instance of ``Sampler`` class or subclass to draw samples from a test set.
6666
After using the provided model to predict labels for said data, ``EvaluateModel`` logs the performance measures (as described in "Training a model") and generates figures and a performance breakdown by feature.
6767

68-
Using a model to make predictions (\ `API <http://selene.flatironinstitute.org/predict.html>`_\ )
68+
Using a model to make predictions (\ `API <../predict.html>`_\ )
6969
-------------------------------------------------------------------------------------------------
7070

7171
Selene’s ``predict`` submodule includes a number of methods and classes for making predictions with sequence-based models.
@@ -74,14 +74,14 @@ It leverages a user-specified trained model to make predictions for sequences se
7474
In each case, the user can specify what ``AnalyzeSequences`` should save: raw predictions, difference scores, absolute difference scores, and/or logit scores.
7575
Note that the aforementioned “scores” can only be computed for *in silico* mutagenesis and variant effect prediction.
7676

77-
Visualizing model predictions (\ `API <http://selene.flatironinstitute.org/interpret.html>`_\ )
77+
Visualizing model predictions (\ `API <../interpret.html>`_\ )
7878
-----------------------------------------------------------------------------------------------
7979

8080
The ``interpret`` submodule of ``selene_sdk`` provides methods for visualizing a sequence-based model’s predictions made with ``AnalyzeSequences``.
8181
For example, ``interpret`` includes methods for processing variant effect predictions made with ``AnalyzeSequences`` and subsequently visualizing them with a heatmap or sequence logo.
8282
The functionality included in the ``interpret`` submodule is not heavily incorporated into the CLI, but is instead intended for incorporation into user code.
8383

84-
The utilities submodule (\ `API <http://selene.flatironinstitute.org/utils.html>`_\ )
84+
The utilities submodule (\ `API <../utils.html>`_\ )
8585
-------------------------------------------------------------------------------------
8686

8787
Unlike the aforementioned submodules designed around individual concepts, the ``utils`` submodule is a catch-all submodule intended to prevent cluttering of the ``selene_sdk`` top-level namespace.

docs/source/tutorials/analyzing_mutations_with_trained_models.nblink

Lines changed: 0 additions & 3 deletions
This file was deleted.

selene_sdk/samplers/dataloader.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
"""
2-
This module provides the `SamplerDataLoader` and `SamplerDataSet` classes,
2+
This module provides the `SamplerDataLoader` and `SamplerDataset` classes,
33
which allow parallel sampling for any Sampler using
44
torch DataLoader mechanism.
55
"""

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
cmdclass = {'build_ext': build_ext}
2626

2727
setup(name="selene-sdk",
28-
version="0.4.8",
28+
version="0.5.dev0",
2929
long_description=long_description,
3030
long_description_content_type='text/markdown',
3131
description=("framework for developing sequence-level "

0 commit comments

Comments
 (0)