Skip to content

Commit 927cbe3

Browse files
committed
Added a tutorial for regression with Selene
1 parent d4b598e commit 927cbe3

5 files changed

Lines changed: 282 additions & 1 deletion

File tree

tutorials/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@ To get started on training a model very quickly, please see [`quickstart_trainin
88
Additionally, we have two tutorials that show how to apply trained models. Selene provides methods to run variant effect prediction and _in silico_ mutagenesis, along with some visualization methods that we recommend running based on our Jupyter notebook tutorials.
99

1010
- Comprehensive _in silico_ mutagenesis tutorial: [`analyzing_mutations_with_trained_models`](https://github.com/FunctionLab/selene/tree/master/tutorials/analyzing_mutations_with_trained_models)
11-
- Tutorial with both the config file method and the non-config file method of running Selene. Also shows how to run variant effect prediction and visualize the difference scores. Contains an _in silico_ mutagenesis example with known regulatory mutations: [`variants_and_visualizations`](https://github.com/FunctionLab/selene/tree/master/tutorials/variants_and_visualizations)
11+
- Tutorial with both the config file method and the non-config file method of running Selene. Also shows how to run variant effect prediction and visualize the difference scores. Contains an _in silico_ mutagenesis example with known regulatory mutations: [`variants_and_visualizations`](https://github.com/FunctionLab/selene/tree/master/tutorials/variants_and_visualizations)
12+
- Tutorial demonstrating Selene's use to predict mean ribosomal load based on 5' UTR sequences: [`regression_mpra_example`](https://github.com/FunctionLab/selene/tree/master/tutorials/regression_mpra_example)
1213

1314
## Contributing tutorials
1415

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
import io
2+
import gzip
3+
import os
4+
import urllib
5+
import tarfile
6+
7+
import pandas
8+
import scipy.io
9+
import selene_sdk.sequences
10+
11+
12+
def run():
13+
target_column = "rl"
14+
local_file = "sample_et_al.tar"
15+
16+
# Download the data.
17+
urllib.retrieve("https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE114002&format=file", local_file)
18+
with tarfile.open(local_file, "r") as archive:
19+
contents = archive.extractfile("GSM3130435_egfp_unmod_1.csv.gz").read()
20+
contents = gzip.decompress(contents).decode("utf-8")
21+
os.remove(local_file)
22+
23+
# Format data.
24+
df = pandas.read_csv(io.StringIO(contents), sep=",", index_col=0)
25+
df = df[["utr", "total_reads", target_column]]
26+
df.sort_values("total_reads", inplace=True, ascending=False)
27+
df.reset_index(inplace=True, drop=True)
28+
29+
# Split into train/validation/test.
30+
df = df.iloc[:280000]
31+
datasets = dict(test=df.iloc[:20000])
32+
df = df.iloc[20000:]
33+
df = df.sample(frac=1.)
34+
datasets["validate"] = df.iloc[:20000]
35+
datasets["train"] = df.iloc[20000:]
36+
x = dict.fromkeys(datasets.keys())
37+
y = dict.fromkeys(datasets.keys())
38+
39+
# Construct features.
40+
for k in datasets.keys():
41+
x[k] = list()
42+
y[k] = list()
43+
for i in range(datasets[k].shape[0]):
44+
x[k].append(selene_sdk.sequences.Genome.sequence_to_encoding(datasets[k]["utr"].iloc[i]).T)
45+
y[k].append(datasets[k][target_column].iloc[i])
46+
x[k] = numpy.stack(x[k])
47+
y[k] = numpy.asarray(y[k]).reshape(-1, 1)
48+
49+
# Scale w/ parameters from training data to prevent leakage.
50+
sdev = numpy.std(y["train"])
51+
mean = numpy.mean(y["train"])
52+
for k in datasets.keys():
53+
y[k] = (y[k] - mean) / sdev
54+
55+
# Write data to file.
56+
for k in datasets.keys():
57+
scipy.io.savemat("{0}.mat".format(k), dict(x=x[k], y=y[k]))
58+
59+
60+
if __name__ == "__main__":
61+
run()
62+
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Regression Models in Selene\n",
8+
"\n",
9+
"Selene is a flexible framework, and can be used for tasks beyond simple classification.\n",
10+
"This tutorial serves as an introduction to training regression models with Selene.\n",
11+
"For this tutorial, we will predict mean ribosomal load (MRL) from 50 base pair 5' UTR sequences using models and data from [*Human 5′ UTR design and variant effect prediction from a massively parallel translation assay*](https://doi.org/10.1101/310375) by Sample et al.\n",
12+
"This data was generated from a massively parallel reporter assay (MPRA), which you can read more about in the preprint [on *bioRxiv*](https://doi.org/10.1101/310375).\n",
13+
"\n",
14+
"## Setup\n",
15+
"\n",
16+
"**Architecture:** The model is defined in [utr_model.py](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/utr_model.py), and only superficially differs from the model in [the paper](https://doi.org/10.1101/310375).\n",
17+
"Since this is a real-valued regression problem, it is appropriate that the `criterion` method in `utr_model.py` uses the mean squared error.\n",
18+
"\n",
19+
"\n",
20+
"**Data:** The data from Sample et al is available [on GEO](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114002).\n",
21+
"However, we have included [the `download_data.py` script](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/download_data.py), to download the data and preprocess it.\n",
22+
"It should produce three files, `train.mat`, `validate.mat`, and `test.mat`.\n",
23+
"They include the data for training, validation, and testing.\n",
24+
"\n",
25+
"**Configuration file:** The configuration file [`regression_train.yml`](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/regression_train.yml) is slightly different than the configuration files in the classification tutorials.\n",
26+
"Specifically, `metrics` in `train_model` includes the coefficient of determination (`r2`), since the default metrics (`roc_auc` and `average_precision`) are not appropriate for regression.\n",
27+
"Further, `report_gt_feature_n_positives` in `train_model` has been set to zero to prevent spurious filtering. \n",
28+
"\n",
29+
"## Download the data\n",
30+
"\n",
31+
"To download the data, just run the [`download_data.py`](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/download_data.py) script from the command line:\n",
32+
"```sh\n",
33+
"python download_data.py\n",
34+
"```\n",
35+
"\n",
36+
"## Train and evaluate the data\n",
37+
"\n",
38+
"\n",
39+
"\n"
40+
]
41+
},
42+
{
43+
"cell_type": "code",
44+
"execution_count": null,
45+
"metadata": {},
46+
"outputs": [],
47+
"source": [
48+
"from selene_sdk.utils import load_path\n",
49+
"from selene_sdk.utils import parse_configs_and_run"
50+
]
51+
},
52+
{
53+
"cell_type": "markdown",
54+
"metadata": {},
55+
"source": [
56+
"Before running `load_path` on `regression_train.yml`, please edit the YAML file to include the absolute path of the model file.\n",
57+
"\n",
58+
"Currently, the model is set to train on GPU.\n",
59+
"If you do not have CUDA on your machine, please set `use_cuda` to `False` in the configuration file. \n",
60+
"(This will slow down the process considerably.)"
61+
]
62+
},
63+
{
64+
"cell_type": "code",
65+
"execution_count": null,
66+
"metadata": {},
67+
"outputs": [],
68+
"source": [
69+
"configs = load_path(\"./regression_train.yml\", instantiate=False)"
70+
]
71+
},
72+
{
73+
"cell_type": "code",
74+
"execution_count": null,
75+
"metadata": {},
76+
"outputs": [],
77+
"source": [
78+
"parse_configs_and_run(configs, lr=0.001)"
79+
]
80+
}
81+
],
82+
"metadata": {
83+
"kernelspec": {
84+
"display_name": "Python 3",
85+
"language": "python",
86+
"name": "python3"
87+
},
88+
"language_info": {
89+
"codemirror_mode": {
90+
"name": "ipython",
91+
"version": 3
92+
},
93+
"file_extension": ".py",
94+
"mimetype": "text/x-python",
95+
"name": "python",
96+
"nbconvert_exporter": "python",
97+
"pygments_lexer": "ipython3",
98+
"version": "3.6.3"
99+
}
100+
},
101+
"nbformat": 4,
102+
"nbformat_minor": 2
103+
}
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
---
2+
ops: [train, evaluate]
3+
model: {
4+
file: /absolute/path/to/selene_sdk/tutorials/regression_mpra_example/utr_model.py,
5+
class: UTRModel,
6+
sequence_length: 50,
7+
n_classes_to_predict: 1,
8+
non_strand_specific: {
9+
use_module: False
10+
}
11+
}
12+
sampler: !obj:selene_sdk.samplers.MultiFileSampler {
13+
features: ["MRL"],
14+
train_sampler: !obj:selene_sdk.samplers.file_samplers.MatFileSampler {
15+
filepath: ./train.mat,
16+
sequence_key: x,
17+
targets_key: y,
18+
shuffle: True
19+
},
20+
validate_sampler: !obj:selene_sdk.samplers.file_samplers.MatFileSampler {
21+
filepath: ./validate.mat,
22+
sequence_key: x,
23+
targets_key: y,
24+
shuffle: False
25+
},
26+
test_sampler: !obj:selene_sdk.samplers.file_samplers.MatFileSampler {
27+
filepath: ./test.mat,
28+
sequence_key: x,
29+
targets_key: y,
30+
shuffle: False
31+
}
32+
}
33+
train_model: !obj:selene_sdk.TrainModel {
34+
batch_size: 128,
35+
max_steps: 8124,
36+
report_gt_feature_n_positives: 0,
37+
report_stats_every_n_steps: 2031,
38+
n_validation_samples: 20000,
39+
save_checkpoint_every_n_steps: 2031,
40+
n_test_samples: 20000,
41+
use_cuda: True,
42+
data_parallel: False,
43+
logging_verbosity: 2,
44+
metrics: {
45+
r2: !import:sklearn.metrics.r2_score
46+
}
47+
}
48+
output_dir: ./training_outputs
49+
random_seed: 1337
50+
create_subdirectory: True
51+
...
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
"""
2+
Model derived from "Human 5′ UTR design and variant effect prediction from a massively parallel translation assay", https://doi.org/10.1101/310375
3+
"""
4+
5+
import torch
6+
import numpy
7+
import torch.nn as nn
8+
9+
10+
class UTRModel(nn.Module):
11+
def __init__(self, sequence_length=50, n_targets=1):
12+
super(UTRModel, self).__init__()
13+
self.sequence_length = sequence_length
14+
kernel_size = 9 # Slight modification from model used in manuscript.
15+
n_filters = 120
16+
nodes = 40
17+
padding = 4 # Note that this will be slightly different from original model's "same" padding in Keras.
18+
self.cnn = nn.Sequential(
19+
nn.Conv1d(4, n_filters, kernel_size=kernel_size, padding=padding),
20+
nn.ReLU(inplace=True),
21+
nn.Conv1d(n_filters, n_filters, kernel_size=kernel_size, padding=padding),
22+
nn.ReLU(inplace=True),
23+
nn.Conv1d(n_filters, n_filters, kernel_size=kernel_size, padding=padding),
24+
nn.ReLU(inplace=True))
25+
with torch.no_grad():
26+
tmp = torch.zeros(1, 4, self.sequence_length)
27+
dnn_input_size = self.cnn.forward(tmp).view(1, -1).shape[1]
28+
del tmp
29+
self.dnn = nn.Sequential(nn.Linear(dnn_input_size, nodes),
30+
nn.ReLU(inplace=True),
31+
nn.Dropout(0.20),
32+
nn.Linear(nodes, n_targets))
33+
# Copy weight initialization from Keras.
34+
def init_weight(x):
35+
if isinstance(x, nn.Linear) or isinstance(x, nn.Conv1d):
36+
if isinstance(x, nn.Linear):
37+
fan_avg = (x.in_features + x.out_features) * 0.5
38+
else:
39+
fan_avg = (x.weight.shape[0] + x.weight.shape[1]) * x.weight.shape[2] * 0.5
40+
limit = numpy.sqrt(3 / fan_avg)
41+
nn.init.uniform_(x.weight, -1 * limit, limit)
42+
x.bias.data.fill_(0)
43+
self.cnn.apply(init_weight)
44+
self.dnn.apply(init_weight)
45+
46+
def forward(self, input):
47+
batch_size = input.shape[0]
48+
ret = self.dnn.forward(self.cnn.forward(input).view(batch_size, -1))
49+
return ret
50+
51+
52+
def criterion():
53+
"""
54+
The loss function to be optimized.
55+
"""
56+
return nn.MSELoss(size_average=True, reduce=True)
57+
58+
59+
def get_optimizer(lr):
60+
"""
61+
The optimizer and parameters.
62+
"""
63+
return (torch.optim.Adam, {"lr": lr, "betas": (0.9, 0.999), "eps": 1e-08})
64+

0 commit comments

Comments
 (0)