Added a tutorial for regression with Selene

evancofer · evancofer · commit 927cbe38ef18 · 2018-12-08T21:52:53.000-05:00
diff --git a/tutorials/README.md b/tutorials/README.md
@@ -8,7 +8,8 @@ To get started on training a model very quickly, please see [`quickstart_trainin
 Additionally, we have two tutorials that show how to apply trained models. Selene provides methods to run variant effect prediction and _in silico_ mutagenesis, along with some visualization methods that we recommend running based on our Jupyter notebook tutorials.
 
 - Comprehensive _in silico_ mutagenesis tutorial: [`analyzing_mutations_with_trained_models`](https://github.com/FunctionLab/selene/tree/master/tutorials/analyzing_mutations_with_trained_models)
-- Tutorial with both the config file method and the non-config file method of running Selene. Also shows how to run variant effect prediction and visualize the difference scores. Contains an _in silico_ mutagenesis example with known regulatory mutations: [`variants_and_visualizations`](https://github.com/FunctionLab/selene/tree/master/tutorials/variants_and_visualizations) 
+- Tutorial with both the config file method and the non-config file method of running Selene. Also shows how to run variant effect prediction and visualize the difference scores. Contains an _in silico_ mutagenesis example with known regulatory mutations: [`variants_and_visualizations`](https://github.com/FunctionLab/selene/tree/master/tutorials/variants_and_visualizations)
+- Tutorial demonstrating Selene's use to predict mean ribosomal load based on 5' UTR sequences: [`regression_mpra_example`](https://github.com/FunctionLab/selene/tree/master/tutorials/regression_mpra_example)
 
 ## Contributing tutorials
 
diff --git a/tutorials/regression_mpra_example/download_data.py b/tutorials/regression_mpra_example/download_data.py
@@ -0,0 +1,62 @@
+import io
+import gzip
+import os
+import urllib
+import tarfile
+
+import pandas
+import scipy.io
+import selene_sdk.sequences
+
+
+def run():
+    target_column = "rl"
+    local_file = "sample_et_al.tar"
+
+    # Download the data.
+    urllib.retrieve("https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE114002&format=file", local_file)
+    with tarfile.open(local_file, "r") as archive:
+        contents = archive.extractfile("GSM3130435_egfp_unmod_1.csv.gz").read()
+        contents = gzip.decompress(contents).decode("utf-8")
+    os.remove(local_file)
+
+    # Format data.
+    df = pandas.read_csv(io.StringIO(contents), sep=",", index_col=0)
+    df = df[["utr", "total_reads", target_column]]
+    df.sort_values("total_reads", inplace=True, ascending=False)
+    df.reset_index(inplace=True, drop=True)
+
+    # Split into train/validation/test.
+    df = df.iloc[:280000]
+    datasets = dict(test=df.iloc[:20000])
+    df = df.iloc[20000:]
+    df = df.sample(frac=1.)
+    datasets["validate"] = df.iloc[:20000]
+    datasets["train"] = df.iloc[20000:]
+    x = dict.fromkeys(datasets.keys())
+    y = dict.fromkeys(datasets.keys())
+
+    # Construct features.
+    for k in datasets.keys():
+        x[k] = list()
+        y[k] = list()
+        for i in range(datasets[k].shape[0]):
+            x[k].append(selene_sdk.sequences.Genome.sequence_to_encoding(datasets[k]["utr"].iloc[i]).T)
+            y[k].append(datasets[k][target_column].iloc[i])
+        x[k] = numpy.stack(x[k])
+        y[k] = numpy.asarray(y[k]).reshape(-1, 1)
+
+    # Scale w/ parameters from training data to prevent leakage.
+    sdev = numpy.std(y["train"])
+    mean = numpy.mean(y["train"])
+    for k in datasets.keys():
+        y[k] = (y[k] - mean) / sdev
+
+    # Write data to file.
+    for k in datasets.keys():
+        scipy.io.savemat("{0}.mat".format(k), dict(x=x[k], y=y[k]))
+
+
+if __name__ == "__main__":
+    run()
+
diff --git a/tutorials/regression_mpra_example/regression_mpra_example.ipynb b/tutorials/regression_mpra_example/regression_mpra_example.ipynb
@@ -0,0 +1,103 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Regression Models in Selene\n",
+    "\n",
+    "Selene is a flexible framework, and can be used for tasks beyond simple classification.\n",
+    "This tutorial serves as an introduction to training regression models with Selene.\n",
+    "For this tutorial, we will predict mean ribosomal load (MRL) from 50 base pair 5' UTR sequences using models and data from [*Human 5′ UTR design and variant effect prediction from a massively parallel translation assay*](https://doi.org/10.1101/310375) by Sample et al.\n",
+    "This data was generated from a massively parallel reporter assay (MPRA), which you can read more about in the preprint [on *bioRxiv*](https://doi.org/10.1101/310375).\n",
+    "\n",
+    "## Setup\n",
+    "\n",
+    "**Architecture:** The model is defined in [utr_model.py](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/utr_model.py), and only superficially differs from the model in [the paper](https://doi.org/10.1101/310375).\n",
+    "Since this is a real-valued regression problem, it is appropriate that the `criterion` method in `utr_model.py` uses the mean squared error.\n",
+    "\n",
+    "\n",
+    "**Data:** The data from Sample et al is available [on GEO](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114002).\n",
+    "However, we have included [the `download_data.py` script](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/download_data.py), to download the data and preprocess it.\n",
+    "It should produce three files, `train.mat`, `validate.mat`, and `test.mat`.\n",
+    "They include the data for training, validation, and testing.\n",
+    "\n",
+    "**Configuration file:** The configuration file [`regression_train.yml`](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/regression_train.yml) is slightly different than the configuration files in the classification tutorials.\n",
+    "Specifically, `metrics` in `train_model` includes the coefficient of determination (`r2`), since the default metrics (`roc_auc` and `average_precision`) are not appropriate for regression.\n",
+    "Further, `report_gt_feature_n_positives` in `train_model` has been set to zero to prevent spurious filtering. \n",
+    "\n",
+    "## Download the data\n",
+    "\n",
+    "To download the data, just run the [`download_data.py`](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/download_data.py) script from the command line:\n",
+    "```sh\n",
+    "python download_data.py\n",
+    "```\n",
+    "\n",
+    "## Train and evaluate the data\n",
+    "\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from selene_sdk.utils import load_path\n",
+    "from selene_sdk.utils import parse_configs_and_run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Before running `load_path` on `regression_train.yml`, please edit the YAML file to include the absolute path of the model file.\n",
+    "\n",
+    "Currently, the model is set to train on GPU.\n",
+    "If you do not have CUDA on your machine, please set `use_cuda` to `False` in the configuration file. \n",
+    "(This will slow down the process considerably.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "configs = load_path(\"./regression_train.yml\", instantiate=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "parse_configs_and_run(configs, lr=0.001)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/tutorials/regression_mpra_example/regression_train.yml b/tutorials/regression_mpra_example/regression_train.yml
@@ -0,0 +1,51 @@
+---
+ops: [train, evaluate]
+model: {
+    file: /absolute/path/to/selene_sdk/tutorials/regression_mpra_example/utr_model.py,
+    class: UTRModel,
+    sequence_length: 50,
+    n_classes_to_predict: 1,
+    non_strand_specific: {
+        use_module: False
+    }
+}
+sampler: !obj:selene_sdk.samplers.MultiFileSampler {
+    features: ["MRL"],
+    train_sampler: !obj:selene_sdk.samplers.file_samplers.MatFileSampler {
+        filepath: ./train.mat,
+        sequence_key: x,
+        targets_key: y,
+        shuffle: True
+    },
+    validate_sampler: !obj:selene_sdk.samplers.file_samplers.MatFileSampler {
+        filepath: ./validate.mat,
+        sequence_key: x,
+        targets_key: y,
+        shuffle: False
+    },
+    test_sampler: !obj:selene_sdk.samplers.file_samplers.MatFileSampler {
+        filepath: ./test.mat,
+        sequence_key: x,
+        targets_key: y,
+        shuffle: False
+    }
+}
+train_model: !obj:selene_sdk.TrainModel {
+    batch_size: 128,
+    max_steps: 8124,
+    report_gt_feature_n_positives: 0,
+    report_stats_every_n_steps: 2031,
+    n_validation_samples: 20000,
+    save_checkpoint_every_n_steps: 2031,
+    n_test_samples: 20000,
+    use_cuda: True,
+    data_parallel: False,
+    logging_verbosity: 2,
+    metrics: {
+        r2: !import:sklearn.metrics.r2_score
+    }
+}
+output_dir: ./training_outputs
+random_seed: 1337
+create_subdirectory: True
+...
diff --git a/tutorials/regression_mpra_example/utr_model.py b/tutorials/regression_mpra_example/utr_model.py
@@ -0,0 +1,64 @@
+"""
+Model derived from "Human 5′ UTR design and variant effect prediction from a massively parallel translation assay", https://doi.org/10.1101/310375
+"""
+
+import torch
+import numpy
+import torch.nn as nn
+
+
+class UTRModel(nn.Module):
+    def __init__(self, sequence_length=50, n_targets=1):
+        super(UTRModel, self).__init__()
+        self.sequence_length = sequence_length
+        kernel_size = 9 # Slight modification from model used in manuscript.
+        n_filters = 120
+        nodes = 40
+        padding = 4 # Note that this will be slightly different from original model's "same" padding in Keras.
+        self.cnn = nn.Sequential(
+                nn.Conv1d(4, n_filters, kernel_size=kernel_size, padding=padding),
+                nn.ReLU(inplace=True),
+                nn.Conv1d(n_filters, n_filters, kernel_size=kernel_size, padding=padding),
+                nn.ReLU(inplace=True),
+                nn.Conv1d(n_filters, n_filters, kernel_size=kernel_size, padding=padding),
+                nn.ReLU(inplace=True))
+        with torch.no_grad():
+            tmp = torch.zeros(1, 4, self.sequence_length)
+            dnn_input_size = self.cnn.forward(tmp).view(1, -1).shape[1]
+            del tmp
+        self.dnn = nn.Sequential(nn.Linear(dnn_input_size, nodes),
+                                 nn.ReLU(inplace=True),
+                                 nn.Dropout(0.20),
+                                 nn.Linear(nodes, n_targets))
+        # Copy weight initialization from Keras.
+        def init_weight(x):
+            if isinstance(x, nn.Linear) or isinstance(x, nn.Conv1d):
+                if isinstance(x, nn.Linear):
+                    fan_avg = (x.in_features + x.out_features) * 0.5
+                else:
+                    fan_avg = (x.weight.shape[0] + x.weight.shape[1]) * x.weight.shape[2] * 0.5
+                limit = numpy.sqrt(3 / fan_avg)
+                nn.init.uniform_(x.weight, -1 * limit, limit)
+                x.bias.data.fill_(0)
+        self.cnn.apply(init_weight)
+        self.dnn.apply(init_weight)
+
+    def forward(self, input):
+        batch_size = input.shape[0]
+        ret = self.dnn.forward(self.cnn.forward(input).view(batch_size, -1))
+        return ret
+
+
+def criterion():
+    """
+    The loss function to be optimized.
+    """
+    return nn.MSELoss(size_average=True, reduce=True)
+
+
+def get_optimizer(lr):
+    """
+    The optimizer and parameters.
+    """
+    return (torch.optim.Adam, {"lr": lr, "betas": (0.9, 0.999), "eps": 1e-08})
+