Merge pull request #34 from SteveBronder/gsoc-2020

SteveBronder · web-flow · commit 4b97454d6cfe · 2021-02-03T18:43:57.000-05:00
Google Summer of Code 2020
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,2 @@
+# For the Mac
+*.DS_Store
diff --git a/gsoc_proposals/2021/bayesian_benchmarking.md b/gsoc_proposals/2021/bayesian_benchmarking.md
@@ -0,0 +1,46 @@
+## Bayesian Benchmarking 
+
+### Abstract
+The project aims to develop a suite of models for benchmarking Bayesian computation. In concert with the mentors, the student will gather data/models with a variety of inferential structure for coding and optimization in Stan. The resulting canonical models and posteriors will be submitted for inclusion in the posteriorDB database to serve as reference points against which new approaches to Bayesian computation can be compared. 
+
+
+| **Intensity**                          | **Priority**              | **Involves**  | **Mentors**              |
+| -------------                          | ------------              | ------------- | -----------              |
+| Easy/Moderate | Low | Stan; Bayesian modeling | Andrew Gelman, [Måns Magnusson](https://github.com/MansMeg), [Mike Lawrence](https://github.com/mike-lawrence)   |
+
+### Project Details
+
+The goal of this project is to set up a set of posteriors that will be useful as benchmarks. We are especially interested in more complex posteriors that will push Bayesian computations further.
+
+#### Use Cases of benchmark set of models
+The purpose or use cases for such a set of benchmark models are:
+Testing implementations of inference algorithms with asymptotic bias (such as variational inference)
+Testing implementations of inference algorithms with asymptotically decreasing bias and variance (such as MCMC)
+Efficiency comparisons of inference algorithms with asymptotically decreasing bias and variance
+Exploratory analysis of algorithms
+Testing out new algorithms on interesting models
+
+The project will consist of identifying relevant models to include in a set of benchmarks, implementing them in Stan together with data, and then optimizing these Stan programs with respect to performance.
+
+
+### Expected Results and Milestones
+By the end of the project the student will have implemented a number of models from a diverse set of use cases, optimized for performance. 
+
+#### Milestone 1
+Identify models that would be of relevance to include as benchmark models.
+
+#### Milestone 2
+Implement all models in Stan and add them to posteriorDB.
+
+#### Milestone 3
+Optimize the models for performance
+
+### Helpful Prerequisite Experience
+Some knowledge about Bayesian statistics and computation.
+
+### What Will You Learn?
+By the end of the summer, the student will have experience with using probabilistic programming languages for a wide array of posteriors and how these type of models can be optimized.
+
+### What Can You Do To Prepare?
+
+You can read up on [Stan](https://mc-Stan.org), [Bayesian workflows](https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html) and [posteriorDB](https://github.com/stan-dev/posteriordb).
diff --git a/gsoc_proposals/2021/garch.md b/gsoc_proposals/2021/garch.md
@@ -0,0 +1,106 @@
+## Garch Modeling in BRMS
+
+### Abstract
+
+The student will implement common General Autoregressive Conditional Heteroskedasticity ([GARCH](https://vlab.stern.nyu.edu/docs/volatility)) style models for users in the [brms](https://cran.r-project.org/web/packages/brms/index.html) R package. The R package brms is a frontend for Stan that allows R users to write R formula syntax for models that is translated and compiled into a Stan model. Currently brms allows for [`arma`](https://github.com/paul-buerkner/brms/issues/708) style models on the mean, but does not support GARCH style models on the variance. The student will code up several of these models in Stan, testing with simulation based calibration that the models are well calibrated, write a technical document describing the GARCH model form and how it can be incorporated into brms, then implement these models directly into brms.
+
+| **Intensity** | **Priority** | **Involves**  | **Mentors** |
+| ------------- | ------------ | ------------- | ----------- |
+| Medium | Low | R, Bayesian Modeling  |[Paul-Christian Bürkner](https://github.com/paul-buerkner), [Asael A Matamoros](https://github.com/asael697), [Steve Bronder](https://github.com/SteveBronder) |
+
+### Project Details
+
+An Autoregressive Moving Average model (ARMA) is of the form
+
+![\Large Y_t = \sum_{i=1}^p\rho_i Y_{t-i} + \sum_{i=1}^q\theta_i \epsilon_{t-i}](https://latex.codecogs.com/svg.latex?Y_t&space;=&space;\sum_{i=1}^p\rho_i&space;Y_{t-i}&space;&plus;&space;\sum_{i=1}^q\theta_i&space;\epsilon_{t-i})
+
+where
+- `Y_t` is the series at time `t`,
+- `p` and `\rho_i` is the number of autoregressive parameters and autoregressive parameters
+- `q` and `\theta_i` is the number of moving average parameters and moving average parameters.
+- `epsilon_t` is the error at time `t`
+
+Currently, brms users are able to express autoregressive and moving average components of the mean of their model with syntax such as
+
+```R
+# Only auto regressive
+y ~ ar(time | group, p = 1)
+# Only moving average
+y ~ ma(time | group, q = 1)
+# Both
+y ~ arma(time | group, p = 1, q = 1)
+```
+
+We would like to extend this functionality to support General Autoregressive Conditional Heteroskedasticity ([GARCH](https://vlab.stern.nyu.edu/docs/volatility)) style models which have the form
+
+<img src="https://latex.codecogs.com/svg.latex?\small&space;\begin{align*}&space;y_t&\sim\mu_t&space;&plus;&space;\epsilon_t&space;\\&space;\epsilon_t&space;&\sim&space;\mathcal{N}{\left(0,\sigma_t^2\right)}&space;\\&space;\sigma_t^2&=\omega&space;&plus;&space;\sum_{i=1}^p&space;\beta_i\sigma_{t-i}^2&space;&plus;&space;\sum_{i=1}^q&space;\alpha_i\epsilon_{t-i}^2&space;\\&space;\end{align*}" title="\small \begin{align*} y_t&\sim\mu_t + \epsilon_t \\ \epsilon_t &\sim \mathcal{N}{\left(0,\sigma_t^2\right)} \\ \sigma_t^2&=\omega + \sum_{i=1}^p \beta_i\sigma_{t-i}^2 + \sum_{i=1}^q \alpha_i\epsilon_{t-i}^2 \\ \end{align*}" />
+
+Where
+
+- `y` is the series with mean `mu` and error `epsilon` at time `t`
+- `epsilon` is normally distributed with mean 0 and variance `sigma_t^2` at time `t`
+- `sigma_t^2` is modeled with
+  - `p` autoregressive parameters `beta`
+  - `q` moving average components `alpha`
+
+As you can see, ARMA and GARCH are pretty similar! The fun starts happening when we are talking about real life modeling where it's very common to have very nasty tails on volatility models. Then we start building models such as asymmetric GARCH (AGARCH) models where one side of volatility leads to more conditional heteroskedasticity than the other.
+
+<img src="https://latex.codecogs.com/svg.latex?\small&space;\sigma_t^2=\omega&space;&plus;&space;\sum_{i=1}^p&space;\beta_i\sigma_{t-i}^2&space;&plus;&space;\sum_{i=1}^q&space;\alpha_i&space;(\epsilon_{t-i}&space;-&space;\gamma_i)^2&space;\\" title="\small \sigma_t^2=\omega + \sum_{i=1}^p \beta_i\sigma_{t-i}^2 + \sum_{i=1}^q \alpha_i (\epsilon_{t-i} - \gamma_i)^2 \\" />
+
+Here, `gamma` acts like a weight that when positive amplifies the expected volatility if previous errors were negative.
+
+Inside of brms it's possible to [distributional parameters](https://paul-buerkner.github.io/brms/articles/brms_distreg.html) and so for `sigma` we would like to allow syntax such as
+
+```R
+# Only autoregressive
+sigma ~ ar(time | group, p = 1)
+# Only moving average
+sigma ~ ma(time | group, q = 1)
+# Both autoregressive and moving average
+sigma ~ garch(time | group, p = 1, q = 1)
+# Asymmetric autoregressive
+sigma ~ aar(time | group, q = 1)
+# Asymmetric autoregressive and moving averag
+sigma ~ agarch(time | group, p = 1, q = 1)
+```
+
+This would be combined with the current ARMA syntax to create models with varying autoregressive components for both the mean and variance such as
+
+```R
+bf_mod = bf(
+  y ~ arma(time | group, p = 1, q = 1),
+  sigma ~ agarch(time | group, p = 1, q = 1)
+)
+```
+
+Challenges from brms side will be to determine which of all the other modeling options can be combined with GARCH models. For example, how to combine GARCH models on sigma with distributional regression in sigma. Other difficulties will be to implement all the post-processing such as posterior predictions etc. The existing ARMA terms already pave the way for all of this but we will have to figure out if the current structure is satisfactory for autoregressive terms to be applied to distributional parameters other than the mean. There are also other minor difficulties such as if/how GARCH models can be combined with a covariance formulation of ARMA models on the mean `arma(..., cov = TRUE)`. These are all details we have to figure out in the process.
+
+### Expected Results and Milestones
+
+The expected result of this project is that brms users will have access to a simple way to incorporate standard volatility models into their overall model structure.
+
+#### Milestone 1
+Take models from [NYU Stern Volatility lab](https://vlab.stern.nyu.edu/docs/volatility) and write them in Stan along with data generating processes. Then use both of these to perform [Simulation Based Calibration](https://mc-stan.org/docs/2_23/stan-users-guide/simulation-based-calibration.html)
+
+#### Milestone 2
+Write tech spec as an issue in BRMS suggesting the syntax style, supported models, outstanding issues, any good default priors for the parameters that were found, etc. (Paul you'd probably need to lay out the template here).
+
+#### Milestone 3
+Make a prototype with tests, prediction methods, and docs for simple a simple `garch(p, q)` on sigma
+
+#### Milestone 3
+Add additional garch flavors such as GJR-GARCH, EGARCH, APARCH, AGARCH, etc.
+
+### Helpful Prerequisite Experience
+
+- Knowledge of the tools behind developing R packages
+- Experience with Stan and Bayesian modeling
+- Studies in time series and volatility modeling
+
+### What Will You Learn?
+
+By the end of the project students will have experience in R package development and working with a large, international open source development team, the algorithms behind volatility modeling, an understanding of the workflow in developing Bayesian models.
+
+### What Can You do To Prepare?
+
+It would be very good to read materials related to time series modeling with Bayesian Statistics [link](https://mc-stan.org/docs/2_20/stan-users-guide/time-series-chapter.html), be familiar with [Simulation Based Calibration](https://mc-stan.org/docs/2_23/stan-users-guide/simulation-based-calibration.html), and try some practice models in the Stan programming language.
diff --git a/gsoc_proposals/2021/lambertw.md b/gsoc_proposals/2021/lambertw.md
@@ -0,0 +1,78 @@
+## Lambert W Transforms
+
+### Abstract
+
+This project will add [Lambert-W transforms](https://discourse.mc-stan.org/t/adding-lambert-w-transforms-to-stan/13906) to the Stan language. Lambert-W transforms will allow users to center parameters in models whose data has skewness and asymmetric kurtosis.
+
+| **Intensity**                          | **Priority**              | **Involves**  | **Mentors**              |
+| -------------                          | ------------              | ------------- | -----------              |
+| Hard | Low/Medium | Stan, R/Python, C++ |[Steve Bronder](https://github.com/SteveBronder), [Sean Pinkney](https://github.com/spinkney) |
+
+### Project Details
+
+Lambert-W transforms are a method for modeling skewness and asymmetric kurtosis in a distribution. For example, using the [`LambertW`]() R package it's possible to gaussianize a cauchy distribution such that after the Lambert-W tranform the data is normally distributed.
+
+```R
+# Univariate example
+library(LambertW)
+set.seed(20)
+y1 <- rcauchy(n = 100)
+plot(density(y1))
+out <- Gaussianize(y1, return.tau.mat = TRUE)
+plot(density(out$input) # huh!
+x1 <- get_input(y1, c(out$tau.mat[, 1]))  # same as out$input
+test_normality(out$input) # Gaussianized a Cauchy!
+```
+
+![](https://aws1.discourse-cdn.com/standard14/uploads/mc_stan/original/2X/6/6f039d1b23ce1ed4d75036bce63e374b0cb4cfdf.png)
+
+Lambert-W transforms can be estimated by either maximum likelihood estimation or moment matching and discussion is ongoing as to which will be better for the Stan language. Ideally the student would implement both, compare the outcomes and code complexity, and provide recommendations on when each approach is suitable. 
+
+The maximum likelihood approach can directly be ported to the Stan language to utilize MCMC methods instead of MLE. The issue with this approach is that one must specify the original distribution in which the transformation applies and each distribution will need its own specialization. The upside, however, is that this is the most efficient - in terms of parameter accuracy. 
+
+The second way of estimating the distribution is by moment matching. In fact, TF does this with their [`gaussianize`](https://www.tensorflow.org/tfx/transform/api_docs/python/tft/scale_to_gaussian) function. TF bases their code on the [Characterizing Tukey h and hh-Distributions through L-Moments and the L-Correlation](https://opensiuc.lib.siu.edu/cgi/viewcontent.cgi?article=1005&context=epse_pubs) by Headrick and Pant. The TF code uses [binary search](https://github.com/tensorflow/transform/blob/879f2345dcd6096104ae66027feacb099e228e66/tensorflow_transform/gaussianization.py) to numerically solve for the left and right parameters when the distribution is asymmetric.
+
+With the L-moments method it is possible to write a tukey_symmetric_lpdf and tukey_skew_lpdf that Stan samples from on the normal 0,1 scale. Along with estimating the location and scale, the symmetric versions would take in `h` and the skew versions an `h_l` and `h_r` as parameters. An ambitious student can also implement the multivariate version - something TF does not currently do - by a multi_tukey specification. Each marginal density would contain its own skew and/or kurtosis, connected via a correlation matrix. See equation 4.1 in the paper that uses the Cholesky factor of the correlation matrix values.
+
+### Expected Results and Milestones
+
+By the end of the project users will be able to utilize Lambert-W distributions inside of their Stan models.
+
+#### Milestone 1
+
+Implement the MLE version of Lambert-W transforms directly as a function written in the Stan language along with data generating processes. Then use both of these to perform [Simulation Based Calibration](https://mc-stan.org/docs/2_23/stan-users-guide/simulation-based-calibration.html) to verify the correctness of the model.
+
+#### Milestone 2
+
+1. Work with Stan developers to make a prototype of the MLE version of the Lambert-W transforms in C++. 
+2. In parallel, begin work on the moment-matching estimator in the Stan language.
+
+#### Milestone 3
+
+1. Add the MLE prototype to the [Stan Math library](https://github.com/stan-dev/math). 
+2. Compare the moment-matching estimator to the model(s) and SBC implemented in milestone 1.
+
+#### Extra Credit 
+1. Implementation of the moment-matching code in C++.
+2. Super extra credit. Add the multivariate extension.
+
+### Helpful Prerequisite Experience
+
+The student should have some familiarity or be able to learn
+
+- Stan
+- C++
+- Bayesian Modeling
+
+
+### What Will You Learn?
+
+By the end of the summer, the student will have experience with:
+ - Bayesian modeling/computing.
+ - Writing performant C++
+ - Working in a large, international open-source development team.
+ - Algorithm development
+
+### What Can You do To Prepare?
+
+Students should read over the [original paper](https://www.hindawi.com/journals/tswj/2015/909231/) on Lambert-W transforms as well as the associated R package [LambertW](https://cran.r-project.org/web/packages/LambertW/index.html).
diff --git a/gsoc_proposals/2021/proposal_main.md b/gsoc_proposals/2021/proposal_main.md
@@ -0,0 +1,6 @@
+# Stan GSoC 2021 Project Descriptions
+
+Stan is excited to offer the following projects for GSoC 2021:
+  - [Lambert W Transforms](lambertw.md)
+  - [Benchmarking Bayesian Models in Stan](bayesian_benchmarking.md)
+  - [Garch Models in BRMS](garch.md)