Skip to content
Nicholas Michaud edited this page Sep 22, 2017 · 4 revisions

This page describes how cross-validation could work in NIMBLE, and poses an open question about its functionality. Any comments or ideas are very welcome!

Cross-validation Overview

Cross-validation methods generally work by splitting a model's data into two parts: a training portion and a test portion. The model is fit (in our case, likely via MCMC) to only the training data, and data values for the missing test data are simulated (in our case, from their posterior predictive distribution). These simulated values are then compared to the actual observed, left-out values via some loss function. The average of the loss function can be computed over all posterior draws, producing an estimate of model fit.

A commonly used type of cross-validation is k-fold cross-validation, in which the data set is partitioned into k roughly equal parts. Each of the k parts is held out as training data, in the manner described above, and a model fit estimate is obtained from that part. The k model fit estimates are then averaged to provide a overall measure of fit. k-fold validation has the advantage of producing an estimate of fit that takes into account all data points.

Cross-validation in NIMBLE

Employing cross-validation for hierarchical models in a general manner raises the question: What method do we want to use for "leaving out" data? Imagine we wish to perform k-fold CV. In simple models, e.g. a linear model with data distributed iid, using randomly equally sized subsets of the data as partitions is a reasonable approach. For more general hierarchical models, randomly choosing data points no longer seems like a good go-to solution.

For example, the hierarchical Hidden Markov Model (HMM) presented as Model 1 from Turek et al. (2016) (link here) is a model where some number of HMMs are related by a set of top level parameters. Randomly choosing a subset of data to leave out for this model will almost surely result in some data points being left out from each of the different HMMs. The cross-validation metric in this case would measure how well the model can predict these randomly withheld data points across a range of different HMMs, which may not be a measurement of interest to a researcher. This issue is described in an old-ish blog post by Andrew Gelman here, as well as likely a number of other places.

More relevant to a researcher could be how well the model can predict a new, currently unseen HMM given the already observed HMMs. This could be estimated by leaving out whole sets of data corresponding to one (or more) of the lower level HMMs in the model, predicting the course of these left out HMMs, and comparing the predictions to the observed data.

This HMM example is intended to show that sensible methods of leaving out data in hierarchical models can be very model-specific, and randomly leaving out data across all of a model's data points is not always sensible. As such, it may make more sense to let users define the subsets of data they wish to leave out. This can be imagined in a few different forms.

Ideas for user-specification of what data to leave out

  1. (The current implementation) Assume that all of the data in a user's hierarchical model is in an M dimensional array. Further, assume that the i-th dimension of that array is the dimension that defines the hierarchical grouping. This method would perform k-fold CV by partitioning the data along this dimension. All the user would need to provide would be the name of the data array, and the number of the dimension that defines the hierarchical groupings

    For example, in the hierarchical HMM model described above, maybe the observed data is an array y[i, t]. Here, the row i denotes data coming from HMM number i, and column t denotes data at time point t. In this model formulation, the first dimension (the rows i) would be the dimension that defines the hierarchical grouping. k-fold CV could be accomplished by sequentially leaving out each row of data and predicting the values of that row given the other rows.

    The drawbacks to this implementation are the strict requirements for the format of the data (all data is contained in an array, one dimension of that array defines the hierarchical groupings) that may make it unusable for many hierarchical model formulations.

  2. Extensions to the above implementation can be imagined. A user could be allowed to specify combinations of dimensions along which to leave out data (e.g. an array y[i, j, k] where y[i, j, ] is left out as test data for every combination of i and j). Extensions to multiple data variables in the model could also be implemented.

  3. A user could provide a list. Each named element of the list would have the name of a data variable in the model. These named elements would be objects of the same dimensions as the model variables, but instead of containing data, they would contain integers in the range [1, k] for k-fold CV. Each integer would denote the "fold" in which that data point would be left out. This implementation would offer the most flexibility, but would also require users to do the most work before running the function. This implementation would also run into a potential issue in dealing with multivariate nodes -- e.g., if a user specified that the first element of a multivariate node should be held out in fold 1, and the second element of that same multivariate node should be held out in fold two.

It seems like the trick will be making the algorithm flexible enough to work for a variety of models, while making the input easy enough for a user to want to try it.

Clone this wiki locally