squidSim_vignette/08-sampling.Rmd at main · squidgroup/squidSim_vignette · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
# Sampling  {#sampling}
```{r, echo=FALSE, message=FALSE,warning=FALSE}
rm(list=ls())
library(dplyr)
library(flair)
library(lme4)
library(MCMCglmm)
library(scales)
library(squidSim)
library(knitr)
set.seed(25)
```
To create different sampling schemes, we can use the sample arguments in the `simulate_population()` function, for example:
```{r, eval=FALSE}
  sample_type = "nested",
  sample_param = cbind(individual=c(10, 15),observation=c(10, 5)),
```

There are three different types of sampling 'nested', 'missing' and 'temporal', each of which are outlined below.

The sample arguments create different datasets for each population that has been simulated, and you can then use the function `get_sample_data()` to extract the sampled data.


## Nested

Nested sampling assumes that you have a nested structure, and allows you to sample different numbers at each hierarchical level. The `param` input is a matrix with (named) columns.  The rows of this matrix represent different sampling sets. This is most easily put together using the `cbind()` (column bind) function, specifying the names. The number of repeat observations for a higher level can be specified using name 'observation' (this doesn't have to exist in the data structure). For example

```{r}
cbind(individual=c(10, 15),observation=c(10, 5))
```
 would represent sampling the data structure above, the first set having 10 individuals each with 10 observations

Note this sampling procedure only produces balanced sampling designs. For unbalanced designs see 'missing data' below.

### Worked example 1
We want to see how the number of repeat measurements on individuals affects power. In order to vary the number of observations of an individual, we could specify:

```{r}
param <- cbind(nest=10,individual=10,observation=c(20, 10, 5, 2))
```

```{r}

pop_data <- simulate_population(
  data_structure = make_structure("nest(10)/individual(20)",repeat_obs=20),

  parameters = list(
    individual = list(
      vcov = 0.1
    ),
    observation= list(
      names = c("environment"),
      beta =c(0.5)
    ),
    residual = list(
      vcov = 0.8
    )
  ),

  sample_type = "nested",
  sample_param = param
)
```

To extract the sampled data we can then use `get_sample_data()` specifying which sample set we want, for example the second set 10 nests, each with 10 individuals with 10 observations:

```{r}
sample_data <- get_sample_data(pop_data, sample_set=2)
length(unique(sample_data$nest))
length(unique(sample_data$individual))
nrow(sample_data)
```

## Missing data
The missing data methods allows generation of unbalanced data. Missing data is generated through creating probabilities of being sampled using logistic regression. Missingness can then either be random, or a function of any of the simulated variables.

This methods allows the different classes of missing data to be generated:

- **M**issing **C**ompletely **A**t **R**andom (MCAR)
    - All observations have an equal probability of being sampled
- **M**issing **A**t **R**andom (MAR)
    - Probability of missingness is dependent on variables correlated with the response variable (i.e. a predictor variable)
- **M**issing **N**ot **A**t **R**andom (MNAR)
    - Probability of missingness is dependent on the response variable itself

<!--
Logistic function - probability of being sampled
	MNAR Y = beta_1 * body size
	MNAR Y = beta_1 * (temp +residual)
	MAR Y = beta_2*temp
-->

### MCAR
Missing completely at random occurs when the probability of missingness is not dependent on anything.  This can be implemented through a logistic regression, where only the intercept is specified:
$$ logit(p) = beta_0 $$

Note this intercept is on the logit scale, so 0 is equivalent to 0.5.

```{r}
pop_data <- simulate_population(
  data_structure = make_structure("individual(100)",repeat_obs=5),

  parameters = list(
    individual = list(
      vcov = 0.1
    ),
    observation= list(
      names = c("environment"),
      beta =c(0.5)
    ),
    residual = list(
      vcov = 0.8
    )
  ),

  sample_type = "missing",
  sample_param = "0"
)

sample_data <- get_sample_data(pop_data)
nrow(sample_data)
```

### MAR
Missing  at random occurs when the probability of missingness is dependent on a predictor variable (or a variables correlated with y). This can be implemented through a logistic regression, where the predictor variable(s) is a predictor(s) of y:
$$ logit(p) = beta_0 + beta_1*environment $$
```{r}
pop_data <- simulate_population(
  data_structure = make_structure("individual(100)",repeat_obs=5),

  parameters = list(
    individual = list(
      vcov = 0.1
    ),
    observation= list(
      names = c("environment"),
      beta =c(0.5)
    ),
    residual = list(
      vcov = 0.8
    )
  ),

  sample_type = "missing",
  sample_param = "0.5*environment"
)

sample_data <- get_sample_data(pop_data)
nrow(sample_data)

```
The predictor variables are scaled (mean 0, variance 1), so the slopes are directly comparable across traits, and intercept represents the mean (on the logit scale).

### MNAR
Missing not at random occurs when the probability of missingness is dependent on the response variable itself variable (i.e. y). This can be implemented through a logistic regression, where the predictor variable is y:
$$ logit(p) = beta_0 + beta_1*y $$
Again y is scaled.

```{r}
pop_data <- simulate_population(
  data_structure = make_structure("individual(100)",repeat_obs=5),

  parameters = list(
    individual = list(
      vcov = 0.8
    ),
    observation= list(
      names = c("environment"),
      beta =c(0.1)
    ),
    residual = list(
      vcov = 0.5
    )
  ),

  sample_type = "missing",
  sample_param = "0.5*y"
)

sample_data <- get_sample_data(pop_data)
nrow(sample_data)

```

Lets try and visualise this. We know there is lots f between individual variation, and we know sampling is based on phenotype, so we would expect an association between number of observations and phenotype:
```{r}
ind_data <- data.frame(
  n=as.vector(table(sample_data$individual)),
  mean=tapply(sample_data$y,sample_data$individual,mean)
  )
boxplot(mean~n,ind_data)

```

<!-- example of variation in sampling between individuals, but not related to phenotype - create an individual level variable with beta=0 (doesn't affect phenotype, and then do sampling based on that) -->


## Temporal Sampling

In the parameters we specify a list, with the temporal variable `time`, the grouping variable with which the temporal sampling occurs `group`, the between group variance (as a proportion) in sampling times `variance` and the within group sample size `n`:
```{r}

pop_data <- simulate_population(
  data_structure = make_structure("day(100) + individual(100)",repeat_obs=1),

  parameters = list(
    individual = list(
      vcov = 0.1
    ),
    day=list(
    	covariate=TRUE,
    	beta=0.4
    ),
    residual = list(
      vcov = 0.8
    )
  ),

  sample_type = "temporal",
  sample_param = list(
    time = c("day"),
    group = c("individual"),
    variance = c(0.1,0.2),
    n=4),

)

sample_data <- get_sample_data(pop_data)


```