Rcode/EX_MultipleImputation_18.Rmd at master · coreysparks/Rcode · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
---
title: "Example 7 Multiple Imputation & Missing Data"
author: "Corey Sparks, PhD"
date: "March 5, 2018"
output: html_document
---

This example will illustrate typical aspects of dealing with missing data. Topics will include: Mean imputation, modal imputation for categorical data, and multiple imputation of complex patterns of missing data.

For this example I am using 2016 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART county data. [Link](https://www.cdc.gov/brfss/smart/smart_2016.html)


```{r}
library(car)
library(mice)
library(ggplot2)
load(file = "~/Google Drive/classes/dem7283/class18/data/brfss16_mmsa.Rdata")
nams<-names(brfss16m)
head(nams, n=10)
#we see some names are lower case, some are upper and some have a little _ in the first position. This is a nightmare.
set.seed(1234)
newnames<-tolower(gsub(pattern = "_",replacement =  "",x =  nams))
names(brfss16m)<-newnames

samp<-sample(1:dim(brfss16m)[1], size = 50000) #smaller sample for brevity
brfss16m<-brfss16m[samp,]
#Healthy days
brfss16m$healthdays<-recode(brfss16m$physhlth, recodes = "88=0; 77=NA; 99=NA")

#Healthy mental health days
brfss16m$healthmdays<-recode(brfss16m$menthlth, recodes = "88=0; 77=NA; 99=NA")

brfss16m$badhealth<-recode(brfss16m$genhlth, recodes="4:5=1; 1:3=0; else=NA")
#race/ethnicity
brfss16m$black<-recode(brfss16m$racegr3, recodes="2=1; 9=NA; else=0")
brfss16m$white<-recode(brfss16m$racegr3, recodes="1=1; 9=NA; else=0")
brfss16m$other<-recode(brfss16m$racegr3, recodes="3:4=1; 9=NA; else=0")
brfss16m$hispanic<-recode(brfss16m$racegr3, recodes="5=1; 9=NA; else=0")

brfss16m$race_eth<-recode(brfss16m$racegr3,
recodes="1='nhwhite'; 2='nh black'; 3='nh other';4='nh multirace'; 5='hispanic'; else=NA",
as.factor.result = T)
brfss16m$race_eth<-relevel(brfss16m$race_eth, ref = "nhwhite")

#insurance
brfss16m$ins<-recode(brfss16m$hlthpln1, recodes ="7:9=NA; 1=1;2=0")

#income grouping
brfss16m$inc<-recode(brfss16m$incomg, recodes = "9= NA;1='1_lt15k'; 2='2_15-25k';3='3_25-35k';4='4_35-50k';5='5_50kplus'", as.factor.result = T)
brfss16m$inc<-as.ordered(brfss16m$inc)
#education level
brfss16m$educ<-recode(brfss16m$educa,
recodes="1:2='0Prim'; 3='1somehs'; 4='2hsgrad'; 5='3somecol'; 6='4colgrad';9=NA",
as.factor.result=T)
brfss16m$educ<-relevel(brfss16m$educ, ref='2hsgrad')

#employloyment
brfss16m$employ<-recode(brfss16m$employ1,
recodes="1:2='employloyed'; 2:6='nilf'; 7='retired'; 8='unable'; else=NA",
as.factor.result=T)
brfss16m$employ<-relevel(brfss16m$employ, ref='employloyed')

#marital status
brfss16m$marst<-recode(brfss16m$marital,
recodes="1='married'; 2='divorced'; 3='widowed'; 4='separated'; 5='nm';6='cohab'; else=NA",
as.factor.result=T)
brfss16m$marst<-relevel(brfss16m$marst, ref='married')

#Age cut into intervals
brfss16m$agec<-cut(brfss16m$age80, breaks=c(0,24,39,59,79,99))

#BMI, in the brfss16ma the bmi variable has 2 implied decimal places,
#so we must divide by 100 to get real bmi's

brfss16m$bmi<-brfss16m$bmi5/100

#smoking currently
brfss16m$smoke<-recode(brfss16m$smoker3,
recodes="1:2=1; 3:4=0; else=NA")
#brfss16m$smoke<-relevel(brfss16m$smoke, ref = "NeverSmoked")


```

Now, we can get a general idea of the missingness of these variables by just using `summary(brfss16m)`

```{r}
summary(brfss16m[, c("ins", "smoke",  "bmi", "badhealth", "race_eth",  "educ", "employ", "marst", "inc")])
```

Which shows that, among these recoded variables, `inc` , the income variable, `r table(is.na(brfss16m$inc))[2]` people in the BRFSS, or `r 100* (table(is.na(brfss16m$inc))[2]/length(brfss16m$inc))`% of the sample.

The lowest number of missings is in the bad health variable, which only has `r 100* (table(is.na(brfss16m$badhealth))[2]/length(brfss16m$badhealth))`% missing.

###Mean imputation
Now, i'm going to illustrate mean imputation of a continuous variable, BMI.
```{r}
#I'm going to play with 3 outcomes, bmi, having a regular doctor and income category
summary(brfss16m$bmi)

#what happens when we replace the missings with the mean?
brfss16m$bmi.imp.mean<-ifelse(is.na(brfss16m$bmi)==T, mean(brfss16m$bmi, na.rm=T), brfss16m$bmi)

mean(brfss16m$bmi, na.rm=T)
mean(brfss16m$bmi.imp.mean) #no difference!

median(brfss16m$bmi, na.rm=T)
median(brfss16m$bmi.imp.mean) #slight difference

var(brfss16m$bmi, na.rm=T)
var(brfss16m$bmi.imp.mean) # more noticeable difference!

```

So what we see here, is that imputing with the mean does nothing to central tendency (when measured using the mean, but does affect the median slightly), but it does reduce the variance in the outcome. This is because you're replacing all missing cases with the most likely value (the mean), so you're artificially deflating the variance. That's not good.

We can see this in a histogram:
```{r}
#plot the histogram
hist(brfss16m$bmi.imp.mean)
hist(brfss16m$bmi, add=T ,col=1) #

```

Where you can see the extra values at the mean by the white area over the mode.

###Modal imputation
If we have a categorical variable, an easy way to impute the values is to use modal imputation, or impute cases with the mode, or most common value. It doesn't make sense to use the mean, because what would that mean for a categorical variable?

```{r}
table(brfss16m$employ)
#find the most common value
mcv.employ<-factor(names(which.max(table(brfss16m$employ))), levels=levels(brfss16m$employ))
mcv.employ
#impute the cases
brfss16m$employ.imp<-as.factor(ifelse(is.na(brfss16m$employ)==T, mcv.employ, brfss16m$employ))
levels(brfss16m$employ.imp)<-levels(brfss16m$employ)

prop.table(table(brfss16m$employ))
prop.table(table(brfss16m$employ.imp))

barplot(prop.table(table(brfss16m$employ)), main="Original Data", ylim=c(0, .6))
barplot(prop.table(table(brfss16m$employ.imp)), main="Imputed Data",ylim=c(0, .6))
```

Which doesn't look like much of a difference because only `r table(is.na(brfss16m$employ))[2]` people were missing. Now let's try modal imputation on income group:

```{r}
table(brfss16m$inc)
#find the most common value
mcv.inc<-factor(names(which.max(table(brfss16m$inc))), levels = levels(brfss16m$inc))
mcv.inc
#impute the cases
brfss16m$inc.imp<-as.factor(ifelse(is.na(brfss16m$inc)==T, mcv.inc, brfss16m$inc))
levels(brfss16m$inc.imp)<-levels(as.factor(brfss16m$inc))

prop.table(table(brfss16m$inc))
prop.table(table(brfss16m$inc.imp))

barplot(prop.table(table(brfss16m$inc)), main="Original Data", ylim=c(0, .6))
barplot(prop.table(table(brfss16m$inc.imp)), main="Imputed Data", ylim=c(0, .6))
```

Which shows how dramatically we alter the distribution of the variable by imputing at the mode.

###Multiple Imputation
These days, these types of imputation have been far surpassed by more complete methods that are based upon regression methods. These methods are generally referred to as multiple imputation, because we are really interested in imputing multiple variables simultaneously. Instead of reviewing this perspective here, I suggest you have a look at Joe Schafer's [site](http://sites.stat.psu.edu/~jls/mifaq.html) that gives a nice treatment of the subject. Here, I will use the imputation techniques in the `mice` library in R, which you can read about [here](http://www.jstatsoft.org/v45/i03/paper).

I have used these in practice in publications and generally like the framework the library uses. Another popular technique is in the `Amelia` library of [Gary King](http://gking.harvard.edu/amelia), which I haven't used much. If you are serious about doing multiple imputation it would be advised to investigate multiple methodologies.

To begin, I explore the various patterns of missingness in the data. The `md.pattern` function in `mice` does this nicely. Here, each row corresponds to a particular pattern of missingness (1 = observed, 0=missing)
```{r}
#look at the patterns of missingness
md.pattern(brfss16m[,c("bmi", "inc", "agec","educ","race_eth")])
```

Shows that 3688 rows of the data are complete (first row).

The second row shows that 135 people are missing *only* the bmi variable. I say *only* because `r table(is.na(brfss16m$bmi))[2]` people are missing the bmi variable in total. Apparently some folks are missing in combination with other variables.

Sure enough, if we look down, we see that 95 people are missing bmi *AND* income, 17 are missing bmi *AND* income and education, and so on.

The bottom row tells how many total people are missing each variable, in *ANY* combination with other variables.

If you want to see how pairs of variables are missing together, the `md.pairs()` function will show this.

A pair of variables can have exactly four missingness patterns:
Both variables are observed (pattern `rr`), the first variable is observed and the second variable is missing (pattern `rm`), the first variable is missing and the second variable is observed (pattern `mr`), and both are missing (pattern `mm`).
```{r}
md.pairs(brfss16m[,c("bmi", "inc", "agec","educ","race_eth")])
```

###Basic imputation:
We can perform a basic multiple imputation by simply doing: **Note this may take a very long time with big data sets**

```{r}
imp<-mice(data = brfss16m[,c("bmi", "inc", "agec","educ","race_eth")], seed = 22, m = 20)

print(imp)
```

Shows how many imputations were done. It also shows total missingness, which imputation method was used for each variable (because you wouldn't want to use a normal distribution for a categorical variable!!).

It also shows the sequence of how each variable is visited (or imputed, the default is left to right).

We may want to make sure imputed values are plausible by having a look. For instance, are the BMI values outside of the range of the data.

```{r}
head(imp$imp$bmi)
summary(imp$imp$bmi)
summary(brfss16m$bmi)
```

```{r}
head(imp$imp$inc)
summary(imp$imp$inc)
```
Which shows the imputed values for the first 6 cases across the 5 different imputations, as well as the numeric summary of the imputed values. We can see that there is variation across the imputations, because the imputed values are not the same.


We can also do some plotting. For instance if we want to see how the observed and imputed values of bmi look with respect to race, we can do:
```{r, fig.height=7, fig.width=8}
library(lattice)
stripplot(imp,bmi~race_eth|.imp, pch=20)
```

and we see the distribution of the original data (blue dots), the imputed data (red dots) across the levels of race, for each of the five different imputation runs(the number at the top shows which run, and the first plot is the original data).

This plot shows that the bmi values correspond well with the observed data, *so they are probably plausible valuesU*.

If we want to get our new, imputed data, we can use the `complete()` function, which by default extracts the first imputed data set. If we want a different one, we can do `complete(imp, action=3)` for example, to get the third imputed data set.

```{r}
dat.imp<-complete(imp, action = 5)
head(dat.imp, n=10)

#Compare to the original data
head(brfss16m[,c("bmi", "inc", "agec","educ","race_eth")], n=10)
```

While the first few cases don't show much missingness, we can coax some more interesting cases out and compare the original data to the imputed:

```{r}
head(dat.imp[is.na(brfss16m$bmi)==T,], n=10)
head(brfss16m[is.na(brfss16m$bmi)==T,c("bmi", "inc", "agec","educ","race_eth")], n=10)

```


###Analyzing the imputed data
A key element of using imputed data, is that the relationships we want to know about should be maintained after imputation, and presumably, the relationships within each imputed data set will be the same. So if we used each of the (5 in this case) imputed data sets in a model, then we should see similar results across the five different models.

Here I look at a linear model for bmi:
```{r}
#Now, I will see the variability in the 5 different imputations for each outcom
fit.bmi<-with(data=imp ,expr=lm(bmi~inc+agec+educ+race_eth))
fit.bmi
```

###variation in bmi
```{r}

with (data=imp, exp=(sd(bmi)))
```

###Frequency table for income
```{r}
with (data=imp, exp=(prop.table(table(inc))))
```

###Frequency table for race/ethnicty

```{r}
with (data=imp, exp=(prop.table(table(race_eth))))
```

###Frequency table for education

```{r}
with (data=imp, exp=(prop.table(table(educ))))

```


Now we pool the separate models from each imputed data set:
```{r}
est.p<-pool(fit.bmi)
print(est.p)
summary(est.p)
```

We need to pay attention to the `fmi` column and the `lambda` column. These convey information about how much the missingness of each particular variable affects the model coefficients.
```{r}
lam<-data.frame(lam=est.p$lambda, param=names(est.p$lambda))

ggplot(data=lam,aes(x=param, y=lam))+geom_col()+theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

It appears that the education variables and race/ethnicity have large variances to them. This suggests that there may be noticeable variation in the resulting coefficient of the model, depending on which imputed data set we use.

We can also compare to the model fit on the original data, with missings eliminated:
```{r}
library(dplyr)
bnm<-brfss16m%>%
  select(bmi, inc, agec, educ, race_eth)%>%
  filter(complete.cases(.))%>%
  as.data.frame()

summary(lm(bmi~inc+agec+educ+race_eth, bnm))
```

###Compare imputed model to original data
Here, I compare the coefficients from the model where we eliminated all missing data to the one that we fit on the imputed data:
```{r}
fit1<-lm(bmi~inc+agec+educ+race_eth, data=brfss16m)
summary(fit1)

fit.imp<-lm(bmi~inc+agec+educ+race_eth, data=dat.imp)
summary(fit.imp)
```

So for instance, the coefficient for college grad in the imputed model is -1.20, but is -1.23 in the model where the data were limited to complete cases only. The notable patter that emerges in the imputed data is the lack of significance for the income variables. In the analysis that only uses complete  cases, we see a significant income effect on bmi, but not once we impute the missing values. This suggests a significant selection effect for the income variable.

##Flag variables
We can construct a flag variable. This is a useful exercise to see whethere we have missing not at random within the data:

```{r}
fit1<-lm(bmi~agec+educ+race_eth+is.na(inc), data=brfss16m)
summary(fit1)

```

And indeed we see that those with missing incomes have signifcantly lower bmi's.


##Examining the variation in the models for the imputed data
If we wanted to see the ranges of the betas in the five imputed data models, we could do that:
```{r, fig.height=6, fig.width=9}
#get the coefficients from each of the 5 imputations of bmi
coefs<-matrix(unlist(lapply(fit.bmi$analyses, coef)), nrow=5, ncol=17, byrow=T)

#plot the coefficients from each of the different rounds of imputation to see the variability in the
#results
plot(coefs[1,-1], ylim=c(-2, 4), xaxt="n",cex=1.5, pch=20, ylab="beta")
axis(1, at=1:16, labels=names(fit.bmi$analyses[[1]]$coef[-1]))
cols=2:5
for(i in 1:dim(coefs)[1]-1){
  points(coefs[i+1,-1], col=cols[i], cex=1.5, pch=20)
}
title(main="Estimated Betas from Each Imputed Regression Model for BMI Outcome")

```