bayesian-learning-with-R/BruceCampbell_ST540_HW_8.Rmd at master · NCSU-STATS/bayesian-learning-with-R · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
---
title: "Applied Bayesian Analysis : NCSU ST 540"
subtitle: "Homework 7"
author: "Bruce Campbell"
fontsize: 11pt
output: pdf_document
bibliography: BruceCampbell_ST540_HW_1.bib
---

---
```{r setup, include=FALSE,echo=FALSE}
rm(list = ls())
setwd("d:/brucebcampbell-git/bayesian-learning-with-R")
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(dev = 'pdf')
knitr::opts_chunk$set(cache=TRUE)
knitr::opts_chunk$set(tidy=TRUE)
knitr::opts_chunk$set(prompt=FALSE)
knitr::opts_chunk$set(fig.height=5)
knitr::opts_chunk$set(fig.width=6)
knitr::opts_chunk$set(warning=FALSE)
knitr::opts_chunk$set(message=FALSE)
knitr::opts_knit$set(root.dir = ".")
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=38),tidy=TRUE)
library(latex2exp)
library(pander)
library(ggplot2)
library(GGally)
```

In this assignment we will performing random slopes logistic regression in JAGS using the Gambia
data described in ```http://www4.stat.ncsu.edu/~reich/ABA/code/GLM```
Let $Y_i$ be the binary response for individual $i$ , and let $\nu_i \in {1 \cdots 65}$ denote the village of individual $i$ Let $X_i = 1$ if individual $i$ regularly sleeps under a bed-net and $X_i = 0$ otherwise. Fit the model

$$logit(P(Y_i=1)) = \alpha_{\nu_i} + X_i \beta_{\nu_i}$$
where $\alpha_{\nu_i}$ and $\beta_{\nu_i}$ are the intercept and slope for village $j$
The priors (independent over village and with each other) are
$$\alpha_{\nu_i} \sim Normal(\mu_a,\sigma_a^2)$$
and
$$\beta_{\nu_i}  \sim Normal(\mu_b,\sigma_b^2)$$

We choose uninformative priors for $\mu_a,\sigma_a^2,\mu_b,\sigma_b^2$

Before we perform the analysis we address the question of why might the effect of bed-net vary by village? The effect of bed-net use on malaria status could vary based on many factors.  We list a few below;

- proximality to wet area with mosquito larve
- the amount of time childeren play outside or in areas near mosquito infested areas
- variation in the level of malaria parasites in the mosquito population

### Model definition and MCMC sampling

```{r}
library(rjags)
library(coda)
library(modeest)

DEBUG <- FALSE
if(DEBUG)
{
nSamples <- 10000
n.chains <- 1
} else
{
nSamples <- 20000
n.chains <- 8
}

load("gambia.RData")

X.net <- as.numeric((X$netuse==1) | (X$treated==1))
Y <- pos
n <- length(X.net)

df <- data.frame(cbind(X.net,village,pos))
names(df) <- c("net","village","pos")
numVillages <- length(unique(df$village))
villages <- df$village

model_string.logistic_random_slopes <- "model{
   # Likelihood
   for(i in 1:n){
      Y[i]    ~ dbern(q[i])
      logit(q[i]) <- beta[villages[i],1] + beta[villages[i],2]*X.net[i]
   }

   # Random effects
   for(j in 1:numVillages){
    beta[j,1] ~dnorm(mu1,precision1)
    beta[j,2] ~dnorm(mu2,precision1)
   }

  #Priors - uninformative.
  mu1 ~ dnorm(0, 0.01)
  mu2 ~ dnorm(0, 0.01)
  precision1 ~ dgamma(0.01, 0.01)
  precision2 ~ dgamma(0.01, 0.01)

  for(j in 1:numVillages){
    pred[j] <- beta[j,1] + beta[j,2]
  }
  }"


model.logistic_random_slopes <- jags.model(textConnection(model_string.logistic_random_slopes), data = list(Y=Y,X.net=X.net,n=n,numVillages=numVillages,villages=villages),n.chains = n.chains)## Compiling model graph

update(model.logistic_random_slopes, nSamples, progress.bar="none"); # Burnin
samp.coeff.logistic_random_slopes <- coda.samples(model.logistic_random_slopes, variable.names=c("beta"),n.iter=2*nSamples)
```

###MCMC Chain diagnostics

Chain diagnostics are optionally produced with the deubg flag.  These were run and inspected by hand for chain convergence.  The debug flag is set to false for rendering the final markdown in the report.  For the final version we calculate the gelman scale reduction factors for each parameter.  The scalre reduction factor near 1 means that between and within chain variance are nearly equal.  We report the number of coefficients with a gelman scale reduction factor greater than one.

```{r}

gelman.srf <-gelman.diag(samp.coeff.logistic_random_slopes)

count.coeff.gt <- sum(gelman.srf$psrf>1.1)

pander(data.frame(count.coeff.gt=count.coeff.gt ),caption="Number of coefficients with a a gelman scale reduction factor greater than 1.1" )


if (DEBUG)
  {
  autocorr.plot(samp.coeff.logistic_random_slopes)

  plot(samp.coeff.logistic_random_slopes)

  #Sample again and estimate posterior means and MAP posterior modes.
  samp.coeff.logistic_random_slopes.jags <- jags.samples(model.logistic_random_slopes, variable.names = c("beta"), n.iter = nSamples, progress.bar = "none")
  posterior_means.logistic_random_slopes <- lapply(samp.coeff.logistic_random_slopes.jags, apply, 1, "mean")
  pander(posterior_means.logistic_random_slopes, caption = "posterior means second sample")

  posterior_modes.logistic_random_slopes <- lapply(samp.coeff.logistic_random_slopes.jags, apply, 1, "mlv")
  posterior_modes.logistic_random_slopes$beta[[1]]$M

  if(n.chains>1)
  {
  gelman.plot(samp.coeff.logistic_random_slopes)
  }
}

```

### Calculate the effective samples size of the chains

We calculate the efective sample size (ESS) below and make a plot for the first chain.  We noted sigificant autocorrelation in some of the afc plots for the intercepts and slope parameters.  As a follow up we might run the MCMC chains for a lot longer on more compute enabled hardware. We'd also like to investigate the relationship between coefficients with low ESS and those which are not deemed significant by the 95% credible intervals.

```{r}
chains.ess <- lapply(samp.coeff.logistic_random_slopes,effectiveSize)

first.chain.ess <- chains.ess[1]
plot(unlist(first.chain.ess)/nSamples, main="Effective Sample Size as proportion of chain length")
```

### Calculate posterior means and MAP estimates for interceps and slopes

```{r}
#Allocate matrix for means and modes of the intercepts
intercepts.means.modes <- data.frame(mean = rep(NaN,65),mode=rep(NaN,65))

#First get a chain
chain <- samp.coeff.logistic_random_slopes[[1]]
#Alpha first
for( i in 1:65)
{
  colname <- colnames(chain)[i]

  samples <- chain[,i]

  sample.mode <- mlv(samples)

  sample.mean <- mean(samples)

  intercepts.means.modes[i,1] <- sample.mean

  intercepts.means.modes[i,2] <- sample.mode$M

}

intercept.max.mean <- which.max(intercepts.means.modes$mean)
intercept.max.mode <- which.max(intercepts.means.modes$mode)

pander(data.frame(intercept.max.mean.index=villages[intercept.max.mean], intercept.max.mode.index=villages[intercept.max.mode]), caption="Village index of max mean and mode for intercept posteriors")

#Allocate matrix for means and modes of the slopes
slopes.means.modes <- data.frame(mean = rep(NaN,65),mode=rep(NaN,65))

#First get a chain
chain <- samp.coeff.logistic_random_slopes[[1]]
#Alpha first
for( i in 66:130)
{
  colname <- colnames(chain)[i]

  samples <- chain[,i]

  sample.mode <- mlv(samples)

  sample.mean <- mean(samples)

  slopes.means.modes[i-65,1] <- sample.mean

  slopes.means.modes[i-65,2] <- sample.mode$M

}
rownames(slopes.means.modes) <- c()

slope.max.mean <- which.max(slopes.means.modes$mean)
slope.max.mode <- which.max(slopes.means.modes$mode)

pander(data.frame(slope.max.mean.index=villages[slope.max.mean], intercept.max.mode.index=villages[slope.max.mode]), caption="Village index of max mean and mode for slope posteriors")
```


### Conclusions

(2)MCMC algorithm converge?
(3) Do you see evidence that the slopes and/or intercepts vary by village?
(4) Which village has the largest intercept? Slope? Does this agree with the data in these
villages?

The MCMC algorithm converged for all the parameters. We saw evidence that the slopes and intercepts vary by village. Below we plot the table for the villages max intercept and slope. We see large values for $(net,pos)=(0,0)$ and $(net,pos)=(1,1)$ for these villages.


```{r}
tables <- table(df$net,df$pos,df$village)

pander(tables[,,slope.max.mean], caption = "freq for max posterior mean for slope coefficient")

pander(tables[,,intercept.max.mean], caption = "freq for max posterior mean for intercept coefficient")

```