forked from bbolker/testing_bias_distribution
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathtesting_distrib.rmd
More file actions
506 lines (432 loc) · 19.5 KB
/
testing_distrib.rmd
File metadata and controls
506 lines (432 loc) · 19.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
---
title: "models for testing"
editor_options:
markdown:
wrap: 72
---
```{r pkgs,message=FALSE,echo=FALSE}
library(tidyverse)
library(ggplot2); theme_set(theme_bw())
library(viridis)
library(bbmle)
library(RTMB)
source("testing_funs.R")
options(bitmapType='cairo')
```
## Introduction
How should we think about the connection between numbers of (1) infected
people in the population (2) tests done (3) cases reported?
Most generally/mechanistically, on any day we have some number of people
who are on day $\tau$ of their infectious period. On a given day, an
infectious ($I$) person may be subclinical (asymptomatic or so mildly
symptomatic that they don't report symptoms); mild; severe
(hospitalized); or very severe (ICU/ventilator). The progression among
these stages will be different for different people (although in general
they will progress toward greater severity). The more severe the
symptoms, the more likely someone is to be tested.
Testing will also depend on how they got infected, e.g. imported cases
and close contacts are more likely to be tested. (An asymptomatic,
community-infected person will basically never be tested.)
In a complete agent-based model (or a model based on a complex
compartmental structure), we could keep track of people moving through
all of these categories and assign them probabilities of being tested.
In a sufficiently realistic model we could even take into account
detailed testing criteria and phenomena such as limited testing
resources (so that, e.g., low-risk, mild cases would never be tested).
There are two extreme scenarios that are easy to reason about.
- if testing is random, the proportion of positive tests should be the
same as the prevalence (or incidence - I'm not being precise about
whether we are measuring a flow or a stock, but I don't think it
makes a difference), regardless of either. Increasing the number of
tests when the prevalence is constant leads to a proportional
increase in the number of cases. The proportion of positive tests is
$i$ and the number of cases (i.e., positive tests) is $iT$. (I'm
using $i$, $t$ for proportion of the population infected or tested
and $I$, $T$ as the number infected or tested: $I=iN$, $T=tN$.) If
the prevalence is increasing exponentially with rate $r_1$ and
testing is increasing at rate $r_2$, the number of cases increases
at rate $r_1 + r_2$ (and doubling time $\log(2)/(r1+r2)$).
- if testing is perfectly focused on infected people, then tests are
100% positive until we run out of infected people (i.e. proportion
of pop tested \> prevalence). The proportion of positive tests is
$\min(1,i/t)$ and the number of cases is $\min(T,I)$.
Now let's consider that in general we test people in (approximate) order
of their probability of infection (ignoring tests skipped because of
constraints). We can think about a distribution in the population of
probability of infection (higher for known contacts of infected people,
travellers from high-risk areas, etc.), and a distribution in the
probability of testing (which ideally lines up with the risk). We could
think about having two distributions, but these can't really be
separated (FIXME: explain better!).
## Machinery
Let's use a Beta distribution with mean equal to prevalence $i$ as the
distribution of 'probability infected'. (An alternative that might be
more analytically tractable is the [Kumaraswamy
distribution](https://en.wikipedia.org/wiki/Kumaraswamy_distribution).)
As the dispersion $\gamma=1/(a+b)$ goes to infinity we end up with point
masses $1-i$ on 0 and $i$ on 1; as it goes to 0 we end up with a point
mass on $i$. If we always test 'from the top down' (i.e. calculate the
mean value of the top fraction $t = T/N$ of the population
distribution), these two extreme cases correspond to perfect testing
(cases=$\min(T,I)$) and random testing (cases=$\textrm{Binomial}(i,T)$).
If $B$ is the Beta distribution with mean $i$ and dispersion $\gamma$,
$\Phi_B$ is the CDF, and $Q$ is the inverse CDF (i.e., the quantile
function) then our expected proportion positive from testing a fraction
$t=T/N$ is
$$
\frac{\left(\int_{Q(1-t)}^1 B(y,\phi) y \, dy\right)}{1-\Phi_B(Q(1-t))} =
\frac{\left(\int_{Q(1-t)}^1 B(y,\phi) y \, dy\right)}{t}
$$
i.e. the mean of the upper tail of the infection-probability
distribution.
We can get a little bit farther analytically. Given that
$B(x,a,b) \propto x^{a-1} (1-x)^{b-1}$, the mean is
$a/(a+b)=i \to a \phi =i \to a = i/\phi, b=(a+b)-a = 1/\phi - i/\phi = (1-i)/\phi$.
We can show that $B(x,a,b) \cdot x = a/(a+b) B(x,a+1,b)$, so we should
be able to do the integral directly by computing the appropriate CDF (or
complementary CDF) of the Beta distribution.
$B(x,a,b) =\frac{1}{\mathcal{B}(a,b)} x^{a-1} (1-x)^{b-1}$, where
$\mathcal{B}(a,b)$ is the beta function, which satisfies
$$\mathcal{B}(a,b) = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}\,.$$ It
follows that $$
\begin{aligned}
B(x,a,b)x = \frac{1}{\mathcal{B}(a,b)} x^{a} (1-x)^{b-1} &= \frac{\mathcal{B}(a+1,b)}{\mathcal{B}(a,b)}\times \frac{1}{\mathcal{B}(a+1,b)} x^{a} (1-x)^{b-1}\\
&=\frac{\mathcal{B}(a+1,b)}{\mathcal{B}(a,b)} B(x,a+1,b)\,,
\end{aligned}
$$ which, using the identity $\Gamma(z+1) = z\Gamma(z)$, simplifies to
$$
\begin{aligned}
B(x,a,b)x
&=\frac{\Gamma(a+1)\Gamma(b)}{\Gamma(a+b+1)}\frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)} B(x,a+1,b)\\
&= \frac{a\Gamma(a)\Gamma(b)}{(a+b)\Gamma(a+b)}\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} B(x,a+1,b) = \frac{a}{a+b}B(x,a+1,b)\,.
\end{aligned}
$$ Since we set the mean $a/(a+b)$ of the distribution $B(x,a,b)$ to
$i$, we have $B(x,a,b)x = iB(x,a+1,b)$.
Can we do a little bit more here since we're integrating this from
$Q(a,b)$ to 1 ... ?
In order to make the dispersion parameter a little more interpretable we
will transform the parameter $\gamma$ from $(0,\infty)$ to a value
$\phi = 1-\exp(-\gamma) \in (0,1) (i.e. \gamma=-\log(1-\phi))$
(previously: $\phi=-\log(1-\gamma)$ (i.e. $\gamma=1-\exp(-\phi)$)), so
that 0 corresponds to random testing and 1 corresponds to perfectly
focused testing.
If we wanted to get more realistic/detailed we could relax the
assumption of a Beta distribution and instead characterize the
distribution of probability infected as a mixture of types (general
population, symptomatic, travel history, etc.), with the constraint of a
mean of $i$ \ldots
## Graphical example
```{r fig1-calcs, echo = FALSE}
## SZ: gamma = -log(1-phi)
## SZ: a = i/gamma, b = (1-i)/gamma, so if we want a = 1, b=3 with i = 0.25, gamma = 0.25, phi = 0.2211992
incid <- 0.1
testprop <- 0.2
lwr <- qbeta(0.75, 1, 3)
qq <- integrate(\(x) dbeta(x, 1, 3)*x, lwr, 1)$value
```
Suppose the population incidence is 25% and we test 25% of the
population. Our $\phi$ parameter is 0.2211992 (dispersion of the Beta
are 0.25 and shape parameters of the Beta are {1, 3}). The lower bound
of the upper 25% tail is `r round(lwr, 3)`; the integral of $x$ under
that portion of the curve is `r round(qq, 3)`, so the test positivity is
`round(qq/0.25)`.
```{r fig1, echo = FALSE}
par(las=1, yaxs = "i",xaxs = "i")
cc <- curve(dbeta(x, 1, 3), from =0, to = 1, xlab = "prob(infected)", ylab = "prob density")
cc2 <- curve(dbeta(x, 1, 3), from = lwr, to = 1, add = TRUE)
polygon(c(cc2$x, rev(cc2$x)), c(rep(0, length(cc2$x)), rev(cc2$y)), col = "gray")
```
## Correlation
What is the relationship between $\phi$ and the correlation between
testing and infection?
We need
$$
\newcommand{\var}{\textrm{Var}}
\frac{E[(\textrm{tested} - \overline{\textrm{tested}})\cdot -
(\textrm{infected} - \overline{\textrm{infected}})]}{\sqrt{\var(\textrm{tested})} \cdot \var(\textrm{infected})}
$$
The probability of testing is 0 below the threshold and 1 above it; thus
this distribution should be a Bernoulli with weight $1-t$ on zero and
$t$ on 1. Thus the numerator is the value we've already computed for the
numerator of the proportion-positive value
($\int_{1-t}^1 x \cdot \textrm{Beta(x)}\,dx$), *minus* the product of
the testing and infection means ($= ti$, using the usual identity
$E[(A-\bar A)(B-\bar B)] = E[AB] - \bar A \bar B$). The variance of
testing probability is ($(1-t)\cdot 0^2 + t \cdot 1^2 = t$); the
variance of infection probability is the variance of the Beta
distribution.
```{r num_corr, cache=TRUE, echo = FALSE}
## numerical test of 'interesting' case
num_cor <- function(i, t, phi, n = 1e5) {
gamma <- -log(1-phi)
a <- i/gamma; b <- (1-i)/gamma
lwr <- qbeta(t, a, b, lower.tail = FALSE)
x <- rbeta(n, a, b)
y <- as.numeric(x>lwr)
cc <- cor.test(x,y)
data.frame(cor = cc$estimate, lwr=cc$conf.int[1], upr = cc$conf.int[2])
}
phivec <- seq(0.1, 0.95, by = 0.05)
set.seed(101)
numcor <- (purrr::map_dfr(phivec, \(p) num_cor(i=0.1, t=0.01, phi=p))
|> mutate(phi=phivec, .before = 1))
```
```{r phi_corr}
inf_test_corr <- function(i, t, phi) {
gamma <- -log(1-phi)
a <- i/gamma; b <- (1-i)/gamma
lwr <- qbeta(t, a, b, lower.tail = FALSE) ## hope we don't need robust version in covid19testing_funs.R
num <- pbeta(lwr,a+1,b,lower.tail=FALSE)*(a/(a+b))-t*i
var_test <- t*(1-t)
var_inf <- a*b/((a+b)^2*(a+b+1)) ## Wikipedia
num/sqrt(var_test*var_inf)
}
nn <- 101
phivec <- seq(0, 1, length.out = nn)
cordf <- data.frame(i = rep(c(0.25, 0.1), c(nn, 2*nn)),
t = rep(c(0.25, 0.05, 0.01), each = nn),
phi = rep(phivec, 3))
cordf$cor <- with(cordf, inf_test_corr(i,t,phi))
cordf$cat <- with(cordf, sprintf("i = %1.2f, t = %1.2f", i, t))
cordf$cat <- factor(cordf$cat, levels = unique(cordf$cat))
ggplot(cordf, aes(phi, cor, colour = cat)) + geom_line() +
geom_pointrange(data = numcor, aes(y=cor, ymin = lwr, ymax = upr),
colour = "blue", size = 0.1)
```
- ?? Is it true that correlation *decreases* with large $\phi$ when
testing and prevalence are low ... ?? (Seems too large an effect to
be due to numeric problems)
- SZ: current investigation indicates that the "decrease" is caused by
the bimodality of Beta when $\phi$ is large, when $t$ is small, as
$t$ increase, the test positivity decrease rapidly. But I doubt
currently if this is decrease is determined by monotocity (1st order
derivative) of the beta distribution curve or together with concave
(2nd order derivative) or even more complicated behaviour.
- ?? Is there a simple expression for cor as a function of $\phi$ or
*vice versa*? (probably not but ??)
Graphical example of decrease in correlation from $\phi=0.75$ to
$\phi=0.9$ with $t=0.05$, $i=0.2$:
```{r corr_example}
draw_example <- function(i, t, phi, fill = "red", add = FALSE,
alpha = 1, ...) {
gamma <- -log(1-phi)
a <- i/gamma; b <- (1-i)/gamma
par(las=1, yaxs = "i",xaxs = "i")
lwr <- qbeta(t, a, b, lower.tail = FALSE)
cc <- curve(dbeta(x, a, b), from =0, to = 1, xlab = "prob(infected)",
ylab = "prob density", ..., add = add)
cc2 <- curve(dbeta(x, a, b), from = lwr, to = 0.999, add = TRUE)
polygon(c(cc2$x, rev(cc2$x)),
c(rep(0, length(cc2$x)), rev(cc2$y)),
col = adjustcolor(fill, alpha = alpha))
}
draw_example(i=0.2, t=0.05, phi = 0.75, ylim = c(0, 7))
draw_example(i=0.2, t=0.05, phi = 0.9, ylim = c(0, 7), add = TRUE, col = "blue",
fill = "blue", alpha = 0.5)
```
Could the change in the *variance* of the Beta distribution be driving
these effects? (See if covariance is a monotonic function of $\phi$?)
## Numerical examples
```{r plots0}
par(las=1,bty="l")
curve(prop_pos_test(i=0.01,t=0.001,x),from=0.001,to=0.999,
xlab=expression("testing focus"~(phi)),
ylab="proportion of positive tests",
main=c("1% prevalence, 0.1% testing"),
ylim=c(0,1))
abline(h=0.01,lty=2)
```
What happens as we test more people than are actually infected in the
population?
```{r plots1}
par(las=1,bty="l")
curve(prop_pos_test(t=x,i=0.01,phi=0.5),
from=0.001, to=0.4,log="x",xlab="proportion tested",
ylab="prop positive tests",
main=expression(list("prevalence=1%",phi==0.5)))
abline(v=0.01,lty=2)
```
```{r plots2}
dd <- (expand.grid(time=1:25,phi=c(0.001,0.2,0.5,0.8,0.999))
%>% as_tibble()
%>% mutate(inc=0.001*exp(log(2)/3*time),
tests=1e-4*exp(log(2)/4*time),
pos_prop=prop_pos_test(inc,tests,phi),
cases=pos_prop*tests)
## issues with pivot_longer? use gather() instead
## "Error: `spec` must have `.name` and `.value` columns"
## FIXME: figure out what's going on here?
%>% gather("var","value",inc,tests,pos_prop,cases)
)
print(ggplot(dd,aes(time,value,col=phi,group=phi))
+ geom_line()
+ facet_wrap(~var,scale="free")
+ scale_y_log10()
+ scale_colour_viridis_c()
)
```
**FIXME:** direct labels? drop incidence and testing, include as
reference-slope lines?
## Estimation
When is $\phi$ identifiable?
- can we make a combined model with hospitalizations/deaths?
- if we assume that prevalence is increasing exponentially (and we
know the rates of testing) can we estimate $\phi$?
- what other comparisons are possible?
- there is *some* information in the model because change in $i$ over
time is constrained by the dynamical parameters ...
- can we set a sensible prior on $\phi$ and integrate over the
possibilities?
## To do
### Simulations
- The first thing to do is the minimal identifiability test: if we
simulate data directly using `prop_pos_test`, then add some noise
(e.g. binomial testing outcomes), and then we try to take the
observed testing data (total tests and positive tests) and get an
MLE of both phi and the parameters of the exponential growth of
prevalence (initial value and growth rate), can we do it? If $t$ is
the fraction of the population tested and $N$ is the population size
and $f(\tau)=\textrm{prop\_pos}(i(\tau),t(\tau),\phi)$ is the
fraction positive, then we want to simulate
$c(\tau) \sim \textrm{Binomial}(f,tN)$, the number of confirmed
positive cases at time $\tau$. Not worrying about
sensitivity/specificity yet (tests are perfect).
MLE: want to estimate $i_0$, $r$ (growth rate of prevalence), and
$\phi$. We know $t(\tau)$ and $c(\tau)$ (no time lags yet!)
```{r testsim}
## t (test vector) and c (confirmation vector) are defined; tau is a time vector
set.seed(101)
## scalar parameters
true_pars <- c(I_0=100, r=log(2)/3, phi=0.5,
T_0=100, r_t=log(2)/4, N=1e6)
true_pars["i_0"] <- true_pars["I_0"]/true_pars["N"]
## vectors of observed stuff
dd <- data.frame(tau=1:20)
dd$T <- round(true_pars["T_0"]*exp(true_pars["r_t"]*dd$tau))
dd$t <- dd$T/true_pars["N"]
## simulating confirmation vector
dd$c <- with(c(as.list(true_pars), dd),
rbinom(length(tau),
size=T, ## number of tests at time tau
prob=prop_pos_test(i_0*exp(r*tau), t, phi)
)
)
mle_out<- mle2(c~dbinom(prob=prop_pos_test_new(plogis(logit_i_0)*exp(r*tau), t, exp(log_phi),debug=FALSE, phiscale = "unconstrained"),
size=t*N),
start=list(logit_i_0=qlogis(true_pars["i_0"]),
r=true_pars["r"], log_phi=log(true_pars["phi"]) ),
data= list(tau=dd$tau, c=dd$c, t=dd$t, N=true_pars["N"]),
control=list(maxit=1000)
)
mle_est0 <- coef(mle_out)
```
## RTMB version
```{r rtmb-fit}
tmbdat <- list(tau=dd$tau, c=dd$c, t=dd$t, N=true_pars["N"])
pars <- list(logit_i_0=qlogis(true_pars["i_0"]),
r=true_pars["r"], plogis_phi=plogis(true_pars["phi"]) )
nllfun <- function(pars) {
getAll(pars, tmbdat)
prob <- prop_pos_test1(plogis(logit_i_0)*exp(r*tau),
t, exp(plogis_phi))
-sum(dbinom(c, size = t*N, prob = prob, log = TRUE))
}
ff <- MakeADFun(nllfun, pars, silent = TRUE)
ff$fn()
fit <- with(ff, nlminb(pars, fn, gr))
## run tmbprofile on a selected set of parameters, combine results,
## scale to delta-deviance (signed? sqrt?)
## is returning CI as an attribute really the best way to do this?
## allow tracing by variable? parallel computation?
tmbprof2 <- function(obj, pars, conf.level = 0.95, ...) {
pfun <- function(p) {
tt <- TMB::tmbprofile(obj, p, ..., trace = FALSE)
ci <- confint(tt, level = conf.level)
tt$value <- sqrt(2*(tt$value - min(tt$value, na.rm=TRUE)))
names(tt) <- c(".focal", "value")
list(df = data.frame(var = p, tt), ci = data.frame(var = p, ci))
}
res <- lapply(pars, pfun)
res_df <- do.call(rbind, lapply(res, \(x) x$df)) |>
transform(var = factor(var, levels = pars))
res_ci <- do.call(rbind, lapply(res, \(x) x$ci))
rownames(res_ci) <- NULL
attr(res_df, "ci") <- res_ci
res_df
}
```
```{r rtmb-prof, cache = TRUE}
prof <- tmbprof2(ff, c("logit_i_0", "r", "plogis_phi"), ystep = 0.01)
```
```{r rtmb-ci}
true_vec <- unlist(sapply(pars, unname))
citab <- (attr(prof, "ci")
|> mutate(est = fit$par[.data$var],
true = true_vec[.data$var],
.before = lower)
|> mutate(across(upper, ~ replace_na(., Inf)))
)
```
```{r rtmb-prof-plot}
ggplot(prof, aes(.focal, value)) +
geom_line() + facet_wrap(~var, scale = "free_x") +
geom_hline(yintercept = 1.96, lty = 2) +
geom_vline(data = citab, aes(xintercept = true), lty = 2, colour = "red") +
theme(panel.spacing = grid::unit(0, "lines"))
```
```{r}
knitr::kable(citab)
```
```{r ci-plot}
ggplot(citab, aes(est, var)) +
geom_pointrange(aes(xmin=lower, xmax = upper)) +
geom_point(aes(x=true), colour = "red", size = 4) +
facet_wrap(~var, scale = "free", ncol = 1) +
labs(y = "", x = "")
```
(run multiple sims for RMSE/coverage/bias/etc.? compare against naive
estimates, uncorrecting for testing bias?
- The next simulation is for understanding better what we're really
modeling here. The simulation should be a more mechanistic
description of the infection and testing process: not sure how this
actually worked. There must be some infection process (e.g. an SIR
model ...) and some process by which people get assigned a
probability of being chosen for testing, which we can match up with
the mathematical description above. The testing process is also
mechanistic; after testing people get removed from the testable
pool.
How do we model a testing policy? How do we make the testing policy
match up with our distribution?
Give every individual two (correlated) numbers which correspond to
infection risk and testing risk? Increase testing risk at onset of
symptoms?
Test asymptomatic (susceptible) people at one rate and symptomatic
(infectious) people at another rate?
A fraction of the incidence moves into a "testing pool"; people in the
testing pool are tested at a constant rate. Every person gets a number
sampled from our beta distribution when they become infected, which
governs their chance of being selected
### Kumaraswamy distribution
Is the inverse CDF for the Kumaraswamy distribution tractable? Seems
that way:
$$
\begin{split}
q & =1-(1-x^a)^b \\
(1-q)^{1/b} & = 1-x^a \\
x = (1- (1-q)^{1/b})^{1/a}
\end{split}
$$
(this could be important if we want to embed this machinery in
Stan/JAGS/etc. where `pbeta()`/`qbeta()` may not exist and/or be
sufficiently robust ...)
Unfortunately the tradeoff is that the expression for the mean is
complicated:
$$
\frac{b\Gamma\left(1+ \frac{1}{a}\right) \Gamma(b)}{\Gamma\left(1+\frac{1}{a}+b\right)}
$$
so reparameterizing in terms of mean/dispersion might be hard (and the
integral trick with the Beta also might not work \ldots)
### brain dump/misc
How does this relate to other ways that people are thinking about
testing vs incidence? e.g. Noam Ross's graphs in
<https://docs.google.com/spreadsheets/d/1ecQ0t1Sn2maR2b9sUacA3RSIUTHFlT-V1XiNnOFiBpw/edit#gid=2019730603>