bayesian-analysis/Getting Started with Bayesian Statistics using Stan and Python.qmd at main · abdullahau/bayesian-analysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: Getting Started with Bayesian Statistics
subtitle: using Stan and Python
author: Abdullah Mahmood
date: last-modified
format:
    html:
        theme: cosmo
        css: quarto-style/style.css
        highlight-style: atom-one
        mainfont: Palatino
        fontcolor: black
        monobackgroundcolor: white
        monofont: Menlo, Lucida Console, Liberation Mono, DejaVu Sans Mono, Bitstream Vera Sans Mono, Courier New, monospace
        fontsize: 13pt
        linestretch: 1.4
        number-sections: true
        number-depth: 2
        toc: true
        toc-location: right
        code-fold: false
        code-copy: true
        cap-location: bottom
        format-links: false
        embed-resources: true
        anchor-sections: true
        html-math-method:
            method: mathjax
            url: https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js
editor: source
jupyter:
    kernelspec:
        display_name: main
        language: python
        name: main
bibliography: quarto-style/references.bib
---

# Preface {.unnumbered}

Welcome to this introduction to Bayesian statistics using Stan in
Python. The preface explains what we expect you to know before
starting and provides the Python boilerplate we will use throughout.

## Prerequisites {.unnumbered}

We will assume the reader will be able to follow text that includes
basic notions from

* differential and integral calculus in multiple dimensions,
* matrix arithmetic (but not linear algebra),
* probability theory, including probability density and mass
functions, cumulative distribution functions, expectations, events,
and the basic rules of probability theory, and
* Python numerical programming with NumPy.

By basics, we really do mean basics. You won't need to do any
calculus, we will just use it to express what Stan computes.
Similarly, we will use matrix notation to express models, and we will
avoid advanced topics in linear algebra that are at the heart of some
of Stan's internals.

We include several appendices as both mathematical background and
summary of notation, with rigorous definitions of the concepts used
in this introduction.  Those who are more mathematically inclined may
wish to start with the appendices.

## Source code and license {.unnumbered}

All of the source markdown, YAML, LaTeX, and BibTeX files are
available in

* Source Repository: [https://github.com/abdullahau/bayesian-analysis](https://github.com/abdullahau/bayesian-analysis)

Everything is open source, with licenses:

* *Code*: BSD-3-Clause license
* *Text*: CC-BY 4.0

## Python, CmdStanPy, NumPy, pandas, and plotnine {.unnumbered}

For scripting language, we use [Python 3](https://www.python.org/downloads/).

To access Stan, we use the Python package
[CmdStanPy](https://mc-stan.org/cmdstanpy/installation.html).

For numerical and statistical computation in Python, we use
[NumPy](https://numpy.org/).

For plotting, we use the Python package
[plotnine](https://plotnine.readthedocs.io/).  plotnine is a Python
reimplementation of [ggplot2](https://ggplot2.tidyverse.org/), which is
itself an implementation of the grammar of graphics [@wilkinson2005].

We use [pandas](https://pandas.pydata.org/) for representing wide-form
data frames, primarily because it is the required input for plotnine.


## Python boilerplate  {.unnumbered}

We include the following Python boilerplate to import and configure
packages we will use throughout this tutorial.

```{python}
# PROJECT SETUP
# set DRAFT = False for final output; DRAFT = True is faster
DRAFT = True

import itertools
import logging
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings( "ignore", module = "plotnine/..*" )

from cmdstanpy import CmdStanModel
import bridgestan as bs
import cmdstanpy as csp
csp.utils.get_logger().setLevel(logging.ERROR)

from utils import *
import numpy as np
import statistics as stat
import pandas as pd
import plotnine as pn
import patchworklib as pw
import time

class StopWatch:
    def __init__(self):
        self.start()
    def start(self):
        self.start_time = time.time()
    def elapsed(self):
        return time.time() - self.start_time
timer = StopWatch()
```


# Introduction

These notes are intended to introduce several technical topics to
practitioners: Bayesian statistics and probabilistic modeling, Markov
chain Monte Carlo methods for Bayesian inference, and the Stan
probabilistic programming language.


## Bayesian statistics

The general problem addressed by statistical inference is that of
reasoning from a limited number of noisy observations. For example, we
might want to perform inference about a population after measuring a
subset of its members, or we might want to predict future events after
observing past events.

There are many approaches to applied statistics. These notes focus
on Bayesian statistics, a form of statistical modeling and inference
that is grounded in probability theory. In the Bayesian
approach to statistics, we characterize our knowledge of the world in
terms of probabilities (e.g., there is a 24.3% chance of rain after
lunch today, the probability that the next baby born in the United
states is male is 51\%).

Bayesian inference is always carried out with respect to a
mathematical model of a stochastic data generating process. If the
model is well-specified in the sense of matching the true data
generating process, then Bayesian statistical inference can be shown
to have several desirable properties, such as calibration and
resistance to overfitting.

[Appendix D. Bayesian statistics](#d.-bayesian-statistics) provides a
short, but precise introduction to Bayesian inference, following
[Appendix A. Set theory](#a.-set-theory) and [Appendix B. Probability
theory](#b.-probability-theor), which provide background. The
appendices establish a rigorous basis for the notation and provide
more formal definitions of exactly what Stan is computing.

If you're looking for a gentle introduction to Bayesian statistics, I
highly recommend *Statistical Rethinking* [@mcelreath2023]. For a
more advanced introduction, try *Bayesian Data Analysis*
[@gelman2013], which is available from the authors as a [free
pdf](http://www.stat.columbia.edu/~gelman/book/).

## Markov chain Monte Carlo methods

Bayesian inference for parameter estimation, prediction, or event
probability estimation is based on posterior expectations.  A
posterior expectation is a high dimensional integral over the space of
parameters.  Stan adopts the standard approach to solving general
high-dimensional integrals, which is the [Monte Carlo
method](https://en.wikipedia.org/wiki/Monte_Carlo_method). Monte Carlo
methods use random sampling (hence the name) to solve high-dimensional
integrals (which is not itself a random quantity).

We cannot use standard Monte Carlo methods for most problems of
interest in Bayesian statistics because we cannot generate a sample of
independent draws from the posterior density of interest.[^1]
The exception is simple models in the exponential family with
conjugate priors [@diaconis1979].  So instead, we have to resort to
Markov chain Monte Carlo (MCMC) methods [@brooks2011], which create
samples with correlation structure among the draws making up the
sample.

[^1]: The statistical sampling literature often overloads "sample" to mean both a sample and a draw.  We will try to stick to the notation where a sample consists of a sequence of one or more draws.

Alternatives to MCMC include rejection sampling [@gilks1992],
sequential Monte Carlo [@doucet2001], approximate Bayesian computation
(ABC) [@marin2012], variational inference [@blei2017], and nested
Laplace approximation [@rue2009], among others.

Among MCMC methods, Stan adopts Hamiltonian Monte Carlo (HMC)
[@neal2011], which is currently the most efficient and scalable MCMC
method for smooth target densities.  Popular alternatives include
random-walk Metropolis-Hastings [@chib1995] and Gibbs sampling
[@casella1992], both of which are simpler, but much less efficient
than HMC for all but the easiest of problems.

## Stan and probabilistic programming

Stan is what is known as a *domain specific language* (DSL), meaning
it was written for a particular application. Specifically, Stan is
a *probabilistic programming language* (PPL) designed for coding
statistical models. Although Stan is used most widely for Bayesian
inference, it can also perform standard frequentist inference (e.g.,
maximum likelihood, bootstrap, etc.), though we do not touch on those
capabilities in this introduction.

A Stan program declares data variables and parameters, along with
a differentiable posterior log density (of the parameters given the
data) up to a constant.  Although this is the only requirement, most
Stan programs define the joint log density of the parameters and
variables, which can be shown to be equal to the log posterior plus a
constant.

Stan programs are probabilistic programs in the sense that their data
and parameters can represent *random variables*.  A random variable is
something that takes on different values with some probability,
although mathematically it is a deterministic function and gets its
randomness from an underlying probability measure that determines how
probable the values of a random variable are.  An example of a random
variable is the outcome of a coin flip.  The variable takes on the
value heads or tails, but we don't know which.  Somewhat confusingly,
statistics often operates counterfactually, where we have actually
observed the outcome of a coin flip but persist in treating it as if
it were random and could have resulted in a different value.

In practice, Stan parameters, transformed parameters, and posterior
predictive quantities are all unobserved random variables.  Such
variables are inferred from the model and the values of other
variables.  Stan typically produces different values for unobserved
random variables each time it is run.  Stan programs can be made
deterministic in order to provide reproducible results by fixing a
random seed.

Stan programs are translated to C++ and we estimate quantities of
interest using either MCMC or approximate methods like variational
inference or Laplace approximation.  Users provide data to Stan
defining constants (like the number of observations) and providing the
values of observed random variables.  Stan provides output consisting
of a sample, each draw of which provides a possible value for all of
the unobserved random variables in the model.  Stan is available
through the open-source analytics languages Python, R, or Julia and
compatible with these languages' built-in Bayesian analysis tools as
well as Stan's own tools, some of which we will cover in this
introduction.


# Pragmatic Bayesian statistics

There have been several schools of Bayesian statisticians, and
@lin2022 provides an excellent overview with primary references and
@little2006 provides a more in-depth summary comparing to frequentist
philosophy.  The two most prominent schools are the *subjective
Bayesians* and the *objective Bayesians*. As suggested by the names,
these two paradigms have diametrically opposed philosophical
approaches.  While both use proper priors in the sense of being
probability distributions, the "subjective" approach tries to capture
actual prior "beliefs," whereas the "objective" approach tries to
minimize the use of prior information.  Both these groups trust their
posterior inferences based on their chosen philosophical approach to
priors.

We are going to follow a more pragmatic approach to Bayesian
statistics that views model building as more of an engineering
discipline than a philosophical exercise. This perspective is laid out
in detail in @gelman2013 and refined in @gelman2020workflow. The
pragmatic approach feels more like modern machine learning than
statistics, with its emphasis on predictive calibration [@dawid1982,
@gneiting2007].  Roughly speaking, a probabilistic inference is
calibrated if it has the right coverage for future data.  For example,
I might predict a 70% chance of rain on 100 different days.  I would
like to see roughly 70 of those days be rainy for my predictions to be
well calibrated.  Calibration is itself a frequentist notion, but we
do not follow standard frequentist practice in that we are willing to
modify our modeling assumptions once we have investigated their
behavior on data [@gelman2013].

The fundamental distinguishing feature of frequentist statistics is
that probabilities are understood as long-run frequencies of
repeatable processes. This prohibits placing probability distributions
over parameters, because there is no long-term repeatable process in
the world generating new parameters. For example, the gravitational
constant has a single value and is not the value of a potentially
repeatable trial like a coin flip (other than in a philosophical,
possible worlds sense).

In our pragmatic approach to Bayesian statistics, we treat probability
as fundamentally _epistemic_ rather than _deontic_, meaning it is
about human _knowledge_, not about human _belief_. This is a subtle,
but important distinction. Although frequentists sometimes worry that
Bayesians are "putting their thumb on the scale" by including their
prior knowledge in a model rather than "letting the data speak for
itself," this is an instance of the pot calling the kettle black. The
biggest "subjective" decision in model building is shared between
Bayesian and frequentist approaches, namely the likelihood assumed to
model the data-generating process.  In practice, we often sidestep the
concerns of subjectivity by using weakly informative priors that
indicate the scale, but not the particular value, of a prior.  And we
furthermore run sensitivity analyses to test the effect of our prior
assumptions.

@laplace1814 begins his book on probability by stating the general
epistemic position on probability in terms of an entity that knows all
(aka Laplace's demon).

> We may regard the present state of the universe as the effect of its
past and the cause of its future. An intellect which at a certain
moment would know all forces that set nature in motion, and all
positions of all items of which nature is composed, if this intellect
were also vast enough to submit these data to analysis, it would
embrace in a single formula the movements of the greatest bodies of
the universe and those of the tiniest atom; for such an intellect
nothing would be uncertain and the future just like the past would be
present before its eyes.

John Stuart @mill1882 is more explicit in laying out the epistemic
view of probability as follows.

> We must remember that the probability of an event is not a quality
of the event itself, but a mere name for the degree of ground which
we, or some one else, have for expecting it. $\ldots$ Every event is
in itself certain, not probable; if we knew all, we should either know
positively that it will happen, or positively that it will not. But
its probability to us means the degree of expectation of its
occurrence, which we are warranted in entertaining by our present
evidence.


# Stan examples: forward simulation and Monte Carlo

By *forward simulation*, we mean running a simulation of a scientific
process forward from the parameter values to simulated data. For
example, consider a simple clinical trial with $N$ subjects and a
probability $\theta \in (0, 1)$ of a positive outcome. Given $\theta$
and $N$, we can simulate the number of patients $y \in 0{:}N$ with a
successful outcome according to a binomial distribution (which we
define below).

The *inverse problem* is that of estimating the probability of success
$\theta,$ given an observation of $y$ successes out of $N$ subjects.
For example, we might have $N = 100$ subjects in the trial, $y = 32$
of whom had a positive outcome from the trial. A simple estimate of
$\theta$ in this case would be 0.32.  We return to estimation and
uncertainty quantification in later sections.

Let's say we have $N = 100$ subjects in our clinical trial and the
success rate is $\theta = 0.3$. We can simulate a result $y$ from the
clinical trial by randomly generating the number of subjects with a
successful outcome. Although this could be done by simulating the
binary outcome for each patient, it wouldn't be an efficient way to
sample from a binomial distribution.

In statistical sampling notation, we write
$$
Y \sim \textrm{binomial}(N, \theta)
$$
to indicate that there are $N \in \mathbb{N}$ patients with
probability $\theta \in (0, 1)$ of a successful outcome, with $Y \in 0{:}N$
representing the number of successful outcomes out of $N$ patients.

The probability mass function function for $Y$, written $p_Y$, is
defined for $N \in \mathbb{N}$, $\theta \in (0, 1)$, and $y \in 0{:}N$ by
\begin{align}
p_Y(y \mid N, \theta)
&= \textrm{binomial}(y \mid N, \theta)
\\[6pt]
&=
\binom{N}{y} \cdot \theta^y \cdot (1 - \theta)^{N - y}.
\end{align}
Unless necessary for disambiguation, we will drop the random variable
subscripts on probability density or mass functions like $p_Y$ going forward, writing
simply $p(y \mid N, \theta)$ and allowing context to disambiguate.

## A first Stan program

Let's say we wanted to generate random instantiations of $Y$ for given
values of $N$ and $\theta$.  For example, we can set $\theta = 0.3$
for a 30% chance of a successful outcome ("success" is the generic
name in statistics for a "positive" outcome).  We can then set $N =
10$ in order to simulate results for 10 patients.  Then given $\theta
= 0.3$ and $N = 10,$ we can generate a value of $Y$ between 0 and 10
for the number of patients out of 10 with a successful outcome.
We can do this using the following Stan program, which we will unpack
line by line after its listing.

```{.stan include="../stan/binomial-rng.stan" filename="stan/binomial-rng.stan"}
```

The first thing to notice is that a Stan program is organized into
blocks.  Here we have two blocks, a _data block_ containing declarations
of variables that must be input as data, and a _generated quantities
block_, which not only declares variables, but assigns a value to
them.  In the case of this Stan program, the generated quantity
variable `y` is assigned the result of taking a single draw from a
$\textrm{binomial}(N, \theta)$ distribution, which Stan provides
through the `binomial_rng` function.

The second thing to notice about a Stan program is that the variables
are all declared with types. Stan uses _static typing_, which means
that unlike Python or R, a variable's type is declared in the program
before it is used rather than determined at run time based on what is
assigned to it. Once declared, a variable's type never changes. Stan
also uses _strong typing_, meaning that unlike C or C++, there is no
way to get around the type restrictions and access memory directly.

The program declares three variables, `N` and `y` of type `int`
(integer values in $\mathbb{Z}$), and `theta` of type `real` (real
values in $\mathbb{R}$). On actual computers, our integers will have
fixed upper and lower bounds and our real numbers are subject to all
the vagaries of numerical floating point calculations.  Stan uses
double-precision (64-bit) floating point and follows the [IEEE
754 standard](https://standards.ieee.org/ieee/754/6210/) other than
in a few highly-optimized calculations that lose a few bits of precision.

A type may also have constraints.  Because `N` is a count, it must be
greater than or equal to zero, which we indicate with the bound
`lower=0.` Similarly, the variable `y` is the number of successful
outcomes out of `N` patients, so it must take on a value between 0 and `N`
(inclusive); that is represented with the constraint `lower=0,
upper=N.` Finally, the variable `theta` is real and declared to fall
in the interval $[0, 1]$ with the constraint `lower=0, upper=1.`
Technically, our bounds are open for real values, but in practice, we
might wind up with 0 or 1 values due to underflow or rounding errors
in floating point arithmetic.

At run time, the compiled Stan program must be given values for `N`
and `theta`, at which point, each iteration it will sample a value of
`y` using its built-in pseudorandom number generator. In code, we
first define a dictionary for our data (variables `N` and `theta`),
then construct an instance of `CmdStanModel` for our model from the
path to its program, and finally sample from the model using the
`sample` method of `CmdStanModel.`

```{python}
N = 100
theta = 0.3
data = {'N': N, 'theta': theta}
model = csp.CmdStanModel(stan_file = '../stan/binomial-rng.stan')
sample = model.sample(data = data, seed = 123, chains = 1,
                      iter_sampling = 10, iter_warmup = 0,
                      show_progress = False, show_console = False)
```

The [Python boilerplate](#python-boilerplate) above set the log level
to `ERROR` for the `cmdstanpy` package in order to get rid of the
warnings and informational messages that would otherwise provide updates
on a running Stan program.  Changing the log level to `WARNING` will
retain warnings, changing to `INFO` retains ongoing updates as a
program runs, and changing to `DEBUG` provides low-level info on the
algorithm as it runs.

The constructor for `CmdStanModel` is used to construct a model from a
Stan program found in the specified file. We highly recommend using a
standalone file for Stan programs to make them easy to share, to allow
both quotes for printing and apostrophes for transposition, and to
make it easy to find the lines referenced by number in error messages.
Under the hood, this first runs Stan's transpiler to convert the Stan
program to a C++ class. Then it compiles the C++ program, which will
take on the order of twenty seconds due to the heavy use of
optimization and template metaprograms.

In the Python interface, we create an object of the class
`CmdStanModel` from the Stan program found in the file
`../stan/binomial-rng.stan` and assign it to the Python variable
`model`.  The constructor for `CmdStanModel` translates the specified
Stan program to a C++ class and compiles the C++ class.  We then call
the `sample()` method on this `model` object in Python to generate
a sample consisting of the specified number of draws.
The `sample()` method takes the arguments

* `data`: the data read in the data block of the Stan program,
* `seed`: pseudorandom number generator for reproducibility,
* `chains`: the number of simulation runs (`parallel_chains`
indicates how many to run in parallel),
* `iter_sampling`: number of draws (i.e., sample size) to return,
* `iter_warmup`: number of warmup iterations to tune parameters of the
sampling algorithm (not needed here, so set to 0),
* `show_progress`: if `True`, print progress updates, and
* `show_console`: pop up a GUI progress monitor.

The result of calling `sample()` on the model instance is assigned to
the Python variable `sample`.  It will contain the 10 draws we
requested with argument `iter_sampling = 10`.


When `model.sample(...)` is called, CmdStan runs Stan as a standalone
C++ program in a background process.  This program starts by copying
the data given in the Python argument `data` to a file, then reads in
that data file to construct a C++ object representing the statistical
model. Since our Stan program only has a generated quantities block,
the C++ class's only remaining task is to generate the requested
number of draws. For each of the `iter_sampling` draws, Stan runs a
pseudorandom number generator to generate a value from the specified
binomial distribution.

Random number generation is determined by the `seed` value specified
in the call. For more details on how pseudorandom number generation is
performed, see the (free online) book by @devroye1986. We describe the
operational semantics of Stan in more detail in the section on [Stan's
execution model](l#stans-execution-model) below.

Once sampling has completed, we can extract the sample consisting of
10 draws for the scalar variable `y` as an array and then print their
values along with the values of the data variables.

```{python}
y = sample.stan_variable('y')
print("N = ", N, ";  theta = ", theta, ";  y(0:10) =", *y.astype(int))
```

Let's put that in a loop and see what it looks like by taking the
number of patients `N` equal to 10, 100, 1000, and 10,000 in turn.

```{python}
for N in [10, 100, 1_000, 10_000]:
    data = {'N': N, 'theta': theta}
    sample = model.sample(data = data, seed = 123, chains = 1,
                          iter_sampling = 10, iter_warmup = 0,
                          show_progress = False,
			  show_console = False)
    y = sample.stan_variable('y')
    print("N =", N)
    print("  y: ", *y.astype(int))
    print("  est. theta: ", *(y / N))
```

On the first line for $N = 10$ trials, our simple frequency-based
estimates range from 0.2 to 0.5. By the time we have 10,000 trials,
the frequency-based estimates only vary between 0.292 and 0.309. We
know from the central limit theorem that the spread of estimates is
expected to shrink at a rate of $\mathcal{O}(1 / \sqrt{N})$ for $N$
draws (this result is only asymptotic in $N$, but is very close for
large-ish $N$ in practice).

It is hard to get an impression fo the true uncertainty from a small
set of results like this.  To get a better handle on uncertainty, we
will simulate 100,000 $y$ values (number of successful outcomes) for
each value of the number of patients $N$ and plot histograms. The
following histogram plots the distribution of frequency-based
estimates based on 10, 100, and 1000 patients, each of which we run
for 100,000 simulations.

```{python}
np.random.seed(123)
ts = []
ps = []
theta = 0.3
M = 100 if DRAFT else 100_000
for N in [10, 100, 1_000]:
    data = {'N': N, 'theta': theta}
    sample = model.sample(data = data, seed = 123, chains = 1,
       iter_sampling = M, iter_warmup = 0, show_progress = False,
       show_console = False)
    y = sample.stan_variable('y')
    theta_hat = y / N
    ps.extend(theta_hat)
    ts.extend(itertools.repeat(N, M))
xlabel = 'estimated Pr[success]'
df = pd.DataFrame({xlabel: ps, 'trials': ts})
print(pn.ggplot(df, pn.aes(x = xlabel))
  + pn.geom_histogram(binwidth=0.01, color='white')
  + pn.facet_grid('. ~ trials')
  + pn.scales.scale_x_continuous(limits = [0, 1], breaks = [0, 1/4, 1/2, 3/4, 1],
      labels = ["0", "1/4", "1/2", "3/4", "1"], expand=[0, 0])
  + pn.scales.scale_y_continuous(expand=[0, 0, 0.05, 0])
  + pn.theme(aspect_ratio = 1, panel_spacing = 0.15,
             strip_text = pn.element_text(size = 6),
             strip_background = pn.element_rect(height=0.08,
	                            fill = "lightgray")))
```

Although the histograms have different heights and the first one is
spiky, the key consideration here is that they all have the same
area, representing the 100,000 simulated values of $y$. The trial size
of 10 only has 10 possible values, 0.0, 0.1, ..., 1.0, so the
histogram (technically a bar chart here) just shows the counts of
those outcomes. Here, $y = 3$ is the most prevalent result, with
corresponding estimate for $\theta$ of $y / 10 = 0.3$. The trial size
of 100 looks roughly normal, as it should as a binomial with trials $N
= 100$. By the time we get to $N = 1,000$ trials, the draws for $y$
concentrate near 300, or near the value of $0.3$ for $\theta$. As $N$
grows, the central limit theorem tells us to expect that the width of
these histograms to shrink at a rate of $\mathcal{O}(1 / \sqrt{N})$.

## Pseudorandom numbers {#prng-seed}

As a probabilistic programming language, Stan relies on random number
generation. Because Stan runs on traditional digital computers (i.e.,
von Neumann machines), it cannot truly generate random numbers.
Instead, it does the next best thing and uses a *pseudorandom number
generator* (PRNG) to generate a sequence of numbers deterministically.
Specifically, we set a random number generation *seed*, which when
combined with a PRNG, generates a sequence of numbers
deterministically that have many of the properties of truly random
numbers. The (free online) book by @devroye1986 is the definitive
reference for pseudorandom number generation for statistical
distributions and contains a general introduction to PRNGs.

We can see how random number generators work in Stan by running our
sampling method with seeds 123, 19876, and 123.
```{python}
for seed in [123, 19876, 123]:
    sample = model.sample(data = data, seed = seed, chains = 1,
                          iter_sampling = 10, iter_warmup = 0,
                          show_progress = False, show_console = False)
    print(f"{seed = };  sample = {sample.stan_variable('y').astype(int)}")
```
The two runs with seed 123 produce the same results. The code to
extract values of `y` is clunky because the `stan_variable()` method
always returns a floating point type and we have converted it to an
integer-valued NumPy array.

Generally, we want our Stan programs to *replicate* similar results
with different random seeds. Substantially different results from
different seeds is a red flag that there is something wrong with the
combination of model and data.


## Monte Carlo integration

Bayesian computation relies on averaging over our uncertainty in
estimating parameters.  In general, it involves computing
expectations, which are weighted averages with weights given by
densities.  In this section, we will introduce Monte Carlo methods for
calculating a simple integral corresponding to the expectation of a
discrete indicator variable. We'll use the textbook example of
throwing darts at a board randomly and using the random locations to
estimate the mathematical constant $\pi$.

We start with a two-unit square centered at the origin. Then we will
generate points uniformly at random in this square.  For each point
$(x, y)$, we will calculate whether it falls inside the unit circle
circumscribed within the square, which is true if the distance to the
origin is less than 1,
$$
\sqrt{x^2 + y^2} < 1,
$$
which simplifies by squaring both sides to
$$
x^2 + y^2 < 1.
$$
The proportion of such points gives the proportion of the square's
volume taken up by the circle.  Because the square is $2 \times 2$, it
has an area of 4, so the circle has an area of 4 times the proportion
of points falling inside the circle (i.e., in the open unit disc).

Here's the Stan code.
```{.stan include="../stan/monte-carlo-pi.stan" filename="stan/monte-carlo-pi.stan"}
```
The program declares variables `x` and `y` and constrains them to fall
in the interval $(-1, 1)$ (numerical overflow may produce values -1
and 1) and assigns them uniform random values.  The indicator variable
`inside` is set to 1 if the Euclidean length of the vector
$\begin{bmatrix}x & y\end{bmatrix}$ is less than 1 (i.e., it falls
within an inscribed unit circle) and is set to 0 otherwise.
The variable `pi` is then set to four times the indicator value.
As we see below, it is the sample mean of these values that is of
interest.

First, we compile and then sample from the model, taking a sample size
of `M = 10_000` draws.  Then we plot the draws.

```{python}
M = 100 if DRAFT else 10_000
model = csp.CmdStanModel(stan_file = '../stan/monte-carlo-pi.stan')
sample = model.sample(chains = 1, iter_warmup = 0, iter_sampling = M,
                      show_progress = False, show_console = False,
                      seed = 123)
x_draws = sample.stan_variable('x')
y_draws = sample.stan_variable('y')
inside_draws = [int(i) for i in sample.stan_variable('inside')]
pi_draws = sample.stan_variable('pi')
inside_named_draws = np.array(["out", "in"])[inside_draws]
df = pd.DataFrame({'x': x_draws, 'y': y_draws,
                   'inside': inside_named_draws})
print(
  pn.ggplot(df, pn.aes(x = 'x', y = 'y',
                group='inside', color='inside'))
  + pn.geom_point(size = 0.1)
  + pn.labs(x = 'x', y = 'y')
  + pn.coord_fixed(ratio = 1)
)
```

Next, we take the sample mean of the inside-the-circle indicator,
which produces an estimate of the probability of a point being
inside the circle.  This corresponds directly to the expectation
\begin{align}
\mathbb{E}[4 \cdot \textrm{I}(\sqrt{X^2 + Y^2} \leq 1)]
&= \int_{-1}^1 \int_{-1}^1
\textrm{I}(x^2 + y^2 <1) \cdot p(x, y) \, \textrm{d}x \, \textrm{d}y
\\[4pt]
&= \int_{-1}^1 \int_{-1}^1
\textrm{I}(x^2 + y^2 < 1) \cdot \textrm{uniform}(x \mid -2, 2)
\cdot \textrm{uniform}(y \mid -2, 2) \, \textrm{d}x \, \textrm{d}y
\\[4pt]
&= \int_{-1}^1 \int_{-1}^1
4 \cdot \textrm{I}(x^2 + y^2 < 1) \, \textrm{d}x \, \textrm{d}y
\\[4pt]
&= \pi,
\end{align}
where $\textrm{I}()$ is the indicator, which returns 1 if its argument
is true and 0 otherwise.  The posterior mean of the variable `inside`
is the probability that a random point in the 2-unit square is inside
the inscribed unit circle. The posterior  mean for `pi` is thus our
estimate for $\pi$.

```{python}
Pr_is_inside = np.mean(inside_draws)
pi_hat = np.mean(pi_draws)
print(f"Pr[Y is inside circle] = {Pr_is_inside:.3f};")
print(f"estimate for pi = {pi_hat:.3f}")
```

The true value of $\pi$ to 3 digits of accuracy is $3.142$, so we are
close, but not exact, as is the nature of Monte Carlo methods.  If we
increase the number of draws, our error will go down.  Theoretically,
with enough draws, we can get any desired precision; in practice, we
don't have that long to wait and have to make do with only a few
digits of accuracy in our Monte Carlo estimates. This is usually not a
problem because statistical uncertainty still dominates our numerical
imprecision in most applications; we discuss this important point later when
contrasting estimation uncertainty and sampling uncertainty in the
[section on posterior predictive inference](#uncertainty-types) and when considering
practical guidance on [how long to run Stan](#practical-guidelines).

### Random points are far away in high dimensions

Suppose we wanted to sample points in the unit disc?  One thing we
could do is sample points in the unit square until we draw one that is
in the unit disc.  In two dimensions, this is fairly efficient, with
79% of the points falling in the circle.  But what will happen in
higher dimensions?  Let's write some Stan code and see.

```{.stan include="../stan/unit-hypersphere.stan" filename="stan/unit-hypersphere.stan"}
```
In this case, we take a natural number `D` as input for the
dimensionality of the hypercube.  We have introduced a block
for transformed data, and declared a variable `one_D` to be a size `D`
array of 1 values.  The transformed data block is executed once
as data is read in and its values are constant outside of the
transformed data block.  In the generated quantities block, we use the
array of 1 values to assign `y` to an array of values, each of which
is independently generated from a $\textrm{uniform}(-1, 1)$
distribution.  This means `y` will be uniformly distributed within the
hypercube $[-1, 1]^D$.  The integer `inside` will be set to 1 if
`y` falls within the unit hypersphere centered at the origin (i.e.,
circumscribed within the unit hypercube).

By construction, the distance from the origin to the side of the
hypercube and hence the radius of the inscribed hypersphere remain
constant at 1. In contrast, the distance from the origin to a corner
is $\sqrt{D}$ in $D$ dimensions (i.e., $\sqrt{1^2 + 1^2 + \cdots 1^2}$
for $D$ terms).  Now let's see what happens to the proportion of
volume in the inscribed hypersphere as the dimensionality grows.

```{python}
M = 100 if DRAFT else 10_000
model = csp.CmdStanModel(stan_file = '../stan/unit-hypersphere.stan')
in_probs = np.repeat(1.0, 13)
for D in range(1, 13):
    sample = model.sample(chains=1, iter_warmup = 0, iter_sampling = M,
                          data = {'D' : D}, show_progress = False,
			  show_console = False, seed = 123)
    inside_draws = sample.stan_variable('inside')
    in_probs[D] = np.sum(inside_draws) / M

print(pn.ggplot(pd.DataFrame({'D':np.arange(1, 13), 'prob in hypersphere':in_probs[1:]}),
                 pn.aes(x = 'D', y='prob in hypersphere'))
       + pn.geom_line()
       + pn.geom_point(size=1)
       + pn.scale_x_continuous(breaks = [0, 2, 4, 6, 8, 10, 12]))
```


## Markov chain Monte Carlo methods

In the previous sections, we generated a sample of draws by taking a
sequence of independent draws. We then just averaged results to get
plug-in estimates for expectations.

With modern applied Bayesian models, it is almost never possible to
generate independent draws from distributions of interest, so we
cannot apply simple Monte Carlo methods.  There are some restricted
cases involving special model forms for which we can take independent
draws, but this only works for very simple models [@diaconis1979].
Before the MCMC revolution of the 1990s, Bayesian inference was
largely restricted to these simple models.  Even well into the MCMC
revolution in the 1990s and 2000s, researchers still used simpler
model forms to improve computation in the probabilistic programming
language BUGS [@lunn2012] and even Stan programs are often influenced
by computational concerns [@stan2023users].

The introduction of automatic differentiation opened up the
possibility of coding the more efficient and scalable Hamiltonian
Monte Carlo method in Stan, which is a form of Markov chain Monte
Carlo.  This has greatly expanded the class of models that can be fit
in reasonable time in practice.

In Markov chain Monte Carlo methods, we base each draw on the previous
draw.  A sequence of random variables each of which depends only on
the previous variable generated is called a *Markov chain*.  That is,
a sequence of random variables $Y_1, Y_2, \ldots$ makes up a Markov
chain if
$$
p_{Y_{n+1} | Y_{1}, \ldots Y_N}(y_{n + 1} | y_1, \ldots, y_n)
=
p_{Y_{n+1} \mid Y_n}(y_{n+1} \mid y_n)
$$
This is saying that $Y_{n + 1}$ is conditionally independent of
$Y_1, \ldots Y_{n - 1}$ given $Y_n.$

We can illustrate with a simple example of three Markov chains, all of
which have a stationary distribution of $\textrm{bernoulli}(0.5).$
Technically, this means that as $n \rightarrow \infty,$ $p_{Y_n}$
approaches the density of a $\textrm{bernoulli}(0.5)$ distribution.
As a result, the long-term average of all chains will also approach
0.5, because that's the expected value of a $\textrm{bernoulli}(0.5)$
variable.  We will introduce a parameter $\theta \in (0, 1)$ and take
the probabilities of element $Y_{n+1}$ depending on the previous
element $Y_{n}$ to be \begin{align} \Pr[Y_{n + 1} &= 1 \mid Y_n = 1] =
\theta \\ \Pr[Y_{n + 1} &= 1 \mid Y_n = 0] = 1 - \theta \end{align}
The first line says that if the last number we generated is 1, the
probability of the next element being 1 is $\theta$.  The second line
says that if the last number we generated is 0, the probability of the
next element being 1 is $1 - \theta$, and thus the probability of the
next element being 0 is $\theta$.  That is, there's a probability of
$\theta$ of generating the same element as the last element.

Here is a Stan program that generates the first $M$ entries of a
Markov chain over outputs 0 and 1, with probability $\rho \in (0, 1)$
of generating the same output again.

```{.stan include="../stan/markov-autocorrelation.stan" filename="stan/markov-autocorrelation.stan"}
```
The assignment to `y[m]` in this program is equivalent to this longer
form.
```stan
    if (y[m - 1] == 1) {
      y[m] = bernoulli_rng(rho);
    } else {
      y[m] = bernoulli_rng(1 - rho);
    }
```

The more concise form exploits two features of Stan.  First, boolean
expressions are coded as `1` (true) or `0` (false) in Stan, like in
C++ and Python.  Because `y[m - 1]` is constrained to take on values 0
or 1, we know that `y[m - 1] == 1` is equivalent to `y[m - 1]`.
Second, it uses the _ternary operator_, also like in C++.  The
expression `cond ? e1 : e2` involves three arguments, separated by a
question mark (`?`) and a colon (`:`); its value is the value of `e1`
if `cond` is true and the value of `e2` otherwise.  Unlike an ordinary
function, the ternary operator only evaluates `e1` if the condition is
true and only evaluates `e2` if the condition is false.

We can simulate these models in Python and print the first
100 values simulated for the Markov chain $y$.

```{python}
model = csp.CmdStanModel(
               stan_file = '../stan/markov-autocorrelation.stan')
M = 100 if DRAFT else 1000
rhos = []
iterations = []
draws = []
estimates = []
for rho in [0.05, 0.5, 0.95]:
    data = {'M': M, 'rho': rho}
    sample = model.sample(data = data, seed = 123, chains = 1,
                      iter_warmup = 0, iter_sampling = 1,
                      show_progress = False, show_console = False)
    y_sim = sample.stan_variable('y')
    cum_sum = np.cumsum(y_sim)
    its = np.arange(1, M + 1)
    ests = cum_sum / its
    draws.extend(y_sim[0])
    iterations.extend(its)
    estimates.extend(ests)
    rhos.extend(itertools.repeat(str(rho), M))
df = pd.DataFrame({'draw': draws, 'iteration': iterations,
                   'estimate': estimates, 'rho': rhos})
rho05 = np.array(df.query('rho == "0.05"').head(100)['draw'], dtype = 'int')
rho50 = np.array(df.query('rho == "0.5"').head(100)['draw'], dtype = 'int')
rho95 = np.array(df.query('rho == "0.95"').head(100)['draw'], dtype = 'int')
print("Markov chain draw with probability rho of repeating last value:\n")
print("rho = 0.05:", rho05, "\n")
print("rho = 0.50:", rho50, "\n")
print("rho = 0.95:", rho95, "\n")
```

With a 0.05 probability of staying in the same state, the Markov chain
exhibits strong anti-correlation in its draws, which tend to bounce
back and forth between 0 and 1 almost every iteration.  In contrast,
the 0.95 probability of staying in the same state means the draws have
long sequences of 0s and 1s.  The 0.5 probability produces independent
draws from the Markov chain and we see short runs of 0s and 1s.

Next, we will show a running average of the 0 and 1 draws for 1000
iterations for the three chains.
```{python}
print(
    pn.ggplot(df, pn.aes(x='iteration', y='estimate',
                  group='rho', color='rho'))
    + pn.geom_hline(yintercept = 0.5, color = 'black')
    + pn.geom_line()
    + pn.labs(x = "iteration", y = "estimate")
)
```
The black horizontal line at 0.5 shows the true answer.  It is clear
that the anti-correlated chain (red) converges much more quickly to
the true answer of 0.5 and more stably than the independent chain
(green), which in turn converges much more quickly than the correlated
chain (blue).


# Hints for Python programmers

Python follows general programming language idioms like in C++ and
Java, whereas Stan follows linear algebra idioms like in R and MATLAB.

## Indexing from 1

Unlike Python, Stan uses the standard mathematical indexing for
matrices, which is from 1.  If I declare a `vector[3]`, then the valid
indexes are 1, 2, and 3.  If `v` is a vector variable, then `v[0]` is
an indexing error and will throw an exception and log a warning as it
is caught and the resulting MCMC proposal is rejected.

## Inclusive ranges

Unlike Python, Stan uses inclusive ranges, so that `1:3` represents
the sequence 1, 2, 3.  The main disadvantage of inclusive notation is
that the length of `L:U` is `U - L + 1.`

Putting these together, Python loops over a container with `N`
elements look as follows.
```python
v = np.random.rand(N)
for n in range(0, N):
    ... process v[n] ...
```
The Python version visits elements `v[0]`, `v[1]`, ..., `v[N - 1].`

Loops in Stan are as follows.
```stan
vector[N] v;
for (n in 1:N) {
   ... process v[n] ...
}
```
The Stan version visits elements `v[1]`, `v[2]`, ..., `v[N].`


## Strong typing

Stan variables are strongly typed.  This means every variable is
assigned a type when it is declared and that type never changes.  Only
expressions compatible with that type may be assigned, which means
they have to have the same type or they have to be assigned a type
that can be *promoted* to that type.  For example, we can assign an
integer to a real value, but not vice-versa.

```stan
int n = 5;
real y = n;  // OK: promote integer to real
int m = y;   // ILLEGAL! cannot demote real to integer
```

Similarly, we can assign a real or integer value to a complex
variable.  We also have what is known as *covariant* typing, which
means if we can assign a value of type `U` to a variable of type `V`,
then we can assign an array of `U` to an array of `V`.
Similarly, we can assign real vectors or matrices to their complex
counterparts.

Top-level block variables and local variables must also declare their
sizes and thereafter only allow assignment of properly sized values.
Function argument types do not declare sizes so that functions can be