bayesian-learning-with-R/Introductions.Rmd at master · NCSU-STATS/bayesian-learning-with-R · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Bayseian Methods"
subtitle: "Introductory Material"
author: "Bruce Campbell"
latex_engine: xelatex
fontsize: 12pt
output: pdf_document
---

---
```{r setup, include=FALSE,echo=FALSE}
rm(list = ls())
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(dev = 'pdf')
knitr::opts_chunk$set(cache=TRUE)
knitr::opts_chunk$set(tidy=TRUE)
knitr::opts_chunk$set(prompt=FALSE)
knitr::opts_chunk$set(fig.height=5)
knitr::opts_chunk$set(fig.width=7)
knitr::opts_chunk$set(warning=FALSE)
knitr::opts_chunk$set(message=FALSE)
knitr::opts_knit$set(root.dir = ".")
library(latex2exp)
library(pander)
library(ggplot2)
library(GGally)
```

## Stan - Rstan

These are the other stan packages
bayesplot ggplot-based plotting library for graphing parameter estimates, MCMC diagnostics, and posterior predictive checks.
shinystan Interactive visual summaries and advanced posterior analysis of MCMC output
loo Out-of-sample predictive performance estimates and model comparison.
rstanarm R formula interface for Bayesian applied regression modeling.
rstantools Tools for developers of R packages interfacing with Stan.

require returns (invisibly) a logical indicating whether the required package is available.
We use that mechanism to install the Stan packages if they are not already available.

```{r}
if(!require(rstan)){
    install.packages("rstan")
    library(rstan)
}

if(!require(bayesplot)){
    install.packages("bayesplot")
    library(bayesplot)
}

if(!require(shinystan)){
    install.packages("shinystan")
    library(shinystan)
}

if(!require(rstanarm)){
    install.packages("rstanarm")
    library(rstan)
}


```

# What is in a high dimensional point cloud?

### Abstract

This case study illustrates the so-called "curse of dimensionality,"
starting from scratch and using simple examples based on simulation.
For example, generating points uniformly at random in a unit hypercube,
it is easy to see how the average distance between random points
increases with dimensionality.  Similarly, such points tend to
increasingly fall outside the hypersphere inscribed within the unit
hypercube, showing that in higher dimensions, almost all of the volume
is in the corners of the hypercube.

Similarly, generating standard normal variates (zero mean, unit
standard deviation) leads to draws that concentrate in a thin shell of
increasing distance from the mean as dimensionality increases.  The
squared distance of a standard normal draw from the mode follows a
standard chi-square distribution with degrees of freedom equal to the
dimensionality.  This allows the the precise bounds of the thin shell
to be calculated using tail statistics, showing just how unlikely it
is to get near the mode with a random draw from a standard
multivariate normal.

These properties of the standard normal distribution provides a segue
into discussing the important information-theoretic concept of the
typical set for a probability distribution, which is roughly defined
as the central log density band into which almost all random draws
from that distribution will fall.  The rather surprising result is
that for a normal distribution in high dimensions, the typical set
does *not* include the volume around the mode.  The density is highest
around the mode, but the volume is low, and the probability mass is
determined as the product of the density and the volume (or more
precisely, the integral of the density over the volume).

The standard normal log density is just negative squared Euclidean
distance, which provides a straightforward demonstration of why
the maximum likelihood estimate for a normal regression is identical
to the least squares solution.


## 1. Euclidean Length and Distance

### Euclidean Length

Consider a vector $y = (y_1, \ldots, y_N)$ with $N$ elements (we call
such vectors $N$-vectors).  The Euclidean length of a vector $y$ is
written as $|\!| y |\!|$, and defined by a generalizing the Pythagorean
theorem,

\[
|\!| y |\!|
\ = \ \sqrt{y_1^2 + y_2^2 + \cdots + y_N^2}.
\]

In words, the length of a vector is the square root of the sum of the
squares of its elements.

If we take $y = (y_1, \ldots, y_N)$ to be a row vector, then we see
that the dot product of a vector with itself is its squared length,
so that

\[
|\!| y |\!| = \sqrt{y \, y^{\top}}.
\]

#### Calculating Vector Length in R

The following function computes vector length in R.

```{r comment=NA}
euclidean_length <- function(u) sqrt(sum(u * u));
```

In R, the operator `*` operates elementwise rather than as vector
multiplication.  To test the function on a simple case, we can verify
the first example of the Pythagorean theorem everyone learns in
school, namely $|\!| (3, 4) |\!| = 5$.

```{r comment=NA}
euclidean_length(c(3, 4));
```

In R, the function `length()` returns the number of elements in a
vector rather than the vector's Euclidean length.

```{r comment=NA}
length(c(3, 4));
```

### Euclidean Distance

The Euclidean distance between two $N$-vectors, $x = (x_1, \ldots,
x_N)$ and $y = (y_1, \ldots, y_N)$, written $\mathrm{d}(x,y)$, is the
Euclidean length of the vector $x - y$ connecting them,

\[
\mathrm{d}(x, y)
\ = \
|\!| x - y |\!|
\ = \
\sqrt{(x_1 - y_1)^2 + \cdots + (x_N - y_N)^2}.
\]


## 2.  All of the Volume is in the Corners

Suppose we have a square and inscribe a circle in it, or that we have
a cube and a sphere inscribed in it.  When we extend this construction
to higher dimensions, we get hyperspheres inscribed in hypercubes.
This section illustrates the curious fact that as the dimensionality
grows, most of the points in the hypercube lie outside the inscribed
hypersphere.

Suppose we have an $N$-dimensional hypercube, with unit-length sides
centered around the origin $\mathbf{0} = (0, \ldots, 0)$.  The
hypercube will have $2^N$ corners at the points $\left( \pm
\frac{1}{2}, \ldots, \pm \frac{1}{2} \right)$.  Because its sides are
length 1, it will have also have unit volume, because $1^N = 1$.

If $N=1$, the hypercube is a line from $-\frac{1}{2}$ to
$\frac{1}{2}$ of unit length (i.e., length 1).  If $N=2$, the
hypercube is a square of unit area with opposite corners at $\left(
-\frac{1}{2}, -\frac{1}{2} \right)$ and $\left( \frac{1}{2},
\frac{1}{2} \right)$.  With $N=3$, we have a cube of unit volume, with
opposite corners at $\left( -\frac{1}{2}, -\frac{1}{2}, -\frac{1}{2}
\right)$ and $\left( \frac{1}{2}, \frac{1}{2}, \frac{1}{2} \right)$,
and unit volume.  And so on up the dimensions.

Now consider the biggest hypersphere you can inscribe in the
hypercube.  It will be centered at the origin and have a radius of
$\frac{1}{2}$ so that it extends to the sides of the hypercube.  A
point $y$ is within this hypersphere if the distance to the origin is
less than the radius, or in symbols, if $|\!|y|\!| < \frac{1}{2}$.
Topologically speaking, we have defined what is known as an open ball,
i.e., the set of points within a hypersphere excluding the limit
points at distance exactly $\frac{1}{2}$ (we could've worked with
closed balls which include the limit points making up the surface of
the ball because this surface (a hypersphere) has zero volume).

In one dimension, the hypersphere is just the line from $-\frac{1}{2}$
to $\frac{1}{2}$ and contains the entire hypercube.  In two
dimensions, the hypersphere is a circle of radius $\frac{1}{2}$
centered at the origin and extending to the center of all four sides.
In three dimensions, it's a sphere that just fits in the cube,
extending to the middle of all six sides.


#### The Monte Carlo Method for Integration

We know the volume of the unit hypercube is one, but what is the
volume of the ball in the inscribed hypersphere?  You may have learned
how to define an integral to calculate the answer for a ball of radius
$r$ in two dimensions as $\pi r^2$, and may even recall that in three
dimensions it's $\frac{4}{3}\pi r^3$.  But what about higher
dimensions?

We could evaluate harder and harder multiple integrals, but we'll
instead use the need to solve these integrals as an opportunity to
introduce the fundamental machinery of using sampling to calculate
integrals.  Such methods are called "Monte Carlo methods" because they
involve random quantities and there is a famous casino in Monte Carlo.
They are at the very heart of modern statistical computing (Bayesian
and otherwise).

Monte Carlo integration allows us to calculate the value of a definite
integral by averaging draws from a random number generator.
(Technically, the random number generators we have on computers, like
used in R, are pseudorandom number generators in the sense that
they're underlyingingly deterministic; for the sake of argument, we
assume they are random enough for our purposes in much the way we
assume the functions we're dealing with are smooth enough).

To use Monte Carlo methods to calculate the volume within a
hypersphere inscribed in a hypercube, we need only generate draws at
random in the hypercube and count the number of draws that fall in the
hypersphere---our answer is the the proportion of draws that fall in
the hypersphere.  That's because we deliberately constructed a
hypercube of unit volume; in general, we need to multiply by the
volume of the set over which we are generating uniform random draws.

As the number of draws increases, the estimated volume converges to
the true volume.  Because the draws are i.i.d., it follows from the
central limit theorem that the error goes down at a rate of
$\mathcal{O}\left(  1 / \sqrt{n} \right)$.  That means each
additional decimal place of accuracy requires multiplying the sample
size by one hundred.  We can get rough answers with Monte Carlo
methods, but many decimal places of accuracy requires a prohibitive
number of simulation draws.

We can draw the points at random from the hypercube by drawing
each component independently according to

\[
y_n \sim  \mathsf{Uniform}\left(-\frac{1}{2}, \frac{1}{2}\right).
\]
Then we count the proportion of draws that lie within the hypersphere.
Recall that a point $y$ lies in the hypersphere if $|\!| y |\!| < \frac{1}{2}$.

#### Using Monte Carlo methods to compute $\pi$

We'll first look at the case where $N = 2$, just to make sure we get
the right answer.  We know the area inside the inscribed circle is
$\pi r^2$, so with $r = \frac{1}{2}$, that's $\frac{\pi}{4}$.  Let's
see if we get the right result.

```{r comment=NA}
N <- 2;
M <- 1e6;
y <- matrix(runif(M * N, -0.5, 0.5), M, N);
p <- sum(sqrt(y[ , 1]^2 + y[ , 2]^2) < 0.5) / M;
print(4 * p, digits=3);
print(pi, digits=3);
```

Now, let's generalize and calculate the volume of the hypersphere
inscribed in the unit hypercube (which has unit volume by
construction) for increasing dimensions.

```{r comment=NA}
M <- 1e5;
N_MAX = 10;
Pr_inside <- rep(NA, N_MAX);
for (N in 1:N_MAX) {
  y <- matrix(runif(M * N, -0.5, 0.5), M, N);
  inside <- 0;
  for (m in 1:M) {
    if (euclidean_length(y[m,]) < 0.5) {
      inside <- inside + 1;
    }
  }
  Pr_inside[N] <- inside / M;
}
df = data.frame(dims = 1:N_MAX, volume = Pr_inside);
print(df, digits=1);
```

Although we actually calculate the probability that a point drawn at
random is inside the hyperphere inscribed in the unit hypercube, this
quantity gives the volume inside the inscribed hypersphere.

Here's the result as a plot.

```{r}
library(ggplot2);
plot_corners <-
  ggplot(df, aes(x = dims, y = Pr_inside)) +
  scale_x_continuous(breaks=c(1, 3, 5, 7, 9)) +
  geom_line(colour="gray") +
  geom_point() +
  ylab("volume of inscribed hyperball") +
  xlab("dimensions") +
  ggtitle("Volume of Hyperball Inscribed in Unit Hypercube")
plot_corners;
```

### All points are far away from each other

Another way of looking at this is that in higher dimensions, the
points on average become further and further away from the center of
the hypercube.  They also become further and further away from each
other (see the exercises). The volume of the hypersphere is the proportion of
points in the hypercube that are within distance $\frac{1}{2}$ from
the center of the hypercube;  with inreasing dimension, vanishingly
few points are within distance $\frac{1}{2}$ of the center of the
hypercube.

As Bernhard Sch??lkopf said on Twitter, "a high-dimensional space is a lonely place."  What he
meant by this is that not only are the points increasingly far from
the mean on average, they are also increasingly far from each other.
As Michael Betancourt noted in a comment on the pull request for this
case study, "Take two points that are translated from each other by 1
unit in each direction ??? the distance between them grows as $\sqrt{N}$
so as the dimension grows the points appear further from each other,
with more and more volume filling up the space between them."


## 3. Typical Sets

Roughly speaking, the typical set is where draws from a given
distribution tend to lie.  That is, it covers most of the mass of the
distribution.  This particular choice of volume not only covers most
of the mass of a distribution, its members reflect the properties of
draws from that distribution.  What we will see in this section is
that the typical set is usually nowhere near the mode of the
distribution, even for a multivariate normal (where the mode is the
same as the mean).

#### A Discrete Example of Typicality

Typicality is easiest to illustrate with a binomial example.  Consider
a binary trial with an eighty percent chance of success (i.e., draws
from a $\mathsf{Bernoulli}(0.80)$ distribution).  Now consider
repeating such a trial one hundred times, drawing $y_1, \ldots,
y_{100}$ independently according to

\[
y_n \sim \mathsf{Bernoulli}(0.8).
\]

Two questions frame the discussion:

1. What is the most likely value for $y_1, \ldots, y_{100}$?

2. How many $y_n$ do you expect to be 1?

Let's answer the second question first.  If you have an 80% chance of
success and make 100 attemps, the expected number of successes is just
80 (in general, it's the probability of success times the number of
attempts).

Then why is the most likely outcome 100 straight successes?  The
probability of 100 straight successes is $0.8^{100}$ or about $2
\times 10^{-10}$.

In contrast, the probability of any given sequence with 80 successes
is only $0.8^{80} \, 0.2^{20}$, or about $2 \times 10^{-22}$.  The
chance of 100 straight successes is a whopping $10^{12}$ times more
probable than any specific sequence with 80 successes and 20 failures!

On the other hand, there are a lot of sequences with 80 successes and
20 failures---a total of $\binom{100}{80}$ of them to be exact (around
$10^{20}$).  So even though any given sequence of 80 success and 20
failures is relatively improbable, there are so many such sequences
that their overall mass is higher than that of the unique sequence
with 100 success, because $10^{20} \times 2 \times 10^{-22}$ is a lot
bigger than $1 \times 2 \times 10^{-10}$.  Same goes for sequences
with 81 successes; each such sequence is more probably than any
sequence with 80 successes, but $\binom{100}{81}$ is smaller than
$\binom{100}{80}$.

The binomial distribution is the distribution of counts, so that if
$y_1, \ldots, y_N$ is such that each $y_n \sim
\mathsf{Bernoulli}(\theta)$, then

\[
(y_1 + \cdots + y_N) \sim \mathsf{Binomial}(N, \theta).
\]

The binomial aggregates the multiple trials by multiplying through by
the possible ways in which $y$ successes can arise in $N$ trials,
namely

\[
\mathsf{Binomial}(y \mid N, \theta)
= \binom{N}{y} \, \theta^y \, (1 - \theta)^{(N - y)},
\]

where the binomial coefficient that normalizes the distribution is
defined as the number of binary sequences of length $N$ that contain
exactly $y$ elements equal to 1,

\[
\binom{N}{y} = \frac{N!}{y! \, (N - y)!}.
\]

In words, $\mathsf{Binomial}(y \mid N, \theta)$ is the probability of
$y$ successes in $N$ independent Bernoulli trials if each trial has
a chance of success $\theta$.

To make sure we're right in expecting around 80 succeses, we can
simulate a million binomial outcomes from 100 trials with an 80%
chance of success as


```{r}
z <- rbinom(1e6, 100, 0.8);
```

with a summary

```{r}
summary(z);
```

and 99% interval

```{r}
quantile(z, probs=c(0.005, 0.995));
```

The most probable outcome as a sequence of Bernoulli trials, namely
$(1, 1, \ldots, 1)$, has a very improbable number of successes, namely
$N$. A much more typical number of successes is 80.  In fact, 100
isn't even in the 99% interval of 100 trials with an 80% success rate.
We can see that analytically using the quantile (inverse cumulative
distribution function).  Suppose we have a random variable $Y$ with
mass or density function $p_Y(y)$.  Then the cumulative distribution
function (CDF) is defined by

\[
F_Y(u)
\ = \
\mbox{Pr}[Y \leq u].
\]

For discrete probability (mass) functions, this works out to

\[
F_Y(u) \ = \ \sum_{y \, = \, -\infty}^u p_Y(y),
\]

and for continuous probability (density) functions,

\[
F_Y(u) = \int_{-\infty}^u p_Y(y) \, \mbox{d}y.
\]

What we are going to need is the inverse CDF, $F_Y^{-1}$, or quantile
function, which maps a quantile $\alpha \in (0, 1)$ to the value $y$
such that $\mbox{Pr}[Y \leq y] = \alpha$ (this needs to be rounded in
the discrete case to deal with the the fact that the inverse CDF is
only technically defined for a countable number of quantiles).

Luckily, this is all built into R, and we can calculate the quantiles
that bound the central 99.9999% interval,


```{r}
qbinom(c(0.0000005, 0.9999995), 100, 0.8)
```

This tells us that 99.9999% of the draws (i.e, 999,999 out of
1,000,000 draws) from $\mathsf{Binomial}(100, 0.8)$ lie in the range
$(59, 97)$.  This demonstrates just how atypical the most probable
sequences are.

We next reproduce a graphical illustration from David J. ;C. MacKay's
*Information Theory, Inference, and Learning Algorithms* (Cambridge,
2003, section 4.4) of how the typical set of the binomial arises as a
product of

1. the Bernoulli trial probability of a sequence with a given number
 of successes, and

2.  the number of sequences with that many successes.

  Suppose we have a binary sequence of $N$ elements, $z = z_1, \ldots,
z_N$, with $z_n \in \{ 0, 1 \}$ and let

\[
y = \sum_{n=1}^N z_n
\]

be the total number of successes.  The repeated Bernoulli trial
probability of $z$ with a chance of success $\theta \in [0, 1]$ is
given by the probability mass function

\[
p(z \mid \theta)
\ = \
\prod_{n=1}^N \mathsf{Bernoulli}(z_n \mid \theta)
\ = \
\theta^y \, (1 - \theta)^{(N - y)}.
\]

The number of ways in which a binary sequence with a total of $y$
successes out of $N$ trials can arise is the total number of ways
a subset of $y$ elements may be chosen out of a set of $N$ elements,
which is given by the binomial coefficient,

\[
\binom{N}{y} = \frac{N!}{y! \, (N - y)!}.
\]

The following plots show how the binomial probability mass function
arises as a product of the binomial coefficient and the probability of
a single sequence with a given number of successess.  The left column
contains plots on the linear scale and the right column plots on the
log scale.  The top plots are of the binomial coefficient,
$\binom{N}{y}$.  Below that is the plot the probability of a single
sequence with $y$ successes, namely $\theta^y \, (1 - \theta)^{N-y}$.
Below both of these plots is their product, the binomial probability
mass function $\mathsf{Binomial}(y \mid N, \theta)$.  There are three
sequences of plots, for $N = 25$, $N = 100$, and $N=400$.

Just to get a sense of scale, there are $2^{25}$ (order $10^7$) binary
sequences of length 25, $2^{100}$ (order $10^{30}$) binary sequences
of length 100, and $2^{400}$ (order $10^{120}$) binary sequences of
length 400.


```{r}
choose_df <- function(Ys, theta=0.2) {
  N <- max(Ys);
  Ns <- rep(N, length(Ys));
  Cs <- choose(N, Ys);
  Ls <- theta^Ys * (1 - theta)^(N - Ys);
  Ps <- Cs * Ls;
  data.frame(list(y = Ys, N = Ns, combos = Cs, L = Ls, P = Ps));
}

choose_plot <- function(df, logy = FALSE) {
  p <-
    ggplot(df, aes(x = y, y = combos)) +
    geom_line(size=0.2, color="darkgray") +
    geom_point(size=0.4) +
    scale_x_continuous() +
    xlab("y");
  if (logy) {
    p <- p + scale_y_log10() +
      ylab("log (N choose y)") +
      theme(axis.title.y = element_text(size=8),
            axis.text.y=element_blank(),
            axis.ticks.y=element_blank())
      ggtitle("sequence log permutations ...");
  } else {
    p <- p + scale_y_continuous() +
      ylab("(N choose y)") +
      theme(axis.title.y = element_text(size=8),
            axis.text.y=element_blank(),
            axis.ticks.y=element_blank())
      ggtitle("sequence permutations ...");
  }
  return(p);
}
seq_plot <- function(df, logy = FALSE) {
  p <-
    ggplot(df, aes(x = y, y = L)) +
    geom_line(size=0.2, color="darkgray") +
    geom_point(size=0.4) +
    scale_x_continuous() +
    xlab("y");
    if (logy) {
      p <- p + scale_y_log10() +
           ylab("log theta^y * (1 - theta)^(N - y)") +
           theme(axis.title.y = element_text(size=8),
                 axis.text.y=element_blank(),
                 axis.ticks.y=element_blank())
           ggtitle("plus sequence log probability ...");
    } else {
      p <- p + scale_y_continuous() +
           ylab("theta^y * (1 - theta)^(N - y)") +
           theme(axis.title.y = element_text(size=8),
                 axis.text.y=element_blank(),
                 axis.ticks.y=element_blank())
           ggtitle("times sequence probability ...");
  }
  return(p);
}
joint_plot <- function(df, logy = FALSE) {
  p <- ggplot(df, aes(x = y, y = P)) +
    geom_line(size = 0.25, color = "darkgray") +
    geom_point(size = 0.25) +
    scale_x_continuous() +
    xlab("y");
  if (logy) {
    p <- p + scale_y_log10() +
           ylab("log binom(y | N, theta)") +
          theme(axis.title.y = element_text(size=8),
                axis.text.y=element_blank(),
                axis.ticks.y=element_blank())
           ggtitle("equals count log probability");
  } else {
    p <- p + scale_y_continuous() +
          ylab("binom(y | N, theta)") +
          theme(axis.title.y = element_text(size=8),
                axis.text.y=element_blank(),
                axis.ticks.y=element_blank())
          ggtitle("equals count probability");
  }
  return(p);
}
library("grid")
library("gridExtra");
plot_all <- function(df) {
  cp <- choose_plot(df, logy = FALSE);
  pp <- seq_plot(df, logy = FALSE);
  jp <- joint_plot(df, logy = FALSE);
  lcp <- choose_plot(df, logy = TRUE);
  lpp <- seq_plot(df, logy = TRUE);
  ljp <- joint_plot(df, logy = TRUE);
  grid.newpage();
  grid.arrange(ggplotGrob(cp), ggplotGrob(lcp),
               ggplotGrob(pp), ggplotGrob(lpp),
               ggplotGrob(jp), ggplotGrob(ljp),
               ncol = 2);
}

df25 <- choose_df(0:25);
plot_all(df25);

df100 <- choose_df(0:100)
plot_all(df100);

df400 <- choose_df(0:400);
plot_all(df400);
```

Although it appears that way, the points are not getting heavier,
there are just more of them in each subsequent plot.

#### A Continuous Example of Typicality

All of the computations of interest in Bayesian statistics are
formulated as posterior expectations.  For example, Bayesian parameter
estimates minimizing expected square error are expectations of the
value of a parameter $\theta$ in the posterior (conditioned on data
$y$),

\[
\hat{\theta} = \mathbb{E}[\theta \mid y] = \int_{\Theta} \theta \,
p(\theta \mid y) \, \mathrm{d}\theta,
\]

where we integrate over the legal values of $\theta \in \Theta$.

Posterior event probabilities are expectations of the indicator
function $I[\theta \in E]$ that indicates whether a parameter value
$\theta$ is in the event $E$,

\[
\mbox{Pr}[E \mid y]
\ = \
\mathbb{E}[\, I[\theta \in E] \mid y \, ]
\ = \
\int_{\Theta} I[\theta \in E] \ p(\theta \mid y) \, \mathrm{d}\theta.
\]

Rather than integrating over the support $\Theta$ of $\theta$ in the
posterior, it suffices to integrate over the typical set (the typical
set is not the only such set; see the exercises for a discussion of
the highest probability set).  That is, if $\Theta$ is the support for
a density $p(\theta)$, then the typical set $T \subseteq \Theta$ has
the property that for "well-behaved" functions,

\[
\mathbb{E}[f(\theta) \mid y]
\ = \
\int_{\Theta} f(\theta) \, p(\theta \mid y) \, \mathrm{d}\theta
\ \approx \
\int_{T}  f(\theta) \, p(\theta \mid y) \, \mathrm{d}\theta
\]

That is, we can restrict attention to the typical set when computing
expectations.  The fact that typical sets exist is why we are able to
compute posterior expectations using Monte Carlo methods---the draws
from the posterior distribution we make with Monte Carlo methods will
almost all fall in the typical set and we know that will be
sufficient.

We saw in the first plot in section 4 (draws are nowhere near the
mode) that if we have $D$ i.i.d. draws from a standard normal, that the
typical set was well bounded away from the posterior mode, which is at
the origin.  There's no built-in R function to compute the quantiles
directly, but the next section shows that the distance from the mode
reduces to a well-known distribution.

#### Definition of the Typical Set

The formal definition of typicality is for sequences $x_1, \ldots,
x_N$ of i.i.d. draws with probability function $p(x)$ (density
function for continuous variables, mass function for discrete
variables).  Such a sequence is said to be "typical" if its average
probability is near the (differential) entropy of the distribution.  A
sequence $(x_1, \ldots, x_N)$ is in the typical set $A^N_{\epsilon}$
for a distribution with probability function $p(x)$ if

\[
\left| \frac{1}{N} \sum_{n=1}^N \log p(x_n)  - \mathrm{H}[ X ] \right|
\leq \epsilon,
\]

where $\mathrm{H}[X]$ is the discrete entropy if $X$ is discrete and the
differential entropy if $X$ is continuous.  It can be established that
such a set covers most such sequences, in that

\[
\mathrm{Pr}\left[
(X_1, \ldots, X_N) \in A_\epsilon^N
\right] > 1 - \epsilon.
\]

assuming the $X_n$ are independent and each has probability function
$p(x)$.


## 4. Vectors of Random Unit Normals

We are now going to generate a random vector $y = (y_1, \ldots, y_D)
\in \mathbb{R}^D$ by generating each dimension independently as

\[
y_d \sim \mathsf{Normal}(0, 1).
\]

The density over vectors is then defined by

\[
p(y)
\ = \
\prod_{d=1}^D p(y_d)
\ = \
\prod_{d=1}^D \mathsf{Normal}(y_d \mid 0, 1).
\]

On the log scale, that's

\[
\log p(y)
\ = \
\sum_{d=1}^D -\frac{1}{2} \, y_d^2 + \mathrm{const}
\ = \
-\frac{1}{2} \, {|\!| y |\!|}^2 + \mathrm{const}.
\]

Equivalently, we could generate the vector $y$ all at once from a
multivariate normal with unit covariance matrix,

\[
y \sim \mathsf{MultiNormal}(\mathbf{0}, \mathbf{I}),
\]

where $\mathbf{0}$ is a $D$-vector of zero values, and $\mathbf{I}$ is
the $D \times D$ unit covariance matrix with unit values on the diagonal
and zero values off diagonal (i.e., unit scale and no correlation).

To generate a random vector $y$ in R, we can use the `rnorm()`
function, which generates univariate random draws from a normal
distribution.

```{r comment=NA}
D <- 10;
u <- rnorm(D, 0, 1);
print(u, digits=2);
```

It is equally straightforward to compute the Euclidean length of `u`
given the function we defined above:

```{r comment=NA}
print(euclidean_length(u), digits=3);
```

What is the distribution of the lengths of vectors generated this way
as a function of the dimensionality?  We answer the question
analytically in the next section, but for now, we'll get a handle on
what's going on through simulation as the dimensionality $D$ grows.
The following code draws from normal distributions of dimensionality 1
to 256, then plots them with 99% intervals using ggplot.

```{r comment=NA}
log_sum_exp <- function(u) max(u) + log(sum(exp(u - max(u))));
N <- 1e4;
dim <- c(1, 2, 4, 5, 8, 11, 16, 22, 32, 45, 64, 90, 128, 181, 256);
D <- length(dim);
lower <- rep(NA, D);
middle <- rep(NA, D);
upper <- rep(NA, D);
lower_ll <- rep(NA, D);
middle_ll <- rep(NA, D);
upper_ll <- rep(NA, D);

mean_ll <- rep(NA, D);
max_ll <- rep(NA, D);
for (k in 1:D) {
  d <- dim[k];
  y <- rep(NA, N);
  for (n in 1:N) y[n] <- euclidean_length(rnorm(d, 0, 1));

  qs <- quantile(y, probs=c(0.005, 0.500, 0.995));
  lower[k] <- qs[[1]];
  middle[k] <- qs[[2]];
  upper[k] <- qs[[3]];

  ll <- rep(NA, N);
  for (n in 1:N) ll[n] <- sum(dnorm(rnorm(d, 0, 1), 0, 1, log=TRUE));

  qs_ll <- quantile(ll, probs=c(0.005, 0.500, 0.995));
  lower_ll[k] <- qs_ll[[1]];
  middle_ll[k] <- qs_ll[[2]];
  upper_ll[k] <- qs_ll[[3]];

  mean_ll[k] <- log_sum_exp(ll) - log(N);
  max_ll[k] <- sum(dnorm(rep(0, d), 0, 1, log=TRUE));
}
df <- data.frame(list(dim = dim, lb = lower, mid = middle, ub = upper,
                      lb_ll = lower_ll, mid_ll = middle_ll,
                      ub_ll = upper_ll, mean_ll = mean_ll, max_ll = max_ll));
print(df, digits = 1);
```

Then we can plot the 99% intervals of the draws using ggplot.

```{r}
library(ggplot2);

plot1 <-
  ggplot(df, aes(dim)) +
  geom_ribbon(aes(ymin = lb, ymax = ub), fill="lightyellow") +
  geom_line(aes(y = mid)) +
  geom_point(aes(y = mid)) +
  scale_x_log10(breaks=2^(0:D)) +
  ylab("Euclidean distance from origin (mode)") +
  xlab("number of dimensions") +
  ggtitle("Draws are Nowhere Near the Mode\n(median draw with 99% intervals)");
plot1;
```

Even in 16 dimensions, the 99% intervals are far away from zero, which
is the mode (highest density point) of the 16-dimensional unit normal
distribution.

##### Concentration of measure

Not only are the intervals moving away from the mode, they are
concentrating into a narrow band in higher dimensions.  This result
follows here from the central limit theorem in this example and in
more generality from concentration of measure (Terry Tao blogged a https://terrytao.wordpress.com/2010/01/03/254a-notes-1-concentration-of-measure/
nice introduction).

How does the log density of random draws compare to the log density at
the mode (i.e., the location at which density is maximized)?  The
following plot of the median log density and 99% intervals along with
the density at the mode illustrates.

```{r}
plot2 <-
  ggplot(df, aes(dim)) +
  geom_ribbon(aes(ymin = lb_ll, ymax = ub_ll), fill="lightyellow") +
  geom_line(aes(y = mid_ll)) +
  geom_point(aes(y = mid_ll)) +
  geom_line(aes(y = max_ll), color="red") +
  geom_point(aes(y = max_ll), color="red") +
  scale_x_log10(breaks=c(2^(0:D))) +
  scale_y_continuous() +
  ylab("log density") +
  xlab("number of dimensions") +
  ggtitle("Draws have Much Lower Density than the Mode
(median and 99% intervals of random draws; mode in red)");
plot2;
```

#### No individual is average

Somewhat counterintuitively, although easy to see in retrospect given
the above graphs, the average member of a population is an outlier.
How could an item with every feature being average be unusual?
Precisely because it is unusual to have so many features that close to
average.

In comments on an earlier draft, Michael Betancourt mentioned that
physicists like to use an average of a population as a representative,
calling such a representative an "Asimov data set" (the name derives
from Isaac Asimov's 1955 short story
[Franchise](https://en.wikipedia.org/wiki/Franchise_(short_story)), in
which a single elector is chosen to represent a population).  Because
of the atypicality of the average member of the population, this
technique is ripe for misuse.  As we saw above, the average member of
a population might be located at the mode, whereas the average
distance of a population member to the mode is much greater.

## 5. Squared Distance of Normal Draws is Chi-Square

Suppose $y = (y_1, \ldots, y_D) \in \mathbb{R}^D$ and that for each $d
\in 1{:}D$,

\[
y_d \sim \mathsf{Normal}(0, 1),
\]

is drawn independently.  The distribution of the sum of the squared
elements of $y$ is well known to have a chi-square distribution,

\[
(y_1^2 + \cdots + y_D^2) \sim \mathsf{ChiSquare}(D).
\]

In other words, the squared length of a unit normal draw has a
chi-square distribution,

\[
{|\!| y |\!|}^2 \sim \mathsf{ChiSquare}(D).
\]

This means we could have drawn the plot out analytically rather than
using Monte Carlo methods by using the inverse cumulative distribution
function for the chi-square distibution (see the exercises).


## 6.  Maximum Likelihood and Least Squares

Gauss recognized the intimate relation between (squared Euclidean)
distance and the normal distribution.  The connection led him to
develop the maximum likelihood principle and show that the maximum
likelihood estimate of the mean of a normal distribution was the
average, and that such an estimate also minimized the sum of square
distances to the data points.

Suppose we observe $y = (y_1, \ldots, y_N) \in \mathbb{R}^N$ and we
assume they come from a normal distribution with unit scale $\sigma =
1$ and unknown location $\mu$.  Then the log density is

\[
\log p(y \mid \mu, 1)
\ \propto \
\sum_{n=1}^N \log \mathsf{Normal}(y_n \mid \mu, 1)
\ \propto \
-\frac{1}{2} \sum_{n=1}^N \left( y_n - \mu \right)^2
\ \propto \
-\frac{1}{2} \sum_{n=1}^N \mathrm{d}(y_n, \mu)^2.
\]

The maximum likelihood estimate for the location $\mu$ is just the
value that maximizes the likelihood function,

\[
\mu^*
\ = \
\mathrm{arg \, max}_{\mu} \ p(y \mid \mu)
\ = \
\mathrm{arg\, max}_{\mu} \ \log p(y \mid \mu)
\ = \
\mathrm{arg \, max}_{\mu} - \frac{1}{2} \, \sum_{n=1}^N \mathrm{d}(y_n, \mu)^2
\ = \
\mathrm{arg \, min}_{\mu} \sum_{n=1}^N \mathrm{d}(y_n, \mu)^2.
\]

The last step drops the negative constant multiplier, thus converting
maximization to minimization and showing that the estimate that
maximizes the likelihood is also the estimate that minimizes the sum
of the square distances from the observations to the estimate of
$\mu$.

The estimate that minimizes squares is just the average of the observations,

\[
\mu^*
\ = \
\frac{1}{N} \sum_{n=1}^N y_n.
\]

The [Gauss-Markov
theorem](https://en.wikipedia.org/wiki/Gauss???Markov_theorem) further