Skip to content

Commit 590e7b8

Browse files
Tom's July 14 edits of two lectures including the likelihood ratio process lecture
1 parent 8823ba7 commit 590e7b8

2 files changed

Lines changed: 48 additions & 25 deletions

File tree

lectures/likelihood_ratio_process.md

Lines changed: 45 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -31,15 +31,15 @@ kernelspec:
3131

3232
This lecture describes likelihood ratio processes and some of their uses.
3333

34-
We'll use a setting described in {doc}`this lecture <exchangeable>`.
34+
We'll study the same setting that is also used in {doc}`this lecture on exchangeability <exchangeable>`.
3535

3636
Among things that we'll learn are
3737

38-
* A peculiar property of likelihood ratio processes
3938
* How a likelihood ratio process is a key ingredient in frequentist hypothesis testing
4039
* How a **receiver operator characteristic curve** summarizes information about a false alarm probability and power in frequentist hypothesis testing
41-
* How a Bayesian statistician combines frequentist probabilities of type I and type II errors to form posterior probabilities of erroneous model selection or missclassification of individuals
40+
* How a statistician can combine frequentist probabilities of type I and type II errors to form posterior probabilities of erroneous model selection or missclassification of individuals
4241
* How during World War II the United States Navy devised a decision rule that Captain Garret L. Schyler challenged, a topic to be studied in {doc}`this lecture <wald_friedman>`
42+
* A peculiar property of likelihood ratio processes
4343

4444

4545

@@ -724,11 +724,11 @@ N, T = l_arr_h.shape
724724
plt.plot(range(T), np.sum(l_seq_h > 10000, axis=0) / N)
725725
```
726726

727-
## Bayesian Classification and Hypothesis Testing
727+
## Hypothesis Testing and Classification
728728

729-
We now describe how a Bayesian statistician can combine frequentist probabilities of type I and type II errors in order to
729+
We now describe how a statistician can combine frequentist probabilities of type I and type II errors in order to
730730

731-
* compute a posterior probability of selecting a wrong model
731+
* compute an anticipated frequency of selecting a wrong model based on a sample length $T$
732732
* compute an anticipated error rate in a classification problem
733733

734734
We consider a situation in which nature generates data by mixing known densities $f$ and $g$ with known mixing
@@ -738,17 +738,27 @@ $$
738738
h (w) = \pi_{-1} f(w) + (1-\pi_{-1}) g(w)
739739
$$
740740

741-
We'll often set $\pi_{-1} = .5$.
741+
We assume that the statistician knows the densities $f$ and $g$ and also the mixing parameter $\pi_{-1}$.
742+
743+
Below, we'll set $\pi_{-1} = .5$, although much of the analysis would follow through with other settings of $\pi_{-1} \in (0,1)$.
742744

743745
We assume that $f$ and $g$ both put positive probabilities on the same intervals of possible realizations of the random variable $W$.
744746

745-
In the simulation below, we define $f$ as the probability density function of a $\text{Beta}(1, 1)$ distribution and $g$ as the probability density function of a $\text{Beta}(3, 1.2)$ distribution as we did above.
747+
748+
749+
In the simulations below, we specify that $f$ is a $\text{Beta}(1, 1)$ distribution and that $g$ is $\text{Beta}(3, 1.2)$ distribution,
750+
just as we did often earlier in this lecture.
751+
752+
We consider two alternative timing protocols.
746753

747-
We consider two alternative timing protocols.
754+
* Timing protocol 1 is for the model selection problem
755+
* Timing protocol 2 is for the individual classification problem
748756

749757
**Protocol 1:** Nature flips a coin once at time $t=-1$ and with probability $\pi_{-1}$ generates a sequence $\{w_t\}_{t=1}^T$
750758
of IID draws from $f$ and with probability $1-\pi_{-1}$ generates a sequence $\{w_t\}_{t=1}^T$
751-
of IID draws from $g$
759+
of IID draws from $g$.
760+
761+
Let's write some Python code that implements timing protocol 1.
752762

753763
```{code-cell} ipython3
754764
def protocol_1(π_minus_1, T, N=1000):
@@ -774,6 +784,8 @@ def protocol_1(π_minus_1, T, N=1000):
774784

775785
**Protocol 2.** At each time $t \geq 0$, nature flips a coin and with probability $\pi_{-1}$ draws $w_t$ from $f$ and with probability $1-\pi_{-1}$ draws $w_t$ from $g$.
776786

787+
Here is Python code that we'll use to implement timing protocol 2.
788+
777789
```{code-cell} ipython3
778790
def protocol_2(π_minus_1, T, N=1000):
779791
"""
@@ -826,11 +838,11 @@ def compute_likelihood_ratios(sequences):
826838
return l_ratios, L_cumulative
827839
```
828840

829-
## Bayesian Model Selection
841+
## Model Selection Mistake Probability
830842

831843
We first study a problem that assumes timing protocol 1.
832844

833-
Consider a decision maker who wants to know whether model $f$ or model $g$ governs the data.
845+
Consider a decision maker who wants to know whether model $f$ or model $g$ governs a data set of length $T$ observations.
834846

835847
The decision makers has observed a sequence $\{w_t\}_{t=1}^T$.
836848

@@ -849,7 +861,7 @@ $$
849861
p_g = {\rm Prob}\left(L_T \geq 1 \Big|g \right) = \beta_T.
850862
$$
851863

852-
We can form a Bayesian prior probability that the likelihood ratio selects the wrong model by assigning a prior probability of $\pi_{-1} = .5$ that it selects the wrong model and then averaging $p_f$ and $p_g$ to form the Bayesian posterior probability of a detection error equal to
864+
We can construct a probability that the likelihood ratio selects the wrong model by assigning a Bayesian prior probability of $\pi_{-1} = .5$ that nature selects model $f$ then averaging $p_f$ and $p_g$ to form the Bayesian posterior probability of a detection error equal to
853865

854866
$$
855867
p(\textrm{wrong decision}) = {1 \over 2} (\alpha_T + \beta_T) .
@@ -906,9 +918,12 @@ plt.show()
906918
print(f"At T={T_max}:")
907919
print(f"α_{T_max} = {α_T[-1]:.4f}")
908920
print(f"β_{T_max} = {β_T[-1]:.4f}")
909-
print(f"Bayesian error probability = {error_prob[-1]:.4f}")
921+
print(f"Model selection error probability = {error_prob[-1]:.4f}")
910922
```
911923
924+
925+
Notice how the model selection error probability approaches zero as $T$ grows.
926+
912927
## Classification
913928
914929
We now consider a problem that assumes timing protocol 2.
@@ -932,7 +947,7 @@ $$ (eq:classerrorprob)
932947
933948
where $\tilde \alpha_t = {\rm Prob}(l_t < 1 \mid f)$ and $\tilde \beta_t = {\rm Prob}(l_t \geq 1 \mid g)$.
934949
935-
Now let's simulate protocol 2 and compute the error probabilities
950+
Now let's simulate protocol 2 and compute the classification error probability.
936951
937952
```{code-cell} ipython3
938953
sequences_p2, true_sources_p2 = protocol_2(
@@ -1003,11 +1018,17 @@ plt.tight_layout()
10031018
plt.show()
10041019
```
10051020
1006-
On the left of the decision boundary, $f$ is more likely than $g$ with $l_t < 1$.
1021+
To the left of the green vertical line $f < g $, so $l_t < 1$; therefore a $w_t$ that falls to the left of the green line is classified as a type $g$ individual.
1022+
1023+
* The shaded orange area equals $\beta$ -- the probability of classifying someone as a type $g$ individual when it is really a type $f$ individual.
1024+
1025+
To the right of the green vertical line $g > f$, so $l_t >1 $; therefore a $w_t$ that falls to the right of the green line is classified as a type $f$ individual.
10071026
1008-
On the right of the decision boundary, $g$ is more likely than $f$ with $l_t \geq 1$.
1027+
* The shaded blue area equals $\alpha$ -- the probability of classifying someone as a type $f$ when it is really a type $g$ individual.
10091028
1010-
Let's see how it performs in the simulated data
1029+
1030+
1031+
Let's see the classification algorithm performs in simulated data.
10111032
10121033
```{code-cell} ipython3
10131034
accuracy = np.empty(T_max)
@@ -1029,7 +1050,7 @@ plt.ylim(0.5, 1.0)
10291050
plt.show()
10301051
```
10311052
1032-
Let's also compare the two protocols by showing how the error probabilities evolve differently
1053+
Let's watch decisions made by the two protocols as more and more observations accrue.
10331054
10341055
```{code-cell} ipython3
10351056
fig, ax = plt.subplots(figsize=(7, 6))
@@ -1045,11 +1066,13 @@ plt.show()
10451066
10461067
From the figure above, we can see:
10471068
1048-
- For both protocols, the error probability starts at the same level subject to randomness.
1069+
- For both protocols, the error probability starts at the same level, subject to a little randomness.
1070+
1071+
- For protocol 1, the error probability decreases as the sample size increaes because we are just making **one** decision -- i.e., selecting whether $f$ or $g$ governs **all** individuals. More data provides better evidence.
10491072
1050-
- For protocol 1, the error probability decreases as we collect more data because we're trying to determine which single model generated the entire sequence. More data provides stronger evidence.
1073+
- For protocol 2, the error probability remains constant because we are making **many** decisions -- one classification decision for each observation.
10511074
1052-
- For protocol 2, the error probability remains constant because each observation is classified independently. The accuracy depends only on the likelihood that the two models generates the single observation.
1075+
**Remark:** Think about how laws of large numbers are applied to compute error probabilities for the model selection problem and the classification problem.
10531076
10541077
## Sequels
10551078

lectures/prob_matrix.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -464,11 +464,11 @@ $$
464464
An associated conditional distribution is
465465
466466
$$
467-
\textrm{Prob}\{Y=i\vert X=j\} = \frac{\rho_{ij}}{ \sum_{i}\rho_{ij}}
467+
\textrm{Prob}\{Y=i\vert X=j\} = \frac{\rho_{ij}}{ \sum_{j}\rho_{ij}}
468468
= \frac{\textrm{Prob}\{Y=j, X=i\}}{\textrm{Prob}\{ X=i\}}
469469
$$
470470
471-
We can define a transition probability matrix
471+
We can define a transition probability matrix $P$ with $i,j$ component
472472
473473
$$
474474
p_{ij}=\textrm{Prob}\{Y=j|X=i\}= \frac{\rho_{ij}}{ \sum_{j}\rho_{ij}}
@@ -490,7 +490,7 @@ The first row is the probability that $Y=j, j=0,1$ conditional on $X=0$.
490490
The second row is the probability that $Y=j, j=0,1$ conditional on $X=1$.
491491
492492
Note that
493-
- $\sum_{j}\rho_{ij}= \frac{ \sum_{j}\rho_{ij}}{ \sum_{j}\rho_{ij}}=1$, so each row of $\rho$ is a probability distribution (not so for each column).
493+
- $\sum_{j}\rho_{ij}= \frac{ \sum_{j}\rho_{ij}}{ \sum_{j}\rho_{ij}}=1$, so each row of the transition matrix $P$ is a probability distribution (not so for each column).
494494
495495
496496

0 commit comments

Comments
 (0)