You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: lectures/likelihood_ratio_process.md
+45-22Lines changed: 45 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,15 +31,15 @@ kernelspec:
31
31
32
32
This lecture describes likelihood ratio processes and some of their uses.
33
33
34
-
We'll use a setting described in {doc}`this lecture <exchangeable>`.
34
+
We'll study the same setting that is also used in {doc}`this lecture on exchangeability <exchangeable>`.
35
35
36
36
Among things that we'll learn are
37
37
38
-
* A peculiar property of likelihood ratio processes
39
38
* How a likelihood ratio process is a key ingredient in frequentist hypothesis testing
40
39
* How a **receiver operator characteristic curve** summarizes information about a false alarm probability and power in frequentist hypothesis testing
41
-
* How a Bayesian statistician combines frequentist probabilities of type I and type II errors to form posterior probabilities of erroneous model selection or missclassification of individuals
40
+
* How a statistician can combine frequentist probabilities of type I and type II errors to form posterior probabilities of erroneous model selection or missclassification of individuals
42
41
* How during World War II the United States Navy devised a decision rule that Captain Garret L. Schyler challenged, a topic to be studied in {doc}`this lecture <wald_friedman>`
42
+
* A peculiar property of likelihood ratio processes
We now describe how a Bayesian statistician can combine frequentist probabilities of type I and type II errors in order to
729
+
We now describe how a statistician can combine frequentist probabilities of type I and type II errors in order to
730
730
731
-
* compute a posterior probability of selecting a wrong model
731
+
* compute an anticipated frequency of selecting a wrong model based on a sample length $T$
732
732
* compute an anticipated error rate in a classification problem
733
733
734
734
We consider a situation in which nature generates data by mixing known densities $f$ and $g$ with known mixing
@@ -738,17 +738,27 @@ $$
738
738
h (w) = \pi_{-1} f(w) + (1-\pi_{-1}) g(w)
739
739
$$
740
740
741
-
We'll often set $\pi_{-1} = .5$.
741
+
We assume that the statistician knows the densities $f$ and $g$ and also the mixing parameter $\pi_{-1}$.
742
+
743
+
Below, we'll set $\pi_{-1} = .5$, although much of the analysis would follow through with other settings of $\pi_{-1} \in (0,1)$.
742
744
743
745
We assume that $f$ and $g$ both put positive probabilities on the same intervals of possible realizations of the random variable $W$.
744
746
745
-
In the simulation below, we define $f$ as the probability density function of a $\text{Beta}(1, 1)$ distribution and $g$ as the probability density function of a $\text{Beta}(3, 1.2)$ distribution as we did above.
747
+
748
+
749
+
In the simulations below, we specify that $f$ is a $\text{Beta}(1, 1)$ distribution and that $g$ is $\text{Beta}(3, 1.2)$ distribution,
750
+
just as we did often earlier in this lecture.
751
+
752
+
We consider two alternative timing protocols.
746
753
747
-
We consider two alternative timing protocols.
754
+
* Timing protocol 1 is for the model selection problem
755
+
* Timing protocol 2 is for the individual classification problem
748
756
749
757
**Protocol 1:** Nature flips a coin once at time $t=-1$ and with probability $\pi_{-1}$ generates a sequence $\{w_t\}_{t=1}^T$
750
758
of IID draws from $f$ and with probability $1-\pi_{-1}$ generates a sequence $\{w_t\}_{t=1}^T$
751
-
of IID draws from $g$
759
+
of IID draws from $g$.
760
+
761
+
Let's write some Python code that implements timing protocol 1.
**Protocol 2.** At each time $t \geq 0$, nature flips a coin and with probability $\pi_{-1}$ draws $w_t$ from $f$ and with probability $1-\pi_{-1}$ draws $w_t$ from $g$.
776
786
787
+
Here is Python code that we'll use to implement timing protocol 2.
We can form a Bayesian prior probability that the likelihood ratio selects the wrong model by assigning a prior probability of $\pi_{-1} = .5$ that it selects the wrong model and then averaging $p_f$ and $p_g$ to form the Bayesian posterior probability of a detection error equal to
864
+
We can construct a probability that the likelihood ratio selects the wrong model by assigning a Bayesian prior probability of $\pi_{-1} = .5$ that nature selects model $f$ then averaging $p_f$ and $p_g$ to form the Bayesian posterior probability of a detection error equal to
print(f"Bayesian error probability = {error_prob[-1]:.4f}")
921
+
print(f"Model selection error probability = {error_prob[-1]:.4f}")
910
922
```
911
923
924
+
925
+
Notice how the model selection error probability approaches zero as $T$ grows.
926
+
912
927
## Classification
913
928
914
929
We now consider a problem that assumes timing protocol 2.
@@ -932,7 +947,7 @@ $$ (eq:classerrorprob)
932
947
933
948
where $\tilde \alpha_t = {\rm Prob}(l_t < 1 \mid f)$ and $\tilde \beta_t = {\rm Prob}(l_t \geq 1 \mid g)$.
934
949
935
-
Now let's simulate protocol 2 and compute the error probabilities
950
+
Now let's simulate protocol 2 and compute the classification error probability.
936
951
937
952
```{code-cell} ipython3
938
953
sequences_p2, true_sources_p2 = protocol_2(
@@ -1003,11 +1018,17 @@ plt.tight_layout()
1003
1018
plt.show()
1004
1019
```
1005
1020
1006
-
On the left of the decision boundary, $f$ is more likely than $g$ with $l_t < 1$.
1021
+
To the left of the green vertical line $f < g $, so $l_t < 1$; therefore a $w_t$ that falls to the left of the green line is classified as a type $g$ individual.
1022
+
1023
+
* The shaded orange area equals $\beta$ -- the probability of classifying someone as a type $g$ individual when it is really a type $f$ individual.
1024
+
1025
+
To the right of the green vertical line $g > f$, so $l_t >1 $; therefore a $w_t$ that falls to the right of the green line is classified as a type $f$ individual.
1007
1026
1008
-
On the right of the decision boundary, $g$ is more likely than $f$ with $l_t \geq 1$.
1027
+
* The shaded blue area equals $\alpha$ -- the probability of classifying someone as a type $f$ when it is really a type $g$ individual.
1009
1028
1010
-
Let's see how it performs in the simulated data
1029
+
1030
+
1031
+
Let's see the classification algorithm performs in simulated data.
1011
1032
1012
1033
```{code-cell} ipython3
1013
1034
accuracy = np.empty(T_max)
@@ -1029,7 +1050,7 @@ plt.ylim(0.5, 1.0)
1029
1050
plt.show()
1030
1051
```
1031
1052
1032
-
Let's also compare the two protocols by showing how the error probabilities evolve differently
1053
+
Let's watch decisions made by the two protocols as more and more observations accrue.
1033
1054
1034
1055
```{code-cell} ipython3
1035
1056
fig, ax = plt.subplots(figsize=(7, 6))
@@ -1045,11 +1066,13 @@ plt.show()
1045
1066
1046
1067
From the figure above, we can see:
1047
1068
1048
-
- For both protocols, the error probability starts at the same level subject to randomness.
1069
+
- For both protocols, the error probability starts at the same level, subject to a little randomness.
1070
+
1071
+
- For protocol 1, the error probability decreases as the sample size increaes because we are just making **one** decision -- i.e., selecting whether $f$ or $g$ governs **all** individuals. More data provides better evidence.
1049
1072
1050
-
- For protocol 1, the error probability decreases as we collect more data because we're trying to determine which single model generated the entire sequence. More data provides stronger evidence.
1073
+
- For protocol 2, the error probability remains constant because we are making **many** decisions -- one classification decision for each observation.
1051
1074
1052
-
- For protocol 2, the error probability remains constant because each observation is classified independently. The accuracy depends only on the likelihood that the two models generates the single observation.
1075
+
**Remark:** Think about how laws of large numbers are applied to compute error probabilities for the model selection problem and the classification problem.
@@ -490,7 +490,7 @@ The first row is the probability that $Y=j, j=0,1$ conditional on $X=0$.
490
490
The second row is the probability that $Y=j, j=0,1$ conditional on $X=1$.
491
491
492
492
Note that
493
-
- $\sum_{j}\rho_{ij}= \frac{ \sum_{j}\rho_{ij}}{ \sum_{j}\rho_{ij}}=1$, so each row of $\rho$ is a probability distribution (not so for each column).
493
+
- $\sum_{j}\rho_{ij}= \frac{ \sum_{j}\rho_{ij}}{ \sum_{j}\rho_{ij}}=1$, so each row of the transition matrix $P$ is a probability distribution (not so for each column).
0 commit comments