| title | Probability | ||||
|---|---|---|---|---|---|
| subtitle | Statistical Inference | ||||
| author | Brian Caffo, Jeff Leek, Roger Peng | ||||
| job | Johns Hopkins Bloomberg School of Public Health | ||||
| logo | bloomberg_shield.png | ||||
| framework | io2012 | ||||
| highlighter | highlight.js | ||||
| hitheme | tomorrow | ||||
| url |
|
||||
| widgets |
|
||||
| mode | selfcontained |
- The sample space,
$\Omega$ , is the collection of possible outcomes of an experiment- Example: die roll
$\Omega = {1,2,3,4,5,6}$
- Example: die roll
- An event, say
$E$ , is a subset of$\Omega$ - Example: die roll is even
$E = {2,4,6}$
- Example: die roll is even
- An elementary or simple event is a particular result
of an experiment
- Example: die roll is a four,
$\omega = 4$
- Example: die roll is a four,
-
$\emptyset$ is called the null event or the empty set
Normal set operations have particular interpretations in this setting
-
$\omega \in E$ implies that$E$ occurs when$\omega$ occurs -
$\omega \not\in E$ implies that$E$ does not occur when$\omega$ occurs -
$E \subset F$ implies that the occurrence of$E$ implies the occurrence of$F$ -
$E \cap F$ implies the event that both$E$ and$F$ occur -
$E \cup F$ implies the event that at least one of$E$ or$F$ occur -
$E \cap F=\emptyset$ means that$E$ and$F$ are mutually exclusive, or cannot both occur -
$E^c$ or$\bar E$ is the event that$E$ does not occur
A probability measure,
- For an event
$E\subset \Omega$ ,$0 \leq P(E) \leq 1$ $P(\Omega) = 1$ - If
$E_1$ and$E_2$ are mutually exclusive events$P(E_1 \cup E_2) = P(E_1) + P(E_2)$ .
Part 3 of the definition implies finite additivity
$$
P(\cup_{i=1}^n A_i) = \sum_{i=1}^n P(A_i)
$$
where the
$P(\emptyset) = 0$ $P(E) = 1 - P(E^c)$ $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ - if
$A \subset B$ then$P(A) \leq P(B)$ $P\left(A \cup B\right) = 1 - P(A^c \cap B^c)$ $P(A \cap B^c) = P(A) - P(A \cap B)$ $P(\cup_{i=1}^n E_i) \leq \sum_{i=1}^n P(E_i)$ $P(\cup_{i=1}^n E_i) \geq \max_i P(E_i)$
The National Sleep Foundation (www.sleepfoundation.org) reports that around 3% of the American population has sleep apnea. They also report that around 10% of the North American and European population has restless leg syndrome. Does this imply that 13% of people will have at least one sleep problems of these sorts?
Answer: No, the events are not mutually exclusive. To elaborate let:
Then
$$ \begin{eqnarray*} P(A_1 \cup A_2 ) & = & P(A_1) + P(A_2) - P(A_1 \cap A_2) \ & = & 0.13 - \mbox{Probability of having both} \end{eqnarray*} $$ Likely, some fraction of the population has both.
- A random variable is a numerical outcome of an experiment.
- The random variables that we study will come in two varieties, discrete or continuous.
- Discrete random variable are random variables that take on only a
countable number of possibilities.
$P(X = k)$
- Continuous random variable can take any value on the real line or some subset of the real line.
$P(X \in A)$
- The
$(0-1)$ outcome of the flip of a coin - The outcome from the roll of a die
- The BMI of a subject four years after a baseline measurement
- The hypertension status of a subject randomly drawn from a population
A probability mass function evaluated at a value corresponds to the
probability that a random variable takes that value. To be a valid
pmf a function,
-
$p(x) \geq 0$ for all$x$ $\sum_{x} p(x) = 1$
The sum is taken over all of the possible values for
Let
A probability density function (pdf), is a function associated with a continuous random variable
Areas under pdfs correspond to probabilities for that random variable
To be a valid pdf, a function
-
$f(x) \geq 0$ for all$x$ -
The area under
$f(x)$ is one.
Suppose that the proportion of help calls that get addressed in a random day by a help line is given by $$ f(x) = \left{\begin{array}{ll} 2 x & \mbox{ for } 1 > x > 0 \ 0 & \mbox{ otherwise} \end{array} \right. $$
Is this a mathematically valid density?
x <- c(-0.5, 0, 1, 1, 1.5)
y <- c(0, 0, 2, 0, 0)
plot(x, y, lwd = 3, frame = FALSE, type = "l")What is the probability that 75% or fewer of calls get addressed?
1.5 * 0.75/2## [1] 0.5625
pbeta(0.75, 2, 1)## [1] 0.5625
- The cumulative distribution function (CDF) of a random variable
$X$ is defined as the function $$ F(x) = P(X \leq x) $$ - This definition applies regardless of whether
$X$ is discrete or continuous. - The survival function of a random variable
$X$ is defined as $$ S(x) = P(X > x) $$ - Notice that
$S(x) = 1 - F(x)$ - For continuous random variables, the PDF is the derivative of the CDF
What are the survival function and CDF from the density considered before?
For
pbeta(c(0.4, 0.5, 0.6), 2, 1)## [1] 0.16 0.25 0.36
- The
$\alpha^{th}$ quantile of a distribution with distribution function$F$ is the point$x_\alpha$ so that $$ F(x_\alpha) = \alpha $$ - A percentile is simply a quantile with
$\alpha$ expressed as a percent - The median is the
$50^{th}$ percentile
- We want to solve
$0.5 = F(x) = x^2$ - Resulting in the solution
sqrt(0.5)## [1] 0.7071
- Therefore, about 0.7071 of calls being answered on a random day is the median.
- R can approximate quantiles for you for common distributions
qbeta(0.5, 2, 1)## [1] 0.7071
- You might be wondering at this point "I've heard of a median before, it didn't require integration. Where's the data?"
- We're referring to are population quantities. Therefore, the median being discussed is the population median.
- A probability model connects the data to the population using assumptions.
- Therefore the median we're discussing is the estimand, the sample median will be the estimator

