| title | PMF vs. PDF | ||||||
|---|---|---|---|---|---|---|---|
| sidebar_label | PMF & PDF | ||||||
| description | A deep dive into Probability Mass Functions (PMF) for discrete data and Probability Density Functions (PDF) for continuous data. | ||||||
| tags |
|
To work with data in Machine Learning, we need a mathematical way to describe how likely different values are to occur. Depending on whether our data is Discrete (countable) or Continuous (measurable), we use either a PMF or a PDF.
The PMF is used for discrete random variables. It gives the probability that a discrete random variable is exactly equal to some value.
-
Direct Probability:
$P(X = x) = f(x)$ . The "height" of the bar is the actual probability. - Summation: All individual probabilities must sum to 1. $$ \sum_{i} P(X = x_i) = 1 $$
-
Range:
$0 \le P(X = x) \le 1$ .
Example: If you roll a fair die, the PMF is
The PDF is used for continuous random variables. Unlike the PMF, the "height" of a PDF curve does not represent probability; it represents density.
In a continuous world (like height or time), the probability of a variable being exactly a specific number (e.g., exactly
Instead, we find the probability over an interval by calculating the area under the curve.
-
Area is Probability: The probability that
$X$ falls between$a$ and$b$ is the integral of the PDF: $$ P(a \le X \le b) = \int_{a}^{b} f(x) dx $$ - Total Area: The total area under the entire curve must equal 1. $$ \int_{-\infty}^{\infty} f(x) dx = 1 $$
-
Density vs. Probability:
$f(x)$ can be greater than 1, as long as the total area remains 1.
graph LR
Data[Data Type] --> Disc[Discrete]
Data --> Cont[Continuous]
Disc --> PMF["PMF: $$P(X=x)$$"]
Cont --> PDF["PDF: $$f(x)$$"]
PMF --> P_Sum["$$\sum P(x) = 1$$"]
PDF --> P_Int["$$\int f(x)dx = 1$$"]
PMF --> P_Val["Height = Probability"]
PDF --> P_Area["Area = Probability"]
| Feature | PMF (Discrete) | PDF (Continuous) |
|---|---|---|
| Variable Type | Countable (Integers) | Measurable (Real Numbers) |
| Probability at a point | ||
| Probability over range | Sum of heights | Area under the curve (Integral) |
| Visualization | Bar chart / Stem plot | Smooth curve |
The CDF is the "running total" of probability. It tells you the probability that a variable is less than or equal to
- For PMF: It is a step function (it jumps at every discrete value).
- For PDF: It is a smooth S-shaped curve.
graph LR
PDF["PDF (Density) <br/> $$f(x)$$"] -- " Integrate: <br/> $$\int_{-\infty}^{x} f(t) dt$$ " --> CDF["CDF (Cumulative) <br/> $$F(x)$$"]
CDF -- " Differentiate: <br/> $$\frac{d}{dx} F(x)$$ " --> PDF
style PDF fill:#fdf,stroke:#333,color:#333
style CDF fill:#def,stroke:#333,color:#333
- Likelihood Functions: When training models (like Logistic Regression), we maximize the Likelihood. For discrete labels, this uses the PMF; for continuous targets, it uses the PDF.
- Anomaly Detection: We often flag a data point as an outlier if its PDF value (density) is below a certain threshold.
- Generative Models: VAEs and GANs attempt to learn the underlying PDF of a dataset so they can sample new points from high-density regions (creating realistic images or text).
Now that you understand how we describe probability at a point or over an area, it's time to meet the most important distribution in all of data science.
