| title | Descriptive Statistics | |||||||
|---|---|---|---|---|---|---|---|---|
| sidebar_label | Descriptive Statistics | |||||||
| description | Mastering measures of central tendency (mean, median, mode) and dispersion (variance, standard deviation, range) to summarize and understand data distributions. | |||||||
| tags |
|
Descriptive statistics allow us to summarize large volumes of raw data into a few meaningful numbers. In Machine Learning, we use these to understand the "center" and the "spread" of our features, which is essential for data cleaning and feature scaling.
These measures tell us where the "middle" of the data lies.
The sum of all values divided by the total number of values. It is highly sensitive to outliers. $$ \mu = \frac{\sum x_i}{N} $$
The middle value when the data is sorted. It is robust to outliers, making it better for skewed distributions (like house prices or salaries).
The value that appears most frequently. Useful for categorical data (e.g., finding the most common car color).
Knowing the center isn't enough; we need to know how "spread out" the data is.
The difference between the maximum and minimum values. Simple, but very sensitive to extreme outliers.
The average of the squared differences from the Mean. It measures how far each number in the set is from the mean. $$ \sigma^2 = \frac{\sum (x_i - \mu)^2}{N} $$
The square root of the variance. It is the most common measure of spread because it is in the same units as the original data.
-
Low
$\sigma$ : Data points are close to the mean. -
High
$\sigma$ : Data points are spread out over a wide range.
Beyond center and spread, we look at the symmetry and "peakedness" of the data.
Measures the asymmetry of the distribution.
- Positive (Right) Skew: Long tail on the right side.
- Negative (Left) Skew: Long tail on the left side.
Measures how "fat" or "thin" the tails of the distribution are compared to a normal distribution. High kurtosis indicates the presence of frequent outliers.
- Handling Outliers: If the Mean and Median are far apart, you likely have outliers that could skew your model's training.
- Missing Value Imputation: When filling in missing data, we often choose the Mean (for normal data), Median (for skewed data), or Mode (for categorical data).
- Feature Scaling: Techniques like Z-Score Normalization (Standardization) directly use the Mean and Standard Deviation to rescale features: $$ z = \frac{x - \mu}{\sigma} $$
Visualizing these numbers is often more intuitive than reading a table. Next, we’ll explore the most important probability distribution in all of science and ML.
