tutorial/ai-ml/machine-learning/statistics/descriptive-statistics.mdx at 5365b8456dd7e4a1abd63ca4f2ba9297a9e32775 · codeharborhub/tutorial

title

Descriptive Statistics

sidebar_label

Descriptive Statistics

description

Mastering measures of central tendency (mean, median, mode) and dispersion (variance, standard deviation, range) to summarize and understand data distributions.

1. Measures of Central Tendency

These measures tell us where the "middle" of the data lies.

A. Mean (Average)

The sum of all values divided by the total number of values. It is highly sensitive to outliers. $$ \mu = \frac{\sum x_i}{N} $$

B. Median

The middle value when the data is sorted. It is robust to outliers, making it better for skewed distributions (like house prices or salaries).

C. Mode

The value that appears most frequently. Useful for categorical data (e.g., finding the most common car color).

2. Measures of Dispersion (Spread)

Knowing the center isn't enough; we need to know how "spread out" the data is.

A. Range

The difference between the maximum and minimum values. Simple, but very sensitive to extreme outliers.

B. Variance ($\sigma^2$)

The average of the squared differences from the Mean. It measures how far each number in the set is from the mean. $$ \sigma^2 = \frac{\sum (x_i - \mu)^2}{N} $$

C. Standard Deviation ($\sigma$)

The square root of the variance. It is the most common measure of spread because it is in the same units as the original data.

Low $\sigma$: Data points are close to the mean.
High $\sigma$: Data points are spread out over a wide range.

3. Measures of Shape

Beyond center and spread, we look at the symmetry and "peakedness" of the data.

A. Skewness

Measures the asymmetry of the distribution.

Positive (Right) Skew: Long tail on the right side.
Negative (Left) Skew: Long tail on the left side.

B. Kurtosis

Measures how "fat" or "thin" the tails of the distribution are compared to a normal distribution. High kurtosis indicates the presence of frequent outliers.

4. Why this matters for ML

Handling Outliers: If the Mean and Median are far apart, you likely have outliers that could skew your model's training.
Missing Value Imputation: When filling in missing data, we often choose the Mean (for normal data), Median (for skewed data), or Mode (for categorical data).
Feature Scaling: Techniques like Z-Score Normalization (Standardization) directly use the Mean and Standard Deviation to rescale features: $$ z = \frac{x - \mu}{\sigma} $$

Visualizing these numbers is often more intuitive than reading a table. Next, we’ll explore the most important probability distribution in all of science and ML.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1. Measures of Central Tendency

A. Mean (Average)

B. Median

C. Mode

2. Measures of Dispersion (Spread)

A. Range

B. Variance ($\sigma^2$)

C. Standard Deviation ($\sigma$)

3. Measures of Shape

A. Skewness

B. Kurtosis

4. Why this matters for ML

Uh oh!

FilesExpand file tree

descriptive-statistics.mdx

Latest commit

History

descriptive-statistics.mdx

File metadata and controls

1. Measures of Central Tendency

A. Mean (Average)

B. Median

C. Mode

2. Measures of Dispersion (Spread)

A. Range

B. Variance ($\sigma^2$)

C. Standard Deviation ($\sigma$)

3. Measures of Shape

A. Skewness

B. Kurtosis

4. Why this matters for ML