- Overview
- Linear Algebra
- Probability Fundamentals
- Random Variables
- Important Distributions
- Statistical Inference
- Parameter Estimation
- Hypothesis Testing
- Linear Regression
- Multivariate Normal Distribution
- Key Concepts for Machine Learning
- References
This document covers essential mathematical and statistical foundations for understanding machine learning algorithms. Topics include linear algebra, probability theory, random variables, statistical distributions, and basic inference methods.
Linear algebra is fundamental to machine learning, as most ML algorithms operate on vectors and matrices.
Vector: An ordered array of numbers representing a point or direction in space.
Column vector: $$ \mathbf{x}=\begin{bmatrix}x_1\x_2\\vdots\x_n\end{bmatrix} $$
Operations:
- Addition: $\mathbf{x}+\mathbf{y}=\begin{bmatrix}x_1+y_1\x_2+y_2\\vdots\x_n+y_n\end{bmatrix}$
- Scalar multiplication: $c\mathbf{x}=\begin{bmatrix}cx_1\cx_2\\vdots\cx_n\end{bmatrix}$
-
Dot product:
$\mathbf{x}^T\mathbf{y}=\sum_{i=1}^nx_iy_i$ - Norm: $|\mathbf{x}|2=\sqrt{\sum{i=1}^nx_i^2}$ (Euclidean norm)
Matrix: A 2D array of numbers with m rows and n columns.
Operations:
- Addition: $(\mathbf{A}+\mathbf{B}){ij}=a{ij}+b_{ij}$
- Scalar multiplication: $(c\mathbf{A}){ij}=ca{ij}$
- Matrix multiplication: $(\mathbf{AB}){ik}=\sum{j=1}^na_{ij}b_{jk}$
- Transpose: $(\mathbf{A}^T){ij}=a{ji}$
Properties:
- Matrix multiplication is associative:
$(\mathbf{AB})\mathbf{C}=\mathbf{A}(\mathbf{BC})$ - Matrix multiplication is distributive:
$\mathbf{A}(\mathbf{B}+\mathbf{C})=\mathbf{AB}+\mathbf{AC}$ - Matrix multiplication is not commutative:
$\mathbf{AB}\neq\mathbf{BA}$ (in general) $(\mathbf{AB})^T=\mathbf{B}^T\mathbf{A}^T$
Identity matrix (I): $$ \mathbf{I}=\begin{bmatrix} 1 & 0 & \cdots & 0\ 0 & 1 & \cdots & 0\ \vdots & \vdots & \ddots & \vdots\ 0 & 0 & \cdots & 1 \end{bmatrix} $$
Property:
Diagonal matrix: All off-diagonal elements are zero $$ \mathbf{D}=\begin{bmatrix} d_1 & 0 & \cdots & 0\ 0 & d_2 & \cdots & 0\ \vdots & \vdots & \ddots & \vdots\ 0 & 0 & \cdots & d_n \end{bmatrix} $$
Symmetric matrix:
Orthogonal matrix:
For square matrix
Properties:
$(\mathbf{A}^{-1})^{-1}=\mathbf{A}$ $(\mathbf{AB})^{-1}=\mathbf{B}^{-1}\mathbf{A}^{-1}$ $(\mathbf{A}^T)^{-1}=(\mathbf{A}^{-1})^T$
A matrix is invertible (non-singular) if its determinant is non-zero.
For 2×2 matrix: $$ \det\begin{bmatrix}a & b\c & d\end{bmatrix}=ad-bc $$
Properties:
$\det(\mathbf{AB})=\det(\mathbf{A})\det(\mathbf{B})$ $\det(\mathbf{A}^T)=\det(\mathbf{A})$ $\det(\mathbf{A}^{-1})=\frac{1}{\det(\mathbf{A})}$
The trace of a square matrix is the sum of diagonal elements: $$ \text{tr}(\mathbf{A})=\sum_{i=1}^na_{ii} $$
Properties:
$\text{tr}(\mathbf{A}+\mathbf{B})=\text{tr}(\mathbf{A})+\text{tr}(\mathbf{B})$ $\text{tr}(c\mathbf{A})=c\cdot\text{tr}(\mathbf{A})$ $\text{tr}(\mathbf{AB})=\text{tr}(\mathbf{BA})$
For square matrix
Then
Characteristic equation: $$ \det(\mathbf{A}-\lambda\mathbf{I})=0 $$
Properties:
- Sum of eigenvalues = trace of matrix
- Product of eigenvalues = determinant of matrix
- Symmetric matrices have real eigenvalues and orthogonal eigenvectors
Eigendecomposition (for square matrix): $$ \mathbf{A}=\mathbf{Q}\Lambda\mathbf{Q}^{-1} $$
where
For symmetric matrix:
Any matrix
where:
-
$\mathbf{U}$ (m×m): left singular vectors (orthogonal) -
$\Sigma$ (m×n): diagonal matrix with singular values$\sigma_1\geq\sigma_2\geq\cdots\geq0$ -
$\mathbf{V}$ (n×n): right singular vectors (orthogonal)
Applications in ML:
- Principal Component Analysis (PCA)
- Matrix approximation
- Dimensionality reduction
- Recommender systems
The rank of a matrix is the maximum number of linearly independent rows (or columns).
Properties:
-
$\text{rank}(\mathbf{A})\leq\min(m,n)$ for m×n matrix $\text{rank}(\mathbf{AB})\leq\min(\text{rank}(\mathbf{A}),\text{rank}(\mathbf{B}))$ - Full rank: rank equals the smallest dimension
Linear independence: Vectors
Span: The set of all possible linear combinations of a set of vectors.
Basis: A set of linearly independent vectors that span a vector space.
Dimension: The number of vectors in a basis.
Gradient of scalar function with respect to vector: $$ \nabla_{\mathbf{x}}f(\mathbf{x})=\begin{bmatrix}\frac{\partial f}{\partial x_1}\\frac{\partial f}{\partial x_2}\\vdots\\frac{\partial f}{\partial x_n}\end{bmatrix} $$
Useful derivatives:
$\nabla_{\mathbf{x}}(\mathbf{a}^T\mathbf{x})=\mathbf{a}$ $\nabla_{\mathbf{x}}(\mathbf{x}^T\mathbf{A}\mathbf{x})=(\mathbf{A}+\mathbf{A}^T)\mathbf{x}$ $\nabla_{\mathbf{x}}(\mathbf{x}^T\mathbf{x})=2\mathbf{x}$
Hessian (matrix of second derivatives): $$ \mathbf{H}_{ij}=\frac{\partial^2f}{\partial x_i\partial x_j} $$
Vector norms:
-
$L^1$ norm: $|\mathbf{x}|1=\sum{i=1}^n|x_i|$ -
$L^2$ norm (Euclidean): $|\mathbf{x}|2=\sqrt{\sum{i=1}^nx_i^2}$ -
$L^\infty$ norm:$|\mathbf{x}|_\infty=\max_i|x_i|$
Matrix norms:
- Frobenius norm: $|\mathbf{A}|F=\sqrt{\sum{i=1}^m\sum_{j=1}^na_{ij}^2}$
A symmetric matrix
Properties:
- All eigenvalues are positive
- Invertible
- Important for optimization (convex functions)
Positive semi-definite:
Linear system:
Least squares solution (when system is overdetermined): $$ \hat{\mathbf{x}}=\arg\min_{\mathbf{x}}|\mathbf{Ax}-\mathbf{b}|_2^2 $$
Normal equations: $$ \mathbf{A}^T\mathbf{Ax}=\mathbf{A}^T\mathbf{b} $$
Solution: $$ \hat{\mathbf{x}}=(\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T\mathbf{b} $$
The matrix
- Sample space: set of all possible outcomes of an experiment
- Event: any subset of the sample space
-
Set operations:
- Intersection:
$E \cap F$ - Union:
$E \cup F$ - Complement:
$E^c$ - Mutually exclusive:
$E \cap F=\phi$
- Intersection:
- General formula:
- Bayes formula:
Two events E and F are independent if: $$ P(EF)=P(E)P(F) $$
- Random variable: quantity of interest determined by the result of an experiment
- Discrete random variables: countable set of possible values
- Continuous random variables: continuum of possible values
-
Cumulative distribution function:
$F(x)=P(X\le x)$ -
Probability mass function (discrete):
$p(a)=P(X=a)$ -
Probability density function (continuous):
$F(a)=\int_{-\infty}^af(x)dx$
Expectation:
- Discrete:
$E[X]=\sum_{i=1}^n x_ip(x_i)$ - Continuous:
$E[X]=\int xf(x)dx$ - Linearity:
$E[X+Y]=E[X]+E[Y]$
Variance: $$ Var(X)=E[(X-\mu)^2]=E[X^2]-E[X]^2 $$
Covariance: $$ Cov(X, Y)=E[(X-\mu_x)(Y-\mu_y)]=E[XY]-E[X]E[Y] $$
Correlation: $$ Corr(X,Y)=\frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}} $$
Properties:
$Var(X+Y)=Var(X)+Var(Y)+2Cov(X, Y)$ - Independent random variables:
$Cov(X,Y)=0$
Binary outcome (success/failure): $$ P(x=1)=p, \quad P(x=0)=1-p $$
Number of successes in n independent Bernoulli trials: $$ P(X=k)=C_n^kp^k(1-p)^{n-k} $$
Events occurring in a fixed interval: $$ P(X=k)=e^{-\lambda}\frac{\lambda^k}{k!} $$
Standard Normal:
Time between events in a Poisson process: $$ f(x)=\lambda e^{-\lambda x}, \ \text{if}\ x\ge0 $$
Memoryless property:
For i.i.d. random variables
For i.i.d. random variables with mean
The sample mean of a large sample (typically n ≥ 30) is approximately normally distributed, regardless of the underlying distribution.
-
Sample mean:
$\bar X=\frac{X_1+...+X_n}{n}$ -
Sample variance:
$s^2=\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar x)^2$ -
Sample standard deviation:
$s=\sqrt{s^2}$
Properties:
$E[\bar X]=\mu$ $Var(\bar X)=\frac{\sigma^2}{n}$
Given i.i.d. sample
Likelihood function: $$ L(\theta;x_1,...,x_n)=\prod_{i=1}^nf(x_i;\theta) $$
MLE: $$ \hat \theta_{MLE}=\arg\max_{\theta}\log L(\theta;X_1,...X_n) $$
Properties:
- Consistent:
$\hat \theta_n\stackrel{p}{\longrightarrow}\theta_0$ - Asymptotically normal:
$\sqrt n(\hat \theta_n-\theta_0)\stackrel{d}{\longrightarrow}N(0,I(\theta_0)^{-1})$
Examples:
- Normal distribution:
$\hat\mu = \bar X$ ,$\hat\sigma^2=\frac{1}{n}\sum_{i=1}^n (X_i-\bar X)^2$ - Uniform distribution:
$\hat \theta=\max(X_1, X_2, ..., X_n)$
A
Common confidence intervals:
-
Normal mean (known variance):
$\bar X \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}$ -
Normal mean (unknown variance):
$\bar X \pm t_{\alpha/2,n-1}\frac{s}{\sqrt{n}}$
- Null hypothesis (H₀): default hypothesis to be tested
- Alternative hypothesis (H₁): competing hypothesis
- Type I error: rejecting H₀ when it is true (significance level α)
- Type II error: accepting H₀ when it is false (power = 1 - β)
- p-value: probability of observing the data (or more extreme) under H₀
Decision rule:
- If p-value < α: reject H₀
- If p-value ≥ α: fail to reject H₀
Mean of a normal population (known variance): $$ z=\frac{\bar X-\mu_0}{\sigma/\sqrt n} $$
Mean of a normal population (unknown variance): $$ t=\frac{\bar X-\mu_0}{s/\sqrt n} \sim t_{n-1} $$
Model:
Least squares estimates: $$ \hat\beta_1=\frac{\sum(x_i-\bar x)(Y_i-\bar Y)}{\sum(x_i-\bar x)^2} $$
Coefficient of determination: $$ R^2=1-\frac{SS_{residual}}{SS_{total}} $$
Matrix form:
Least squares solution: $$ \hat{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} $$
Properties:
- Unbiased:
$E[\hat\beta]=\beta$ - Variance:
$Var(\hat\beta)=\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$
Density function: $$ f(\mathbf{x};\mu,\Sigma)=\frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(\mathbf{x}-\mu)^T\Sigma^{-1}(\mathbf{x}-\mu)\right) $$
MLE estimates: $$ \hat\mu=\bar{\mathbf{X}}n=\frac{1}{n}\sum{i=1}^n\mathbf{X}_i $$
Central Limit Theorem: $$ \sqrt n(\bar X_n-\mu)\stackrel{D}{\longrightarrow}N(0,\Sigma) $$
Model error can be decomposed as: $$ E[(f(\mathbf{x})-\hat{f}(\mathbf{x}))^2]=\text{Bias}^2[\hat{f}(\mathbf{x})]+\text{Var}[\hat{f}(\mathbf{x})]+\sigma^2 $$
- Bias: error from incorrect assumptions
- Variance: error from sensitivity to training data
- Irreducible error: inherent noise in data
L1 Regularization (Lasso): $$ \min_\beta \left|\mathbf{y}-\mathbf{X}\beta\right|_2^2 + \lambda|\beta|_1 $$
L2 Regularization (Ridge): $$ \min_\beta \left|\mathbf{y}-\mathbf{X}\beta\right|_2^2 + \lambda|\beta|_2^2 $$
Entropy: $$ H(X)=-\sum_x p(x)\log p(x) $$
Cross-Entropy: $$ H(p,q)=-\sum_x p(x)\log q(x) $$
KL Divergence: $$ D_{KL}(p||q)=\sum_x p(x)\log\frac{p(x)}{q(x)} $$
Course Materials:
- STAT GR5701 Probability and Statistics for Data Science - Columbia University
- STAT GR5703 Statistical Inference and Modeling - Columbia University

