6.390 Notation and Style Guide

Consistent notation across all course materials: notes, exams, labs, homework, and recitations. If in doubt, the course notes, Shen's lecture slides, and most recent exams are the canonical reference.

Mathematical Notation Conventions

Training Data

Data points indexed with superscript in parentheses: x^{(i)}, y^{(i)}
Dataset: \mathcal{D} = \{(x^{(i)}, y^{(i)})\}_{i=1}^n
n = number of training examples, d = input dimension
Feature components use subscripts: x_1, x_2, x_j

Vectors and Matrices

Vectors: bold lowercase via \mathbf{} -- e.g., \mathbf{x}
Matrices: plain uppercase -- e.g., X, W, A
Transpose: X^T or \theta^T
Column vectors by default; \theta^T x for dot products
Dimensions stated explicitly: X \in \mathbb{R}^{n \times d}, \theta \in \mathbb{R}^d

Parameters and Weights

Linear model: \theta (weights), \theta_0 (intercept/bias)
Neural network layers: W^{(l)} (weight matrix), W_0^{(l)} or b^{(l)} (bias), with superscript in parens for layer index
ML estimate: \theta_{\text{ml}}
ERM estimate: \theta_{\text{erm}}
Model variant labels: \theta^{\mathrm{multi}}, \theta^{\mathrm{bin}}

Functions

Hypothesis: h(x; \theta, \theta_0) -- semicolon separates input from parameters
Loss: \mathcal{L}(g, y) for general loss; specific variants \mathcal{L}_{\text{nll}}, \mathcal{L}_{\text{SE}}
Objective: J(\theta) or J(\theta, \theta_0)
Sigmoid: \sigma(z) = \frac{1}{1 + e^{-z}}
Softmax: \operatorname{softmax}
ReLU: \text{ReLU}(z) = \max(0, z) -- capitalize "ReLU" in text
Feature transform: \phi(x) for transformed feature vector

Neural Networks

Pre-activation: Z^{(l)} = (W^{(l)})^T A^{(l-1)} + W_0^{(l)}
Post-activation: A^{(l)} = f^{(l)}(Z^{(l)})
Input: A^{(0)} = x
Activation functions: f^{(l)}(\cdot) per-layer

Gradients

Gradient operator: \nabla_\theta J(\theta)
Partial derivatives: \frac{\partial J}{\partial \theta}
Gradient descent update: \theta^{(t)} = \theta^{(t-1)} - \eta \nabla_\theta J(\theta^{(t-1)})

Attention Mechanism

Queries/keys/values: q_i = W_q^T x_i, k_j = W_k^T x_j, v_j = W_v^T x_j
Attention scores: s_{ij} = q_i \cdot k_j
Softmax'd attention scores: \alpha_{ij} = \frac{e^{s_{ij}}}{\sum_l e^{s_{il}}} (explicit softmax formula) -- do NOT call these "attention weights"
Output: z_i = \sum_j \alpha_{ij} v_j

CNNs

Filter/kernel: lowercase f or F, shown as small matrices
Output size: (\text{input} - \text{filter} + 2\text{padding}) / \text{stride} + 1
Receptive field: set notation \{(i,j), \ldots\}

MDPs and Reinforcement Learning

States: lowercase s with subscripts (s_0, s_1, s_2) or tuples ((p, m))
Actions: lowercase a; named actions in \text{}: \text{Forward}, \text{cash out}
State space: \mathcal{S}, action space: \mathcal{A}
Reward function: \mathrm{R}(s, a) -- upright R
Transition function: \mathrm{T}(s, a, s') -- upright T
Discount factor: \gamma \in (0, 1)
Horizon: h
Optimal value function: \mathrm{V}^*_h(s) (finite horizon), \mathrm{V}^*_\infty(s) or \mathrm{V}^*(s) (infinite horizon) -- upright V
Optimal Q-function: \mathrm{Q}^*_h(s, a) (finite horizon), \mathrm{Q}^*_\infty(s, a) (infinite horizon) -- upright Q
Policy value function: \mathrm{V}^\pi(s)
Policy: \pi or \pi(s); greedy policy: \pi^*(s) = \arg\max_a \mathrm{Q}^*(s, a)
All MDP functions (\mathrm{R}, \mathrm{T}, \mathrm{V}, \mathrm{Q}) use upright (\mathrm{}) letters
Value iteration (Bellman equation): \mathrm{V}^*_{h+1}(s) = \max_a [\mathrm{R}(s,a) + \gamma \sum_{s'} \mathrm{T}(s,a,s') \mathrm{V}^*_h(s')]
Q-value Bellman equation: \mathrm{Q}^*_{h+1}(s,a) = \mathrm{R}(s,a) + \gamma \sum_{s'} \mathrm{T}(s,a,s') \max_{a'} \mathrm{Q}^*_h(s',a')
V-Q relationship: \mathrm{V}^*(s) = \max_a \mathrm{Q}^*(s, a)
Q-learning update: \mathrm{Q}(s,a) \leftarrow \mathrm{Q}(s,a) + \alpha [r + \gamma \max_{a'} \mathrm{Q}(s',a') - \mathrm{Q}(s,a)]
Learning rate (Q-learning): \alpha
Epsilon-greedy: \varepsilon-greedy
Deep RL / DQN: neural network Q-function Q_\theta(s,a), policy network \pi_\theta
DQN target: y = r + \gamma \max_{a'} Q_\theta(s',a')
DQN loss (Bellman error): \mathcal{L}(\theta) = \mathbb{E}[(y - Q_\theta(s,a))^2]

Other Conventions

Learning rate: \eta
Regularization: \lambda
Norms: \lVert \cdot \rVert (e.g., \lVert x \rVert); avoid raw ||
Absolute value: \left| \cdot \right| (or \lvert \cdot \rvert for proper spacing)
Sets: calligraphic \mathcal{} -- \mathcal{D} (dataset), \mathcal{M} (model class), \mathcal{H} (hypothesis class)
Real numbers: \mathbb{R}
Iteration/time index: superscript in parens \theta^{(t)}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6.390 Notation and Style Guide

Mathematical Notation Conventions

Training Data

Vectors and Matrices

Parameters and Weights

Functions

Neural Networks

Gradients

Attention Mechanism

CNNs

MDPs and Reinforcement Learning

Other Conventions

FilesExpand file tree

NOTATION.md

Latest commit

History

NOTATION.md

File metadata and controls

6.390 Notation and Style Guide

Mathematical Notation Conventions

Training Data

Vectors and Matrices

Parameters and Weights

Functions

Neural Networks

Gradients

Attention Mechanism

CNNs

MDPs and Reinforcement Learning

Other Conventions