Skip to content

Commit 5fb1c19

Browse files
apartsinclaude
andcommitted
WAVE 7 meta/infrastructure sweep: headers, broken links, element ordering, takeaways
Structural architect, self-containment verifier, controller, and publication QA agents across all 11 parts + appendices. Key fixes: - 70 files: header structure migrated to canonical book-title-bar pattern (Parts 7-9) - 70 files: footer standardized with copyright and last-modified date (Parts 7-9) - ~360 broken cross-reference links fixed (zero-padded section numbers, wrong directory/module names) across Parts 7-11 and appendices - 29 missing key takeaways sections added (Parts 1, 3-4, 10-11, appendices) - 12 element ordering violations fixed (takeaways/research-frontier/whats-next) - 37 files: CSS class corrections (key-takeaway to takeaways, quiz to self-check) - 23 files: generic whats-next text replaced with specific section descriptions - 3 missing big-picture callouts added (Part 1) - h3 numbering corrections in RAG chapter (section 20.6) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent b81d984 commit 5fb1c19

174 files changed

Lines changed: 1555 additions & 1004 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

appendices/appendix-a-mathematical-foundations/index.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ <h1>Mathematical Foundations</h1>
4444
<p>This appendix is most useful for readers who studied these subjects previously but need a refresher targeted to LLMs, and for practitioners who want to understand the "why" behind formulas they already use. If these topics feel entirely new, supplement with a linear algebra or probability textbook before proceeding.</p>
4545
</div>
4646

47-
<p>The mathematical foundations here underpin everything in <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/index.html">Chapter 4 (Transformer Architecture)</a>, which is the primary destination for applying this material. Optimization and gradient concepts connect directly to <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/index.html">Chapter 0 (ML and PyTorch Foundations)</a>. Information-theoretic concepts such as cross-entropy and KL divergence reappear in <a class="cross-ref" href="../../part-2-understanding-llms/module-06-pretraining-scaling-laws/index.html">Chapter 6 (Pretraining and Scaling Laws)</a> and throughout evaluation in <a class="cross-ref" href="../../part-8-evaluation-production/module-29-evaluation-metrics/index.html">Chapter 29 (Evaluation)</a>.</p>
47+
<p>The mathematical foundations here underpin everything in <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/index.html">Chapter 4 (Transformer Architecture)</a>, which is the primary destination for applying this material. Optimization and gradient concepts connect directly to <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/index.html">Chapter 0 (ML and PyTorch Foundations)</a>. Information-theoretic concepts such as cross-entropy and KL divergence reappear in <a class="cross-ref" href="../../part-2-understanding-llms/module-06-pretraining-scaling-laws/index.html">Chapter 6 (Pretraining and Scaling Laws)</a> and throughout evaluation in <a class="cross-ref" href="../../part-8-evaluation-production/module-29-evaluation-observability/index.html">Chapter 29 (Evaluation)</a>.</p>
4848

4949
<div class="callout note">
5050
<div class="callout-title">Prerequisites</div>

appendices/appendix-a-mathematical-foundations/section-a.2.html

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ <h3>Probability Distributions</h3>
5252
# tensor([0.4466, 0.1642, 0.0996, 0.0222, 0.0667])
5353
# All probabilities sum to 1.0</code></pre>
5454
<div class="code-caption"><strong>Code Fragment A.2.1:</strong> Converting raw logits to a probability distribution with <a href="https://pytorch.org/" target="_blank" rel="noopener">PyTorch</a>'s softmax. The output sums to 1.0, with higher logits receiving proportionally larger probabilities.</div>
55-
<p>The <strong><a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-04.1.html">softmax</a></strong> function is the bridge between raw model scores (logits) and probabilities:</p>
55+
<p>The <strong><a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">softmax</a></strong> function is the bridge between raw model scores (logits) and probabilities:</p>
5656

5757
<div class="math-block">
5858
$$\operatorname{softmax}(z_i) = \exp(z_i) / \Sigma _j \exp(z_j)$$
@@ -105,17 +105,17 @@ <h3>Common Distributions</h3>
105105
<tr>
106106
<td><strong>Bernoulli</strong></td>
107107
<td>Discrete</td>
108-
<td><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.2.html">Dropout</a> masks (each neuron kept with probability p)</td>
108+
<td><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.2.html">Dropout</a> masks (each neuron kept with probability p)</td>
109109
</tr>
110110
</table>
111111
</div>
112112

113113
<h3>Expected Value and Variance</h3>
114114

115-
<p>The <strong>expected value</strong> (mean) of a distribution tells you the average outcome: <span class="math">$E[X] = \Sigma x_i \cdot P(x_i)$</span>. The <strong>variance</strong> measures spread: <span class="math">$Var(X) = E[(X - E[X])^2]$</span>. <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-04.1.html">Layer normalization</a> in transformers works by subtracting the mean and dividing by the standard deviation (the square root of variance), ensuring that activations stay in a well-behaved range throughout the network.</p>
115+
<p>The <strong>expected value</strong> (mean) of a distribution tells you the average outcome: <span class="math">$E[X] = \Sigma x_i \cdot P(x_i)$</span>. The <strong>variance</strong> measures spread: <span class="math">$Var(X) = E[(X - E[X])^2]$</span>. <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">Layer normalization</a> in transformers works by subtracting the mean and dividing by the standard deviation (the square root of variance), ensuring that activations stay in a well-behaved range throughout the network.</p>
116116

117117
<div class="callout practical-example">
118-
<div class="callout-title">Practical Example: <a class="cross-ref" href="../../part-1-foundations/module-05-decoding-text-generation/section-05.2.html">Temperature</a> Sampling</div>
118+
<div class="callout-title">Practical Example: <a class="cross-ref" href="../../part-1-foundations/module-05-decoding-text-generation/section-5.2.html">Temperature</a> Sampling</div>
119119
<p>When generating text, the <strong>temperature</strong> parameter reshapes the probability distribution. Given logits <span class="math">$z$</span>, we compute <span class="math">$\operatorname{softmax}(z / T)$</span>. A temperature of 1.0 is the default distribution. Temperatures below 1.0 make the distribution sharper (more confident), while temperatures above 1.0 flatten it (more random). At <span class="math">$T \rightarrow 0$</span>, the model always picks the highest-probability token (greedy decoding). At <span class="math">$T \rightarrow \infty$</span>, all tokens become equally likely.</p>
120120
</div>
121121

appendices/appendix-a-mathematical-foundations/section-a.3.html

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ <h3>Derivatives and Gradients</h3>
4949

5050
<h3>Gradient Descent</h3>
5151

52-
<p>The weight update rule for <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.1.html">gradient descent</a> is remarkably simple:</p>
52+
<p>The weight update rule for <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.1.html">gradient descent</a> is remarkably simple:</p>
5353

5454
<div class="math-block">
5555
$$w_{new} = w_{old} - \eta \cdot \nabla L$$
@@ -84,11 +84,11 @@ <h3>The Chain Rule and Backpropagation</h3>
8484

8585
</div>
8686

87-
<p>Each factor in this chain corresponds to one layer. <strong><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.2.html">Backpropagation</a></strong> is simply the efficient, automated application of the chain rule, working backwards from the loss to compute gradients for every weight in the network.</p>
87+
<p>Each factor in this chain corresponds to one layer. <strong><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.2.html">Backpropagation</a></strong> is simply the efficient, automated application of the chain rule, working backwards from the loss to compute gradients for every weight in the network.</p>
8888

8989
<div class="callout key-insight">
9090
<div class="callout-title">Key Insight: Why Deep Networks Can Be Difficult to Train</div>
91-
<p>If the chain rule multiplies many factors less than 1, the gradient shrinks exponentially as it flows backward through layers (the <strong>vanishing gradient</strong> problem). If factors exceed 1, gradients explode. Techniques like <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-04.1.html">residual connections</a> (adding the input of a layer to its output), <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-04.1.html">layer normalization</a>, and careful initialization all exist to keep this product near 1. The transformer architecture uses residual connections around every sub-layer, which is one reason it can be trained to hundreds of layers.</p>
91+
<p>If the chain rule multiplies many factors less than 1, the gradient shrinks exponentially as it flows backward through layers (the <strong>vanishing gradient</strong> problem). If factors exceed 1, gradients explode. Techniques like <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">residual connections</a> (adding the input of a layer to its output), <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">layer normalization</a>, and careful initialization all exist to keep this product near 1. The transformer architecture uses residual connections around every sub-layer, which is one reason it can be trained to hundreds of layers.</p>
9292
</div>
9393

9494
<h3>Common Activation Functions and Their Derivatives</h3>
@@ -103,7 +103,7 @@ <h3>Common Activation Functions and Their Derivatives</h3>
103103
<th scope="col">Used In</th>
104104
</tr>
105105
<tr>
106-
<td><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.2.html">ReLU</a></td>
106+
<td><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.2.html">ReLU</a></td>
107107
<td>max(0, x)</td>
108108
<td>0 if x &lt; 0, else 1</td>
109109
<td>Early transformers, CNNs</td>
@@ -121,7 +121,7 @@ <h3>Common Activation Functions and Their Derivatives</h3>
121121
<td>LLaMA, modern LLMs</td>
122122
</tr>
123123
<tr>
124-
<td><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.2.html">Sigmoid</a></td>
124+
<td><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.2.html">Sigmoid</a></td>
125125
<td>1 / (1 + exp(-x))</td>
126126
<td>&sigma;(x)(1 - &sigma;(x))</td>
127127
<td>Gating mechanisms</td>

appendices/appendix-a-mathematical-foundations/section-a.4.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -68,14 +68,14 @@ <h3>Entropy</h3>
6868
<div class="code-caption"><strong>Code Fragment A.4.1:</strong> Computing entropy for a confident distribution (0.57 bits) versus a uniform distribution (2.0 bits) using <a href="https://numpy.org/" target="_blank" rel="noopener">NumPy</a>. The gap illustrates how entropy quantifies uncertainty: the more spread out the probabilities, the higher the entropy.</div>
6969
<h3>Cross-Entropy</h3>
7070

71-
<p><strong><a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-04.1.html">Cross-entropy</a></strong> measures how well a predicted distribution <span class="math">$Q$</span> matches a true distribution <span class="math">$P$</span>:</p>
71+
<p><strong><a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">Cross-entropy</a></strong> measures how well a predicted distribution <span class="math">$Q$</span> matches a true distribution <span class="math">$P$</span>:</p>
7272

7373
<div class="math-block">
7474
$$H(P, Q) = - \Sigma P(x) \cdot \log Q(x)$$
7575

7676
</div>
7777

78-
<p>This is the standard loss function for training language models. The "true distribution" <span class="math">$P$</span> is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution <span class="math">$Q$</span> is the model's <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-04.1.html">softmax</a> output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.</p>
78+
<p>This is the standard loss function for training language models. The "true distribution" <span class="math">$P$</span> is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution <span class="math">$Q$</span> is the model's <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">softmax</a> output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.</p>
7979

8080
<div class="callout key-insight">
8181
<div class="callout-title">Key Insight: Cross-Entropy and <a class="cross-ref" href="../appendix-b-ml-essentials/section-b.4.html">Perplexity</a></div>

appendices/appendix-a-mathematical-foundations/section-a.5.html

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,12 +34,12 @@ <h1>A.5 Connecting the Pieces</h1>
3434

3535
<ol>
3636
<li><strong>Embedding lookup</strong> converts token IDs to vectors (linear algebra).</li>
37-
<li><strong><a class="cross-ref" href="../../part-1-foundations/module-03-sequence-models-attention/section-03.3.html">Self-attention</a></strong> computes dot products between query and key vectors, applies <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-04.1.html">softmax</a> to get attention weights (probability), and takes a weighted sum of value vectors (linear algebra).</li>
37+
<li><strong><a class="cross-ref" href="../../part-1-foundations/module-03-sequence-models-attention/section-3.3.html">Self-attention</a></strong> computes dot products between query and key vectors, applies <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">softmax</a> to get attention weights (probability), and takes a weighted sum of value vectors (linear algebra).</li>
3838
<li><strong>Feed-forward layers</strong> apply matrix multiplications followed by activation functions (linear algebra, calculus).</li>
3939
<li><strong>Output projection and softmax</strong> produce a probability distribution over the vocabulary (probability).</li>
40-
<li><strong><a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-04.1.html">Cross-entropy</a> loss</strong> compares the predicted distribution to the true next token (information theory).</li>
41-
<li><strong><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.2.html">Backpropagation</a></strong> computes gradients of the loss with respect to every weight (calculus, chain rule).</li>
42-
<li><strong><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.1.html">Gradient descent</a></strong> updates the weights to reduce the loss (calculus, optimization).</li>
40+
<li><strong><a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">Cross-entropy</a> loss</strong> compares the predicted distribution to the true next token (information theory).</li>
41+
<li><strong><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.2.html">Backpropagation</a></strong> computes gradients of the loss with respect to every weight (calculus, chain rule).</li>
42+
<li><strong><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.1.html">Gradient descent</a></strong> updates the weights to reduce the loss (calculus, optimization).</li>
4343
</ol>
4444

4545
<div class="callout big-picture">

appendices/appendix-b-ml-essentials/index.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ <h1>Machine Learning Essentials</h1>
3232
<p>This reference is most valuable for practitioners who understand the material but benefit from a single page of formulas and tables rather than prose. Graduate students entering LLM work from adjacent fields (vision, speech, classical NLP) will find it a useful orientation to the conventions used throughout the book.</p>
3333
</div>
3434

35-
<p>This appendix maps directly onto <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/index.html">Chapter 0 (ML and PyTorch Foundations)</a>. The evaluation metrics in Section B.4 also connect to the model assessment framework in <a class="cross-ref" href="../../part-8-evaluation-production/module-29-evaluation-metrics/index.html">Chapter 29 (Evaluation)</a>. Loss functions and optimization methods reappear throughout fine-tuning in <a class="cross-ref" href="../../part-4-training-adapting/module-14-fine-tuning-fundamentals/index.html">Chapter 14</a>.</p>
35+
<p>This appendix maps directly onto <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/index.html">Chapter 0 (ML and PyTorch Foundations)</a>. The evaluation metrics in Section B.4 also connect to the model assessment framework in <a class="cross-ref" href="../../part-8-evaluation-production/module-29-evaluation-observability/index.html">Chapter 29 (Evaluation)</a>. Loss functions and optimization methods reappear throughout fine-tuning in <a class="cross-ref" href="../../part-4-training-adapting/module-14-fine-tuning-fundamentals/index.html">Chapter 14</a>.</p>
3636

3737
<div class="callout note">
3838
<div class="callout-title">Prerequisites</div>

appendices/appendix-b-ml-essentials/section-b.1.html

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ <h1>B.1 Learning Paradigms</h1>
2222

2323
<div class="callout note">
2424
<div class="callout-title">Covered in Detail</div>
25-
<p>For a comprehensive treatment of supervised learning, classification, and regression, see <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.1.html">Section 0.1: ML Basics: Features, Optimization &amp; Generalization</a>. For a full introduction to reinforcement learning, see <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.4.html">Section 0.4: Reinforcement Learning Foundations</a>.</p>
25+
<p>For a comprehensive treatment of supervised learning, classification, and regression, see <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.1.html">Section 0.1: ML Basics: Features, Optimization &amp; Generalization</a>. For a full introduction to reinforcement learning, see <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.4.html">Section 0.4: Reinforcement Learning Foundations</a>.</p>
2626
</div>
2727

2828
<p>This page provides a concise at-a-glance summary of the three learning paradigms. Use it as a quick reminder; refer to the main chapters for explanations, examples, and code.</p>
@@ -40,19 +40,19 @@ <h1>B.1 Learning Paradigms</h1>
4040
<td><strong>Supervised</strong></td>
4141
<td>Labeled input-output pairs</td>
4242
<td>Supervised fine-tuning (SFT)</td>
43-
<td><a href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.1.html">Section 0.1</a></td>
43+
<td><a href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.1.html">Section 0.1</a></td>
4444
</tr>
4545
<tr>
4646
<td><strong>Self-supervised</strong></td>
4747
<td>Labels derived from data structure (next-token prediction)</td>
4848
<td>Pretraining</td>
49-
<td><a href="../../part-2-understanding-llms/module-06-pretraining-scaling-laws/section-06.1.html">Section 6.1</a></td>
49+
<td><a href="../../part-2-understanding-llms/module-06-pretraining-scaling-laws/section-6.1.html">Section 6.1</a></td>
5050
</tr>
5151
<tr>
5252
<td><strong>Reinforcement</strong></td>
5353
<td>Reward signal from environment or human feedback</td>
5454
<td>RLHF / DPO alignment</td>
55-
<td><a href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.4.html">Section 0.4</a>, <a href="../../part-4-training-adapting/module-17-alignment-rlhf-dpo/section-17.1.html">Section 16.1</a></td>
55+
<td><a href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.4.html">Section 0.4</a>, <a href="../../part-4-training-adapting/module-17-alignment-rlhf-dpo/section-17.1.html">Section 16.1</a></td>
5656
</tr>
5757
</table>
5858
</div>

0 commit comments

Comments
 (0)