ApartsinProjects
diff --git a/‎appendices/appendix-a-mathematical-foundations/index.html‎
Lines changed: 1 addition & 1 deletion b/‎appendices/appendix-a-mathematical-foundations/index.html‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎appendices/appendix-a-mathematical-foundations/section-a.2.html‎
Lines changed: 4 additions & 4 deletions b/‎appendices/appendix-a-mathematical-foundations/section-a.2.html‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎appendices/appendix-a-mathematical-foundations/section-a.3.html‎
Lines changed: 5 additions & 5 deletions b/‎appendices/appendix-a-mathematical-foundations/section-a.3.html‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎appendices/appendix-a-mathematical-foundations/section-a.4.html‎
Lines changed: 2 additions & 2 deletions b/‎appendices/appendix-a-mathematical-foundations/section-a.4.html‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎appendices/appendix-a-mathematical-foundations/section-a.5.html‎
Lines changed: 4 additions & 4 deletions b/‎appendices/appendix-a-mathematical-foundations/section-a.5.html‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎appendices/appendix-b-ml-essentials/index.html‎
Lines changed: 1 addition & 1 deletion b/‎appendices/appendix-b-ml-essentials/index.html‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎appendices/appendix-b-ml-essentials/section-b.1.html‎
Lines changed: 4 additions & 4 deletions b/‎appendices/appendix-b-ml-essentials/section-b.1.html‎
Lines changed: 4 additions & 4 deletions
@@ -44,7 +44,7 @@ <h1>Mathematical Foundations</h1>
         <p>This appendix is most useful for readers who studied these subjects previously but need a refresher targeted to LLMs, and for practitioners who want to understand the "why" behind formulas they already use. If these topics feel entirely new, supplement with a linear algebra or probability textbook before proceeding.</p>
     </div>
 
-    <p>The mathematical foundations here underpin everything in <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/index.html">Chapter 4 (Transformer Architecture)</a>, which is the primary destination for applying this material. Optimization and gradient concepts connect directly to <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/index.html">Chapter 0 (ML and PyTorch Foundations)</a>. Information-theoretic concepts such as cross-entropy and KL divergence reappear in <a class="cross-ref" href="../../part-2-understanding-llms/module-06-pretraining-scaling-laws/index.html">Chapter 6 (Pretraining and Scaling Laws)</a> and throughout evaluation in <a class="cross-ref" href="../../part-8-evaluation-production/module-29-evaluation-metrics/index.html">Chapter 29 (Evaluation)</a>.</p>
+    <p>The mathematical foundations here underpin everything in <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/index.html">Chapter 4 (Transformer Architecture)</a>, which is the primary destination for applying this material. Optimization and gradient concepts connect directly to <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/index.html">Chapter 0 (ML and PyTorch Foundations)</a>. Information-theoretic concepts such as cross-entropy and KL divergence reappear in <a class="cross-ref" href="../../part-2-understanding-llms/module-06-pretraining-scaling-laws/index.html">Chapter 6 (Pretraining and Scaling Laws)</a> and throughout evaluation in <a class="cross-ref" href="../../part-8-evaluation-production/module-29-evaluation-observability/index.html">Chapter 29 (Evaluation)</a>.</p>
 
     <div class="callout note">
         <div class="callout-title">Prerequisites</div>
 
@@ -52,7 +52,7 @@ <h3>Probability Distributions</h3>
 # tensor([0.4466, 0.1642, 0.0996, 0.0222, 0.0667])
 # All probabilities sum to 1.0</code></pre>
 <div class="code-caption"><strong>Code Fragment A.2.1:</strong> Converting raw logits to a probability distribution with <a href="https://pytorch.org/" target="_blank" rel="noopener">PyTorch</a>'s softmax. The output sums to 1.0, with higher logits receiving proportionally larger probabilities.</div>
-<p>The <strong><a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-04.1.html">softmax</a></strong> function is the bridge between raw model scores (logits) and probabilities:</p>
+<p>The <strong><a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">softmax</a></strong> function is the bridge between raw model scores (logits) and probabilities:</p>
 
 <div class="math-block">
     $$\operatorname{softmax}(z_i) = \exp(z_i) / \Sigma _j \exp(z_j)$$
@@ -105,17 +105,17 @@ <h3>Common Distributions</h3>
     <tr>
         <td><strong>Bernoulli</strong></td>
         <td>Discrete</td>
-        <td><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.2.html">Dropout</a> masks (each neuron kept with probability p)</td>
+        <td><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.2.html">Dropout</a> masks (each neuron kept with probability p)</td>
     </tr>
 </table>
 </div>
 
 <h3>Expected Value and Variance</h3>
 
-<p>The <strong>expected value</strong> (mean) of a distribution tells you the average outcome: <span class="math">$E[X] = \Sigma x_i \cdot P(x_i)$</span>. The <strong>variance</strong> measures spread: <span class="math">$Var(X) = E[(X - E[X])^2]$</span>. <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-04.1.html">Layer normalization</a> in transformers works by subtracting the mean and dividing by the standard deviation (the square root of variance), ensuring that activations stay in a well-behaved range throughout the network.</p>
+<p>The <strong>expected value</strong> (mean) of a distribution tells you the average outcome: <span class="math">$E[X] = \Sigma x_i \cdot P(x_i)$</span>. The <strong>variance</strong> measures spread: <span class="math">$Var(X) = E[(X - E[X])^2]$</span>. <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">Layer normalization</a> in transformers works by subtracting the mean and dividing by the standard deviation (the square root of variance), ensuring that activations stay in a well-behaved range throughout the network.</p>
 
 <div class="callout practical-example">
-    <div class="callout-title">Practical Example: <a class="cross-ref" href="../../part-1-foundations/module-05-decoding-text-generation/section-05.2.html">Temperature</a> Sampling</div>
+    <div class="callout-title">Practical Example: <a class="cross-ref" href="../../part-1-foundations/module-05-decoding-text-generation/section-5.2.html">Temperature</a> Sampling</div>
     <p>When generating text, the <strong>temperature</strong> parameter reshapes the probability distribution. Given logits <span class="math">$z$</span>, we compute <span class="math">$\operatorname{softmax}(z / T)$</span>. A temperature of 1.0 is the default distribution. Temperatures below 1.0 make the distribution sharper (more confident), while temperatures above 1.0 flatten it (more random). At <span class="math">$T \rightarrow 0$</span>, the model always picks the highest-probability token (greedy decoding). At <span class="math">$T \rightarrow \infty$</span>, all tokens become equally likely.</p>
 </div>
 
 
@@ -49,7 +49,7 @@ <h3>Derivatives and Gradients</h3>
 
 <h3>Gradient Descent</h3>
 
-<p>The weight update rule for <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.1.html">gradient descent</a> is remarkably simple:</p>
+<p>The weight update rule for <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.1.html">gradient descent</a> is remarkably simple:</p>
 
 <div class="math-block">
     $$w_{new} = w_{old} - \eta \cdot \nabla L$$
@@ -84,11 +84,11 @@ <h3>The Chain Rule and Backpropagation</h3>
 
 </div>
 
-<p>Each factor in this chain corresponds to one layer. <strong><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.2.html">Backpropagation</a></strong> is simply the efficient, automated application of the chain rule, working backwards from the loss to compute gradients for every weight in the network.</p>
+<p>Each factor in this chain corresponds to one layer. <strong><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.2.html">Backpropagation</a></strong> is simply the efficient, automated application of the chain rule, working backwards from the loss to compute gradients for every weight in the network.</p>
 
 <div class="callout key-insight">
     <div class="callout-title">Key Insight: Why Deep Networks Can Be Difficult to Train</div>
-    <p>If the chain rule multiplies many factors less than 1, the gradient shrinks exponentially as it flows backward through layers (the <strong>vanishing gradient</strong> problem). If factors exceed 1, gradients explode. Techniques like <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-04.1.html">residual connections</a> (adding the input of a layer to its output), <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-04.1.html">layer normalization</a>, and careful initialization all exist to keep this product near 1. The transformer architecture uses residual connections around every sub-layer, which is one reason it can be trained to hundreds of layers.</p>
+    <p>If the chain rule multiplies many factors less than 1, the gradient shrinks exponentially as it flows backward through layers (the <strong>vanishing gradient</strong> problem). If factors exceed 1, gradients explode. Techniques like <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">residual connections</a> (adding the input of a layer to its output), <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">layer normalization</a>, and careful initialization all exist to keep this product near 1. The transformer architecture uses residual connections around every sub-layer, which is one reason it can be trained to hundreds of layers.</p>
 </div>
 
 <h3>Common Activation Functions and Their Derivatives</h3>
@@ -103,7 +103,7 @@ <h3>Common Activation Functions and Their Derivatives</h3>
         <th scope="col">Used In</th>
     </tr>
     <tr>
-        <td><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.2.html">ReLU</a></td>
+        <td><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.2.html">ReLU</a></td>
         <td>max(0, x)</td>
         <td>0 if x &lt; 0, else 1</td>
         <td>Early transformers, CNNs</td>
@@ -121,7 +121,7 @@ <h3>Common Activation Functions and Their Derivatives</h3>
         <td>LLaMA, modern LLMs</td>
     </tr>
     <tr>
-        <td><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.2.html">Sigmoid</a></td>
+        <td><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.2.html">Sigmoid</a></td>
         <td>1 / (1 + exp(-x))</td>
         <td>&sigma;(x)(1 - &sigma;(x))</td>
         <td>Gating mechanisms</td>
 
@@ -68,14 +68,14 @@ <h3>Entropy</h3>
 <div class="code-caption"><strong>Code Fragment A.4.1:</strong> Computing entropy for a confident distribution (0.57 bits) versus a uniform distribution (2.0 bits) using <a href="https://numpy.org/" target="_blank" rel="noopener">NumPy</a>. The gap illustrates how entropy quantifies uncertainty: the more spread out the probabilities, the higher the entropy.</div>
 <h3>Cross-Entropy</h3>
 
-<p><strong><a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-04.1.html">Cross-entropy</a></strong> measures how well a predicted distribution <span class="math">$Q$</span> matches a true distribution <span class="math">$P$</span>:</p>
+<p><strong><a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">Cross-entropy</a></strong> measures how well a predicted distribution <span class="math">$Q$</span> matches a true distribution <span class="math">$P$</span>:</p>
 
 <div class="math-block">
     $$H(P, Q) = - \Sigma P(x) \cdot \log Q(x)$$
 
 </div>
 
-<p>This is the standard loss function for training language models. The "true distribution" <span class="math">$P$</span> is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution <span class="math">$Q$</span> is the model's <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-04.1.html">softmax</a> output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.</p>
+<p>This is the standard loss function for training language models. The "true distribution" <span class="math">$P$</span> is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution <span class="math">$Q$</span> is the model's <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">softmax</a> output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.</p>
 
 <div class="callout key-insight">
     <div class="callout-title">Key Insight: Cross-Entropy and <a class="cross-ref" href="../appendix-b-ml-essentials/section-b.4.html">Perplexity</a></div>
 
@@ -34,12 +34,12 @@ <h1>A.5 Connecting the Pieces</h1>
 
 <ol>
     <li><strong>Embedding lookup</strong> converts token IDs to vectors (linear algebra).</li>
-    <li><strong><a class="cross-ref" href="../../part-1-foundations/module-03-sequence-models-attention/section-03.3.html">Self-attention</a></strong> computes dot products between query and key vectors, applies <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-04.1.html">softmax</a> to get attention weights (probability), and takes a weighted sum of value vectors (linear algebra).</li>
+    <li><strong><a class="cross-ref" href="../../part-1-foundations/module-03-sequence-models-attention/section-3.3.html">Self-attention</a></strong> computes dot products between query and key vectors, applies <a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">softmax</a> to get attention weights (probability), and takes a weighted sum of value vectors (linear algebra).</li>
     <li><strong>Feed-forward layers</strong> apply matrix multiplications followed by activation functions (linear algebra, calculus).</li>
     <li><strong>Output projection and softmax</strong> produce a probability distribution over the vocabulary (probability).</li>
-    <li><strong><a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-04.1.html">Cross-entropy</a> loss</strong> compares the predicted distribution to the true next token (information theory).</li>
-    <li><strong><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.2.html">Backpropagation</a></strong> computes gradients of the loss with respect to every weight (calculus, chain rule).</li>
-    <li><strong><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.1.html">Gradient descent</a></strong> updates the weights to reduce the loss (calculus, optimization).</li>
+    <li><strong><a class="cross-ref" href="../../part-1-foundations/module-04-transformer-architecture/section-4.1.html">Cross-entropy</a> loss</strong> compares the predicted distribution to the true next token (information theory).</li>
+    <li><strong><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.2.html">Backpropagation</a></strong> computes gradients of the loss with respect to every weight (calculus, chain rule).</li>
+    <li><strong><a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.1.html">Gradient descent</a></strong> updates the weights to reduce the loss (calculus, optimization).</li>
 </ol>
 
 <div class="callout big-picture">
 
@@ -32,7 +32,7 @@ <h1>Machine Learning Essentials</h1>
         <p>This reference is most valuable for practitioners who understand the material but benefit from a single page of formulas and tables rather than prose. Graduate students entering LLM work from adjacent fields (vision, speech, classical NLP) will find it a useful orientation to the conventions used throughout the book.</p>
     </div>
 
-    <p>This appendix maps directly onto <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/index.html">Chapter 0 (ML and PyTorch Foundations)</a>. The evaluation metrics in Section B.4 also connect to the model assessment framework in <a class="cross-ref" href="../../part-8-evaluation-production/module-29-evaluation-metrics/index.html">Chapter 29 (Evaluation)</a>. Loss functions and optimization methods reappear throughout fine-tuning in <a class="cross-ref" href="../../part-4-training-adapting/module-14-fine-tuning-fundamentals/index.html">Chapter 14</a>.</p>
+    <p>This appendix maps directly onto <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/index.html">Chapter 0 (ML and PyTorch Foundations)</a>. The evaluation metrics in Section B.4 also connect to the model assessment framework in <a class="cross-ref" href="../../part-8-evaluation-production/module-29-evaluation-observability/index.html">Chapter 29 (Evaluation)</a>. Loss functions and optimization methods reappear throughout fine-tuning in <a class="cross-ref" href="../../part-4-training-adapting/module-14-fine-tuning-fundamentals/index.html">Chapter 14</a>.</p>
 
     <div class="callout note">
         <div class="callout-title">Prerequisites</div>
 
@@ -22,7 +22,7 @@ <h1>B.1 Learning Paradigms</h1>
 
 <div class="callout note">
     <div class="callout-title">Covered in Detail</div>
-    <p>For a comprehensive treatment of supervised learning, classification, and regression, see <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.1.html">Section 0.1: ML Basics: Features, Optimization &amp; Generalization</a>. For a full introduction to reinforcement learning, see <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.4.html">Section 0.4: Reinforcement Learning Foundations</a>.</p>
+    <p>For a comprehensive treatment of supervised learning, classification, and regression, see <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.1.html">Section 0.1: ML Basics: Features, Optimization &amp; Generalization</a>. For a full introduction to reinforcement learning, see <a class="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.4.html">Section 0.4: Reinforcement Learning Foundations</a>.</p>
 </div>
 
 <p>This page provides a concise at-a-glance summary of the three learning paradigms. Use it as a quick reminder; refer to the main chapters for explanations, examples, and code.</p>
@@ -40,19 +40,19 @@ <h1>B.1 Learning Paradigms</h1>
         <td><strong>Supervised</strong></td>
         <td>Labeled input-output pairs</td>
         <td>Supervised fine-tuning (SFT)</td>
-        <td><a href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.1.html">Section 0.1</a></td>
+        <td><a href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.1.html">Section 0.1</a></td>
     </tr>
     <tr>
         <td><strong>Self-supervised</strong></td>
         <td>Labels derived from data structure (next-token prediction)</td>
         <td>Pretraining</td>
-        <td><a href="../../part-2-understanding-llms/module-06-pretraining-scaling-laws/section-06.1.html">Section 6.1</a></td>
+        <td><a href="../../part-2-understanding-llms/module-06-pretraining-scaling-laws/section-6.1.html">Section 6.1</a></td>
     </tr>
     <tr>
         <td><strong>Reinforcement</strong></td>
         <td>Reward signal from environment or human feedback</td>
         <td>RLHF / DPO alignment</td>
-        <td><a href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-00.4.html">Section 0.4</a>, <a href="../../part-4-training-adapting/module-17-alignment-rlhf-dpo/section-17.1.html">Section 16.1</a></td>
+        <td><a href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.4.html">Section 0.4</a>, <a href="../../part-4-training-adapting/module-17-alignment-rlhf-dpo/section-17.1.html">Section 16.1</a></td>
     </tr>
 </table>
 </div>