Skip to content

Commit 1bd7ae2

Browse files
apartsinclaude
andcommitted
Rename algorithm captions from Code Fragment to Pseudocode across 18 files
Algorithm callout boxes should use "Pseudocode X.Y.Z:" prefix instead of "Code Fragment X.Y.Z:" since they contain pseudocode, not runnable code. Also fix section 23.1 algorithm caption manually. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent f56469d commit 1bd7ae2

18 files changed

Lines changed: 24 additions & 24 deletions

File tree

part-1-foundations/module-02-tokenization-subword-models/section-2.2.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@ <h3>The BPE Merge Algorithm</h3>
158158
trainer = BpeTrainer(vocab_size=1000, special_tokens=["[UNK]"])
159159
tokenizer.train(files=["corpus.txt"], trainer=trainer)
160160
</code></pre>
161-
<div class="code-caption"><strong>Code Fragment 2.2.12:</strong> Pseudocode for the BPE training algorithm. Starting from individual characters, the algorithm repeatedly merges the most frequent adjacent pair until the target vocabulary size is reached, recording each merge in an ordered table used at inference time.</div>
161+
<div class="code-caption"><strong>Pseudocode 2.2.12:</strong> Pseudocode for the BPE training algorithm. Starting from individual characters, the algorithm repeatedly merges the most frequent adjacent pair until the target vocabulary size is reached, recording each merge in an ordered table used at inference time.</div>
162162
</div>
163163

164164
<!-- DIAGRAM 1: BPE merge tree -->

part-1-foundations/module-04-transformer-architecture/section-4.4.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -383,7 +383,7 @@ <h3>Online Softmax</h3>
383383
)
384384
return output
385385
</code></pre>
386-
<div class="code-caption"><strong>Code Fragment 4.4.6:</strong> The FlashAttention tiling algorithm in pseudocode. By processing Q, K, and V in SRAM-sized blocks and rescaling partial softmax accumulators on the fly, it computes exact attention while reducing HBM reads from quadratic to linear in sequence length.</div>
386+
<div class="code-caption"><strong>Pseudocode 4.4.6:</strong> The FlashAttention tiling algorithm in pseudocode. By processing Q, K, and V in SRAM-sized blocks and rescaling partial softmax accumulators on the fly, it computes exact attention while reducing HBM reads from quadratic to linear in sequence length.</div>
387387
</div>
388388

389389
<p>

part-1-foundations/module-05-decoding-text-generation/section-5.1.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -387,7 +387,7 @@ <h3>Beam Search Step by Step</h3>
387387
print("Beam-5:", tokenizer.decode(beam_out[0]))</code></pre>
388388
<div class="code-output">Greedy: The future of artificial intelligence is not just about the ability to create machines that can think and act like humans.
389389
Beam-5: The future of artificial intelligence is a topic that has been discussed for decades, and it is one that has been</div>
390-
<div class="code-caption"><strong>Code Fragment 5.1.2:</strong> Each beam: (sequence_tensor, cumulative_log_prob).</div>
390+
<div class="code-caption"><strong>Pseudocode 5.1.2:</strong> Each beam: (sequence_tensor, cumulative_log_prob).</div>
391391
</div>
392392

393393
<div class="callout tip">

part-10-frontiers/module-34-emerging-architectures/section-34.3.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -305,7 +305,7 @@ <h3>2.2 Mamba: Selective State Spaces</h3>
305305
g. y<sub>t</sub> = C<sub>t</sub> &middot; h <span class="algo-line-comment">// output from state</span>
306306
<span class="algo-line-keyword">return</span> y
307307
</code></pre>
308-
<div class="code-caption"><strong>Code Fragment 34.3.5:</strong> The Mamba selective scan algorithm, showing how input-dependent parameters (B, C, and the step size delta) are computed at each timestep and used to update a compressed hidden state. This input-dependent gating is what distinguishes selective SSMs from their linear, time-invariant predecessors.</div>
308+
<div class="code-caption"><strong>Pseudocode 34.3.5:</strong> The Mamba selective scan algorithm, showing how input-dependent parameters (B, C, and the step size delta) are computed at each timestep and used to update a compressed hidden state. This input-dependent gating is what distinguishes selective SSMs from their linear, time-invariant predecessors.</div>
309309
</div>
310310

311311
<p>

part-10-frontiers/module-35-ai-society/section-35.1.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@ <h3>AI Safety via Debate</h3>
122122
<span class="algo-line-comment">// because the opponent can expose any false claim</span>
123123
7. <span class="algo-line-keyword">return</span> (verdict, confidence)
124124
</code></pre>
125-
<div class="code-caption"><strong>Code Fragment 35.1.1:</strong> The AI Safety via Debate algorithm, where two adversarial models argue opposing sides and a human judge evaluates the transcript. The Nash equilibrium property ensures that truthful argumentation is the dominant strategy, because any false claim can be exposed by the opponent.</div>
125+
<div class="code-caption"><strong>Pseudocode 35.1.1:</strong> The AI Safety via Debate algorithm, where two adversarial models argue opposing sides and a human judge evaluates the transcript. The Nash equilibrium property ensures that truthful argumentation is the dominant strategy, because any false claim can be exposed by the opponent.</div>
126126
</div>
127127

128128
<h3>Recursive Reward Modeling</h3>

part-2-understanding-llms/module-07-modern-llm-landscape/section-7.3.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -423,7 +423,7 @@ <h3>4.1 The Mechanics</h3>
423423
# Very hard: tree search with large model
424424
return mcts_search(hard_model, reward_model, problem,
425425
n_iterations=200)</code></pre>
426-
<div class="code-caption"><strong>Code Fragment 7.3.3:</strong> Best-of-N sampling with reward model scoring.</div>
426+
<div class="code-caption"><strong>Pseudocode 7.3.3:</strong> Best-of-N sampling with reward model scoring.</div>
427427
</div>
428428

429429
<p>Code Fragment 7.3.3 below puts this into practice.</p>

part-2-understanding-llms/module-08-reasoning-test-time-compute/section-8.3.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ <h3>1.1 The RLVR Training Loop</h3>
102102
5. Add KL penalty: L_total = mean(L_i) - beta * KL(pi || pi_ref)
103103
6. Update pi by gradient ascent on L_total
104104
</code></pre>
105-
<div class="code-caption"><strong>Code Fragment 8.3.4:</strong> The RLVR training loop generates solutions, scores them with an automatic verifier, and updates the policy using the reward signal. Because verification is fully automatic, this loop scales to millions of training examples without human annotators.</div>
105+
<div class="code-caption"><strong>Pseudocode 8.3.4:</strong> The RLVR training loop generates solutions, scores them with an automatic verifier, and updates the policy using the reward signal. Because verification is fully automatic, this loop scales to millions of training examples without human annotators.</div>
106106
</div>
107107

108108
<p>The power of RLVR lies in the verifier: because correctness checking is automatic, the system can generate millions of training signals without human annotators. This enables RL training at a scale that would be impractical with human feedback.</p>
@@ -121,7 +121,7 @@ <h3>2.1 How GRPO Works</h3>
121121

122122
<div class="callout algorithm">
123123
<div class="callout-title">Algorithm: Group Relative Policy Optimization (GRPO)</div>
124-
<div class="code-caption"><strong>Code Fragment 8.3.3:</strong> GRPO computes advantages by normalizing rewards within a group of sampled solutions, eliminating the need for a separate critic model. This halves GPU memory compared to PPO while preserving stable policy updates through ratio clipping and a KL penalty.</div>
124+
<div class="code-caption"><strong>Pseudocode 8.3.3:</strong> GRPO computes advantages by normalizing rewards within a group of sampled solutions, eliminating the need for a separate critic model. This halves GPU memory compared to PPO while preserving stable policy updates through ratio clipping and a KL penalty.</div>
125125
</div>
126126

127127
<p>The key insight is in step 3: by normalizing rewards within each group, GRPO converts absolute rewards into relative comparisons. A solution that scores 1.0 in a group where all others also score 1.0 receives zero advantage (nothing to learn from), while the same score in a group of mostly failures receives high positive advantage. This eliminates the need for a separate critic model to estimate expected reward, halving the GPU memory requirement.</p>

part-4-training-adapting/module-17-alignment-rlhf-dpo/section-17.1.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -316,7 +316,7 @@ <h3>2.2 Stage 2: Reward Model Training</h3>
316316
Update V to minimize value prediction error
317317
3. <b>return</b> pi* (the aligned policy)
318318
</code></pre>
319-
<div class="code-caption"><strong>Code Fragment 17.1.3:</strong> This pseudocode outlines the PPO-based RLHF training loop, which iterates over prompts to sample responses from the current policy pi_theta, scores them with reward model R, and updates the policy using clipped surrogate loss with a KL penalty weighted by beta against the reference policy pi_ref.</div>
319+
<div class="code-caption"><strong>Pseudocode 17.1.3:</strong> This pseudocode outlines the PPO-based RLHF training loop, which iterates over prompts to sample responses from the current policy pi_theta, scores them with reward model R, and updates the policy using clipped surrogate loss with a KL penalty weighted by beta against the reference policy pi_ref.</div>
320320
</div>
321321

322322
<p>The KL penalty in step 2b is critical: without it, the policy can "game" the reward model by producing outputs that score highly but are incoherent or repetitive (a phenomenon called reward hacking). The KL term anchors the policy near the SFT distribution, preserving the model's language capabilities while steering its behavior. The following code demonstrates the PPO training loop with TRL.</p>

part-5-retrieval-conversation/module-20-rag/section-20.1.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ <h3>1.1 The Core RAG Loop</h3>
111111
4. <b>Generate:</b> response = G(prompt) // LLM generates grounded answer
112112
5. <b>return</b> response (with source citations from docs)
113113
</code></pre>
114-
<div class="code-caption"><strong>Code Fragment 20.1.1:</strong> This pseudocode describes the core RAG pipeline: encode query q with <a class="cross-ref" href="../module-19-embeddings-vector-db/section-19.1.html">embedding model</a> E, retrieve <a class="cross-ref" href="../../part-1-foundations/module-05-decoding-text-generation/section-05.2.html">top-k</a> passages from knowledge base KB by <a class="cross-ref" href="../module-19-embeddings-vector-db/section-19.1.html">cosine similarity</a>, concatenate them into a context string, and pass the augmented prompt to generator G. The output is a grounded response conditioned on retrieved evidence.</div>
114+
<div class="code-caption"><strong>Pseudocode 20.1.1:</strong> This pseudocode describes the core RAG pipeline: encode query q with <a class="cross-ref" href="../module-19-embeddings-vector-db/section-19.1.html">embedding model</a> E, retrieve <a class="cross-ref" href="../../part-1-foundations/module-05-decoding-text-generation/section-05.2.html">top-k</a> passages from knowledge base KB by <a class="cross-ref" href="../module-19-embeddings-vector-db/section-19.1.html">cosine similarity</a>, concatenate them into a context string, and pass the augmented prompt to generator G. The output is a grounded response conditioned on retrieved evidence.</div>
115115
</div>
116116

117117
<p>

part-6-agentic-ai/module-22-ai-agents/section-22.1.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -346,7 +346,7 @@ <h2>3. The ReAct Framework</h2>
346346
e. Append (Thought, Action, Observation) to context
347347
3. <b>return</b> "Max steps reached without resolution"
348348
</code></pre>
349-
<div class="code-caption"><strong>Code Fragment 22.1.2:</strong> This pseudocode formalizes the ReAct agent loop: given a user task T, tool set, and LLM M, the agent iterates through Thought, Action, and Observation steps up to max_steps S. The loop terminates when the LLM emits a final_answer action or the step budget is exhausted, returning the accumulated trajectory.</div>
349+
<div class="code-caption"><strong>Pseudocode 22.1.2:</strong> This pseudocode formalizes the ReAct agent loop: given a user task T, tool set, and LLM M, the agent iterates through Thought, Action, and Observation steps up to max_steps S. The loop terminates when the LLM emits a final_answer action or the step budget is exhausted, returning the accumulated trajectory.</div>
350350
</div>
351351

352352
<p>The key insight is that the explicit reasoning in step 2a (the "Thought") dramatically improves decision quality compared to acting without thinking or thinking without acting. Each thought provides a chain-of-reasoning that is also valuable for debugging when the agent makes mistakes.</p>

0 commit comments

Comments
 (0)