ContextLab
diff --git a/‎notes/2026-02-12-week7-diffusion-rewrite.md‎
Lines changed: 34 additions & 0 deletions b/‎notes/2026-02-12-week7-diffusion-rewrite.md‎
Lines changed: 34 additions & 0 deletions
diff --git a/‎slides/week7/lecture22.html‎
Lines changed: 40 additions & 13 deletions b/‎slides/week7/lecture22.html‎
Lines changed: 40 additions & 13 deletions
diff --git a/‎slides/week7/lecture22.md‎
Lines changed: 42 additions & 5 deletions b/‎slides/week7/lecture22.md‎
Lines changed: 42 additions & 5 deletions
diff --git a/‎slides/week7/lecture22.pdf‎
102 KB b/‎slides/week7/lecture22.pdf‎
102 KB
diff --git a/‎slides/week7/lecture23.html‎
Lines changed: 28 additions & 12 deletions b/‎slides/week7/lecture23.html‎
Lines changed: 28 additions & 12 deletions
@@ -58,6 +58,40 @@ Commit: `721d55a` — pushed to main.
 6. Run tests
 7. Commit and push
 
+### Session 2026-02-15 (continued, session 3):
+- Updated lecture 21 with multiple user-requested edits:
+  - MMLU/GSM8K hyperlinks verified and added (LLaDA slide)
+  - RLHF/DPO/KV-caching "Definitions" box added (Current limitations slide)
+  - Take-home messages slide populated (third bullet about multimodal models)
+  - New "Controlling generation" slide added (prompting, infilling, prefix completion — 3 flow diagrams)
+- Committed submodule changes (final-project-llm-course at f97babc)
+- Cleaned up arrow SVG references (slides/week7/images/ deleted)
+- All 1500+ tests pass
+
+### Session 2026-02-16 (Lectures 22 & 23 style update):
+- **Lecture 22** style updates:
+  - Added "Remember..." definition-box for CLIP on text conditioning slide
+  - Added "Remember..." definition-box for ODE/SDE on flow matching slide
+  - Added FID Wikipedia hyperlink on DiT slide
+  - Added He et al. (2016) hyperlink on adaLN-Zero slide
+  - Updated Liu et al. rectified flow to note ICLR 2023
+  - Added "Take-home messages" slide (Think about it... note-box)
+  - All 6 references verified accurate (via librarian agent)
+  - Fixed overflow: removed VAE/U-Net Remember box from pixel problem slide (content already explained on next slide), split flow matching into 2 slides (definition + comparison table), added scale classes
+- **Lecture 23** style updates + reference fixes:
+  - Fixed Sahoo et al. venue: "arXiv" → "NeurIPS 2024"
+  - Fixed survey citation: "Fei et al." → "Yang et al. (2024, ACM Computing Surveys)"
+  - Added deepfake stat citations (Sensity AI 2019, Sumsub 2023)
+  - Fixed Thomson Reuters v. Ross: "Settled 2024" → "Ruled 2025"
+  - Added "Remember..." definition-box for CLIP + T5 on text-to-image slide
+  - Added mel-spectrogram Wikipedia link and vocoder inline definition on audio slide
+  - Added Stable Diffusion hyperlink to Rombach et al.
+  - Added C2PA and LAION-5B hyperlinks
+  - Added "Take-home messages" slide
+- Both lectures recompiled to HTML + PDF
+- Visual verification via Playwright: all key slides render without overflow
+- Screenshots cleaned up, HTTP server killed
+
 ## Key Constraints
 - Assignment announcements ONLY in lecture21 (not 22/23)
 - Companion notebook referenced ONLY in lecture23
 
@@ -29,6 +29,8 @@ Winter 2026
 
 ---
 
+<!-- _class: scale-85 -->
+
 # The pixel problem
 
 <div class="warning-box" data-title="Why running diffusion in pixel space is expensive">
@@ -112,6 +114,12 @@ To generate images from text prompts, latent diffusion adds **cross-attention**
 
 </div>
 
+<div class="definition-box" data-title="Remember...">
+
+- **[CLIP](https://arxiv.org/abs/2103.00020)** (Contrastive Language-Image Pre-training; [Radford et al., 2021](https://arxiv.org/abs/2103.00020)): A model trained on 400M image-text pairs to learn a **shared embedding space** where images and their captions are nearby. Used throughout diffusion systems as the text encoder that "understands" prompts.
+
+</div>
+
 <div class="example-box" data-title="How 'a cat wearing a hat' becomes an image">
 
 The word "cat" activates high attention weights in the spatial region where the cat is being generated. The word "hat" activates attention weights near the top of the cat region. This spatial-linguistic binding is learned entirely from image-caption pairs during training.
@@ -177,7 +185,7 @@ The [Diffusion Transformer (DiT)](https://arxiv.org/abs/2212.09748) replaces the
 
 <div class="important-box" data-title="Why replace U-Net?">
 
-Transformers scale better than U-Nets. DiT-XL/2 (675M parameters) achieves a new state-of-the-art FID of 2.27 on ImageNet, beating all previous diffusion models. More importantly, DiT shows **clean scaling behavior** — larger models consistently produce better results, with no architectural bottlenecks.
+Transformers scale better than U-Nets. DiT-XL/2 (675M parameters) achieves a new state-of-the-art [FID](https://en.wikipedia.org/wiki/Fr%C3%A9chet_inception_distance) of 2.27 on ImageNet, beating all previous diffusion models. More importantly, DiT shows **clean scaling behavior** — larger models consistently produce better results, with no architectural bottlenecks.
 
 </div>
 
@@ -197,7 +205,7 @@ DiT conditions on timestep and class label using **adaptive Layer Normalization
 
 <div class="note-box" data-title="Why 'Zero'?">
 
-Initializing the gating parameter $\alpha = 0$ means each Transformer block initially acts as an **identity function**. This makes training stable even for very deep models — the network starts by doing nothing and gradually learns to denoise. This is the same principle behind residual learning (He et al., 2016).
+Initializing the gating parameter $\alpha = 0$ means each Transformer block initially acts as an **identity function**. This makes training stable even for very deep models — the network starts by doing nothing and gradually learns to denoise. This is the same principle behind residual learning ([He et al., 2016](https://arxiv.org/abs/1512.03385)).
 
 </div>
 
@@ -215,7 +223,18 @@ where $v_\theta$ is a neural network that predicts the **velocity** (direction a
 
 </div>
 
-<div class="note-box" data-title="Key differences from DDPM">
+<div class="definition-box" data-title="Remember...">
+
+- **[ODE](https://en.wikipedia.org/wiki/Ordinary_differential_equation)** (Ordinary Differential Equation): An equation describing how a quantity changes over time via a deterministic rule — given the current state, the next state is fully determined
+- **[SDE](https://en.wikipedia.org/wiki/Stochastic_differential_equation)** (Stochastic Differential Equation): Like an ODE but with a random noise term — the path from noise to data has some randomness at each step
+
+</div>
+
+---
+
+# Flow matching vs DDPM
+
+<div class="note-box" data-title="Key differences">
 
 | | DDPM | Flow matching |
 |---|---|---|
@@ -226,13 +245,19 @@ where $v_\theta$ is a neural network that predicts the **velocity** (direction a
 
 </div>
 
+<div class="tip-box" data-title="The intuition">
+
+Flow matching asks: "What's the simplest path from noise to data?" Instead of designing a complex noise schedule and learning to reverse it, we define a straight interpolation and learn the velocity field that moves along it. The math is simpler, the training is more stable, and generation is faster.
+
+</div>
+
 ---
 
 # Rectified flow
 
 <div class="definition-box" data-title="Straight paths from noise to data">
 
-**Rectified flow** ([Liu et al., 2023](https://arxiv.org/abs/2209.03003)) uses the simplest possible interpolation — a straight line between the data point and a noise sample:
+**Rectified flow** ([Liu et al., 2023, ICLR](https://arxiv.org/abs/2209.03003)) uses the simplest possible interpolation — a straight line between the data point and a noise sample:
 
 $$\mathbf{x}_t = (1 - t)\,\mathbf{x}_0 + t\,\boldsymbol{\epsilon}$$
 
@@ -247,7 +272,7 @@ Straight paths are the shortest paths between noise and data. Since they don't c
 </div>
 
 ---
-<!-- _class: scale-90 -->
+<!-- _class: scale-85 -->
 
 # Stable Diffusion 3: putting it all together
 
@@ -312,6 +337,18 @@ The field progresses by **composing** innovations, not replacing them. Each exte
 
 </div>
 
+---
+
+# Take-home messages
+
+<div class="note-box" data-title="Think about it...">
+
+- The key bottleneck in high-resolution generation wasn't the diffusion process itself — it was **where** you run it. Compressing to latent space (via a VAE) made consumer-GPU generation possible.
+- Classifier-free guidance shows that **controlling** generation is as important as generation itself — and the trick is surprisingly simple: learn what the conditional and unconditional outputs look like, then amplify the difference.
+- The field evolves by **composing** innovations (latent space + CFG + DiT + flow matching), not replacing them. Each addresses one specific limitation.
+
+</div>
+
 ---
 <!-- _class: scale-85 -->
 
 
@@ -755,6 +755,12 @@ <h1 id="text-to-image-the-big-picture">Text-to-image: the big picture</h1>
 </tbody>
 </table>
 </div>
+<div class="definition-box" data-title="Remember...">
+<ul>
+<li><strong><a href="https://arxiv.org/abs/2103.00020">CLIP</a></strong>: Contrastive Language-Image Pre-training (<a href="https://arxiv.org/abs/2103.00020">Radford et al., 2021</a>) — learns a shared embedding space for images and text, used as the text encoder in DALL-E 2 and Stable Diffusion (see Lecture 22)</li>
+<li><strong><a href="https://arxiv.org/abs/1910.10683">T5</a></strong>: A text-to-text transformer (Google, 2020) — Imagen uses the largest variant (T5-XXL, 4.6B parameters) as its text encoder</li>
+</ul>
+</div>
 <div class="important-box" data-title="The key question">
 <p>Which component matters more — the language understanding (text encoder) or the image generation (diffusion model)? Imagen's surprising finding: <strong>scaling the text encoder helps more than scaling the diffusion model</strong>.</p>
 </div>
@@ -833,7 +839,7 @@ <h1 id="imagen">Imagen</h1>
 </foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="7" data-class="scale-78" data-theme="cdl-theme" lang="C" class="scale-78" style="--class:scale-78;--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
 <h1 id="stable-diffusion">Stable Diffusion</h1>
 <div class="note-box" data-title="Open-source democratization">
-<p>Stable Diffusion (Rombach et al., 2022) is the open-source implementation of latent diffusion (Lecture 22):</p>
+<p><a href="https://arxiv.org/abs/2112.10752">Stable Diffusion</a> (<a href="https://arxiv.org/abs/2112.10752">Rombach et al., 2022</a>) is the open-source implementation of latent diffusion (Lecture 22):</p>
 <table>
 <thead>
 <tr>
@@ -852,7 +858,7 @@ <h1 id="stable-diffusion">Stable Diffusion</h1>
 </tr>
 <tr>
 <td>Training data</td>
-<td>LAION-5B (5 billion image-text pairs)</td>
+<td><a href="https://laion.ai/blog/laion-5b/">LAION-5B</a> (5 billion image-text pairs)</td>
 </tr>
 <tr>
 <td>Parameters</td>
@@ -899,9 +905,9 @@ <h1 id="text-to-audio">Text-to-audio</h1>
 <div class="definition-box" data-title="Diffusion in the spectral domain">
 <p>Audio generation applies diffusion to <strong>spectrograms</strong> (time-frequency representations of sound):</p>
 <ol>
-<li>Convert audio to a mel-spectrogram</li>
+<li>Convert audio to a <a href="https://en.wikipedia.org/wiki/Mel-frequency_cepstrum">mel-spectrogram</a> (a visual representation of sound frequencies over time, weighted to match human hearing)</li>
 <li>Run diffusion in spectrogram space (or a latent compression of it)</li>
-<li>Convert the generated spectrogram back to audio using a vocoder</li>
+<li>Convert the generated spectrogram back to audio using a <strong>vocoder</strong> (a neural network that reconstructs audio waveforms from spectrograms)</li>
 </ol>
 </div>
 <div class="note-box" data-title="Notable systems">
@@ -941,7 +947,7 @@ <h1 id="text-to-audio">Text-to-audio</h1>
 </section>
 </foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="10" data-theme="cdl-theme" lang="C" style="--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
 <h1 id="discrete-diffusion-for-text">Discrete diffusion for text</h1>
-<div class="definition-box" data-title="Sahoo et al. (2024): Masked Diffusion Language Models (MDLM)">
+<div class="definition-box" data-title="Sahoo et al. (2024, NeurIPS): Masked Diffusion Language Models (MDLM)">
 <p>Standard diffusion adds Gaussian noise to continuous data. For discrete data like text, <a href="https://arxiv.org/abs/2406.07524">MDLM</a> replaces &quot;adding noise&quot; with <strong>masking tokens</strong>:</p>
 <ul>
 <li><strong>Forward process</strong>: Randomly replace tokens with [MASK], increasing the masking rate over time</li>
@@ -1017,7 +1023,7 @@ <h1 id="ethics-deepfakes-and-consent">Ethics: deepfakes and consent</h1>
 </ul>
 </div>
 <div class="important-box" data-title="Scale of the problem">
-<p>A 2023 report found that <strong>96% of deepfake videos online are non-consensual intimate imagery</strong>, and the number of deepfake videos doubled every 6 months from 2018 to 2023. The democratization of generation tools has outpaced legal and technical protections.</p>
+<p>A <a href="https://sensity.ai/blog/deepfake-detection/mapping-the-deepfake-landscape/">2019 Sensity AI (Deeptrace) report</a> found that <strong>96% of deepfake videos online are non-consensual intimate imagery</strong>, and the number of deepfake videos doubled in just 9 months. By 2023, <a href="https://sumsub.com/blog/deepfake-statistics/">Sumsub reported</a> a 10× increase in detected deepfakes year-over-year. The democratization of generation tools has far outpaced legal and technical protections.</p>
 </div>
 </section>
 </foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="14" data-theme="cdl-theme" lang="C" style="--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
@@ -1064,7 +1070,7 @@ <h1 id="ethics-copyright-and-training-data">Ethics: copyright and training data<
 </tr>
 <tr>
 <td>Thomson Reuters v. Ross</td>
-<td>Settled 2024</td>
+<td>Ruled 2025</td>
 <td>Training on proprietary legal database</td>
 </tr>
 </tbody>
@@ -1079,7 +1085,7 @@ <h1 id="ethics-regulation-and-provenance">Ethics: regulation and provenance</h1>
 <div class="definition-box" data-title="Emerging regulatory frameworks">
 <ul>
 <li><strong>EU AI Act (2024)</strong>: Requires labeling of AI-generated content, transparency about training data, risk classification for generative systems</li>
-<li><strong>C2PA (Coalition for Content Provenance and Authenticity)</strong>: Technical standard for embedding provenance metadata in images and videos — &quot;nutrition labels&quot; for digital content</li>
+<li><strong><a href="https://c2pa.org/">C2PA</a> (Coalition for Content Provenance and Authenticity)</strong>: Technical standard for embedding provenance metadata in images and videos — &quot;nutrition labels&quot; for digital content</li>
 <li><strong>US Executive Order (Oct 2023)</strong>: Requires watermarking of AI-generated content from government contractors</li>
 <li><strong>China's deep synthesis regulations (2023)</strong>: Mandatory labeling and registration of deepfake services</li>
 </ul>
@@ -1110,17 +1116,27 @@ <h1 id="discussion">Discussion</h1>
 </ol>
 </div>
 </section>
-</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="18" data-class="scale-85" data-theme="cdl-theme" lang="C" class="scale-85" style="--class:scale-85;--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
+</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="18" data-theme="cdl-theme" lang="C" style="--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
+<h1 id="take-home-messages">Take-home messages</h1>
+<div class="note-box" data-title="Think about it...">
+<ul>
+<li>The same diffusion framework scales across modalities — images (DALL-E 2, Stable Diffusion), video (Sora), audio, and text (MDLM) — suggesting <strong>iterative refinement from noise</strong> is a general-purpose generation principle.</li>
+<li>Imagen's key finding — that <strong>scaling the text encoder matters more than scaling the image generator</strong> — reveals that understanding the prompt is the bottleneck, not producing pixels. Language models are central even in vision.</li>
+<li>The power of open-source: Stable Diffusion's release enabled an explosion of community innovation (ControlNet, LoRA, inpainting) that no closed model could match — but also democratized the tools for deepfakes and misuse.</li>
+</ul>
+</div>
+</section>
+</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="19" data-class="scale-85" data-theme="cdl-theme" lang="C" class="scale-85" style="--class:scale-85;--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
 <h1 id="further-reading">Further reading</h1>
 <div class="note-box" data-title="Further reading">
 <p><a href="https://arxiv.org/abs/2204.06125"><strong>Ramesh et al. (2022, <em>arXiv</em>)</strong></a> &quot;Hierarchical Text-Conditional Image Generation with CLIP Latents&quot; — DALL-E 2: CLIP prior + diffusion decoder.</p>
 <p><a href="https://arxiv.org/abs/2205.11487"><strong>Saharia et al. (2022, <em>NeurIPS</em>)</strong></a> &quot;Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding&quot; — Imagen: scaling text encoders matters most.</p>
 <p><a href="https://openai.com/research/video-generation-models-as-world-simulators"><strong>OpenAI (2024)</strong></a> &quot;Video Generation Models as World Simulators&quot; — Sora: spacetime patches and emergent physics.</p>
-<p><a href="https://arxiv.org/abs/2406.07524"><strong>Sahoo et al. (2024, <em>arXiv</em>)</strong></a> &quot;Simple and Effective Masked Diffusion Language Models&quot; — MDLM: bridging BERT and diffusion for text.</p>
-<p><a href="https://arxiv.org/abs/2409.00587"><strong>Fei et al. (2024, <em>arXiv</em>)</strong></a> &quot;A Comprehensive Survey on Diffusion Models and Their Applications&quot; — Broad overview of diffusion across modalities.</p>
+<p><a href="https://arxiv.org/abs/2406.07524"><strong>Sahoo et al. (2024, <em>NeurIPS</em>)</strong></a> &quot;Simple and Effective Masked Diffusion Language Models&quot; — MDLM: bridging BERT and diffusion for text.</p>
+<p><a href="https://arxiv.org/abs/2409.00587"><strong>Yang et al. (2024, <em>ACM Computing Surveys</em>)</strong></a> &quot;Diffusion Models: A Comprehensive Survey of Methods and Applications&quot; — Broad overview of diffusion across modalities.</p>
 </div>
 </section>
-</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="19" data-theme="cdl-theme" lang="C" style="--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
+</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="20" data-theme="cdl-theme" lang="C" style="--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
 <h1 id="questions">Questions?</h1>
 <div class="emoji-figure">
   <div class="emoji-col">