Skip to content

Commit a3fd2f4

Browse files
committed
Update lectures 22-23: add definition boxes, fix references, improve layout
Lecture 22: add Remember boxes (CLIP, ODE/SDE), hyperlinks (FID, He et al.), take-home messages slide, split flow matching slide to fix overflow, add scale classes for dense slides. All 6 references verified accurate. Lecture 23: fix Sahoo et al. venue (NeurIPS 2024), fix survey citation (Yang et al.), add deepfake stat citations (Sensity AI, Sumsub), fix Thomson Reuters v. Ross (Ruled 2025), add Remember box (CLIP/T5), add mel-spectrogram/C2PA/LAION-5B links, add take-home messages slide.
1 parent 7801e82 commit a3fd2f4

7 files changed

Lines changed: 173 additions & 40 deletions

File tree

notes/2026-02-12-week7-diffusion-rewrite.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,40 @@ Commit: `721d55a` — pushed to main.
5858
6. Run tests
5959
7. Commit and push
6060

61+
### Session 2026-02-15 (continued, session 3):
62+
- Updated lecture 21 with multiple user-requested edits:
63+
- MMLU/GSM8K hyperlinks verified and added (LLaDA slide)
64+
- RLHF/DPO/KV-caching "Definitions" box added (Current limitations slide)
65+
- Take-home messages slide populated (third bullet about multimodal models)
66+
- New "Controlling generation" slide added (prompting, infilling, prefix completion — 3 flow diagrams)
67+
- Committed submodule changes (final-project-llm-course at f97babc)
68+
- Cleaned up arrow SVG references (slides/week7/images/ deleted)
69+
- All 1500+ tests pass
70+
71+
### Session 2026-02-16 (Lectures 22 & 23 style update):
72+
- **Lecture 22** style updates:
73+
- Added "Remember..." definition-box for CLIP on text conditioning slide
74+
- Added "Remember..." definition-box for ODE/SDE on flow matching slide
75+
- Added FID Wikipedia hyperlink on DiT slide
76+
- Added He et al. (2016) hyperlink on adaLN-Zero slide
77+
- Updated Liu et al. rectified flow to note ICLR 2023
78+
- Added "Take-home messages" slide (Think about it... note-box)
79+
- All 6 references verified accurate (via librarian agent)
80+
- Fixed overflow: removed VAE/U-Net Remember box from pixel problem slide (content already explained on next slide), split flow matching into 2 slides (definition + comparison table), added scale classes
81+
- **Lecture 23** style updates + reference fixes:
82+
- Fixed Sahoo et al. venue: "arXiv" → "NeurIPS 2024"
83+
- Fixed survey citation: "Fei et al." → "Yang et al. (2024, ACM Computing Surveys)"
84+
- Added deepfake stat citations (Sensity AI 2019, Sumsub 2023)
85+
- Fixed Thomson Reuters v. Ross: "Settled 2024" → "Ruled 2025"
86+
- Added "Remember..." definition-box for CLIP + T5 on text-to-image slide
87+
- Added mel-spectrogram Wikipedia link and vocoder inline definition on audio slide
88+
- Added Stable Diffusion hyperlink to Rombach et al.
89+
- Added C2PA and LAION-5B hyperlinks
90+
- Added "Take-home messages" slide
91+
- Both lectures recompiled to HTML + PDF
92+
- Visual verification via Playwright: all key slides render without overflow
93+
- Screenshots cleaned up, HTTP server killed
94+
6195
## Key Constraints
6296
- Assignment announcements ONLY in lecture21 (not 22/23)
6397
- Companion notebook referenced ONLY in lecture23

slides/week7/lecture22.html

Lines changed: 40 additions & 13 deletions
Large diffs are not rendered by default.

slides/week7/lecture22.md

Lines changed: 42 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@ Winter 2026
2929

3030
---
3131

32+
<!-- _class: scale-85 -->
33+
3234
# The pixel problem
3335

3436
<div class="warning-box" data-title="Why running diffusion in pixel space is expensive">
@@ -112,6 +114,12 @@ To generate images from text prompts, latent diffusion adds **cross-attention**
112114

113115
</div>
114116

117+
<div class="definition-box" data-title="Remember...">
118+
119+
- **[CLIP](https://arxiv.org/abs/2103.00020)** (Contrastive Language-Image Pre-training; [Radford et al., 2021](https://arxiv.org/abs/2103.00020)): A model trained on 400M image-text pairs to learn a **shared embedding space** where images and their captions are nearby. Used throughout diffusion systems as the text encoder that "understands" prompts.
120+
121+
</div>
122+
115123
<div class="example-box" data-title="How 'a cat wearing a hat' becomes an image">
116124

117125
The word "cat" activates high attention weights in the spatial region where the cat is being generated. The word "hat" activates attention weights near the top of the cat region. This spatial-linguistic binding is learned entirely from image-caption pairs during training.
@@ -177,7 +185,7 @@ The [Diffusion Transformer (DiT)](https://arxiv.org/abs/2212.09748) replaces the
177185

178186
<div class="important-box" data-title="Why replace U-Net?">
179187

180-
Transformers scale better than U-Nets. DiT-XL/2 (675M parameters) achieves a new state-of-the-art FID of 2.27 on ImageNet, beating all previous diffusion models. More importantly, DiT shows **clean scaling behavior** — larger models consistently produce better results, with no architectural bottlenecks.
188+
Transformers scale better than U-Nets. DiT-XL/2 (675M parameters) achieves a new state-of-the-art [FID](https://en.wikipedia.org/wiki/Fr%C3%A9chet_inception_distance) of 2.27 on ImageNet, beating all previous diffusion models. More importantly, DiT shows **clean scaling behavior** — larger models consistently produce better results, with no architectural bottlenecks.
181189

182190
</div>
183191

@@ -197,7 +205,7 @@ DiT conditions on timestep and class label using **adaptive Layer Normalization
197205

198206
<div class="note-box" data-title="Why 'Zero'?">
199207

200-
Initializing the gating parameter $\alpha = 0$ means each Transformer block initially acts as an **identity function**. This makes training stable even for very deep models — the network starts by doing nothing and gradually learns to denoise. This is the same principle behind residual learning (He et al., 2016).
208+
Initializing the gating parameter $\alpha = 0$ means each Transformer block initially acts as an **identity function**. This makes training stable even for very deep models — the network starts by doing nothing and gradually learns to denoise. This is the same principle behind residual learning ([He et al., 2016](https://arxiv.org/abs/1512.03385)).
201209

202210
</div>
203211

@@ -215,7 +223,18 @@ where $v_\theta$ is a neural network that predicts the **velocity** (direction a
215223

216224
</div>
217225

218-
<div class="note-box" data-title="Key differences from DDPM">
226+
<div class="definition-box" data-title="Remember...">
227+
228+
- **[ODE](https://en.wikipedia.org/wiki/Ordinary_differential_equation)** (Ordinary Differential Equation): An equation describing how a quantity changes over time via a deterministic rule — given the current state, the next state is fully determined
229+
- **[SDE](https://en.wikipedia.org/wiki/Stochastic_differential_equation)** (Stochastic Differential Equation): Like an ODE but with a random noise term — the path from noise to data has some randomness at each step
230+
231+
</div>
232+
233+
---
234+
235+
# Flow matching vs DDPM
236+
237+
<div class="note-box" data-title="Key differences">
219238

220239
| | DDPM | Flow matching |
221240
|---|---|---|
@@ -226,13 +245,19 @@ where $v_\theta$ is a neural network that predicts the **velocity** (direction a
226245

227246
</div>
228247

248+
<div class="tip-box" data-title="The intuition">
249+
250+
Flow matching asks: "What's the simplest path from noise to data?" Instead of designing a complex noise schedule and learning to reverse it, we define a straight interpolation and learn the velocity field that moves along it. The math is simpler, the training is more stable, and generation is faster.
251+
252+
</div>
253+
229254
---
230255

231256
# Rectified flow
232257

233258
<div class="definition-box" data-title="Straight paths from noise to data">
234259

235-
**Rectified flow** ([Liu et al., 2023](https://arxiv.org/abs/2209.03003)) uses the simplest possible interpolation — a straight line between the data point and a noise sample:
260+
**Rectified flow** ([Liu et al., 2023, ICLR](https://arxiv.org/abs/2209.03003)) uses the simplest possible interpolation — a straight line between the data point and a noise sample:
236261

237262
$$\mathbf{x}_t = (1 - t)\,\mathbf{x}_0 + t\,\boldsymbol{\epsilon}$$
238263

@@ -247,7 +272,7 @@ Straight paths are the shortest paths between noise and data. Since they don't c
247272
</div>
248273

249274
---
250-
<!-- _class: scale-90 -->
275+
<!-- _class: scale-85 -->
251276

252277
# Stable Diffusion 3: putting it all together
253278

@@ -312,6 +337,18 @@ The field progresses by **composing** innovations, not replacing them. Each exte
312337

313338
</div>
314339

340+
---
341+
342+
# Take-home messages
343+
344+
<div class="note-box" data-title="Think about it...">
345+
346+
- The key bottleneck in high-resolution generation wasn't the diffusion process itself — it was **where** you run it. Compressing to latent space (via a VAE) made consumer-GPU generation possible.
347+
- Classifier-free guidance shows that **controlling** generation is as important as generation itself — and the trick is surprisingly simple: learn what the conditional and unconditional outputs look like, then amplify the difference.
348+
- The field evolves by **composing** innovations (latent space + CFG + DiT + flow matching), not replacing them. Each addresses one specific limitation.
349+
350+
</div>
351+
315352
---
316353
<!-- _class: scale-85 -->
317354

slides/week7/lecture22.pdf

102 KB
Binary file not shown.

slides/week7/lecture23.html

Lines changed: 28 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -755,6 +755,12 @@ <h1 id="text-to-image-the-big-picture">Text-to-image: the big picture</h1>
755755
</tbody>
756756
</table>
757757
</div>
758+
<div class="definition-box" data-title="Remember...">
759+
<ul>
760+
<li><strong><a href="https://arxiv.org/abs/2103.00020">CLIP</a></strong>: Contrastive Language-Image Pre-training (<a href="https://arxiv.org/abs/2103.00020">Radford et al., 2021</a>) — learns a shared embedding space for images and text, used as the text encoder in DALL-E 2 and Stable Diffusion (see Lecture 22)</li>
761+
<li><strong><a href="https://arxiv.org/abs/1910.10683">T5</a></strong>: A text-to-text transformer (Google, 2020) — Imagen uses the largest variant (T5-XXL, 4.6B parameters) as its text encoder</li>
762+
</ul>
763+
</div>
758764
<div class="important-box" data-title="The key question">
759765
<p>Which component matters more — the language understanding (text encoder) or the image generation (diffusion model)? Imagen's surprising finding: <strong>scaling the text encoder helps more than scaling the diffusion model</strong>.</p>
760766
</div>
@@ -833,7 +839,7 @@ <h1 id="imagen">Imagen</h1>
833839
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="7" data-class="scale-78" data-theme="cdl-theme" lang="C" class="scale-78" style="--class:scale-78;--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
834840
<h1 id="stable-diffusion">Stable Diffusion</h1>
835841
<div class="note-box" data-title="Open-source democratization">
836-
<p>Stable Diffusion (Rombach et al., 2022) is the open-source implementation of latent diffusion (Lecture 22):</p>
842+
<p><a href="https://arxiv.org/abs/2112.10752">Stable Diffusion</a> (<a href="https://arxiv.org/abs/2112.10752">Rombach et al., 2022</a>) is the open-source implementation of latent diffusion (Lecture 22):</p>
837843
<table>
838844
<thead>
839845
<tr>
@@ -852,7 +858,7 @@ <h1 id="stable-diffusion">Stable Diffusion</h1>
852858
</tr>
853859
<tr>
854860
<td>Training data</td>
855-
<td>LAION-5B (5 billion image-text pairs)</td>
861+
<td><a href="https://laion.ai/blog/laion-5b/">LAION-5B</a> (5 billion image-text pairs)</td>
856862
</tr>
857863
<tr>
858864
<td>Parameters</td>
@@ -899,9 +905,9 @@ <h1 id="text-to-audio">Text-to-audio</h1>
899905
<div class="definition-box" data-title="Diffusion in the spectral domain">
900906
<p>Audio generation applies diffusion to <strong>spectrograms</strong> (time-frequency representations of sound):</p>
901907
<ol>
902-
<li>Convert audio to a mel-spectrogram</li>
908+
<li>Convert audio to a <a href="https://en.wikipedia.org/wiki/Mel-frequency_cepstrum">mel-spectrogram</a> (a visual representation of sound frequencies over time, weighted to match human hearing)</li>
903909
<li>Run diffusion in spectrogram space (or a latent compression of it)</li>
904-
<li>Convert the generated spectrogram back to audio using a vocoder</li>
910+
<li>Convert the generated spectrogram back to audio using a <strong>vocoder</strong> (a neural network that reconstructs audio waveforms from spectrograms)</li>
905911
</ol>
906912
</div>
907913
<div class="note-box" data-title="Notable systems">
@@ -941,7 +947,7 @@ <h1 id="text-to-audio">Text-to-audio</h1>
941947
</section>
942948
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="10" data-theme="cdl-theme" lang="C" style="--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
943949
<h1 id="discrete-diffusion-for-text">Discrete diffusion for text</h1>
944-
<div class="definition-box" data-title="Sahoo et al. (2024): Masked Diffusion Language Models (MDLM)">
950+
<div class="definition-box" data-title="Sahoo et al. (2024, NeurIPS): Masked Diffusion Language Models (MDLM)">
945951
<p>Standard diffusion adds Gaussian noise to continuous data. For discrete data like text, <a href="https://arxiv.org/abs/2406.07524">MDLM</a> replaces &quot;adding noise&quot; with <strong>masking tokens</strong>:</p>
946952
<ul>
947953
<li><strong>Forward process</strong>: Randomly replace tokens with [MASK], increasing the masking rate over time</li>
@@ -1017,7 +1023,7 @@ <h1 id="ethics-deepfakes-and-consent">Ethics: deepfakes and consent</h1>
10171023
</ul>
10181024
</div>
10191025
<div class="important-box" data-title="Scale of the problem">
1020-
<p>A 2023 report found that <strong>96% of deepfake videos online are non-consensual intimate imagery</strong>, and the number of deepfake videos doubled every 6 months from 2018 to 2023. The democratization of generation tools has outpaced legal and technical protections.</p>
1026+
<p>A <a href="https://sensity.ai/blog/deepfake-detection/mapping-the-deepfake-landscape/">2019 Sensity AI (Deeptrace) report</a> found that <strong>96% of deepfake videos online are non-consensual intimate imagery</strong>, and the number of deepfake videos doubled in just 9 months. By 2023, <a href="https://sumsub.com/blog/deepfake-statistics/">Sumsub reported</a> a 10× increase in detected deepfakes year-over-year. The democratization of generation tools has far outpaced legal and technical protections.</p>
10211027
</div>
10221028
</section>
10231029
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="14" data-theme="cdl-theme" lang="C" style="--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
@@ -1064,7 +1070,7 @@ <h1 id="ethics-copyright-and-training-data">Ethics: copyright and training data<
10641070
</tr>
10651071
<tr>
10661072
<td>Thomson Reuters v. Ross</td>
1067-
<td>Settled 2024</td>
1073+
<td>Ruled 2025</td>
10681074
<td>Training on proprietary legal database</td>
10691075
</tr>
10701076
</tbody>
@@ -1079,7 +1085,7 @@ <h1 id="ethics-regulation-and-provenance">Ethics: regulation and provenance</h1>
10791085
<div class="definition-box" data-title="Emerging regulatory frameworks">
10801086
<ul>
10811087
<li><strong>EU AI Act (2024)</strong>: Requires labeling of AI-generated content, transparency about training data, risk classification for generative systems</li>
1082-
<li><strong>C2PA (Coalition for Content Provenance and Authenticity)</strong>: Technical standard for embedding provenance metadata in images and videos — &quot;nutrition labels&quot; for digital content</li>
1088+
<li><strong><a href="https://c2pa.org/">C2PA</a> (Coalition for Content Provenance and Authenticity)</strong>: Technical standard for embedding provenance metadata in images and videos — &quot;nutrition labels&quot; for digital content</li>
10831089
<li><strong>US Executive Order (Oct 2023)</strong>: Requires watermarking of AI-generated content from government contractors</li>
10841090
<li><strong>China's deep synthesis regulations (2023)</strong>: Mandatory labeling and registration of deepfake services</li>
10851091
</ul>
@@ -1110,17 +1116,27 @@ <h1 id="discussion">Discussion</h1>
11101116
</ol>
11111117
</div>
11121118
</section>
1113-
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="18" data-class="scale-85" data-theme="cdl-theme" lang="C" class="scale-85" style="--class:scale-85;--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
1119+
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="18" data-theme="cdl-theme" lang="C" style="--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
1120+
<h1 id="take-home-messages">Take-home messages</h1>
1121+
<div class="note-box" data-title="Think about it...">
1122+
<ul>
1123+
<li>The same diffusion framework scales across modalities — images (DALL-E 2, Stable Diffusion), video (Sora), audio, and text (MDLM) — suggesting <strong>iterative refinement from noise</strong> is a general-purpose generation principle.</li>
1124+
<li>Imagen's key finding — that <strong>scaling the text encoder matters more than scaling the image generator</strong> — reveals that understanding the prompt is the bottleneck, not producing pixels. Language models are central even in vision.</li>
1125+
<li>The power of open-source: Stable Diffusion's release enabled an explosion of community innovation (ControlNet, LoRA, inpainting) that no closed model could match — but also democratized the tools for deepfakes and misuse.</li>
1126+
</ul>
1127+
</div>
1128+
</section>
1129+
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="19" data-class="scale-85" data-theme="cdl-theme" lang="C" class="scale-85" style="--class:scale-85;--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
11141130
<h1 id="further-reading">Further reading</h1>
11151131
<div class="note-box" data-title="Further reading">
11161132
<p><a href="https://arxiv.org/abs/2204.06125"><strong>Ramesh et al. (2022, <em>arXiv</em>)</strong></a> &quot;Hierarchical Text-Conditional Image Generation with CLIP Latents&quot; — DALL-E 2: CLIP prior + diffusion decoder.</p>
11171133
<p><a href="https://arxiv.org/abs/2205.11487"><strong>Saharia et al. (2022, <em>NeurIPS</em>)</strong></a> &quot;Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding&quot; — Imagen: scaling text encoders matters most.</p>
11181134
<p><a href="https://openai.com/research/video-generation-models-as-world-simulators"><strong>OpenAI (2024)</strong></a> &quot;Video Generation Models as World Simulators&quot; — Sora: spacetime patches and emergent physics.</p>
1119-
<p><a href="https://arxiv.org/abs/2406.07524"><strong>Sahoo et al. (2024, <em>arXiv</em>)</strong></a> &quot;Simple and Effective Masked Diffusion Language Models&quot; — MDLM: bridging BERT and diffusion for text.</p>
1120-
<p><a href="https://arxiv.org/abs/2409.00587"><strong>Fei et al. (2024, <em>arXiv</em>)</strong></a> &quot;A Comprehensive Survey on Diffusion Models and Their Applications&quot; — Broad overview of diffusion across modalities.</p>
1135+
<p><a href="https://arxiv.org/abs/2406.07524"><strong>Sahoo et al. (2024, <em>NeurIPS</em>)</strong></a> &quot;Simple and Effective Masked Diffusion Language Models&quot; — MDLM: bridging BERT and diffusion for text.</p>
1136+
<p><a href="https://arxiv.org/abs/2409.00587"><strong>Yang et al. (2024, <em>ACM Computing Surveys</em>)</strong></a> &quot;Diffusion Models: A Comprehensive Survey of Methods and Applications&quot; — Broad overview of diffusion across modalities.</p>
11211137
</div>
11221138
</section>
1123-
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="19" data-theme="cdl-theme" lang="C" style="--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
1139+
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="20" data-theme="cdl-theme" lang="C" style="--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
11241140
<h1 id="questions">Questions?</h1>
11251141
<div class="emoji-figure">
11261142
<div class="emoji-col">

0 commit comments

Comments
 (0)