You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**[CLIP](https://arxiv.org/abs/2103.00020)** (Contrastive Language-Image Pre-training; [Radford et al., 2021](https://arxiv.org/abs/2103.00020)): A model trained on 400M image-text pairs to learn a **shared embedding space** where images and their captions are nearby. Used throughout diffusion systems as the text encoder that "understands" prompts.
120
+
121
+
</div>
122
+
115
123
<divclass="example-box"data-title="How 'a cat wearing a hat' becomes an image">
116
124
117
125
The word "cat" activates high attention weights in the spatial region where the cat is being generated. The word "hat" activates attention weights near the top of the cat region. This spatial-linguistic binding is learned entirely from image-caption pairs during training.
@@ -177,7 +185,7 @@ The [Diffusion Transformer (DiT)](https://arxiv.org/abs/2212.09748) replaces the
Transformers scale better than U-Nets. DiT-XL/2 (675M parameters) achieves a new state-of-the-art FID of 2.27 on ImageNet, beating all previous diffusion models. More importantly, DiT shows **clean scaling behavior** — larger models consistently produce better results, with no architectural bottlenecks.
188
+
Transformers scale better than U-Nets. DiT-XL/2 (675M parameters) achieves a new state-of-the-art [FID](https://en.wikipedia.org/wiki/Fr%C3%A9chet_inception_distance) of 2.27 on ImageNet, beating all previous diffusion models. More importantly, DiT shows **clean scaling behavior** — larger models consistently produce better results, with no architectural bottlenecks.
181
189
182
190
</div>
183
191
@@ -197,7 +205,7 @@ DiT conditions on timestep and class label using **adaptive Layer Normalization
197
205
198
206
<divclass="note-box"data-title="Why 'Zero'?">
199
207
200
-
Initializing the gating parameter $\alpha = 0$ means each Transformer block initially acts as an **identity function**. This makes training stable even for very deep models — the network starts by doing nothing and gradually learns to denoise. This is the same principle behind residual learning (He et al., 2016).
208
+
Initializing the gating parameter $\alpha = 0$ means each Transformer block initially acts as an **identity function**. This makes training stable even for very deep models — the network starts by doing nothing and gradually learns to denoise. This is the same principle behind residual learning ([He et al., 2016](https://arxiv.org/abs/1512.03385)).
201
209
202
210
</div>
203
211
@@ -215,7 +223,18 @@ where $v_\theta$ is a neural network that predicts the **velocity** (direction a
215
223
216
224
</div>
217
225
218
-
<divclass="note-box"data-title="Key differences from DDPM">
-**[ODE](https://en.wikipedia.org/wiki/Ordinary_differential_equation)** (Ordinary Differential Equation): An equation describing how a quantity changes over time via a deterministic rule — given the current state, the next state is fully determined
229
+
-**[SDE](https://en.wikipedia.org/wiki/Stochastic_differential_equation)** (Stochastic Differential Equation): Like an ODE but with a random noise term — the path from noise to data has some randomness at each step
230
+
231
+
</div>
232
+
233
+
---
234
+
235
+
# Flow matching vs DDPM
236
+
237
+
<divclass="note-box"data-title="Key differences">
219
238
220
239
|| DDPM | Flow matching |
221
240
|---|---|---|
@@ -226,13 +245,19 @@ where $v_\theta$ is a neural network that predicts the **velocity** (direction a
226
245
227
246
</div>
228
247
248
+
<divclass="tip-box"data-title="The intuition">
249
+
250
+
Flow matching asks: "What's the simplest path from noise to data?" Instead of designing a complex noise schedule and learning to reverse it, we define a straight interpolation and learn the velocity field that moves along it. The math is simpler, the training is more stable, and generation is faster.
251
+
252
+
</div>
253
+
229
254
---
230
255
231
256
# Rectified flow
232
257
233
258
<divclass="definition-box"data-title="Straight paths from noise to data">
234
259
235
-
**Rectified flow** ([Liu et al., 2023](https://arxiv.org/abs/2209.03003)) uses the simplest possible interpolation — a straight line between the data point and a noise sample:
260
+
**Rectified flow** ([Liu et al., 2023, ICLR](https://arxiv.org/abs/2209.03003)) uses the simplest possible interpolation — a straight line between the data point and a noise sample:
@@ -247,7 +272,7 @@ Straight paths are the shortest paths between noise and data. Since they don't c
247
272
</div>
248
273
249
274
---
250
-
<!-- _class: scale-90-->
275
+
<!-- _class: scale-85-->
251
276
252
277
# Stable Diffusion 3: putting it all together
253
278
@@ -312,6 +337,18 @@ The field progresses by **composing** innovations, not replacing them. Each exte
312
337
313
338
</div>
314
339
340
+
---
341
+
342
+
# Take-home messages
343
+
344
+
<divclass="note-box"data-title="Think about it...">
345
+
346
+
- The key bottleneck in high-resolution generation wasn't the diffusion process itself — it was **where** you run it. Compressing to latent space (via a VAE) made consumer-GPU generation possible.
347
+
- Classifier-free guidance shows that **controlling** generation is as important as generation itself — and the trick is surprisingly simple: learn what the conditional and unconditional outputs look like, then amplify the difference.
348
+
- The field evolves by **composing** innovations (latent space + CFG + DiT + flow matching), not replacing them. Each addresses one specific limitation.
<li><strong><ahref="https://arxiv.org/abs/2103.00020">CLIP</a></strong>: Contrastive Language-Image Pre-training (<ahref="https://arxiv.org/abs/2103.00020">Radford et al., 2021</a>) — learns a shared embedding space for images and text, used as the text encoder in DALL-E 2 and Stable Diffusion (see Lecture 22)</li>
761
+
<li><strong><ahref="https://arxiv.org/abs/1910.10683">T5</a></strong>: A text-to-text transformer (Google, 2020) — Imagen uses the largest variant (T5-XXL, 4.6B parameters) as its text encoder</li>
<p>Which component matters more — the language understanding (text encoder) or the image generation (diffusion model)? Imagen's surprising finding: <strong>scaling the text encoder helps more than scaling the diffusion model</strong>.</p>
<p>Stable Diffusion (Rombach et al., 2022) is the open-source implementation of latent diffusion (Lecture 22):</p>
842
+
<p><ahref="https://arxiv.org/abs/2112.10752">Stable Diffusion</a> (<ahref="https://arxiv.org/abs/2112.10752">Rombach et al., 2022</a>) is the open-source implementation of latent diffusion (Lecture 22):</p>
<divclass="definition-box" data-title="Diffusion in the spectral domain">
900
906
<p>Audio generation applies diffusion to <strong>spectrograms</strong> (time-frequency representations of sound):</p>
901
907
<ol>
902
-
<li>Convert audio to a mel-spectrogram</li>
908
+
<li>Convert audio to a <ahref="https://en.wikipedia.org/wiki/Mel-frequency_cepstrum">mel-spectrogram</a> (a visual representation of sound frequencies over time, weighted to match human hearing)</li>
903
909
<li>Run diffusion in spectrogram space (or a latent compression of it)</li>
904
-
<li>Convert the generated spectrogram back to audio using a vocoder</li>
910
+
<li>Convert the generated spectrogram back to audio using a <strong>vocoder</strong> (a neural network that reconstructs audio waveforms from spectrograms)</li>
<h1id="discrete-diffusion-for-text">Discrete diffusion for text</h1>
944
-
<divclass="definition-box" data-title="Sahoo et al. (2024): Masked Diffusion Language Models (MDLM)">
950
+
<divclass="definition-box" data-title="Sahoo et al. (2024, NeurIPS): Masked Diffusion Language Models (MDLM)">
945
951
<p>Standard diffusion adds Gaussian noise to continuous data. For discrete data like text, <ahref="https://arxiv.org/abs/2406.07524">MDLM</a> replaces "adding noise" with <strong>masking tokens</strong>:</p>
946
952
<ul>
947
953
<li><strong>Forward process</strong>: Randomly replace tokens with [MASK], increasing the masking rate over time</li>
@@ -1017,7 +1023,7 @@ <h1 id="ethics-deepfakes-and-consent">Ethics: deepfakes and consent</h1>
1017
1023
</ul>
1018
1024
</div>
1019
1025
<divclass="important-box" data-title="Scale of the problem">
1020
-
<p>A 2023 report found that <strong>96% of deepfake videos online are non-consensual intimate imagery</strong>, and the number of deepfake videos doubled every 6 months from 2018 to 2023. The democratization of generation tools has outpaced legal and technical protections.</p>
1026
+
<p>A <ahref="https://sensity.ai/blog/deepfake-detection/mapping-the-deepfake-landscape/">2019 Sensity AI (Deeptrace) report</a> found that <strong>96% of deepfake videos online are non-consensual intimate imagery</strong>, and the number of deepfake videos doubled in just 9 months. By 2023, <ahref="https://sumsub.com/blog/deepfake-statistics/">Sumsub reported</a> a 10× increase in detected deepfakes year-over-year. The democratization of generation tools has far outpaced legal and technical protections.</p>
<li><strong>EU AI Act (2024)</strong>: Requires labeling of AI-generated content, transparency about training data, risk classification for generative systems</li>
1082
-
<li><strong>C2PA (Coalition for Content Provenance and Authenticity)</strong>: Technical standard for embedding provenance metadata in images and videos — "nutrition labels" for digital content</li>
1088
+
<li><strong><ahref="https://c2pa.org/">C2PA</a> (Coalition for Content Provenance and Authenticity)</strong>: Technical standard for embedding provenance metadata in images and videos — "nutrition labels" for digital content</li>
1083
1089
<li><strong>US Executive Order (Oct 2023)</strong>: Requires watermarking of AI-generated content from government contractors</li>
1084
1090
<li><strong>China's deep synthesis regulations (2023)</strong>: Mandatory labeling and registration of deepfake services</li>
<divclass="note-box" data-title="Think about it...">
1122
+
<ul>
1123
+
<li>The same diffusion framework scales across modalities — images (DALL-E 2, Stable Diffusion), video (Sora), audio, and text (MDLM) — suggesting <strong>iterative refinement from noise</strong> is a general-purpose generation principle.</li>
1124
+
<li>Imagen's key finding — that <strong>scaling the text encoder matters more than scaling the image generator</strong> — reveals that understanding the prompt is the bottleneck, not producing pixels. Language models are central even in vision.</li>
1125
+
<li>The power of open-source: Stable Diffusion's release enabled an explosion of community innovation (ControlNet, LoRA, inpainting) that no closed model could match — but also democratized the tools for deepfakes and misuse.</li>
<p><ahref="https://arxiv.org/abs/2204.06125"><strong>Ramesh et al. (2022, <em>arXiv</em>)</strong></a> "Hierarchical Text-Conditional Image Generation with CLIP Latents" — DALL-E 2: CLIP prior + diffusion decoder.</p>
1117
1133
<p><ahref="https://arxiv.org/abs/2205.11487"><strong>Saharia et al. (2022, <em>NeurIPS</em>)</strong></a> "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding" — Imagen: scaling text encoders matters most.</p>
1118
1134
<p><ahref="https://openai.com/research/video-generation-models-as-world-simulators"><strong>OpenAI (2024)</strong></a> "Video Generation Models as World Simulators" — Sora: spacetime patches and emergent physics.</p>
1119
-
<p><ahref="https://arxiv.org/abs/2406.07524"><strong>Sahoo et al. (2024, <em>arXiv</em>)</strong></a> "Simple and Effective Masked Diffusion Language Models" — MDLM: bridging BERT and diffusion for text.</p>
1120
-
<p><ahref="https://arxiv.org/abs/2409.00587"><strong>Fei et al. (2024, <em>arXiv</em>)</strong></a> "A Comprehensive Survey on Diffusion Models and Their Applications" — Broad overview of diffusion across modalities.</p>
1135
+
<p><ahref="https://arxiv.org/abs/2406.07524"><strong>Sahoo et al. (2024, <em>NeurIPS</em>)</strong></a> "Simple and Effective Masked Diffusion Language Models" — MDLM: bridging BERT and diffusion for text.</p>
1136
+
<p><ahref="https://arxiv.org/abs/2409.00587"><strong>Yang et al. (2024, <em>ACM Computing Surveys</em>)</strong></a> "Diffusion Models: A Comprehensive Survey of Methods and Applications" — Broad overview of diffusion across modalities.</p>
0 commit comments