Skip to content

Commit 537cc51

Browse files
committed
deploy: 902634a
1 parent fef4566 commit 537cc51

4 files changed

Lines changed: 21 additions & 21 deletions

File tree

_sources/intro.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ At appropriate times throughout the course, you will select from a list of proje
9090

9191
For projects you will put together a Jupyter notebook that demonstrates your project. The notebook should have code and demonstrate the task but also be written in an expository way that other students could, in principle, read and learn from. It is submitted in an analogous way as the regular course assignments.
9292

93-
Each project notebook must be submitted via Gradescope for grading.
93+
Each project notebook must be submitted via Gradescope for grading. There is not late submissions allowed for the projects. If you do not submit in Gradescope by the deadline, you will receive a zero grade for that project. There are no exceptions to this policy.
9494

9595
## <span style="color:Red">Grading</span>
9696
* Class attendence and participation: 5%

_sources/lectures/Attention.html

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1099,8 +1099,8 @@ <h3><span style="color:LightGreen">Sequence Padding and Attention Masking</span>
10991099
<section id="span-style-color-orange-computing-the-reweighted-padded-attention-mask-span">
11001100
<h2><span style="color:Orange">Computing the Reweighted Padded Attention Mask</span><a class="headerlink" href="#span-style-color-orange-computing-the-reweighted-padded-attention-mask-span" title="Permalink to this heading">#</a></h2>
11011101
<p>Lets create some numbers so we can get a better idea of how this works. Let the tokens be <span class="math notranslate nohighlight">\(X = [10, 2, \text{&lt;pad&gt;}]\)</span>, so the third token is a padding token. Lets then also pretend, we pass this to our model, and when we go to compute our attention <span class="math notranslate nohighlight">\(QK^T\)</span>. The raw output before the Softmax is below:</p>
1102-
<div class="amsmath math notranslate nohighlight" id="equation-4879249e-730c-4c72-bd88-f861bae2041e">
1103-
<span class="eqno">(1)<a class="headerlink" href="#equation-4879249e-730c-4c72-bd88-f861bae2041e" title="Permalink to this equation">#</a></span>\[\begin{equation}
1102+
<div class="amsmath math notranslate nohighlight" id="equation-cca887ed-7663-48bc-8764-1bc3dc98cd1b">
1103+
<span class="eqno">(1)<a class="headerlink" href="#equation-cca887ed-7663-48bc-8764-1bc3dc98cd1b" title="Permalink to this equation">#</a></span>\[\begin{equation}
11041104
\begin{bmatrix}
11051105
7 &amp; -8 &amp; 6 \\
11061106
-3 &amp; 2 &amp; 4 \\
@@ -1113,8 +1113,8 @@ <h2><span style="color:Orange">Computing the Reweighted Padded Attention Mask</s
11131113
\text{Softmax}(\vec{x}) = \frac{e^{x_i}}{\sum_{j=1}^N{e^{x_j}}}
11141114
\]</div>
11151115
<p>If we ignore padding and everything right now, we can compute softmax for row of the matrix above:</p>
1116-
<div class="amsmath math notranslate nohighlight" id="equation-469b756d-c77d-47bf-8a82-87e2bc96499a">
1117-
<span class="eqno">(2)<a class="headerlink" href="#equation-469b756d-c77d-47bf-8a82-87e2bc96499a" title="Permalink to this equation">#</a></span>\[\begin{equation}
1116+
<div class="amsmath math notranslate nohighlight" id="equation-b044bbb8-5fe8-4257-90bc-f3ea0de4203b">
1117+
<span class="eqno">(2)<a class="headerlink" href="#equation-b044bbb8-5fe8-4257-90bc-f3ea0de4203b" title="Permalink to this equation">#</a></span>\[\begin{equation}
11181118
\text{Softmax}
11191119
\begin{bmatrix}
11201120
7 &amp; -8 &amp; 6 \\
@@ -1133,17 +1133,17 @@ <h2><span style="color:Orange">Computing the Reweighted Padded Attention Mask</s
11331133
\end{bmatrix}
11341134
\end{equation}\]</div>
11351135
<p>But what we need is to mask out all the tokens in this matrix related to padding. Just like we did in <a class="reference external" href="https://github.com/priyammaz/HAL-DL-From-Scratch/tree/main/PyTorch%20for%20NLP/GPT">GPT</a>, we will fill in the indexes of the that we want to mask with <span class="math notranslate nohighlight">\(-\infty\)</span>. If only the last token was a padding token in our sequence, then the attention before the softmax should be written as:</p>
1136-
<div class="amsmath math notranslate nohighlight" id="equation-aa311aff-3c34-4d6d-b620-620f697cb836">
1137-
<span class="eqno">(3)<a class="headerlink" href="#equation-aa311aff-3c34-4d6d-b620-620f697cb836" title="Permalink to this equation">#</a></span>\[\begin{equation}
1136+
<div class="amsmath math notranslate nohighlight" id="equation-28997d0e-2785-494d-8a85-5af60c43aded">
1137+
<span class="eqno">(3)<a class="headerlink" href="#equation-28997d0e-2785-494d-8a85-5af60c43aded" title="Permalink to this equation">#</a></span>\[\begin{equation}
11381138
\begin{bmatrix}
11391139
7 &amp; -8 &amp; -\infty \\
11401140
-3 &amp; 2 &amp; -\infty \\
11411141
1 &amp; 6 &amp; -\infty \\
11421142
\end{bmatrix}
11431143
\end{equation}\]</div>
11441144
<p>Taking the softmax of the rows of this matrix then gives:</p>
1145-
<div class="amsmath math notranslate nohighlight" id="equation-606dee84-60c1-4f1f-827c-4913a7a02b0e">
1146-
<span class="eqno">(4)<a class="headerlink" href="#equation-606dee84-60c1-4f1f-827c-4913a7a02b0e" title="Permalink to this equation">#</a></span>\[\begin{equation}
1145+
<div class="amsmath math notranslate nohighlight" id="equation-e6b399df-b453-4cea-bb55-99c48c0d87cc">
1146+
<span class="eqno">(4)<a class="headerlink" href="#equation-e6b399df-b453-4cea-bb55-99c48c0d87cc" title="Permalink to this equation">#</a></span>\[\begin{equation}
11471147
\text{Softmax}
11481148
\begin{bmatrix}
11491149
7 &amp; -8 &amp; -\infty \\
@@ -1185,8 +1185,8 @@ <h3><span style="color:LightGreen">Repeating to Match Attention Matrix Shape</sp
11851185
<p><code class="docutils literal notranslate"><span class="pre">attn.shape</span></code> - (Batch x seq_len x seq_len)</p>
11861186
<p><code class="docutils literal notranslate"><span class="pre">mask.shape</span></code> - (Batch x seq_len)</p>
11871187
<p>It is clear that our mask is missing a dimension, and we need to repeat it. Lets take sequence_1 for instance that has a mask of [True, True, True, False]. Because the sequence length here is 4, lets repeat this row 4 times:</p>
1188-
<div class="amsmath math notranslate nohighlight" id="equation-befe9eff-9127-4ab8-a1bf-a8fbb58c4325">
1189-
<span class="eqno">(5)<a class="headerlink" href="#equation-befe9eff-9127-4ab8-a1bf-a8fbb58c4325" title="Permalink to this equation">#</a></span>\[\begin{bmatrix}
1188+
<div class="amsmath math notranslate nohighlight" id="equation-c7c9a3ab-17ac-4a5a-b0a2-52a5eac6f7ec">
1189+
<span class="eqno">(5)<a class="headerlink" href="#equation-c7c9a3ab-17ac-4a5a-b0a2-52a5eac6f7ec" title="Permalink to this equation">#</a></span>\[\begin{bmatrix}
11901190
\textrm{True} &amp; \textrm{True} &amp; \textrm{True} &amp; \textrm{False} \\
11911191
\textrm{True} &amp; \textrm{True} &amp; \textrm{True} &amp; \textrm{False} \\
11921192
\textrm{True} &amp; \textrm{True} &amp; \textrm{True} &amp; \textrm{False} \\
@@ -1446,8 +1446,8 @@ <h3><span style="color:LightGreen">Enforcing Causality</span><a class="headerlin
14461446
<section id="span-style-color-lightgreen-computing-the-reweighted-causal-attention-mask-span">
14471447
<h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mask</span><a class="headerlink" href="#span-style-color-lightgreen-computing-the-reweighted-causal-attention-mask-span" title="Permalink to this heading">#</a></h3>
14481448
<p>Lets pretend the raw outputs of <span class="math notranslate nohighlight">\(QK^T\)</span>, before the softmax, is below:</p>
1449-
<div class="amsmath math notranslate nohighlight" id="equation-4b7ce628-0c08-401f-8c37-779deeaffc59">
1450-
<span class="eqno">(6)<a class="headerlink" href="#equation-4b7ce628-0c08-401f-8c37-779deeaffc59" title="Permalink to this equation">#</a></span>\[\begin{equation}
1449+
<div class="amsmath math notranslate nohighlight" id="equation-953898fc-df4c-4707-8b13-c5739e6979e5">
1450+
<span class="eqno">(6)<a class="headerlink" href="#equation-953898fc-df4c-4707-8b13-c5739e6979e5" title="Permalink to this equation">#</a></span>\[\begin{equation}
14511451
\begin{bmatrix}
14521452
7 &amp; -8 &amp; 6 \\
14531453
-3 &amp; 2 &amp; 4 \\
@@ -1458,8 +1458,8 @@ <h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mas
14581458
<div class="math notranslate nohighlight">
14591459
\[\text{Softmax}(\vec{x}) = \frac{e^{x_i}}{\sum_{j=1}^N{e^{x_j}}}\]</div>
14601460
<p>Then, we can compute softmax for row of the matrix above:</p>
1461-
<div class="amsmath math notranslate nohighlight" id="equation-f53bf636-67a1-4047-bd08-90546434449a">
1462-
<span class="eqno">(7)<a class="headerlink" href="#equation-f53bf636-67a1-4047-bd08-90546434449a" title="Permalink to this equation">#</a></span>\[\begin{equation}
1461+
<div class="amsmath math notranslate nohighlight" id="equation-5247fabc-a18a-4cb0-9988-428195c4ccdc">
1462+
<span class="eqno">(7)<a class="headerlink" href="#equation-5247fabc-a18a-4cb0-9988-428195c4ccdc" title="Permalink to this equation">#</a></span>\[\begin{equation}
14631463
\text{Softmax}
14641464
\begin{bmatrix}
14651465
7 &amp; -8 &amp; 6 \\
@@ -1498,17 +1498,17 @@ <h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mas
14981498
\text{Softmax}(x_2) = [\frac{e^{-3}}{e^{-3}+e^{2}+0}, \frac{e^{2}}{e^{-3}+e^{2}+0}, \frac{0}{e^{-3}+e^{2}+0}] = [\frac{e^{-3}}{e^{-3}+e^{2}+0}, \frac{e^{2}}{e^{-3}+e^{2}+0}, \frac{0}{e^{-3}+e^{2}+0}] = [0.0067, 0.9933, 0.0000]
14991499
\]</div>
15001500
<p>So we have exactly what we want! The attention weight of the last value is set to 0, so when we are on the second vector <span class="math notranslate nohighlight">\(x_2\)</span>, we cannot look forward to the future value vectors <span class="math notranslate nohighlight">\(v_3\)</span>, and the remaining parts add up to 1 so its still a probability vector! To do this correctly for the entire matrix, we can just substitute in the top triangle of <span class="math notranslate nohighlight">\(QK^T\)</span> with <span class="math notranslate nohighlight">\(-\infty\)</span>. This would look like:</p>
1501-
<div class="amsmath math notranslate nohighlight" id="equation-ae838e54-9d6a-443f-bfe3-ab3c5193121e">
1502-
<span class="eqno">(8)<a class="headerlink" href="#equation-ae838e54-9d6a-443f-bfe3-ab3c5193121e" title="Permalink to this equation">#</a></span>\[\begin{equation}
1501+
<div class="amsmath math notranslate nohighlight" id="equation-bca53567-ce82-4f24-b424-013b88906702">
1502+
<span class="eqno">(8)<a class="headerlink" href="#equation-bca53567-ce82-4f24-b424-013b88906702" title="Permalink to this equation">#</a></span>\[\begin{equation}
15031503
\begin{bmatrix}
15041504
7 &amp; -\infty &amp; -\infty \\
15051505
-3 &amp; 2 &amp; -\infty \\
15061506
1 &amp; 6 &amp; -2 \\
15071507
\end{bmatrix}
15081508
\end{equation}\]</div>
15091509
<p>Taking the softmax of the rows of this matrix then gives:</p>
1510-
<div class="amsmath math notranslate nohighlight" id="equation-a4e382a4-b1a3-46c1-a4d3-623292fdd7db">
1511-
<span class="eqno">(9)<a class="headerlink" href="#equation-a4e382a4-b1a3-46c1-a4d3-623292fdd7db" title="Permalink to this equation">#</a></span>\[\begin{equation}
1510+
<div class="amsmath math notranslate nohighlight" id="equation-154be58d-b84c-4939-9189-567569ad4ecd">
1511+
<span class="eqno">(9)<a class="headerlink" href="#equation-154be58d-b84c-4939-9189-567569ad4ecd" title="Permalink to this equation">#</a></span>\[\begin{equation}
15121512
\text{Softmax}
15131513
\begin{bmatrix}
15141514
7 &amp; -\infty &amp; -\infty \\

intro.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -729,7 +729,7 @@ <h3>Homework Assignments<a class="headerlink" href="#homework-assignments" title
729729
<h3>Projects<a class="headerlink" href="#projects" title="Permalink to this heading">#</a></h3>
730730
<p>At appropriate times throughout the course, you will select from a list of projects that involve demonstrating and extending your work in class by doing something cool and interesting in data analysys. You must work alone on this (i.e. without collaboration).</p>
731731
<p>For projects you will put together a Jupyter notebook that demonstrates your project. The notebook should have code and demonstrate the task but also be written in an expository way that other students could, in principle, read and learn from. It is submitted in an analogous way as the regular course assignments.</p>
732-
<p>Each project notebook must be submitted via Gradescope for grading.</p>
732+
<p>Each project notebook must be submitted via Gradescope for grading. There is not late submissions allowed for the projects. If you do not submit in Gradescope by the deadline, you will receive a zero grade for that project. There are no exceptions to this policy.</p>
733733
</section>
734734
</section>
735735
<section id="span-style-color-red-grading-span">

searchindex.js

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)