Skip to content

Commit 70fdc14

Browse files
committed
deploy: 69326c2
1 parent ba582ab commit 70fdc14

1 file changed

Lines changed: 18 additions & 18 deletions

File tree

_sources/lectures/Attention.html

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1118,8 +1118,8 @@ <h3><span style="color:LightGreen">Sequence Padding and Attention Masking</span>
11181118
<section id="span-style-color-orange-computing-the-reweighted-padded-attention-mask-span">
11191119
<h2><span style="color:Orange">Computing the Reweighted Padded Attention Mask</span><a class="headerlink" href="#span-style-color-orange-computing-the-reweighted-padded-attention-mask-span" title="Permalink to this heading">#</a></h2>
11201120
<p>Lets create some numbers so we can get a better idea of how this works. Let the tokens be <span class="math notranslate nohighlight">\(X = [10, 2, \text{&lt;pad&gt;}]\)</span>, so the third token is a padding token. Lets then also pretend, we pass this to our model, and when we go to compute our attention <span class="math notranslate nohighlight">\(QK^T\)</span>. The raw output before the Softmax is below:</p>
1121-
<div class="amsmath math notranslate nohighlight" id="equation-af658284-bb2e-42be-9574-773a0f45e295">
1122-
<span class="eqno">(1)<a class="headerlink" href="#equation-af658284-bb2e-42be-9574-773a0f45e295" title="Permalink to this equation">#</a></span>\[\begin{equation}
1121+
<div class="amsmath math notranslate nohighlight" id="equation-29af6266-4108-4c02-86cb-ead3cbd35522">
1122+
<span class="eqno">(1)<a class="headerlink" href="#equation-29af6266-4108-4c02-86cb-ead3cbd35522" title="Permalink to this equation">#</a></span>\[\begin{equation}
11231123
\begin{bmatrix}
11241124
7 &amp; -8 &amp; 6 \\
11251125
-3 &amp; 2 &amp; 4 \\
@@ -1132,8 +1132,8 @@ <h2><span style="color:Orange">Computing the Reweighted Padded Attention Mask</s
11321132
\text{Softmax}(\vec{x}) = \frac{e^{x_i}}{\sum_{j=1}^N{e^{x_j}}}
11331133
\]</div>
11341134
<p>If we ignore padding and everything right now, we can compute softmax for row of the matrix above:</p>
1135-
<div class="amsmath math notranslate nohighlight" id="equation-d0cce888-4694-4dc7-bd92-45db67eadaf9">
1136-
<span class="eqno">(2)<a class="headerlink" href="#equation-d0cce888-4694-4dc7-bd92-45db67eadaf9" title="Permalink to this equation">#</a></span>\[\begin{equation}
1135+
<div class="amsmath math notranslate nohighlight" id="equation-36ba346e-375c-498e-959e-f92cd06715a2">
1136+
<span class="eqno">(2)<a class="headerlink" href="#equation-36ba346e-375c-498e-959e-f92cd06715a2" title="Permalink to this equation">#</a></span>\[\begin{equation}
11371137
\text{Softmax}
11381138
\begin{bmatrix}
11391139
7 &amp; -8 &amp; 6 \\
@@ -1152,17 +1152,17 @@ <h2><span style="color:Orange">Computing the Reweighted Padded Attention Mask</s
11521152
\end{bmatrix}
11531153
\end{equation}\]</div>
11541154
<p>But what we need is to mask out all the tokens in this matrix related to padding. Just like we did in <a class="reference external" href="https://github.com/priyammaz/HAL-DL-From-Scratch/tree/main/PyTorch%20for%20NLP/GPT">GPT</a>, we will fill in the indexes of the that we want to mask with <span class="math notranslate nohighlight">\(-\infty\)</span>. If only the last token was a padding token in our sequence, then the attention before the softmax should be written as:</p>
1155-
<div class="amsmath math notranslate nohighlight" id="equation-c1f5341d-f539-42aa-be75-4b4468761db0">
1156-
<span class="eqno">(3)<a class="headerlink" href="#equation-c1f5341d-f539-42aa-be75-4b4468761db0" title="Permalink to this equation">#</a></span>\[\begin{equation}
1155+
<div class="amsmath math notranslate nohighlight" id="equation-1a31c76b-1a14-42c0-bd13-23bfda712876">
1156+
<span class="eqno">(3)<a class="headerlink" href="#equation-1a31c76b-1a14-42c0-bd13-23bfda712876" title="Permalink to this equation">#</a></span>\[\begin{equation}
11571157
\begin{bmatrix}
11581158
7 &amp; -8 &amp; -\infty \\
11591159
-3 &amp; 2 &amp; -\infty \\
11601160
1 &amp; 6 &amp; -\infty \\
11611161
\end{bmatrix}
11621162
\end{equation}\]</div>
11631163
<p>Taking the softmax of the rows of this matrix then gives:</p>
1164-
<div class="amsmath math notranslate nohighlight" id="equation-0e99714e-ef81-4fe7-af42-9338fb6d621e">
1165-
<span class="eqno">(4)<a class="headerlink" href="#equation-0e99714e-ef81-4fe7-af42-9338fb6d621e" title="Permalink to this equation">#</a></span>\[\begin{equation}
1164+
<div class="amsmath math notranslate nohighlight" id="equation-5c2bee6c-0937-408b-ac31-0a2e740e0846">
1165+
<span class="eqno">(4)<a class="headerlink" href="#equation-5c2bee6c-0937-408b-ac31-0a2e740e0846" title="Permalink to this equation">#</a></span>\[\begin{equation}
11661166
\text{Softmax}
11671167
\begin{bmatrix}
11681168
7 &amp; -8 &amp; -\infty \\
@@ -1204,8 +1204,8 @@ <h3><span style="color:LightGreen">Repeating to Match Attention Matrix Shape</sp
12041204
<p><code class="docutils literal notranslate"><span class="pre">attn.shape</span></code> - (Batch x seq_len x seq_len)</p>
12051205
<p><code class="docutils literal notranslate"><span class="pre">mask.shape</span></code> - (Batch x seq_len)</p>
12061206
<p>It is clear that our mask is missing a dimension, and we need to repeat it. Lets take sequence_1 for instance that has a mask of [True, True, True, False]. Because the sequence length here is 4, lets repeat this row 4 times:</p>
1207-
<div class="amsmath math notranslate nohighlight" id="equation-959385c7-aae4-422c-aff2-cbfe974d1abd">
1208-
<span class="eqno">(5)<a class="headerlink" href="#equation-959385c7-aae4-422c-aff2-cbfe974d1abd" title="Permalink to this equation">#</a></span>\[\begin{bmatrix}
1207+
<div class="amsmath math notranslate nohighlight" id="equation-6b93ab1a-d82c-4cc5-9269-d3b4bdd7f95c">
1208+
<span class="eqno">(5)<a class="headerlink" href="#equation-6b93ab1a-d82c-4cc5-9269-d3b4bdd7f95c" title="Permalink to this equation">#</a></span>\[\begin{bmatrix}
12091209
\textrm{True} &amp; \textrm{True} &amp; \textrm{True} &amp; \textrm{False} \\
12101210
\textrm{True} &amp; \textrm{True} &amp; \textrm{True} &amp; \textrm{False} \\
12111211
\textrm{True} &amp; \textrm{True} &amp; \textrm{True} &amp; \textrm{False} \\
@@ -1465,8 +1465,8 @@ <h3><span style="color:LightGreen">Enforcing Causality</span><a class="headerlin
14651465
<section id="span-style-color-lightgreen-computing-the-reweighted-causal-attention-mask-span">
14661466
<h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mask</span><a class="headerlink" href="#span-style-color-lightgreen-computing-the-reweighted-causal-attention-mask-span" title="Permalink to this heading">#</a></h3>
14671467
<p>Lets pretend the raw outputs of <span class="math notranslate nohighlight">\(QK^T\)</span>, before the softmax, is below:</p>
1468-
<div class="amsmath math notranslate nohighlight" id="equation-c6f9c264-4c4d-4c8e-9793-e776077f8174">
1469-
<span class="eqno">(6)<a class="headerlink" href="#equation-c6f9c264-4c4d-4c8e-9793-e776077f8174" title="Permalink to this equation">#</a></span>\[\begin{equation}
1468+
<div class="amsmath math notranslate nohighlight" id="equation-24d53089-fd13-4a28-8e24-49f4499e732d">
1469+
<span class="eqno">(6)<a class="headerlink" href="#equation-24d53089-fd13-4a28-8e24-49f4499e732d" title="Permalink to this equation">#</a></span>\[\begin{equation}
14701470
\begin{bmatrix}
14711471
7 &amp; -8 &amp; 6 \\
14721472
-3 &amp; 2 &amp; 4 \\
@@ -1477,8 +1477,8 @@ <h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mas
14771477
<div class="math notranslate nohighlight">
14781478
\[\text{Softmax}(\vec{x}) = \frac{e^{x_i}}{\sum_{j=1}^N{e^{x_j}}}\]</div>
14791479
<p>Then, we can compute softmax for row of the matrix above:</p>
1480-
<div class="amsmath math notranslate nohighlight" id="equation-47072d1f-7b1f-4fad-8bfc-96accbb56dd8">
1481-
<span class="eqno">(7)<a class="headerlink" href="#equation-47072d1f-7b1f-4fad-8bfc-96accbb56dd8" title="Permalink to this equation">#</a></span>\[\begin{equation}
1480+
<div class="amsmath math notranslate nohighlight" id="equation-6100e13c-f490-45b8-b54d-788026b85cac">
1481+
<span class="eqno">(7)<a class="headerlink" href="#equation-6100e13c-f490-45b8-b54d-788026b85cac" title="Permalink to this equation">#</a></span>\[\begin{equation}
14821482
\text{Softmax}
14831483
\begin{bmatrix}
14841484
7 &amp; -8 &amp; 6 \\
@@ -1517,17 +1517,17 @@ <h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mas
15171517
\text{Softmax}(x_2) = [\frac{e^{-3}}{e^{-3}+e^{2}+0}, \frac{e^{2}}{e^{-3}+e^{2}+0}, \frac{0}{e^{-3}+e^{2}+0}] = [\frac{e^{-3}}{e^{-3}+e^{2}+0}, \frac{e^{2}}{e^{-3}+e^{2}+0}, \frac{0}{e^{-3}+e^{2}+0}] = [0.0067, 0.9933, 0.0000]
15181518
\]</div>
15191519
<p>So we have exactly what we want! The attention weight of the last value is set to 0, so when we are on the second vector <span class="math notranslate nohighlight">\(x_2\)</span>, we cannot look forward to the future value vectors <span class="math notranslate nohighlight">\(v_3\)</span>, and the remaining parts add up to 1 so its still a probability vector! To do this correctly for the entire matrix, we can just substitute in the top triangle of <span class="math notranslate nohighlight">\(QK^T\)</span> with <span class="math notranslate nohighlight">\(-\infty\)</span>. This would look like:</p>
1520-
<div class="amsmath math notranslate nohighlight" id="equation-39aaefdb-ed1c-454e-9c9e-42a98dd8ba8b">
1521-
<span class="eqno">(8)<a class="headerlink" href="#equation-39aaefdb-ed1c-454e-9c9e-42a98dd8ba8b" title="Permalink to this equation">#</a></span>\[\begin{equation}
1520+
<div class="amsmath math notranslate nohighlight" id="equation-c664611a-e8fc-4d2a-8267-68e950101828">
1521+
<span class="eqno">(8)<a class="headerlink" href="#equation-c664611a-e8fc-4d2a-8267-68e950101828" title="Permalink to this equation">#</a></span>\[\begin{equation}
15221522
\begin{bmatrix}
15231523
7 &amp; -\infty &amp; -\infty \\
15241524
-3 &amp; 2 &amp; -\infty \\
15251525
1 &amp; 6 &amp; -2 \\
15261526
\end{bmatrix}
15271527
\end{equation}\]</div>
15281528
<p>Taking the softmax of the rows of this matrix then gives:</p>
1529-
<div class="amsmath math notranslate nohighlight" id="equation-6102b396-e63f-46d9-8622-2c42630f12fb">
1530-
<span class="eqno">(9)<a class="headerlink" href="#equation-6102b396-e63f-46d9-8622-2c42630f12fb" title="Permalink to this equation">#</a></span>\[\begin{equation}
1529+
<div class="amsmath math notranslate nohighlight" id="equation-3135a916-a7b4-459d-b572-6f55b547370b">
1530+
<span class="eqno">(9)<a class="headerlink" href="#equation-3135a916-a7b4-459d-b572-6f55b547370b" title="Permalink to this equation">#</a></span>\[\begin{equation}
15311531
\text{Softmax}
15321532
\begin{bmatrix}
15331533
7 &amp; -\infty &amp; -\infty \\

0 commit comments

Comments
 (0)