Skip to content

Commit 6b1f493

Browse files
committed
deploy: 3e5fb87
1 parent e6100b2 commit 6b1f493

1 file changed

Lines changed: 18 additions & 18 deletions

File tree

_sources/lectures/Attention.html

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1100,8 +1100,8 @@ <h3><span style="color:LightGreen">Sequence Padding and Attention Masking</span>
11001100
<section id="span-style-color-orange-computing-the-reweighted-padded-attention-mask-span">
11011101
<h2><span style="color:Orange">Computing the Reweighted Padded Attention Mask</span><a class="headerlink" href="#span-style-color-orange-computing-the-reweighted-padded-attention-mask-span" title="Permalink to this heading">#</a></h2>
11021102
<p>Lets create some numbers so we can get a better idea of how this works. Let the tokens be <span class="math notranslate nohighlight">\(X = [10, 2, \text{&lt;pad&gt;}]\)</span>, so the third token is a padding token. Lets then also pretend, we pass this to our model, and when we go to compute our attention <span class="math notranslate nohighlight">\(QK^T\)</span>. The raw output before the Softmax is below:</p>
1103-
<div class="amsmath math notranslate nohighlight" id="equation-a230ab2d-044f-4351-929f-ea68f6ce11ea">
1104-
<span class="eqno">(1)<a class="headerlink" href="#equation-a230ab2d-044f-4351-929f-ea68f6ce11ea" title="Permalink to this equation">#</a></span>\[\begin{equation}
1103+
<div class="amsmath math notranslate nohighlight" id="equation-d5241a6d-9f72-48fc-8054-0a38b162b56d">
1104+
<span class="eqno">(1)<a class="headerlink" href="#equation-d5241a6d-9f72-48fc-8054-0a38b162b56d" title="Permalink to this equation">#</a></span>\[\begin{equation}
11051105
\begin{bmatrix}
11061106
7 &amp; -8 &amp; 6 \\
11071107
-3 &amp; 2 &amp; 4 \\
@@ -1114,8 +1114,8 @@ <h2><span style="color:Orange">Computing the Reweighted Padded Attention Mask</s
11141114
\text{Softmax}(\vec{x}) = \frac{e^{x_i}}{\sum_{j=1}^N{e^{x_j}}}
11151115
\]</div>
11161116
<p>If we ignore padding and everything right now, we can compute softmax for row of the matrix above:</p>
1117-
<div class="amsmath math notranslate nohighlight" id="equation-2d164eb8-340f-4ce3-bd4c-e2f2d7925d74">
1118-
<span class="eqno">(2)<a class="headerlink" href="#equation-2d164eb8-340f-4ce3-bd4c-e2f2d7925d74" title="Permalink to this equation">#</a></span>\[\begin{equation}
1117+
<div class="amsmath math notranslate nohighlight" id="equation-48203bcc-acc5-41e4-8587-c0a6c10607fb">
1118+
<span class="eqno">(2)<a class="headerlink" href="#equation-48203bcc-acc5-41e4-8587-c0a6c10607fb" title="Permalink to this equation">#</a></span>\[\begin{equation}
11191119
\text{Softmax}
11201120
\begin{bmatrix}
11211121
7 &amp; -8 &amp; 6 \\
@@ -1134,17 +1134,17 @@ <h2><span style="color:Orange">Computing the Reweighted Padded Attention Mask</s
11341134
\end{bmatrix}
11351135
\end{equation}\]</div>
11361136
<p>But what we need is to mask out all the tokens in this matrix related to padding. Just like we did in <a class="reference external" href="https://github.com/priyammaz/HAL-DL-From-Scratch/tree/main/PyTorch%20for%20NLP/GPT">GPT</a>, we will fill in the indexes of the that we want to mask with <span class="math notranslate nohighlight">\(-\infty\)</span>. If only the last token was a padding token in our sequence, then the attention before the softmax should be written as:</p>
1137-
<div class="amsmath math notranslate nohighlight" id="equation-0253ffa2-35b1-488c-bcf4-d9752f37cee9">
1138-
<span class="eqno">(3)<a class="headerlink" href="#equation-0253ffa2-35b1-488c-bcf4-d9752f37cee9" title="Permalink to this equation">#</a></span>\[\begin{equation}
1137+
<div class="amsmath math notranslate nohighlight" id="equation-579f0937-3f63-4f69-a64b-235de2e41a33">
1138+
<span class="eqno">(3)<a class="headerlink" href="#equation-579f0937-3f63-4f69-a64b-235de2e41a33" title="Permalink to this equation">#</a></span>\[\begin{equation}
11391139
\begin{bmatrix}
11401140
7 &amp; -8 &amp; -\infty \\
11411141
-3 &amp; 2 &amp; -\infty \\
11421142
1 &amp; 6 &amp; -\infty \\
11431143
\end{bmatrix}
11441144
\end{equation}\]</div>
11451145
<p>Taking the softmax of the rows of this matrix then gives:</p>
1146-
<div class="amsmath math notranslate nohighlight" id="equation-4bfa0c8b-4c04-4693-ae6d-2206c57b9264">
1147-
<span class="eqno">(4)<a class="headerlink" href="#equation-4bfa0c8b-4c04-4693-ae6d-2206c57b9264" title="Permalink to this equation">#</a></span>\[\begin{equation}
1146+
<div class="amsmath math notranslate nohighlight" id="equation-c2a4a3f2-727f-4823-9678-0d2e45b64251">
1147+
<span class="eqno">(4)<a class="headerlink" href="#equation-c2a4a3f2-727f-4823-9678-0d2e45b64251" title="Permalink to this equation">#</a></span>\[\begin{equation}
11481148
\text{Softmax}
11491149
\begin{bmatrix}
11501150
7 &amp; -8 &amp; -\infty \\
@@ -1186,8 +1186,8 @@ <h3><span style="color:LightGreen">Repeating to Match Attention Matrix Shape</sp
11861186
<p><code class="docutils literal notranslate"><span class="pre">attn.shape</span></code> - (Batch x seq_len x seq_len)</p>
11871187
<p><code class="docutils literal notranslate"><span class="pre">mask.shape</span></code> - (Batch x seq_len)</p>
11881188
<p>It is clear that our mask is missing a dimension, and we need to repeat it. Lets take sequence_1 for instance that has a mask of [True, True, True, False]. Because the sequence length here is 4, lets repeat this row 4 times:</p>
1189-
<div class="amsmath math notranslate nohighlight" id="equation-ed607000-2e59-4f14-a321-36f1bd479965">
1190-
<span class="eqno">(5)<a class="headerlink" href="#equation-ed607000-2e59-4f14-a321-36f1bd479965" title="Permalink to this equation">#</a></span>\[\begin{bmatrix}
1189+
<div class="amsmath math notranslate nohighlight" id="equation-a60b1a61-35b1-4942-a0de-f3d82ca3bce0">
1190+
<span class="eqno">(5)<a class="headerlink" href="#equation-a60b1a61-35b1-4942-a0de-f3d82ca3bce0" title="Permalink to this equation">#</a></span>\[\begin{bmatrix}
11911191
\textrm{True} &amp; \textrm{True} &amp; \textrm{True} &amp; \textrm{False} \\
11921192
\textrm{True} &amp; \textrm{True} &amp; \textrm{True} &amp; \textrm{False} \\
11931193
\textrm{True} &amp; \textrm{True} &amp; \textrm{True} &amp; \textrm{False} \\
@@ -1447,8 +1447,8 @@ <h3><span style="color:LightGreen">Enforcing Causality</span><a class="headerlin
14471447
<section id="span-style-color-lightgreen-computing-the-reweighted-causal-attention-mask-span">
14481448
<h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mask</span><a class="headerlink" href="#span-style-color-lightgreen-computing-the-reweighted-causal-attention-mask-span" title="Permalink to this heading">#</a></h3>
14491449
<p>Lets pretend the raw outputs of <span class="math notranslate nohighlight">\(QK^T\)</span>, before the softmax, is below:</p>
1450-
<div class="amsmath math notranslate nohighlight" id="equation-b19cc58c-07a4-493d-9726-af9cfb39ab99">
1451-
<span class="eqno">(6)<a class="headerlink" href="#equation-b19cc58c-07a4-493d-9726-af9cfb39ab99" title="Permalink to this equation">#</a></span>\[\begin{equation}
1450+
<div class="amsmath math notranslate nohighlight" id="equation-8ad62b77-687e-4290-8db7-aaa61c8b985f">
1451+
<span class="eqno">(6)<a class="headerlink" href="#equation-8ad62b77-687e-4290-8db7-aaa61c8b985f" title="Permalink to this equation">#</a></span>\[\begin{equation}
14521452
\begin{bmatrix}
14531453
7 &amp; -8 &amp; 6 \\
14541454
-3 &amp; 2 &amp; 4 \\
@@ -1459,8 +1459,8 @@ <h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mas
14591459
<div class="math notranslate nohighlight">
14601460
\[\text{Softmax}(\vec{x}) = \frac{e^{x_i}}{\sum_{j=1}^N{e^{x_j}}}\]</div>
14611461
<p>Then, we can compute softmax for row of the matrix above:</p>
1462-
<div class="amsmath math notranslate nohighlight" id="equation-3b9abc65-eeb7-427f-a4aa-5d72ba8cf814">
1463-
<span class="eqno">(7)<a class="headerlink" href="#equation-3b9abc65-eeb7-427f-a4aa-5d72ba8cf814" title="Permalink to this equation">#</a></span>\[\begin{equation}
1462+
<div class="amsmath math notranslate nohighlight" id="equation-9ba80b99-5751-4c89-aeb1-1417f3ffc534">
1463+
<span class="eqno">(7)<a class="headerlink" href="#equation-9ba80b99-5751-4c89-aeb1-1417f3ffc534" title="Permalink to this equation">#</a></span>\[\begin{equation}
14641464
\text{Softmax}
14651465
\begin{bmatrix}
14661466
7 &amp; -8 &amp; 6 \\
@@ -1499,17 +1499,17 @@ <h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mas
14991499
\text{Softmax}(x_2) = [\frac{e^{-3}}{e^{-3}+e^{2}+0}, \frac{e^{2}}{e^{-3}+e^{2}+0}, \frac{0}{e^{-3}+e^{2}+0}] = [\frac{e^{-3}}{e^{-3}+e^{2}+0}, \frac{e^{2}}{e^{-3}+e^{2}+0}, \frac{0}{e^{-3}+e^{2}+0}] = [0.0067, 0.9933, 0.0000]
15001500
\]</div>
15011501
<p>So we have exactly what we want! The attention weight of the last value is set to 0, so when we are on the second vector <span class="math notranslate nohighlight">\(x_2\)</span>, we cannot look forward to the future value vectors <span class="math notranslate nohighlight">\(v_3\)</span>, and the remaining parts add up to 1 so its still a probability vector! To do this correctly for the entire matrix, we can just substitute in the top triangle of <span class="math notranslate nohighlight">\(QK^T\)</span> with <span class="math notranslate nohighlight">\(-\infty\)</span>. This would look like:</p>
1502-
<div class="amsmath math notranslate nohighlight" id="equation-760a21b3-e8a1-4aec-b493-d61ffeb78f22">
1503-
<span class="eqno">(8)<a class="headerlink" href="#equation-760a21b3-e8a1-4aec-b493-d61ffeb78f22" title="Permalink to this equation">#</a></span>\[\begin{equation}
1502+
<div class="amsmath math notranslate nohighlight" id="equation-aab30ec1-747d-4c44-9c19-e3a142972185">
1503+
<span class="eqno">(8)<a class="headerlink" href="#equation-aab30ec1-747d-4c44-9c19-e3a142972185" title="Permalink to this equation">#</a></span>\[\begin{equation}
15041504
\begin{bmatrix}
15051505
7 &amp; -\infty &amp; -\infty \\
15061506
-3 &amp; 2 &amp; -\infty \\
15071507
1 &amp; 6 &amp; -2 \\
15081508
\end{bmatrix}
15091509
\end{equation}\]</div>
15101510
<p>Taking the softmax of the rows of this matrix then gives:</p>
1511-
<div class="amsmath math notranslate nohighlight" id="equation-629fd4eb-96a4-4077-a5ff-4765234df0f9">
1512-
<span class="eqno">(9)<a class="headerlink" href="#equation-629fd4eb-96a4-4077-a5ff-4765234df0f9" title="Permalink to this equation">#</a></span>\[\begin{equation}
1511+
<div class="amsmath math notranslate nohighlight" id="equation-6e0ab1b6-62e9-4b8d-8a64-7e80ef4002c4">
1512+
<span class="eqno">(9)<a class="headerlink" href="#equation-6e0ab1b6-62e9-4b8d-8a64-7e80ef4002c4" title="Permalink to this equation">#</a></span>\[\begin{equation}
15131513
\text{Softmax}
15141514
\begin{bmatrix}
15151515
7 &amp; -\infty &amp; -\infty \\

0 commit comments

Comments
 (0)