Skip to content

Commit ba582ab

Browse files
committed
deploy: f59a71c
1 parent 7e673e3 commit ba582ab

1 file changed

Lines changed: 18 additions & 18 deletions

File tree

_sources/lectures/Attention.html

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1118,8 +1118,8 @@ <h3><span style="color:LightGreen">Sequence Padding and Attention Masking</span>
11181118
<section id="span-style-color-orange-computing-the-reweighted-padded-attention-mask-span">
11191119
<h2><span style="color:Orange">Computing the Reweighted Padded Attention Mask</span><a class="headerlink" href="#span-style-color-orange-computing-the-reweighted-padded-attention-mask-span" title="Permalink to this heading">#</a></h2>
11201120
<p>Lets create some numbers so we can get a better idea of how this works. Let the tokens be <span class="math notranslate nohighlight">\(X = [10, 2, \text{&lt;pad&gt;}]\)</span>, so the third token is a padding token. Lets then also pretend, we pass this to our model, and when we go to compute our attention <span class="math notranslate nohighlight">\(QK^T\)</span>. The raw output before the Softmax is below:</p>
1121-
<div class="amsmath math notranslate nohighlight" id="equation-8f648b50-3036-4331-8fd9-2ee0e5c609ca">
1122-
<span class="eqno">(1)<a class="headerlink" href="#equation-8f648b50-3036-4331-8fd9-2ee0e5c609ca" title="Permalink to this equation">#</a></span>\[\begin{equation}
1121+
<div class="amsmath math notranslate nohighlight" id="equation-af658284-bb2e-42be-9574-773a0f45e295">
1122+
<span class="eqno">(1)<a class="headerlink" href="#equation-af658284-bb2e-42be-9574-773a0f45e295" title="Permalink to this equation">#</a></span>\[\begin{equation}
11231123
\begin{bmatrix}
11241124
7 &amp; -8 &amp; 6 \\
11251125
-3 &amp; 2 &amp; 4 \\
@@ -1132,8 +1132,8 @@ <h2><span style="color:Orange">Computing the Reweighted Padded Attention Mask</s
11321132
\text{Softmax}(\vec{x}) = \frac{e^{x_i}}{\sum_{j=1}^N{e^{x_j}}}
11331133
\]</div>
11341134
<p>If we ignore padding and everything right now, we can compute softmax for row of the matrix above:</p>
1135-
<div class="amsmath math notranslate nohighlight" id="equation-00df6f30-e080-4980-b108-241820ec8aa2">
1136-
<span class="eqno">(2)<a class="headerlink" href="#equation-00df6f30-e080-4980-b108-241820ec8aa2" title="Permalink to this equation">#</a></span>\[\begin{equation}
1135+
<div class="amsmath math notranslate nohighlight" id="equation-d0cce888-4694-4dc7-bd92-45db67eadaf9">
1136+
<span class="eqno">(2)<a class="headerlink" href="#equation-d0cce888-4694-4dc7-bd92-45db67eadaf9" title="Permalink to this equation">#</a></span>\[\begin{equation}
11371137
\text{Softmax}
11381138
\begin{bmatrix}
11391139
7 &amp; -8 &amp; 6 \\
@@ -1152,17 +1152,17 @@ <h2><span style="color:Orange">Computing the Reweighted Padded Attention Mask</s
11521152
\end{bmatrix}
11531153
\end{equation}\]</div>
11541154
<p>But what we need is to mask out all the tokens in this matrix related to padding. Just like we did in <a class="reference external" href="https://github.com/priyammaz/HAL-DL-From-Scratch/tree/main/PyTorch%20for%20NLP/GPT">GPT</a>, we will fill in the indexes of the that we want to mask with <span class="math notranslate nohighlight">\(-\infty\)</span>. If only the last token was a padding token in our sequence, then the attention before the softmax should be written as:</p>
1155-
<div class="amsmath math notranslate nohighlight" id="equation-a4035f3e-b6fc-43ce-8f63-e4c4159a8c8e">
1156-
<span class="eqno">(3)<a class="headerlink" href="#equation-a4035f3e-b6fc-43ce-8f63-e4c4159a8c8e" title="Permalink to this equation">#</a></span>\[\begin{equation}
1155+
<div class="amsmath math notranslate nohighlight" id="equation-c1f5341d-f539-42aa-be75-4b4468761db0">
1156+
<span class="eqno">(3)<a class="headerlink" href="#equation-c1f5341d-f539-42aa-be75-4b4468761db0" title="Permalink to this equation">#</a></span>\[\begin{equation}
11571157
\begin{bmatrix}
11581158
7 &amp; -8 &amp; -\infty \\
11591159
-3 &amp; 2 &amp; -\infty \\
11601160
1 &amp; 6 &amp; -\infty \\
11611161
\end{bmatrix}
11621162
\end{equation}\]</div>
11631163
<p>Taking the softmax of the rows of this matrix then gives:</p>
1164-
<div class="amsmath math notranslate nohighlight" id="equation-51db25c6-3f1c-4956-a1cb-0d04bab31437">
1165-
<span class="eqno">(4)<a class="headerlink" href="#equation-51db25c6-3f1c-4956-a1cb-0d04bab31437" title="Permalink to this equation">#</a></span>\[\begin{equation}
1164+
<div class="amsmath math notranslate nohighlight" id="equation-0e99714e-ef81-4fe7-af42-9338fb6d621e">
1165+
<span class="eqno">(4)<a class="headerlink" href="#equation-0e99714e-ef81-4fe7-af42-9338fb6d621e" title="Permalink to this equation">#</a></span>\[\begin{equation}
11661166
\text{Softmax}
11671167
\begin{bmatrix}
11681168
7 &amp; -8 &amp; -\infty \\
@@ -1204,8 +1204,8 @@ <h3><span style="color:LightGreen">Repeating to Match Attention Matrix Shape</sp
12041204
<p><code class="docutils literal notranslate"><span class="pre">attn.shape</span></code> - (Batch x seq_len x seq_len)</p>
12051205
<p><code class="docutils literal notranslate"><span class="pre">mask.shape</span></code> - (Batch x seq_len)</p>
12061206
<p>It is clear that our mask is missing a dimension, and we need to repeat it. Lets take sequence_1 for instance that has a mask of [True, True, True, False]. Because the sequence length here is 4, lets repeat this row 4 times:</p>
1207-
<div class="amsmath math notranslate nohighlight" id="equation-162d3418-cf6b-48ac-8c1e-761fe464d12b">
1208-
<span class="eqno">(5)<a class="headerlink" href="#equation-162d3418-cf6b-48ac-8c1e-761fe464d12b" title="Permalink to this equation">#</a></span>\[\begin{bmatrix}
1207+
<div class="amsmath math notranslate nohighlight" id="equation-959385c7-aae4-422c-aff2-cbfe974d1abd">
1208+
<span class="eqno">(5)<a class="headerlink" href="#equation-959385c7-aae4-422c-aff2-cbfe974d1abd" title="Permalink to this equation">#</a></span>\[\begin{bmatrix}
12091209
\textrm{True} &amp; \textrm{True} &amp; \textrm{True} &amp; \textrm{False} \\
12101210
\textrm{True} &amp; \textrm{True} &amp; \textrm{True} &amp; \textrm{False} \\
12111211
\textrm{True} &amp; \textrm{True} &amp; \textrm{True} &amp; \textrm{False} \\
@@ -1465,8 +1465,8 @@ <h3><span style="color:LightGreen">Enforcing Causality</span><a class="headerlin
14651465
<section id="span-style-color-lightgreen-computing-the-reweighted-causal-attention-mask-span">
14661466
<h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mask</span><a class="headerlink" href="#span-style-color-lightgreen-computing-the-reweighted-causal-attention-mask-span" title="Permalink to this heading">#</a></h3>
14671467
<p>Lets pretend the raw outputs of <span class="math notranslate nohighlight">\(QK^T\)</span>, before the softmax, is below:</p>
1468-
<div class="amsmath math notranslate nohighlight" id="equation-641dc9d8-5ff9-4b3e-b5bf-73f1ccaf4699">
1469-
<span class="eqno">(6)<a class="headerlink" href="#equation-641dc9d8-5ff9-4b3e-b5bf-73f1ccaf4699" title="Permalink to this equation">#</a></span>\[\begin{equation}
1468+
<div class="amsmath math notranslate nohighlight" id="equation-c6f9c264-4c4d-4c8e-9793-e776077f8174">
1469+
<span class="eqno">(6)<a class="headerlink" href="#equation-c6f9c264-4c4d-4c8e-9793-e776077f8174" title="Permalink to this equation">#</a></span>\[\begin{equation}
14701470
\begin{bmatrix}
14711471
7 &amp; -8 &amp; 6 \\
14721472
-3 &amp; 2 &amp; 4 \\
@@ -1477,8 +1477,8 @@ <h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mas
14771477
<div class="math notranslate nohighlight">
14781478
\[\text{Softmax}(\vec{x}) = \frac{e^{x_i}}{\sum_{j=1}^N{e^{x_j}}}\]</div>
14791479
<p>Then, we can compute softmax for row of the matrix above:</p>
1480-
<div class="amsmath math notranslate nohighlight" id="equation-d0831650-376e-4d4d-a4cd-51c3d91038e1">
1481-
<span class="eqno">(7)<a class="headerlink" href="#equation-d0831650-376e-4d4d-a4cd-51c3d91038e1" title="Permalink to this equation">#</a></span>\[\begin{equation}
1480+
<div class="amsmath math notranslate nohighlight" id="equation-47072d1f-7b1f-4fad-8bfc-96accbb56dd8">
1481+
<span class="eqno">(7)<a class="headerlink" href="#equation-47072d1f-7b1f-4fad-8bfc-96accbb56dd8" title="Permalink to this equation">#</a></span>\[\begin{equation}
14821482
\text{Softmax}
14831483
\begin{bmatrix}
14841484
7 &amp; -8 &amp; 6 \\
@@ -1517,17 +1517,17 @@ <h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mas
15171517
\text{Softmax}(x_2) = [\frac{e^{-3}}{e^{-3}+e^{2}+0}, \frac{e^{2}}{e^{-3}+e^{2}+0}, \frac{0}{e^{-3}+e^{2}+0}] = [\frac{e^{-3}}{e^{-3}+e^{2}+0}, \frac{e^{2}}{e^{-3}+e^{2}+0}, \frac{0}{e^{-3}+e^{2}+0}] = [0.0067, 0.9933, 0.0000]
15181518
\]</div>
15191519
<p>So we have exactly what we want! The attention weight of the last value is set to 0, so when we are on the second vector <span class="math notranslate nohighlight">\(x_2\)</span>, we cannot look forward to the future value vectors <span class="math notranslate nohighlight">\(v_3\)</span>, and the remaining parts add up to 1 so its still a probability vector! To do this correctly for the entire matrix, we can just substitute in the top triangle of <span class="math notranslate nohighlight">\(QK^T\)</span> with <span class="math notranslate nohighlight">\(-\infty\)</span>. This would look like:</p>
1520-
<div class="amsmath math notranslate nohighlight" id="equation-1d15e2fc-8007-405e-82ae-c108093fdec2">
1521-
<span class="eqno">(8)<a class="headerlink" href="#equation-1d15e2fc-8007-405e-82ae-c108093fdec2" title="Permalink to this equation">#</a></span>\[\begin{equation}
1520+
<div class="amsmath math notranslate nohighlight" id="equation-39aaefdb-ed1c-454e-9c9e-42a98dd8ba8b">
1521+
<span class="eqno">(8)<a class="headerlink" href="#equation-39aaefdb-ed1c-454e-9c9e-42a98dd8ba8b" title="Permalink to this equation">#</a></span>\[\begin{equation}
15221522
\begin{bmatrix}
15231523
7 &amp; -\infty &amp; -\infty \\
15241524
-3 &amp; 2 &amp; -\infty \\
15251525
1 &amp; 6 &amp; -2 \\
15261526
\end{bmatrix}
15271527
\end{equation}\]</div>
15281528
<p>Taking the softmax of the rows of this matrix then gives:</p>
1529-
<div class="amsmath math notranslate nohighlight" id="equation-ec2e29d3-b551-4765-aaad-a25ddba8679a">
1530-
<span class="eqno">(9)<a class="headerlink" href="#equation-ec2e29d3-b551-4765-aaad-a25ddba8679a" title="Permalink to this equation">#</a></span>\[\begin{equation}
1529+
<div class="amsmath math notranslate nohighlight" id="equation-6102b396-e63f-46d9-8622-2c42630f12fb">
1530+
<span class="eqno">(9)<a class="headerlink" href="#equation-6102b396-e63f-46d9-8622-2c42630f12fb" title="Permalink to this equation">#</a></span>\[\begin{equation}
15311531
\text{Softmax}
15321532
\begin{bmatrix}
15331533
7 &amp; -\infty &amp; -\infty \\

0 commit comments

Comments
 (0)