You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<h2><spanstyle="color:Orange">Computing the Reweighted Padded Attention Mask</span><aclass="headerlink" href="#span-style-color-orange-computing-the-reweighted-padded-attention-mask-span" title="Permalink to this heading">#</a></h2>
1121
1121
<p>Lets create some numbers so we can get a better idea of how this works. Let the tokens be <spanclass="math notranslate nohighlight">\(X = [10, 2, \text{<pad>}]\)</span>, so the third token is a padding token. Lets then also pretend, we pass this to our model, and when we go to compute our attention <spanclass="math notranslate nohighlight">\(QK^T\)</span>. The raw output before the Softmax is below:</p>
1122
-
<divclass="amsmath math notranslate nohighlight" id="equation-3f9a884c-a2d1-4790-9ccc-22eb777faba2">
1123
-
<spanclass="eqno">(1)<aclass="headerlink" href="#equation-3f9a884c-a2d1-4790-9ccc-22eb777faba2" title="Permalink to this equation">#</a></span>\[\begin{equation}
1122
+
<divclass="amsmath math notranslate nohighlight" id="equation-1458f32b-d442-494e-b855-8a83597bfa2b">
1123
+
<spanclass="eqno">(1)<aclass="headerlink" href="#equation-1458f32b-d442-494e-b855-8a83597bfa2b" title="Permalink to this equation">#</a></span>\[\begin{equation}
<p>If we ignore padding and everything right now, we can compute softmax for row of the matrix above:</p>
1136
-
<divclass="amsmath math notranslate nohighlight" id="equation-63d1a546-6cca-4be4-98c1-750360b7e9cd">
1137
-
<spanclass="eqno">(2)<aclass="headerlink" href="#equation-63d1a546-6cca-4be4-98c1-750360b7e9cd" title="Permalink to this equation">#</a></span>\[\begin{equation}
1136
+
<divclass="amsmath math notranslate nohighlight" id="equation-04dc3062-b225-4c13-a734-7707da519306">
1137
+
<spanclass="eqno">(2)<aclass="headerlink" href="#equation-04dc3062-b225-4c13-a734-7707da519306" title="Permalink to this equation">#</a></span>\[\begin{equation}
<p>But what we need is to mask out all the tokens in this matrix related to padding. Just like we did in <aclass="reference external" href="https://github.com/priyammaz/HAL-DL-From-Scratch/tree/main/PyTorch%20for%20NLP/GPT">GPT</a>, we will fill in the indexes of the that we want to mask with <spanclass="math notranslate nohighlight">\(-\infty\)</span>. If only the last token was a padding token in our sequence, then the attention before the softmax should be written as:</p>
1156
-
<divclass="amsmath math notranslate nohighlight" id="equation-62962437-0153-4c8a-b58c-c6195acc9c0d">
1157
-
<spanclass="eqno">(3)<aclass="headerlink" href="#equation-62962437-0153-4c8a-b58c-c6195acc9c0d" title="Permalink to this equation">#</a></span>\[\begin{equation}
1156
+
<divclass="amsmath math notranslate nohighlight" id="equation-416a944c-c56b-44d7-b702-1271bd6c6dbf">
1157
+
<spanclass="eqno">(3)<aclass="headerlink" href="#equation-416a944c-c56b-44d7-b702-1271bd6c6dbf" title="Permalink to this equation">#</a></span>\[\begin{equation}
1158
1158
\begin{bmatrix}
1159
1159
7 & -8 & -\infty \\
1160
1160
-3 & 2 & -\infty \\
1161
1161
1 & 6 & -\infty \\
1162
1162
\end{bmatrix}
1163
1163
\end{equation}\]</div>
1164
1164
<p>Taking the softmax of the rows of this matrix then gives:</p>
1165
-
<divclass="amsmath math notranslate nohighlight" id="equation-bfcfffc9-068d-40aa-8324-1b4354a71176">
1166
-
<spanclass="eqno">(4)<aclass="headerlink" href="#equation-bfcfffc9-068d-40aa-8324-1b4354a71176" title="Permalink to this equation">#</a></span>\[\begin{equation}
1165
+
<divclass="amsmath math notranslate nohighlight" id="equation-e61b8123-fc93-4e7c-8a95-e5ae2912a354">
1166
+
<spanclass="eqno">(4)<aclass="headerlink" href="#equation-e61b8123-fc93-4e7c-8a95-e5ae2912a354" title="Permalink to this equation">#</a></span>\[\begin{equation}
1167
1167
\text{Softmax}
1168
1168
\begin{bmatrix}
1169
1169
7 & -8 & -\infty \\
@@ -1205,8 +1205,8 @@ <h3><span style="color:LightGreen">Repeating to Match Attention Matrix Shape</sp
1205
1205
<p><codeclass="docutils literal notranslate"><spanclass="pre">attn.shape</span></code> - (Batch x seq_len x seq_len)</p>
1206
1206
<p><codeclass="docutils literal notranslate"><spanclass="pre">mask.shape</span></code> - (Batch x seq_len)</p>
1207
1207
<p>It is clear that our mask is missing a dimension, and we need to repeat it. Lets take sequence_1 for instance that has a mask of [True, True, True, False]. Because the sequence length here is 4, lets repeat this row 4 times:</p>
1208
-
<divclass="amsmath math notranslate nohighlight" id="equation-fb5058d7-5f44-4108-8ecf-bd4891fce586">
1209
-
<spanclass="eqno">(5)<aclass="headerlink" href="#equation-fb5058d7-5f44-4108-8ecf-bd4891fce586" title="Permalink to this equation">#</a></span>\[\begin{bmatrix}
1208
+
<divclass="amsmath math notranslate nohighlight" id="equation-82bd5929-c99b-4a71-aff0-6b735a56c041">
1209
+
<spanclass="eqno">(5)<aclass="headerlink" href="#equation-82bd5929-c99b-4a71-aff0-6b735a56c041" title="Permalink to this equation">#</a></span>\[\begin{bmatrix}
1210
1210
\textrm{True} & \textrm{True} & \textrm{True} & \textrm{False} \\
1211
1211
\textrm{True} & \textrm{True} & \textrm{True} & \textrm{False} \\
1212
1212
\textrm{True} & \textrm{True} & \textrm{True} & \textrm{False} \\
<h3><spanstyle="color:LightGreen">Computing the Reweighted Causal Attention Mask</span><aclass="headerlink" href="#span-style-color-lightgreen-computing-the-reweighted-causal-attention-mask-span" title="Permalink to this heading">#</a></h3>
1468
1468
<p>Lets pretend the raw outputs of <spanclass="math notranslate nohighlight">\(QK^T\)</span>, before the softmax, is below:</p>
1469
-
<divclass="amsmath math notranslate nohighlight" id="equation-0b415d81-7bbb-4d27-a1f3-f24f92d4c167">
1470
-
<spanclass="eqno">(6)<aclass="headerlink" href="#equation-0b415d81-7bbb-4d27-a1f3-f24f92d4c167" title="Permalink to this equation">#</a></span>\[\begin{equation}
1469
+
<divclass="amsmath math notranslate nohighlight" id="equation-91439069-0acb-44c5-a9e0-fd7727dfd0a2">
1470
+
<spanclass="eqno">(6)<aclass="headerlink" href="#equation-91439069-0acb-44c5-a9e0-fd7727dfd0a2" title="Permalink to this equation">#</a></span>\[\begin{equation}
1471
1471
\begin{bmatrix}
1472
1472
7 & -8 & 6 \\
1473
1473
-3 & 2 & 4 \\
@@ -1478,8 +1478,8 @@ <h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mas
<p>Then, we can compute softmax for row of the matrix above:</p>
1481
-
<divclass="amsmath math notranslate nohighlight" id="equation-97377beb-98f5-43d9-a2e7-723db80e3767">
1482
-
<spanclass="eqno">(7)<aclass="headerlink" href="#equation-97377beb-98f5-43d9-a2e7-723db80e3767" title="Permalink to this equation">#</a></span>\[\begin{equation}
1481
+
<divclass="amsmath math notranslate nohighlight" id="equation-b14ca7a2-8ccf-4483-b582-46827047f7cd">
1482
+
<spanclass="eqno">(7)<aclass="headerlink" href="#equation-b14ca7a2-8ccf-4483-b582-46827047f7cd" title="Permalink to this equation">#</a></span>\[\begin{equation}
1483
1483
\text{Softmax}
1484
1484
\begin{bmatrix}
1485
1485
7 & -8 & 6 \\
@@ -1518,17 +1518,17 @@ <h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mas
<p>So we have exactly what we want! The attention weight of the last value is set to 0, so when we are on the second vector <spanclass="math notranslate nohighlight">\(x_2\)</span>, we cannot look forward to the future value vectors <spanclass="math notranslate nohighlight">\(v_3\)</span>, and the remaining parts add up to 1 so its still a probability vector! To do this correctly for the entire matrix, we can just substitute in the top triangle of <spanclass="math notranslate nohighlight">\(QK^T\)</span> with <spanclass="math notranslate nohighlight">\(-\infty\)</span>. This would look like:</p>
1521
-
<divclass="amsmath math notranslate nohighlight" id="equation-a974de21-62ff-4380-a7df-e472c0b3221d">
1522
-
<spanclass="eqno">(8)<aclass="headerlink" href="#equation-a974de21-62ff-4380-a7df-e472c0b3221d" title="Permalink to this equation">#</a></span>\[\begin{equation}
1521
+
<divclass="amsmath math notranslate nohighlight" id="equation-8dd4d9f2-0417-4b76-a90a-a48bb6d688a5">
1522
+
<spanclass="eqno">(8)<aclass="headerlink" href="#equation-8dd4d9f2-0417-4b76-a90a-a48bb6d688a5" title="Permalink to this equation">#</a></span>\[\begin{equation}
1523
1523
\begin{bmatrix}
1524
1524
7 & -\infty & -\infty \\
1525
1525
-3 & 2 & -\infty \\
1526
1526
1 & 6 & -2 \\
1527
1527
\end{bmatrix}
1528
1528
\end{equation}\]</div>
1529
1529
<p>Taking the softmax of the rows of this matrix then gives:</p>
1530
-
<divclass="amsmath math notranslate nohighlight" id="equation-17d685b6-4194-4b0f-a9a7-741290375345">
1531
-
<spanclass="eqno">(9)<aclass="headerlink" href="#equation-17d685b6-4194-4b0f-a9a7-741290375345" title="Permalink to this equation">#</a></span>\[\begin{equation}
1530
+
<divclass="amsmath math notranslate nohighlight" id="equation-c8bec7f8-10a9-4e30-bc4a-682e98f92055">
1531
+
<spanclass="eqno">(9)<aclass="headerlink" href="#equation-c8bec7f8-10a9-4e30-bc4a-682e98f92055" title="Permalink to this equation">#</a></span>\[\begin{equation}
<h2><spanstyle="color:Orange">Overview</span><aclass="headerlink" href="#span-style-color-orange-overview-span" title="Permalink to this heading">#</a></h2>
559
559
<p>Fluid mechanics are vital processes in the modern world, and modeling such dynamics has become increasingly important in many engineering and physics challenges. However, the complexities are also becoming increasingly complex: disciplines like multiphase flow (pollutant or disease dispersion), hypersonics (spacecraft atmospheric rentry), and fluid-surface interaction (bio-inspired motion) are all very important yet very computational expensive to numerically or experimentally model.</p>
0 commit comments