deploy: 902634a

msneubauer · msneubauer · commit 537cc513aeb0 · 2025-04-17T02:39:05.000Z
diff --git a/_sources/intro.md b/_sources/intro.md
@@ -90,7 +90,7 @@ At appropriate times throughout the course, you will select from a list of proje
 
 For projects you will put together a Jupyter notebook that demonstrates your project. The notebook should have code and demonstrate the task but also be written in an expository way that other students could, in principle, read and learn from. It is submitted in an analogous way as the regular course assignments.
 
-Each project notebook must be submitted via Gradescope for grading.
+Each project notebook must be submitted via Gradescope for grading. There is not late submissions allowed for the projects. If you do not submit in Gradescope by the deadline, you will receive a zero grade for that project. There are no exceptions to this policy. 
 
 ## <span style="color:Red">Grading</span>
 * Class attendence and participation: 5%
diff --git a/_sources/lectures/Attention.html b/_sources/lectures/Attention.html
@@ -1099,8 +1099,8 @@ <h3><span style="color:LightGreen">Sequence Padding and Attention Masking</span>
 <section id="span-style-color-orange-computing-the-reweighted-padded-attention-mask-span">
 <h2><span style="color:Orange">Computing the Reweighted Padded Attention Mask</span><a class="headerlink" href="#span-style-color-orange-computing-the-reweighted-padded-attention-mask-span" title="Permalink to this heading">#</a></h2>
 <p>Lets create some numbers so we can get a better idea of how this works. Let the tokens be <span class="math notranslate nohighlight">\(X = [10, 2, \text{&lt;pad&gt;}]\)</span>, so the third token is a padding token. Lets then also pretend, we pass this to our model, and when we go to compute our attention <span class="math notranslate nohighlight">\(QK^T\)</span>. The raw output before the Softmax is below:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-4879249e-730c-4c72-bd88-f861bae2041e">
-<span class="eqno">(1)<a class="headerlink" href="#equation-4879249e-730c-4c72-bd88-f861bae2041e" title="Permalink to this equation">#</a></span>\[\begin{equation}
+<div class="amsmath math notranslate nohighlight" id="equation-cca887ed-7663-48bc-8764-1bc3dc98cd1b">
+<span class="eqno">(1)<a class="headerlink" href="#equation-cca887ed-7663-48bc-8764-1bc3dc98cd1b" title="Permalink to this equation">#</a></span>\[\begin{equation}
 \begin{bmatrix}
   7       &amp; -8   &amp; 6  \\
   -3       &amp; 2   &amp; 4   \\
@@ -1113,8 +1113,8 @@ <h2><span style="color:Orange">Computing the Reweighted Padded Attention Mask</s
 \text{Softmax}(\vec{x}) = \frac{e^{x_i}}{\sum_{j=1}^N{e^{x_j}}}
 \]</div>
 <p>If we ignore padding and everything right now, we can compute softmax for row of the matrix above:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-469b756d-c77d-47bf-8a82-87e2bc96499a">
-<span class="eqno">(2)<a class="headerlink" href="#equation-469b756d-c77d-47bf-8a82-87e2bc96499a" title="Permalink to this equation">#</a></span>\[\begin{equation}
+<div class="amsmath math notranslate nohighlight" id="equation-b044bbb8-5fe8-4257-90bc-f3ea0de4203b">
+<span class="eqno">(2)<a class="headerlink" href="#equation-b044bbb8-5fe8-4257-90bc-f3ea0de4203b" title="Permalink to this equation">#</a></span>\[\begin{equation}
 \text{Softmax}
 \begin{bmatrix}
   7       &amp; -8   &amp; 6  \\
@@ -1133,17 +1133,17 @@ <h2><span style="color:Orange">Computing the Reweighted Padded Attention Mask</s
 \end{bmatrix}
 \end{equation}\]</div>
 <p>But what we need is to mask out all the tokens in this matrix related to padding. Just like we did in <a class="reference external" href="https://github.com/priyammaz/HAL-DL-From-Scratch/tree/main/PyTorch%20for%20NLP/GPT">GPT</a>, we will fill in the indexes of the that we want to mask with <span class="math notranslate nohighlight">\(-\infty\)</span>. If only the last token was a padding token in our sequence, then the attention before the softmax should be written as:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-aa311aff-3c34-4d6d-b620-620f697cb836">
-<span class="eqno">(3)<a class="headerlink" href="#equation-aa311aff-3c34-4d6d-b620-620f697cb836" title="Permalink to this equation">#</a></span>\[\begin{equation}
+<div class="amsmath math notranslate nohighlight" id="equation-28997d0e-2785-494d-8a85-5af60c43aded">
+<span class="eqno">(3)<a class="headerlink" href="#equation-28997d0e-2785-494d-8a85-5af60c43aded" title="Permalink to this equation">#</a></span>\[\begin{equation}
 \begin{bmatrix}
   7       &amp; -8   &amp; -\infty  \\
   -3       &amp; 2   &amp; -\infty   \\
   1       &amp; 6  &amp; -\infty  \\
 \end{bmatrix}
 \end{equation}\]</div>
 <p>Taking the softmax of the rows of this matrix then gives:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-606dee84-60c1-4f1f-827c-4913a7a02b0e">
-<span class="eqno">(4)<a class="headerlink" href="#equation-606dee84-60c1-4f1f-827c-4913a7a02b0e" title="Permalink to this equation">#</a></span>\[\begin{equation}
+<div class="amsmath math notranslate nohighlight" id="equation-e6b399df-b453-4cea-bb55-99c48c0d87cc">
+<span class="eqno">(4)<a class="headerlink" href="#equation-e6b399df-b453-4cea-bb55-99c48c0d87cc" title="Permalink to this equation">#</a></span>\[\begin{equation}
 \text{Softmax}
 \begin{bmatrix}
  7       &amp; -8   &amp; -\infty  \\
@@ -1185,8 +1185,8 @@ <h3><span style="color:LightGreen">Repeating to Match Attention Matrix Shape</sp
 <p><code class="docutils literal notranslate"><span class="pre">attn.shape</span></code> - (Batch x seq_len x seq_len)</p>
 <p><code class="docutils literal notranslate"><span class="pre">mask.shape</span></code> - (Batch x seq_len)</p>
 <p>It is clear that our mask is missing a dimension, and we need to repeat it. Lets take sequence_1 for instance that has a mask of [True, True, True, False]. Because the sequence length here is 4, lets repeat this row 4 times:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-befe9eff-9127-4ab8-a1bf-a8fbb58c4325">
-<span class="eqno">(5)<a class="headerlink" href="#equation-befe9eff-9127-4ab8-a1bf-a8fbb58c4325" title="Permalink to this equation">#</a></span>\[\begin{bmatrix}
+<div class="amsmath math notranslate nohighlight" id="equation-c7c9a3ab-17ac-4a5a-b0a2-52a5eac6f7ec">
+<span class="eqno">(5)<a class="headerlink" href="#equation-c7c9a3ab-17ac-4a5a-b0a2-52a5eac6f7ec" title="Permalink to this equation">#</a></span>\[\begin{bmatrix}
 \textrm{True} &amp; \textrm{True} &amp; \textrm{True} &amp; \textrm{False} \\
 \textrm{True} &amp; \textrm{True} &amp; \textrm{True} &amp; \textrm{False} \\
 \textrm{True} &amp; \textrm{True} &amp; \textrm{True} &amp; \textrm{False} \\
@@ -1446,8 +1446,8 @@ <h3><span style="color:LightGreen">Enforcing Causality</span><a class="headerlin
 <section id="span-style-color-lightgreen-computing-the-reweighted-causal-attention-mask-span">
 <h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mask</span><a class="headerlink" href="#span-style-color-lightgreen-computing-the-reweighted-causal-attention-mask-span" title="Permalink to this heading">#</a></h3>
 <p>Lets pretend the raw outputs of <span class="math notranslate nohighlight">\(QK^T\)</span>, before the softmax, is below:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-4b7ce628-0c08-401f-8c37-779deeaffc59">
-<span class="eqno">(6)<a class="headerlink" href="#equation-4b7ce628-0c08-401f-8c37-779deeaffc59" title="Permalink to this equation">#</a></span>\[\begin{equation}
+<div class="amsmath math notranslate nohighlight" id="equation-953898fc-df4c-4707-8b13-c5739e6979e5">
+<span class="eqno">(6)<a class="headerlink" href="#equation-953898fc-df4c-4707-8b13-c5739e6979e5" title="Permalink to this equation">#</a></span>\[\begin{equation}
 \begin{bmatrix}
   7       &amp; -8   &amp; 6  \\
   -3       &amp; 2   &amp; 4   \\
@@ -1458,8 +1458,8 @@ <h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mas
 <div class="math notranslate nohighlight">
 \[\text{Softmax}(\vec{x}) = \frac{e^{x_i}}{\sum_{j=1}^N{e^{x_j}}}\]</div>
 <p>Then, we can compute softmax for row of the matrix above:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-f53bf636-67a1-4047-bd08-90546434449a">
-<span class="eqno">(7)<a class="headerlink" href="#equation-f53bf636-67a1-4047-bd08-90546434449a" title="Permalink to this equation">#</a></span>\[\begin{equation}
+<div class="amsmath math notranslate nohighlight" id="equation-5247fabc-a18a-4cb0-9988-428195c4ccdc">
+<span class="eqno">(7)<a class="headerlink" href="#equation-5247fabc-a18a-4cb0-9988-428195c4ccdc" title="Permalink to this equation">#</a></span>\[\begin{equation}
 \text{Softmax}
 \begin{bmatrix}
   7       &amp; -8   &amp; 6  \\
@@ -1498,17 +1498,17 @@ <h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mas
 \text{Softmax}(x_2) = [\frac{e^{-3}}{e^{-3}+e^{2}+0}, \frac{e^{2}}{e^{-3}+e^{2}+0}, \frac{0}{e^{-3}+e^{2}+0}] = [\frac{e^{-3}}{e^{-3}+e^{2}+0}, \frac{e^{2}}{e^{-3}+e^{2}+0}, \frac{0}{e^{-3}+e^{2}+0}] = [0.0067, 0.9933, 0.0000]
 \]</div>
 <p>So we have exactly what we want! The attention weight of the last value is set to 0, so when we are on the second vector <span class="math notranslate nohighlight">\(x_2\)</span>, we cannot look forward to the future value vectors <span class="math notranslate nohighlight">\(v_3\)</span>, and the remaining parts add up to 1 so its still a probability vector! To do this correctly for the entire matrix, we can just substitute in the top triangle of <span class="math notranslate nohighlight">\(QK^T\)</span> with <span class="math notranslate nohighlight">\(-\infty\)</span>. This would look like:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-ae838e54-9d6a-443f-bfe3-ab3c5193121e">
-<span class="eqno">(8)<a class="headerlink" href="#equation-ae838e54-9d6a-443f-bfe3-ab3c5193121e" title="Permalink to this equation">#</a></span>\[\begin{equation}
+<div class="amsmath math notranslate nohighlight" id="equation-bca53567-ce82-4f24-b424-013b88906702">
+<span class="eqno">(8)<a class="headerlink" href="#equation-bca53567-ce82-4f24-b424-013b88906702" title="Permalink to this equation">#</a></span>\[\begin{equation}
 \begin{bmatrix}
   7       &amp; -\infty   &amp; -\infty  \\
   -3       &amp; 2   &amp; -\infty   \\
   1       &amp; 6  &amp; -2   \\
 \end{bmatrix}
 \end{equation}\]</div>
 <p>Taking the softmax of the rows of this matrix then gives:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-a4e382a4-b1a3-46c1-a4d3-623292fdd7db">
-<span class="eqno">(9)<a class="headerlink" href="#equation-a4e382a4-b1a3-46c1-a4d3-623292fdd7db" title="Permalink to this equation">#</a></span>\[\begin{equation}
+<div class="amsmath math notranslate nohighlight" id="equation-154be58d-b84c-4939-9189-567569ad4ecd">
+<span class="eqno">(9)<a class="headerlink" href="#equation-154be58d-b84c-4939-9189-567569ad4ecd" title="Permalink to this equation">#</a></span>\[\begin{equation}
 \text{Softmax}
 \begin{bmatrix}
   7       &amp; -\infty   &amp; -\infty  \\
diff --git a/intro.html b/intro.html
@@ -729,7 +729,7 @@ <h3>Homework Assignments<a class="headerlink" href="#homework-assignments" title
 <h3>Projects<a class="headerlink" href="#projects" title="Permalink to this heading">#</a></h3>
 <p>At appropriate times throughout the course, you will select from a list of projects that involve demonstrating and extending your work in class by doing something cool and interesting in data analysys. You must work alone on this (i.e. without collaboration).</p>
 <p>For projects you will put together a Jupyter notebook that demonstrates your project. The notebook should have code and demonstrate the task but also be written in an expository way that other students could, in principle, read and learn from. It is submitted in an analogous way as the regular course assignments.</p>
-<p>Each project notebook must be submitted via Gradescope for grading.</p>
+<p>Each project notebook must be submitted via Gradescope for grading. There is not late submissions allowed for the projects. If you do not submit in Gradescope by the deadline, you will receive a zero grade for that project. There are no exceptions to this policy.</p>
 </section>
 </section>
 <section id="span-style-color-red-grading-span">
diff --git a/searchindex.js b/searchindex.js