illinois-mlp
diff --git a/‎_sources/_sources/lectures/UnsupervisedLearningAnomalyDetection.ipynb‎
Lines changed: 44 additions & 2 deletions b/‎_sources/_sources/lectures/UnsupervisedLearningAnomalyDetection.ipynb‎
Lines changed: 44 additions & 2 deletions
diff --git a/‎_sources/lectures/Attention.html‎
Lines changed: 18 additions & 18 deletions b/‎_sources/lectures/Attention.html‎
Lines changed: 18 additions & 18 deletions
@@ -181,6 +181,48 @@
         "While our Time Series data is univariate (we have only 1 feature), the code should work for multivariate datasets (multiple features) with little or no modification. Feel free to try it!"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### <span style=\"color:LightGreen\">Brief LSTM Review</span>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "A <span style=\"color:Violet\">Long Short-Term Memory</span> (LSTM) is a type of Recurrent Neural Network (RNN) designed to handle long-term dependencies in sequential data, such as text, time series, and speech. LSTMs are known for their ability to mitigate the vanishing gradient problem that plagues standard RNNs, allowing them to learn and remember information over longer sequences of data.\n",
+        "\n",
+        "Key Features of LSTMs:\n",
+        "\n",
+        "* ___<span style=\"color:Violet\">Memory Cell</span>___: LSTMs introduce a memory cell that acts as a \"memory\" for the network, allowing it to store and retrieve information over time.\n",
+        "\n",
+        "*  ___<span style=\"color:Violet\">Gates</span>___: LSTMs use \"gates\" (input, forget, and output gates) to control the flow of information into, out of, and within the memory cell.\n",
+        "\n",
+        "* ___<span style=\"color:Violet\">Vanishing Gradient Problem</span>___: LSTMs are designed to prevent the gradients from vanishing or exploding as they propagate through the network over time, making them more effective for learning long-term relationships in sequential data.\n",
+        "\n",
+        "* ___<span style=\"color:Violet\">Sequence Learning</span>___: LSTMs are particularly well-suited for tasks that involve processing sequential data, such as natural language processing (language modeling, machine translation), speech recognition, and time series forecasting.\n",
+        "\n",
+        "How LSTMs Work:\n",
+        "\n",
+        "1. ___<span style=\"color:Violet\">Input</span>___: The LSTM receives an input sequence, where each input represents a time step. \n",
+        "\n",
+        "2. ___<span style=\"color:Violet\">Gates</span>___: The gates regulate the flow of information into the memory cell and the output from the cell. \n",
+        "\n",
+        "3. ___<span style=\"color:Violet\">Memory Cell</span>___: The memory cell stores and updates its internal state based on the input and the previous state. \n",
+        "\n",
+        "4. ___<span style=\"color:Violet\">Output</span>___: The LSTM produces an output at each time step based on the current cell state and the input.\n",
+        "\n",
+        "Advantages of LSTMs:\n",
+        "\n",
+        "* ___<span style=\"color:Violet\">Long-term dependencies</span>___: LSTMs are capable of learning long-term dependencies in sequential data.\n",
+        "\n",
+        "* ___<span style=\"color:Violet\">Vanishing gradient problem</span>___: LSTMs mitigate the vanishing gradient problem, making them more effective for processing long sequences.\n",
+        "\n",
+        "* ___<span style=\"color:Violet\">Wide range of applications</span>___: LSTMs have been successfully applied to many sequence learning tasks."
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
@@ -1670,7 +1712,7 @@
       "provenance": []
     },
     "kernelspec": {
-      "display_name": "venv",
+      "display_name": "Python 3",
       "language": "python",
       "name": "python3"
     },
@@ -1684,7 +1726,7 @@
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
-      "version": "3.12.7"
+      "version": "3.13.1"
     }
   },
   "nbformat": 4,
 
@@ -1099,8 +1099,8 @@ <h3><span style="color:LightGreen">Sequence Padding and Attention Masking</span>
 <section id="span-style-color-orange-computing-the-reweighted-padded-attention-mask-span">
 <h2><span style="color:Orange">Computing the Reweighted Padded Attention Mask</span><a class="headerlink" href="#span-style-color-orange-computing-the-reweighted-padded-attention-mask-span" title="Permalink to this heading">#</a></h2>
 <p>Lets create some numbers so we can get a better idea of how this works. Let the tokens be <span class="math notranslate nohighlight">\(X = [10, 2, \text{&lt;pad&gt;}]\)</span>, so the third token is a padding token. Lets then also pretend, we pass this to our model, and when we go to compute our attention <span class="math notranslate nohighlight">\(QK^T\)</span>. The raw output before the Softmax is below:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-8a35a337-a51d-4693-9249-265321965daf">
-<span class="eqno">(1)<a class="headerlink" href="#equation-8a35a337-a51d-4693-9249-265321965daf" title="Permalink to this equation">#</a></span>\[\begin{equation}
+<div class="amsmath math notranslate nohighlight" id="equation-c70790f6-1a83-47ed-acde-dc45b63490b5">
+<span class="eqno">(1)<a class="headerlink" href="#equation-c70790f6-1a83-47ed-acde-dc45b63490b5" title="Permalink to this equation">#</a></span>\[\begin{equation}
 \begin{bmatrix}
   7       &amp; -8   &amp; 6  \\
   -3       &amp; 2   &amp; 4   \\
@@ -1113,8 +1113,8 @@ <h2><span style="color:Orange">Computing the Reweighted Padded Attention Mask</s
 \text{Softmax}(\vec{x}) = \frac{e^{x_i}}{\sum_{j=1}^N{e^{x_j}}}
 \]</div>
 <p>If we ignore padding and everything right now, we can compute softmax for row of the matrix above:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-756eb811-31e8-485a-b729-a4247b7c9dfe">
-<span class="eqno">(2)<a class="headerlink" href="#equation-756eb811-31e8-485a-b729-a4247b7c9dfe" title="Permalink to this equation">#</a></span>\[\begin{equation}
+<div class="amsmath math notranslate nohighlight" id="equation-2e2338aa-104a-4755-9040-eba4460729fd">
+<span class="eqno">(2)<a class="headerlink" href="#equation-2e2338aa-104a-4755-9040-eba4460729fd" title="Permalink to this equation">#</a></span>\[\begin{equation}
 \text{Softmax}
 \begin{bmatrix}
   7       &amp; -8   &amp; 6  \\
@@ -1133,17 +1133,17 @@ <h2><span style="color:Orange">Computing the Reweighted Padded Attention Mask</s
 \end{bmatrix}
 \end{equation}\]</div>
 <p>But what we need is to mask out all the tokens in this matrix related to padding. Just like we did in <a class="reference external" href="https://github.com/priyammaz/HAL-DL-From-Scratch/tree/main/PyTorch%20for%20NLP/GPT">GPT</a>, we will fill in the indexes of the that we want to mask with <span class="math notranslate nohighlight">\(-\infty\)</span>. If only the last token was a padding token in our sequence, then the attention before the softmax should be written as:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-5709591d-6e65-4893-b935-aaf48acf26e3">
-<span class="eqno">(3)<a class="headerlink" href="#equation-5709591d-6e65-4893-b935-aaf48acf26e3" title="Permalink to this equation">#</a></span>\[\begin{equation}
+<div class="amsmath math notranslate nohighlight" id="equation-4c716b8f-af62-4d56-97fe-b00066113dda">
+<span class="eqno">(3)<a class="headerlink" href="#equation-4c716b8f-af62-4d56-97fe-b00066113dda" title="Permalink to this equation">#</a></span>\[\begin{equation}
 \begin{bmatrix}
   7       &amp; -8   &amp; -\infty  \\
   -3       &amp; 2   &amp; -\infty   \\
   1       &amp; 6  &amp; -\infty  \\
 \end{bmatrix}
 \end{equation}\]</div>
 <p>Taking the softmax of the rows of this matrix then gives:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-93194f43-7a1e-400a-b6aa-733b2d292957">
-<span class="eqno">(4)<a class="headerlink" href="#equation-93194f43-7a1e-400a-b6aa-733b2d292957" title="Permalink to this equation">#</a></span>\[\begin{equation}
+<div class="amsmath math notranslate nohighlight" id="equation-ad16f16b-eced-4c4d-9cac-87c514955daf">
+<span class="eqno">(4)<a class="headerlink" href="#equation-ad16f16b-eced-4c4d-9cac-87c514955daf" title="Permalink to this equation">#</a></span>\[\begin{equation}
 \text{Softmax}
 \begin{bmatrix}
  7       &amp; -8   &amp; -\infty  \\
@@ -1185,8 +1185,8 @@ <h3><span style="color:LightGreen">Repeating to Match Attention Matrix Shape</sp
 <p><code class="docutils literal notranslate"><span class="pre">attn.shape</span></code> - (Batch x seq_len x seq_len)</p>
 <p><code class="docutils literal notranslate"><span class="pre">mask.shape</span></code> - (Batch x seq_len)</p>
 <p>It is clear that our mask is missing a dimension, and we need to repeat it. Lets take sequence_1 for instance that has a mask of [True, True, True, False]. Because the sequence length here is 4, lets repeat this row 4 times:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-6aafbce1-68b0-4a7c-ab18-ad24ddc78de2">
-<span class="eqno">(5)<a class="headerlink" href="#equation-6aafbce1-68b0-4a7c-ab18-ad24ddc78de2" title="Permalink to this equation">#</a></span>\[\begin{bmatrix}
+<div class="amsmath math notranslate nohighlight" id="equation-9abc9429-fbf9-4e1e-9a38-f772a6250c1f">
+<span class="eqno">(5)<a class="headerlink" href="#equation-9abc9429-fbf9-4e1e-9a38-f772a6250c1f" title="Permalink to this equation">#</a></span>\[\begin{bmatrix}
 \textrm{True} &amp; \textrm{True} &amp; \textrm{True} &amp; \textrm{False} \\
 \textrm{True} &amp; \textrm{True} &amp; \textrm{True} &amp; \textrm{False} \\
 \textrm{True} &amp; \textrm{True} &amp; \textrm{True} &amp; \textrm{False} \\
@@ -1446,8 +1446,8 @@ <h3><span style="color:LightGreen">Enforcing Causality</span><a class="headerlin
 <section id="span-style-color-lightgreen-computing-the-reweighted-causal-attention-mask-span">
 <h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mask</span><a class="headerlink" href="#span-style-color-lightgreen-computing-the-reweighted-causal-attention-mask-span" title="Permalink to this heading">#</a></h3>
 <p>Lets pretend the raw outputs of <span class="math notranslate nohighlight">\(QK^T\)</span>, before the softmax, is below:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-bcf05244-fb23-4fbf-ae04-7ee0c29b85f7">
-<span class="eqno">(6)<a class="headerlink" href="#equation-bcf05244-fb23-4fbf-ae04-7ee0c29b85f7" title="Permalink to this equation">#</a></span>\[\begin{equation}
+<div class="amsmath math notranslate nohighlight" id="equation-a3bcd113-24b6-4be0-8074-bfa67eef5331">
+<span class="eqno">(6)<a class="headerlink" href="#equation-a3bcd113-24b6-4be0-8074-bfa67eef5331" title="Permalink to this equation">#</a></span>\[\begin{equation}
 \begin{bmatrix}
   7       &amp; -8   &amp; 6  \\
   -3       &amp; 2   &amp; 4   \\
@@ -1458,8 +1458,8 @@ <h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mas
 <div class="math notranslate nohighlight">
 \[\text{Softmax}(\vec{x}) = \frac{e^{x_i}}{\sum_{j=1}^N{e^{x_j}}}\]</div>
 <p>Then, we can compute softmax for row of the matrix above:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-78810f9a-25e8-4d4e-82d7-529671c15bd8">
-<span class="eqno">(7)<a class="headerlink" href="#equation-78810f9a-25e8-4d4e-82d7-529671c15bd8" title="Permalink to this equation">#</a></span>\[\begin{equation}
+<div class="amsmath math notranslate nohighlight" id="equation-658af593-8f9f-4139-be56-fb460ca2b1eb">
+<span class="eqno">(7)<a class="headerlink" href="#equation-658af593-8f9f-4139-be56-fb460ca2b1eb" title="Permalink to this equation">#</a></span>\[\begin{equation}
 \text{Softmax}
 \begin{bmatrix}
   7       &amp; -8   &amp; 6  \\
@@ -1498,17 +1498,17 @@ <h3><span style="color:LightGreen">Computing the Reweighted Causal Attention Mas
 \text{Softmax}(x_2) = [\frac{e^{-3}}{e^{-3}+e^{2}+0}, \frac{e^{2}}{e^{-3}+e^{2}+0}, \frac{0}{e^{-3}+e^{2}+0}] = [\frac{e^{-3}}{e^{-3}+e^{2}+0}, \frac{e^{2}}{e^{-3}+e^{2}+0}, \frac{0}{e^{-3}+e^{2}+0}] = [0.0067, 0.9933, 0.0000]
 \]</div>
 <p>So we have exactly what we want! The attention weight of the last value is set to 0, so when we are on the second vector <span class="math notranslate nohighlight">\(x_2\)</span>, we cannot look forward to the future value vectors <span class="math notranslate nohighlight">\(v_3\)</span>, and the remaining parts add up to 1 so its still a probability vector! To do this correctly for the entire matrix, we can just substitute in the top triangle of <span class="math notranslate nohighlight">\(QK^T\)</span> with <span class="math notranslate nohighlight">\(-\infty\)</span>. This would look like:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-7f0464ed-96a0-4b1f-a906-6faf2d1e6d86">
-<span class="eqno">(8)<a class="headerlink" href="#equation-7f0464ed-96a0-4b1f-a906-6faf2d1e6d86" title="Permalink to this equation">#</a></span>\[\begin{equation}
+<div class="amsmath math notranslate nohighlight" id="equation-5c744a47-0d2a-4619-9242-21ec929d3bb8">
+<span class="eqno">(8)<a class="headerlink" href="#equation-5c744a47-0d2a-4619-9242-21ec929d3bb8" title="Permalink to this equation">#</a></span>\[\begin{equation}
 \begin{bmatrix}
   7       &amp; -\infty   &amp; -\infty  \\
   -3       &amp; 2   &amp; -\infty   \\
   1       &amp; 6  &amp; -2   \\
 \end{bmatrix}
 \end{equation}\]</div>
 <p>Taking the softmax of the rows of this matrix then gives:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-4b059b5b-dc0d-4cdf-b66c-995c1317ec75">
-<span class="eqno">(9)<a class="headerlink" href="#equation-4b059b5b-dc0d-4cdf-b66c-995c1317ec75" title="Permalink to this equation">#</a></span>\[\begin{equation}
+<div class="amsmath math notranslate nohighlight" id="equation-9ab9add6-4fdb-430f-8eab-90022d1c4d48">
+<span class="eqno">(9)<a class="headerlink" href="#equation-9ab9add6-4fdb-430f-8eab-90022d1c4d48" title="Permalink to this equation">#</a></span>\[\begin{equation}
 \text{Softmax}
 \begin{bmatrix}
   7       &amp; -\infty   &amp; -\infty  \\