diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..fbe4a5c --- /dev/null +++ b/.gitignore @@ -0,0 +1,3 @@ +*.DS_Store + +*.ipynb_checkpoints/ diff --git a/French_Large_language_models.ipynb b/French_Large_language_models.ipynb deleted file mode 100644 index f3e302d..0000000 --- a/French_Large_language_models.ipynb +++ /dev/null @@ -1,6236 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "view-in-github", - "colab_type": "text" - }, - "source": [ - "\"Open" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "m2s4kN_QPQVe" - }, - "source": [ - "# LLMs pour tous\n", - "\n", - "\n", - "\n", - "\"Open\n", - "\n", - "© Deep Learning Indaba 2024. Apache License 2.0.\n", - "\n", - " Add your name to the author list\n", - "**Authors: Jabez Magomere, Harry Mayne, Khalil Mrini, Nabra Rizvi, Doudou Ba**\n", - "\n", - "**Introduction:**\n", - "\n", - "Bienvenue à \"LLMs pour Tous\"—votre porte d'entrée vers le monde fascinant des Modèles de Langage de Grande Taille (LLMs) ! Pour commencer, voici un fait amusant : toute cette introduction a été générée par ChatGPT, l'un des nombreux LLMs puissants que vous allez découvrir. 🤖✨\n", - "\n", - "Dans ce tutoriel, vous allez plonger au cœur des principes fondamentaux des transformateurs, la technologie de pointe derrière des modèles comme GPT. Vous aurez également l'occasion de vous exercer à entraîner votre propre Modèle de Langage ! Préparez-vous à explorer comment ces systèmes d'IA impressionnants créent des textes aussi réalistes et captivants. Partons ensemble pour ce voyage passionnant et déverrouillons les secrets des LLMs ! 🚀📚\n", - "\n", - "**Sujets :**\n", - "\n", - "Contenu : [Introduction à Hugging Face, Mécanisme d'Attention, Architecture du Transformeur, Entraîner votre propre LLM depuis zéro, Ajustement fin d'un LLM pour la Classification de Texte]\n", - "\n", - "Niveau : Débutant, Intermédiaire, Avancé\n", - "\n", - "**Objectifs d'apprentissage :**\n", - "\n", - "* Comprendre l'idée derrière [l'Attention](https://arxiv.org/abs/1706.03762) et pourquoi elle est utilisée.\n", - "* Présenter et décrire les blocs de construction fondamentaux de l'[Architecture du Transformeur](https://arxiv.org/abs/1706.03762) ainsi qu'une intuition sur la conception de cette architecture.\n", - "* Construire et entraîner un simple LLM inspiré de Shakespeare.\n", - "\n", - "**Prérequis :**\n", - "\n", - "* Connaissances introductives en Apprentissage Profond.\n", - "* Connaissances introductives en NLP.\n", - "* Connaissances introductives des modèles séquence à séquence.\n", - "* Compréhension de base en algèbre linéaire.\n", - "\n", - "**Plan :**\n", - "\n", - ">[LLMs pour tous](#scrollTo=m2s4kN_QPQVe)\n", - "\n", - ">>[Installations, Importations et Fonctions Utilitaires](#scrollTo=6EqhIg1odqg0)\n", - "\n", - ">>[Commençons avec une Démo Hugging Face ! Débutant](#scrollTo=4zu5cg-YG4XU)\n", - "\n", - ">>>[Hugging Face](#scrollTo=AwjIIipOG4fz)\n", - "\n", - ">>>[C'est l'heure de la démo ! ⏰⚡ Charger un modèle Hugging Face et exécuter un échantillon](#scrollTo=eq46TV_0G4f0)\n", - "\n", - ">>[1. Attention](#scrollTo=-ZUp8i37dFbU)\n", - "\n", - ">>>[Intuition - Débutant](#scrollTo=ygdi884ugGcu)\n", - "\n", - ">>>[Comprendre l'Attention en termes simples](#scrollTo=ygdi884ugGcu)\n", - "\n", - ">>>[Mécanismes d'attention séquence à séquence - Intermédiaire](#scrollTo=aQfqM1EJyDXI)\n", - "\n", - ">>>[De l'auto-attention à l'attention multi-têtes - Intermédiaire](#scrollTo=J-MU6rrny8Nj)\n", - "\n", - ">>>>[Auto-attention](#scrollTo=0AFUEFZGzCTv)\n", - "\n", - ">>>>>[Requêtes, clés et valeurs](#scrollTo=pwOIMtdZzdTf)\n", - "\n", - ">>>>>[Attention masquée](#scrollTo=D7B-AgO80gIt)\n", - "\n", - ">>>>>[Attention multi-têtes](#scrollTo=OWDubQwCs4zG)\n", - "\n", - ">>[2. Construire votre propre LLM](#scrollTo=e9NW58_3hAg2)\n", - "\n", - ">>>[2.1 Vue d'ensemble générale Débutant](#scrollTo=bA_2coZvhAg3)\n", - "\n", - ">>>[2.2 Tokenisation + Encodage positionnel Débutant](#scrollTo=fbTsk0MdhAhC)\n", - "\n", - ">>>>[2.2.1 Tokenisation](#scrollTo=DehUpfym_RF8)\n", - "\n", - ">>>>[2.2.2 Encodages positionnels](#scrollTo=639s7Zuk_RF9)\n", - "\n", - ">>>>>[Fonctions sinus et cosinus](#scrollTo=rklY-aL-_RF9)\n", - "\n", - ">>>[2.3 Bloc de Transformeur Intermédiaire](#scrollTo=SdNPg0pnhAhG)\n", - "\n", - ">>>>[2.3.1 Réseau de Neurones à Propagation Avant (FFN) / Perceptron Multicouche (MLP) Débutant](#scrollTo=kTURbfr__RF-)\n", - "\n", - ">>>>[2.3.2 Bloc Ajouter et Normaliser Débutant](#scrollTo=Sts5Vr4i_RF-)\n", - "\n", - ">>>[2.4 Construire le Décodeur du Transformeur / LLM Intermédiaire](#scrollTo=91dXd29b_RF_)\n", - "\n", - ">>>[2.5 Entraîner votre LLM](#scrollTo=wmt3tp38G90A)\n", - "\n", - ">>>>[2.5.1 Objectif de l'entraînement Intermédiaire](#scrollTo=agLIpsoh_RGA)\n", - "\n", - ">>>>[2.5.2 Modèles d'entraînement Avancé](#scrollTo=4CSfvGj__RGA)\n", - "\n", - ">>>>[2.5.3 Inspection du LLM entraîné Débutant](#scrollTo=pGv9c2AFmF4V)\n", - "\n", - ">>[Conclusion](#scrollTo=fV3YG7QOZD-B)\n", - "\n", - ">[Retour d'expérience](#scrollTo=o1ndpYE50BpG)\n", - "\n", - "**Avant de commencer :**\n", - "\n", - "Pour ce TP, vous aurez besoin d'utiliser un GPU pour accélérer l'entraînement. Pour ce faire, allez dans le menu \"Exécution\" de Colab, sélectionnez \"Modifier le type d'exécution\", puis dans le menu popup, choisissez \"GPU\" dans la boîte \"Accélérateur matériel\".\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "952qogb79nnY" - }, - "source": [ - "**Niveau d'expérience suggéré dans ce sujet :**\n", - "\n", - "| Niveau | Expérience |\n", - "| --- | --- |\n", - "`Débutant` | C'est la première fois que je suis introduit à ce travail. |\n", - "`Intermédiaire` | J'ai suivi quelques cours de base/introductions sur ce sujet. |\n", - "`Avancé` | Je travaille quotidiennement dans ce domaine/sujet. |\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "YBdDHcI_ArCR", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "a7786281-b512-41c2-8c1d-03817f6d1b71" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "En fonction de votre expérience, nous vous recommandons de ne pas essayer de faire toutes les tâches de codage mais plutôt de passer à chaque section et de vous assurer d'interagir avec le LLM affiné avec LoRA présenté dans la dernière section ainsi qu'avec le LLM préentraîné pour obtenir une compréhension pratique du comportement de ces modèles.\n", - "Note : ceci est juste une ligne directrice, n'hésitez pas à explorer le colab comme bon vous semble si vous vous sentez à l'aise !\n" - ] - } - ], - "source": [ - "# @title **Chemins à suivre :** Quel est votre niveau d'expérience dans les sujets présentés dans ce notebook ? (Exécutez la cellule)\n", - "experience = \"débutant\" #@param [\"débutant\", \"intermédiaire\", \"avancé\"]\n", - "sections_to_follow=\"\"\n", - "\n", - "\n", - "if experience == \"débutant\": sections_to_follow = \"\"\"nous vous recommandons de ne pas essayer de faire toutes les tâches de codage mais plutôt de passer à chaque section et de vous assurer d'interagir avec le LLM affiné avec LoRA présenté dans la dernière section ainsi qu'avec le LLM préentraîné pour obtenir une compréhension pratique du comportement de ces modèles\"\"\"\n", - "\n", - "elif experience == \"intermédiaire\": sections_to_follow = \"\"\"nous vous recommandons de parcourir chaque section de ce notebook et d'essayer les tâches de codage étiquetées comme débutant ou intermédiaire. Si vous êtes bloqué sur le code, demandez de l'aide à un tuteur ou passez à autre chose pour mieux utiliser le temps de la pratique\"\"\"\n", - "\n", - "elif experience == \"avancé\": sections_to_follow = \"\"\"nous vous recommandons de parcourir chaque section et d'essayer chaque tâche de codage jusqu'à ce que vous réussissiez à la faire fonctionner\"\"\"\n", - "\n", - "\n", - "print(f\"En fonction de votre expérience, {sections_to_follow}.\\nNote : ceci est juste une ligne directrice, n'hésitez pas à explorer le colab comme bon vous semble si vous vous sentez à l'aise !\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6EqhIg1odqg0" - }, - "source": [ - "## Installations, Importations et Fonctions Utilitaires\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "4boGA9rYdt9l", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "31e5f620-2420-4668-bd2b-076f053deb42" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.42.4)\n", - "Collecting datasets\n", - " Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)\n", - "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.15.4)\n", - "Requirement already satisfied: huggingface-hub<1.0,>=0.23.2 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.23.5)\n", - "Requirement already satisfied: numpy<2.0,>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.26.4)\n", - "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (24.1)\n", - "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.2)\n", - "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2024.5.15)\n", - "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.32.3)\n", - "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.4.4)\n", - "Requirement already satisfied: tokenizers<0.20,>=0.19 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.19.1)\n", - "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.5)\n", - "Collecting pyarrow>=15.0.0 (from datasets)\n", - " Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)\n", - "Collecting dill<0.3.9,>=0.3.0 (from datasets)\n", - " Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)\n", - "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (2.1.4)\n", - "Collecting xxhash (from datasets)\n", - " Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)\n", - "Collecting multiprocess (from datasets)\n", - " Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)\n", - "Requirement already satisfied: fsspec<=2024.6.1,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from fsspec[http]<=2024.6.1,>=2023.1.0->datasets) (2024.6.1)\n", - "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.10.5)\n", - "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.4.0)\n", - "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)\n", - "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (24.2.0)\n", - "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.1)\n", - "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.5)\n", - "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.4)\n", - "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.23.2->transformers) (4.12.2)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.3.2)\n", - "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.7)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.0.7)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2024.7.4)\n", - "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)\n", - "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.1)\n", - "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.1)\n", - "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n", - "Downloading datasets-2.21.0-py3-none-any.whl (527 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m527.3/527.3 kB\u001b[0m \u001b[31m10.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m12.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m39.9/39.9 MB\u001b[0m \u001b[31m16.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hDownloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m14.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hInstalling collected packages: xxhash, pyarrow, dill, multiprocess, datasets\n", - " Attempting uninstall: pyarrow\n", - " Found existing installation: pyarrow 14.0.2\n", - " Uninstalling pyarrow-14.0.2:\n", - " Successfully uninstalled pyarrow-14.0.2\n", - "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", - "cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 17.0.0 which is incompatible.\n", - "ibis-framework 8.0.0 requires pyarrow<16,>=2, but you have pyarrow 17.0.0 which is incompatible.\u001b[0m\u001b[31m\n", - "\u001b[0mSuccessfully installed datasets-2.21.0 dill-0.3.8 multiprocess-0.70.16 pyarrow-17.0.0 xxhash-3.5.0\n", - "Requirement already satisfied: seaborn in /usr/local/lib/python3.10/dist-packages (0.13.1)\n", - "Collecting umap-learn\n", - " Downloading umap_learn-0.5.6-py3-none-any.whl.metadata (21 kB)\n", - "Requirement already satisfied: numpy!=1.24.0,>=1.20 in /usr/local/lib/python3.10/dist-packages (from seaborn) (1.26.4)\n", - "Requirement already satisfied: pandas>=1.2 in /usr/local/lib/python3.10/dist-packages (from seaborn) (2.1.4)\n", - "Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /usr/local/lib/python3.10/dist-packages (from seaborn) (3.7.1)\n", - "Requirement already satisfied: scipy>=1.3.1 in /usr/local/lib/python3.10/dist-packages (from umap-learn) (1.13.1)\n", - "Requirement already satisfied: scikit-learn>=0.22 in /usr/local/lib/python3.10/dist-packages (from umap-learn) (1.3.2)\n", - "Requirement already satisfied: numba>=0.51.2 in /usr/local/lib/python3.10/dist-packages (from umap-learn) (0.60.0)\n", - "Collecting pynndescent>=0.5 (from umap-learn)\n", - " Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)\n", - "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from umap-learn) (4.66.5)\n", - "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.1)\n", - "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)\n", - "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.53.1)\n", - "Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.5)\n", - "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.1)\n", - "Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (9.4.0)\n", - "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.2)\n", - "Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.8.2)\n", - "Requirement already satisfied: llvmlite<0.44,>=0.43.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba>=0.51.2->umap-learn) (0.43.0)\n", - "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2->seaborn) (2024.1)\n", - "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2->seaborn) (2024.1)\n", - "Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.10/dist-packages (from pynndescent>=0.5->umap-learn) (1.4.2)\n", - "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.22->umap-learn) (3.5.0)\n", - "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)\n", - "Downloading umap_learn-0.5.6-py3-none-any.whl (85 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m85.7/85.7 kB\u001b[0m \u001b[31m3.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hDownloading pynndescent-0.5.13-py3-none-any.whl (56 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m56.9/56.9 kB\u001b[0m \u001b[31m5.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hInstalling collected packages: pynndescent, umap-learn\n", - "Successfully installed pynndescent-0.5.13 umap-learn-0.5.6\n", - "Collecting livelossplot\n", - " Downloading livelossplot-0.5.5-py3-none-any.whl.metadata (8.7 kB)\n", - "Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from livelossplot) (3.7.1)\n", - "Requirement already satisfied: bokeh in /usr/local/lib/python3.10/dist-packages (from livelossplot) (3.4.3)\n", - "Requirement already satisfied: Jinja2>=2.9 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (3.1.4)\n", - "Requirement already satisfied: contourpy>=1.2 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (1.2.1)\n", - "Requirement already satisfied: numpy>=1.16 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (1.26.4)\n", - "Requirement already satisfied: packaging>=16.8 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (24.1)\n", - "Requirement already satisfied: pandas>=1.2 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (2.1.4)\n", - "Requirement already satisfied: pillow>=7.1.0 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (9.4.0)\n", - "Requirement already satisfied: PyYAML>=3.10 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (6.0.2)\n", - "Requirement already satisfied: tornado>=6.2 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (6.3.3)\n", - "Requirement already satisfied: xyzservices>=2021.09.1 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (2024.6.0)\n", - "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->livelossplot) (0.12.1)\n", - "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->livelossplot) (4.53.1)\n", - "Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->livelossplot) (1.4.5)\n", - "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->livelossplot) (3.1.2)\n", - "Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->livelossplot) (2.8.2)\n", - "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from Jinja2>=2.9->bokeh->livelossplot) (2.1.5)\n", - "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2->bokeh->livelossplot) (2024.1)\n", - "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2->bokeh->livelossplot) (2024.1)\n", - "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib->livelossplot) (1.16.0)\n", - "Downloading livelossplot-0.5.5-py3-none-any.whl (22 kB)\n", - "Installing collected packages: livelossplot\n", - "Successfully installed livelossplot-0.5.5\n", - "Requirement already satisfied: accelerate in /usr/local/lib/python3.10/dist-packages (0.32.1)\n", - "Collecting accelerate\n", - " Downloading accelerate-0.33.0-py3-none-any.whl.metadata (18 kB)\n", - "Requirement already satisfied: numpy<2.0.0,>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate) (1.26.4)\n", - "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (24.1)\n", - "Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate) (5.9.5)\n", - "Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate) (6.0.2)\n", - "Requirement already satisfied: torch>=1.10.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (2.3.1+cu121)\n", - "Requirement already satisfied: huggingface-hub>=0.21.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (0.23.5)\n", - "Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from accelerate) (0.4.4)\n", - "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.0->accelerate) (3.15.4)\n", - "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.0->accelerate) (2024.6.1)\n", - "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.0->accelerate) (2.32.3)\n", - "Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.0->accelerate) (4.66.5)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.0->accelerate) (4.12.2)\n", - "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (1.13.2)\n", - "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (3.3)\n", - "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (3.1.4)\n", - "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (12.1.105)\n", - "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (12.1.105)\n", - "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (12.1.105)\n", - "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (8.9.2.26)\n", - "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (12.1.3.1)\n", - "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (11.0.2.54)\n", - "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (10.3.2.106)\n", - "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (11.4.5.107)\n", - "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (12.1.0.106)\n", - "Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (2.20.5)\n", - "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (12.1.105)\n", - "Requirement already satisfied: triton==2.3.1 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (2.3.1)\n", - "Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.10/dist-packages (from nvidia-cusolver-cu12==11.4.5.107->torch>=1.10.0->accelerate) (12.6.20)\n", - "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.10.0->accelerate) (2.1.5)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.21.0->accelerate) (3.3.2)\n", - "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.21.0->accelerate) (3.7)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.21.0->accelerate) (2.0.7)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.21.0->accelerate) (2024.7.4)\n", - "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.10.0->accelerate) (1.3.0)\n", - "Downloading accelerate-0.33.0-py3-none-any.whl (315 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m315.1/315.1 kB\u001b[0m \u001b[31m6.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hInstalling collected packages: accelerate\n", - " Attempting uninstall: accelerate\n", - " Found existing installation: accelerate 0.32.1\n", - " Uninstalling accelerate-0.32.1:\n", - " Successfully uninstalled accelerate-0.32.1\n", - "Successfully installed accelerate-0.33.0\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m296.4/296.4 kB\u001b[0m \u001b[31m4.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hun GPU est connecté.\n" - ] - }, - { - "output_type": "stream", - "name": "stderr", - "text": [ - "[nltk_data] Downloading package word2vec_sample to /root/nltk_data...\n", - "[nltk_data] Unzipping models/word2vec_sample.zip.\n" - ] - } - ], - "source": [ - "## Installer et importer tout ce qui est requis. Capture cache la sortie de la cellule.\n", - "# @title Installer et importer les packages requis. (Exécutez la cellule)\n", - "\n", - "!pip install transformers datasets\n", - "!pip install seaborn umap-learn\n", - "!pip install livelossplot\n", - "!pip install -q datasets\n", - "!pip install -q transformers[torch]\n", - "!pip install accelerate -U\n", - "!pip install -q peft\n", - "\n", - "# Utilitaires Python\n", - "!pip install -q ipdb # débogage.\n", - "!pip install -q colorama # impression de couleurs :).\n", - "\n", - "import os\n", - "import math\n", - "import urllib.request\n", - "\n", - "# https://stackoverflow.com/questions/68340858/in-google-colab-is-there-a-programing-way-to-check-which-runtime-like-gpu-or-tpu\n", - "if os.environ[\"COLAB_GPU\"] and int(os.environ[\"COLAB_GPU\"]) > 0:\n", - " print(\"un GPU est connecté.\")\n", - "elif \"COLAB_TPU_ADDR\" in os.environ and os.environ[\"COLAB_TPU_ADDR\"]:\n", - " print(\"Un TPU est connecté.\")\n", - " import jax.tools.colab_tpu\n", - "\n", - " jax.tools.colab_tpu.setup_tpu()\n", - "else:\n", - " print(\"Seul le CPU est connecté.\")\n", - "\n", - "# https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html#gpu-memory-allocation\n", - "# Évitez que l'allocation de mémoire GPU soit effectuée par JAX.\n", - "os.environ['XLA_PYTHON_CLIENT_PREALLOCATE'] = \"false\"\n", - "\n", - "import chex\n", - "import flax\n", - "import flax.linen as nn\n", - "import jax\n", - "import jax.numpy as jnp\n", - "from jax import grad, jit, vmap\n", - "import optax\n", - "\n", - "import transformers\n", - "from transformers import pipeline, AutoTokenizer, AutoModel\n", - "import datasets\n", - "import peft\n", - "\n", - "from PIL import Image\n", - "from livelossplot import PlotLosses\n", - "\n", - "# Utilitaires.\n", - "import colorama\n", - "\n", - "import torch\n", - "import torchvision\n", - "\n", - "import matplotlib.pyplot as plt\n", - "import numpy as np\n", - "import seaborn as sns\n", - "\n", - "import itertools\n", - "import random\n", - "\n", - "# télécharger des images utilisées dans le notebook\n", - "urllib.request.urlretrieve(\n", - " \"https://images.unsplash.com/photo-1529778873920-4da4926a72c2?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxzZWFyY2h8MXx8Y3V0ZSUyMGNhdHxlbnwwfHwwfHw%3D&w=1000&q=80\",\n", - " \"cat.png\",\n", - ")\n", - "\n", - "import copy\n", - "\n", - "import gensim\n", - "from nltk.data import find\n", - "import nltk\n", - "\n", - "nltk.download(\"word2vec_sample\")\n", - "\n", - "import huggingface_hub\n", - "import ipywidgets as widgets\n", - "from IPython.display import display\n", - "%config InlineBackend.figure_format = 'svg'\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "-9X10jhocGaS" - }, - "outputs": [], - "source": [ - "# @title Fonctions Utilitaires pour le Tracé. (Exécutez la Cellule)\n", - "def plot_position_encodings(P, max_tokens, d_model):\n", - " \"\"\"Fonction qui prend en entrée une matrice d'encodage positionnel et la trace.\"\"\"\n", - "\n", - " plt.figure(figsize=(20, np.min([8, max_tokens])))\n", - " im = plt.imshow(P, aspect=\"auto\", cmap=\"Blues_r\")\n", - " plt.colorbar(im, cmap=\"blue\")\n", - "\n", - " if d_model <= 64:\n", - " plt.xticks(range(d_model))\n", - " if max_tokens <= 32:\n", - " plt.yticks(range(max_tokens))\n", - " plt.xlabel(\"Indice d'intégration\")\n", - " plt.ylabel(\"Indice de position\")\n", - " plt.show()\n", - "\n", - "\n", - "def plot_image_patches(patches):\n", - " \"\"\"Fonction qui prend en entrée une liste de patchs d'image et les trace.\"\"\"\n", - " axes = []\n", - " fig = plt.figure(figsize=(25, 25))\n", - " for a in range(patches.shape[1]):\n", - " axes.append(fig.add_subplot(1, patches.shape[1], a + 1))\n", - " plt.imshow(patches[0][a])\n", - " fig.tight_layout()\n", - " plt.show()\n", - "\n", - "\n", - "def plot_projected_embeddings(embeddings, labels):\n", - " \"\"\"Fonction qui prend en entrée une liste d'intégrations, les projette dans un espace 2D et les trace en utilisant UMAP.\"\"\"\n", - " import umap\n", - " import seaborn as sns\n", - "\n", - " projected_embeddings = umap.UMAP().fit_transform(embeddings)\n", - "\n", - " plt.figure(figsize=(15, 8))\n", - " plt.title(\"Intégrations de texte projetées\")\n", - " sns.scatterplot(\n", - " x=projected_embeddings[:, 0], y=projected_embeddings[:, 1], hue=labels\n", - " )\n", - " plt.show()\n", - "\n", - "\n", - "def plot_attention_weight_matrix(weight_matrix, x_ticks, y_ticks):\n", - " \"\"\"Fonction qui prend en entrée une matrice de poids et la trace avec des graduations personnalisées.\"\"\"\n", - " plt.figure(figsize=(15, 7))\n", - " ax = sns.heatmap(weight_matrix, cmap=\"Blues\")\n", - " plt.xticks(np.arange(weight_matrix.shape[1]) + 0.5, x_ticks)\n", - " plt.yticks(np.arange(weight_matrix.shape[0]) + 0.5, y_ticks)\n", - " plt.title(\"Matrice d'attention\")\n", - " plt.xlabel(\"Score d'attention\")\n", - " plt.show()\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "kMkaKekB_pR4" - }, - "outputs": [], - "source": [ - "# @title Fonctions Utilitaires pour le Traitement de Texte. (Exécutez la Cellule)\n", - "\n", - "def get_word2vec_embedding(words):\n", - " \"\"\"\n", - " Fonction qui prend une liste de mots et retourne une liste de leurs intégrations,\n", - " basée sur un encodeur word2vec préentraîné.\n", - " \"\"\"\n", - " word2vec_sample = str(find(\"models/word2vec_sample/pruned.word2vec.txt\"))\n", - " model = gensim.models.KeyedVectors.load_word2vec_format(\n", - " word2vec_sample, binary=False\n", - " )\n", - "\n", - " output = []\n", - " words_pass = []\n", - " for word in words:\n", - " try:\n", - " output.append(jnp.array(model.word_vec(word)))\n", - " words_pass.append(word)\n", - " except:\n", - " pass\n", - "\n", - " embeddings = jnp.array(output)\n", - " del model # libérer de l'espace à nouveau\n", - " return embeddings, words_pass\n", - "\n", - "\n", - "def remove_punctuation(text):\n", - " \"\"\"Fonction qui prend une chaîne de caractères et supprime toute la ponctuation.\"\"\"\n", - " import re\n", - "\n", - " text = re.sub(r\"[^\\w\\s]\", \"\", text)\n", - " return text\n", - "\n", - "def print_sample(prompt: str, sample: str):\n", - " \"\"\"Fonction qui prend une instruction de prompt et une réponse de modèle et\n", - " les affiche en différentes couleurs pour montrer une distinction\"\"\"\n", - " print(colorama.Fore.MAGENTA + prompt, end=\"\")\n", - " print(colorama.Fore.BLUE + sample)\n", - " print(colorama.Fore.RESET)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4zu5cg-YG4XU" - }, - "source": [ - "## Commençons avec une Démo Hugging Face ! Débutant\n", - "\n", - "Nous sommes ravis de vous avoir à bord ! 🎉 Avant de plonger dans la partie pratique de notre voyage, faisons un petit détour dans le monde fascinant de [Hugging Face](https://huggingface.co/)—une plateforme open-source incroyable pour construire et déployer des modèles de langage à la pointe de la technologie. 🌐\n", - "\n", - "Comme avant-goût de ce que nous allons créer aujourd'hui, nous allons commencer par charger un *petit* modèle de langage (*en comparaison avec les modèles d'aujourd'hui) et lui donner une instruction simple. Cela vous donnera un aperçu de la manière d'interagir avec ces bibliothèques puissantes. 💡 Préparez-vous à débloquer le potentiel des modèles de langage avec juste quelques lignes de code !\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "AwjIIipOG4fz" - }, - "source": [ - "### Hugging Face\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "N2DSHiuhG4f0" - }, - "source": [ - "\n", - "\n", - "\n", - "[Hugging Face](https://huggingface.co/) est une startup fondée en 2016 et, selon leurs propres mots : \"ils ont pour mission de démocratiser le machine learning de qualité, un commit à la fois.\" Actuellement, ils sont une véritable mine d'or pour les outils permettant de travailler avec les Modèles de Langage de Grande Taille (LLMs).\n", - "\n", - "Ils ont développé divers packages open-source et permettent aux utilisateurs d'interagir facilement avec un large corpus de modèles transformeurs préentraînés (dans toutes les modalités) et de datasets pour entraîner ou ajuster finement ces transformeurs préentraînés. Leur logiciel est largement utilisé dans l'industrie et la recherche. Pour plus de détails sur eux et leur utilisation, référez-vous à [l'exercice pratique sur l'attention et les transformeurs de 2022](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2022/blob/main/practicals/attention_and_transformers.ipynb#scrollTo=qFBw8kRx-4Mk).\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "3xdt9PQ6G4f0" - }, - "source": [ - "Dans ce colab, nous affichons les prompts en rose et les échantillons générés par un modèle en bleu comme dans l'exemple ci-dessous :\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "L-8C9SJCG4f0", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "074ce871-051b-45d3-e334-b372e61a187c" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "\u001b[35mMon faux prompt\u001b[34m est génial !\n", - "\u001b[39m\n" - ] - } - ], - "source": [ - "print_sample(prompt='Mon faux prompt', sample=' est génial !')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "eq46TV_0G4f0" - }, - "source": [ - "### C'est l'heure de la démo ! ⏰⚡ Charger un modèle Hugging Face et exécuter un échantillon\n", - "\n", - "Plongeons dans la simplicité de charger et d'interagir avec un modèle de Hugging Face !\n", - "\n", - "Pour ce tutoriel, nous avons préconfiguré deux options de modèles :\n", - "\n", - "- **`gpt-neo-125M`** : Un modèle plus petit avec 125 millions de paramètres. Il est plus rapide et utilise moins de mémoire—parfait pour commencer ! Nous vous recommandons d'essayer celui-ci en premier.\n", - "- **`gpt2-medium`** : Un modèle plus grand avec 355 millions de paramètres pour une utilisation plus avancée.\n", - "\n", - "Si vous souhaitez changer de modèle, il vous suffit de redémarrer le noyau Colab et de mettre à jour le nom du modèle dans la cellule ci-dessous.\n", - "\n", - "**Remarque** : Les étapes que nous allons montrer fonctionnent non seulement pour ces modèles, mais aussi pour [tous les modèles](https://huggingface.co/models?pipeline_tag=text-generation) sur Hugging Face qui prennent en charge les pipelines de génération de texte.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "QVV28V-TG4f1", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 561, - "referenced_widgets": [ - "316ba561018544789643764625282f15", - "504231a374294cb19fd8e0b9b89ec7e8", - "7c863229d1e24de6822714b5ca26b79c", - "f6fdd9f590b54ac88880b19cf8188065", - "39e4060377a24cb2bc0076077938373c", - "ee2f118a071d494abae2e98eda6f1ed6", - "b555d0c16d0f4d969091617b20a84a76", - "bdedbd9e96d54ac4aa3c7fb35a6f76f9", - "4d8571051f134363977878a50dd03319", - "3f8189acfa7f48b98c7707e89061cc29", - "285c44f874904f3b8ef246b81faf2f2c", - "0942aeb966a749abb611511e81973b82", - "5ca27309759c490fb4ccfef3816b0770", - "f8ddc53b692a48c4a68679ea3af026f6", - "59601d4889eb41c99ab6fb842d77b74e", - "39bc3697123b4df8b91f07979be4f67e", - "8c3d13a57fb84984b75e210c7d7227d0", - "ee51b3dfda4b43e693f9ca1fbfa5ab48", - "d02b455c0f544b39a36fa1138319ffda", - "ebe9f06209dc44588c34b923d635e217", - "f26169bf7f324140be5166e3c6bf55ec", - "e5472083b49a4e4f854565089a3e96de", - "a8fe0138076743c0992cff08634ca2e5", - "8806049050f245058d1e58ff7bb653b2", - "7b4fc56aa430480fad6a989af26fb675", - "69b82aa785ac4e8fa5a3bb09c8c68008", - "7980af54051c450e96d3179a5890f089", - "2c7b228ec8e6459fbfefcf5d9ab63dd9", - "bee5c02c3f3d411fbbe1fd876d331483", - "5e879a40d6304c00acb0d18740f8c751", - "5fc7ca54266a4ffab507d369ce58208d", - "caeeee54e30b43c1997cab8d6761f840", - "c73c160b140247d695ce61e973af9c2c", - "f6aebc9a9c804a37af70db7ad38bd960", - "883970442d624ce8b375705d2067c2ec", - "aaf60a6c52144cc49cc58724b1861a70", - "1ddfcc9eaea049e9956759b4dbbfe7af", - "3e1a39ba92ae4f04b369fed753420305", - "029bfefe634b4c57b358cf42f7dbef4e", - "241a186183b040df81474f1766afa588", - "3f1082fb3fe54f1bb67f61019e060ac9", - "3f5a6372cf29479393fb7986ec2f20b8", - "405e3a80340f4be68684bc7a1edf9854", - "6e4de36b7eaa4395b20c945acadfbc8d", - "8ccd010264ab42499b510f647dfdb223", - "e16938f8f6f044d9a3e5c29bc80acbfd", - "c8467aa01ba24c30ba8fff4f6bdafc55", - "1b270e97e9674a7ca6cf27080576f836", - "f601815ac1d5407f908ba75d09f377ee", - "60287ad958d347d0bd996427b80e2887", - "d76fa1d824424fa688902a84e8b6a77c", - "a13529b977c045ecae240fb095c218b2", - "b8fb12904b2d41f78288c503391489fe", - "a66e86a718494fc0a88a4d214389bc44", - "6ea3ab754b1044b3b3a0320a9caed1a4", - "67919bf3c5a44627bc83cc80fe12a240", - "4ea62680b80444db951646b80fcd4c95", - "f5408df7b5c24396b6f964892593164f", - "5cc97ff3b54545afaa3f83152f61fad5", - "e64410977eb84d929623935464b6b076", - "ea5d720fe84f4046b7a4f28b246e12c6", - "28fcfb20c5354967ad3a82fef7a59956", - "ebce223b50ba48cd90070111f1d36fa8", - "1c7d003576a24216bd248a54d2f558bf", - "9e9f4e2b0ba64e509262306b9588102c", - "36904956f38d41d7b201e0e2c5c53f83", - "304773d4d8444473a3adb55c37d12ac8", - "4ab40baf831c416f9e57af3c0a54bbfc", - "03c314daeb7c4938a7f1bc90c321e2d6", - "0202e22b66e34ed7b8c92eabfc6e6756", - "be6e9667d48b4c4796acd0ee503ee2c4", - "fdcc2d118ee54e72ae22949291b9fca8", - "5a077af432024aa1af479aa060151def", - "141c082cc81447159582c9cd414ded81", - "895f142dc9b146e782e5c42c1396dc45", - "e552d2d2e3ff4274bca7153ed60d2051", - "22abcd28d2b44706bafda4350243e28e", - "0b93bc073a9341c69e9d33a99feefa99", - "a6a3fa52ca68433fb545c61d40db660c", - "83a23a023ef24bc5b1c38eac0dcb49cb", - "0d69fd544d7e4b4b9f11a8675fb550b7", - "a5dfeffb71c94fdabcd6ad1bd26bc9f4", - "e0b3e4532ab5451689128fdeb5263667", - "f9f2c5448107489a854333ff50200cd3", - "30b8b5253bed447785ef4e9e5efcc533", - "a3af18369e0d410097ff234172bdd08d", - "4e0e3ebe2495481e83c0792a4ae7f2fc", - "6c48cb5fabc44ad085edf0fb01309977" - ] - }, - "outputId": "b1f17447-feb5-42ba-d466-49a4e1e83e69" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n", - "The secret `HF_TOKEN` does not exist in your Colab secrets.\n", - "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n", - "You will be able to reuse this secret in all of your notebooks.\n", - "Please note that authentication is recommended but still optional to access public models or datasets.\n", - " warnings.warn(\n" - ] - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "config.json: 0%| | 0.00/1.01k [00:00 str:\n", - " # Convertir le texte du prompt en tokens que le modèle peut traiter\n", - " inputs = tokenizer(prompt, return_tensors=\"pt\")\n", - "\n", - " # Extraire les tokens (IDs d'entrée) et le masque d'attention (pour se concentrer sur les parties importantes) des entrées\n", - " input_ids = inputs[\"input_ids\"]\n", - " attention_mask = inputs[\"attention_mask\"]\n", - "\n", - " # Déplacer les tokens et le masque d'attention vers le même appareil que le modèle (comme un GPU si disponible)\n", - " input_ids = input_ids.to(model.device)\n", - " attention_mask = attention_mask.to(model.device)\n", - "\n", - " # Configurer comment nous voulons que le modèle génère du texte\n", - " generation_config = transformers.GenerationConfig(\n", - " do_sample=True, # Permettre au modèle d'ajouter un peu d'aléatoire à sa génération de texte\n", - " temperature=temperature, # Ajuster à quel point la sortie est aléatoire ; plus bas signifie plus focalisé\n", - " top_p=top_p, # Prendre en compte les mots les plus probables qui constituent les 90 % de possibilités les plus élevées\n", - " pad_token_id=tokenizer.pad_token_id, # Utiliser l'ID du token qui représente le remplissage (espace supplémentaire)\n", - " top_k=0, # Nous ne limitons pas aux top-k mots, donc nous définissons cela à 0\n", - " )\n", - "\n", - " # Si une graine est fournie, la définir pour que les résultats soient répétables (même sortie à chaque fois)\n", - " if seed is not None:\n", - " torch.manual_seed(seed)\n", - "\n", - " # Générer du texte en utilisant le modèle avec les paramètres que nous avons définis\n", - " generation_output = model.generate(\n", - " input_ids=input_ids, # Fournir les tokens d'entrée au modèle\n", - " attention_mask=attention_mask, # Fournir le masque d'attention pour aider le modèle à se concentrer\n", - " return_dict_in_generate=True, # Demander au modèle de retourner des informations détaillées\n", - " output_scores=True, # Inclure les scores (niveaux de confiance) pour les tokens générés\n", - " max_new_tokens=max_new_tokens, # Définir le nombre maximum de tokens à générer\n", - " generation_config=generation_config, # Appliquer nos paramètres de génération de texte personnalisés\n", - " )\n", - "\n", - " # S'assurer qu'une seule séquence (sortie) est générée, pour simplifier les choses\n", - " assert len(generation_output.sequences) == 1\n", - "\n", - " # Obtenir la séquence générée de tokens\n", - " output_sequence = generation_output.sequences[0]\n", - "\n", - " # Convertir les tokens générés en texte lisible\n", - " output_string = tokenizer.decode(output_sequence)\n", - "\n", - " # Extraire uniquement le nouveau texte (en supprimant le prompt original) et le nettoyer\n", - " response = output_string.split(prompt)[1].rstrip()\n", - "\n", - " # Afficher le prompt et la réponse générée\n", - " print_sample(prompt, response)\n", - "\n", - " # Retourner la réponse textuelle générée\n", - " return response\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Yme6VzW4G4f1", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "630d05f2-9736-4c95-c33c-8a7dbc9a7d5c" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "\u001b[35mWhat is love?\u001b[34m\n", - "\n", - "Love is a great way to express your feelings.\n", - "\n", - "Love can be an important part of any relationship, but it can also be an important part of a marriage.\n", - "\n", - "Love can be the key to a marriage.\n", - "\n", - "Love can be the key to a relationship.\n", - "\n", - "Love can be the\n", - "\u001b[39m\n" - ] - } - ], - "source": [ - "_ = run_sample(model, tokenizer, prompt=\"What is love?\", seed=2)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "V7vnUawyG4f1" - }, - "source": [ - "Assez incroyable, n'est-ce pas ? 🤩 Bien que cela ait pu être époustouflant en 2021, il y a de fortes chances que la plupart d'entre vous aient déjà interagi avec des modèles de langage de grande taille d'une manière ou d'une autre. Aujourd'hui, nous allons aller plus loin en entraînant notre propre **LLM inspiré de Shakespeare**. Cela nous donnera une compréhension pratique du fonctionnement de ces modèles de langage sous le capot.\n", - "\n", - "Mais avant de nous lancer dans l'entraînement, construisons d'abord une compréhension solide de ce que sont les **Modèles de Langage de Grande Taille** et des concepts clés en **Apprentissage Automatique** qui rendent cette technologie révolutionnaire possible. Au cœur des LLMs de pointe (SoTA) d'aujourd'hui se trouvent le **Mécanisme d'Attention** et l'**Architecture du Transformeur**. Nous explorerons ces concepts essentiels dans les sections à venir de ce tutoriel. 🚀💡\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-ZUp8i37dFbU" - }, - "source": [ - "## **1. Attention**\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "acgW1ofF_RFz" - }, - "source": [ - "Le mécanisme d'attention est inspiré par la manière dont les humains regardent une image ou lisent une phrase.\n", - "\n", - "Prenons l'image du chien en vêtements humains ci-dessous (image et exemple [source](https://lilianweng.github.io/posts/2018-06-24-attention/)). Lorsque nous prêtons *attention* aux blocs de pixels rouges, nous dirons que le bloc jaune des oreilles pointues est quelque chose que nous attendions (corrélé), mais que les blocs gris des vêtements humains sont inattendus pour nous (non corrélé). Ceci est *basé sur ce que nous avons vu dans le passé* en regardant des photos de chiens, spécifiquement d'un Shiba Inu.\n", - "\n", - "\"drawing\"\n", - "\n", - "Supposons que nous voulons identifier la race de chien dans cette image. Lorsque nous regardons les blocs de pixels rouges, nous avons tendance à prêter plus d'*attention* aux pixels pertinents qui sont plus similaires ou liés à eux, ce qui pourrait être ceux dans la boîte jaune. Nous ignorons presque complètement la neige en arrière-plan et les vêtements humains pour cette tâche.\n", - "\n", - "D'un autre côté, lorsque nous commençons à regarder l'arrière-plan dans une tentative d'identifier ce qu'il contient, nous ignorons inconsciemment les pixels du chien car ils sont sans importance pour la tâche actuelle.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "usLBF2g0x5gH" - }, - "source": [ - "La même chose se produit lorsque nous lisons. Pour comprendre toute la phrase, nous apprendrons à corréler et à *prêter attention à* certains mots en fonction du contexte de la phrase entière.\n", - "\n", - "\"drawing\"\n", - "\n", - "Par exemple, dans la première phrase de l'image ci-dessus, en regardant le mot \"coding\", nous prêtons plus d'attention aux mots \"Apple\" et \"computer\" car nous savons que lorsque nous parlons de codage, \"Apple\" fait en fait référence à l'entreprise. Cependant, dans la deuxième phrase, nous réalisons que nous ne devrions pas considérer \"apple\" en regardant \"code\" car, compte tenu du contexte du reste de la phrase, nous savons que cette pomme fait référence à une pomme réelle et non à un ordinateur.\n", - "\n", - "Nous pouvons construire de meilleurs modèles en développant des mécanismes qui imitent l'attention. Cela permettra à nos modèles d'apprendre de meilleures représentations de nos données d'entrée en contextualisant ce qu'ils savent sur certaines parties de l'entrée en fonction d'autres parties. Dans les sections suivantes, nous explorerons les mécanismes qui nous permettent d'entraîner des modèles d'apprentissage profond à prêter attention aux données d'entrée dans le contexte d'autres données d'entrée.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ygdi884ugGcu" - }, - "source": [ - "### Intuition - Débutant\n", - "\n", - "Imaginez l'attention comme un mécanisme qui permet à un réseau neuronal de se concentrer davantage sur certaines parties des données. En faisant cela, le réseau peut améliorer sa compréhension du problème sur lequel il travaille, en mettant à jour sa compréhension ou ses représentations en conséquence.\n", - "\n", - "### Comprendre l'Attention en termes simples\n", - "\n", - "Une façon de mettre en œuvre l'attention dans les réseaux neuronaux est de représenter chaque mot (ou même des parties d'un mot) comme un vecteur.\n", - "\n", - "Alors, qu'est-ce qu'un vecteur ? Un vecteur est simplement un tableau de nombres (appelés nombres réels) qui peut avoir différentes longueurs. Pensez-y comme à une liste de valeurs qui décrivent certaines propriétés d'un mot. Ces vecteurs nous permettent de mesurer à quel point deux mots sont similaires. Une façon courante de mesurer cette similarité est de calculer ce qu'on appelle le **produit scalaire**.\n", - "\n", - "Le résultat de ce calcul de similarité est ce que nous appelons **l'attention.** Cette valeur d'attention aide le modèle à décider dans quelle mesure un mot doit influencer la représentation d'un autre mot.\n", - "\n", - "En termes plus simples, si deux mots ont des représentations vectorielles similaires, cela signifie qu'ils sont probablement liés ou importants l'un pour l'autre. En raison de cette relation, ils affectent les représentations de chacun dans le réseau neuronal, permettant au modèle de mieux comprendre le contexte. 🎯\n", - "\n", - "Pour illustrer comment le produit scalaire peut créer des poids d'attention significatifs, nous utiliserons des embeddings [word2vec](https://jalammar.github.io/illustrated-word2vec/) pré-entraînés. Ces embeddings word2vec sont générés par un réseau neuronal qui a appris à créer des embeddings similaires pour des mots ayant des significations similaires.\n", - "\n", - "En calculant la matrice des produits scalaires entre tous les vecteurs, nous obtenons une matrice d'attention. Cela indiquera quels mots sont corrélés et devraient donc \"se prêter attention\" mutuellement.\n", - "\n", - "[1] Vous pouvez trouver plus de détails sur la façon dont cela est fait pour les LLMs dans la session \"Construire votre propre LLM\".\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "OvBYShCFk6WC" - }, - "source": [ - "**Tâche de code** Intermédiaire : Complétez la fonction d'attention par produit scalaire ci-dessous.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "yrbITGPnk7Ce" - }, - "outputs": [], - "source": [ - "def dot_product_attention(hidden_states, previous_state):\n", - " \"\"\"\n", - " Calculer le produit scalaire entre les états cachés et les états précédents.\n", - "\n", - " Args:\n", - " hidden_states: Un tenseur de forme [T_hidden, dm]\n", - " previous_state: Un tenseur de forme [T_previous, dm]\n", - " \"\"\"\n", - "\n", - " scores = ... # Complétez ici\n", - " w_n = ... # Complétez ici\n", - "\n", - " # multiplier les poids par les états cachés pour obtenir le vecteur de contexte\n", - " c_t = jnp.matmul(w_n, hidden_states)\n", - "\n", - " return w_n, c_t\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "QARgTrNZlIqH", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "4e27787d-b19f-40a8-aab9-9854be8f3e41" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Il semble que la fonction n'est pas encore complètement implémentée. Essayez de la modifier.\n" - ] - } - ], - "source": [ - "# @title Exécutez-moi pour tester votre code\n", - "\n", - "key = jax.random.PRNGKey(42)\n", - "x = jax.random.normal(key, [2, 2])\n", - "\n", - "try:\n", - " w_n, c_t = dot_product_attention(x, x)\n", - "\n", - " w_n_correct = jnp.array([[0.9567678, 0.04323225], [0.00121029, 0.99878967]])\n", - " c_t_correct = jnp.array([[0.11144122, 0.95290256], [-1.5571996, -1.5321486]])\n", - " assert jnp.allclose(w_n_correct, w_n), \"w_n n'est pas calculé correctement\"\n", - " assert jnp.allclose(c_t_correct, c_t), \"c_t n'est pas calculé correctement\"\n", - "\n", - " print(\"Cela semble correct. Regardez la réponse ci-dessous pour comparer les méthodes.\")\n", - "except:\n", - " print(\"Il semble que la fonction n'est pas encore complètement implémentée. Essayez de la modifier.\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Qa6PyKYnkzUJ" - }, - "outputs": [], - "source": [ - "# Lors du changement de ces mots, notez que si le mot n'est pas dans le corpus\n", - "# d'entraînement original, il ne sera pas affiché dans le graphique de la matrice des poids.\n", - "# @title Réponse à la tâche de code (Essayez de ne pas regarder avant d'avoir bien essayé !')\n", - "def dot_product_attention(hidden_states, previous_state):\n", - " # Calculer les scores d'attention :\n", - " # Multiplier le vecteur de l'état précédent par la transposée de la matrice des états cachés.\n", - " # Cela nous donne une matrice de scores qui montre à quel point chaque élément de l'état précédent\n", - " # doit prêter attention à chaque élément des états cachés.\n", - " # Le résultat est une matrice de forme [T, N], où :\n", - " # T est le nombre d'éléments dans les états cachés,\n", - " # N est le nombre d'éléments dans l'état précédent.\n", - " scores = jnp.matmul(previous_state, hidden_states.T)\n", - "\n", - " # Appliquer la fonction softmax aux scores pour les convertir en probabilités.\n", - " # Cela normalise les scores pour qu'ils soient additionnés à 1 pour chaque élément,\n", - " # ce qui nous permet de les interpréter comme le degré d'attention à accorder à chaque état caché.\n", - " w_n = jax.nn.softmax(scores)\n", - "\n", - " # Calculer le vecteur de contexte (c_t) :\n", - " # Multiplier les poids d'attention (w_n) par les états cachés.\n", - " # Cela combine les états cachés en fonction du degré d'attention que chacun mérite,\n", - " # ce qui donne un nouveau vecteur qui représente la somme pondérée des états cachés.\n", - " # La forme résultante est [T, d], où :\n", - " # T est le nombre d'éléments dans l'état précédent,\n", - " # d est la dimension des états cachés.\n", - " c_t = jnp.matmul(w_n, hidden_states)\n", - "\n", - " # Retourner les poids d'attention et le vecteur de contexte.\n", - " return w_n, c_t\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "QlHL3e_QhLfq", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 654 - }, - "outputId": "0c256368-ef3b-442d-ce83-6af37cf456ac" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - ":17: DeprecationWarning: Call to deprecated `word_vec` (Use get_vector instead).\n", - " output.append(jnp.array(model.word_vec(word)))\n" - ] - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "
" - ], - "image/svg+xml": "\n\n\n \n \n \n \n 2024-08-24T16:38:40.361606\n image/svg+xml\n \n \n Matplotlib v3.7.1, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n" - }, - "metadata": {} - } - ], - "source": [ - "mots = [\"king\", \"queen\", \"royalty\", \"food\", \"apple\", \"pear\", \"computers\"]\n", - "embeddings_mots, mots = get_word2vec_embedding(mots)\n", - "poids, _ = dot_product_attention(embeddings_mots, embeddings_mots)\n", - "plot_attention_weight_matrix(poids, mots, mots)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tItZU09YlhEZ" - }, - "source": [ - "En regardant la matrice, nous pouvons voir quels mots ont des significations similaires. Le groupe de mots \"royal\" a des scores d'attention plus élevés les uns avec les autres que les mots \"nourriture\", qui s'attendent tous mutuellement. Nous voyons également que \"computers\" ont des scores d'attention très bas pour tous, ce qui montre qu'ils ne sont ni très liés aux mots \"royal\" ni aux mots \"nourriture\".\n", - "\n", - "**Tâche de groupe :**\n", - " - Jouez avec les sélections de mots ci-dessus. Voyez si vous pouvez trouver des combinaisons de mots dont les valeurs d'attention semblent contre-intuitives. Pensez à des explications possibles. Quel sens d'un mot les scores d'attention ont-ils capturé ?\n", - " - Demandez à votre ami s'il a trouvé des exemples.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "S3iB8hf0hJdX" - }, - "source": [ - "**Remarque** : Le produit scalaire n'est qu'une des façons de mettre en œuvre la fonction de score pour les mécanismes d'attention, il existe une liste plus étendue dans cet [article de blog](https://lilianweng.github.io/posts/2018-06-24-attention/#summary) de Dr Lilian Weng.\n", - "\n", - "Plus de ressources :\n", - "\n", - "[Un modèle de base encodeur-décodeur pour la traduction automatique](https://www.youtube.com/watch?v=gHk2IWivt_8&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=1)\n", - "\n", - "[Entraînement et perte pour les modèles encodeur-décodeur](https://www.youtube.com/watch?v=aBZUTuT1Izs&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=2)\n", - "\n", - "[Attention de base](https://www.youtube.com/watch?v=BSSoEtv5jvQ&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=6)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "aQfqM1EJyDXI" - }, - "source": [ - "### Mécanismes d'attention séquence à séquence - Intermédiaire\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "68QBeG-4yDZ9" - }, - "source": [ - "Les premiers mécanismes d'attention ont été utilisés dans les modèles séquence à séquence. Ces modèles étaient généralement des structures encodeur et décodeur RNN. La séquence d'entrée était traitée séquentiellement par un RNN, encodant la séquence dans un seul vecteur de contexte, qui était ensuite alimenté dans un autre RNN générant une nouvelle séquence. Voici un exemple de cela ([source](https://lilianweng.github.io/posts/2018-06-24-attention/)).\n", - "\n", - "\"drawing\"\n", - "\n", - "Étant donné qu'il n'y a qu'un seul vecteur de contexte, il est difficile pour l'encodeur de représenter de longues séquences et l'information est généralement perdue. Le mécanisme d'attention introduit dans [Bahdanau et al., 2015](https://arxiv.org/pdf/1409.0473.pdf) a été proposé pour résoudre ce problème.\n", - "\n", - "Ici, au lieu de se fier à un seul vecteur de contexte statique, qui est également utilisé une seule fois dans le processus de décodage, nous fournissons des informations sur toute la séquence d'entrée à chaque étape de décodage en utilisant un vecteur de contexte dynamique. Ce faisant, le décodeur peut accéder à une plus grande \"banque\" de mémoire et prêter attention aux informations nécessaires de l'entrée en fonction de l'état actuel du RNN décodeur, $s_t$. Cela est illustré ci-dessous.\n", - "\n", - "\"drawing\"\n", - "\n", - "En apprentissage profond, l'attention peut être interprétée comme un vecteur \"d'importance\". Pour prédire ou inférer un élément, tel qu'un pixel dans une image ou un mot dans une phrase, nous estimons à quel point il est corrélé avec, ou \"attend\", d'autres éléments en utilisant le vecteur/poids d'attention. Ces poids d'attention sont ensuite utilisés pour générer une nouvelle somme pondérée des éléments restants, ce qui représente la cible [(source)](https://lilianweng.github.io/posts/2018-06-24-attention/).\n", - "\n", - "Cela consiste généralement en trois étapes pour chaque étape de décodage $t$ :\n", - "\n", - "1. Calculer le score (importance) pour chaque $h_n$, étant donné $s_{t-1}$, et utiliser la fonction softmax pour transformer cela en un vecteur d'attention, $w_{n}$.\n", - " - $\\text{score} = a(s_{t−1}, h_{n})$, où $a$ peut être n'importe quelle fonction différentiable, telle que le produit scalaire.\n", - " - $w_{n} = \\frac{\\exp \\left\\{a\\left(s_{t-1}, h_{n}\\right)\\right\\}}{\\sum_{j=1}^{N} \\exp \\left\\{a\\left(s_{t-1}, h_{j}\\right)\\right\\}}$, où nous utilisons la fonction softmax pour transformer les scores bruts en poids d'attention relatifs.\n", - "2. Générer le vecteur de contexte final, $c_t$, en sommant les produits des poids d'attention et des vecteurs de contexte de l'encodeur.\n", - " - $c_t=\\sum_{n=1}^{N} w_n h_{n}$\n", - "3. Générer l'état décodeur suivant $s_{t+1}$ en combinant l'état décodeur actuel, $s_t$, avec le vecteur de contexte, $c_t$, via une fonction, $f$.\n", - "\n", - " - $s_{t+1} = f\\left ( c_t, s_t \\right)$\n", - "\n", - " Dans Bahdanau et al., 2015, $f$ était une couche feedforward apprise prenant en entrée le vecteur concaténé $[c_t; s_t]$, avec $a(s_{t−1}, h_{n})$ étant le produit scalaire.\n", - " \n", - "Ensuite, construisons ce schéma d'attention, tel qu'il est utilisé dans l'architecture du transformeur. Nous avons déjà calculé une simple attention par produit scalaire, où le score était donné par $a(s_{t-1}, h_n)=s_{t-1} h_n^\\top$ et nous allons réutiliser la même idée.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "J-MU6rrny8Nj" - }, - "source": [ - "### De l'auto-attention à l'attention multi-têtes - Intermédiaire\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BRuLtxNey_EQ" - }, - "source": [ - "L'auto-attention et l'attention multi-têtes (MHA) sont des composants fondamentaux de l'architecture du transformeur. Dans cette section, nous expliquerons en détail l'intuition derrière ces concepts et leur mise en œuvre. Plus tard, dans la section **Transformers**, vous apprendrez comment ces mécanismes d'attention sont utilisés pour créer un modèle séquence à séquence qui repose entièrement sur l'attention.\n", - "\n", - "À mesure que nous avancerons, nous représenterons les phrases en les décomposant en mots individuels et en encodant chaque mot en utilisant le modèle word2vec discuté précédemment. Dans la section Transformers, nous explorerons plus en détail comment les séquences d'entrée sont transformées en une série de vecteurs.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Oe1lR5_oynOR" - }, - "outputs": [], - "source": [ - "def embed_sentence(sentence):\n", - " \"\"\"\n", - " Encoder une phrase en utilisant word2vec ; pour des cas d'utilisation d'exemple uniquement.\n", - " \"\"\"\n", - " # nettoyer la phrase (pas nécessaire si vous utilisez un tokenizer LLM approprié)\n", - " sentence = remove_punctuation(sentence)\n", - "\n", - " # extraire les mots individuels\n", - " mots = sentence.split()\n", - "\n", - " # obtenir l'embedding word2vec pour chaque mot de la phrase\n", - " séquence_vecteur_mots, mots = get_word2vec_embedding(mots)\n", - "\n", - " # retour avec dimension supplémentaire (utile pour créer des lots plus tard)\n", - " return jnp.expand_dims(séquence_vecteur_mots, axis=0), mots\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0AFUEFZGzCTv" - }, - "source": [ - "#### Auto-attention\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LF2V3KI-za9l" - }, - "source": [ - "L'auto-attention est un mécanisme d'attention où chaque vecteur d'une séquence d'entrée donnée prête attention à l'ensemble de la séquence. Pour comprendre pourquoi l'auto-attention est importante, pensons à la phrase suivante (exemple tiré de [source](https://jalammar.github.io/illustrated-transformer/)) :\n", - "\n", - "`\"L'animal n'a pas traversé la rue parce qu'il était trop fatigué.\"`\n", - "\n", - "Une question simple concernant cette phrase est de savoir à quoi le mot \"il\" fait référence ? Bien que cela puisse paraître simple, il peut être difficile pour un algorithme d'apprendre cela. C'est là que l'auto-attention intervient, car elle peut apprendre une matrice d'attention pour le mot \"il\" où un poids important est attribué au mot \"animal\".\n", - "\n", - "L'auto-attention permet également au modèle d'apprendre à interpréter les mots ayant les mêmes embeddings, comme \"pomme\", qui peut désigner une entreprise ou un aliment, selon le contexte. Cela est très similaire à l'état caché que l'on trouve dans un RNN, mais ce processus, comme vous le verrez, permet au modèle de prêter attention à l'ensemble de la séquence en parallèle, permettant ainsi d'utiliser des séquences plus longues.\n", - "\n", - "L'auto-attention se compose de trois concepts :\n", - "\n", - "- Requêtes, clés et valeurs\n", - "- Attention par produit scalaire avec échelle\n", - "- Masques\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pwOIMtdZzdTf" - }, - "source": [ - "##### **Requêtes, clés et valeurs**\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "mEf7QWIWzdo1" - }, - "source": [ - "En général, tous les mécanismes d'attention peuvent être écrits en termes de paires `clé-valeur` et de `requêtes` pour calculer la matrice d'attention et le nouveau vecteur de contexte.\n", - "\n", - "Pour se faire une idée, on peut interpréter le vecteur de `requête` comme contenant les informations que nous cherchons à obtenir et les vecteurs de `clé` comme ayant des informations. Les vecteurs de `requête` sont comparés aux vecteurs de `clé` pour obtenir des scores d'attention, où un score d'attention plus élevé indique qu'une `clé` contenait des informations pertinentes. Ces scores d'attention sont ensuite utilisés pour déterminer quelles `valeurs` (qui sont associées aux `clés`) nous devons prendre en compte. Ou comme l'explique [Lena Voita](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html) :\n", - "\n", - "- Requête : demande d'information\n", - "- Clé : indiquant qu'elle possède des informations\n", - "- Valeur : fournissant les informations\n", - "\n", - "Dans les architectures de transformeur, nous utilisons des matrices de poids apprenables, représentées par $W_Q, W_K, W_V$, pour projeter chaque vecteur de séquence vers des vecteurs $q$, $k$, et $v$ uniques.\n", - "\n", - "\"drawing\"\n", - "\n", - "Vous remarquerez que les vecteurs $q, k, v$ sont plus petits que les vecteurs d'entrée. Cela sera abordé ultérieurement, mais sachez simplement qu'il s'agit d'un choix de conception pour les transformeurs et non d'une exigence pour fonctionner.\n", - "\n", - "Ce processus peut également être parallélisé, car la séquence d'entrée peut être représentée sous forme de matrice $X$, qui peut être transformée en matrices de requêtes, clés et valeurs $Q$, $K$, et $V$ respectivement :\n", - "\n", - "$Q=W_QX \\\\ K=W_KX \\\\ V=W_VX$\n", - "\n", - "Ci-dessous, nous montrons le code qui crée trois couches linéaires, projetant les données d'entrée vers les matrices $Q, K, V$, où la taille de la sortie peut être ajustée.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Xc8zjK6eziIV" - }, - "outputs": [], - "source": [ - "class SequenceToQKV(nn.Module):\n", - " taille_sortie: int\n", - "\n", - " @nn.compact\n", - " def __call__(self, X):\n", - "\n", - " # définir la méthode d'initialisation des poids\n", - " initialisateur = nn.initializers.variance_scaling(scale=0.5, mode=\"fan_in\", distribution=\"truncated_normal\")\n", - "\n", - " # initialiser trois couches linéaires pour faire les transformations QKV.\n", - " # note : cela pourrait aussi être une seule couche, comment pensez-vous que vous le feriez ?\n", - " couche_q = nn.Dense(self.taille_sortie, kernel_init=initialisateur)\n", - " couche_k = nn.Dense(self.taille_sortie, kernel_init=initialisateur)\n", - " couche_v = nn.Dense(self.taille_sortie, kernel_init=initialisateur)\n", - "\n", - " # transformer et retourner les matrices\n", - " Q = couche_q(X)\n", - " K = couche_k(X)\n", - " V = couche_v(X)\n", - "\n", - " return Q, K, V\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "OhGZHFsHz_Qp" - }, - "source": [ - "[link text](https://)##### **Attention par produit scalaire avec échelle**\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "DxycHDUW0BVE" - }, - "source": [ - "Maintenant que nous avons nos matrices de `requête`, `clé` et `valeur`, il est temps de calculer la matrice d'attention. N'oubliez pas que dans tous les mécanismes d'attention, nous devons d'abord trouver un score pour chaque vecteur de la séquence, puis utiliser ces scores pour créer un nouveau vecteur de contexte. Dans l'auto-attention, le scoring est effectué en utilisant l'attention par produit scalaire avec échelle, puis les scores normalisés sont utilisés comme poids pour sommer les vecteurs de valeur et créer le vecteur de contexte.\n", - "\n", - "$\\operatorname{Attention}(Q, K, V)=\\operatorname{softmax}\\left(\\frac{Q K^{T}}{\\sqrt{d_{k}}}\\right) V$\n", - "\n", - "où les scores d'attention sont calculés par $\\operatorname{softmax}\\left(\\frac{Q K^{T}}{\\sqrt{d_{k}}}\\right)$ et les scores sont ensuite multipliés par $V$ pour obtenir le vecteur de contexte.\n", - "\n", - "Ce qui se passe ici est similaire à ce que nous avons fait dans l'attention par produit scalaire dans la section précédente, sauf que nous appliquons le mécanisme à la séquence elle-même. Pour chaque élément de la séquence, nous calculons la matrice de poids d'attention entre $q_i$ et $K$. Nous multiplions ensuite $V$ par chaque poids et enfin, nous additionnons tous les vecteurs pondérés $v_{weighted}$ ensemble pour former une nouvelle représentation de $q_i$. Ce faisant, nous étouffons essentiellement les vecteurs non pertinents et mettons en avant les vecteurs importants de la séquence lorsque notre attention est sur $q_1$.\n", - "\n", - "$QK^\\top$ est mis à l'échelle par la racine carrée de la dimension des vecteurs, $\\sqrt{d_k}$, pour assurer des gradients plus stables pendant l'entraînement.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "i_UYNzrS0Hga" - }, - "outputs": [], - "source": [ - "def scaled_dot_product_attention(query, key, value):\n", - " \"\"\"\n", - " Formule pour retourner l'attention par produit scalaire avec échelle étant donné les matrices QKV\n", - " \"\"\"\n", - " d_k = key.shape[-1]\n", - "\n", - " # obtenir les scores bruts (logits) en calculant le produit scalaire des requêtes et des clés\n", - " logits = jnp.matmul(query, jnp.swapaxes(key, -2, -1))\n", - "\n", - " # échelle des scores bruts et applique la fonction softmax pour obtenir les scores/poids d'attention\n", - " logits_mis_à_l_échelle = logits / jnp.sqrt(d_k)\n", - " poids_attention = jax.nn.softmax(logits_mis_à_l_échelle, axis=-1)\n", - "\n", - " # multiplier les poids par la matrice de valeurs pour obtenir la sortie\n", - " sortie = jnp.matmul(poids_attention, value)\n", - "\n", - " return sortie, poids_attention\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "cuNaEjIm0PhV" - }, - "source": [ - "Voyons maintenant l'attention par produit scalaire avec échelle en action. Nous allons prendre une phrase, encoder chaque mot en utilisant word2vec, et voir à quoi ressemblent les poids finaux de l'auto-attention.\n", - "\n", - "Nous n'utiliserons pas les couches de projection linéaire que nous aurions besoin d'entraîner. À la place, nous allons simplifier les choses et utiliser $X=Q=V=K$.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "3Oy2sWzR0Ok5", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 928 - }, - "outputId": "7ae3094a-cf46-4926-a9c5-ca690b9ec397" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - ":17: DeprecationWarning: Call to deprecated `word_vec` (Use get_vector instead).\n", - " output.append(jnp.array(model.word_vec(word)))\n" - ] - }, - { - "output_type": "error", - "ename": "ValueError", - "evalue": "The number of FixedLocator locations (6), usually from a call to set_ticks, does not match the number of labels (9).", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;31m# tracer les mots et les poids d'attention entre eux\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0mmots\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mremove_punctuation\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mphrase\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m \u001b[0mplot_attention_weight_matrix\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpoids_attention\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmots\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmots\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;32m\u001b[0m in \u001b[0;36mplot_attention_weight_matrix\u001b[0;34m(weight_matrix, x_ticks, y_ticks)\u001b[0m\n\u001b[1;32m 46\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfigure\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfigsize\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m15\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m7\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 47\u001b[0m \u001b[0max\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mheatmap\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mweight_matrix\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcmap\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"Blues\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 48\u001b[0;31m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mxticks\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mweight_matrix\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;36m0.5\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mx_ticks\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 49\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0myticks\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mweight_matrix\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;36m0.5\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_ticks\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 50\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtitle\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Matrice d'attention\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/matplotlib/pyplot.py\u001b[0m in \u001b[0;36mxticks\u001b[0;34m(ticks, labels, minor, **kwargs)\u001b[0m\n\u001b[1;32m 1891\u001b[0m \u001b[0ml\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_internal_update\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1892\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1893\u001b[0;31m \u001b[0mlabels\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0max\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mset_xticklabels\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlabels\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mminor\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mminor\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1894\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1895\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mlocs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlabels\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/matplotlib/axes/_base.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 72\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 73\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 74\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mget_method\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 75\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 76\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__module__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mowner\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__module__\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/matplotlib/_api/deprecation.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 295\u001b[0m f\"for the old name will be dropped %(removal)s.\")\n\u001b[1;32m 296\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mnew\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpop\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mold\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 297\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 298\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 299\u001b[0m \u001b[0;31m# wrapper() must keep the same documented signature as func(): if we\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/matplotlib/axis.py\u001b[0m in \u001b[0;36mset_ticklabels\u001b[0;34m(self, labels, minor, fontdict, **kwargs)\u001b[0m\n\u001b[1;32m 1967\u001b[0m \u001b[0;31m# remove all tick labels, so only error for > 0 labels\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1968\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlocator\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlocs\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlabels\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlabels\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1969\u001b[0;31m raise ValueError(\n\u001b[0m\u001b[1;32m 1970\u001b[0m \u001b[0;34m\"The number of FixedLocator locations\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1971\u001b[0m \u001b[0;34mf\" ({len(locator.locs)}), usually from a call to\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mValueError\u001b[0m: The number of FixedLocator locations (6), usually from a call to set_ticks, does not match the number of labels (9)." - ] - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "
" - ], - "image/svg+xml": "\n\n\n \n \n \n \n 2024-08-24T16:49:11.388087\n image/svg+xml\n \n \n Matplotlib v3.7.1, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n" - }, - "metadata": {} - } - ], - "source": [ - "# définir une phrase\n", - "phrase = \"Je bois du coca, mais je mange du steak\"\n", - "\n", - "# encoder et créer les matrices QKV\n", - "embeddings_mots, mots = embed_sentence(phrase)\n", - "Q = K = V = embeddings_mots\n", - "\n", - "# calculer les poids et tracer\n", - "sorties, poids_attention = scaled_dot_product_attention(Q, K, V)\n", - "\n", - "# tracer les mots et les poids d'attention entre eux\n", - "mots = remove_punctuation(phrase).split()\n", - "plot_attention_weight_matrix(poids_attention[0], mots, mots)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NG1Kxljr0Vzw" - }, - "source": [ - "Gardez à l'esprit que nous n'avons pas encore entraîné notre matrice d'attention. Cependant, nous pouvons déjà voir qu'en utilisant les vecteurs word2vec comme notre séquence, l'attention par produit scalaire avec échelle est capable de prêter attention à \"manger\" lorsque \"steak\" est notre requête, et que la requête \"boire\" se concentre davantage sur \"coca\" et \"manger\".\n", - "\n", - "Plus de ressources :\n", - "\n", - "[Attention avec Q,K,V](https://www.youtube.com/watch?v=k-5QMalS8bQ&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=7)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "D7B-AgO80gIt" - }, - "source": [ - "##### **Attention masquée**\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tdRoKsu70gGW" - }, - "source": [ - "Il existe des cas où appliquer l'auto-attention sur l'ensemble de la séquence n'est pas pratique. Ceux-ci peuvent inclure :\n", - "\n", - "- Séquences de longueurs inégales regroupées ensemble.\n", - " - Lors de l'envoi d'un lot de séquences à travers un réseau, l'auto-attention s'attend à ce que chaque séquence soit de la même longueur. On gère cela en remplissant la séquence. Lors du calcul de l'attention, idéalement, ces tokens de remplissage ne devraient pas être pris en compte.\n", - "- Entraînement d'un modèle décodeur.\n", - " - Lors de l'entraînement de modèles décodeurs, tels que GPT-3, le décodeur a accès à toute la séquence cible lors de l'entraînement (car l'entraînement est effectué en parallèle). Pour éviter que la méthode ne triche en regardant les tokens futurs, nous devons masquer les données de la séquence future afin que les données antérieures ne puissent pas y prêter attention.\n", - "\n", - "En appliquant un masque au score final calculé entre les requêtes et les clés, nous pouvons atténuer l'influence des vecteurs de séquence indésirables. Les vecteurs sont masqués en faisant en sorte que le score entre la requête et leurs clés respectives soit une valeur négative TRÈS grande. Cela a pour effet que la fonction softmax pousse le poids d'attention très près de zéro, et la valeur résultante sera ignorée et n'influencera pas la représentation finale.\n", - "\n", - "En réunissant tout, l'attention par produit scalaire avec échelle masquée ressemble visuellement à ceci :\n", - "\n", - "\"drawing\".\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "5Syx8_5E0eM9", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 438 - }, - "outputId": "a8ead7c1-0c9f-4219-a878-c1e3aae1ab47" - }, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": [ - "
" - ], - "image/svg+xml": "\n\n\n \n \n \n \n 2024-08-24T16:51:37.841837\n image/svg+xml\n \n \n Matplotlib v3.7.1, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n" - }, - "metadata": {} - } - ], - "source": [ - "# exemple de création d'un masque pour des tokens de taille 32\n", - "# le masque s'assure que les positions ne prêtent attention qu'aux positions précédentes dans l'entrée (masque causal)\n", - "# nous utiliserons cela plus tard pour insérer des valeurs -inf dans les scores bruts\n", - "masque = jnp.tril(jnp.ones((32, 32)))\n", - "\n", - "# tracer\n", - "sns.heatmap(masque, cmap=\"Blues\")\n", - "plt.title(\"Exemple de masque qui peut être appliqué\");\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pfwTJrQ20gDw" - }, - "source": [ - "Lets now adapt our scaled dot product attention function to implement masked attention." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "PVHpyNs_0ePh" - }, - "outputs": [], - "source": [ - "def masked_scaled_dot_product_attention(query, key, value, mask=None):\n", - " \"\"\"\n", - " Formule pour retourner l'attention par produit scalaire avec échelle avec un masque appliqué\n", - " \"\"\"\n", - " d_k = key.shape[-1]\n", - "\n", - " # obtenir les scores bruts (logits) en calculant le produit scalaire des requêtes et des clés\n", - " logits = jnp.matmul(query, jnp.swapaxes(key, -2, -1))\n", - "\n", - " # appliquer le masque aux logits\n", - " if mask is not None:\n", - " logits = logits + (mask * -1e9) # -1e9 est une valeur très petite pour annuler l'influence des positions masquées\n", - "\n", - " # échelle des scores bruts et application de la fonction softmax pour obtenir les scores/poids d'attention\n", - " scaled_logits = logits / jnp.sqrt(d_k)\n", - " attention_weights = jax.nn.softmax(scaled_logits, axis=-1)\n", - "\n", - " # multiplier les poids par la matrice de valeurs pour obtenir la sortie\n", - " output = jnp.matmul(attention_weights, value)\n", - "\n", - " return output, attention_weights\n" - ] - }, - { - "cell_type": "markdown", - "source": [ - "##### **Attention multi-têtes**\n" - ], - "metadata": { - "id": "OWDubQwCs4zG" - } - }, - { - "cell_type": "markdown", - "source": [ - "Le mécanisme d'attention que nous avons couvert jusqu'à présent permet au modèle de se concentrer sur différentes positions dans l'entrée. En pratique, l'architecture du transformeur utilise une variation subtile de ce mécanisme, appelée attention multi-têtes (MHA).\n", - "\n", - "La distinction est minime ; plutôt que de calculer l'attention une seule fois, le mécanisme MHA exécute l'attention par produit scalaire avec échelle plusieurs fois en parallèle. Selon l'article *Attention is All You Need*, \"l'attention multi-têtes permet au modèle de **prêter attention conjointement** aux informations provenant de différents sous-espaces de représentation à différentes positions. Avec une seule tête d'attention, la moyenne inhibe cela.\"\n", - "\n", - "L'attention multi-têtes peut être vue comme une stratégie similaire à l'empilement de noyaux de convolution dans une couche CNN. Cela permet aux noyaux de se concentrer sur et d'apprendre différentes caractéristiques et règles, ce qui explique pourquoi plusieurs têtes d'attention fonctionnent également.\n", - "\n", - "La figure ci-dessous montre comment fonctionne l'attention multi-têtes de base. L'attention par produit scalaire avec échelle discutée précédemment est simplement répétée $N$ fois ($N=2$ dans cette figure), avec $3N$ matrices apprenables pour chaque tête. Les sorties des différentes têtes sont ensuite concaténées, après quoi elles sont passées à travers une projection linéaire, qui produit la représentation finale.\n", - "\n", - "En pratique, l'attention multi-têtes surpasse largement l'attention à une seule tête.\n", - "\n", - "\"drawing\"\n" - ], - "metadata": { - "id": "nHkyjyErsYae" - } - }, - { - "cell_type": "markdown", - "source": [ - "Voyons comment implémenter l'attention multi-têtes. En termes simples, l'attention multi-têtes consiste à exécuter le processus d'attention plusieurs fois en parallèle, en utilisant différentes copies des matrices Q, K et V pour chaque \"tête\". Cela aide le modèle à se concentrer sur différentes parties de l'entrée en même temps. Si vous souhaitez en savoir plus, consultez [ce blog de Sebastian Raschka](https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention) pour une explication détaillée.\n" - ], - "metadata": { - "id": "vtuqNCln9EWW" - } - }, - { - "cell_type": "code", - "source": [ - "class AttentionMultiTêtes(nn.Module):\n", - " nombre_têtes: int # Nombre de têtes d'attention\n", - " d_m: int # Dimension des embeddings du modèle\n", - "\n", - " def setup(self):\n", - " # Initialiser le module de transformation séquence-à-QKV\n", - " self.sequence_to_qkv = SequenceToQKV(self.d_m)\n", - "\n", - " # Définir l'initialiseur pour les poids de la couche linéaire de sortie\n", - " initialiseur = nn.initializers.variance_scaling(\n", - " scale=0.5, mode=\"fan_in\", distribution=\"truncated_normal\"\n", - " )\n", - "\n", - " # Initialiser la couche de projection de sortie Wo (utilisée après l'attention)\n", - " self.Wo = nn.Dense(self.d_m, kernel_init=initialiseur)\n", - "\n", - " def __call__(self, X=None, Q=None, K=None, V=None, masque=None, retourner_poids=False):\n", - " # Si Q, K ou V ne sont pas fournis, utiliser l'entrée X pour les générer\n", - " if None in [Q, K, V]:\n", - " assert not X is None, \"X doit être fourni si Q, K ou V ne sont pas fournis\"\n", - "\n", - " # Générer les matrices Q, K et V à partir de l'entrée X\n", - " Q, K, V = self.sequence_to_qkv(X)\n", - "\n", - " # Extraire la taille du lot (B), la longueur de la séquence (T), et la taille des embeddings (d_m)\n", - " B, T, d_m = K.shape\n", - "\n", - " # Calculer la taille des embeddings de chaque tête d'attention (d_m / nombre_têtes)\n", - " taille_tête = d_m // self.nombre_têtes\n", - "\n", - " # Redimensionner Q, K, V pour avoir des dimensions séparées pour les têtes\n", - " # B, T, d_m -> B, T, nombre_têtes, taille_tête -> B, nombre_têtes, T, taille_tête\n", - " q_têtes = Q.reshape(B, T, self.nombre_têtes, taille_tête).swapaxes(1, 2)\n", - " k_têtes = K.reshape(B, T, self.nombre_têtes, taille_tête).swapaxes(1, 2)\n", - " v_têtes = V.reshape(B, T, self.nombre_têtes, taille_tête).swapaxes(1, 2)\n", - "\n", - " # Appliquer l'attention par produit scalaire avec échelle à chaque tête\n", - " attention, poids_attention = scaled_dot_product_attention(\n", - " q_têtes, k_têtes, v_têtes, masque\n", - " )\n", - "\n", - " # Redimensionner la sortie d'attention à ses dimensions d'origine\n", - " # (B, nombre_têtes, T, taille_tête) -> (B, T, nombre_têtes, taille_tête) -> (B, T, d_m)\n", - " attention = attention.swapaxes(1, 2).reshape(B, T, d_m)\n", - "\n", - " # Appliquer la transformation linéaire de sortie Wo à la sortie d'attention\n", - " X_nouveau = self.Wo(attention)\n", - "\n", - " # Si retourner_poids est True, retourner à la fois la sortie transformée et les poids d'attention\n", - " if retourner_poids:\n", - " return X_nouveau, poids_attention\n", - " else:\n", - " # Sinon, retourner uniquement la sortie transformée\n", - " return X_nouveau\n" - ], - "metadata": { - "id": "BY2xXLMQ9CB6" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "e9NW58_3hAg2" - }, - "source": [ - "## **2. Construire votre propre LLM**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "bA_2coZvhAg3" - }, - "source": [ - "### 2.1 Vue d'ensemble générale Débutant\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BflycqAw_RF8" - }, - "source": [ - "L'architecture du Transformeur a été présentée pour la première fois dans l'article intitulé [Attention is all you need](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) par Vaswani et al.\n", - "\n", - "Comme le titre de l'article le suggère, une telle architecture consiste essentiellement uniquement en des mécanismes d'attention ainsi que des couches de feed-forward et des couches linéaires, comme le montre le schéma ci-dessous.\n", - "\n", - "\n", - "\n", - "Les Transformeurs et leurs variantes sont au cœur des Modèles de Langage de Grande Taille, et il n'est pas exagéré de dire que presque tous les modèles de langage existants sont basés sur des architectures de Transformeurs.\n", - "\n", - "Comme vous pouvez le voir dans le schéma, l'architecture originale du Transformeur se compose de deux parties, l'une qui reçoit les entrées, généralement appelée encodeur, et l'autre qui reçoit les sorties (c'est-à-dire les cibles), appelée décodeur. Cela est dû au fait que le transformeur a été conçu pour la traduction automatique.\n", - "\n", - "L'encodeur reçoit une phrase d'entrée dans une langue et la traite à travers plusieurs `blocs d'encodeur` empilés. Cela crée une représentation finale, qui contient des informations utiles nécessaires pour la tâche de décodage. Cette sortie est ensuite alimentée dans des `blocs de décodeur` empilés qui produisent de nouvelles sorties de manière autoregressive.\n", - "\n", - "L'encodeur se compose de $N$ blocs identiques, qui traitent une séquence de vecteurs de tokens séquentiellement. Ces blocs se composent de 3 parties :\n", - "\n", - "1. Un bloc d'attention multi-têtes. Ce sont la colonne vertébrale de l'architecture du transformeur. Ils traitent les données pour générer des représentations pour chaque token, en s'assurant que les informations nécessaires pour la tâche à accomplir sont représentées dans les vecteurs. Ce sont exactement les MHA que nous avons couverts dans la section sur l'attention précédemment.\n", - "2. Un MLP (Perceptron Multi-Couches, c'est-à-dire un réseau neuronal avec plusieurs couches) est appliqué à chaque token d'entrée séparément et de manière identique.\n", - "3. Une connexion résiduelle qui ajoute les tokens d'entrée aux représentations attentives et une connexion résiduelle entre l'entrée du MLP et ses sorties. Pour ces deux connexions, le résultat est normalisé à l'aide de layernorm. Dans certaines implémentations, ces étapes de normalisation sont appliquées aux entrées plutôt qu'aux sorties. Tout comme un Resnet, les transformeurs sont conçus pour être des modèles très profonds, donc ces blocs add and norm sont essentiels pour un flux de gradient fluide.\n", - "\n", - "De même, le bloc décodeur se compose de $N$ blocs identiques, cependant, il y a quelques variations au sein de ces blocs. Concrètement, les différentes parties sont :\n", - "\n", - "1. Un bloc d'attention multi-têtes masqué. Il s'agit d'un bloc MHA qui effectue l'_auto-attention_ sur la séquence de sortie, mais ce calcul est restreint aux entrées déjà vues. En d'autres termes, les tokens futurs sont bloqués lors des prédictions.\n", - "2. Un bloc d'attention multi-têtes. Ce bloc reçoit la sortie du dernier bloc encodeur, les tokens transformés, et l'utilise comme paires clé-valeur, tout en utilisant la sortie du premier bloc MHA comme requête. Ce faisant, le modèle porte son attention sur l'entrée requise pour effectuer la tâche de séquence. Ce bloc MHA effectue donc une _cross-attention_ en regardant les entrées de l'encodeur.\n", - "3. Un MLP identique à celui de l'encodeur\n", - "4. Une connexion résiduelle identique à celle de l'encodeur.\n", - "\n", - "À partir de cette architecture originale, plusieurs variations ont été proposées, certaines se concentrant uniquement sur l'encodeur et d'autres uniquement sur le **décodeur**. Les grands modèles de langage (LLMs) tels que GPT-2, GPT-3 et Turing-NLG sont issus d'architectures uniquement décodeur. Ces architectures ressemblent à ceci :\n", - "\n", - "\"drawing\"\n", - "\n", - "avec le bloc de cross-attention manquant car aucune sortie de l'encodeur n'est disponible. Pour construire un modèle de langage, nous nous concentrerons donc sur l'architecture uniquement décodeur comme celle illustrée ci-dessus.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "fbTsk0MdhAhC" - }, - "source": [ - "### 2.2 Tokenisation + Encodage positionnel Débutant" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "DehUpfym_RF8" - }, - "source": [ - "#### 2.2.1 Tokenisation\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "uBiFpVBu_RF9" - }, - "source": [ - "Les transformeurs ne peuvent pas traiter des chaînes de texte brutes. Pour traiter le texte, celui-ci est d'abord divisé en tokens. Les tokens sont ensuite indexés, et chaque token se voit attribuer un embedding de taille $d_{model}$. Ces embeddings peuvent être appris pendant l'entraînement ou provenir d'un vocabulaire d'embeddings préentraînés. Cette nouvelle séquence d'embeddings de tokens est ensuite transmise à l'architecture du transformeur. Cette idée est visualisée ci-dessous.\n", - "\n", - "\\\\\n", - "\n", - "\"drawing\"\n", - "\n", - "Ces identifiants de tokens sont généralement prédits lorsqu'un modèle génère du texte, complète des mots manquants, etc.\n", - "\n", - "Ce processus de division du texte en tokens et d'attribution d'un identifiant à chaque token est appelé [tokenisation](https://huggingface.co/docs/transformers/tokenizer_summary). Il existe plusieurs façons de tokeniser le texte, certaines méthodes étant directement entraînées à partir des données. Lors de l'utilisation de transformeurs pré-entraînés, il est crucial d'utiliser le même tokeniseur que celui utilisé pour entraîner le modèle. Le lien précédent propose des descriptions approfondies de nombreuses techniques largement connues.\n", - "\n", - "Ci-dessous, nous montrons comment le tokeniseur du modèle [BERT](https://arxiv.org/abs/1810.04805) tokenise une phrase. Nous utilisons [Hugging Face](https://huggingface.co/) pour cette partie.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "hJBMvlUA_RF9", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "706323b0-4d1e-4df3-ee10-99e95dc408cb" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "IDs des tokens : [101, 2001, 185, 7625, 5484, 12890, 1587, 14529, 1821, 11700, 11656, 102]\n" - ] - } - ], - "source": [ - "import transformers\n", - "from transformers import pipeline, AutoTokenizer, AutoModel\n", - "\n", - "bert_tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased\")\n", - "entrée_encodée = bert_tokenizer(\"La pratique est tellement amusante\")\n", - "print(f\"IDs des tokens : {entrée_encodée['input_ids']}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "GYbtZTVP_RF9" - }, - "source": [ - "Ici, nous pouvons voir que le tokeniseur retourne les IDs pour chaque token, comme illustré dans la figure. Mais en comptant le nombre d'IDs, nous constatons qu'il est plus grand que le nombre de mots dans la phrase. Imprimons les tokens associés à chaque ID.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "yPZjiLis_RF9", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 144 - }, - "outputId": "f1466d7d-98e8-42ff-d4a7-bbba96540ada" - }, - "outputs": [ - { - "output_type": "error", - "ename": "NameError", - "evalue": "name 'encoded_input' is not defined", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf\"Tokens: {bert_tokenizer.decode(encoded_input['input_ids'])}\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;31mNameError\u001b[0m: name 'encoded_input' is not defined" - ] - } - ], - "source": [ - "print(f\"Tokens: {bert_tokenizer.decode(encoded_input['input_ids'])}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "k3K8UFlR_RF9" - }, - "source": [ - "Nous pouvons voir que le tokeniseur ajoute de nouveaux tokens, `[CLS]` et `[SEP]`, au début et à la fin de la séquence. Il s'agit d'une exigence spécifique à BERT pour l'entraînement et l'inférence. Ajouter des tokens spéciaux est une pratique très courante. Grâce à ces tokens spéciaux, nous pouvons indiquer à un modèle quand une phrase commence ou se termine, ou quand une nouvelle partie de l'entrée commence. Cela peut être utile pour réaliser différentes tâches.\n", - "\n", - "Par exemple, pour préentraîner certains transformeurs spécifiques, ils effectuent ce que l'on appelle une prédiction masquée. Pour cela, des tokens aléatoires dans une séquence sont remplacés par le token `[MASK]`, et le modèle est entraîné à prédire l'ID correct du token remplacé par ce token.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "djMP4Ijz_RF9" - }, - "source": [ - "**Inconvénient de l'utilisation des tokens bruts** :\n", - "\n", - "Un inconvénient de l'utilisation des tokens bruts est qu'ils ne fournissent aucune indication sur la position du mot dans la séquence. Cela est évident lorsqu'on considère des phrases comme \"Je suis heureux\" et \"Suis-je heureux\" - ces deux phrases ont des significations distinctes, et le modèle doit saisir l'ordre des mots pour comprendre le message voulu de manière précise.\n", - "\n", - "Pour remédier à cela, lors de la conversion des entrées en vecteurs, des vecteurs de position sont introduits et ajoutés à ces vecteurs pour indiquer la **position** de chaque mot.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "639s7Zuk_RF9" - }, - "source": [ - "#### 2.2.2 Encodages positionnels\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "s-hBFVYo_RF9" - }, - "source": [ - "Dans la plupart des domaines où un transformeur peut être utilisé, il existe un ordre sous-jacent aux tokens produits, qu'il s'agisse de l'ordre des mots dans une phrase, de l'emplacement à partir duquel des patches sont pris dans une image ou même des étapes effectuées dans un environnement de RL. Cet ordre est très important dans tous les cas ; imaginez simplement que vous interprétez la phrase \"Je dois lire ce livre.\" comme \"J'ai ce livre à lire.\". Les deux phrases contiennent exactement les mêmes mots, mais elles ont des significations complètement différentes en fonction de l'ordre.\n", - "\n", - "Étant donné que les blocs d'encodeur et de décodeur traitent tous les tokens en parallèle, l'ordre des tokens est perdu dans ces calculs. Pour remédier à cela, l'ordre de la séquence doit être directement injecté dans les tokens. Cela peut être fait en ajoutant des *encodages positionnels* aux tokens au début des blocs d'encodeur et de décodeur (bien que certaines des techniques les plus récentes ajoutent des informations positionnelles dans les blocs d'attention). Un exemple de la façon dont les encodages positionnels modifient les tokens est montré ci-dessous.\n", - "\n", - "\\\\\n", - "\n", - "\"drawing\"\n", - "\n", - "Idéalement, ces encodages devraient avoir les caractéristiques suivantes ([source](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/)) :\n", - "* Chaque étape temporelle devrait avoir une valeur unique.\n", - "* La distance entre les étapes temporelles doit rester constante.\n", - "* L'encodage devrait pouvoir se généraliser à des séquences plus longues que celles vues pendant l'entraînement.\n", - "* L'encodage doit être déterministe.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "rklY-aL-_RF9" - }, - "source": [ - "##### **Fonctions sinus et cosinus**\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "GLcfkMku_RF9" - }, - "source": [ - "Dans *Attention is All you Need*, les auteurs ont utilisé une méthode qui peut satisfaire toutes ces exigences. Cela implique de sommer une combinaison d'ondes sinusoïdales et cosinusoïdales à différentes fréquences, avec la formule pour un encodage positionnel à la position $D$ montrée ci-dessous, où $i$ est l'indice de l'embedding et $d_m$ est la taille de l'embedding du token.\n", - "\n", - "\\\\\n", - "\n", - "$P_{D}= \\begin{cases}\\sin \\left(\\frac{D}{10000^{i/d_{m}}}\\right), & \\text { si } i \\bmod 2=0 \\\\ \\cos \\left(\\frac{D}{10000^{((i-1)/d_{m}}}\\right), & \\text { sinon } \\end{cases}$\n", - "\n", - "\\\n", - "\n", - "En supposant que notre modèle ait $d_m=8$, l'embedding positionnel ressemblera à ceci :\n", - "\n", - "\\\n", - "$P_{D}=\\left[\\begin{array}{c}\\sin \\left(\\frac{D}{10000^{0/8}}\\right)\\\\ \\cos \\left(\\frac{D}{10000^{0/8}}\\right)\\\\ \\sin \\left(\\frac{D}{10000^{2/8}}\\right)\\\\ \\cos \\left(\\frac{D}{10000^{2/8}}\\right)\\\\ \\sin \\left(\\frac{D}{10000^{4/8}}\\right)\\\\ \\cos \\left(\\frac{D}{10000^{4/8}}\\right)\\\\ \\sin \\left(\\frac{D}{10000^{8/8}}\\right)\\\\ \\cos \\left(\\frac{D}{10000^{8/8}}\\right)\\end{array}\\right]$\n", - "\n", - "\\\\\n", - "\n", - "Commençons par créer une fonction capable de retourner ces encodages pour comprendre pourquoi cela fonctionne.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "zT5t5D30_RF9" - }, - "outputs": [], - "source": [ - "def return_frequency_pe_matrix(longueur_sequence_tokens, taille_embedding_tokens):\n", - "\n", - " assert taille_embedding_tokens % 2 == 0, \"la taille de l'embedding des tokens doit être divisible par deux\"\n", - "\n", - " P = jnp.zeros((longueur_sequence_tokens, taille_embedding_tokens))\n", - " positions = jnp.arange(0, longueur_sequence_tokens)[:, jnp.newaxis]\n", - "\n", - " i = jnp.arange(0, taille_embedding_tokens, 2)\n", - " pas_frequence = jnp.exp(i * (-math.log(10000.0) / taille_embedding_tokens))\n", - " frequences = positions * pas_frequence\n", - "\n", - " P = P.at[:, 0::2].set(jnp.sin(frequences))\n", - " P = P.at[:, 1::2].set(jnp.cos(frequences))\n", - "\n", - " return P\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "CYW-VDOL_RF-", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 677 - }, - "outputId": "a5f2600d-6eea-498b-c58d-24782d456eee" - }, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": [ - "
" - ], - "image/svg+xml": "\n\n\n \n \n \n \n 2024-08-24T17:23:13.343593\n image/svg+xml\n \n \n Matplotlib v3.7.1, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n" - }, - "metadata": {} - } - ], - "source": [ - "longueur_sequence_tokens = 50 # Nombre de tokens que le modèle devra traiter\n", - "dimension_embedding_tokens = 10000 # Dimensions des embeddings des tokens (et de l'encodage positionnel), s'assurer qu'il est divisible par deux\n", - "P = return_frequency_pe_matrix(longueur_sequence_tokens, dimension_embedding_tokens)\n", - "plot_position_encodings(P, longueur_sequence_tokens, dimension_embedding_tokens)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1mjHEDPO_RF-" - }, - "source": [ - "En regardant le graphique ci-dessus, nous pouvons voir que pour chaque indice de position, il y a un motif unique qui se forme, où chaque indice de position aura toujours le même encodage.\n", - "\n", - "**Tâche de groupe** :\n", - "\n", - "- Discutez avec votre ami pourquoi nous voyons ce motif spécifique lorsque `longueur_sequence_tokens` est 1000, et `dimension_embedding_tokens` est 768.\n", - "- Vous pouvez essayer de jouer avec des valeurs plus petites pour `longueur_sequence_tokens` et `dimension_embedding_tokens` pour obtenir une meilleure intuition pour la discussion ci-dessus.\n", - "- Demandez à votre ami pourquoi ils pensent que la constante 10000 est utilisée dans les fonctions ci-dessus.\n", - "- Réglez `longueur_sequence_tokens` sur 50 et `dimension_embedding_tokens` sur quelque chose de grand, comme 10000. Que remarquez-vous ? Un grand embedding de token est-il toujours nécessaire ?\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SdNPg0pnhAhG" - }, - "source": [ - "### 2.3 Bloc Transformer Intermédiaire\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "M4vSolF2_RF-" - }, - "source": [ - "Tout comme un MLP (un réseau de neurones simple qui traite les données d'entrée à travers plusieurs couches) ou un CNN (un type de réseau de neurones qui excelle dans la reconnaissance de motifs dans les images en utilisant des couches de convolution), les transformers sont constitués d'une pile de blocs transformer. Dans cette section, nous allons construire chacun des composants nécessaires pour créer un de ces blocs transformer.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "kTURbfr__RF-" - }, - "source": [ - "#### 2.3.1 Réseau de neurones Feed Forward (FFN) / Perceptron multicouche (MLP) Débutant\n", - "\n", - "\"drawing\"\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LTtFi9AZ_RF-" - }, - "source": [ - "Ces blocs sont simplement un MLP (perceptron multicouche) à 2 couches qui utilise la fonction d'activation ReLU dans le modèle original. La fonction GeLU est également devenue très populaire, et nous l'utiliserons tout au long de la pratique. La formule ci-dessous représente le réseau de neurones feedforward (FFN) avec activation GeLU, où l'entrée `x` est transformée à travers deux couches linéaires avec des poids `W1` et `W2`, suivis de termes de biais `b1` et `b2`, et la fonction `max` représente la fonction d'activation ReLU.\n", - "\n", - "$$\n", - "\\operatorname{FFN}(x)=\\max \\left(0, x W_{1}+b_{1}\\right) W_{2}+b_{2}\n", - "$$\n", - "\n", - "On peut interpréter ce bloc comme traitant ce que le bloc MHA a produit, puis projetant ces nouvelles représentations de tokens dans un espace que le bloc suivant peut utiliser de manière plus optimale. En général, la première couche est très large, dans la gamme de 2 à 8 fois la taille des représentations de tokens. Ils le font car il est plus facile de paralléliser les calculs pour une seule couche plus large pendant l'entraînement que de paralléliser un bloc feedforward avec plusieurs couches. Ainsi, ils peuvent ajouter plus de complexité tout en gardant l'entraînement et l'inférence optimisés.\n", - "\n", - "**Tâche de code :** Codez un module Flax qui implémente le bloc feedforward.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "zsho1CnW_RF-", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 108 - }, - "outputId": "cd15974e-e9df-4de2-85a0-504d18ae47a5" - }, - "outputs": [ - { - "output_type": "error", - "ename": "SyntaxError", - "evalue": "invalid syntax (, line 27)", - "traceback": [ - "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m27\u001b[0m\n\u001b[0;31m layer1 = #TERMINER\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n" - ] - } - ], - "source": [ - "class FeedForwardBlock(nn.Module):\n", - " \"\"\"\n", - " Un MLP à 2 couches qui élargit puis rétrécit l'entrée.\n", - "\n", - " Args:\n", - " widening_factor [optionnel, par défaut=4] : La taille de la couche cachée sera d_model * widening_factor.\n", - " \"\"\"\n", - "\n", - " widening_factor: int = 4\n", - " init_scale: float = 0.25\n", - "\n", - " @nn.compact\n", - " def __call__(self, x):\n", - " '''\n", - " Args:\n", - " x: [B, T, d_m]\n", - "\n", - " Return:\n", - " x: [B, T, d_m]\n", - " '''\n", - " d_m = x.shape[-1]\n", - " layer1_size = self.widening_factor * d_m\n", - "\n", - " initializer = nn.initializers.variance_scaling(\n", - " scale=self.init_scale, mode='fan_in', distribution='truncated_normal',\n", - " )\n", - " layer1 = # TERMINER\n", - " layer2 = # TERMINER\n", - "\n", - " x = jax.nn.gelu(layer1(x))\n", - " x = layer2(x)\n", - " return x\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "-qj0nfhH_RF-" - }, - "outputs": [], - "source": [ - "# @title Réponse à la tâche de code (Essayez de ne pas regarder avant d'avoir bien réfléchi !)\n", - "\n", - "class FeedForwardBlock(nn.Module):\n", - " \"\"\"Un MLP (Perceptron Multicouche) à 2 couches qui agrandit d'abord la taille de l'entrée puis la réduit à nouveau.\"\"\"\n", - "\n", - " # widening_factor contrôle combien la dimension d'entrée est agrandie dans la première couche.\n", - " widening_factor: int = 4\n", - "\n", - " # init_scale contrôle le facteur d'échelle pour l'initialisation des poids.\n", - " init_scale: float = 0.25\n", - "\n", - " @nn.compact\n", - " def __call__(self, x):\n", - " # Obtenir la taille de la dernière dimension de l'entrée (taille de l'embedding).\n", - " d_m = x.shape[-1]\n", - "\n", - " # Calculer la taille de la première couche en multipliant la taille de l'embedding par le facteur d'élargissement.\n", - " layer1_size = self.widening_factor * d_m\n", - "\n", - " # Initialiser les poids pour les deux couches en utilisant un initialisateur de mise à l'échelle de variance.\n", - " initializer = nn.initializers.variance_scaling(\n", - " scale=self.init_scale, mode='fan_in', distribution='truncated_normal',\n", - " )\n", - "\n", - " # Définir la première couche dense, qui agrandit la taille de l'entrée.\n", - " layer1 = nn.Dense(layer1_size, kernel_init=initializer)\n", - "\n", - " # Définir la seconde couche dense, qui réduit la taille à la dimension d'origine.\n", - " layer2 = nn.Dense(d_m, kernel_init=initializer)\n", - "\n", - " # Appliquer la première couche dense suivie d'une fonction d'activation GELU.\n", - " x = jax.nn.gelu(layer1(x))\n", - "\n", - " # Appliquer la seconde couche dense pour projeter les données à nouveau à sa dimension d'origine.\n", - " x = layer2(x)\n", - "\n", - " # Retourner la sortie finale.\n", - " return x\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Sts5Vr4i_RF-" - }, - "source": [ - "#### 2.3.2 Bloc Ajouter et Normaliser Débutant\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TWUpf8wt_RF-" - }, - "source": [ - "#### 2.3.2 Bloc Ajouter et Normaliser Débutant\n", - "\n", - "Pour permettre aux transformateurs de devenir plus profonds, les connexions résiduelles sont très importantes pour permettre un meilleur flux des gradients à travers le réseau. Pour la normalisation, `layer norm` est utilisé. Cette normalisation est appliquée indépendamment à chaque vecteur de token dans le lot. Il est constaté que la normalisation des vecteurs améliore la convergence et la stabilité des transformateurs.\n", - "\n", - "Il y a deux paramètres apprenables dans la normalisation par couche (`layer norm`), `scale` et `bias`, qui redimensionnent la valeur normalisée. Ainsi, pour chaque token d'entrée dans un lot, nous calculons la moyenne, $\\mu_{i}$ et la variance $\\sigma_i^2$. Nous normalisons ensuite le token avec :\n", - "\n", - "$$\n", - "\\hat{x}_i = \\frac{x_i - \\mu_{i}}{\\sigma_i^2 + ϵ}\n", - "$$\n", - "\n", - "Puis $\\hat{x}$ est redimensionné en utilisant le `scale` appris, $γ$, et le `bias` $β$, avec :\n", - "\n", - "$$\n", - "y_i = γ\\hat{x}_i + β = LN_{γ,β}(x_i)\n", - "$$\n", - "\n", - "Ainsi, notre bloc ajouter et normaliser peut être représenté par $LN(x+f(x))$, où $f(x)$ est soit un bloc MLP soit MHA.\n", - "\n", - "**Tâche de code :** Implémentez un module Flax qui réalise le bloc ajouter et normaliser. Il doit prendre en entrée les tokens traités et non traités. Indice : `hk.LayerNorm`\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "F5bLb5Ly_RF_" - }, - "outputs": [], - "source": [ - "class AddNorm(nn.Module):\n", - " \"\"\"Un bloc qui implémente le bloc ajouter et normaliser\"\"\"\n", - "\n", - " @nn.compact\n", - " def __call__(self, x, processed_x):\n", - " '''\n", - " Args:\n", - " x: Séquence de tokens avant d'être alimentée dans les blocs MHA ou FF, avec forme [B, T, d_m]\n", - " processed_x: Séquence après traitement par les blocs MHA ou FF, avec forme [B, T, d_m]\n", - "\n", - " Return:\n", - " add_norm_x: Tokens transformés avec forme [B, T, d_m]\n", - " '''\n", - "\n", - " added = x + processed_x\n", - " normalised = nn.LayerNorm()\n", - " return normalised(added)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "HXSi7BXZ_RF_" - }, - "outputs": [], - "source": [ - "class AddNorm(nn.Module):\n", - " \"\"\"Un bloc qui implémente l'opération 'Ajouter et Normaliser' utilisée dans les transformateurs.\"\"\"\n", - "\n", - " @nn.compact\n", - " def __call__(self, x, processed_x):\n", - " # Étape 1 : Ajouter l'entrée originale (x) à l'entrée traitée (processed_x).\n", - " added = x + processed_x\n", - "\n", - " # Étape 2 : Appliquer la normalisation de couche au résultat de l'addition.\n", - " # - LayerNorm aide à stabiliser et améliorer le processus d'entraînement en normalisant la sortie.\n", - " # - reduction_axes=-1 indique que la normalisation est appliquée à la dernière dimension (typiquement la dimension d'encodage).\n", - " # - use_scale=True et use_bias=True permettent à la couche d'apprendre des paramètres de mise à l'échelle et de biais pour un ajustement plus fin.\n", - " normalised = nn.LayerNorm(reduction_axes=-1, use_scale=True, use_bias=True)\n", - "\n", - " # Retourner le résultat normalisé.\n", - " return normalised(added)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "91dXd29b_RF_" - }, - "source": [ - "### 2.4 Construction du Décodeur Transformer / LLM Intermédiaire\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Sl0UAyvM_RF_" - }, - "source": [ - "\"drawing\"\n", - "\n", - "La plupart des éléments de base ont été réalisés. Nous avons construit le bloc d'encodage positionnel, le bloc MHA, le bloc feed-forward et le bloc add&norm.\n", - "\n", - "La seule partie nécessaire est de passer les entrées à chaque bloc décodeur et d'appliquer le bloc MHA masqué trouvé dans les blocs décodeurs.\n", - "\n", - "**Tâche de code :** Codez un module FLAX qui implémente le (FFN(norm(MHA(norm(X))))) pour le bloc décodeur\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "wVmSFKZK_RF_", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 108 - }, - "outputId": "5f76c82a-a818-4cd2-dc79-5a75e1ced987" - }, - "outputs": [ - { - "output_type": "error", - "ename": "SyntaxError", - "evalue": "invalid syntax (, line 30)", - "traceback": [ - "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m30\u001b[0m\n\u001b[0;31m attention, attention_weights_1 = # FINISHEZ MOI\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n" - ] - } - ], - "source": [ - "class DecoderBlock(nn.Module):\n", - " \"\"\"\n", - " Bloc décodeur Transformer.\n", - "\n", - " Args:\n", - " num_heads: Le nombre de têtes à utiliser dans le bloc MHA.\n", - " d_m: Taille de l'encodage des jetons\n", - " widening factor: La taille du calque caché sera d_m * widening_factor.\n", - " \"\"\"\n", - "\n", - " num_heads: int\n", - " d_m: int\n", - " widening_factor: int = 4\n", - "\n", - " def setup(self):\n", - " self.mha = MultiHeadAttention(self.num_heads, self.d_m)\n", - " self.add_norm1 = AddNorm()\n", - " self.add_norm2 = AddNorm()\n", - " self.MLP = FeedForwardBlock(widening_factor=self.widening_factor)\n", - "\n", - " def __call__(self, X, mask=None, return_att_weight=True):\n", - " \"\"\"\n", - " Args:\n", - " X: Lot de jetons entrant dans le décodeur, avec la forme [B, T_decoder, d_m]\n", - " encoder_output: Lot de jetons traités par l'encodeur, avec la forme [B, T_encoder, d_m]\n", - " mask [optionnel, défaut=None]: Masque à appliquer, avec la forme [T_decoder, T_decoder].\n", - " return_att_weight [optionnel, défaut=True]: S'il faut renvoyer les poids d'attention.\n", - " \"\"\"\n", - "\n", - " attention, attention_weights_1 = # FINISHEZ MOI\n", - "\n", - " X = # FINISHEZ MOI\n", - "\n", - " projection = # FINISHEZ MOI\n", - " X = # FINISHEZ MOI\n", - "\n", - " return (X, attention_weights_1) if return_att_weight else X\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "stNZVVv3_RF_" - }, - "outputs": [], - "source": [ - "class DecoderBlock(nn.Module):\n", - " \"\"\"\n", - " Bloc décodeur Transformer.\n", - "\n", - " Args:\n", - " num_heads: Le nombre de têtes d'attention dans le bloc Multi-Head Attention (MHA).\n", - " d_m: La taille des embeddings des jetons.\n", - " widening_factor: Le facteur par lequel la taille de la couche cachée est étendue dans le MLP.\n", - " \"\"\"\n", - "\n", - " num_heads: int\n", - " d_m: int\n", - " widening_factor: int = 4\n", - "\n", - " def setup(self):\n", - " # Initialiser le bloc Multi-Head Attention (MHA)\n", - " self.mha = MultiHeadAttention(self.num_heads, self.d_m)\n", - "\n", - " # Initialiser les blocs AddNorm pour les connexions résiduelles et la normalisation\n", - " self.add_norm1 = AddNorm() # Premier bloc AddNorm après MHA\n", - " self.add_norm2 = AddNorm() # Deuxième bloc AddNorm après le MLP\n", - "\n", - " # Initialiser le FeedForwardBlock (MLP) qui traite les données après l'attention\n", - " self.MLP = FeedForwardBlock(widening_factor=self.widening_factor)\n", - "\n", - " def __call__(self, X, mask=None, return_att_weight=True):\n", - " \"\"\"\n", - " Passage avant à travers le Bloc Décodeur.\n", - "\n", - " Args:\n", - " X: Lot de jetons d'entrée alimentés dans le décodeur, forme [B, T_decoder, d_m]\n", - " mask [optionnel, défaut=None]: Masque pour contrôler les positions que l'attention est autorisée à considérer, forme [T_decoder, T_decoder].\n", - " return_att_weight [optionnel, défaut=True]: Si True, renvoie les poids d'attention avec la sortie.\n", - "\n", - " Returns:\n", - " Si return_att_weight est True, renvoie un tuple (X, attention_weights_1).\n", - " Sinon, renvoie les représentations de jetons traitées X.\n", - " \"\"\"\n", - "\n", - " # Appliquer l'attention Multi-Head aux jetons d'entrée (X) avec un masquage optionnel\n", - " attention, attention_weights_1 = self.mha(X, mask=mask, return_weights=True)\n", - "\n", - " # Appliquer le premier bloc AddNorm (ajoute l'entrée originale X et normalise)\n", - " X = self.add_norm1(X, attention)\n", - "\n", - " # Passer le résultat à travers le FeedForwardBlock (MLP) pour traiter davantage les données\n", - " projection = self.MLP(X)\n", - "\n", - " # Appliquer le deuxième bloc AddNorm (ajoute l'entrée de l'étape précédente et normalise)\n", - " X = self.add_norm2(X, projection)\n", - "\n", - " # Retourner la sortie finale X, et éventuellement les poids d'attention\n", - " return (X, attention_weights_1) if return_att_weight else X\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8SXXVWd7_RF_" - }, - "source": [ - "Ensuite, nous allons assembler le tout, en ajoutant les encodages positionnels ainsi qu'en empilant plusieurs blocs de transformateur et en ajoutant notre couche de prédiction.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "4XBG24Qs_RF_" - }, - "outputs": [], - "source": [ - "class LLM(nn.Module):\n", - " \"\"\"\n", - " Modèle Transformer composé de plusieurs couches de blocs de décodeur.\n", - "\n", - " Args:\n", - " num_heads: Nombre de têtes d'attention dans chaque bloc Multi-Head Attention (MHA).\n", - " num_layers: Nombre de blocs de décodeur dans le modèle.\n", - " d_m: Dimensionnalité des embeddings de tokens.\n", - " vocab_size: Taille du vocabulaire (nombre de tokens uniques).\n", - " widening_factor: Facteur par lequel la taille de la couche cachée est agrandie dans le MLP.\n", - " \"\"\"\n", - " num_heads: int\n", - " num_layers: int\n", - " d_m: int\n", - " vocab_size: int\n", - " widening_factor: int = 4\n", - "\n", - " def setup(self):\n", - " # Initialiser une liste de blocs de décodeur, un pour chaque couche du modèle\n", - " self.blocks = [\n", - " DecoderBlock(self.num_heads, self.d_m, self.widening_factor)\n", - " for _ in range(self.num_layers)\n", - " ]\n", - "\n", - " # Initialiser une couche d'embedding pour convertir les IDs de tokens en embeddings de tokens\n", - " self.embedding = nn.Embed(num_embeddings=self.vocab_size, features=self.d_m)\n", - "\n", - " # Initialiser une couche dense pour prédire le token suivant dans la séquence\n", - " self.pred_layer = nn.Dense(self.vocab_size)\n", - "\n", - " def __call__(self, X, mask=None, return_att_weights=False):\n", - " \"\"\"\n", - " Passage avant à travers le modèle LLM.\n", - "\n", - " Args:\n", - " X: Batch d'IDs de tokens d'entrée, forme [B, T_decoder] où B est la taille du batch et T_decoder est la longueur de la séquence.\n", - " mask [optionnel, défaut=None]: Masque pour contrôler sur quelles positions l'attention peut se concentrer, forme [T_decoder, T_decoder].\n", - " return_att_weights [optionnel, défaut=False]: Si True, retourne les poids d'attention.\n", - "\n", - " Returns:\n", - " logits: Les probabilités prédites pour chaque token du vocabulaire.\n", - " Si return_att_weights est True, retourne également les poids d'attention.\n", - " \"\"\"\n", - "\n", - " # Convertir les IDs de tokens en embeddings (forme [B, T_decoder, d_m])\n", - " X = self.embedding(X)\n", - "\n", - " # Obtenir la longueur de la séquence de l'entrée\n", - " sequence_len = X.shape[-2]\n", - "\n", - " # Générer des encodages positionnels et les ajouter aux embeddings de tokens\n", - " positions = return_frequency_pe_matrix(sequence_len, self.d_m)\n", - " X = X + positions\n", - "\n", - " # Initialiser une liste pour stocker les poids d'attention si nécessaire\n", - " if return_att_weights:\n", - " att_weights = []\n", - "\n", - " # Passer les embeddings à travers chaque bloc de décodeur en séquence\n", - " for block in self.blocks:\n", - " out = block(X, mask, return_att_weights)\n", - " if return_att_weights:\n", - " # Si on retourne les poids d'attention, décompacter la sortie\n", - " X = out[0]\n", - " att_weights.append(out[1])\n", - " else:\n", - " # Sinon, mettre simplement à jour l'entrée pour le bloc suivant\n", - " X = out\n", - "\n", - " # Appliquer une couche dense suivie d'un log softmax pour obtenir les logits (probabilités des tokens prédites)\n", - " logits = nn.log_softmax(self.pred_layer(X))\n", - "\n", - " # Retourner les logits, et éventuellement, les poids d'attention\n", - " return logits if not return_att_weights else (logits, jnp.array(att_weights).swapaxes(0, 1))\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sClFLLkU_RF_" - }, - "source": [ - "Si tout est correct, alors si nous exécutons le code ci-dessous, tout devrait fonctionner sans problème.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "82CWEa5m_RGA", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 436 - }, - "outputId": "ddc3523e-a432-4d9e-fc46-54aa4c1296df" - }, - "outputs": [ - { - "output_type": "error", - "ename": "NameError", - "evalue": "name 'MultiHeadAttention' is not defined", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0mkey\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mjax\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrandom\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mPRNGKey\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m42\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0mX\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mjax\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrandom\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrandint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mB\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mT\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvocab_size\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m \u001b[0mparams\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mllm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmask\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mmask\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 10\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;31m# extrait la sortie du décodeur\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - " \u001b[0;31m[... skipping hidden 9 frame]\u001b[0m\n", - "\u001b[0;32m\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, X, mask, return_att_weights)\u001b[0m\n\u001b[1;32m 59\u001b[0m \u001b[0;31m# Passer les embeddings à travers chaque bloc de décodeur en séquence\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 60\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mblock\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mblocks\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 61\u001b[0;31m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mblock\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmask\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreturn_att_weights\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 62\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mreturn_att_weights\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 63\u001b[0m \u001b[0;31m# Si on retourne les poids d'attention, décompacter la sortie\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - " \u001b[0;31m[... skipping hidden 5 frame]\u001b[0m\n", - "\u001b[0;32m\u001b[0m in \u001b[0;36msetup\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0msetup\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 16\u001b[0m \u001b[0;31m# Initialiser le bloc Multi-Head Attention (MHA)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 17\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmha\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mMultiHeadAttention\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnum_heads\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0md_m\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 18\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 19\u001b[0m \u001b[0;31m# Initialiser les blocs AddNorm pour les connexions résiduelles et la normalisation\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mNameError\u001b[0m: name 'MultiHeadAttention' is not defined" - ] - } - ], - "source": [ - "B, T, d_m, N, vocab_size = 18, 32, 16, 8, 25670\n", - "\n", - "llm = LLM(num_heads=1, num_layers=1, d_m=d_m, vocab_size=vocab_size, widening_factor=4)\n", - "mask = jnp.tril(np.ones((T, T)))\n", - "\n", - "# initialise le module et obtient une sortie factice\n", - "key = jax.random.PRNGKey(42)\n", - "X = jax.random.randint(key, [B, T], 0, vocab_size)\n", - "params = llm.init(key, X, mask=mask)\n", - "\n", - "# extrait la sortie du décodeur\n", - "logits, decoder_att_weights = llm.apply(\n", - " params,\n", - " X,\n", - " mask=mask,\n", - " return_att_weights=True,\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Gve7ssD__RGA" - }, - "source": [ - "Comme dernière vérification de cohérence, nous pouvons confirmer que nos poids d'attention fonctionnent correctement. Comme le montre la figure ci-dessous, les poids d'attention du décodeur ne se concentrent que sur les jetons précédents, comme prévu.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "H4NpywYv_RGA", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 661 - }, - "outputId": "b182f095-2cc4-49b1-d5a1-31e25a01241d" - }, - "outputs": [ - { - "output_type": "error", - "ename": "NameError", - "evalue": "name 'decoder_att_weights' is not defined", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mfig\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0max\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msubplots\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfigsize\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m5\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msuptitle\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Poids d'attention du LLM\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0msns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mheatmap\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdecoder_att_weights\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m...\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0max\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0max\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcmap\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"Blues\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0mfig\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshow\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mNameError\u001b[0m: name 'decoder_att_weights' is not defined" - ] - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "
" - ], - "image/svg+xml": "\n\n\n \n \n \n \n 2024-08-24T17:40:54.466241\n image/svg+xml\n \n \n Matplotlib v3.7.1, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n" - }, - "metadata": {} - } - ], - "source": [ - "fig, ax = plt.subplots(1, 1, figsize=(10, 5))\n", - "plt.suptitle(\"Poids d'attention du LLM\")\n", - "sns.heatmap(decoder_att_weights[0, 0, 0, ...], ax=ax, cmap=\"Blues\")\n", - "fig.show()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wmt3tp38G90A" - }, - "source": [ - "### 2.5 Entraînement de votre LLM\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "agLIpsoh_RGA" - }, - "source": [ - "#### 2.5.1 Objectif d'entraînement Intermédiaire\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "QOSv1-3B_RGA" - }, - "source": [ - "#### 2.5.1 Objectif d'entraînement Intermédiaire\n", - "\n", - "Une phrase n'est rien d'autre qu'une chaîne de mots. Un LLM vise à prédire le mot suivant en tenant compte du contexte actuel, c'est-à-dire des mots qui l'ont précédé.\n", - "\n", - "Voici l'idée de base :\n", - "\n", - "Pour calculer la probabilité d'une phrase complète \"mot1, mot2, ..., dernier mot\" apparaissant dans un contexte donné $c$, la procédure consiste à décomposer la phrase en mots individuels et à considérer la probabilité de chaque mot étant donné les mots qui le précèdent. Ces probabilités individuelles sont ensuite multipliées ensemble :\n", - "\n", - "$$\\text{Probabilité de la phrase} = \\text{Probabilité de mot1} \\times \\text{Probabilité de mot2} \\times \\ldots \\times \\text{Probabilité du dernier mot}$$\n", - "\n", - "Cette méthode est semblable à la construction d'une narration pièce par pièce en fonction de l'histoire précédente.\n", - "\n", - "Mathématiquement, cela s'exprime comme la vraisemblance (probabilité) d'une séquence de mots $y_1, y_2, ..., y_n$ dans un contexte donné $c$, ce qui est réalisé en multipliant les probabilités de chaque mot $y_t$ calculées étant donné les prédécesseurs ($y_{Avancé\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zIQ_aJGW_RGA" - }, - "source": [ - "Dans la section suivante, nous définissons tous les processus nécessaires pour entraîner le modèle en utilisant l'objectif décrit ci-dessus. Une grande partie de cela concerne maintenant le travail requis pour effectuer l'entraînement avec FLAX.\n", - "\n", - "Ci-dessous, nous rassemblons le jeu de données sur lequel nous allons entraîner, qui est le jeu de données de Shakespeare de Karpathy. Il n'est pas si important de comprendre ce code, donc soit exécutez simplement la cellule pour charger les données, soit consultez le code si vous souhaitez le comprendre.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "guMHAaSo_RGB", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "aa710025-7c04-4008-fb7e-ec97291b552e" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "--2024-08-24 17:48:37-- https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt\n", - "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...\n", - "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n", - "HTTP request sent, awaiting response... 200 OK\n", - "Length: 1115394 (1.1M) [text/plain]\n", - "Saving to: ‘input.txt’\n", - "\n", - "input.txt 100%[===================>] 1.06M 5.56MB/s in 0.2s \n", - "\n", - "2024-08-24 17:48:38 (5.56 MB/s) - ‘input.txt’ saved [1115394/1115394]\n", - "\n" - ] - } - ], - "source": [ - "# @title Créer le jeu de données Shakespeare et l'itérateur (optionnel, mais exécutez la cellule)\n", - "\n", - "# Astuce pour éviter les erreurs lors du téléchargement de tinyshakespeare.\n", - "import locale\n", - "locale.getpreferredencoding = lambda: \"UTF-8\"\n", - "\n", - "!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O input.txt\n", - "\n", - "class WordBasedAsciiDatasetForLLM:\n", - " \"\"\"Jeu de données en mémoire d'un fichier ASCII unique pour un modèle de type langage.\"\"\"\n", - "\n", - " def __init__(self, path: str, batch_size: int, sequence_length: int):\n", - " \"\"\"Charger un fichier ASCII unique en mémoire.\"\"\"\n", - " self._batch_size = batch_size\n", - "\n", - " with open(path, \"r\") as f:\n", - " corpus = f.read()\n", - "\n", - " # Tokeniser en séparant le texte en mots\n", - " words = corpus.split()\n", - " self.vocab_size = len(set(words)) # Nombre de mots uniques\n", - "\n", - " # Créer un mapping de mots vers des IDs uniques\n", - " self.word_to_id = {word: i for i, word in enumerate(set(words))}\n", - "\n", - " # Stocker le mapping inverse des IDs vers les mots\n", - " self.id_to_word = {i: word for word, i in self.word_to_id.items()}\n", - "\n", - " # Convertir les mots du corpus en leurs IDs correspondants\n", - " corpus = np.array([self.word_to_id[word] for word in words]).astype(np.int32)\n", - "\n", - " crop_len = sequence_length + 1\n", - " num_batches, ragged = divmod(corpus.size, batch_size * crop_len)\n", - " if ragged:\n", - " corpus = corpus[:-ragged]\n", - " corpus = corpus.reshape([-1, crop_len])\n", - "\n", - " if num_batches < 10:\n", - " raise ValueError(\n", - " f\"Seulement {num_batches} lots ; envisagez une séquence plus courte \"\n", - " \"ou un lot plus petit.\"\n", - " )\n", - "\n", - " self._ds = WordBasedAsciiDatasetForLLM._infinite_shuffle(\n", - " corpus, batch_size * 10\n", - " )\n", - "\n", - " def __iter__(self):\n", - " return self\n", - "\n", - " def __next__(self):\n", - " \"\"\"Générer le prochain mini-lot.\"\"\"\n", - " batch = [next(self._ds) for _ in range(self._batch_size)]\n", - " batch = np.stack(batch)\n", - " # Créer les paires observation/cible pour la modélisation du langage.\n", - " return dict(\n", - " input=batch[:, :-1], target=batch[:, 1:]\n", - " )\n", - "\n", - " def ids_to_words(self, ids):\n", - " \"\"\"Convertir une séquence d'IDs de mots en mots.\"\"\"\n", - " return [self.id_to_word[id] for id in ids]\n", - "\n", - " @staticmethod\n", - " def _infinite_shuffle(iterable, buffer_size):\n", - " \"\"\"Répéter et mélanger infiniment les données de l'itérable.\"\"\"\n", - " ds = itertools.cycle(iterable)\n", - " buf = [next(ds) for _ in range(buffer_size)]\n", - " random.shuffle(buf)\n", - " while True:\n", - " item = next(ds)\n", - " idx = random.randint(0, buffer_size - 1) # Inclus.\n", - " result, buf[idx] = buf[idx], item\n", - " yield result\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_WBIFg51oQl0" - }, - "source": [ - "Lets now look how our data is structured for training" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "WvH3XPM5_RGB", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "6e4cd0a5-08f6-41c1-c429-a7eef24f2ddf" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "---------- Entrée -----------\n", - "TEXTE : First Citizen: Before we proceed any further, hear me speak. All: Speak, speak. First Citizen: You are all resolved rather to die than to famish? All: Resolved. resolved. First Citizen: First, you\n", - "ASCII : [15154 9394 8902 24151 21811 7967 23158 16740 22265 13689 21532 4438\n", - " 13689 15154 9394 16378 23260 2949 2173 2710 7533 449 1376 7533\n", - " 10471 21532 24310 889 15154 9394 1588 7773]\n", - "---------- Cible ----------\n", - "TEXTE : Citizen: Before we proceed any further, hear me speak. All: Speak, speak. First Citizen: You are all resolved rather to die than to famish? All: Resolved. resolved. First Citizen: First, you know\n", - "ASCII : [ 9394 8902 24151 21811 7967 23158 16740 22265 13689 21532 4438 13689\n", - " 15154 9394 16378 23260 2949 2173 2710 7533 449 1376 7533 10471\n", - " 21532 24310 889 15154 9394 1588 7773 15038]\n", - "---------- Entrée -----------\n", - "TEXTE : wondrous malicious, Or be accused of folly. I shall tell you A pretty tale: it may be you have heard it; But, since it serves my purpose, I will venture To stale\n", - "ASCII : [15463 3873 20770 10814 20992 8815 9306 12371 19032 6613 7773 4195\n", - " 20875 8065 22747 10636 10814 7773 15178 17181 17531 818 20792 22747\n", - " 7716 23157 19093 12371 13521 13812 16314 6576]\n", - "---------- Cible ----------\n", - "TEXTE : malicious, Or be accused of folly. I shall tell you A pretty tale: it may be you have heard it; But, since it serves my purpose, I will venture To stale 't\n", - "ASCII : [ 3873 20770 10814 20992 8815 9306 12371 19032 6613 7773 4195 20875\n", - " 8065 22747 10636 10814 7773 15178 17181 17531 818 20792 22747 7716\n", - " 23157 19093 12371 13521 13812 16314 6576 5309]\n", - "\n", - " Taille totale du vocabulaire : 25670\n" - ] - } - ], - "source": [ - "# Échantillonner et examiner les données\n", - "batch_size = 2\n", - "seq_length = 32\n", - "train_dataset = WordBasedAsciiDatasetForLLM(\"input.txt\", batch_size, seq_length)\n", - "\n", - "batch = next(train_dataset)\n", - "\n", - "for obs, target in zip(batch[\"input\"], batch[\"target\"]):\n", - " print(\"-\" * 10, \"Entrée\", \"-\" * 11)\n", - " print(\"TEXTE :\", ' '.join(train_dataset.ids_to_words(obs)))\n", - " print(\"ASCII :\", obs)\n", - " print(\"-\" * 10, \"Cible\", \"-\" * 10)\n", - " print(\"TEXTE :\", ' '.join(train_dataset.ids_to_words(target)))\n", - " print(\"ASCII :\", target)\n", - "\n", - "print(f\"\\n Taille totale du vocabulaire : {train_dataset.vocab_size}\")\n", - "\n", - "VOCAB_SIZE = train_dataset.vocab_size\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "w9vzee53_RGB" - }, - "source": [ - "Ensuite, entraînons notre LLM et voyons comment il se comporte pour produire du texte shakespearien. Tout d'abord, nous allons définir ce qui se passe à chaque étape d'entraînement.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "PGuYBCkekgDw" - }, - "outputs": [], - "source": [ - "import functools\n", - "\n", - "@functools.partial(jax.jit, static_argnums=(3, 4))\n", - "def train_step(params, optimizer_state, batch, apply_fn, update_fn):\n", - " \"\"\"\n", - " Effectuer une étape d'entraînement.\n", - "\n", - " Args:\n", - " params: Les paramètres actuels du modèle.\n", - " optimizer_state: L'état actuel de l'optimiseur.\n", - " batch: Un dictionnaire contenant les données d'entrée et les étiquettes cibles pour le batch.\n", - " apply_fn: La fonction utilisée pour appliquer le modèle aux entrées.\n", - " update_fn: La fonction utilisée pour mettre à jour les paramètres du modèle en fonction des gradients.\n", - "\n", - " Returns:\n", - " Paramètres mis à jour, état de l'optimiseur mis à jour, et la perte calculée pour le batch.\n", - " \"\"\"\n", - "\n", - " def loss_fn(params):\n", - " # Obtenez la longueur de la séquence (T) à partir des données d'entrée.\n", - " T = batch['input'].shape[1]\n", - "\n", - " # Appliquez le modèle aux données d'entrée, en utilisant un masque triangulaire inférieur pour imposer la causalité.\n", - " # jnp.tril(np.ones((T, T))) crée une matrice triangulaire inférieure de uns.\n", - " logits = apply_fn(params, batch['input'], jnp.tril(np.ones((T, T))))\n", - "\n", - " # Calculez la perte entre les logits prédits et les étiquettes cibles.\n", - " loss = sequence_loss_fn(logits, batch['target'])\n", - "\n", - " return loss\n", - "\n", - " # Calculez la perte et ses gradients par rapport aux paramètres.\n", - " loss, gradients = jax.value_and_grad(loss_fn)(params)\n", - "\n", - " # Mettez à jour l'état de l'optimiseur et calculez les mises à jour des paramètres en fonction des gradients.\n", - " updates, optimizer_state = update_fn(gradients, optimizer_state)\n", - "\n", - " # Appliquez les mises à jour aux paramètres.\n", - " params = optax.apply_updates(params, updates)\n", - "\n", - " # Retournez les paramètres mis à jour, l'état de l'optimiseur, et la perte pour le batch.\n", - " return params, optimizer_state, loss\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "rtKWzKIAkfYU" - }, - "source": [ - "Nous allons maintenant initialiser notre optimiseur et notre modèle. N'hésitez pas à expérimenter avec les hyperparamètres pendant la pratique.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "8o3q-BZX_RGB", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 436 - }, - "outputId": "12096f0b-f85a-4eb5-98f6-41f4b6288830" - }, - "outputs": [ - { - "output_type": "error", - "ename": "NameError", - "evalue": "name 'MultiHeadAttention' is not defined", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 23\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 24\u001b[0m \u001b[0;31m# Initialiser les paramètres du modèle en utilisant le premier lot de données d'entrée et le masque\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 25\u001b[0;31m \u001b[0mparams\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mllm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrng\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbatch\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'input'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmask\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 26\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 27\u001b[0m \u001b[0;31m# Configurer l'optimiseur en utilisant l'algorithme d'optimisation Adam avec le taux d'apprentissage spécifié\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - " \u001b[0;31m[... skipping hidden 9 frame]\u001b[0m\n", - "\u001b[0;32m\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, X, mask, return_att_weights)\u001b[0m\n\u001b[1;32m 59\u001b[0m \u001b[0;31m# Passer les embeddings à travers chaque bloc de décodeur en séquence\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 60\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mblock\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mblocks\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 61\u001b[0;31m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mblock\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmask\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreturn_att_weights\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 62\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mreturn_att_weights\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 63\u001b[0m \u001b[0;31m# Si on retourne les poids d'attention, décompacter la sortie\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - " \u001b[0;31m[... skipping hidden 5 frame]\u001b[0m\n", - "\u001b[0;32m\u001b[0m in \u001b[0;36msetup\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0msetup\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 16\u001b[0m \u001b[0;31m# Initialiser le bloc Multi-Head Attention (MHA)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 17\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmha\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mMultiHeadAttention\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnum_heads\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0md_m\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 18\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 19\u001b[0m \u001b[0;31m# Initialiser les blocs AddNorm pour les connexions résiduelles et la normalisation\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mNameError\u001b[0m: name 'MultiHeadAttention' is not defined" - ] - } - ], - "source": [ - "# Définir tous les hyperparamètres\n", - "d_model = 128 # Dimension des embeddings de tokens (d_m)\n", - "num_heads = 4 # Nombre de têtes d'attention dans l'Attention Multi-Têtes\n", - "num_layers = 1 # Nombre de blocs de décodeur dans le modèle\n", - "widening_factor = 2 # Facteur pour élargir la taille de la couche cachée dans le MLP\n", - "LR = 2e-3 # Taux d'apprentissage pour l'optimiseur\n", - "batch_size = 32 # Nombre d'échantillons par lot d'entraînement\n", - "seq_length = 64 # Longueur de chaque séquence d'entrée (nombre de tokens)\n", - "\n", - "# Préparer les données d'entraînement\n", - "train_dataset = WordBasedAsciiDatasetForLLM(\"input.txt\", batch_size, seq_length)\n", - "vocab_size = train_dataset.vocab_size # Obtenir la taille du vocabulaire à partir du dataset\n", - "batch = next(train_dataset) # Obtenir le premier lot de données d'entrée\n", - "\n", - "# Définir la clé pour l'initialisation du modèle\n", - "rng = jax.random.PRNGKey(42)\n", - "\n", - "# Initialiser le modèle LLM avec les hyperparamètres spécifiés\n", - "llm = LLM(num_heads=num_heads, num_layers=num_layers, d_m=d_model, vocab_size=vocab_size, widening_factor=widening_factor)\n", - "\n", - "# Créer un masque causal pour s'assurer que le modèle ne prête attention qu'aux tokens précédents\n", - "mask = jnp.tril(np.ones((batch['input'].shape[1], batch['input'].shape[1])))\n", - "\n", - "# Initialiser les paramètres du modèle en utilisant le premier lot de données d'entrée et le masque\n", - "params = llm.init(rng, batch['input'], mask)\n", - "\n", - "# Configurer l'optimiseur en utilisant l'algorithme d'optimisation Adam avec le taux d'apprentissage spécifié\n", - "optimizer = optax.adam(LR, b1=0.9, b2=0.99)\n", - "optimizer_state = optimizer.init(params) # Initialiser l'état de l'optimiseur avec les paramètres du modèle\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "3bPEFakxmvsM" - }, - "source": [ - "Now we train! This will take a few minutes..\n", - "While it trains, have you greeted your neighbor yet?\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "oUAS6tie_RGB", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 216 - }, - "outputId": "19fab26c-4c24-460e-bd38-2650f922b43d" - }, - "outputs": [ - { - "output_type": "error", - "ename": "NameError", - "evalue": "name 'params' is not defined", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 10\u001b[0m \u001b[0mbatch\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnext\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtrain_dataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 11\u001b[0m params, optimizer_state, loss = train_step(\n\u001b[0;32m---> 12\u001b[0;31m params, optimizer_state, batch, llm.apply, optimizer.update)\n\u001b[0m\u001b[1;32m 13\u001b[0m \u001b[0mlosses\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mloss\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 14\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mstep\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mLOG_EVERY\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mNameError\u001b[0m: name 'params' is not defined" - ] - } - ], - "source": [ - "plotlosses = PlotLosses()\n", - "\n", - "MAX_STEPS = 3500\n", - "LOG_EVERY = 32\n", - "losses = []\n", - "VOCAB_SIZE = 25670\n", - "\n", - "# Boucle d'entraînement\n", - "for step in range(MAX_STEPS):\n", - " batch = next(train_dataset)\n", - " params, optimizer_state, loss = train_step(\n", - " params, optimizer_state, batch, llm.apply, optimizer.update)\n", - " losses.append(loss)\n", - " if step % LOG_EVERY == 0:\n", - " loss_ = jnp.array(losses).mean()\n", - " plotlosses.update(\n", - " {\n", - " \"loss\": loss_,\n", - " }\n", - " )\n", - " plotlosses.send()\n", - " losses = []\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pGv9c2AFmF4V" - }, - "source": [ - "#### 2.5.3 Inspecter le LLM entraîné Débutant\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Pfq61gim_RGB" - }, - "source": [ - "**Rappel :** n'oubliez pas d'exécuter tout le code présenté jusqu'à présent dans cette section avant de lancer les cellules ci-dessous !\n", - "\n", - "Générons maintenant un peu de texte et voyons comment notre modèle a performé. NE STOPPEZ PAS LA CELLULE UNE FOIS QU'ELLE EST EN COURS D'EXÉCUTION, CELA FERA PLANTER LA SESSION.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "5lt8HTS__RGC", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 180 - }, - "outputId": "456f38dc-8da0-4148-bf6c-b4c59bddc369" - }, - "outputs": [ - { - "output_type": "error", - "ename": "NameError", - "evalue": "name 'params' is not defined", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 29\u001b[0m \u001b[0mword_2_id\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtrain_dataset\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mword_to_id\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 30\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 31\u001b[0;31m \u001b[0mgenerated_shakespeare\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgenerate_random_shakespeare\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mllm\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mid_2_word\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mword_2_id\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;31mNameError\u001b[0m: name 'params' is not defined" - ] - } - ], - "source": [ - "import functools\n", - "\n", - "@functools.partial(jax.jit, static_argnums=(2, ))\n", - "def generate_prediction(params, input, apply_fn):\n", - " logits = apply_fn(params, input)\n", - " argmax_out = jnp.argmax(logits, axis=-1)\n", - " return argmax_out[0][-1].astype(int)\n", - "\n", - "def generate_random_shakespeare(llm, params, id_2_word, word_2_id):\n", - " '''\n", - " Obtenir la sortie du modèle\n", - " '''\n", - "\n", - " prompt = \"Love\"\n", - " print(prompt, end=\"\")\n", - " tokens = prompt.split()\n", - "\n", - " # prédire et ajouter\n", - " for i in range(15):\n", - " input = jnp.array([[word_2_id[t] for t in tokens]]).astype(int)\n", - " prediction = generate_prediction(params, input, llm.apply)\n", - " prediction = id_2_word[int(prediction)]\n", - " tokens.append(prediction)\n", - " print(\" \"+prediction, end=\"\")\n", - "\n", - " return \" \".join(tokens)\n", - "\n", - "id_2_word = train_dataset.id_to_word\n", - "word_2_id = train_dataset.word_to_id\n", - "\n", - "generated_shakespeare = generate_random_shakespeare(llm, params, id_2_word, word_2_id)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wOwNuMRf_RGC" - }, - "source": [ - "Enfin, nous avons implémenté tout ce qui précède en prenant l'ID de jeton avec la probabilité maximale d'être correct. C'est ce qu'on appelle le décodage gourmand, car nous avons uniquement pris le jeton le plus probable. Cela a bien fonctionné dans ce cas, mais il y a des situations où cette approche gourmande peut dégrader les performances, notamment lorsque nous souhaitons générer un texte réaliste.\n", - "\n", - "Il existe d'autres méthodes pour échantillonner à partir du décodeur, avec un algorithme célèbre étant la recherche par faisceau (beam search). Nous fournissons ci-dessous des ressources pour ceux qui souhaitent en savoir plus à ce sujet.\n", - "\n", - "[Décodage Gourmand](https://www.youtube.com/watch?v=DW5C3eqAFQM&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=4)\n", - "\n", - "[Recherche par Faisceau](https://www.youtube.com/watch?v=uG3xoYNo3HM&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=5)\n" - ] - }, - { - "cell_type": "markdown", - "source": [ - "### **Affinage d'un LLM pour le Discours de Haine Anti-Autistique**\n" - ], - "metadata": { - "id": "JR3SWQ7arJEa" - } - }, - { - "cell_type": "markdown", - "metadata": { - "id": "fV3YG7QOZD-B" - }, - "source": [ - "## **Conclusion**\n", - "\n", - "**Résumé :**\n", - "\n", - "Vous avez maintenant appris toutes les bases du fonctionnement d'un LLM, depuis les principes fondamentaux jusqu'à l'affinage d'une architecture GPT avec LoRA. Ce sont des outils puissants et très applicables pour de nombreuses tâches, mais comme tout autre modèle d'apprentissage profond, ce sont simplement des modèles et doivent être utilisés pour le problème et les données appropriés.\n", - "\n", - "**Étapes suivantes :**\n", - "\n", - "Suivez tous les liens fournis dans ce document pratique, ainsi que lisez sur les architectures llama2 et Falcon pour voir comment les dernières techniques sont utilisées.\n", - "\n", - "**Références :** pour des références supplémentaires, consultez les liens mentionnés dans les sections spécifiques de ce colab.\n", - "\n", - "* [Article \"Attention is all you need\"](https://arxiv.org/abs/1706.03762)\n", - "* [Vidéos supplémentaires sur les transformers](https://www.youtube.com/playlist?list=PLmZlBIcArwhOPR2s-FIR7WoqNaBML233s)\n", - "* [Article LoRA](https://arxiv.org/abs/2106.09685)\n", - "* [RLHF](https://huggingface.co/blog/rlhf) (comment ChatGPT a été entraîné)\n", - "* [Extension de la longueur du contexte](https://kaiokendev.github.io/context)\n", - "\n", - "Pour d'autres pratiques du Deep Learning Indaba, veuillez visiter [ici](https://github.com/deep-learning-indaba/indaba-pracs-2023).\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "o1ndpYE50BpG" - }, - "source": [ - "# Retours - Avis - Suggestions\n", - "\n", - "Veuillez fournir des commentaires que nous pourrons utiliser pour améliorer nos pratiques à l'avenir.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": true, - "id": "OIZvkhfRz9Jz" - }, - "outputs": [], - "source": [ - "# @title Generate Feedback Form. (Run Cell)\n", - "from IPython.display import HTML\n", - "\n", - "HTML(\n", - " \"\"\"\n", - "\n", - "\tLoading...\n", - "\n", - "\"\"\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "oglV4kHMWnIN" - }, - "source": [ - "" - ] - } - ], - "metadata": { - "accelerator": "GPU", - "colab": { - "gpuType": "T4", - "provenance": [], - "include_colab_link": true - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - }, - "language_info": { - "name": "python", - "version": "3.8.5" - }, - "vscode": { - "interpreter": { - "hash": "145833166d986a8417df3c7acb65d917d84b716b5a452e57fcacdc66f1a168c9" - } - }, - "widgets": { - "application/vnd.jupyter.widget-state+json": { - "316ba561018544789643764625282f15": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_504231a374294cb19fd8e0b9b89ec7e8", - "IPY_MODEL_7c863229d1e24de6822714b5ca26b79c", - "IPY_MODEL_f6fdd9f590b54ac88880b19cf8188065" - ], - "layout": "IPY_MODEL_39e4060377a24cb2bc0076077938373c" - } - }, - "504231a374294cb19fd8e0b9b89ec7e8": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_ee2f118a071d494abae2e98eda6f1ed6", - "placeholder": "​", - "style": "IPY_MODEL_b555d0c16d0f4d969091617b20a84a76", - "value": "config.json: 100%" - } - }, - "7c863229d1e24de6822714b5ca26b79c": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_bdedbd9e96d54ac4aa3c7fb35a6f76f9", - "max": 1007, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_4d8571051f134363977878a50dd03319", - "value": 1007 - } - }, - "f6fdd9f590b54ac88880b19cf8188065": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_3f8189acfa7f48b98c7707e89061cc29", - "placeholder": "​", - "style": "IPY_MODEL_285c44f874904f3b8ef246b81faf2f2c", - "value": " 1.01k/1.01k [00:00<00:00, 19.7kB/s]" - } - }, - "39e4060377a24cb2bc0076077938373c": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "ee2f118a071d494abae2e98eda6f1ed6": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "b555d0c16d0f4d969091617b20a84a76": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "bdedbd9e96d54ac4aa3c7fb35a6f76f9": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "4d8571051f134363977878a50dd03319": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "3f8189acfa7f48b98c7707e89061cc29": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "285c44f874904f3b8ef246b81faf2f2c": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "0942aeb966a749abb611511e81973b82": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_5ca27309759c490fb4ccfef3816b0770", - "IPY_MODEL_f8ddc53b692a48c4a68679ea3af026f6", - "IPY_MODEL_59601d4889eb41c99ab6fb842d77b74e" - ], - "layout": "IPY_MODEL_39bc3697123b4df8b91f07979be4f67e" - } - }, - "5ca27309759c490fb4ccfef3816b0770": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_8c3d13a57fb84984b75e210c7d7227d0", - "placeholder": "​", - "style": "IPY_MODEL_ee51b3dfda4b43e693f9ca1fbfa5ab48", - "value": "model.safetensors: 100%" - } - }, - "f8ddc53b692a48c4a68679ea3af026f6": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_d02b455c0f544b39a36fa1138319ffda", - "max": 525979192, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_ebe9f06209dc44588c34b923d635e217", - "value": 525979192 - } - }, - "59601d4889eb41c99ab6fb842d77b74e": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_f26169bf7f324140be5166e3c6bf55ec", - "placeholder": "​", - "style": "IPY_MODEL_e5472083b49a4e4f854565089a3e96de", - "value": " 526M/526M [00:08<00:00, 60.1MB/s]" - } - }, - "39bc3697123b4df8b91f07979be4f67e": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "8c3d13a57fb84984b75e210c7d7227d0": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "ee51b3dfda4b43e693f9ca1fbfa5ab48": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "d02b455c0f544b39a36fa1138319ffda": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "ebe9f06209dc44588c34b923d635e217": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "f26169bf7f324140be5166e3c6bf55ec": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "e5472083b49a4e4f854565089a3e96de": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "a8fe0138076743c0992cff08634ca2e5": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_8806049050f245058d1e58ff7bb653b2", - "IPY_MODEL_7b4fc56aa430480fad6a989af26fb675", - "IPY_MODEL_69b82aa785ac4e8fa5a3bb09c8c68008" - ], - "layout": "IPY_MODEL_7980af54051c450e96d3179a5890f089" - } - }, - "8806049050f245058d1e58ff7bb653b2": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_2c7b228ec8e6459fbfefcf5d9ab63dd9", - "placeholder": "​", - "style": "IPY_MODEL_bee5c02c3f3d411fbbe1fd876d331483", - "value": "generation_config.json: 100%" - } - }, - "7b4fc56aa430480fad6a989af26fb675": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_5e879a40d6304c00acb0d18740f8c751", - "max": 119, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_5fc7ca54266a4ffab507d369ce58208d", - "value": 119 - } - }, - "69b82aa785ac4e8fa5a3bb09c8c68008": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_caeeee54e30b43c1997cab8d6761f840", - "placeholder": "​", - "style": "IPY_MODEL_c73c160b140247d695ce61e973af9c2c", - "value": " 119/119 [00:00<00:00, 6.78kB/s]" - } - }, - "7980af54051c450e96d3179a5890f089": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "2c7b228ec8e6459fbfefcf5d9ab63dd9": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "bee5c02c3f3d411fbbe1fd876d331483": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "5e879a40d6304c00acb0d18740f8c751": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "5fc7ca54266a4ffab507d369ce58208d": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "caeeee54e30b43c1997cab8d6761f840": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "c73c160b140247d695ce61e973af9c2c": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "f6aebc9a9c804a37af70db7ad38bd960": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_883970442d624ce8b375705d2067c2ec", - "IPY_MODEL_aaf60a6c52144cc49cc58724b1861a70", - "IPY_MODEL_1ddfcc9eaea049e9956759b4dbbfe7af" - ], - "layout": "IPY_MODEL_3e1a39ba92ae4f04b369fed753420305" - } - }, - "883970442d624ce8b375705d2067c2ec": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_029bfefe634b4c57b358cf42f7dbef4e", - "placeholder": "​", - "style": "IPY_MODEL_241a186183b040df81474f1766afa588", - "value": "tokenizer_config.json: 100%" - } - }, - "aaf60a6c52144cc49cc58724b1861a70": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_3f1082fb3fe54f1bb67f61019e060ac9", - "max": 727, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_3f5a6372cf29479393fb7986ec2f20b8", - "value": 727 - } - }, - "1ddfcc9eaea049e9956759b4dbbfe7af": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_405e3a80340f4be68684bc7a1edf9854", - "placeholder": "​", - "style": "IPY_MODEL_6e4de36b7eaa4395b20c945acadfbc8d", - "value": " 727/727 [00:00<00:00, 50.2kB/s]" - } - }, - "3e1a39ba92ae4f04b369fed753420305": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "029bfefe634b4c57b358cf42f7dbef4e": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "241a186183b040df81474f1766afa588": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "3f1082fb3fe54f1bb67f61019e060ac9": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "3f5a6372cf29479393fb7986ec2f20b8": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "405e3a80340f4be68684bc7a1edf9854": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "6e4de36b7eaa4395b20c945acadfbc8d": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "8ccd010264ab42499b510f647dfdb223": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_e16938f8f6f044d9a3e5c29bc80acbfd", - "IPY_MODEL_c8467aa01ba24c30ba8fff4f6bdafc55", - "IPY_MODEL_1b270e97e9674a7ca6cf27080576f836" - ], - "layout": "IPY_MODEL_f601815ac1d5407f908ba75d09f377ee" - } - }, - "e16938f8f6f044d9a3e5c29bc80acbfd": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_60287ad958d347d0bd996427b80e2887", - "placeholder": "​", - "style": "IPY_MODEL_d76fa1d824424fa688902a84e8b6a77c", - "value": "vocab.json: 100%" - } - }, - "c8467aa01ba24c30ba8fff4f6bdafc55": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_a13529b977c045ecae240fb095c218b2", - "max": 898669, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_b8fb12904b2d41f78288c503391489fe", - "value": 898669 - } - }, - "1b270e97e9674a7ca6cf27080576f836": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_a66e86a718494fc0a88a4d214389bc44", - "placeholder": "​", - "style": "IPY_MODEL_6ea3ab754b1044b3b3a0320a9caed1a4", - "value": " 899k/899k [00:00<00:00, 1.29MB/s]" - } - }, - "f601815ac1d5407f908ba75d09f377ee": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "60287ad958d347d0bd996427b80e2887": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "d76fa1d824424fa688902a84e8b6a77c": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "a13529b977c045ecae240fb095c218b2": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "b8fb12904b2d41f78288c503391489fe": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "a66e86a718494fc0a88a4d214389bc44": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "6ea3ab754b1044b3b3a0320a9caed1a4": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "67919bf3c5a44627bc83cc80fe12a240": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_4ea62680b80444db951646b80fcd4c95", - "IPY_MODEL_f5408df7b5c24396b6f964892593164f", - "IPY_MODEL_5cc97ff3b54545afaa3f83152f61fad5" - ], - "layout": "IPY_MODEL_e64410977eb84d929623935464b6b076" - } - }, - "4ea62680b80444db951646b80fcd4c95": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_ea5d720fe84f4046b7a4f28b246e12c6", - "placeholder": "​", - "style": "IPY_MODEL_28fcfb20c5354967ad3a82fef7a59956", - "value": "merges.txt: 100%" - } - }, - "f5408df7b5c24396b6f964892593164f": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_ebce223b50ba48cd90070111f1d36fa8", - "max": 456318, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_1c7d003576a24216bd248a54d2f558bf", - "value": 456318 - } - }, - "5cc97ff3b54545afaa3f83152f61fad5": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_9e9f4e2b0ba64e509262306b9588102c", - "placeholder": "​", - "style": "IPY_MODEL_36904956f38d41d7b201e0e2c5c53f83", - "value": " 456k/456k [00:00<00:00, 1.33MB/s]" - } - }, - "e64410977eb84d929623935464b6b076": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "ea5d720fe84f4046b7a4f28b246e12c6": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "28fcfb20c5354967ad3a82fef7a59956": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "ebce223b50ba48cd90070111f1d36fa8": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "1c7d003576a24216bd248a54d2f558bf": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "9e9f4e2b0ba64e509262306b9588102c": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "36904956f38d41d7b201e0e2c5c53f83": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "304773d4d8444473a3adb55c37d12ac8": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_4ab40baf831c416f9e57af3c0a54bbfc", - "IPY_MODEL_03c314daeb7c4938a7f1bc90c321e2d6", - "IPY_MODEL_0202e22b66e34ed7b8c92eabfc6e6756" - ], - "layout": "IPY_MODEL_be6e9667d48b4c4796acd0ee503ee2c4" - } - }, - "4ab40baf831c416f9e57af3c0a54bbfc": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_fdcc2d118ee54e72ae22949291b9fca8", - "placeholder": "​", - "style": "IPY_MODEL_5a077af432024aa1af479aa060151def", - "value": "tokenizer.json: 100%" - } - }, - "03c314daeb7c4938a7f1bc90c321e2d6": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_141c082cc81447159582c9cd414ded81", - "max": 2107652, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_895f142dc9b146e782e5c42c1396dc45", - "value": 2107652 - } - }, - "0202e22b66e34ed7b8c92eabfc6e6756": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_e552d2d2e3ff4274bca7153ed60d2051", - "placeholder": "​", - "style": "IPY_MODEL_22abcd28d2b44706bafda4350243e28e", - "value": " 2.11M/2.11M [00:00<00:00, 11.6MB/s]" - } - }, - "be6e9667d48b4c4796acd0ee503ee2c4": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "fdcc2d118ee54e72ae22949291b9fca8": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "5a077af432024aa1af479aa060151def": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "141c082cc81447159582c9cd414ded81": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "895f142dc9b146e782e5c42c1396dc45": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "e552d2d2e3ff4274bca7153ed60d2051": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "22abcd28d2b44706bafda4350243e28e": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "0b93bc073a9341c69e9d33a99feefa99": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_a6a3fa52ca68433fb545c61d40db660c", - "IPY_MODEL_83a23a023ef24bc5b1c38eac0dcb49cb", - "IPY_MODEL_0d69fd544d7e4b4b9f11a8675fb550b7" - ], - "layout": "IPY_MODEL_a5dfeffb71c94fdabcd6ad1bd26bc9f4" - } - }, - "a6a3fa52ca68433fb545c61d40db660c": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_e0b3e4532ab5451689128fdeb5263667", - "placeholder": "​", - "style": "IPY_MODEL_f9f2c5448107489a854333ff50200cd3", - "value": "special_tokens_map.json: 100%" - } - }, - "83a23a023ef24bc5b1c38eac0dcb49cb": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_30b8b5253bed447785ef4e9e5efcc533", - "max": 357, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_a3af18369e0d410097ff234172bdd08d", - "value": 357 - } - }, - "0d69fd544d7e4b4b9f11a8675fb550b7": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_4e0e3ebe2495481e83c0792a4ae7f2fc", - "placeholder": "​", - "style": "IPY_MODEL_6c48cb5fabc44ad085edf0fb01309977", - "value": " 357/357 [00:00<00:00, 28.5kB/s]" - } - }, - "a5dfeffb71c94fdabcd6ad1bd26bc9f4": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "e0b3e4532ab5451689128fdeb5263667": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "f9f2c5448107489a854333ff50200cd3": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "30b8b5253bed447785ef4e9e5efcc533": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "a3af18369e0d410097ff234172bdd08d": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "4e0e3ebe2495481e83c0792a4ae7f2fc": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "6c48cb5fabc44ad085edf0fb01309977": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - } - } - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} \ No newline at end of file diff --git a/README.md b/README.md index ebbee89..dbd12ab 100644 --- a/README.md +++ b/README.md @@ -1 +1,62 @@ -# indaba-pracs-2024 \ No newline at end of file +# Deep Learning Indaba Practicals 2024 + +This year we offer all of our practicals in both English and French! Scroll down to find the French versions of the practicals, under the English versions for each day. + +*Cette année, nous proposons tous nos travaux pratiques en anglais et en français ! Faites défiler la page vers le bas pour trouver les versions françaises des travaux pratiques, sous les versions anglaises de chaque jour.* + +## Day 1 (foundations 1) – English + +| Topic 💥 | Description 📘 | +|:----|----| +| [Introduction to ML [using JAX]](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Intro_to_ML_using_JAX/Introduction_to_ML_using_JAX.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Intro_to_ML_using_JAX/Introduction_to_ML_using_JAX.ipynb)

| In this tutorial, we will learn about some of the high-level concepts behind machine learning (ML) and the basics of JAX, a numerical computing library that we will use for our practicals. Finally, we will learn about the fundamentals of supervised learning, from linear regression, all the way to neural networks, learning the fundamentals of optimisation along the way. | +| Introduction to Probabilistic Thinking and Programming

[Part 1](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Probabilistic_Thinking_and_Programming/Probabilistic_Thinking_and_Programming_Part1.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Probabilistic_Thinking_and_Programming/Probabilistic_Thinking_and_Programming_Part1.ipynb)

[Part 2](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Probabilistic_Thinking_and_Programming/Probabilistic_Thinking_and_Programming_Part2.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Probabilistic_Thinking_and_Programming/Probabilistic_Thinking_and_Programming_Part2.ipynb)

| Probabilistic thinking and working with probability distributions are very powerful tools for any machine learning practitioner. This practical introduces a powerful approach to solving real-world problems called **probabilistic programming**, and builds a helpful foundation for reasoning about probabilistic models and events. | +| [Graph Neural Networks](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Graph_Neural_Networks/Graph_Neural_Networks.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Graph_Neural_Networks/Graph_Neural_Networks.ipynb)

[Extended Version](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Graph_Neural_Networks/Graph_Neural_Networks_Extended.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Graph_Neural_Networks/Graph_Neural_Networks_Extended.ipynb)

| In this tutorial, we will be learning about Graph Neural Networks (GNNs), a topic which has exploded in popularity in both research and industry. We will start with a refresher on graph theory, then dive into how GNNs work from a high level. Next we will cover some popular GNN implementations and see how they work in practice. | + + +## Jour 1 (connaissances 1) – Français + +| Sujet 💥 | Description 📘 | +|:----|----| +| [Introduction au Machine Learning [en utilisant JAX]](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Intro_to_ML_using_JAX/Introduction_to_ML_using_JAX_French.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Intro_to_ML_using_JAX/Introduction_to_ML_using_JAX_French.ipynb)

| Dans ce tutoriel, nous allons découvrir certains des concepts de haut niveau derrière l'apprentissage automatique (ANGLAIS machine learning, ML) et les bases de JAX, une bibliothèque de calcul numérique que nous utiliserons pour nos travaux pratiques. Enfin, nous aborderons les fondamentaux de l'apprentissage supervisé (ANGLAIS supervised learning), de la régression linéaire (linear regression) jusqu'aux réseaux de neurones (ANGLAIS neural networks), en apprenant les principes de l'optimisation en cours de route. | +| Introduction à la pensée et à la programmation probabilistes

[Part 1](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Probabilistic_Thinking_and_Programming/Probabilistic_Thinking_and_Programming_Part1_French.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Probabilistic_Thinking_and_Programming/Probabilistic_Thinking_and_Programming_Part1_French.ipynb)

[Part 2](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Probabilistic_Thinking_and_Programming/Probabilistic_Thinking_and_Programming_Part2_French.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Probabilistic_Thinking_and_Programming/Probabilistic_Thinking_and_Programming_Part2_French.ipynb)

| La pensée probabiliste et l'utilisation de distributions de probabilité sont des outils très puissants pour tout praticien du machine learning.Ce guide pratique introduit une approche puissante pour résoudre des problèmes du monde réel appelée **programmation probabiliste**, et construit une base solide pour raisonner sur les modèles et les événements probabilistes. | +| [Réseaux de neurones en graphes](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Graph_Neural_Networks/Graph_Neural_Networks_French.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Graph_Neural_Networks/Graph_Neural_Networks_French.ipynb)

[Version étendue](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Graph_Neural_Networks/Graph_Neural_Networks_Extended_French.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Graph_Neural_Networks/Graph_Neural_Networks_Extended_French.ipynb)

| Ce tutoriel porte sur les réseaux de neurones en graphes (Graph Neural Networks en anglais, ou tout simplement GNNs), un sujet qui a explosé en popularité tant dans la recherche que dans l'industrie. Nous commencerons par une révision de la théorie des graphes, puis nous plongerons dans le fonctionnement des GNNs à un niveau général. Ensuite, nous couvrirons quelques implémentations populaires de GNNs et verrons comment elles fonctionnent en pratique. | + + +## Day 2 (foundations 2) – English + +| Topic 💥 | Description 📘 | +|:----|----| +| [Responsible AI](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Responsible_AI/Responsible_AI.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Responsible_AI/Responsible_AI.ipynb)

| This notebook provides a hands-on exploration of responsible AI through two parts: analyzing ProPublica's analysis of the COMPAS risk assessment tool and examining biases using the Fairlearn toolkit. The first part focuses on ProPublica's investigation of COMPAS, and specifically on how its recidivism scores vary by race and sex. This involves data import, preprocessing, exploratory analysis, and logistic regression modeling to reproduce and interpret ProPublica's findings. The second part transitions to detecting and mitigating biases using Fairlearn, a library designed to assess and improve fairness in machine learning models. By engaging with both theoretical and practical aspects of responsible AI, this notebook aims to enhance understanding of bias in AI systems and the tools available to address it. | +| LLM Foundations | Coming soon! | +| [Diffusion Models: Building your own Stable Diffusion](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Diffusion_Models/Diffusion_Models.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Diffusion_Models/Diffusion_Models.ipynb)

| Denoising Diffusion Models are a variant of generative modelling that serve as the backbone in recent advances in image synthesis - including Dall-E, Stable Diffusion, and Midjourney. These models utilise an iterative denoising process during generation to produce high-quality samples. In this practical, we will explore the fundamentals of diffusion models, the intuition behind them, and how they work in practice. By the end of the practical, we will have covered all the steps required to train one of these models from scratch! | +| From Zero to 2048: Building RL environment with JAX

[Beginner Level](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/RL_2048/From_Zero_to_2048_Building_RL_environment_with_JAX_(Beginner_Level).ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/RL_2048/From_Zero_to_2048_Building_RL_environment_with_JAX_(Beginner_Level).ipynb)

[Intermediate Level](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/RL_2048/From_Zero_to_2048_Building_RL_environment_with_JAX_(Intermediate_Level).ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/RL_2048/From_Zero_to_2048_Building_RL_environment_with_JAX_(Intermediate_Level).ipynb)

| In this practical, we will explore building a JAX environment for the game "2048". In Reinforcement Learning (RL), the roles of an Agent and an Environment are crucial, as the environment is essential for testing and training RL algorithms. On the other side, JAX has become a key tool for advancing RL algorithm implementation, enabling more efficient architectures and the creation of distributed systems that can be trained in minutes on local GPU machines. However, to achieve this efficiency, the environment needs to be "jaxified". The importance of adapting environments for JAX is highlighted by the growing focus on jax-environments repositories like Jumanji, Gymnax, and JaxMARL. | + + +## Jour 2 (connaissances 2) – Français + +| Sujet 💥 | Description 📘 | +|:----|----| +| [IA Responsable](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Responsible_AI/Responsible_AI_French.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Responsible_AI/Responsible_AI_French.ipynb)

| Ce bloc-notes propose une exploration pratique de l'IA responsable en deux parties : analyser l'analyse de ProPublica de l'outil d'évaluation des risques COMPAS et examiner les biais à l'aide de la boîte à outils Fairlearn. La première partie se concentre sur l'enquête de ProPublica sur COMPAS, en particulier sur la manière dont ses scores de récidive varient selon la race et le sexe. Cela implique l'importation de données, le prétraitement, l'analyse exploratoire et la modélisation par régression logistique pour reproduire et interpréter les résultats de ProPublica. La deuxième partie passe à la détection et à l'atténuation des biais à l'aide de Fairlearn, une bibliothèque conçue pour évaluer et améliorer l'équité des modèles de Machine Learning. En abordant les aspects à la fois théoriques et pratiques de l'IA responsable, ce bloc-notes vise à améliorer la compréhension des biais dans les systèmes d'IA et des outils disponibles pour y remédier. | +| Introduction aux LLMs | Bientôt disponible ! | +| [Diffusion Models : Construire votre propre Stable Diffusion](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Diffusion_Models/Diffusion_Models_French.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Diffusion_Models/Diffusion_Models_French.ipynb)

| Les modèles de diffusion à débruitage (Denoising Diffusion Models) sont une variante de la modélisation générative qui constitue l'épine dorsale des progrès récents en matière de synthèse d'images - notamment Dall-E, Stable Diffusion et Midjourney. Ces modèles utilisent un processus de débruitage itératif pendant la génération afin de produire des échantillons de haute qualité. Dans ce TP, nous explorerons les principes fondamentaux des modèles de diffusion, l'intuition qui les sous-tend et leur fonctionnement en pratique. À la fin du TP, nous aurons couvert toutes les étapes nécessaires à l'apprentissage d'un de ces modèles à partir de zéro ! | +| De zéro à 2048 : Construire un environnement RL avec JAX

[Begginer Level](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/RL_2048/From_Zero_to_2048_Building_RL_environment_with_JAX_(Beginner_Level)_French.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/RL_2048/From_Zero_to_2048_Building_RL_environment_with_JAX_(Beginner_Level)_French.ipynb)

[Intermediate Level](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/RL_2048/From_Zero_to_2048_Building_RL_environment_with_JAX_(Intermediate_Level)_French.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/RL_2048/From_Zero_to_2048_Building_RL_environment_with_JAX_(Intermediate_Level)_French.ipynb)

| Dans cette pratique, nous allons explorer la construction d'un environnement JAX pour le jeu « 2048 ». Dans l'apprentissage par renforcement (RL), les rôles d'un agent et d'un environnement sont cruciaux, car l'environnement est essentiel pour tester et entraîner les algorithmes RL. D'autre part, JAX est devenu un outil clé pour faire progresser la mise en œuvre des algorithmes d'apprentissage par renforcement, permettant des architectures plus efficaces et la création de systèmes distribués qui peuvent être entraînés en quelques minutes sur des machines GPU locales. Cependant, pour atteindre cette efficacité, l'environnement doit être « jaxifié ». L'importance de l'adaptation des environnements pour JAX est soulignée par l'intérêt croissant pour les dépôts (repositories) d'environnements jax tels que Jumanji, Gymnax, et JaxMARL. | + + +## Day 3 (applications) – English + +| Topic 💥 | Description 📘 | +|:----|----| +| [AI for Biology](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/AI_for_Biology/AI_for_Biology.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/AI_for_Biology/AI_for_Biology.ipynb)

|In this practical, we will learn about some of the major application areas of AI in the biosciences, go over the role of DNA and how DNA language models are trained, extract and explore DNA embeddings using a pre-trained state-of-the-art DNA language model, and dive into a hands-on problem on modelling DNA sequences and their properties. | +| [Fine-tuning and resource-efficient LLMs for NLP](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Low_Resource_LLMs/Indaba_2024_Low_Resource_LLM.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Low_Resource_LLMs/Indaba_2024_Low_Resource_LLM.ipynb)

| Low-resource NLP (Natural Language Processing) refers to the study and development of NLP models and systems for languages, tasks, or domains that have limited data and resources available. These can include languages with fewer digital text corpora, limited computational tools, or less-developed linguistic research. In this practical, we will explore data scarcity and compute resource limitations in low-resource NLP, and introduce some ways to address these challenges with parameter-efficient finetuning of LLMs. | +| [From Centralised to Decentralised Training: An Intro to Federated Learning](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Federated_Learning/Federated_Learning.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Federated_Learning/Federated_Learning.ipynb)

|Federated Learning (FL) is a growing research area with a number of existing applications and numerous reseach papers presented at conferences, journals and workshops each year. While this is great for research and industry experts who have a thriving FL community, the rapid innovations in FL implicitly create a barrier of entry for those wanting to understand FL from a first principles perspective. Despite FL being a relatively simple concept at its core, it can be implemented in numerous ways and includes a great deal of domain-specific jargon that can leave newcomers feeling overwhelmed or alientated initially. Our practical is aimed at bridging this gap. We want to provide a beginner-focused introduction to FL, stripping away any fancy bells and whistles such that focus is placed on the foundational concepts. We want you to leave the practical with a good enough understanding such that you can explain Federated Learning to someone else, and such that you can intuit when Federated Learning could be useful in future scenarios you might encounter. | +| [Recommender Systems](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Recommender_Systems/Recommender_Systems.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Recommender_Systems/Recommender_Systems.ipynb)

[Building Recommender Systems using GNNs (Part 2)](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Recommender_Systems/GNNs_for_Recommendations.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Recommender_Systems/GNNs_for_Recommendations.ipynb)

| Recommender Systems are probably one of the most ubiquitous types of machine learning models that we encounter in our online life. They influence what we see in our social media feeds, the products we buy, the music we listen to, the food we eat, and the movies we watch. In this practical, we take you through some of the techniques popularly used in industry to recommend the content you see online by building our very own movie-recommender system. | + + +## Jour 3 (applications) – Français + +| Sujet 💥 | Description 📘 | +|:----|----| +| [IA pour la Biologie](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/AI_for_Biology/AI_for_Biology_French.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/AI_for_Biology/AI_for_Biology_French.ipynb)

|Dans cette pratique, nous allons : apprendre à connaître certains des principaux domaines d'application de l'IA dans les biosciences, passer en revue le rôle de l'ADN et la manière dont les modèles de langage ADN sont entraînés, extraire et explorer les plongements ADN à l'aide d'un modèle de langage ADN pré-entraîné à la pointe de la technologie, et nous plonger dans un problème pratique de modélisation des séquences d'ADN et de leurs propriétés. | +| [Finetuning et LLM économe en ressources pour le NLP](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Low_Resource_LLMs/Indaba_2024_Low_Resource_LLM_FR.ipynb)) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Low_Resource_LLMs/Indaba_2024_Low_Resource_LLM_FR.ipynb)

| Le traitement du langage naturel (NLP) à faibles ressources fait référence à l'étude et au développement de modèles et de systèmes de traitement du langage naturel (NLP) pour les langues, les tâches ou les domaines pour lesquels les données et les ressources disponibles sont limitées. Il peut s'agir de langues avec moins de corpus de textes numériques, d'outils informatiques limités ou de recherches linguistiques moins développées. Dans cette étude pratique, nous explorerons la rareté des données et les limitations des ressources informatiques dans le traitement du langage naturel à faibles ressources, et nous présenterons quelques moyens de relever ces défis grâce à un réglage fin des LLM efficace en termes de paramètres. | +| [De l'entraînement centralisé à l'entraînement décentralisé : une introduction au Federated Learning](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Federated_Learning/Federated_Learning_French.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Federated_Learning/Federated_Learning_French.ipynb)

|Le Federated Learning (FL) est un domaine de recherche en pleine croissance avec un certain nombre d'applications existantes et de nombreux articles de recherche présentés chaque année lors de conférences, dans des revues et des ateliers. Bien que cela soit bénéfique pour les experts de la recherche et de l'industrie qui disposent d'une communauté FL florissante, les innovations rapides dans le domaine du FL créent implicitement une barrière à l'entrée pour ceux qui souhaitent entrer dans cet espace et comprendre le FL à partir des premiers principes. Bien que le FL soit un concept relativement simple à la base, il peut être mis en œuvre de nombreuses façons et comprend un grand nombre de jargons spécifiques au domaine qui peuvent laisser les nouveaux arrivants se sentir dépassés ou aliénés au départ. Notre approche pratique vise à combler cette lacune. Nous voulons fournir une introduction au FL axée sur les débutants, en supprimant toutes les fioritures afin que l'accent soit mis sur les concepts fondamentaux. Nous voulons que vous quittiez cette formation pratique avec une compréhension suffisante pour pouvoir expliquer le Federated Learning à quelqu'un d'autre et pour pouvoir deviner quand le Federated Learning pourrait être utile dans les futurs scénarios que vous pourriez rencontrer. | +| [Systèmes de recommandation](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Recommender_Systems/Recommender_Systems_French.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Recommender_Systems/Recommender_Systems_French.ipynb)

[Construire des systèmes de recommandation à l'aide de GNN (Partie 2)](https://github.com/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Recommender_Systems/GNNs_for_Recommendations_French.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Recommender_Systems/GNNs_for_Recommendations_French.ipynb)

| Les systèmes de recommandation sont probablement l'un des types de modèles de ML les plus omniprésents que nous rencontrons dans notre vie en ligne. Ils influencent ce que nous voyons dans nos flux de médias sociaux, les produits que nous achetons, la musique que nous écoutons, la nourriture que nous mangeons et les films que nous regardons. Dans ce TP, nous vous présentons quelques-unes des techniques couramment utilisées dans l'industrie qui recommandent le contenu que vous voyez en ligne en construisant notre propre système de recommandation de films. | diff --git a/foundations_of_llms_practical.ipynb b/foundations_of_llms_practical.ipynb deleted file mode 100644 index 9ff4ded..0000000 --- a/foundations_of_llms_practical.ipynb +++ /dev/null @@ -1,7680 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "view-in-github", - "colab_type": "text" - }, - "source": [ - "\"Open" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "m2s4kN_QPQVe" - }, - "source": [ - "# LLMs for everyone\n", - "\n", - "\n", - "\n", - "\"Open\n", - "\n", - "© Deep Learning Indaba 2024. Apache License 2.0.\n", - "\n", - "**Authors: Jabez Magomere, Harry Mayne, Khalil Mrini, Nabra Rizvi, Doudou Ba**\n", - "\n", - "**Introduction:**\n", - "\n", - "Welcome to **\"LLMs for Everyone\"**—your gateway to the fascinating world of Large Language Models (LLMs)! To kick things off, here’s a fun fact: this entire introduction was generated by ChatGPT, one of the many powerful LLMs you'll be learning about. 🤖✨\n", - "\n", - "In this tutorial, you'll dive into the core principles of transformers, the cutting-edge technology behind models like GPT. You’ll also get hands-on experience training your very own Language Model! Get ready to explore how these impressive AI systems create such realistic and engaging text. Let’s embark on this exciting journey together and unlock the secrets of LLMs! 🚀📚\n", - "\n", - "**Topics:**\n", - "\n", - "Content: [Hugging Face Introduction, Attention Mechanism, Transformer Architecture, Training your own LLM from scratch, Finetuning an LLM for Text Classification]\n", - "\n", - "Level: Beginner, Intermediate, Advanced\n", - "\n", - "**Aims/Learning Objectives:**\n", - "\n", - "* Understand the idea behind [Attention](https://arxiv.org/abs/1706.03762) and why it is used.\n", - "* Present and describe the fundamental building blocks of the [Transformer Architecture](https://arxiv.org/abs/1706.03762) along with an intuition on such an architecture design.\n", - "* Build and train a simple Shakespeare-inspired LLM.\n", - "\n", - "**Prerequisites:**\n", - "\n", - "* Basic knowledge of Deep Learning.\n", - "* Familiarity with Natural Language Processing (NLP).\n", - "* Understanding of sequence-to-sequence models.\n", - "* Basic understanding of Linear Algebra.\n", - "\n", - "**Outline:**\n", - "\n", - ">[LLMs for everyone](#scrollTo=m2s4kN_QPQVe)\n", - "\n", - ">>[Installations, Imports and Helper Functions](#scrollTo=6EqhIg1odqg0)\n", - "\n", - ">>[Let's kick things off with a Hugging Face Demo! Beginner](#scrollTo=4zu5cg-YG4XU)\n", - "\n", - ">>>[Hugging Face](#scrollTo=AwjIIipOG4fz)\n", - "\n", - ">>>[Time for a Demo! ⏰⚡ Loading a Hugging Face Model and Running a Sample](#scrollTo=eq46TV_0G4f0)\n", - "\n", - ">>[1. Attention](#scrollTo=-ZUp8i37dFbU)\n", - "\n", - ">>>[Intuition - Beginner](#scrollTo=ygdi884ugGcu)\n", - "\n", - ">>>[Understanding Attention in Simple Terms](#scrollTo=ygdi884ugGcu)\n", - "\n", - ">>>[Sequence to sequence attenion mechanisms - Intermediate](#scrollTo=aQfqM1EJyDXI)\n", - "\n", - ">>>[Self-attention to Multihead Attention - Intermediate](#scrollTo=J-MU6rrny8Nj)\n", - "\n", - ">>>>[Self-attention](#scrollTo=0AFUEFZGzCTv)\n", - "\n", - ">>>>>[Queries, keys and values](#scrollTo=pwOIMtdZzdTf)\n", - "\n", - ">>>>>[Masked attention](#scrollTo=D7B-AgO80gIt)\n", - "\n", - ">>>>>[Multi-head attention](#scrollTo=OWDubQwCs4zG)\n", - "\n", - ">>[2. Building your own LLM](#scrollTo=e9NW58_3hAg2)\n", - "\n", - ">>>[2.1 High-level overvierw Beginner](#scrollTo=bA_2coZvhAg3)\n", - "\n", - ">>>[2.2 Tokenization + Positional encoding Beginner](#scrollTo=fbTsk0MdhAhC)\n", - "\n", - ">>>>[2.2.1 Tokenization](#scrollTo=DehUpfym_RF8)\n", - "\n", - ">>>>[2.2.2 Positional encodings](#scrollTo=639s7Zuk_RF9)\n", - "\n", - ">>>>>[Sine and cosine functions](#scrollTo=rklY-aL-_RF9)\n", - "\n", - ">>>[2.3 Transformer block Intermediate](#scrollTo=SdNPg0pnhAhG)\n", - "\n", - ">>>>[2.3.1 Feed Forward Network (FFN) / Multilayer perceptron (MLP) Beginner](#scrollTo=kTURbfr__RF-)\n", - "\n", - ">>>>[2.3.2 Add and Norm block Beginner](#scrollTo=Sts5Vr4i_RF-)\n", - "\n", - ">>>[2.4 Building the Transformer Decoder / LLM Intermediate](#scrollTo=91dXd29b_RF_)\n", - "\n", - ">>>[2.5 Training your LLM](#scrollTo=wmt3tp38G90A)\n", - "\n", - ">>>>[2.5.1 Training objective Intermediate](#scrollTo=agLIpsoh_RGA)\n", - "\n", - ">>>>[2.5.2 Training models Advanced](#scrollTo=4CSfvGj__RGA)\n", - "\n", - ">>>>[2.5.3 Inspecting the trained LLM Beginner](#scrollTo=pGv9c2AFmF4V)\n", - "\n", - ">>[Conclusion](#scrollTo=fV3YG7QOZD-B)\n", - "\n", - ">[Feedback](#scrollTo=o1ndpYE50BpG)\n", - "\n", - "\n", - "\n", - "\n", - "**Before you start:**\n", - "\n", - "For this practical, you will need to use a GPU to speed up training. To do this, go to the \"Runtime\" menu in Colab, select \"Change runtime type\" and then in the popup menu, choose \"GPU\" in the \"Hardware accelerator\" box." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "952qogb79nnY" - }, - "source": [ - "**Suggested experience level in this topic:**\n", - "\n", - "| Level | Experience |\n", - "| --- | --- |\n", - "`Beginner` | It is my first time being introduced to this work. |\n", - "`Intermediate` | I have done some basic courses/intros on this topic. |\n", - "`Advanced` | I work in this area/topic daily. |" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "cellView": "form", - "id": "YBdDHcI_ArCR", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "1e963d65-c3e9-42b3-9737-c6a8836e3cbc" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Based on your experience, we recommend you to not attempt to do every coding task but instead, skip through to every section and ensure you interact with the LoRA finetuned LLM presented in the last section as well as with the pretrained LLM to get a practical understanding of how these models behave.\n", - "Note: this is just a guideline, feel free to explore the colab as you'd like if you feel comfort able!\n" - ] - } - ], - "source": [ - "# @title **Paths to follow:** What is your level of experience in the topics presented in this notebook? (Run Cell)\n", - "experience = \"beginner\" #@param [\"beginner\", \"intermediate\", \"advanced\"]\n", - "sections_to_follow=\"\"\n", - "\n", - "\n", - "if experience == \"beginner\": sections_to_follow = \"\"\"we recommend you to not attempt to do every coding task but instead, skip through to every section and ensure you interact with the LoRA finetuned LLM presented in the last section as well as with the pretrained LLM to get a practical understanding of how these models behave\"\"\"\n", - "\n", - "elif experience == \"intermediate\": sections_to_follow = \"\"\"we recommend you go through every section in this notebook and try the coding tasks tagged as beginner or intermediate. If you get stuck on the code ask a tutor for help or move on to better use the time of the practical\"\"\"\n", - "\n", - "elif experience == \"advanced\": sections_to_follow = \"\"\"we recommend you go through every section and try every coding task until you get it to work\"\"\"\n", - "\n", - "\n", - "print(f\"Based on your experience, {sections_to_follow}.\\nNote: this is just a guideline, feel free to explore the colab as you'd like if you feel comfort able!\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6EqhIg1odqg0" - }, - "source": [ - "## Installations, Imports and Helper Functions" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "id": "4boGA9rYdt9l", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "9275ff48-ed7e-4e6e-8038-7918c521acc3" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.42.4)\n", - "Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (2.21.0)\n", - "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.15.4)\n", - "Requirement already satisfied: huggingface-hub<1.0,>=0.23.2 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.23.5)\n", - "Requirement already satisfied: numpy<2.0,>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.26.4)\n", - "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (24.1)\n", - "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.2)\n", - "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2024.5.15)\n", - "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.32.3)\n", - "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.4.4)\n", - "Requirement already satisfied: tokenizers<0.20,>=0.19 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.19.1)\n", - "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.5)\n", - "Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (17.0.0)\n", - "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.3.8)\n", - "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (2.1.4)\n", - "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.5.0)\n", - "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets) (0.70.16)\n", - "Requirement already satisfied: fsspec<=2024.6.1,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from fsspec[http]<=2024.6.1,>=2023.1.0->datasets) (2024.6.1)\n", - "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.10.5)\n", - "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.4.0)\n", - "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)\n", - "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (24.2.0)\n", - "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.1)\n", - "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.5)\n", - "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.4)\n", - "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.23.2->transformers) (4.12.2)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.3.2)\n", - "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.8)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.0.7)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2024.7.4)\n", - "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)\n", - "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.1)\n", - "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.1)\n", - "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n", - "Requirement already satisfied: seaborn in /usr/local/lib/python3.10/dist-packages (0.13.1)\n", - "Requirement already satisfied: umap-learn in /usr/local/lib/python3.10/dist-packages (0.5.6)\n", - "Requirement already satisfied: numpy!=1.24.0,>=1.20 in /usr/local/lib/python3.10/dist-packages (from seaborn) (1.26.4)\n", - "Requirement already satisfied: pandas>=1.2 in /usr/local/lib/python3.10/dist-packages (from seaborn) (2.1.4)\n", - "Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /usr/local/lib/python3.10/dist-packages (from seaborn) (3.7.1)\n", - "Requirement already satisfied: scipy>=1.3.1 in /usr/local/lib/python3.10/dist-packages (from umap-learn) (1.13.1)\n", - "Requirement already satisfied: scikit-learn>=0.22 in /usr/local/lib/python3.10/dist-packages (from umap-learn) (1.3.2)\n", - "Requirement already satisfied: numba>=0.51.2 in /usr/local/lib/python3.10/dist-packages (from umap-learn) (0.60.0)\n", - "Requirement already satisfied: pynndescent>=0.5 in /usr/local/lib/python3.10/dist-packages (from umap-learn) (0.5.13)\n", - "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from umap-learn) (4.66.5)\n", - "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.1)\n", - "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)\n", - "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.53.1)\n", - "Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.5)\n", - "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.1)\n", - "Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (9.4.0)\n", - "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.4)\n", - "Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.8.2)\n", - "Requirement already satisfied: llvmlite<0.44,>=0.43.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba>=0.51.2->umap-learn) (0.43.0)\n", - "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2->seaborn) (2024.1)\n", - "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2->seaborn) (2024.1)\n", - "Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.10/dist-packages (from pynndescent>=0.5->umap-learn) (1.4.2)\n", - "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.22->umap-learn) (3.5.0)\n", - "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)\n", - "Requirement already satisfied: livelossplot in /usr/local/lib/python3.10/dist-packages (0.5.5)\n", - "Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from livelossplot) (3.7.1)\n", - "Requirement already satisfied: bokeh in /usr/local/lib/python3.10/dist-packages (from livelossplot) (3.4.3)\n", - "Requirement already satisfied: Jinja2>=2.9 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (3.1.4)\n", - "Requirement already satisfied: contourpy>=1.2 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (1.2.1)\n", - "Requirement already satisfied: numpy>=1.16 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (1.26.4)\n", - "Requirement already satisfied: packaging>=16.8 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (24.1)\n", - "Requirement already satisfied: pandas>=1.2 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (2.1.4)\n", - "Requirement already satisfied: pillow>=7.1.0 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (9.4.0)\n", - "Requirement already satisfied: PyYAML>=3.10 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (6.0.2)\n", - "Requirement already satisfied: tornado>=6.2 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (6.3.3)\n", - "Requirement already satisfied: xyzservices>=2021.09.1 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (2024.6.0)\n", - "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->livelossplot) (0.12.1)\n", - "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->livelossplot) (4.53.1)\n", - "Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->livelossplot) (1.4.5)\n", - "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->livelossplot) (3.1.4)\n", - "Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->livelossplot) (2.8.2)\n", - "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from Jinja2>=2.9->bokeh->livelossplot) (2.1.5)\n", - "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2->bokeh->livelossplot) (2024.1)\n", - "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2->bokeh->livelossplot) (2024.1)\n", - "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib->livelossplot) (1.16.0)\n", - "Requirement already satisfied: accelerate in /usr/local/lib/python3.10/dist-packages (0.33.0)\n", - "Requirement already satisfied: numpy<2.0.0,>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate) (1.26.4)\n", - "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (24.1)\n", - "Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate) (5.9.5)\n", - "Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate) (6.0.2)\n", - "Requirement already satisfied: torch>=1.10.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (2.4.0+cu121)\n", - "Requirement already satisfied: huggingface-hub>=0.21.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (0.23.5)\n", - "Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from accelerate) (0.4.4)\n", - "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.0->accelerate) (3.15.4)\n", - "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.0->accelerate) (2024.6.1)\n", - "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.0->accelerate) (2.32.3)\n", - "Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.0->accelerate) (4.66.5)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.0->accelerate) (4.12.2)\n", - "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (1.13.2)\n", - "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (3.3)\n", - "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (3.1.4)\n", - "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.10.0->accelerate) (2.1.5)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.21.0->accelerate) (3.3.2)\n", - "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.21.0->accelerate) (3.8)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.21.0->accelerate) (2.0.7)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.21.0->accelerate) (2024.7.4)\n", - "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.10.0->accelerate) (1.3.0)\n", - "A GPU is connected.\n" - ] - }, - { - "output_type": "stream", - "name": "stderr", - "text": [ - "[nltk_data] Downloading package word2vec_sample to /root/nltk_data...\n", - "[nltk_data] Package word2vec_sample is already up-to-date!\n" - ] - } - ], - "source": [ - "# Install necessary libraries for deep learning, NLP, and plotting\n", - "!pip install transformers datasets # Transformers and datasets libraries for NLP tasks\n", - "!pip install seaborn umap-learn # Seaborn for plotting, UMAP for dimensionality reduction\n", - "!pip install livelossplot # LiveLossPlot for tracking model training progress\n", - "!pip install -q transformers[torch] # Transformers with PyTorch backend\n", - "!pip install -q peft # Parameter-Efficient Fine-Tuning library\n", - "!pip install accelerate -U # Accelerate library for performance\n", - "\n", - "# Install utilities for debugging and console output formatting\n", - "!pip install -q ipdb # Interactive Python Debugger\n", - "!pip install -q colorama # Colored terminal text output\n", - "\n", - "# Import system and math utilities\n", - "import os\n", - "import math\n", - "import urllib.request\n", - "\n", - "# Check for connected accelerators (GPU or TPU) and set up accordingly\n", - "if os.environ.get(\"COLAB_GPU\") and int(os.environ[\"COLAB_GPU\"]) > 0:\n", - " print(\"A GPU is connected.\")\n", - "elif \"COLAB_TPU_ADDR\" in os.environ and os.environ[\"COLAB_TPU_ADDR\"]:\n", - " print(\"A TPU is connected.\")\n", - " import jax.tools.colab_tpu\n", - " jax.tools.colab_tpu.setup_tpu()\n", - "else:\n", - " print(\"Only CPU accelerator is connected.\")\n", - "\n", - "# Avoid GPU memory allocation to be done by JAX\n", - "os.environ['XLA_PYTHON_CLIENT_PREALLOCATE'] = \"false\"\n", - "\n", - "# Import libraries for JAX-based deep learning\n", - "import chex\n", - "import flax\n", - "import flax.linen as nn\n", - "import jax\n", - "import jax.numpy as jnp\n", - "from jax import grad, jit, vmap\n", - "import optax\n", - "\n", - "# Import NLP and model-related libraries\n", - "import transformers\n", - "from transformers import pipeline, AutoTokenizer, AutoModel\n", - "import datasets\n", - "import peft\n", - "\n", - "# Import image processing and plotting libraries\n", - "from PIL import Image\n", - "from livelossplot import PlotLosses\n", - "import matplotlib.pyplot as plt\n", - "import numpy as np\n", - "import seaborn as sns\n", - "\n", - "# Import additional utilities for working with text and models\n", - "import torch\n", - "import torchvision\n", - "import itertools\n", - "import random\n", - "import copy\n", - "\n", - "# Download an example image to use in the notebook\n", - "urllib.request.urlretrieve(\n", - " \"https://images.unsplash.com/photo-1529778873920-4da4926a72c2?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxzZWFyY2h8MXx8Y3V0ZSUyMGNhdHxlbnwwfHwwfHw%3D&w=1000&q=80\",\n", - " \"cat.png\",\n", - ")\n", - "\n", - "# Import libraries for NLP preprocessing and working with pre-trained models\n", - "import gensim\n", - "from nltk.data import find\n", - "import nltk\n", - "nltk.download(\"word2vec_sample\")\n", - "\n", - "# Import Hugging Face tools and IPython widgets\n", - "import huggingface_hub\n", - "import ipywidgets as widgets\n", - "from IPython.display import display\n", - "import colorama\n", - "\n", - "# Set Matplotlib to output SVG format for better quality plots\n", - "%config InlineBackend.figure_format = 'svg'" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": { - "id": "-9X10jhocGaS" - }, - "outputs": [], - "source": [ - "# @title Helper Plotting Functions. (Run Cell)\n", - "\n", - "def plot_position_encodings(P, max_tokens, d_model):\n", - " \"\"\"\n", - " Plots the position encodings matrix.\n", - "\n", - " Args:\n", - " P: Position encoding matrix (2D array).\n", - " max_tokens: Maximum number of tokens (rows) to plot.\n", - " d_model: Dimensionality of the model (columns) to plot.\n", - " \"\"\"\n", - "\n", - " # Set up the plot size based on the number of tokens and model dimensions\n", - " plt.figure(figsize=(20, np.min([8, max_tokens])))\n", - "\n", - " # Plot the position encoding matrix with a color map for better visualization\n", - " im = plt.imshow(P, aspect=\"auto\", cmap=\"Blues_r\")\n", - "\n", - " # Add a color bar to indicate the encoding values\n", - " plt.colorbar(im, cmap=\"blue\")\n", - "\n", - " # Show embedding indices as ticks if the dimensionality is small\n", - " if d_model <= 64:\n", - " plt.xticks(range(d_model))\n", - "\n", - " # Show position indices as ticks if the number of tokens is small\n", - " if max_tokens <= 32:\n", - " plt.yticks(range(max_tokens))\n", - "\n", - " # Label the axes\n", - " plt.xlabel(\"Embedding index\")\n", - " plt.ylabel(\"Position index\")\n", - "\n", - " # Display the plot\n", - " plt.show()\n", - "\n", - "\n", - "def plot_image_patches(patches):\n", - " \"\"\"\n", - " Function that takes in a list of patches and plots them.\n", - "\n", - " Args:\n", - " patches: A list or array of image patches to plot.\n", - " \"\"\"\n", - "\n", - " # Set up the figure for plotting patches\n", - " fig = plt.figure(figsize=(25, 25))\n", - "\n", - " # Create a subplot for each patch and display it\n", - " axes = []\n", - " for a in range(patches.shape[1]):\n", - " axes.append(fig.add_subplot(1, patches.shape[1], a + 1))\n", - " plt.imshow(patches[0][a])\n", - "\n", - " # Adjust layout to prevent overlap and display the plot\n", - " fig.tight_layout()\n", - " plt.show()\n", - "\n", - "\n", - "def plot_projected_embeddings(embeddings, labels):\n", - " \"\"\"\n", - " Projects high-dimensional embeddings onto 2D space and plots them.\n", - "\n", - " Args:\n", - " embeddings: High-dimensional embedding vectors to project.\n", - " labels: Labels corresponding to each embedding for coloring in the plot.\n", - " \"\"\"\n", - "\n", - " # Import UMAP and Seaborn for dimensionality reduction and plotting\n", - " import umap\n", - " import seaborn as sns\n", - "\n", - " # Reduce the dimensionality of the embeddings to 2D using UMAP\n", - " projected_embeddings = umap.UMAP().fit_transform(embeddings)\n", - "\n", - " # Plot the 2D projections with labels using Seaborn for better aesthetics\n", - " plt.figure(figsize=(15, 8))\n", - " plt.title(\"Projected text embeddings\")\n", - " sns.scatterplot(\n", - " x=projected_embeddings[:, 0], y=projected_embeddings[:, 1], hue=labels\n", - " )\n", - "\n", - " # Display the plot\n", - " plt.show()\n", - "\n", - "\n", - "def plot_attention_weight_matrix(weight_matrix, x_ticks, y_ticks):\n", - " \"\"\"\n", - " Plots an attention weight matrix with custom axis ticks.\n", - "\n", - " Args:\n", - " weight_matrix: The attention weight matrix to plot.\n", - " x_ticks: Labels for the x-axis (typically the query tokens).\n", - " y_ticks: Labels for the y-axis (typically the key tokens).\n", - " \"\"\"\n", - "\n", - " # Set up the plot size\n", - " plt.figure(figsize=(15, 7))\n", - "\n", - " # Plot the attention weight matrix as a heatmap\n", - " ax = sns.heatmap(weight_matrix, cmap=\"Blues\")\n", - "\n", - " # Set custom ticks on the x and y axes\n", - " plt.xticks(np.arange(weight_matrix.shape[1]) + 0.5, x_ticks)\n", - " plt.yticks(np.arange(weight_matrix.shape[0]) + 0.5, y_ticks)\n", - "\n", - " # Label the plot\n", - " plt.title(\"Attention matrix\")\n", - " plt.xlabel(\"Attention score\")\n", - "\n", - " # Display the plot\n", - " plt.show()\n" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "id": "kMkaKekB_pR4" - }, - "outputs": [], - "source": [ - "# @title Helper Text Processing Functions. (Run Cell)\n", - "\n", - "def get_word2vec_embedding(words):\n", - " \"\"\"\n", - " Function that takes in a list of words and returns a list of their embeddings,\n", - " based on a pretrained word2vec encoder.\n", - " \"\"\"\n", - " word2vec_sample = str(find(\"models/word2vec_sample/pruned.word2vec.txt\"))\n", - " model = gensim.models.KeyedVectors.load_word2vec_format(\n", - " word2vec_sample, binary=False\n", - " )\n", - "\n", - " output = []\n", - " words_pass = []\n", - " for word in words:\n", - " try:\n", - " output.append(jnp.array(model.word_vec(word)))\n", - " words_pass.append(word)\n", - " except:\n", - " pass\n", - "\n", - " embeddings = jnp.array(output)\n", - " del model # free up space again\n", - " return embeddings, words_pass\n", - "\n", - "\n", - "def remove_punctuation(text):\n", - " \"\"\"Function that takes in a string and removes all punctuation.\"\"\"\n", - " import re\n", - "\n", - " text = re.sub(r\"[^\\w\\s]\", \"\", text)\n", - " return text\n", - "\n", - "def print_sample(prompt: str, sample: str):\n", - " \"\"\"Function that takes in a prompt instruction and model response and\n", - " prints them out in different colors to show a distinction\"\"\"\n", - " print(colorama.Fore.MAGENTA + prompt, end=\"\")\n", - " print(colorama.Fore.BLUE + sample)\n", - " print(colorama.Fore.RESET)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4zu5cg-YG4XU" - }, - "source": [ - "## Let's kick things off with a Hugging Face Demo! Beginner\n", - "\n", - "We're thrilled to have you on board! 🎉 Before we dive into the hands-on part of our journey, let's take a quick detour into the fascinating world of [Hugging Face](https://huggingface.co/)—an incredible open-source platform for building and deploying cutting-edge language models. 🌐\n", - "\n", - "As a sneak peek into what we'll be creating today, we'll start by loading a *small* large language model (*in comparison to today's models) and prompting it with a simple instruction. This will give you a feel for how to interact with these powerful libraries. 💡 Get ready to unlock the potential of language models with just a few lines of code!" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "AwjIIipOG4fz" - }, - "source": [ - "### Hugging Face\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "N2DSHiuhG4f0" - }, - "source": [ - "\n", - "\n", - "\n", - "[Hugging Face](https://huggingface.co/) is a startup founded in 2016 and, in their own words: \"are on a mission to democratize good machine learning, one commit at a time.\" Currently they are a treasure trove for tools to work on and with Large Language Model (LLMs).\n", - "\n", - "They have developed various open-source packages and allow users to easily interact with a large corpus of pretrained transformer models (across all modalities) and datasets to train or fine-tune pre-trained transformers. Their software is used widely in industry and research. For more details on them and usage, refer to [the 2022 attention and transformer practical](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2022/blob/main/practicals/attention_and_transformers.ipynb#scrollTo=qFBw8kRx-4Mk).\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "3xdt9PQ6G4f0" - }, - "source": [ - "In this colab we print prompts in pink and samples generated from a model in blue like in the example below:" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "id": "L-8C9SJCG4f0", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "e2f784b7-4189-4d16-aa68-33e34a60761c" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "\u001b[35mMy fake prompt\u001b[34m is awesome!\n", - "\u001b[39m\n" - ] - } - ], - "source": [ - "print_sample(prompt='My fake prompt', sample=' is awesome!')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "eq46TV_0G4f0" - }, - "source": [ - "### Time for a Demo! ⏰⚡ Loading a Hugging Face Model and Running a Sample\n", - "\n", - "Let's dive into how simple it is to load and interact with a model from Hugging Face!\n", - "\n", - "For this tutorial, we've pre-configured two model options:\n", - "\n", - "- **`gpt-neo-125M`**: A smaller model with 125 million parameters. It's faster and uses less memory—perfect for getting started! We recommend trying this one first.\n", - "- **`gpt2-medium`**: A larger model with 355 million parameters for more advanced use.\n", - "\n", - "If you want to switch models, just restart the Colab kernel and update the model name in the cell below.\n", - "\n", - "**Note**: The steps we're about to show work not only for these models but also for [all models](https://huggingface.co/models?pipeline_tag=text-generation) on Hugging Face that support text generation pipelines.\n", - "|" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": { - "id": "QVV28V-TG4f1", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 571, - "referenced_widgets": [ - "d8f46e6226af431d9b7c6ecfa1c2769a", - "93a441fdf82141af81e85b2d5aec49b7", - "4eb2c2d0758f4061a3d0fc398018de28", - "fefd764b0cf2425c97cdab4506a81b9c", - "692c620c5e104d33849c3da34268f5cb", - "ecec6f9f968940078a4cf7a9105a3717", - "e352febd89fe462e945d40bcf421f7fd", - "f5a917167c914fdfaa86f1f71176eda2", - "4edc048323484d24b8c59bb8a2f1fab1", - "fe54b7356cf049d5b449b3fd69d64221", - "740307d65ab447658d3945abd47b3318", - "70969f782d9d48cebefa3ee64e3a04f5", - "111b2a654a424460b4917b5dad5fd69e", - "e35b59e079ab498dbe595cde7c984438", - "f2edac25c1e74b14a4ade39fe86dd040", - "4983de4c57c349af8cb8b6be78f64030", - "74926ca6f40e44c0887297ac44cbd577", - "467dfa92fdee4b9d850bd1eac5c502c6", - "52e8f51e283847d79bda1ee4977fd53f", - "aa3a63b55fe74989b92cfe8504695309", - "a78f96142d83418685734ffb9ec85ddd", - "d642dc3cabc64e66a8bf8853527f7165", - "8ca5d9a0316d4d85b88d45f5993bc21c", - "0dc54052e03245c3b149ed1fcb22b038", - "973cf8a73e3a478c888928030194793d", - "1668490995c74105ab6f11ab33ecaefb", - "3ec771810c9d472fa75278a76decf956", - "f505b2bbabae43219a02639d33501a32", - "f0a69ae2f0064d80825b0091932e7813", - "65e1cac3c45442b4a68711d410ef37c3", - "5602e00600884fdfbbe145694bcc0f90", - "f02e4e7118d64a2bb764909705049d18", - "c4710e641a1c421f96222ffb65bcedfe", - "98ca063c1b1548ce8de87647dcf23507", - "ec94052f637b4de3a172b3d5d9d32355", - "267fa0085440473c8762bfe21b5e9106", - "708f1e90e25c49c0bbfac3bc5c233e24", - "14a6f314f3ff4ad3b1f05933c50fc830", - "c69550aa99a34bada8ef67f61115d760", - "d9d79a644a1d42a38b70097c7f77dbad", - "af73f467b5bd4c38882bc27e3b3b5732", - "71eeff60e0f7464ca42483c4fe3c7bca", - "a546a75097e149f4a226453951d987da", - "3d5b4513d6ee4a73a392e4c16896e14c", - "d334e7a205704133b8549562700bca53", - "2e0e5880d12b4a37b673dcb3e455f47e", - "3c7d5cb0f9de4132a2b26428b72cb10a", - "49de67c36f9b4b66ba234bc606ffa0c4", - "ff8540ebd47e4dc39197eb6b19945d32", - "622fd5127db84c7388b9258ee7c4fdc1", - "7a191a40c1e245748de1f54f449afe37", - "3263b46210b544b989301e9edfc23473", - "2a64a5672e934b739e6280e2ab278da9", - "98577aad05654ff5aa46ca95121ac640", - "03251cffadcb429b9dbe87402bb8a4bb", - "f35d92a421d84260976fc4fba3d4527c", - "d96a8053de2f4aeabb8cf68474f6725f", - "441741ea09ac4ba29ac2dcd9f8cc0ade", - "445c478b29e844e99c45eb4e7a093e65", - "9bc0b47676434ca9a6c8c3f8af50d9aa", - "2555e6fbde6e45c2ba5f466701a1fb57", - "73d4afacfe3741bc90dfd70e55d763e3", - "2815ed72ee56478b81379eabd3cdc004", - "7ee8d64ff1344f7f8e6816e8ced5c5d7", - "71ce19a082fd4f28a1c5f4f29853f32d", - "9e1d5d7e79c9494598ca56079525dadc", - "991ab38f5ab142a2a053da131fca08e8", - "b14dc0d2bba84dbbb17b657ef5555132", - "51a7cfd306fc45498f8f84b579e6f05f", - "3e134af4ddc041aab1e0f8769b77e232", - "6b33bff41d3d438f9c81af109273f41e", - "ff568bf3a34b46edb2e50fb910608de3", - "9e1cb31c569f40118ae949cd8643e392", - "eeeca8ac591242edafe63e100421534e", - "68b138160a984dd686787cad890ff13c", - "90ef3af60627467d979237b57a8f411d", - "e93153b83b034468aad8516c644e8d55", - "8b87d73ab967490481a27011d5b53236", - "739e3997f7544b73aff099c31a3d4bed", - "730b750139364343aaee6bd28fde7f53", - "b41c51b1111943468cb50695e5ce1f84", - "6f3f01955c0847a19b2ca4e84a06d149", - "9084385cba634c01a9e746643e40e32c", - "9a694d9d63524748a8dd7f2ae59525d5", - "6541ce75ed414eac9428fc9e0a53d128", - "83436ad970f044c9abc1e79a7b7a749d", - "61cba98f8dae431ea588538cdc5a2e07", - "cbace4a6e6bb476bb9d45d0516e5f8de" - ] - }, - "outputId": "d2a38816-1879-4f98-e342-480790a82873" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n", - "The secret `HF_TOKEN` does not exist in your Colab secrets.\n", - "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n", - "You will be able to reuse this secret in all of your notebooks.\n", - "Please note that authentication is recommended but still optional to access public models or datasets.\n", - " warnings.warn(\n" - ] - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "config.json: 0%| | 0.00/1.01k [00:00 str:\n", - " # This function generates text based on a given prompt using a language model,\n", - " # with options to control randomness, the number of tokens generated, and reproducibility.\n", - "\n", - " # Convert the prompt text into tokens that the model can process\n", - " inputs = tokenizer(prompt, return_tensors=\"pt\")\n", - "\n", - " # Extract the tokens (input IDs) and attention mask (to focus on important parts) from the inputs\n", - " input_ids = inputs[\"input_ids\"]\n", - " attention_mask = inputs[\"attention_mask\"]\n", - "\n", - " # Move the tokens and attention mask to the same device as the model (like a GPU if available)\n", - " input_ids = input_ids.to(model.device)\n", - " attention_mask = attention_mask.to(model.device)\n", - "\n", - " # Set up how we want the model to generate text\n", - " generation_config = transformers.GenerationConfig(\n", - " do_sample=True, # Allow the model to add some randomness to its text generation\n", - " temperature=temperature, # Adjust how random the output is; lower means more focused\n", - " top_p=top_p, # Consider the most likely words that make up the top 90% of possibilities\n", - " pad_token_id=tokenizer.pad_token_id, # Use the token ID that represents padding (extra space)\n", - " top_k=0, # We're not limiting to the top-k words, so we set this to 0\n", - " )\n", - "\n", - " # If a seed is provided, set it so that the results are repeatable (same output each time)\n", - " if seed is not None:\n", - " torch.manual_seed(seed)\n", - "\n", - " # Generate text using the model with the settings we defined\n", - " generation_output = model.generate(\n", - " input_ids=input_ids, # Provide the input tokens to the model\n", - " attention_mask=attention_mask, # Provide the attention mask to help the model focus\n", - " return_dict_in_generate=True, # Ask the model to return detailed information\n", - " output_scores=True, # Include the scores (confidence levels) for the generated tokens\n", - " max_new_tokens=max_new_tokens, # Set the maximum number of tokens to generate\n", - " generation_config=generation_config, # Apply our custom text generation settings\n", - " )\n", - "\n", - " # Make sure only one sequence (output) is generated, to keep things simple\n", - " assert len(generation_output.sequences) == 1\n", - "\n", - " # Get the generated sequence of tokens\n", - " output_sequence = generation_output.sequences[0]\n", - "\n", - " # Convert the generated tokens back into readable text\n", - " output_string = tokenizer.decode(output_sequence)\n", - "\n", - " # Print the prompt and the generated response\n", - " print_sample(prompt, output_string)\n", - "\n", - " # Return the generated text response\n", - " return output_string" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": { - "id": "Yme6VzW4G4f1", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "2d99988d-a455-4929-f0fd-946ba4b69776" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "\u001b[35mWhat is love?\u001b[34mWhat is love?\n", - "\n", - "Love is a term used to describe the way in which one person feels about others. It is a process that involves the emotional and physical interaction of the person, the relationship, and the relationship itself.\n", - "\n", - "Love is the ability to feel the love of another person.\n", - "\n", - "Love is the ability to feel\n", - "\u001b[39m\n" - ] - } - ], - "source": [ - "_ = run_sample(model, tokenizer, prompt=\"What is love?\", temperature = 0.5, seed=2)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "V7vnUawyG4f1" - }, - "source": [ - "Pretty amazing, right? 🤩 Try playing around with the **prompt**, **temperature** and **seed** values above and see what different outputs you get. What do you notice when you increase the temperature? While this might have been mind-blowing back in 2021, by now, most of you have likely interacted with large language models in some way. Today, we're going to take things a step further by training our own **Shakespeare-inspired LLM**. This will give us a hands-on understanding of how these language models work under the hood.\n", - "\n", - "But before we jump into training, let’s first build a solid understanding of what **Large Language Models** are and the key **Machine Learning** concepts that make this groundbreaking technology possible. At the heart of today’s state-of-the-art (SoTA) LLMs are the **Attention Mechanism** and the **Transformer Architecture**. We’ll explore these essential concepts in the upcoming sections of this tutorial. 🚀💡\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-ZUp8i37dFbU" - }, - "source": [ - "## **1. Attention**\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "acgW1ofF_RFz" - }, - "source": [ - "The attention mechanism is inspired by how humans would look at an image or read a sentence.\n", - "\n", - "Let us take the image of the dog in human clothes below (image and example [source](https://lilianweng.github.io/posts/2018-06-24-attention/)). When paying *attention* to the red blocks of pixels, we will say that the yellow block of pointy ears is something we expected (correlated) but that the grey blocks of human clothes are unexpected for us (uncorrelated). This is *based on what we have seen in the past* when looking at pictures of dogs, specifically one of a Shiba Inu.\n", - "\n", - "\"drawing\"\n", - "\n", - "Assume we want to identify the dog breed in this image. When we look at the red blocks of pixels, we tend to pay more *attention* to relevant pixels that are more similar or relevant to them, which could be the ones in the yellow box. We almost completely ignore the snow in the background and the human clothing for this task.\n", - "\n", - "Alternatively, when we begin looking at the background in an attempt to identify what is in it, we subconsciously ignore the dog pixels because they are irrelevant to the current task." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "usLBF2g0x5gH" - }, - "source": [ - "The same thing happens when we read. In order to understand the entire sentence, we will learn to correlate and *attend to* certain words based on the context of the entire sentence.\n", - "\n", - "\"drawing\"\n", - "\n", - " For instance, in the first sentence in the image above, when looking at the word \"coding\", we pay more attention to the word \"Apple\" and \"computer\" because we know that when we speak about coding, \"Apple\" is actually referring to the company. However, in the second sentence, we realise we should not consider \" apple \" when looking at \"code\" because given the context of the rest of the sentence, we know that this apple is referring to an actual apple and not a computer.\n", - "\n", - "We can build better models by developing mechanisms that mimic attention. It will enable our models to learn better representations of our input data by contextualising what it knows about some parts of the input based on other parts. In the following sections, we will explore the mechanisms that enable us to train deep learning models to attend to input data in the context of other input data." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ygdi884ugGcu" - }, - "source": [ - "### Intuition - Beginner\n", - "\n", - "Imagine attention as a mechanism that allows a neural network to focus more on certain parts of data. By doing this, the network can enhance its grasp of the problem it's working on, updating its understanding or representations accordingly.\n", - "\n", - "### Understanding Attention in Simple Terms\n", - "\n", - "One way to implement attention in neural networks is by representing each word (or even parts of a word) as a vector.\n", - "\n", - "So, what’s a vector? A vector is simply an array of numbers (called real-valued numbers) that can have different lengths. Think of it like a list of values that describe certain properties of a word. These vectors allow us to measure how similar two words are to each other. One common way to measure this similarity is by calculating something called the **dot product**.\n", - "\n", - "The result of this similarity calculation is what we refer to as **attention.** This attention value helps the model decide how much one word should influence the representation of another word.\n", - "\n", - "In simpler terms, if two words have similar vector representations, it means they’re likely related or important to each other. Because of this relationship, they affect each other’s representations inside the neural network, allowing the model to understand the context better. 🎯\n", - "\n", - "To illustrate how the dot product can create meaningful attention weights, we'll use pre-trained [word2vec](https://jalammar.github.io/illustrated-word2vec/) embeddings. These word2vec embeddings are generated by a neural network that learned to create similar embeddings for words with similar meanings.\n", - "\n", - "By calculating the matrix of dot products between all vectors, we get an attention matrix. This will indicate which words are correlated and therefore should \"attend\" to each other.\n", - "\n", - "[1] You can find more details about how this is done for LLMs in the \"Building Your Own LLM\" session." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "OvBYShCFk6WC" - }, - "source": [ - "**Code task** Intermediate: Complete the dot product attention function below." - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": { - "id": "yrbITGPnk7Ce" - }, - "outputs": [], - "source": [ - "def dot_product_attention(hidden_states, previous_state):\n", - " \"\"\"\n", - " Calculate the dot product between the hidden states and previous states.\n", - "\n", - " Args:\n", - " hidden_states: A tensor with shape [T_hidden, dm]\n", - " previous_state: A tensor with shape [T_previous, dm]\n", - " \"\"\"\n", - "\n", - " # Hint: To calculate the attention scores, think about how you can use the `previous_state` vector\n", - " # and the `hidden_states` matrix. You want to find out how much each element in `previous_state`\n", - " # should \"pay attention\" to each element in `hidden_states`. Remember that in matrix multiplication,\n", - " # you can find the relationship between two sets of vectors by multiplying one by the transpose of the other.\n", - " # Hint: Use `jnp.matmul` to perform the matrix multiplication between `previous_state` and the\n", - " # transpose of `hidden_states` (`hidden_states.T`).\n", - " scores = ... # FINISH ME\n", - "\n", - " # Hint: Now that you have the scores, you need to convert them into probabilities.\n", - " # A softmax function is typically used in attention mechanisms to turn raw scores into probabilities\n", - " # that sum to 1. This will help in determining how much focus should be placed on each hidden state.\n", - " # Hint: Use `jax.nn.softmax` to apply the softmax function to `scores`.\n", - " w_n = ... # FINISH ME\n", - "\n", - " # Multiply the weights by the hidden states to get the context vector\n", - " # Hint: Use `jnp.matmul` again to multiply the attention weights `w_n` by `hidden_states`\n", - " # to get the context vector.\n", - " c_t = jnp.matmul(w_n, hidden_states)\n", - "\n", - " return w_n, c_t" - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "metadata": { - "id": "QARgTrNZlIqH", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "807606b7-9ead-4cca-b8a2-5c54e1f47bec" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "It looks like the function isn't fully implemented yet. Try modifying it.\n" - ] - } - ], - "source": [ - "# @title Run me to test your code\n", - "\n", - "key = jax.random.PRNGKey(42)\n", - "x = jax.random.normal(key, [2, 2])\n", - "\n", - "try:\n", - " w_n, c_t = dot_product_attention(x, x)\n", - "\n", - " w_n_correct = jnp.array([[0.9567678, 0.04323225], [0.00121029, 0.99878967]])\n", - " c_t_correct = jnp.array([[0.11144122, 0.95290256], [-1.5571996, -1.5321486]])\n", - " assert jnp.allclose(w_n_correct, w_n), \"w_n is not calculated correctly\"\n", - " assert jnp.allclose(c_t_correct, c_t), \"c_t is not calculated correctly\"\n", - "\n", - " print(\"It seems correct. Look at the answer below to compare methods.\")\n", - "except:\n", - " print(\"It looks like the function isn't fully implemented yet. Try modifying it.\")" - ] - }, - { - "cell_type": "code", - "execution_count": 34, - "metadata": { - "id": "Qa6PyKYnkzUJ" - }, - "outputs": [], - "source": [ - "# when changing these words, note that if the word is not in the original\n", - "# training corpus it will not be shown in the weight matrix plot.\n", - "# @title Answer to code task (Try not to peek until you've given it a good try!')\n", - "def dot_product_attention(hidden_states, previous_state):\n", - " # Calculate the attention scores:\n", - " # Multiply the previous state vector by the transpose of the hidden states matrix.\n", - " # This gives us a matrix of scores that show how much attention each element in the previous state\n", - " # should pay to each element in the hidden states.\n", - " # The result is a matrix of shape [T, N], where:\n", - " # T is the number of elements in the hidden states,\n", - " # N is the number of elements in the previous state.\n", - " scores = jnp.matmul(previous_state, hidden_states.T)\n", - "\n", - " # Apply the softmax function to the scores to convert them into probabilities.\n", - " # This normalizes the scores so that they sum up to 1 for each element,\n", - " # allowing us to interpret them as how much attention should be given to each hidden state.\n", - " w_n = jax.nn.softmax(scores)\n", - "\n", - " # Calculate the context vector (c_t):\n", - " # Multiply the attention weights (w_n) by the hidden states.\n", - " # This combines the hidden states based on how much attention each one deserves,\n", - " # resulting in a new vector that represents the weighted sum of the hidden states.\n", - " # The resulting shape is [T, d], where:\n", - " # T is the number of elements in the previous state,\n", - " # d is the dimension of the hidden states.\n", - " c_t = jnp.matmul(w_n, hidden_states)\n", - "\n", - " # Return the attention weights and the context vector.\n", - " return w_n, c_t\n" - ] - }, - { - "cell_type": "code", - "execution_count": 35, - "metadata": { - "id": "QlHL3e_QhLfq", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 653 - }, - "outputId": "b3b0ab21-a262-49e3-ae21-2ab5a4ad3f37" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - ":17: DeprecationWarning: Call to deprecated `word_vec` (Use get_vector instead).\n", - " output.append(jnp.array(model.word_vec(word)))\n" - ] - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "
" - ], - "image/svg+xml": "\n\n\n \n \n \n \n 2024-08-30T08:53:15.411377\n image/svg+xml\n \n \n Matplotlib v3.7.1, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n" - }, - "metadata": {} - } - ], - "source": [ - "words = [\"king\", \"queen\", \"royalty\", \"food\", \"apple\", \"pear\", \"computers\"]\n", - "word_embeddings, words = get_word2vec_embedding(words)\n", - "weights, _ = dot_product_attention(word_embeddings, word_embeddings)\n", - "plot_attention_weight_matrix(weights, words, words)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tItZU09YlhEZ" - }, - "source": [ - "Looking at the matrix, we can see which words have similar meanings. The \"royal\" group of words have higher attention scores with each other than the \"food\" words, which all attend to one another. We also see that \"computers\" have very low attention scores for all of them, which shows that they are neither very related to \"royal\" or \"food\" words. \n", - "\n", - "**Group task:**\n", - " - Play with the word selections above. See if you can find word combinations whose attention values seem counter-intuitive. Think of possible explanations. Which sense of a word did the attention scores capture?\n", - " - Ask your friend if they found examples." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "S3iB8hf0hJdX" - }, - "source": [ - "**Note**: Dot product is only one of the ways to implement the scoring function for attention mechanisms, there is a more extensive list in this [blog](https://lilianweng.github.io/posts/2018-06-24-attention/#summary) post by Dr Lilian Weng.\n", - "\n", - "More resources:\n", - "\n", - "[A basic encoder-decoder model for machine translation](https://www.youtube.com/watch?v=gHk2IWivt_8&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=1)\n", - "\n", - "[Training and loss for encoder-decoder models](https://www.youtube.com/watch?v=aBZUTuT1Izs&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=2)\n", - "\n", - "[Basic attention](https://www.youtube.com/watch?v=BSSoEtv5jvQ&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=6)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "aQfqM1EJyDXI" - }, - "source": [ - "### Sequence to sequence attenion mechanisms - Intermediate\n", - "\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "68QBeG-4yDZ9" - }, - "source": [ - "The first attention mechanisms were used in sequence-to-sequence models. These models were usually RNN encoder and decoder structures. The input sequence was processed sequentially by an RNN, encoding the sequence in a single context vector, which is then fed into another RNN that generates a new sequence. Below is an example of this ([source](https://lilianweng.github.io/posts/2018-06-24-attention/)).\n", - "\n", - "\n", - "\"drawing\"\n", - "\n", - "Since there is only one context vector, it is challenging to for the encoder to represent long sequences and information typically gets lost. The attention mechanism introduced in [Bahdanau et al., 2015](https://arxiv.org/pdf/1409.0473.pdf) was proposed to solve this.\n", - "\n", - "Here, instead of relying on one static context vector, which is also only used once in the decoding process, let us provide information on the entire input sequence at every decoding step using a dynamic context vector. By doing this, the decoder can access a larger \"bank\" of memory and attend to the input's required information based on the current decoder RNN output state, $s_t$. This is shown below.\n", - "\n", - "\"drawing\"\n", - "\n", - "In deep learning, attention can be interpreted as a vector of \"importance.\" To predict or infer one element, such as a pixel in an image or a word in a sentence, we estimate how strongly it is correlated with, or \"attends to,\" other elements using the attention vector/weights. These attention weights are then used to generate a new weighted sum of the remaining elements, which represents the target [(source)](https://lilianweng.github.io/posts/2018-06-24-attention/).\n", - "\n", - "\n", - "This, usually, consists of three steps for each decoding step $t$:\n", - "\n", - "1. Calculate the score (importance) for each $h_n$, given $s_{t-1}$ and use the softmax function to transform this into an attention vector, $w_{n}$.\n", - " - $\\text{score} = a(s_{t−1}, h_{n})$, where $a$ can be any differentiable function, such as the dot product.\n", - " - $w_{n} = \\frac{\\exp \\left\\{a\\left(s_{t-1}, h_{n}\\right)\\right\\}}{\\sum_{j=1}^{N} \\exp \\left\\{a\\left(s_{t-1}, h_{j}\\right)\\right\\}}$, where we use the softmax function to transform the raw scores to relative attention weights.\n", - "2. Generate the final context vector, $c_t$, by summing the products of the attention weights and the encoder context vectors.\n", - " - $c_t=\\sum_{n=1}^{N} w_n h_{n}$\n", - "3. Generate the subsequent decoder state $s_{t+1}$ by combining the current decoder state, $s_t$, with the context vector, $c_t$, via some function, $f$.\n", - "\n", - " - $s_{t+1} = f\\left ( c_t, s_t \\right)$\n", - "\n", - " In Bahdanau et al., 2015, $f$ was a learned feedforward layer taking in the concatenated vector $[c_t; s_t]$, with $a(s_{t−1}, h_{n})$ being the dot product.\n", - " \n", - "Next, let us build up this attention schema, as used in the transformer architecture. We've already calcualed simple dot product attention, where the score was given by $a(s_{t-1}, h_n)=s_{t-1} h_n^\\top$ and we're going to use the same idea again." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "J-MU6rrny8Nj" - }, - "source": [ - "### Self-attention to Multihead Attention - Intermediate\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BRuLtxNey_EQ" - }, - "source": [ - "Self-attention and multi-head attention (MHA) are fundamental components of the transformer architecture. In this section, we'll thoroughly explain the intuition behind these concepts and their implementation. Later, in the **Transformers** section, you'll learn how these attention mechanisms are used to create a sequence-to-sequence model that relies entirely on attention.\n", - "\n", - "As we move forward, we'll represent sentences by breaking them down into individual words and encoding each word using the word2vec model discussed earlier. In the Transformers section, we'll explore in more detail how input sequences are transformed into a series of vectors." - ] - }, - { - "cell_type": "code", - "source": [ - "def embed_sentence(sentence):\n", - " \"\"\"\n", - " Embed a sentence using word2vec; for example use cases only.\n", - " \"\"\"\n", - " # clean sentence (not necessary if using a proper LLM tokenizer)\n", - " sentence = remove_punctuation(sentence)\n", - "\n", - " # extract individual words\n", - " words = sentence.split()\n", - "\n", - " # get the word2vec embedding for each word in the sentence\n", - " word_vector_sequence, words = get_word2vec_embedding(words)\n", - "\n", - " # return with extra dimension (useful for creating batches later)\n", - " return jnp.expand_dims(word_vector_sequence, axis=0), words" - ], - "metadata": { - "id": "J2z6-NckgNT-" - }, - "execution_count": 39, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0AFUEFZGzCTv" - }, - "source": [ - "#### Self-attention" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LF2V3KI-za9l" - }, - "source": [ - "Self-attention is an attention mechanism where each vector of a given input sequence attends to the entire sequence. To gain an intuition for why self-attention is important, let us think about the following sentence (example taken from [source](https://jalammar.github.io/illustrated-transformer/)):\n", - "\n", - "`\"The animal didn't cross the street because it was too tired.\"`\n", - "\n", - "A simple question about this sentence is what the word \"it\" refers to? Even though it might look simple, it can be tough for an algorithm to learn this. This is where self-attention comes in, as it can learn an attention matrix for the word \"it\" where a large weight is assigned to the word \"animal\".\n", - "\n", - "Self-attention also allows the model to learn how to interpret words with the same embeddings, such as apple, which can be a company or food, depending on the context. This is very similar to the hidden state found within an RNN, but this process, as you will see, allows the model to attend over the entire sequence in parallel, allowing longer sequences to be utilised.\n", - "\n", - "Self-attention consists of three concepts:\n", - "\n", - "- Queries, keys and values\n", - "- Scaled dot product attention\n", - "- Masks" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pwOIMtdZzdTf" - }, - "source": [ - "##### **Queries, keys and values**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "mEf7QWIWzdo1" - }, - "source": [ - "Typically all attention mechanisms can be written in terms of `key-value` pairs and `queries` to calculate the attention matrix and new context vector.\n", - "\n", - "To gain intuition, one can interpret the `query` vector as containing the information we are interested in obtaining and the `key` vectors as having some information. The `query` vectors are compared to the `key` vectors to get attention scores, where a higher attention score indicates a `key` had relevant information. These attention scores are then used to determine which `values` (which are paired with the `keys`) we should attend to. Or as [Lena Voita](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html) puts it:\n", - "\n", - "- Query: asking for information\n", - "- Key: saying that it has some information\n", - "- Value: giving the information\n", - "\n", - "In transformer architectures, we use learnable weights matrices, represented as $W_Q,W_K,W_V$, to project each sequence vector to unique $q$, $k$, and $v$ vectors.\n", - "\n", - "\"drawing\"\n", - "\n", - "You will notice that the vectors $q,k,v$ are smaller in size than the input vectors. This will be covered at a later stage, but just know that it is a design choice for transformers and not a requirement to work.\n", - "\n", - "This process can also be parallelised, as the input sequence can be represented as a matrix $X$, which can be transformed into query, key, and value matrices $Q$, $K$, and $V$ respectively:\n", - "\n", - "$Q=W_QX \\\\ K=W_KX \\\\ V=W_VX$\n", - "\n", - "Below we show the code that creates three linear layers, which projects the input data to the $Q,K,V$ matrices, where the output size can be adjusted." - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "metadata": { - "id": "Xc8zjK6eziIV" - }, - "outputs": [], - "source": [ - "class SequenceToQKV(nn.Module):\n", - " output_size: int\n", - "\n", - " @nn.compact\n", - " def __call__(self, X):\n", - "\n", - " # define the method for weight initialisation\n", - " initializer = nn.initializers.variance_scaling(scale=0.5, mode=\"fan_in\", distribution=\"truncated_normal\")\n", - "\n", - " # initialise three linear layers to do the QKV transformations.\n", - " # note: this can also be one layer, how do you think you would do it?\n", - " q_layer = nn.Dense(self.output_size, kernel_init=initializer)\n", - " k_layer = nn.Dense(self.output_size, kernel_init=initializer)\n", - " v_layer = nn.Dense(self.output_size, kernel_init=initializer)\n", - "\n", - " # transform and return the matrices\n", - " Q = q_layer(X)\n", - " K = k_layer(X)\n", - " V = v_layer(X)\n", - "\n", - " return Q, K, V" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "OhGZHFsHz_Qp" - }, - "source": [ - "##### **Scaled dot product attention**\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "DxycHDUW0BVE" - }, - "source": [ - "Now that we have our `query`, `key` and `value` matrices, it is time to calculate the attention matrix. Remember, in all attention mechanisms; we must first find a score for each vector in the sequence and then use these scores to create a new context vector. In self-attention scoring is done using scaled dot product attention, and then the normalised scores are used as weights to sum the value vectors and create the context vector.\n", - "\n", - "$\\operatorname{Attention}(Q, K, V)=\\operatorname{softmax}\\left(\\frac{Q K^{T}}{\\sqrt{d_{k}}}\\right) V$\n", - "\n", - "where the attention scores are calculated by $\\operatorname{softmax}\\left(\\frac{Q K^{T}}{\\sqrt{d_{k}}}\\right)$ and the scores are then multiplied by $V$ to get the context vector.\n", - "\n", - "\n", - "What happens here is similar to what we did in the dot product attention in the previous section, just applying the mechanism to the sequence itself. For each element in the sequence, we calculate the attention weight matrix between $q_i$ and $K$. We then multiply $V$ by each weight and finally sum all weighted vectors $v_{weighted}$ together to form a new representation for $q_i$. By doing this, we are essentially drowning out irrelevant vectors and bringing up important vectors in the sequence when our focus is on $q_1$.\n", - "\n", - "$QK^\\top$ is scaled by the square root of the dimension of the vectors, $\\sqrt{d_k}$, to ensure more stable gradients during training.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "metadata": { - "id": "i_UYNzrS0Hga" - }, - "outputs": [], - "source": [ - "def scaled_dot_product_attention(query, key, value):\n", - " \"\"\"\n", - " Formula to return scaled dot product attention given QKV matrices\n", - " \"\"\"\n", - " d_k = key.shape[-1]\n", - "\n", - " # get the raw scores (logits) from dot producting the queries and keys\n", - " logits = jnp.matmul(query, jnp.swapaxes(key, -2, -1))\n", - "\n", - " # scale the raw scores and apply the softmax function to get the attention scores/weights\n", - " scaled_logits = logits / jnp.sqrt(d_k)\n", - " attention_weights = jax.nn.softmax(scaled_logits, axis=-1)\n", - "\n", - " # multiply the weights by the value matrix to get the output\n", - " output = jnp.matmul(attention_weights, value)\n", - "\n", - " return output, attention_weights" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "cuNaEjIm0PhV" - }, - "source": [ - "Let's now see scaled dot product attention in action. We will take a sentence, embed each word using word2vec, and see what the final self-attention weights look like.\n", - "\n", - "We will not use the linear projection layers we would need to train these. Instead, we are going to make things simple and use $X=Q=V=K$." - ] - }, - { - "cell_type": "code", - "execution_count": 42, - "metadata": { - "id": "3Oy2sWzR0Ok5", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 653 - }, - "outputId": "633491e8-daef-48e6-be3b-b2603e3f750a" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - ":17: DeprecationWarning: Call to deprecated `word_vec` (Use get_vector instead).\n", - " output.append(jnp.array(model.word_vec(word)))\n" - ] - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "
" - ], - "image/svg+xml": "\n\n\n \n \n \n \n 2024-08-30T08:59:47.470126\n image/svg+xml\n \n \n Matplotlib v3.7.1, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n" - }, - "metadata": {} - } - ], - "source": [ - "# define a sentence\n", - "sentence = \"I drink coke, but eat steak\"\n", - "\n", - "# embed and create QKV matrices\n", - "word_embeddings, words = embed_sentence(sentence)\n", - "Q = K = V = word_embeddings\n", - "\n", - "# calculate weights and plot\n", - "outputs, attention_weights = scaled_dot_product_attention(Q, K, V)\n", - "\n", - "# plot the words and the attention weights between them\n", - "words = remove_punctuation(sentence).split()\n", - "plot_attention_weight_matrix(attention_weights[0], words, words)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NG1Kxljr0Vzw" - }, - "source": [ - "Keep in mind that we have not trained our attention matrix yet. However, we can see that by utilising the word2vec vectors as our sequence, we can see how scaled dot product attention already is capable of attending to \"eat\" when \"steak\" is our query and that the query \"drink\" attends more to \"coke\" and \"eat\".\n", - "\n", - "More resources:\n", - "\n", - "[Attention with Q,K,V](https://www.youtube.com/watch?v=k-5QMalS8bQ&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=7)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "D7B-AgO80gIt" - }, - "source": [ - "##### **Masked attention**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tdRoKsu70gGW" - }, - "source": [ - "There are cases where applying self-attention over the entire sequence is not practical. These can include:\n", - "\n", - "- Uneven length sequences batched together.\n", - " - When sending a batch of sequences through a network, the self-attention expects each sequence to be the same length. One handles this by padding the sequence. When calculating attention, ideally, these padding tokens should not be taken into consideration.\n", - "- Training a decoder model.\n", - " - When training decoder models, such as GPT-3, the decoder has access to the entire target sequence when training (as training is done in parallel). In order to prevent the method from cheating by looking at future tokens, we have to mask the future sequence data so that earlier data can not attend to it.\n", - "\n", - "By applying a mask to the final score calculated between queries and keys, we can mitigate the influence of the unwanted sequence vectors. **The vectors are masked by making the score between the query and their respective keys a VERY large negative value.** This results in the softmax function pushing the attention weight very close to zero, and the resulting value will be summed out and not influence the final representation.\n", - "\n", - "\n", - "Putting everything together, masked scaled dot product attention visually looks like this:\n", - "\n", - "\"drawing\".\n" - ] - }, - { - "cell_type": "code", - "execution_count": 44, - "metadata": { - "id": "5Syx8_5E0eM9", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 438 - }, - "outputId": "2922f653-cb02-43cd-9abf-5565d2cd0baf" - }, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": [ - "
" - ], - "image/svg+xml": "\n\n\n \n \n \n \n 2024-08-30T09:01:11.433530\n image/svg+xml\n \n \n Matplotlib v3.7.1, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n" - }, - "metadata": {} - } - ], - "source": [ - "# example of building a mask for tokens of size 32\n", - "# the mask makes sure that positions only attend to previous positions in the input (causal mask)\n", - "# we will use this later to insert -inf values into the raw scores\n", - "mask = jnp.tril(jnp.ones((32, 32)))\n", - "\n", - "# plot\n", - "sns.heatmap(mask, cmap=\"Blues\")\n", - "plt.title(\"Example of mask that can be applied\");" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pfwTJrQ20gDw" - }, - "source": [ - "Lets now adapt our scaled dot product attention function to implement masked attention." - ] - }, - { - "cell_type": "code", - "execution_count": 45, - "metadata": { - "id": "PVHpyNs_0ePh" - }, - "outputs": [], - "source": [ - "def scaled_dot_product_attention(query, key, value, mask=None):\n", - " \"\"\"\n", - " Scaled dot product attention with a causal mask (only allowed to attend to previous positions)\n", - " \"\"\"\n", - " d_k = key.shape[-1]\n", - " T_k = key.shape[-2]\n", - " T_q = query.shape[-2]\n", - "\n", - " # get scaled logits using dot product as before\n", - " logits = jnp.matmul(query, jnp.swapaxes(key, -2, -1))\n", - " scaled_logits = logits / jnp.sqrt(d_k)\n", - "\n", - " # add optional mask where values along the mask are set to -inf\n", - " if mask is not None:\n", - " scaled_logits = jnp.where(mask[:T_q, :T_k], scaled_logits, -jnp.inf)\n", - "\n", - " # calcualte the attention weights via softmax\n", - " attention_weights = jax.nn.softmax(scaled_logits, axis=-1)\n", - "\n", - " # sum with the values to get the output\n", - " output = jnp.matmul(attention_weights, value)\n", - "\n", - " return output, attention_weights" - ] - }, - { - "cell_type": "markdown", - "source": [ - "##### **Multi-head attention**" - ], - "metadata": { - "id": "OWDubQwCs4zG" - } - }, - { - "cell_type": "markdown", - "source": [ - "The attention mechanism we've covered so far successfully allows the model to focus on different positions in the input. In practice, the transformer architecture uses a subtle variation of this mechanism, called multi-head attention (MHA).\n", - "\n", - "The distinction is minimal; rather than only computing the attention once, the MHA mechanism runs through the scaled dot-product attention multiple times in parallel. According to the paper, *Attention is All You Need*, \"multi-head attention allows the model to **jointly attend** to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.\"\n", - "\n", - "Multi-head attention can be viewed as a similar strategy to stacking convolution kernels in a CNN layer. This allows the kernels to focus on and learn different features and rules, which is why multiple heads of attention also work.\n", - "\n", - "The figure below shows how basic MHA works. The scaled dot product attention discussed earlier is just repeated $N$ times ($N=2$ in this figure), with $3N$ learnable matrices for each head. The outputs from the different heads are then concatenated, whereafter it is fed through a linear projection, which produces the final representation.\n", - "\n", - "In practice, MHA significantly out-performs single-head attention.\n", - "\n", - "\"drawing\"\n" - ], - "metadata": { - "id": "nHkyjyErsYae" - } - }, - { - "cell_type": "markdown", - "source": [ - "Let's take a look at how to implement multi-head attention. In simple terms, multi-head attention is like running the attention process multiple times in parallel, using different copies of the Q, K, and V matrices for each \"head.\" This helps the model focus on different parts of the input at the same time. If you're interested in learning more, check out [this blog by Sebastian Raschka](https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention) for a detailed explanation." - ], - "metadata": { - "id": "vtuqNCln9EWW" - } - }, - { - "cell_type": "code", - "source": [ - "class MultiHeadAttention(nn.Module):\n", - " num_heads: int # Number of attention heads\n", - " d_m: int # Dimension of the model's embeddings\n", - "\n", - " def setup(self):\n", - " # Initialize the sequence-to-QKV transformation module\n", - " self.sequence_to_qkv = SequenceToQKV(self.d_m)\n", - "\n", - " # Define the initializer for the output linear layer weights\n", - " initializer = nn.initializers.variance_scaling(\n", - " scale=0.5, mode=\"fan_in\", distribution=\"truncated_normal\"\n", - " )\n", - "\n", - " # Initialize the output projection layer Wo (used after attention)\n", - " self.Wo = nn.Dense(self.d_m, kernel_init=initializer)\n", - "\n", - " def __call__(self, X=None, Q=None, K=None, V=None, mask=None, return_weights=False):\n", - " # If Q, K, or V are not provided, use the input X to generate them\n", - " if None in [Q, K, V]:\n", - " assert not X is None, \"X has to be provided if either Q, K, or V are not provided\"\n", - "\n", - " # Generate Q, K, and V matrices from the input X\n", - " Q, K, V = self.sequence_to_qkv(X)\n", - "\n", - " # Extract the batch size (B), sequence length (T), and embedding size (d_m)\n", - " B, T, d_m = K.shape\n", - "\n", - " # Calculate the size of each attention head's embedding (d_m / num_heads)\n", - " head_size = d_m // self.num_heads\n", - "\n", - " # Reshape Q, K, V to have separate dimensions for the heads\n", - " # B, T, d_m -> B, T, num_heads, head_size -> B, num_heads, T, head_size\n", - " q_heads = Q.reshape(B, T, self.num_heads, head_size).swapaxes(1, 2)\n", - " k_heads = K.reshape(B, T, self.num_heads, head_size).swapaxes(1, 2)\n", - " v_heads = V.reshape(B, T, self.num_heads, head_size).swapaxes(1, 2)\n", - "\n", - " # Apply scaled dot-product attention to each head\n", - " attention, attention_weights = scaled_dot_product_attention(\n", - " q_heads, k_heads, v_heads, mask\n", - " )\n", - "\n", - " # Reshape the attention output back to its original dimensions\n", - " # (B, num_heads, T, head_size) -> (B, T, num_heads, head_size) -> (B, T, d_m)\n", - " attention = attention.swapaxes(1, 2).reshape(B, T, d_m)\n", - "\n", - " # Apply the output linear transformation Wo to the attention output\n", - " X_new = self.Wo(attention)\n", - "\n", - " # If return_weights is True, return both the transformed output and attention weights\n", - " if return_weights:\n", - " return X_new, attention_weights\n", - " else:\n", - " # Otherwise, return just the transformed output\n", - " return X_new" - ], - "metadata": { - "id": "BY2xXLMQ9CB6" - }, - "execution_count": 47, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "e9NW58_3hAg2" - }, - "source": [ - "## **2. Building your own LLM** " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "bA_2coZvhAg3" - }, - "source": [ - "### 2.1 High-level overvierw Beginner" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BflycqAw_RF8" - }, - "source": [ - "The Transformer Architecture was famously introduced in the paper entitled [Attention is all you need](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) by Vaswani et al.\n", - "\n", - "As the title of the paper suggests, such an architecture consists of basically only attention mechanisms along with feed-forward layers and linear layers, as shown in the diagram below.\n", - "\n", - "\n", - "\n", - "Transformers and its variations are in the core of Large Language Models and it's not an exaggeration to say that almost all language models out there are Transformer based architectures.\n", - "\n", - "As you can see in the diagram the original Transformer architecture consists of two parts, one that receives inputs usually called encoder and another that receives outputs (i.e. targets) called decoder. This is because the transformer was designed for machine translation.\n", - "\n", - "The encoder will receive an input sentence in one language and process it through multiple stacked `encoder blocks`. This creates a final representation, which contains helpful information necessary for the decoding task. This output is then fed into stacked `decoder blocks` that produce new outputs in an autoregressive manner.\n", - "\n", - "The encoder consists of $N$ identical blocks, which process a sequence of token vectors sequentially. These blocks consist of 3 parts:\n", - "\n", - "1. A multi-head attention block. These are the transformer architecture's backbone. They process the data to generate representations for each token, ensuring that the necessary information for the task at hand is represented in the vectors. These are exactly the MHA we covered in the attention section previously.\n", - "2. An MLP (Multi-Layer Perceptron i.e. a neural network with multiple layers) is applied to each input token separately and identically.\n", - "3. Residual connection that adds the input tokens to the attended representations and a residual connection between the input to the MLP and its outputs. For both these connections, the result is normalized using layernorm. In certain implementations, these normalization steps are applied to the inputs rather than the outputs. Just like a Resnet, transformers are designed to be very deep models thus, these add and norm blocks are essential for a smooth gradient flow. \n", - "\n", - "Similarly, the decoder block consists of $N$ identical blocks, however there is some variation within these block. Concretely, the different parts are:\n", - "\n", - "1. A masked multi-head attention block. This is an MHA block that performs _self-attention_ on the output sequence however this computation is restricted to the inputs that have already been seen. In other words, future tokens are blocked when making predictions.\n", - "2. A multi-head attention block. This block receives the output of the final encoder block, the transformed tokens, and uses that as the key-value pairs, while using the output of the first MHA block as the query. In doing this, the model attends over the input required to perform the sequence task. This MHA block thus performs _cross-attention_ by looking at the encoder inputs.\n", - "3. An MLP same as the encoder\n", - "4. Residual connection same as the encoder.\n", - "\n", - "Given this original architecture, there have been several variation with others focusing on the encoder only and others the **decoder only**. Large language models(LLMs) such as GPT-2, GPT-3 and Turing-NLG were born out of decoder only architectures. These architecture look like:\n", - "\n", - "\"drawing\"\n", - "\n", - "with the cross attention block missing as no encoder output is available. So to build a language model, we will focus on the decoder only architecture as seen above.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "fbTsk0MdhAhC" - }, - "source": [ - "### 2.2 Tokenization + Positional encoding Beginner\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "DehUpfym_RF8" - }, - "source": [ - "#### 2.2.1 Tokenization" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "uBiFpVBu_RF9" - }, - "source": [ - "\n", - "Transformers cannot handle raw strings of text. So to process text, the text is first split up into tokens. The tokens are then indexed and each token is assigned an embedding of size $d_{model}$. These embeddings can be learned during training or can come from a pretrained vocabulary of embeddings. This new sequence of token embeddings is then fed into the transformer architecture. This idea is visualised below.\n", - "\n", - "\\\\\n", - "\n", - "\"drawing\"\n", - "\n", - "\n", - "These token IDs are typically predicted when a model generates text, fills in missing words, etc.\n", - "\n", - "This process of splitting up text into tokens and assigning an ID to each token is called [tokenisation](https://huggingface.co/docs/transformers/tokenizer_summary). There are various ways to tokenise text, with some methods being trained directly from the data. When using pre-trained transformers, it is crucial to use the same tokeniser that was used to train the model. The previous link has in-depth descriptions of many widely known techniques.\n", - "\n", - "Below we show how the [BERT](https://arxiv.org/abs/1810.04805) model's tokeniser tokenises a sentence. We use [Hugging Face](https://huggingface.co/) for this part.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 48, - "metadata": { - "id": "hJBMvlUA_RF9", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 162, - "referenced_widgets": [ - "fae0de63c22e4ed8afdc9b3554df9dbc", - "cce88c6717f14dbc8d601a5c306380ac", - "5f825f83172f4f3083c63b4f6c6df441", - "88985dba38b64de2b81a0b2da28e801b", - "534bd01df3c84807bd6a7b15d7279847", - "d8eab87c7cc940388dece8670e466a4e", - "285d1ffdac3149a9b2a17c108d35cf15", - "f18d22e0eeac4ca7953d0f87da7fac4a", - "dce8aa0a4a14409e84bf723f84ff5be6", - "e15e6eb22be4477d8ef216343b7371de", - "801ac3fc2ff34d42a65c0898edf5d07d", - "6b23d26a6c6044029025fab0dfbd3555", - "cb48b8b12a584c599917844e958cd69e", - "d91590c0670345b7a572e7f437353be3", - "663af63865a34844b6dc4ccea7df2ed9", - "913668a324b7433f9235fcdb4bc2a644", - "6ed2b83044fa4b4ab0586719ef9edc96", - "87e9e502cd3d48be83cda4999cac92ee", - "e345de9ae283492887f53c83214538b4", - "b40ea73a453845d3ac21caf4b112165a", - "04f2eb37159e4cf79a2b61e1c402d2a6", - "525a68ecc63b445db2a2eba949679625", - "9aa27129beed4d4ebcb728ea46db7294", - "f8d48f5c510e43b586f7cbadf0ac383e", - "e7a1b65310404ae1a4a685cddce1d727", - "58e3bb783e934b81b2457e88fff1c3c6", - "7490d00a7342421eb38a403386e6df64", - "0303f59b0de44e8ea1cb3d1a32147589", - "454d8927472d417081373a30fcc0f919", - "3618690067c7433a8aa81c8ebea5d1a3", - "63bbb5e415c44e15bbc5c88ec9759d4b", - "5878f8b4e8b14f1d9ed602585b6634d0", - "52f1ae9c77b04ce1ad0d9b1a7e8270ab", - "d54d176daad2416ea06ae3e1e1660592", - "589b2d7b489946dbaf2265d20542fd66", - "49da7f1ff2f74d208a0a440430f8845f", - "3952eb993b7d477eaf132061ad355194", - "4804b336a4ba4e4286b352768c4789a8", - "bfd29ef47cfd4db6aa5bb12f05af5779", - "437c0e32e3d84087a8d0b68dbc31f0db", - "31ad90c612e14196873b401e8862ea38", - "bf1eab7f180f45e5b378ceafca1746f4", - "bbca117899a64397b26a512911ba8868", - "fa63f4d8e2934594a6a05c51a70a607e" - ] - }, - "outputId": "77ec71d2-8c67-4a3f-8b2a-ce338e26660e" - }, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": [ - "tokenizer_config.json: 0%| | 0.00/49.0 [00:00\n", - "\n", - "Ideally, these encodings should have these characteristics ([source](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/)):\n", - "* Each time-step should have a unique value\n", - "* The distance between time steps should stay constant.\n", - "* The encoding should be able to generalise to longer sequences than seen during training.\n", - "* The encoding must be deterministic." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "rklY-aL-_RF9" - }, - "source": [ - "##### **Sine and cosine functions**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "GLcfkMku_RF9" - }, - "source": [ - "\n", - "In Attention is All you Need, the authors used a method that can satisfy all these requirements. This involves summing a combination of sine and cosine waves at different frequencies, with the formula for a position encoding at position $D$ shown below, where $i$ is the embedding index and $d_m$ is the token embedding size.\n", - "\n", - "\\\\\n", - "\n", - "$P_{D}= \\begin{cases}\\sin \\left(\\frac{D}{10000^{i/d_{m}}}\\right), & \\text { if } i \\bmod 2=0 \\\\ \\cos \\left(\\frac{D}{10000^{((i-1)/d_{m}}}\\right), & \\text { otherwise } \\end{cases}$\n", - "\n", - "\\\n", - "\n", - "Assuming our model as $d_m=8$, the position embedding will look like this:\n", - "\n", - "\\\n", - "$P_{D}=\\left[\\begin{array}{c}\\sin \\left(\\frac{D}{10000^{0/8}}\\right)\\\\ \\cos \\left(\\frac{D}{10000^{0/8}}\\right)\\\\ \\sin \\left(\\frac{D}{10000^{2/8}}\\right)\\\\ \\cos \\left(\\frac{D}{10000^{2/8}}\\right)\\\\ \\sin \\left(\\frac{D}{10000^{4/8}}\\right)\\\\ \\cos \\left(\\frac{D}{10000^{4/8}}\\right)\\\\ \\sin \\left(\\frac{D}{10000^{8/8}}\\right)\\\\ \\cos \\left(\\frac{D}{10000^{8/8}}\\right)\\end{array}\\right]$\n", - "\n", - "\\\\\n", - "\n", - "Let's first create a function that can return these encodings to understand why this will work." - ] - }, - { - "cell_type": "code", - "execution_count": 50, - "metadata": { - "id": "zT5t5D30_RF9" - }, - "outputs": [], - "source": [ - "def return_frequency_pe_matrix(token_sequence_length, token_embedding):\n", - "\n", - " assert token_embedding % 2 == 0, \"token_embedding should be divisible by two\"\n", - "\n", - " P = jnp.zeros((token_sequence_length, token_embedding))\n", - " positions = jnp.arange(0, token_sequence_length)[:, jnp.newaxis]\n", - "\n", - " i = jnp.arange(0, token_embedding, 2)\n", - " frequency_steps = jnp.exp(i * (-math.log(10000.0) / token_embedding))\n", - " frequencies = positions * frequency_steps\n", - "\n", - " P = P.at[:, 0::2].set(jnp.sin(frequencies))\n", - " P = P.at[:, 1::2].set(jnp.cos(frequencies))\n", - "\n", - " return P" - ] - }, - { - "cell_type": "code", - "execution_count": 51, - "metadata": { - "id": "CYW-VDOL_RF-", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 697 - }, - "outputId": "f41f3e50-a098-4184-9c01-dab5fe30bf8e" - }, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": [ - "
" - ], - "image/svg+xml": "\n\n\n \n \n \n \n 2024-08-30T09:02:57.453110\n image/svg+xml\n \n \n Matplotlib v3.7.1, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n" - }, - "metadata": {} - } - ], - "source": [ - "token_sequence_length = 50 # Number of tokens the model will need to process\n", - "token_embedding = 10000 # token embedding (and positional encoding) dimensions, ensure it is divisible by two\n", - "P = return_frequency_pe_matrix(token_sequence_length, token_embedding)\n", - "plot_position_encodings(P, token_sequence_length, token_embedding)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1mjHEDPO_RF-" - }, - "source": [ - "Looking at the graph above, we can see that for each position index, a unique pattern emerges, where each position index consistently has the same encoding.\n", - "\n", - "### **Group Activity**:\n", - "\n", - "- Take a moment with your friend to explore why this specific pattern appears when `token_sequence_length` is set to 1000, and `token_embedding` is 768.\n", - "- Experiment with smaller values for `token_sequence_length` and `token_embedding` to build a deeper understanding and enhance your discussion.\n", - "- Curious about the constant 10000? Ask your friend why they think it’s used in the functions above.\n", - "- Now, try setting `token_sequence_length` to 50 and `token_embedding` to a much larger value, like 10000. What do you observe? Do we always need a large token embedding?\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SdNPg0pnhAhG" - }, - "source": [ - "### 2.3 Transformer block Intermediate" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "M4vSolF2_RF-" - }, - "source": [ - "Just like an MLP (a simple neural network that processes input data through multiple layers) or a CNN (a type of neural network that excels at recognizing patterns in images by using convolution layers), transformers are made up of a stack of transformer blocks. In this section, we'll build each of the components needed to create one of these transformer blocks." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "kTURbfr__RF-" - }, - "source": [ - "\n", - "#### 2.3.1 Feed Forward Network (FFN) / Multilayer perceptron (MLP) Beginner\n", - "\n", - "\n", - "\"drawing\"" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LTtFi9AZ_RF-" - }, - "source": [ - "In the original model, these blocks consist of a simple 2-layer MLP (Multi-Layer Perceptron) that uses ReLU activation. However, GeLU (Gaussian Error Linear Unit) has become very popular, and we will be using it throughout this practical. The formula below represents the feedforward neural network (FFN) with GeLU activation. In this network, the input `x` is first passed through two linear layers with weights `W1` and `W2`, followed by bias terms `b1` and `b2`. The ReLU activation function, often represented by the `max` function, is replaced by the GeLU activation function in this case.\n", - "\n", - "$$\n", - "\\operatorname{FFN}(x)=\\max \\left(0, x W_{1}+b_{1}\\right) W_{2}+b_{2}\n", - "$$\n", - "\n", - "One can interpret this block as processing what the MHA block has produced and then projecting these new token representations to a space that the next block can use more optimally. Usually, the first layer is very wide, in the range of 2-8 times the size of the token representations. They do this as it is easier to parallelize computations for a single wider layer during training than to parallelize a feedforward block with multiple layers. Thus they can add in more complexity but keep training and inference optimized.\n", - "\n", - "**Code task:** Code up a Flax Module that implements the feed forward block." - ] - }, - { - "cell_type": "code", - "execution_count": 53, - "metadata": { - "id": "zsho1CnW_RF-" - }, - "outputs": [], - "source": [ - "class FeedForwardBlock(nn.Module):\n", - " \"\"\"\n", - " A 2-layer MLP which widens then narrows the input.\n", - "\n", - " Args:\n", - " widening_factor [optional, default=4]: The size of the hidden layer will be d_model * widening_factor.\n", - " \"\"\"\n", - "\n", - " widening_factor: int = 4\n", - " init_scale: float = 0.25\n", - "\n", - " @nn.compact\n", - " def __call__(self, x):\n", - " '''\n", - " Args:\n", - " x: [B, T, d_m]\n", - "\n", - " Return:\n", - " x: [B, T, d_m]\n", - " '''\n", - " d_m = x.shape[-1]\n", - " layer1_size = self.widening_factor * d_m\n", - "\n", - " initializer = nn.initializers.variance_scaling(\n", - " scale=self.init_scale, mode='fan_in', distribution='truncated_normal',\n", - " )\n", - "\n", - " # Hint: Layer 1 is a dense layer (fully connected layer) that increases the size of the input by the widening factor.\n", - " # Use nn.Dense to create this layer with layer1_size as the output size.\n", - " layer1 = # FINISH ME\n", - "\n", - " # Hint: Layer 2 is another dense layer that reduces the size back to the original dimension d_m.\n", - " # Use nn.Dense with d_m as the output size to create this layer.\n", - " layer2 = # FINISH ME\n", - "\n", - " x = jax.nn.gelu(layer1(x)) # Apply the GeLU activation function to the output of layer 1\n", - " x = layer2(x) # Pass the result through layer 2\n", - " return x" - ] - }, - { - "cell_type": "code", - "execution_count": 61, - "metadata": { - "id": "-qj0nfhH_RF-" - }, - "outputs": [], - "source": [ - "# @title Answer to code task (Try not to peek until you've given it a good try!')\n", - "\n", - "class FeedForwardBlock(nn.Module):\n", - " \"\"\"A 2-layer MLP (Multi-Layer Perceptron) that first expands the input size and then reduces it back.\"\"\"\n", - "\n", - " # widening_factor controls how much the input dimension is expanded in the first layer.\n", - " widening_factor: int = 4\n", - "\n", - " # init_scale controls the scaling factor for weight initialization.\n", - " init_scale: float = 0.25\n", - "\n", - " @nn.compact\n", - " def __call__(self, x):\n", - " # Get the size of the last dimension of the input (embedding size).\n", - " d_m = x.shape[-1]\n", - "\n", - " # Calculate the size of the first layer by multiplying the embedding size by the widening factor.\n", - " layer1_size = self.widening_factor * d_m\n", - "\n", - " # Initialize the weights for both layers using a variance scaling initializer.\n", - " initializer = nn.initializers.variance_scaling(\n", - " scale=self.init_scale, mode='fan_in', distribution='truncated_normal',\n", - " )\n", - "\n", - " # Define the first dense layer, which expands the input size.\n", - " layer1 = nn.Dense(layer1_size, kernel_init=initializer)\n", - "\n", - " # Define the second dense layer, which reduces the size back to the original dimension.\n", - " layer2 = nn.Dense(d_m, kernel_init=initializer)\n", - "\n", - " # Apply the first dense layer followed by a GELU activation function.\n", - " x = jax.nn.gelu(layer1(x))\n", - "\n", - " # Apply the second dense layer to project the data back to its original dimension.\n", - " x = layer2(x)\n", - "\n", - " # Return the final output.\n", - " return x" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Sts5Vr4i_RF-" - }, - "source": [ - "#### 2.3.2 Add and Norm block Beginner" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TWUpf8wt_RF-" - }, - "source": [ - "In order to get transformers to go deeper, the residual connections are very important to allow an easier flow of gradients through the network. For normalisation, `layer norm` is used. This normalises each token vector independently in the batch. It is found that normalising the vectors improves the convergence and stability of transformers.\n", - "\n", - "There are two learnable parameters in layernorm, `scale` and `bias`, which rescales the normalised value. Thus, for each input token in a batch, we calculate the mean, $\\mu_{i}$ and variance $\\sigma_i^2$. We then normalise the token with:\n", - "\n", - "$\\hat{x}_i = \\frac{x_i-\\mu_{i}}{\\sigma_i^2 + ϵ}$.\n", - "\n", - "Then $\\hat{x}$ is rescaled using the learned `scale`, $γ$, and `bias` $β$, with:\n", - "\n", - "$y_i = γ\\hat{x}_i + β = LN_{γ,β}(x_i)$.\n", - "\n", - "So our add norm block can be represented as $LN(x+f(x))$, where $f(x)$ is either a MLP or MHA block.\n", - "\n", - "**Code task:** Code up a Flax Module that implements the add norm block. It should take as input the processed and unprocessed tokens. Hint: `hk.LayerNorm `" - ] - }, - { - "cell_type": "code", - "execution_count": 63, - "metadata": { - "id": "F5bLb5Ly_RF_" - }, - "outputs": [], - "source": [ - "class AddNorm(nn.Module):\n", - " \"\"\"A block that impliments the add and norm block\"\"\"\n", - "\n", - " @nn.compact\n", - " def __call__(self, x, processed_x):\n", - " '''\n", - " Args:\n", - " x: Sequence of tokens before feeding into MHA or FF blocks, with shape [B, T, d_m]\n", - " x: Sequence of after being processed by MHA or FF blocks, with shape [B, T, d_m]\n", - "\n", - " Return:\n", - " add_norm_x: Transformed tokens with shape [B, T, d_m]\n", - " '''\n", - " # Hint: Step 1 involves adding the original input `x` to the processed input `processed_x`.\n", - " added = # FINISH ME\n", - "\n", - " # Hint: Step 2 requires applying layer normalization to the result of the addition.\n", - " # Use `nn.LayerNorm`, and set `reduction_axes=-1` to apply normalization across the last dimension.\n", - " normalised = #FINISH ME\n", - " return normalised(added)" - ] - }, - { - "cell_type": "code", - "execution_count": 64, - "metadata": { - "id": "HXSi7BXZ_RF_" - }, - "outputs": [], - "source": [ - "# @title Answer to code task (Try not to peek until you've given it a good try!')\n", - "\n", - "class AddNorm(nn.Module):\n", - " \"\"\"A block that implements the 'Add and Norm' operation used in transformers.\"\"\"\n", - "\n", - " @nn.compact\n", - " def __call__(self, x, processed_x):\n", - " # Step 1: Add the original input (x) to the processed input (processed_x).\n", - " added = x + processed_x\n", - "\n", - " # Step 2: Apply layer normalization to the result of the addition.\n", - " # - LayerNorm helps to stabilize and improve the training process by normalizing the output.\n", - " # - reduction_axes=-1 indicates that normalization is applied across the last dimension (typically the embedding dimension).\n", - " # - use_scale=True and use_bias=True allow the layer to learn scaling and bias parameters for further fine-tuning.\n", - " normalised = nn.LayerNorm(reduction_axes=-1, use_scale=True, use_bias=True)\n", - "\n", - " # Return the normalized result.\n", - " return normalised(added)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "91dXd29b_RF_" - }, - "source": [ - "### 2.4 Building the Transformer Decoder / LLM Intermediate" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Sl0UAyvM_RF_" - }, - "source": [ - "\"drawing\"\n", - "\n", - "Most of the groundwork has happened. We have built the positional encoding block, the MHA block, the feed-forward block and the add&norm block.\n", - "\n", - "The only part needed is passing inputs to each decoder block and applying the masked MHA block found in the decoder blocks.\n", - "\n", - "**Code task:** Code up a FLAX Module that implements the (FFN(norm(MHA(norm(X))))) for the decoder block" - ] - }, - { - "cell_type": "code", - "execution_count": 67, - "metadata": { - "id": "wVmSFKZK_RF_" - }, - "outputs": [], - "source": [ - "class DecoderBlock(nn.Module):\n", - " \"\"\"\n", - " Transformer decoder block.\n", - "\n", - " Args:\n", - " num_heads: The number of heads to be used in the MHA block.\n", - " d_m: Token embedding size\n", - " widening factor: The size of the hidden layer will be d_m * widening_factor.\n", - " \"\"\"\n", - "\n", - " num_heads: int\n", - " d_m: int\n", - " widening_factor: int = 4\n", - "\n", - " def setup(self):\n", - " self.mha = MultiHeadAttention(self.num_heads, self.d_m)\n", - " self.add_norm1 = AddNorm()\n", - " self.add_norm2 = AddNorm()\n", - " self.MLP = FeedForwardBlock(widening_factor=self.widening_factor)\n", - "\n", - " def __call__(self, X, mask=None, return_att_weight=True):\n", - " \"\"\"\n", - " Args:\n", - " X: Batch of tokens being fed into the decoder, with shape [B, T_decoder, d_m]\n", - " encoder_output: Batch of tokens with was processed by the encoder, with shape [B, T_encoder, d_m]\n", - " mask [optional, default=None]: Mask to be applied, with shape [T_decoder, T_decoder].\n", - " return_att_weight [optional, default=True]: Whether to return the attention weights.\n", - " \"\"\"\n", - "\n", - " attention, attention_weights_1 = # FINISH ME\n", - "\n", - " X = # FINISH ME\n", - "\n", - " projection = # FINISH ME\n", - " X = # FINISH ME\n", - "\n", - " return (X, attention_weights_1) if return_att_weight else X" - ] - }, - { - "cell_type": "code", - "execution_count": 69, - "metadata": { - "id": "stNZVVv3_RF_" - }, - "outputs": [], - "source": [ - "#@title Answer to code task (Try not to peek until you've given it a good try!')\n", - "\n", - "class DecoderBlock(nn.Module):\n", - " \"\"\"\n", - " Transformer decoder block.\n", - "\n", - " Args:\n", - " num_heads: The number of attention heads in the Multi-Head Attention (MHA) block.\n", - " d_m: The size of the token embeddings.\n", - " widening_factor: The factor by which the hidden layer size is expanded in the MLP.\n", - " \"\"\"\n", - "\n", - " num_heads: int\n", - " d_m: int\n", - " widening_factor: int = 4\n", - "\n", - " def setup(self):\n", - " # Initialize the Multi-Head Attention (MHA) block\n", - " self.mha = MultiHeadAttention(self.num_heads, self.d_m)\n", - "\n", - " # Initialize the AddNorm blocks for residual connections and normalization\n", - " self.add_norm1 = AddNorm() # First AddNorm block after MHA\n", - " self.add_norm2 = AddNorm() # Second AddNorm block after the MLP\n", - "\n", - " # Initialize the FeedForwardBlock (MLP) which processes the data after attention\n", - " self.MLP = FeedForwardBlock(widening_factor=self.widening_factor)\n", - "\n", - " def __call__(self, X, mask=None, return_att_weight=True):\n", - " \"\"\"\n", - " Forward pass through the DecoderBlock.\n", - "\n", - " Args:\n", - " X: Batch of input tokens fed into the decoder, shape [B, T_decoder, d_m]\n", - " mask [optional, default=None]: Mask to control which positions the attention is allowed to consider, shape [T_decoder, T_decoder].\n", - " return_att_weight [optional, default=True]: If True, returns the attention weights along with the output.\n", - "\n", - " Returns:\n", - " If return_att_weight is True, returns a tuple (X, attention_weights_1).\n", - " Otherwise, returns the processed token representations X.\n", - " \"\"\"\n", - "\n", - " # Apply Multi-Head Attention to the input tokens (X) with optional masking\n", - " attention, attention_weights_1 = self.mha(X, mask=mask, return_weights=True)\n", - "\n", - " # Apply the first AddNorm block (adds the original input X and normalizes)\n", - " X = self.add_norm1(X, attention)\n", - "\n", - " # Pass the result through the FeedForwardBlock (MLP) to further process the data\n", - " projection = self.MLP(X)\n", - "\n", - " # Apply the second AddNorm block (adds the input from the previous step and normalizes)\n", - " X = self.add_norm2(X, projection)\n", - "\n", - " # Return the final output X, and optionally the attention weights\n", - " return (X, attention_weights_1) if return_att_weight else X\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8SXXVWd7_RF_" - }, - "source": [ - "Next, we just put everything together, adding in the positional encodings as well as stacking multiple transformer blocks and adding our prediction layer." - ] - }, - { - "cell_type": "code", - "execution_count": 70, - "metadata": { - "id": "4XBG24Qs_RF_" - }, - "outputs": [], - "source": [ - "class LLM(nn.Module):\n", - " \"\"\"\n", - " Transformer model consisting of several layers of decoder blocks.\n", - "\n", - " Args:\n", - " num_heads: Number of attention heads in each Multi-Head Attention (MHA) block.\n", - " num_layers: Number of decoder blocks in the model.\n", - " d_m: Dimensionality of the token embeddings.\n", - " vocab_size: Size of the vocabulary (number of unique tokens).\n", - " widening_factor: Factor by which the hidden layer size is expanded in the MLP.\n", - " \"\"\"\n", - " num_heads: int\n", - " num_layers: int\n", - " d_m: int\n", - " vocab_size: int\n", - " widening_factor: int = 4\n", - "\n", - " def setup(self):\n", - " # Initialize a list of decoder blocks, one for each layer in the model\n", - " self.blocks = [\n", - " DecoderBlock(self.num_heads, self.d_m, self.widening_factor)\n", - " for _ in range(self.num_layers)\n", - " ]\n", - "\n", - " # Initialize an embedding layer to convert token IDs into token embeddings\n", - " self.embedding = nn.Embed(num_embeddings=self.vocab_size, features=self.d_m)\n", - "\n", - " # Initialize a dense layer for predicting the next token in the sequence\n", - " self.pred_layer = nn.Dense(self.vocab_size)\n", - "\n", - " def __call__(self, X, mask=None, return_att_weights=False):\n", - " \"\"\"\n", - " Forward pass through the LLM model.\n", - "\n", - " Args:\n", - " X: Batch of input token IDs, shape [B, T_decoder] where B is batch size and T_decoder is sequence length.\n", - " mask [optional, default=None]: Mask to control which positions the attention can focus on, shape [T_decoder, T_decoder].\n", - " return_att_weights [optional, default=False]: Whether to return the attention weights.\n", - "\n", - " Returns:\n", - " logits: The predicted probabilities for each token in the vocabulary.\n", - " If return_att_weights is True, also returns the attention weights.\n", - " \"\"\"\n", - "\n", - " # Convert token IDs to embeddings (shape [B, T_decoder, d_m])\n", - " X = self.embedding(X)\n", - "\n", - " # Get the sequence length of the input\n", - " sequence_len = X.shape[-2]\n", - "\n", - " # Generate positional encodings and add them to the token embeddings\n", - " positions = return_frequency_pe_matrix(sequence_len, self.d_m)\n", - " X = X + positions\n", - "\n", - " # Initialize a list to store attention weights if needed\n", - " if return_att_weights:\n", - " att_weights = []\n", - "\n", - " # Pass the embeddings through each decoder block in sequence\n", - " for block in self.blocks:\n", - " out = block(X, mask, return_att_weights)\n", - " if return_att_weights:\n", - " # If returning attention weights, unpack the output\n", - " X = out[0]\n", - " att_weights.append(out[1])\n", - " else:\n", - " # Otherwise, just update the input for the next block\n", - " X = out\n", - "\n", - " # Apply a dense layer followed by a log softmax to get logits (predicted token probabilities)\n", - " logits = nn.log_softmax(self.pred_layer(X))\n", - "\n", - " # Return the logits, and optionally, the attention weights\n", - " return logits if not return_att_weights else (logits, jnp.array(att_weights).swapaxes(0, 1))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sClFLLkU_RF_" - }, - "source": [ - "If everything is correct, then if we run the code below, everything should run without any issues." - ] - }, - { - "cell_type": "code", - "execution_count": 71, - "metadata": { - "id": "82CWEa5m_RGA" - }, - "outputs": [], - "source": [ - "B, T, d_m, N, vocab_size = 18, 32, 16, 8, 25670\n", - "\n", - "llm = LLM(num_heads=1, num_layers=1, d_m=d_m, vocab_size=vocab_size, widening_factor=4)\n", - "mask = jnp.tril(np.ones((T, T)))\n", - "\n", - "# initialise module and get dummy output\n", - "key = jax.random.PRNGKey(42)\n", - "X = jax.random.randint(key, [B, T], 0, vocab_size)\n", - "params = llm.init(key, X, mask=mask)\n", - "\n", - "# extract output from decoder\n", - "logits, decoder_att_weights = llm.apply(\n", - " params,\n", - " X,\n", - " mask=mask,\n", - " return_att_weights=True,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Gve7ssD__RGA" - }, - "source": [ - "As a final sanity check, we can confirm that our attention weights are working correctly. As shown in the figure below, the decoder's attention weights only focus on previous tokens, as expected." - ] - }, - { - "cell_type": "code", - "execution_count": 72, - "metadata": { - "id": "H4NpywYv_RGA", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 480 - }, - "outputId": "2d859add-d15b-47c5-c1b1-6faa6ec63138" - }, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": [ - "
" - ], - "image/svg+xml": "\n\n\n \n \n \n \n 2024-08-30T09:16:07.849621\n image/svg+xml\n \n \n Matplotlib v3.7.1, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n" - }, - "metadata": {} - } - ], - "source": [ - "fig, ax = plt.subplots(1, 1, figsize=(10, 5))\n", - "plt.suptitle(\"LLM attention weights\")\n", - "sns.heatmap(decoder_att_weights[0, 0, 0, ...], ax=ax, cmap=\"Blues\")\n", - "fig.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wmt3tp38G90A" - }, - "source": [ - "### 2.5 Training your LLM" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "agLIpsoh_RGA" - }, - "source": [ - "#### 2.5.1 Training objective Intermediate\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "QOSv1-3B_RGA" - }, - "source": [ - "A sentence is nothing but a string of words. A LLM aims to predict the next word by considering the current context, namely the words that have come before.\n", - "\n", - "Here's the basic idea:\n", - "\n", - "To calculate the probability of a full sentence \"word1, word2, ..., last word\" appearing in a given context $c$, the procedure is to break down the sentence into individual words and consider the probability of each word given the words that precede it. These individual probabilities are then multiplied together:\n", - "\n", - "$$\\text{Probability of sentence} = \\text{Probability of word1} \\times \\text{Probability of word2} \\times \\ldots \\times \\text{Probability of last word}$$\n", - "\n", - "This method is akin to building up a narrative one piece at a time based on the preceding storyline.\n", - "\n", - "Mathematically, this is expressed as the likelihood (probability) of a sequence of words $y_1, y_2, ..., y_n$ in a given context $c$, which is achieved by multiplying the probabilities of each word $y_t$ calculated given the predecessors ($y_{Advanced" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zIQ_aJGW_RGA" - }, - "source": [ - "In the next section, we define all the processes required to train the model using the objective described above. A lot of this is now the work required to do training using FLAX.\n", - "\n", - "Below we gather the dataset and we shall be training on, which is Karpathy's shakespeare dataset. Its not so important to understand this code, so either just run the cell to load the data, or view the code if you want to understand it.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 76, - "metadata": { - "id": "guMHAaSo_RGB", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "f88a064b-a5b2-44f7-8143-9048ff11d7ba" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "--2024-08-30 09:18:33-- https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt\n", - "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n", - "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", - "HTTP request sent, awaiting response... 200 OK\n", - "Length: 1115394 (1.1M) [text/plain]\n", - "Saving to: ‘input.txt’\n", - "\n", - "\rinput.txt 0%[ ] 0 --.-KB/s \rinput.txt 100%[===================>] 1.06M --.-KB/s in 0.04s \n", - "\n", - "2024-08-30 09:18:33 (26.3 MB/s) - ‘input.txt’ saved [1115394/1115394]\n", - "\n" - ] - } - ], - "source": [ - "# @title Create Shakespeare dataset and iterator (optional, but run the cell)\n", - "\n", - "# Trick to avoid errors when downloading tinyshakespeare.\n", - "import locale\n", - "locale.getpreferredencoding = lambda: \"UTF-8\"\n", - "\n", - "!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O input.txt\n", - "\n", - "class WordBasedAsciiDatasetForLLM:\n", - " \"\"\"In-memory dataset of a single-file ASCII dataset for language-like model.\"\"\"\n", - "\n", - " def __init__(self, path: str, batch_size: int, sequence_length: int):\n", - " \"\"\"Load a single-file ASCII dataset in memory.\"\"\"\n", - " self._batch_size = batch_size\n", - "\n", - " with open(path, \"r\") as f:\n", - " corpus = f.read()\n", - "\n", - " # Tokenize by splitting the text into words\n", - " words = corpus.split()\n", - " self.vocab_size = len(set(words)) # Number of unique words\n", - "\n", - " # Create a mapping from words to unique IDs\n", - " self.word_to_id = {word: i for i, word in enumerate(set(words))}\n", - "\n", - " # Store the inverse mapping from IDs to words\n", - " self.id_to_word = {i: word for word, i in self.word_to_id.items()}\n", - "\n", - " # Convert the words in the corpus to their corresponding IDs\n", - " corpus = np.array([self.word_to_id[word] for word in words]).astype(np.int32)\n", - "\n", - " crop_len = sequence_length + 1\n", - " num_batches, ragged = divmod(corpus.size, batch_size * crop_len)\n", - " if ragged:\n", - " corpus = corpus[:-ragged]\n", - " corpus = corpus.reshape([-1, crop_len])\n", - "\n", - " if num_batches < 10:\n", - " raise ValueError(\n", - " f\"Only {num_batches} batches; consider a shorter \"\n", - " \"sequence or a smaller batch.\"\n", - " )\n", - "\n", - " self._ds = WordBasedAsciiDatasetForLLM._infinite_shuffle(\n", - " corpus, batch_size * 10\n", - " )\n", - "\n", - " def __iter__(self):\n", - " return self\n", - "\n", - " def __next__(self):\n", - " \"\"\"Yield next mini-batch.\"\"\"\n", - " batch = [next(self._ds) for _ in range(self._batch_size)]\n", - " batch = np.stack(batch)\n", - " # Create the language modeling observation/target pairs.\n", - " return dict(\n", - " input=batch[:, :-1], target=batch[:, 1:]\n", - " )\n", - "\n", - " def ids_to_words(self, ids):\n", - " \"\"\"Convert a sequence of word IDs to words.\"\"\"\n", - " return [self.id_to_word[id] for id in ids]\n", - "\n", - " @staticmethod\n", - " def _infinite_shuffle(iterable, buffer_size):\n", - " \"\"\"Infinitely repeat and shuffle data from iterable.\"\"\"\n", - " ds = itertools.cycle(iterable)\n", - " buf = [next(ds) for _ in range(buffer_size)]\n", - " random.shuffle(buf)\n", - " while True:\n", - " item = next(ds)\n", - " idx = random.randint(0, buffer_size - 1) # Inclusive.\n", - " result, buf[idx] = buf[idx], item\n", - " yield result\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_WBIFg51oQl0" - }, - "source": [ - "Lets now look how our data is structured for training" - ] - }, - { - "cell_type": "code", - "execution_count": 77, - "metadata": { - "id": "WvH3XPM5_RGB", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "6bd960ae-887a-49b9-90f9-cfaed8fe508f" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "---------- Input -----------\n", - "TEXT: Nay, but speak not maliciously. First Citizen: I say unto you, what he hath done famously, he did it to that end: though soft-conscienced men can be content to say it was\n", - "ASCII: [ 7689 21226 4486 20296 15854 4336 8376 13235 2368 2564 7379 3893\n", - " 4074 6041 7028 7627 4074 6754 9269 23295 11807 785 4841 16254\n", - " 12875 15794 4364 1885 23295 2368 9269 5]\n", - "---------- Target ----------\n", - "TEXT: but speak not maliciously. First Citizen: I say unto you, what he hath done famously, he did it to that end: though soft-conscienced men can be content to say it was for\n", - "ASCII: [21226 4486 20296 15854 4336 8376 13235 2368 2564 7379 3893 4074\n", - " 6041 7028 7627 4074 6754 9269 23295 11807 785 4841 16254 12875\n", - " 15794 4364 1885 23295 2368 9269 5 1215]\n", - "---------- Input -----------\n", - "TEXT: talking on't; let it be done: away, away! Second Citizen: One word, good citizens. First Citizen: We are accounted poor citizens, the patricians good. What authority surfeits on would relieve us: if\n", - "ASCII: [12366 22188 10530 9269 4364 21595 1348 2770 16167 8376 6969 13327\n", - " 21731 23093 4336 8376 3656 12541 21911 8526 21028 14023 905 4469\n", - " 6613 533 10566 17134 9859 19044 1725 18619]\n", - "---------- Target ----------\n", - "TEXT: on't; let it be done: away, away! Second Citizen: One word, good citizens. First Citizen: We are accounted poor citizens, the patricians good. What authority surfeits on would relieve us: if they\n", - "ASCII: [22188 10530 9269 4364 21595 1348 2770 16167 8376 6969 13327 21731\n", - " 23093 4336 8376 3656 12541 21911 8526 21028 14023 905 4469 6613\n", - " 533 10566 17134 9859 19044 1725 18619 8816]\n", - "\n", - " Total vocabulary size: 25670\n" - ] - } - ], - "source": [ - "# sample and look at the data\n", - "batch_size = 2\n", - "seq_length = 32\n", - "train_dataset = WordBasedAsciiDatasetForLLM(\"input.txt\", batch_size, seq_length)\n", - "\n", - "batch = next(train_dataset)\n", - "\n", - "for obs, target in zip(batch[\"input\"], batch[\"target\"]):\n", - " print(\"-\" * 10, \"Input\", \"-\" * 11)\n", - " print(\"TEXT:\", ' '.join(train_dataset.ids_to_words(obs)))\n", - " print(\"ASCII:\", obs)\n", - " print(\"-\" * 10, \"Target\", \"-\" * 10)\n", - " print(\"TEXT:\", ' '.join(train_dataset.ids_to_words(target)))\n", - " print(\"ASCII:\", target)\n", - "\n", - "print(f\"\\n Total vocabulary size: {train_dataset.vocab_size}\")\n", - "\n", - "VOCAB_SIZE = train_dataset.vocab_size" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "w9vzee53_RGB" - }, - "source": [ - "Next, let us train our LLM and see how it performs in producing Shakespearian text. First, we will define what happens for every training step." - ] - }, - { - "cell_type": "code", - "execution_count": 78, - "metadata": { - "id": "PGuYBCkekgDw" - }, - "outputs": [], - "source": [ - "import functools\n", - "\n", - "@functools.partial(jax.jit, static_argnums=(3, 4))\n", - "def train_step(params, optimizer_state, batch, apply_fn, update_fn):\n", - " \"\"\"\n", - " Perform a single training step.\n", - "\n", - " Args:\n", - " params: The current parameters of the model.\n", - " optimizer_state: The current state of the optimizer.\n", - " batch: A dictionary containing the input data and target labels for the batch.\n", - " apply_fn: The function used to apply the model to the inputs.\n", - " update_fn: The function used to update the model parameters based on the gradients.\n", - "\n", - " Returns:\n", - " Updated parameters, updated optimizer state, and the computed loss for the batch.\n", - " \"\"\"\n", - "\n", - " def loss_fn(params):\n", - " # Get the sequence length (T) from the input data.\n", - " T = batch['input'].shape[1]\n", - "\n", - " # Apply the model to the input data, using a lower triangular mask to enforce causality.\n", - " # jnp.tril(np.ones((T, T))) creates a lower triangular matrix of ones.\n", - " logits = apply_fn(params, batch['input'], jnp.tril(np.ones((T, T))))\n", - "\n", - " # Calculate the loss between the predicted logits and the target labels.\n", - " loss = sequence_loss_fn(logits, batch['target'])\n", - "\n", - " return loss\n", - "\n", - " # Compute the loss and its gradients with respect to the parameters.\n", - " loss, gradients = jax.value_and_grad(loss_fn)(params)\n", - "\n", - " # Update the optimizer state and calculate the parameter updates based on the gradients.\n", - " updates, optimizer_state = update_fn(gradients, optimizer_state)\n", - "\n", - " # Apply the updates to the parameters.\n", - " params = optax.apply_updates(params, updates)\n", - "\n", - " # Return the updated parameters, optimizer state, and the loss for the batch.\n", - " return params, optimizer_state, loss" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "rtKWzKIAkfYU" - }, - "source": [ - "Next we initialise our optimizer and model. Feel free to play with the hyperparameters during the practical." - ] - }, - { - "cell_type": "code", - "execution_count": 79, - "metadata": { - "id": "8o3q-BZX_RGB" - }, - "outputs": [], - "source": [ - "# Define all hyperparameters\n", - "d_model = 128 # Dimension of token embeddings (d_m)\n", - "num_heads = 4 # Number of attention heads in Multi-Head Attention\n", - "num_layers = 1 # Number of decoder blocks in the model\n", - "widening_factor = 2 # Factor to widen the hidden layer size in the MLP\n", - "LR = 2e-3 # Learning rate for the optimizer\n", - "batch_size = 32 # Number of samples per training batch\n", - "seq_length = 64 # Length of each input sequence (number of tokens)\n", - "\n", - "# Set up the training data\n", - "train_dataset = WordBasedAsciiDatasetForLLM(\"input.txt\", batch_size, seq_length)\n", - "vocab_size = train_dataset.vocab_size # Get the size of the vocabulary from the dataset\n", - "batch = next(train_dataset) # Get the first batch of input data\n", - "\n", - "# Set the random number generator key for model initialization\n", - "rng = jax.random.PRNGKey(42)\n", - "\n", - "# Initialize the LLM model with the specified hyperparameters\n", - "llm = LLM(num_heads=num_heads, num_layers=num_layers, d_m=d_model, vocab_size=vocab_size, widening_factor=widening_factor)\n", - "\n", - "# Create a causal mask to ensure that the model only attends to previous tokens\n", - "mask = jnp.tril(np.ones((batch['input'].shape[1], batch['input'].shape[1])))\n", - "\n", - "# Initialize the model parameters using the first batch of input data and the mask\n", - "params = llm.init(rng, batch['input'], mask)\n", - "\n", - "# Set up the optimizer using the Adam optimization algorithm with the specified learning rate\n", - "optimizer = optax.adam(LR, b1=0.9, b2=0.99)\n", - "optimizer_state = optimizer.init(params) # Initialize the optimizer state with the model parameters" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "3bPEFakxmvsM" - }, - "source": [ - "Now we train! This will take a few minutes.. While it trains, have you greeted your neighbour yet?" - ] - }, - { - "cell_type": "code", - "execution_count": 80, - "metadata": { - "id": "oUAS6tie_RGB", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 812 - }, - "outputId": "0e4980c4-043d-4f9e-eae2-28c74c3a8817" - }, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": [ - "
" - ], - "image/svg+xml": "\n\n\n \n \n \n \n 2024-08-30T09:21:07.272979\n image/svg+xml\n \n \n Matplotlib v3.7.1, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n" - }, - "metadata": {} - }, - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Loss\n", - "\tloss \t (min: 0.180, max: 10.654, cur: 0.194)\n" - ] - } - ], - "source": [ - "plotlosses = PlotLosses()\n", - "\n", - "MAX_STEPS = 3500\n", - "LOG_EVERY = 32\n", - "losses = []\n", - "VOCAB_SIZE = 25670\n", - "\n", - "# Training loop\n", - "for step in range(MAX_STEPS):\n", - " batch = next(train_dataset)\n", - " params, optimizer_state, loss = train_step(\n", - " params, optimizer_state, batch, llm.apply, optimizer.update)\n", - " losses.append(loss)\n", - " if step % LOG_EVERY == 0:\n", - " loss_ = jnp.array(losses).mean()\n", - " plotlosses.update(\n", - " {\n", - " \"loss\": loss_,\n", - " }\n", - " )\n", - " plotlosses.send()\n", - " losses = []" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pGv9c2AFmF4V" - }, - "source": [ - "#### 2.5.3 Inspecting the trained LLM Beginner\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Pfq61gim_RGB" - }, - "source": [ - "**Reminder:** remember to run all code presented so far in this section before runnning the cells below!\n", - "\n", - "Lets generate some text now and see how our model did. DO NOT STOP THE CELL ONCE IT IS RUNNING, THIS WILL CHRASH THE SESSION." - ] - }, - { - "cell_type": "code", - "execution_count": 81, - "metadata": { - "id": "5lt8HTS__RGC", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "8af3b23e-05df-4a52-f260-5823d29b64de" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Love the teaching of the maid: That's your device. LUCENTIO: It is: may it be the" - ] - } - ], - "source": [ - "import functools\n", - "\n", - "@functools.partial(jax.jit, static_argnums=(2, ))\n", - "def generate_prediction(params, input, apply_fn):\n", - " logits = apply_fn(params, input)\n", - " argmax_out = jnp.argmax(logits, axis=-1)\n", - " return argmax_out[0][-1].astype(int)\n", - "\n", - "def generate_random_shakespeare(llm, params, id_2_word, word_2_id):\n", - " '''\n", - " Get the model output\n", - " '''\n", - "\n", - " prompt = \"Love\"\n", - " print(prompt, end=\"\")\n", - " tokens = prompt.split()\n", - "\n", - " # predict and append\n", - " for i in range(15):\n", - " input = jnp.array([[word_2_id[t] for t in tokens]]).astype(int)\n", - " prediction = generate_prediction(params, input, llm.apply)\n", - " prediction = id_2_word[int(prediction)]\n", - " tokens.append(prediction)\n", - " print(\" \"+prediction, end=\"\")\n", - "\n", - " return \" \".join(tokens)\n", - "\n", - "id_2_word = train_dataset.id_to_word\n", - "word_2_id = train_dataset.word_to_id\n", - "\n", - "generated_shakespeare = generate_random_shakespeare(llm, params, id_2_word, word_2_id)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wOwNuMRf_RGC" - }, - "source": [ - "Finally, we implemented everything above by taking the token ID with the maximum probability of being correct. This is greedy decoding, as we only took the most likely token. It worked well in this use case, but there are cases where we will see a degrading performance when taking this greedy approach, specifically when we are interested in generating realistic text.\n", - "\n", - "Other methods exist for sampling from the decoder, with a famous algorithm being beam search. We provide resources below for anyone interested in learning more about this.\n", - "\n", - "[Greedy Decoding](https://www.youtube.com/watch?v=DW5C3eqAFQM&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=4)\n", - "\n", - "[Beam Search](https://www.youtube.com/watch?v=uG3xoYNo3HM&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=5)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "fV3YG7QOZD-B" - }, - "source": [ - "## **Conclusion**\n", - "**Summary:**\n", - "\n", - "You've now mastered the essentials of how a Large Language Model (LLM) works, from the fundamentals of attention mechanisms to training your own LLM! These powerful tools have the potential to transform a wide range of tasks. However, like any deep learning model, their magic lies in applying them to the right problems with the right data.\n", - "\n", - "Ready to take your skills to the next level? Dive into fine-tuning your own LLMs and unleash even more potential! I highly recommend exploring last year's practical on Parameter Efficient Fine-Tuning Methods for a comprehensive overview of advanced techniques. The journey doesn't stop here—there's so much more to discover! [LLMs for Everyone 2023](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2023/blob/main/practicals/large_language_models.ipynb)\n", - "\n", - "The world of LLMs is yours to explore—go ahead and create something amazing! 🌟🚀\n", - "\n", - "---\n", - "\n", - "**Next Steps:**\n", - "[**Efficiently Finetuning LLMs with Hugging Face**](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2023/blob/main/practicals/large_language_models.ipynb)\n", - "\n", - "\n", - "**References:** for further references check the links referenced throughout\n", - "specific sections of this colab.\n", - "\n", - "* [Attention is all you need paper](https://arxiv.org/abs/1706.03762)\n", - "* [Additional videos on transformers](https://www.youtube.com/playlist?list=PLmZlBIcArwhOPR2s-FIR7WoqNaBML233s)\n", - "* [LoRA paper](https://arxiv.org/abs/2106.09685)\n", - "* [RLHF](https://huggingface.co/blog/rlhf) (how ChatGPT was trained)\n", - "* [Extending context length](https://kaiokendev.github.io/context):\n", - "\n", - "\n", - "For other practicals from the Deep Learning Indaba, please visit [here](https://github.com/deep-learning-indaba/indaba-pracs-2023)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "o1ndpYE50BpG" - }, - "source": [ - "# Feedback\n", - "\n", - "Please provide feedback that we can use to improve our practicals in the future." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": true, - "id": "OIZvkhfRz9Jz", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "outputId": "46a7ad13-b174-453d-ed32-9f932626a259" - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "\n", - "\n", - "\tLoading...\n", - "\n" - ] - }, - "metadata": {}, - "execution_count": 1 - } - ], - "source": [ - "# @title Generate Feedback Form. (Run Cell)\n", - "from IPython.display import HTML\n", - "\n", - "HTML(\n", - " \"\"\"\n", - "\n", - "\tLoading...\n", - "\n", - "\"\"\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "oglV4kHMWnIN" - }, - "source": [ - "" - ] - } - ], - "metadata": { - "accelerator": "GPU", - "colab": { - "gpuType": "T4", - "provenance": [], - "include_colab_link": true - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - }, - "language_info": { - "name": "python", - "version": "3.8.5" - }, - "vscode": { - "interpreter": { - "hash": "145833166d986a8417df3c7acb65d917d84b716b5a452e57fcacdc66f1a168c9" - } - }, - "widgets": { - "application/vnd.jupyter.widget-state+json": { - "d8f46e6226af431d9b7c6ecfa1c2769a": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_93a441fdf82141af81e85b2d5aec49b7", - "IPY_MODEL_4eb2c2d0758f4061a3d0fc398018de28", - "IPY_MODEL_fefd764b0cf2425c97cdab4506a81b9c" - ], - "layout": "IPY_MODEL_692c620c5e104d33849c3da34268f5cb" - } - }, - "93a441fdf82141af81e85b2d5aec49b7": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_ecec6f9f968940078a4cf7a9105a3717", - "placeholder": "​", - "style": "IPY_MODEL_e352febd89fe462e945d40bcf421f7fd", - "value": "config.json: 100%" - } - }, - "4eb2c2d0758f4061a3d0fc398018de28": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_f5a917167c914fdfaa86f1f71176eda2", - "max": 1007, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_4edc048323484d24b8c59bb8a2f1fab1", - "value": 1007 - } - }, - "fefd764b0cf2425c97cdab4506a81b9c": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_fe54b7356cf049d5b449b3fd69d64221", - "placeholder": "​", - "style": "IPY_MODEL_740307d65ab447658d3945abd47b3318", - "value": " 1.01k/1.01k [00:00<00:00, 14.7kB/s]" - } - }, - "692c620c5e104d33849c3da34268f5cb": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "ecec6f9f968940078a4cf7a9105a3717": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "e352febd89fe462e945d40bcf421f7fd": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "f5a917167c914fdfaa86f1f71176eda2": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "4edc048323484d24b8c59bb8a2f1fab1": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "fe54b7356cf049d5b449b3fd69d64221": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "740307d65ab447658d3945abd47b3318": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "70969f782d9d48cebefa3ee64e3a04f5": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_111b2a654a424460b4917b5dad5fd69e", - "IPY_MODEL_e35b59e079ab498dbe595cde7c984438", - "IPY_MODEL_f2edac25c1e74b14a4ade39fe86dd040" - ], - "layout": "IPY_MODEL_4983de4c57c349af8cb8b6be78f64030" - } - }, - "111b2a654a424460b4917b5dad5fd69e": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_74926ca6f40e44c0887297ac44cbd577", - "placeholder": "​", - "style": "IPY_MODEL_467dfa92fdee4b9d850bd1eac5c502c6", - "value": "model.safetensors: 100%" - } - }, - "e35b59e079ab498dbe595cde7c984438": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_52e8f51e283847d79bda1ee4977fd53f", - "max": 525979192, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_aa3a63b55fe74989b92cfe8504695309", - "value": 525979192 - } - }, - "f2edac25c1e74b14a4ade39fe86dd040": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_a78f96142d83418685734ffb9ec85ddd", - "placeholder": "​", - "style": "IPY_MODEL_d642dc3cabc64e66a8bf8853527f7165", - "value": " 526M/526M [00:06<00:00, 79.6MB/s]" - } - }, - "4983de4c57c349af8cb8b6be78f64030": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "74926ca6f40e44c0887297ac44cbd577": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "467dfa92fdee4b9d850bd1eac5c502c6": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "52e8f51e283847d79bda1ee4977fd53f": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "aa3a63b55fe74989b92cfe8504695309": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "a78f96142d83418685734ffb9ec85ddd": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "d642dc3cabc64e66a8bf8853527f7165": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "8ca5d9a0316d4d85b88d45f5993bc21c": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_0dc54052e03245c3b149ed1fcb22b038", - "IPY_MODEL_973cf8a73e3a478c888928030194793d", - "IPY_MODEL_1668490995c74105ab6f11ab33ecaefb" - ], - "layout": "IPY_MODEL_3ec771810c9d472fa75278a76decf956" - } - }, - "0dc54052e03245c3b149ed1fcb22b038": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_f505b2bbabae43219a02639d33501a32", - "placeholder": "​", - "style": "IPY_MODEL_f0a69ae2f0064d80825b0091932e7813", - "value": "generation_config.json: 100%" - } - }, - "973cf8a73e3a478c888928030194793d": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_65e1cac3c45442b4a68711d410ef37c3", - "max": 119, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_5602e00600884fdfbbe145694bcc0f90", - "value": 119 - } - }, - "1668490995c74105ab6f11ab33ecaefb": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_f02e4e7118d64a2bb764909705049d18", - "placeholder": "​", - "style": "IPY_MODEL_c4710e641a1c421f96222ffb65bcedfe", - "value": " 119/119 [00:00<00:00, 1.29kB/s]" - } - }, - "3ec771810c9d472fa75278a76decf956": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "f505b2bbabae43219a02639d33501a32": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "f0a69ae2f0064d80825b0091932e7813": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "65e1cac3c45442b4a68711d410ef37c3": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "5602e00600884fdfbbe145694bcc0f90": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "f02e4e7118d64a2bb764909705049d18": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "c4710e641a1c421f96222ffb65bcedfe": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "98ca063c1b1548ce8de87647dcf23507": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_ec94052f637b4de3a172b3d5d9d32355", - "IPY_MODEL_267fa0085440473c8762bfe21b5e9106", - "IPY_MODEL_708f1e90e25c49c0bbfac3bc5c233e24" - ], - "layout": "IPY_MODEL_14a6f314f3ff4ad3b1f05933c50fc830" - } - }, - "ec94052f637b4de3a172b3d5d9d32355": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_c69550aa99a34bada8ef67f61115d760", - "placeholder": "​", - "style": "IPY_MODEL_d9d79a644a1d42a38b70097c7f77dbad", - "value": "tokenizer_config.json: 100%" - } - }, - "267fa0085440473c8762bfe21b5e9106": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_af73f467b5bd4c38882bc27e3b3b5732", - "max": 727, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_71eeff60e0f7464ca42483c4fe3c7bca", - "value": 727 - } - }, - "708f1e90e25c49c0bbfac3bc5c233e24": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_a546a75097e149f4a226453951d987da", - "placeholder": "​", - "style": "IPY_MODEL_3d5b4513d6ee4a73a392e4c16896e14c", - "value": " 727/727 [00:00<00:00, 10.4kB/s]" - } - }, - "14a6f314f3ff4ad3b1f05933c50fc830": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "c69550aa99a34bada8ef67f61115d760": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "d9d79a644a1d42a38b70097c7f77dbad": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "af73f467b5bd4c38882bc27e3b3b5732": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "71eeff60e0f7464ca42483c4fe3c7bca": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "a546a75097e149f4a226453951d987da": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "3d5b4513d6ee4a73a392e4c16896e14c": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "d334e7a205704133b8549562700bca53": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_2e0e5880d12b4a37b673dcb3e455f47e", - "IPY_MODEL_3c7d5cb0f9de4132a2b26428b72cb10a", - "IPY_MODEL_49de67c36f9b4b66ba234bc606ffa0c4" - ], - "layout": "IPY_MODEL_ff8540ebd47e4dc39197eb6b19945d32" - } - }, - "2e0e5880d12b4a37b673dcb3e455f47e": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_622fd5127db84c7388b9258ee7c4fdc1", - "placeholder": "​", - "style": "IPY_MODEL_7a191a40c1e245748de1f54f449afe37", - "value": "vocab.json: 100%" - } - }, - "3c7d5cb0f9de4132a2b26428b72cb10a": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_3263b46210b544b989301e9edfc23473", - "max": 898669, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_2a64a5672e934b739e6280e2ab278da9", - "value": 898669 - } - }, - "49de67c36f9b4b66ba234bc606ffa0c4": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_98577aad05654ff5aa46ca95121ac640", - "placeholder": "​", - "style": "IPY_MODEL_03251cffadcb429b9dbe87402bb8a4bb", - "value": " 899k/899k [00:00<00:00, 3.43MB/s]" - } - }, - "ff8540ebd47e4dc39197eb6b19945d32": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "622fd5127db84c7388b9258ee7c4fdc1": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "7a191a40c1e245748de1f54f449afe37": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "3263b46210b544b989301e9edfc23473": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "2a64a5672e934b739e6280e2ab278da9": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "98577aad05654ff5aa46ca95121ac640": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "03251cffadcb429b9dbe87402bb8a4bb": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "f35d92a421d84260976fc4fba3d4527c": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_d96a8053de2f4aeabb8cf68474f6725f", - "IPY_MODEL_441741ea09ac4ba29ac2dcd9f8cc0ade", - "IPY_MODEL_445c478b29e844e99c45eb4e7a093e65" - ], - "layout": "IPY_MODEL_9bc0b47676434ca9a6c8c3f8af50d9aa" - } - }, - "d96a8053de2f4aeabb8cf68474f6725f": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_2555e6fbde6e45c2ba5f466701a1fb57", - "placeholder": "​", - "style": "IPY_MODEL_73d4afacfe3741bc90dfd70e55d763e3", - "value": "merges.txt: 100%" - } - }, - "441741ea09ac4ba29ac2dcd9f8cc0ade": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_2815ed72ee56478b81379eabd3cdc004", - "max": 456318, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_7ee8d64ff1344f7f8e6816e8ced5c5d7", - "value": 456318 - } - }, - "445c478b29e844e99c45eb4e7a093e65": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_71ce19a082fd4f28a1c5f4f29853f32d", - "placeholder": "​", - "style": "IPY_MODEL_9e1d5d7e79c9494598ca56079525dadc", - "value": " 456k/456k [00:00<00:00, 9.82MB/s]" - } - }, - "9bc0b47676434ca9a6c8c3f8af50d9aa": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "2555e6fbde6e45c2ba5f466701a1fb57": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "73d4afacfe3741bc90dfd70e55d763e3": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "2815ed72ee56478b81379eabd3cdc004": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "7ee8d64ff1344f7f8e6816e8ced5c5d7": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "71ce19a082fd4f28a1c5f4f29853f32d": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "9e1d5d7e79c9494598ca56079525dadc": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "991ab38f5ab142a2a053da131fca08e8": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_b14dc0d2bba84dbbb17b657ef5555132", - "IPY_MODEL_51a7cfd306fc45498f8f84b579e6f05f", - "IPY_MODEL_3e134af4ddc041aab1e0f8769b77e232" - ], - "layout": "IPY_MODEL_6b33bff41d3d438f9c81af109273f41e" - } - }, - "b14dc0d2bba84dbbb17b657ef5555132": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_ff568bf3a34b46edb2e50fb910608de3", - "placeholder": "​", - "style": "IPY_MODEL_9e1cb31c569f40118ae949cd8643e392", - "value": "tokenizer.json: 100%" - } - }, - "51a7cfd306fc45498f8f84b579e6f05f": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_eeeca8ac591242edafe63e100421534e", - "max": 2107652, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_68b138160a984dd686787cad890ff13c", - "value": 2107652 - } - }, - "3e134af4ddc041aab1e0f8769b77e232": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_90ef3af60627467d979237b57a8f411d", - "placeholder": "​", - "style": "IPY_MODEL_e93153b83b034468aad8516c644e8d55", - "value": " 2.11M/2.11M [00:00<00:00, 10.8MB/s]" - } - }, - "6b33bff41d3d438f9c81af109273f41e": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "ff568bf3a34b46edb2e50fb910608de3": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "9e1cb31c569f40118ae949cd8643e392": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "eeeca8ac591242edafe63e100421534e": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "68b138160a984dd686787cad890ff13c": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "90ef3af60627467d979237b57a8f411d": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "e93153b83b034468aad8516c644e8d55": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "8b87d73ab967490481a27011d5b53236": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_739e3997f7544b73aff099c31a3d4bed", - "IPY_MODEL_730b750139364343aaee6bd28fde7f53", - "IPY_MODEL_b41c51b1111943468cb50695e5ce1f84" - ], - "layout": "IPY_MODEL_6f3f01955c0847a19b2ca4e84a06d149" - } - }, - "739e3997f7544b73aff099c31a3d4bed": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_9084385cba634c01a9e746643e40e32c", - "placeholder": "​", - "style": "IPY_MODEL_9a694d9d63524748a8dd7f2ae59525d5", - "value": "special_tokens_map.json: 100%" - } - }, - "730b750139364343aaee6bd28fde7f53": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_6541ce75ed414eac9428fc9e0a53d128", - "max": 357, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_83436ad970f044c9abc1e79a7b7a749d", - "value": 357 - } - }, - "b41c51b1111943468cb50695e5ce1f84": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_61cba98f8dae431ea588538cdc5a2e07", - "placeholder": "​", - "style": "IPY_MODEL_cbace4a6e6bb476bb9d45d0516e5f8de", - "value": " 357/357 [00:00<00:00, 5.70kB/s]" - } - }, - "6f3f01955c0847a19b2ca4e84a06d149": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "9084385cba634c01a9e746643e40e32c": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "9a694d9d63524748a8dd7f2ae59525d5": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "6541ce75ed414eac9428fc9e0a53d128": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "83436ad970f044c9abc1e79a7b7a749d": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "61cba98f8dae431ea588538cdc5a2e07": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "cbace4a6e6bb476bb9d45d0516e5f8de": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "fae0de63c22e4ed8afdc9b3554df9dbc": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_cce88c6717f14dbc8d601a5c306380ac", - "IPY_MODEL_5f825f83172f4f3083c63b4f6c6df441", - "IPY_MODEL_88985dba38b64de2b81a0b2da28e801b" - ], - "layout": "IPY_MODEL_534bd01df3c84807bd6a7b15d7279847" - } - }, - "cce88c6717f14dbc8d601a5c306380ac": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_d8eab87c7cc940388dece8670e466a4e", - "placeholder": "​", - "style": "IPY_MODEL_285d1ffdac3149a9b2a17c108d35cf15", - "value": "tokenizer_config.json: 100%" - } - }, - "5f825f83172f4f3083c63b4f6c6df441": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_f18d22e0eeac4ca7953d0f87da7fac4a", - "max": 49, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_dce8aa0a4a14409e84bf723f84ff5be6", - "value": 49 - } - }, - "88985dba38b64de2b81a0b2da28e801b": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_e15e6eb22be4477d8ef216343b7371de", - "placeholder": "​", - "style": "IPY_MODEL_801ac3fc2ff34d42a65c0898edf5d07d", - "value": " 49.0/49.0 [00:00<00:00, 2.48kB/s]" - } - }, - "534bd01df3c84807bd6a7b15d7279847": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "d8eab87c7cc940388dece8670e466a4e": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "285d1ffdac3149a9b2a17c108d35cf15": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "f18d22e0eeac4ca7953d0f87da7fac4a": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "dce8aa0a4a14409e84bf723f84ff5be6": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "e15e6eb22be4477d8ef216343b7371de": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "801ac3fc2ff34d42a65c0898edf5d07d": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "6b23d26a6c6044029025fab0dfbd3555": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_cb48b8b12a584c599917844e958cd69e", - "IPY_MODEL_d91590c0670345b7a572e7f437353be3", - "IPY_MODEL_663af63865a34844b6dc4ccea7df2ed9" - ], - "layout": "IPY_MODEL_913668a324b7433f9235fcdb4bc2a644" - } - }, - "cb48b8b12a584c599917844e958cd69e": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_6ed2b83044fa4b4ab0586719ef9edc96", - "placeholder": "​", - "style": "IPY_MODEL_87e9e502cd3d48be83cda4999cac92ee", - "value": "config.json: 100%" - } - }, - "d91590c0670345b7a572e7f437353be3": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_e345de9ae283492887f53c83214538b4", - "max": 570, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_b40ea73a453845d3ac21caf4b112165a", - "value": 570 - } - }, - "663af63865a34844b6dc4ccea7df2ed9": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_04f2eb37159e4cf79a2b61e1c402d2a6", - "placeholder": "​", - "style": "IPY_MODEL_525a68ecc63b445db2a2eba949679625", - "value": " 570/570 [00:00<00:00, 24.6kB/s]" - } - }, - "913668a324b7433f9235fcdb4bc2a644": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "6ed2b83044fa4b4ab0586719ef9edc96": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "87e9e502cd3d48be83cda4999cac92ee": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "e345de9ae283492887f53c83214538b4": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "b40ea73a453845d3ac21caf4b112165a": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "04f2eb37159e4cf79a2b61e1c402d2a6": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "525a68ecc63b445db2a2eba949679625": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "9aa27129beed4d4ebcb728ea46db7294": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_f8d48f5c510e43b586f7cbadf0ac383e", - "IPY_MODEL_e7a1b65310404ae1a4a685cddce1d727", - "IPY_MODEL_58e3bb783e934b81b2457e88fff1c3c6" - ], - "layout": "IPY_MODEL_7490d00a7342421eb38a403386e6df64" - } - }, - "f8d48f5c510e43b586f7cbadf0ac383e": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_0303f59b0de44e8ea1cb3d1a32147589", - "placeholder": "​", - "style": "IPY_MODEL_454d8927472d417081373a30fcc0f919", - "value": "vocab.txt: 100%" - } - }, - "e7a1b65310404ae1a4a685cddce1d727": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_3618690067c7433a8aa81c8ebea5d1a3", - "max": 213450, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_63bbb5e415c44e15bbc5c88ec9759d4b", - "value": 213450 - } - }, - "58e3bb783e934b81b2457e88fff1c3c6": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_5878f8b4e8b14f1d9ed602585b6634d0", - "placeholder": "​", - "style": "IPY_MODEL_52f1ae9c77b04ce1ad0d9b1a7e8270ab", - "value": " 213k/213k [00:00<00:00, 2.89MB/s]" - } - }, - "7490d00a7342421eb38a403386e6df64": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "0303f59b0de44e8ea1cb3d1a32147589": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "454d8927472d417081373a30fcc0f919": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "3618690067c7433a8aa81c8ebea5d1a3": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "63bbb5e415c44e15bbc5c88ec9759d4b": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "5878f8b4e8b14f1d9ed602585b6634d0": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "52f1ae9c77b04ce1ad0d9b1a7e8270ab": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "d54d176daad2416ea06ae3e1e1660592": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_589b2d7b489946dbaf2265d20542fd66", - "IPY_MODEL_49da7f1ff2f74d208a0a440430f8845f", - "IPY_MODEL_3952eb993b7d477eaf132061ad355194" - ], - "layout": "IPY_MODEL_4804b336a4ba4e4286b352768c4789a8" - } - }, - "589b2d7b489946dbaf2265d20542fd66": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_bfd29ef47cfd4db6aa5bb12f05af5779", - "placeholder": "​", - "style": "IPY_MODEL_437c0e32e3d84087a8d0b68dbc31f0db", - "value": "tokenizer.json: 100%" - } - }, - "49da7f1ff2f74d208a0a440430f8845f": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_31ad90c612e14196873b401e8862ea38", - "max": 435797, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_bf1eab7f180f45e5b378ceafca1746f4", - "value": 435797 - } - }, - "3952eb993b7d477eaf132061ad355194": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_bbca117899a64397b26a512911ba8868", - "placeholder": "​", - "style": "IPY_MODEL_fa63f4d8e2934594a6a05c51a70a607e", - "value": " 436k/436k [00:00<00:00, 14.4MB/s]" - } - }, - "4804b336a4ba4e4286b352768c4789a8": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "bfd29ef47cfd4db6aa5bb12f05af5779": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "437c0e32e3d84087a8d0b68dbc31f0db": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "31ad90c612e14196873b401e8862ea38": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "bf1eab7f180f45e5b378ceafca1746f4": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "bbca117899a64397b26a512911ba8868": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "fa63f4d8e2934594a6a05c51a70a607e": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - } - } - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} \ No newline at end of file diff --git a/practicals/Foundations_of_LLMs/foundations_of_llms_practical.ipynb b/practicals/Foundations_of_LLMs/foundations_of_llms_practical.ipynb new file mode 100644 index 0000000..20e998c --- /dev/null +++ b/practicals/Foundations_of_LLMs/foundations_of_llms_practical.ipynb @@ -0,0 +1,2815 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "m2s4kN_QPQVe" + }, + "source": [ + "# LLMs for everyone\n", + "\n", + "\n", + "\n", + "\"Open\n", + "\n", + "© Deep Learning Indaba 2024. Apache License 2.0.\n", + "\n", + "**Authors: Jabez Magomere, Harry Mayne, Khalil Mrini, Nabra Rizvi, Doudou Ba**\n", + "\n", + "**Reviewers: Seid Muhie Yimam, Foutse Yuehgoh**\n", + "\n", + "**Introduction:**\n", + "\n", + "Welcome to **\"LLMs for Everyone\"**—your gateway to the fascinating world of Large Language Models (LLMs)! To kick things off, here’s a fun fact: this entire introduction was generated by ChatGPT, one of the many powerful LLMs you'll be learning about. 🤖✨\n", + "\n", + "In this tutorial, you'll dive into the core principles of transformers, the cutting-edge technology behind models like GPT. You’ll also get hands-on experience training your very own Language Model! Get ready to explore how these impressive AI systems create such realistic and engaging text. Let’s embark on this exciting journey together and unlock the secrets of LLMs! 🚀📚\n", + "\n", + "**Topics:**\n", + "\n", + "Content: [Hugging Face Introduction, Attention Mechanism, Transformer Architecture, Training your own LLM from scratch, Finetuning an LLM for Text Classification]\n", + "\n", + "Level: Beginner, Intermediate, Advanced\n", + "\n", + "**Aims/Learning Objectives:**\n", + "\n", + "* Understand the idea behind [Attention](https://arxiv.org/abs/1706.03762) and why it is used.\n", + "* Present and describe the fundamental building blocks of the [Transformer Architecture](https://arxiv.org/abs/1706.03762) along with an intuition on such an architecture design.\n", + "* Build and train a simple Shakespeare-inspired LLM.\n", + "\n", + "**Prerequisites:**\n", + "\n", + "* Basic knowledge of Deep Learning.\n", + "* Familiarity with Natural Language Processing (NLP).\n", + "* Understanding of sequence-to-sequence models.\n", + "* Basic understanding of Linear Algebra.\n", + "\n", + "**Outline:**\n", + "\n", + ">[LLMs for everyone](#scrollTo=m2s4kN_QPQVe)\n", + "\n", + ">>[Installations, Imports and Helper Functions](#scrollTo=6EqhIg1odqg0)\n", + "\n", + ">>[Let's kick things off with a Hugging Face Demo! Beginner](#scrollTo=4zu5cg-YG4XU)\n", + "\n", + ">>>[Hugging Face](#scrollTo=AwjIIipOG4fz)\n", + "\n", + ">>>[Time for a Demo! ⏰⚡ Loading a Hugging Face Model and Running a Sample](#scrollTo=eq46TV_0G4f0)\n", + "\n", + ">[LLMs for everyone](#scrollTo=m2s4kN_QPQVe)\n", + "\n", + ">>[Installations, Imports and Helper Functions](#scrollTo=6EqhIg1odqg0)\n", + "\n", + ">>[Let's kick things off with a Hugging Face Demo! Beginner](#scrollTo=4zu5cg-YG4XU)\n", + "\n", + ">>>[Hugging Face](#scrollTo=AwjIIipOG4fz)\n", + "\n", + ">>>[Time for a Demo! ⏰⚡ Loading a Hugging Face Model and Running a Sample](#scrollTo=eq46TV_0G4f0)\n", + "\n", + ">>[1. Attention](#scrollTo=-ZUp8i37dFbU)\n", + "\n", + ">>>[Intuition - Beginner](#scrollTo=ygdi884ugGcu)\n", + "\n", + ">>>[Understanding Attention in Simple Terms](#scrollTo=ygdi884ugGcu)\n", + "\n", + ">>>[Sequence to sequence attenion mechanisms - Intermediate](#scrollTo=aQfqM1EJyDXI)\n", + "\n", + ">>>[Self-attention to Multihead Attention - Intermediate](#scrollTo=J-MU6rrny8Nj)\n", + "\n", + ">>>>[Self-attention](#scrollTo=0AFUEFZGzCTv)\n", + "\n", + ">>>>>[Queries, keys and values](#scrollTo=pwOIMtdZzdTf)\n", + "\n", + ">>>>>[Scaled dot product attention](#scrollTo=OhGZHFsHz_Qp)\n", + "\n", + ">>>>>[Masked attention](#scrollTo=D7B-AgO80gIt)\n", + "\n", + ">>>>>[Multi-head attention](#scrollTo=OWDubQwCs4zG)\n", + "\n", + ">>[2. Building your own LLM](#scrollTo=e9NW58_3hAg2)\n", + "\n", + ">>>[2.1 High-level overvierw Beginner](#scrollTo=bA_2coZvhAg3)\n", + "\n", + ">>>[2.2 Tokenization + Positional encoding Beginner](#scrollTo=fbTsk0MdhAhC)\n", + "\n", + ">>>>[2.2.1 Tokenization](#scrollTo=DehUpfym_RF8)\n", + "\n", + ">>>>[2.2.2 Positional encodings](#scrollTo=639s7Zuk_RF9)\n", + "\n", + ">>>>>[Sine and cosine functions](#scrollTo=rklY-aL-_RF9)\n", + "\n", + ">>>[Group Activity:](#scrollTo=1mjHEDPO_RF-)\n", + "\n", + ">>>[2.3 Transformer block Intermediate](#scrollTo=SdNPg0pnhAhG)\n", + "\n", + ">>>>[2.3.1 Feed Forward Network (FFN) / Multilayer perceptron (MLP) Beginner](#scrollTo=kTURbfr__RF-)\n", + "\n", + ">>>>[2.3.2 Add and Norm block Beginner](#scrollTo=Sts5Vr4i_RF-)\n", + "\n", + ">>>[2.4 Building the Transformer Decoder / LLM Intermediate](#scrollTo=91dXd29b_RF_)\n", + "\n", + ">>>[2.5 Training your LLM](#scrollTo=wmt3tp38G90A)\n", + "\n", + ">>>>[2.5.1 Training objective Intermediate](#scrollTo=agLIpsoh_RGA)\n", + "\n", + ">>>>[2.5.2 Training models Advanced](#scrollTo=4CSfvGj__RGA)\n", + "\n", + ">>>>[2.5.3 Inspecting the trained LLM Beginner](#scrollTo=pGv9c2AFmF4V)\n", + "\n", + ">>[Conclusion](#scrollTo=fV3YG7QOZD-B)\n", + "\n", + ">[Feedback](#scrollTo=o1ndpYE50BpG)\n", + "\n", + "**Before you start:**\n", + "\n", + "For this practical, you will need to use a GPU to speed up training. To do this, go to the \"Runtime\" menu in Colab, select \"Change runtime type\" and then in the popup menu, choose \"GPU\" in the \"Hardware accelerator\" box." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "952qogb79nnY" + }, + "source": [ + "**Suggested experience level in this topic:**\n", + "\n", + "| Level | Experience |\n", + "| --- | --- |\n", + "`Beginner` | It is my first time being introduced to this work. |\n", + "`Intermediate` | I have done some basic courses/intros on this topic. |\n", + "`Advanced` | I work in this area/topic daily. |" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "YBdDHcI_ArCR" + }, + "outputs": [], + "source": [ + "# @title **Paths to follow:** What is your level of experience in the topics presented in this notebook? (Run Cell)\n", + "experience = \"beginner\" #@param [\"beginner\", \"intermediate\", \"advanced\"]\n", + "sections_to_follow=\"\"\n", + "\n", + "\n", + "if experience == \"beginner\": sections_to_follow = \"\"\"we recommend you to not attempt to do every coding task but instead, skip through to every section and ensure you interact with the LoRA finetuned LLM presented in the last section as well as with the pretrained LLM to get a practical understanding of how these models behave\"\"\"\n", + "\n", + "elif experience == \"intermediate\": sections_to_follow = \"\"\"we recommend you go through every section in this notebook and try the coding tasks tagged as beginner or intermediate. If you get stuck on the code ask a tutor for help or move on to better use the time of the practical\"\"\"\n", + "\n", + "elif experience == \"advanced\": sections_to_follow = \"\"\"we recommend you go through every section and try every coding task until you get it to work\"\"\"\n", + "\n", + "\n", + "print(f\"Based on your experience, {sections_to_follow}.\\nNote: this is just a guideline, feel free to explore the colab as you'd like if you feel comfort able!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6EqhIg1odqg0" + }, + "source": [ + "## Installations, Imports and Helper Functions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "4boGA9rYdt9l" + }, + "outputs": [], + "source": [ + "# Install necessary libraries for deep learning, NLP, and plotting\n", + "!pip install transformers datasets # Transformers and datasets libraries for NLP tasks\n", + "!pip install seaborn umap-learn # Seaborn for plotting, UMAP for dimensionality reduction\n", + "!pip install livelossplot # LiveLossPlot for tracking model training progress\n", + "!pip install -q transformers[torch] # Transformers with PyTorch backend\n", + "!pip install -q peft # Parameter-Efficient Fine-Tuning library\n", + "!pip install accelerate -U # Accelerate library for performance\n", + "\n", + "# Install utilities for debugging and console output formatting\n", + "!pip install -q ipdb # Interactive Python Debugger\n", + "!pip install -q colorama # Colored terminal text output\n", + "\n", + "# Import system and math utilities\n", + "import os\n", + "import math\n", + "import urllib.request\n", + "\n", + "# Check for connected accelerators (GPU or TPU) and set up accordingly\n", + "if os.environ.get(\"COLAB_GPU\") and int(os.environ[\"COLAB_GPU\"]) > 0:\n", + " print(\"A GPU is connected.\")\n", + "elif \"COLAB_TPU_ADDR\" in os.environ and os.environ[\"COLAB_TPU_ADDR\"]:\n", + " print(\"A TPU is connected.\")\n", + " import jax.tools.colab_tpu\n", + " jax.tools.colab_tpu.setup_tpu()\n", + "else:\n", + " print(\"Only CPU accelerator is connected.\")\n", + "\n", + "# Avoid GPU memory allocation to be done by JAX\n", + "os.environ['XLA_PYTHON_CLIENT_PREALLOCATE'] = \"false\"\n", + "\n", + "# Import libraries for JAX-based deep learning\n", + "import chex\n", + "import flax\n", + "import flax.linen as nn\n", + "import jax\n", + "import jax.numpy as jnp\n", + "from jax import grad, jit, vmap\n", + "import optax\n", + "\n", + "# Import NLP and model-related libraries\n", + "import transformers\n", + "from transformers import pipeline, AutoTokenizer, AutoModel\n", + "import datasets\n", + "import peft\n", + "\n", + "# Import image processing and plotting libraries\n", + "from PIL import Image\n", + "from livelossplot import PlotLosses\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "import seaborn as sns\n", + "\n", + "# Import additional utilities for working with text and models\n", + "import torch\n", + "import torchvision\n", + "import itertools\n", + "import random\n", + "import copy\n", + "\n", + "# Download an example image to use in the notebook\n", + "urllib.request.urlretrieve(\n", + " \"https://images.unsplash.com/photo-1529778873920-4da4926a72c2?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxzZWFyY2h8MXx8Y3V0ZSUyMGNhdHxlbnwwfHwwfHw%3D&w=1000&q=80\",\n", + " \"cat.png\",\n", + ")\n", + "\n", + "# Import libraries for NLP preprocessing and working with pre-trained models\n", + "import gensim\n", + "from nltk.data import find\n", + "import nltk\n", + "nltk.download(\"word2vec_sample\")\n", + "\n", + "# Import Hugging Face tools and IPython widgets\n", + "import huggingface_hub\n", + "import ipywidgets as widgets\n", + "from IPython.display import display\n", + "import colorama\n", + "\n", + "# Set Matplotlib to output SVG format for better quality plots\n", + "%config InlineBackend.figure_format = 'svg'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-9X10jhocGaS" + }, + "outputs": [], + "source": [ + "# @title Helper Plotting Functions. (Run Cell)\n", + "\n", + "def plot_position_encodings(P, max_tokens, d_model):\n", + " \"\"\"\n", + " Plots the position encodings matrix.\n", + "\n", + " Args:\n", + " P: Position encoding matrix (2D array).\n", + " max_tokens: Maximum number of tokens (rows) to plot.\n", + " d_model: Dimensionality of the model (columns) to plot.\n", + " \"\"\"\n", + "\n", + " # Set up the plot size based on the number of tokens and model dimensions\n", + " plt.figure(figsize=(20, np.min([8, max_tokens])))\n", + "\n", + " # Plot the position encoding matrix with a color map for better visualization\n", + " im = plt.imshow(P, aspect=\"auto\", cmap=\"Blues_r\")\n", + "\n", + " # Add a color bar to indicate the encoding values\n", + " plt.colorbar(im, cmap=\"blue\")\n", + "\n", + " # Show embedding indices as ticks if the dimensionality is small\n", + " if d_model <= 64:\n", + " plt.xticks(range(d_model))\n", + "\n", + " # Show position indices as ticks if the number of tokens is small\n", + " if max_tokens <= 32:\n", + " plt.yticks(range(max_tokens))\n", + "\n", + " # Label the axes\n", + " plt.xlabel(\"Embedding index\")\n", + " plt.ylabel(\"Position index\")\n", + "\n", + " # Display the plot\n", + " plt.show()\n", + "\n", + "\n", + "def plot_image_patches(patches):\n", + " \"\"\"\n", + " Function that takes in a list of patches and plots them.\n", + "\n", + " Args:\n", + " patches: A list or array of image patches to plot.\n", + " \"\"\"\n", + "\n", + " # Set up the figure for plotting patches\n", + " fig = plt.figure(figsize=(25, 25))\n", + "\n", + " # Create a subplot for each patch and display it\n", + " axes = []\n", + " for a in range(patches.shape[1]):\n", + " axes.append(fig.add_subplot(1, patches.shape[1], a + 1))\n", + " plt.imshow(patches[0][a])\n", + "\n", + " # Adjust layout to prevent overlap and display the plot\n", + " fig.tight_layout()\n", + " plt.show()\n", + "\n", + "\n", + "def plot_projected_embeddings(embeddings, labels):\n", + " \"\"\"\n", + " Projects high-dimensional embeddings onto 2D space and plots them.\n", + "\n", + " Args:\n", + " embeddings: High-dimensional embedding vectors to project.\n", + " labels: Labels corresponding to each embedding for coloring in the plot.\n", + " \"\"\"\n", + "\n", + " # Import UMAP and Seaborn for dimensionality reduction and plotting\n", + " import umap\n", + " import seaborn as sns\n", + "\n", + " # Reduce the dimensionality of the embeddings to 2D using UMAP\n", + " projected_embeddings = umap.UMAP().fit_transform(embeddings)\n", + "\n", + " # Plot the 2D projections with labels using Seaborn for better aesthetics\n", + " plt.figure(figsize=(15, 8))\n", + " plt.title(\"Projected text embeddings\")\n", + " sns.scatterplot(\n", + " x=projected_embeddings[:, 0], y=projected_embeddings[:, 1], hue=labels\n", + " )\n", + "\n", + " # Display the plot\n", + " plt.show()\n", + "\n", + "\n", + "def plot_attention_weight_matrix(weight_matrix, x_ticks, y_ticks):\n", + " \"\"\"\n", + " Plots an attention weight matrix with custom axis ticks.\n", + "\n", + " Args:\n", + " weight_matrix: The attention weight matrix to plot.\n", + " x_ticks: Labels for the x-axis (typically the query tokens).\n", + " y_ticks: Labels for the y-axis (typically the key tokens).\n", + " \"\"\"\n", + "\n", + " # Set up the plot size\n", + " plt.figure(figsize=(15, 7))\n", + "\n", + " # Plot the attention weight matrix as a heatmap\n", + " ax = sns.heatmap(weight_matrix, cmap=\"Blues\")\n", + "\n", + " # Set custom ticks on the x and y axes\n", + " plt.xticks(np.arange(weight_matrix.shape[1]) + 0.5, x_ticks)\n", + " plt.yticks(np.arange(weight_matrix.shape[0]) + 0.5, y_ticks)\n", + "\n", + " # Label the plot\n", + " plt.title(\"Attention matrix\")\n", + " plt.xlabel(\"Attention score\")\n", + "\n", + " # Display the plot\n", + " plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "kMkaKekB_pR4" + }, + "outputs": [], + "source": [ + "# @title Helper Text Processing Functions. (Run Cell)\n", + "\n", + "def get_word2vec_embedding(words):\n", + " \"\"\"\n", + " Function that takes in a list of words and returns a list of their embeddings,\n", + " based on a pretrained word2vec encoder.\n", + " \"\"\"\n", + " word2vec_sample = str(find(\"models/word2vec_sample/pruned.word2vec.txt\"))\n", + " model = gensim.models.KeyedVectors.load_word2vec_format(\n", + " word2vec_sample, binary=False\n", + " )\n", + "\n", + " output = []\n", + " words_pass = []\n", + " for word in words:\n", + " try:\n", + " output.append(jnp.array(model.word_vec(word)))\n", + " words_pass.append(word)\n", + " except:\n", + " pass\n", + "\n", + " embeddings = jnp.array(output)\n", + " del model # free up space again\n", + " return embeddings, words_pass\n", + "\n", + "\n", + "def remove_punctuation(text):\n", + " \"\"\"Function that takes in a string and removes all punctuation.\"\"\"\n", + " import re\n", + "\n", + " text = re.sub(r\"[^\\w\\s]\", \"\", text)\n", + " return text\n", + "\n", + "def print_sample(prompt: str, sample: str):\n", + " \"\"\"Function that takes in a prompt instruction and model response and\n", + " prints them out in different colors to show a distinction\"\"\"\n", + " print(colorama.Fore.MAGENTA + prompt, end=\"\")\n", + " print(colorama.Fore.BLUE + sample)\n", + " print(colorama.Fore.RESET)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4zu5cg-YG4XU" + }, + "source": [ + "## Let's kick things off with a Hugging Face Demo! Beginner\n", + "\n", + "We're thrilled to have you on board! 🎉 Before we dive into the hands-on part of our journey, let's take a quick detour into the fascinating world of [Hugging Face](https://huggingface.co/)—an incredible open-source platform for building and deploying cutting-edge language models. 🌐\n", + "\n", + "As a sneak peek into what we'll be creating today, we'll start by loading a *small* large language model (*in comparison to today's models) and prompting it with a simple instruction. This will give you a feel for how to interact with these powerful libraries. 💡 Get ready to unlock the potential of language models with just a few lines of code!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AwjIIipOG4fz" + }, + "source": [ + "### Hugging Face\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N2DSHiuhG4f0" + }, + "source": [ + "\n", + "\n", + "\n", + "[Hugging Face](https://huggingface.co/) is a startup founded in 2016 and, in their own words: \"are on a mission to democratize good machine learning, one commit at a time.\" Currently they are a treasure trove for tools to work on and with Large Language Model (LLMs).\n", + "\n", + "They have developed various open-source packages and allow users to easily interact with a large corpus of pretrained transformer models (across all modalities) and datasets to train or fine-tune pre-trained transformers. Their software is used widely in industry and research. For more details on them and usage, refer to [the 2022 attention and transformer practical](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2022/blob/main/practicals/attention_and_transformers.ipynb#scrollTo=qFBw8kRx-4Mk).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3xdt9PQ6G4f0" + }, + "source": [ + "In this colab we print prompts in pink and samples generated from a model in blue like in the example below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "L-8C9SJCG4f0" + }, + "outputs": [], + "source": [ + "print_sample(prompt='My fake prompt', sample=' is awesome!')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eq46TV_0G4f0" + }, + "source": [ + "### Time for a Demo! ⏰⚡ Loading a Hugging Face Model and Running a Sample\n", + "\n", + "Let's dive into how simple it is to load and interact with a model from Hugging Face!\n", + "\n", + "For this tutorial, we've pre-configured two model options:\n", + "\n", + "- **`gpt-neo-125M`**: A smaller model with 125 million parameters. It's faster and uses less memory—perfect for getting started! We recommend trying this one first.\n", + "- **`gpt2-medium`**: A larger model with 355 million parameters for more advanced use.\n", + "\n", + "If you want to switch models, just restart the Colab kernel and update the model name in the cell below.\n", + "\n", + "**Note**: The steps we're about to show work not only for these models but also for [all models](https://huggingface.co/models?pipeline_tag=text-generation) on Hugging Face that support text generation pipelines.\n", + "|" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QVV28V-TG4f1" + }, + "outputs": [], + "source": [ + "# Set the model name to \"EleutherAI/gpt-neo-125M\" (this can be changed via the dropdown options)\n", + "model_name = \"EleutherAI/gpt-neo-125M\" # @param [\"gpt2-medium\", \"EleutherAI/gpt-neo-125M\"]\n", + "\n", + "# Define the prompt for the text generation model\n", + "test_prompt = 'What is love?' # @param {type: \"string\"}\n", + "\n", + "# Create a text generation pipeline using the specified model\n", + "generator = transformers.pipeline('text-generation', model=model_name)\n", + "\n", + "# Generate text based on the provided prompt\n", + "# 'do_sample=True' enables sampling to introduce randomness in generation, and 'min_length=30' ensures at least 30 tokens are generated\n", + "model_output = generator(test_prompt, do_sample=True, min_length=30)\n", + "\n", + "# Print the generated text sample, removing the original prompt from the output\n", + "print_sample(test_prompt, model_output[0]['generated_text'].split(test_prompt)[1].rstrip())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V5IEKl4iG4f1" + }, + "source": [ + "**💡 Tip:** Try running the code above with different prompts or with the same prompt more than once!\n", + "\n", + "**🤔 Discussion:** Why do you think the generated text changes every time, even with the same prompt? Write your response in the input field below and discuss with your neighbour." + ] + }, + { + "cell_type": "code", + "source": [ + "# Define the prompt for the text generation model\n", + "discussion_point = '' # @param {type: \"string\"}" + ], + "metadata": { + "id": "OQD5pIYCciI2" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TfV0Qk6yG4f1" + }, + "source": [ + "Let's create our own `generator` function to make it easier to load different model weights and configure how text generation is done. Simply run the cells below to get started! 😀\n", + "\n", + "For now, don’t worry too much about understanding the details of the tokenizer. Just think of it as a step to convert the input into a format that the language model can understand. We’ll dive deeper into tokenization later in the notebook.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jxs5bO_sG4f1" + }, + "outputs": [], + "source": [ + "# Check if the model name contains 'gpt2' and load the appropriate tokenizer and model\n", + "if 'gpt2' in model_name:\n", + " # Load the GPT-2 tokenizer and model\n", + " tokenizer = transformers.GPT2Tokenizer.from_pretrained(model_name)\n", + " model = transformers.GPT2LMHeadModel.from_pretrained(model_name)\n", + "# If the model name is 'EleutherAI/gpt-neo-125M', load the corresponding tokenizer and model\n", + "elif model_name == \"EleutherAI/gpt-neo-125M\":\n", + " # Load the AutoTokenizer and AutoModel for the specified GPT-Neo model\n", + " tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)\n", + " model = transformers.AutoModelForCausalLM.from_pretrained(model_name)\n", + "# Raise an error if the model name is not supported\n", + "else:\n", + " raise NotImplementedError\n", + "\n", + "# If a GPU is available, move the model to the GPU for faster processing\n", + "if torch.cuda.is_available():\n", + " model = model.to(\"cuda\")\n", + "\n", + "# Set the padding token ID to be the same as the end-of-sequence token ID\n", + "tokenizer.pad_token_id = tokenizer.eos_token_id" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "vsZEwoZJG4f1" + }, + "outputs": [], + "source": [ + "def run_sample(\n", + " model, # The language model we’ll use to generate text\n", + " tokenizer, # The tokenizer that converts text into a format the model understands\n", + " prompt: str, # The text prompt we'll give to the model to start the text generation\n", + " seed: int | None = None, # Optional: A number to make the results predictable each time\n", + " temperature: float = 0.6, # Controls how random the model’s output is; lower values make it more focused\n", + " top_p: float = 0.9, # Controls how much of the most likely words are considered; higher values consider more options\n", + " max_new_tokens: int = 64, # The maximum number of words or tokens the model will add to the prompt\n", + ") -> str:\n", + " # This function generates text based on a given prompt using a language model,\n", + " # with options to control randomness, the number of tokens generated, and reproducibility.\n", + "\n", + " # Convert the prompt text into tokens that the model can process\n", + " inputs = tokenizer(prompt, return_tensors=\"pt\")\n", + "\n", + " # Extract the tokens (input IDs) and attention mask (to focus on important parts) from the inputs\n", + " input_ids = inputs[\"input_ids\"]\n", + " attention_mask = inputs[\"attention_mask\"]\n", + "\n", + " # Move the tokens and attention mask to the same device as the model (like a GPU if available)\n", + " input_ids = input_ids.to(model.device)\n", + " attention_mask = attention_mask.to(model.device)\n", + "\n", + " # Set up how we want the model to generate text\n", + " generation_config = transformers.GenerationConfig(\n", + " do_sample=True, # Allow the model to add some randomness to its text generation\n", + " temperature=temperature, # Adjust how random the output is; lower means more focused\n", + " top_p=top_p, # Consider the most likely words that make up the top 90% of possibilities\n", + " pad_token_id=tokenizer.pad_token_id, # Use the token ID that represents padding (extra space)\n", + " top_k=0, # We're not limiting to the top-k words, so we set this to 0\n", + " )\n", + "\n", + " # If a seed is provided, set it so that the results are repeatable (same output each time)\n", + " if seed is not None:\n", + " torch.manual_seed(seed)\n", + "\n", + " # Generate text using the model with the settings we defined\n", + " generation_output = model.generate(\n", + " input_ids=input_ids, # Provide the input tokens to the model\n", + " attention_mask=attention_mask, # Provide the attention mask to help the model focus\n", + " return_dict_in_generate=True, # Ask the model to return detailed information\n", + " output_scores=True, # Include the scores (confidence levels) for the generated tokens\n", + " max_new_tokens=max_new_tokens, # Set the maximum number of tokens to generate\n", + " generation_config=generation_config, # Apply our custom text generation settings\n", + " )\n", + "\n", + " # Make sure only one sequence (output) is generated, to keep things simple\n", + " assert len(generation_output.sequences) == 1\n", + "\n", + " # Get the generated sequence of tokens\n", + " output_sequence = generation_output.sequences[0]\n", + "\n", + " # Convert the generated tokens back into readable text\n", + " output_string = tokenizer.decode(output_sequence)\n", + "\n", + " # Print the prompt and the generated response\n", + " print_sample(prompt, output_string)\n", + "\n", + " # Return the generated text response\n", + " return output_string" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Yme6VzW4G4f1" + }, + "outputs": [], + "source": [ + "_ = run_sample(model, tokenizer, prompt=\"What is love?\", temperature = 0.5, seed=2)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V7vnUawyG4f1" + }, + "source": [ + "Pretty amazing, right? 🤩 Try playing around with the **prompt**, **temperature** and **seed** values above and see what different outputs you get. What do you notice when you increase the temperature? While this might have been mind-blowing back in 2021, by now, most of you have likely interacted with large language models in some way. Today, we're going to take things a step further by training our own **Shakespeare-inspired LLM**. This will give us a hands-on understanding of how these language models work under the hood.\n", + "\n", + "But before we jump into training, let’s first build a solid understanding of what **Large Language Models** are and the key **Machine Learning** concepts that make this groundbreaking technology possible. At the heart of today’s state-of-the-art (SoTA) LLMs are the **Attention Mechanism** and the **Transformer Architecture**. We’ll explore these essential concepts in the upcoming sections of this tutorial. 🚀💡\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-ZUp8i37dFbU" + }, + "source": [ + "## **1. Attention**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "acgW1ofF_RFz" + }, + "source": [ + "The attention mechanism is inspired by how humans would look at an image or read a sentence.\n", + "\n", + "Let us take the image of the dog in human clothes below (image and example [source](https://lilianweng.github.io/posts/2018-06-24-attention/)). When paying *attention* to the red blocks of pixels, we will say that the yellow block of pointy ears is something we expected (correlated) but that the grey blocks of human clothes are unexpected for us (uncorrelated). This is *based on what we have seen in the past* when looking at pictures of dogs, specifically one of a Shiba Inu.\n", + "\n", + "\"drawing\"\n", + "\n", + "Assume we want to identify the dog breed in this image. When we look at the red blocks of pixels, we tend to pay more *attention* to relevant pixels that are more similar or relevant to them, which could be the ones in the yellow box. We almost completely ignore the snow in the background and the human clothing for this task.\n", + "\n", + "Alternatively, when we begin looking at the background in an attempt to identify what is in it, we subconsciously ignore the dog pixels because they are irrelevant to the current task." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "usLBF2g0x5gH" + }, + "source": [ + "The same thing happens when we read. In order to understand the entire sentence, we will learn to correlate and *attend to* certain words based on the context of the entire sentence.\n", + "\n", + "\"drawing\"\n", + "\n", + " For instance, in the first sentence in the image above, when looking at the word \"coding\", we pay more attention to the word \"Apple\" and \"computer\" because we know that when we speak about coding, \"Apple\" is actually referring to the company. However, in the second sentence, we realise we should not consider \" apple \" when looking at \"code\" because given the context of the rest of the sentence, we know that this apple is referring to an actual apple and not a computer.\n", + "\n", + "We can build better models by developing mechanisms that mimic attention. It will enable our models to learn better representations of our input data by contextualising what it knows about some parts of the input based on other parts. In the following sections, we will explore the mechanisms that enable us to train deep learning models to attend to input data in the context of other input data." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ygdi884ugGcu" + }, + "source": [ + "### Intuition - Beginner\n", + "\n", + "Imagine attention as a mechanism that allows a neural network to focus more on certain parts of data. By doing this, the network can enhance its grasp of the problem it's working on, updating its understanding or representations accordingly.\n", + "\n", + "### Understanding Attention in Simple Terms\n", + "\n", + "One way to implement attention in neural networks is by representing each word (or even parts of a word) as a vector.\n", + "\n", + "So, what’s a vector? A vector is simply an array of numbers (called real-valued numbers) that can have different lengths. Think of it like a list of values that describe certain properties of a word. These vectors allow us to measure how similar two words are to each other. One common way to measure this similarity is by calculating something called the **dot product**.\n", + "\n", + "The result of this similarity calculation is what we refer to as **attention.** This attention value helps the model decide how much one word should influence the representation of another word.\n", + "\n", + "In simpler terms, if two words have similar vector representations, it means they’re likely related or important to each other. Because of this relationship, they affect each other’s representations inside the neural network, allowing the model to understand the context better. 🎯\n", + "\n", + "To illustrate how the dot product can create meaningful attention weights, we'll use pre-trained [word2vec](https://jalammar.github.io/illustrated-word2vec/) embeddings. These word2vec embeddings are generated by a neural network that learned to create similar embeddings for words with similar meanings.\n", + "\n", + "By calculating the matrix of dot products between all vectors, we get an attention matrix. This will indicate which words are correlated and therefore should \"attend\" to each other.\n", + "\n", + "[1] You can find more details about how this is done for LLMs in the \"Building Your Own LLM\" session." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OvBYShCFk6WC" + }, + "source": [ + "**Code task** Intermediate: Complete the dot product attention function below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yrbITGPnk7Ce" + }, + "outputs": [], + "source": [ + "def dot_product_attention(hidden_states, previous_state):\n", + " \"\"\"\n", + " Calculate the dot product between the hidden states and previous states.\n", + "\n", + " Args:\n", + " hidden_states: A tensor with shape [T_hidden, dm]\n", + " previous_state: A tensor with shape [T_previous, dm]\n", + " \"\"\"\n", + "\n", + " # Hint: To calculate the attention scores, think about how you can use the `previous_state` vector\n", + " # and the `hidden_states` matrix. You want to find out how much each element in `previous_state`\n", + " # should \"pay attention\" to each element in `hidden_states`. Remember that in matrix multiplication,\n", + " # you can find the relationship between two sets of vectors by multiplying one by the transpose of the other.\n", + " # Hint: Use `jnp.matmul` to perform the matrix multiplication between `previous_state` and the\n", + " # transpose of `hidden_states` (`hidden_states.T`).\n", + " scores = ... # FINISH ME\n", + "\n", + " # Hint: Now that you have the scores, you need to convert them into probabilities.\n", + " # A softmax function is typically used in attention mechanisms to turn raw scores into probabilities\n", + " # that sum to 1. This will help in determining how much focus should be placed on each hidden state.\n", + " # Hint: Use `jax.nn.softmax` to apply the softmax function to `scores`.\n", + " w_n = ... # FINISH ME\n", + "\n", + " # Multiply the weights by the hidden states to get the context vector\n", + " # Hint: Use `jnp.matmul` again to multiply the attention weights `w_n` by `hidden_states`\n", + " # to get the context vector.\n", + " c_t = jnp.matmul(w_n, hidden_states)\n", + "\n", + " return w_n, c_t" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QARgTrNZlIqH" + }, + "outputs": [], + "source": [ + "# @title Run me to test your code\n", + "\n", + "key = jax.random.PRNGKey(42)\n", + "x = jax.random.normal(key, [2, 2])\n", + "\n", + "try:\n", + " w_n, c_t = dot_product_attention(x, x)\n", + "\n", + " w_n_correct = jnp.array([[0.9567678, 0.04323225], [0.00121029, 0.99878967]])\n", + " c_t_correct = jnp.array([[0.11144122, 0.95290256], [-1.5571996, -1.5321486]])\n", + " assert jnp.allclose(w_n_correct, w_n), \"w_n is not calculated correctly\"\n", + " assert jnp.allclose(c_t_correct, c_t), \"c_t is not calculated correctly\"\n", + "\n", + " print(\"It seems correct. Look at the answer below to compare methods.\")\n", + "except:\n", + " print(\"It looks like the function isn't fully implemented yet. Try modifying it.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Qa6PyKYnkzUJ" + }, + "outputs": [], + "source": [ + "# when changing these words, note that if the word is not in the original\n", + "# training corpus it will not be shown in the weight matrix plot.\n", + "# @title Answer to code task (Try not to peek until you've given it a good try!')\n", + "def dot_product_attention(hidden_states, previous_state):\n", + " # Calculate the attention scores:\n", + " # Multiply the previous state vector by the transpose of the hidden states matrix.\n", + " # This gives us a matrix of scores that show how much attention each element in the previous state\n", + " # should pay to each element in the hidden states.\n", + " # The result is a matrix of shape [T, N], where:\n", + " # T is the number of elements in the hidden states,\n", + " # N is the number of elements in the previous state.\n", + " scores = jnp.matmul(previous_state, hidden_states.T)\n", + "\n", + " # Apply the softmax function to the scores to convert them into probabilities.\n", + " # This normalizes the scores so that they sum up to 1 for each element,\n", + " # allowing us to interpret them as how much attention should be given to each hidden state.\n", + " w_n = jax.nn.softmax(scores)\n", + "\n", + " # Calculate the context vector (c_t):\n", + " # Multiply the attention weights (w_n) by the hidden states.\n", + " # This combines the hidden states based on how much attention each one deserves,\n", + " # resulting in a new vector that represents the weighted sum of the hidden states.\n", + " # The resulting shape is [T, d], where:\n", + " # T is the number of elements in the previous state,\n", + " # d is the dimension of the hidden states.\n", + " c_t = jnp.matmul(w_n, hidden_states)\n", + "\n", + " # Return the attention weights and the context vector.\n", + " return w_n, c_t\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QlHL3e_QhLfq" + }, + "outputs": [], + "source": [ + "words = [\"king\", \"queen\", \"royalty\", \"food\", \"apple\", \"pear\", \"computers\"]\n", + "word_embeddings, words = get_word2vec_embedding(words)\n", + "weights, _ = dot_product_attention(word_embeddings, word_embeddings)\n", + "plot_attention_weight_matrix(weights, words, words)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tItZU09YlhEZ" + }, + "source": [ + "Looking at the matrix, we can see which words have similar meanings. The \"royal\" group of words have higher attention scores with each other than the \"food\" words, which all attend to one another. We also see that \"computers\" have very low attention scores for all of them, which shows that they are neither very related to \"royal\" or \"food\" words. \n", + "\n", + "**Group task:**\n", + " - Play with the word selections above. See if you can find word combinations whose attention values seem counter-intuitive. Think of possible explanations. Which sense of a word did the attention scores capture?\n", + " - Ask your friend if they found examples." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "S3iB8hf0hJdX" + }, + "source": [ + "**Note**: Dot product is only one of the ways to implement the scoring function for attention mechanisms, there is a more extensive list in this [blog](https://lilianweng.github.io/posts/2018-06-24-attention/#summary) post by Dr Lilian Weng.\n", + "\n", + "More resources:\n", + "\n", + "[A basic encoder-decoder model for machine translation](https://www.youtube.com/watch?v=gHk2IWivt_8&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=1)\n", + "\n", + "[Training and loss for encoder-decoder models](https://www.youtube.com/watch?v=aBZUTuT1Izs&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=2)\n", + "\n", + "[Basic attention](https://www.youtube.com/watch?v=BSSoEtv5jvQ&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=6)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aQfqM1EJyDXI" + }, + "source": [ + "### Sequence to sequence attenion mechanisms - Intermediate\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "68QBeG-4yDZ9" + }, + "source": [ + "The first attention mechanisms were used in sequence-to-sequence models. These models were usually RNN encoder and decoder structures. The input sequence was processed sequentially by an RNN, encoding the sequence in a single context vector, which is then fed into another RNN that generates a new sequence. Below is an example of this ([source](https://lilianweng.github.io/posts/2018-06-24-attention/)).\n", + "\n", + "\n", + "\"drawing\"\n", + "\n", + "Since there is only one context vector, it is challenging to for the encoder to represent long sequences and information typically gets lost. The attention mechanism introduced in [Bahdanau et al., 2015](https://arxiv.org/pdf/1409.0473.pdf) was proposed to solve this.\n", + "\n", + "Here, instead of relying on one static context vector, which is also only used once in the decoding process, let us provide information on the entire input sequence at every decoding step using a dynamic context vector. By doing this, the decoder can access a larger \"bank\" of memory and attend to the input's required information based on the current decoder RNN output state, $s_t$. This is shown below.\n", + "\n", + "\"drawing\"\n", + "\n", + "In deep learning, attention can be interpreted as a vector of \"importance.\" To predict or infer one element, such as a pixel in an image or a word in a sentence, we estimate how strongly it is correlated with, or \"attends to,\" other elements using the attention vector/weights. These attention weights are then used to generate a new weighted sum of the remaining elements, which represents the target [(source)](https://lilianweng.github.io/posts/2018-06-24-attention/).\n", + "\n", + "\n", + "This, usually, consists of three steps for each decoding step $t$:\n", + "\n", + "1. Calculate the score (importance) for each $h_n$, given $s_{t-1}$ and use the softmax function to transform this into an attention vector, $w_{n}$.\n", + " - $\\text{score} = a(s_{t−1}, h_{n})$, where $a$ can be any differentiable function, such as the dot product.\n", + " - $w_{n} = \\frac{\\exp \\left\\{a\\left(s_{t-1}, h_{n}\\right)\\right\\}}{\\sum_{j=1}^{N} \\exp \\left\\{a\\left(s_{t-1}, h_{j}\\right)\\right\\}}$, where we use the softmax function to transform the raw scores to relative attention weights.\n", + "2. Generate the final context vector, $c_t$, by summing the products of the attention weights and the encoder context vectors.\n", + " - $c_t=\\sum_{n=1}^{N} w_n h_{n}$\n", + "3. Generate the subsequent decoder state $s_{t+1}$ by combining the current decoder state, $s_t$, with the context vector, $c_t$, via some function, $f$.\n", + "\n", + " - $s_{t+1} = f\\left ( c_t, s_t \\right)$\n", + "\n", + " In Bahdanau et al., 2015, $f$ was a learned feedforward layer taking in the concatenated vector $[c_t; s_t]$, with $a(s_{t−1}, h_{n})$ being the dot product.\n", + " \n", + "Next, let us build up this attention schema, as used in the transformer architecture. We've already calcualed simple dot product attention, where the score was given by $a(s_{t-1}, h_n)=s_{t-1} h_n^\\top$ and we're going to use the same idea again." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J-MU6rrny8Nj" + }, + "source": [ + "### Self-attention to Multihead Attention - Intermediate\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BRuLtxNey_EQ" + }, + "source": [ + "Self-attention and multi-head attention (MHA) are fundamental components of the transformer architecture. In this section, we'll thoroughly explain the intuition behind these concepts and their implementation. Later, in the **Transformers** section, you'll learn how these attention mechanisms are used to create a sequence-to-sequence model that relies entirely on attention.\n", + "\n", + "As we move forward, we'll represent sentences by breaking them down into individual words and encoding each word using the word2vec model discussed earlier. In the Transformers section, we'll explore in more detail how input sequences are transformed into a series of vectors." + ] + }, + { + "cell_type": "code", + "source": [ + "def embed_sentence(sentence):\n", + " \"\"\"\n", + " Embed a sentence using word2vec; for example use cases only.\n", + " \"\"\"\n", + " # clean sentence (not necessary if using a proper LLM tokenizer)\n", + " sentence = remove_punctuation(sentence)\n", + "\n", + " # extract individual words\n", + " words = sentence.split()\n", + "\n", + " # get the word2vec embedding for each word in the sentence\n", + " word_vector_sequence, words = get_word2vec_embedding(words)\n", + "\n", + " # return with extra dimension (useful for creating batches later)\n", + " return jnp.expand_dims(word_vector_sequence, axis=0), words" + ], + "metadata": { + "id": "J2z6-NckgNT-" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0AFUEFZGzCTv" + }, + "source": [ + "#### Self-attention" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LF2V3KI-za9l" + }, + "source": [ + "Self-attention is an attention mechanism where each vector of a given input sequence attends to the entire sequence. To gain an intuition for why self-attention is important, let us think about the following sentence (example taken from [source](https://jalammar.github.io/illustrated-transformer/)):\n", + "\n", + "`\"The animal didn't cross the street because it was too tired.\"`\n", + "\n", + "A simple question about this sentence is what the word \"it\" refers to? Even though it might look simple, it can be tough for an algorithm to learn this. This is where self-attention comes in, as it can learn an attention matrix for the word \"it\" where a large weight is assigned to the word \"animal\".\n", + "\n", + "Self-attention also allows the model to learn how to interpret words with the same embeddings, such as apple, which can be a company or food, depending on the context. This is very similar to the hidden state found within an RNN, but this process, as you will see, allows the model to attend over the entire sequence in parallel, allowing longer sequences to be utilised.\n", + "\n", + "Self-attention consists of three concepts:\n", + "\n", + "- Queries, keys and values\n", + "- Scaled dot product attention\n", + "- Masks" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pwOIMtdZzdTf" + }, + "source": [ + "##### **Queries, keys and values**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mEf7QWIWzdo1" + }, + "source": [ + "Typically all attention mechanisms can be written in terms of `key-value` pairs and `queries` to calculate the attention matrix and new context vector.\n", + "\n", + "To gain intuition, one can interpret the `query` vector as containing the information we are interested in obtaining and the `key` vectors as having some information. The `query` vectors are compared to the `key` vectors to get attention scores, where a higher attention score indicates a `key` had relevant information. These attention scores are then used to determine which `values` (which are paired with the `keys`) we should attend to. Or as [Lena Voita](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html) puts it:\n", + "\n", + "- Query: asking for information\n", + "- Key: saying that it has some information\n", + "- Value: giving the information\n", + "\n", + "In transformer architectures, we use learnable weights matrices, represented as $W_Q,W_K,W_V$, to project each sequence vector to unique $q$, $k$, and $v$ vectors.\n", + "\n", + "\"drawing\"\n", + "\n", + "You will notice that the vectors $q,k,v$ are smaller in size than the input vectors. This will be covered at a later stage, but just know that it is a design choice for transformers and not a requirement to work.\n", + "\n", + "This process can also be parallelised, as the input sequence can be represented as a matrix $X$, which can be transformed into query, key, and value matrices $Q$, $K$, and $V$ respectively:\n", + "\n", + "$Q=W_QX \\\\ K=W_KX \\\\ V=W_VX$\n", + "\n", + "Below we show the code that creates three linear layers, which projects the input data to the $Q,K,V$ matrices, where the output size can be adjusted." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Xc8zjK6eziIV" + }, + "outputs": [], + "source": [ + "class SequenceToQKV(nn.Module):\n", + " output_size: int\n", + "\n", + " @nn.compact\n", + " def __call__(self, X):\n", + "\n", + " # define the method for weight initialisation\n", + " initializer = nn.initializers.variance_scaling(scale=0.5, mode=\"fan_in\", distribution=\"truncated_normal\")\n", + "\n", + " # initialise three linear layers to do the QKV transformations.\n", + " # note: this can also be one layer, how do you think you would do it?\n", + " q_layer = nn.Dense(self.output_size, kernel_init=initializer)\n", + " k_layer = nn.Dense(self.output_size, kernel_init=initializer)\n", + " v_layer = nn.Dense(self.output_size, kernel_init=initializer)\n", + "\n", + " # transform and return the matrices\n", + " Q = q_layer(X)\n", + " K = k_layer(X)\n", + " V = v_layer(X)\n", + "\n", + " return Q, K, V" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OhGZHFsHz_Qp" + }, + "source": [ + "##### **Scaled dot product attention**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DxycHDUW0BVE" + }, + "source": [ + "Now that we have our `query`, `key` and `value` matrices, it is time to calculate the attention matrix. Remember, in all attention mechanisms; we must first find a score for each vector in the sequence and then use these scores to create a new context vector. In self-attention scoring is done using scaled dot product attention, and then the normalised scores are used as weights to sum the value vectors and create the context vector.\n", + "\n", + "$\\operatorname{Attention}(Q, K, V)=\\operatorname{softmax}\\left(\\frac{Q K^{T}}{\\sqrt{d_{k}}}\\right) V$\n", + "\n", + "where the attention scores are calculated by $\\operatorname{softmax}\\left(\\frac{Q K^{T}}{\\sqrt{d_{k}}}\\right)$ and the scores are then multiplied by $V$ to get the context vector.\n", + "\n", + "\n", + "What happens here is similar to what we did in the dot product attention in the previous section, just applying the mechanism to the sequence itself. For each element in the sequence, we calculate the attention weight matrix between $q_i$ and $K$. We then multiply $V$ by each weight and finally sum all weighted vectors $v_{weighted}$ together to form a new representation for $q_i$. By doing this, we are essentially drowning out irrelevant vectors and bringing up important vectors in the sequence when our focus is on $q_1$.\n", + "\n", + "$QK^\\top$ is scaled by the square root of the dimension of the vectors, $\\sqrt{d_k}$, to ensure more stable gradients during training.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "i_UYNzrS0Hga" + }, + "outputs": [], + "source": [ + "def scaled_dot_product_attention(query, key, value):\n", + " \"\"\"\n", + " Formula to return scaled dot product attention given QKV matrices\n", + " \"\"\"\n", + " d_k = key.shape[-1]\n", + "\n", + " # get the raw scores (logits) from dot producting the queries and keys\n", + " logits = jnp.matmul(query, jnp.swapaxes(key, -2, -1))\n", + "\n", + " # scale the raw scores and apply the softmax function to get the attention scores/weights\n", + " scaled_logits = logits / jnp.sqrt(d_k)\n", + " attention_weights = jax.nn.softmax(scaled_logits, axis=-1)\n", + "\n", + " # multiply the weights by the value matrix to get the output\n", + " output = jnp.matmul(attention_weights, value)\n", + "\n", + " return output, attention_weights" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cuNaEjIm0PhV" + }, + "source": [ + "Let's now see scaled dot product attention in action. We will take a sentence, embed each word using word2vec, and see what the final self-attention weights look like.\n", + "\n", + "We will not use the linear projection layers we would need to train these. Instead, we are going to make things simple and use $X=Q=V=K$." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3Oy2sWzR0Ok5" + }, + "outputs": [], + "source": [ + "# define a sentence\n", + "sentence = \"I drink coke, but eat steak\"\n", + "\n", + "# embed and create QKV matrices\n", + "word_embeddings, words = embed_sentence(sentence)\n", + "Q = K = V = word_embeddings\n", + "\n", + "# calculate weights and plot\n", + "outputs, attention_weights = scaled_dot_product_attention(Q, K, V)\n", + "\n", + "# plot the words and the attention weights between them\n", + "words = remove_punctuation(sentence).split()\n", + "plot_attention_weight_matrix(attention_weights[0], words, words)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NG1Kxljr0Vzw" + }, + "source": [ + "Keep in mind that we have not trained our attention matrix yet. However, we can see that by utilising the word2vec vectors as our sequence, we can see how scaled dot product attention already is capable of attending to \"eat\" when \"steak\" is our query and that the query \"drink\" attends more to \"coke\" and \"eat\".\n", + "\n", + "More resources:\n", + "\n", + "[Attention with Q,K,V](https://www.youtube.com/watch?v=k-5QMalS8bQ&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=7)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D7B-AgO80gIt" + }, + "source": [ + "##### **Masked attention**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tdRoKsu70gGW" + }, + "source": [ + "There are cases where applying self-attention over the entire sequence is not practical. These can include:\n", + "\n", + "- Uneven length sequences batched together.\n", + " - When sending a batch of sequences through a network, the self-attention expects each sequence to be the same length. One handles this by padding the sequence. When calculating attention, ideally, these padding tokens should not be taken into consideration.\n", + "- Training a decoder model.\n", + " - When training decoder models, such as GPT-3, the decoder has access to the entire target sequence when training (as training is done in parallel). In order to prevent the method from cheating by looking at future tokens, we have to mask the future sequence data so that earlier data can not attend to it.\n", + "\n", + "By applying a mask to the final score calculated between queries and keys, we can mitigate the influence of the unwanted sequence vectors. **The vectors are masked by making the score between the query and their respective keys a VERY large negative value.** This results in the softmax function pushing the attention weight very close to zero, and the resulting value will be summed out and not influence the final representation.\n", + "\n", + "\n", + "Putting everything together, masked scaled dot product attention visually looks like this:\n", + "\n", + "\"drawing\".\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5Syx8_5E0eM9" + }, + "outputs": [], + "source": [ + "# example of building a mask for tokens of size 32\n", + "# the mask makes sure that positions only attend to previous positions in the input (causal mask)\n", + "# we will use this later to insert -inf values into the raw scores\n", + "mask = jnp.tril(jnp.ones((32, 32)))\n", + "\n", + "# plot\n", + "sns.heatmap(mask, cmap=\"Blues\")\n", + "plt.title(\"Example of mask that can be applied\");" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pfwTJrQ20gDw" + }, + "source": [ + "Lets now adapt our scaled dot product attention function to implement masked attention." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PVHpyNs_0ePh" + }, + "outputs": [], + "source": [ + "def scaled_dot_product_attention(query, key, value, mask=None):\n", + " \"\"\"\n", + " Scaled dot product attention with a causal mask (only allowed to attend to previous positions)\n", + " \"\"\"\n", + " d_k = key.shape[-1]\n", + " T_k = key.shape[-2]\n", + " T_q = query.shape[-2]\n", + "\n", + " # get scaled logits using dot product as before\n", + " logits = jnp.matmul(query, jnp.swapaxes(key, -2, -1))\n", + " scaled_logits = logits / jnp.sqrt(d_k)\n", + "\n", + " # add optional mask where values along the mask are set to -inf\n", + " if mask is not None:\n", + " scaled_logits = jnp.where(mask[:T_q, :T_k], scaled_logits, -jnp.inf)\n", + "\n", + " # calcualte the attention weights via softmax\n", + " attention_weights = jax.nn.softmax(scaled_logits, axis=-1)\n", + "\n", + " # sum with the values to get the output\n", + " output = jnp.matmul(attention_weights, value)\n", + "\n", + " return output, attention_weights" + ] + }, + { + "cell_type": "markdown", + "source": [ + "##### **Multi-head attention**" + ], + "metadata": { + "id": "OWDubQwCs4zG" + } + }, + { + "cell_type": "markdown", + "source": [ + "The attention mechanism we've covered so far successfully allows the model to focus on different positions in the input. In practice, the transformer architecture uses a subtle variation of this mechanism, called multi-head attention (MHA).\n", + "\n", + "The distinction is minimal; rather than only computing the attention once, the MHA mechanism runs through the scaled dot-product attention multiple times in parallel. According to the paper, *Attention is All You Need*, \"multi-head attention allows the model to **jointly attend** to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.\"\n", + "\n", + "Multi-head attention can be viewed as a similar strategy to stacking convolution kernels in a CNN layer. This allows the kernels to focus on and learn different features and rules, which is why multiple heads of attention also work.\n", + "\n", + "The figure below shows how basic MHA works. The scaled dot product attention discussed earlier is just repeated $N$ times ($N=2$ in this figure), with $3N$ learnable matrices for each head. The outputs from the different heads are then concatenated, whereafter it is fed through a linear projection, which produces the final representation.\n", + "\n", + "In practice, MHA significantly out-performs single-head attention.\n", + "\n", + "\"drawing\"\n" + ], + "metadata": { + "id": "nHkyjyErsYae" + } + }, + { + "cell_type": "markdown", + "source": [ + "Let's take a look at how to implement multi-head attention. In simple terms, multi-head attention is like running the attention process multiple times in parallel, using different copies of the Q, K, and V matrices for each \"head.\" This helps the model focus on different parts of the input at the same time. If you're interested in learning more, check out [this blog by Sebastian Raschka](https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention) for a detailed explanation." + ], + "metadata": { + "id": "vtuqNCln9EWW" + } + }, + { + "cell_type": "code", + "source": [ + "class MultiHeadAttention(nn.Module):\n", + " num_heads: int # Number of attention heads\n", + " d_m: int # Dimension of the model's embeddings\n", + "\n", + " def setup(self):\n", + " # Initialize the sequence-to-QKV transformation module\n", + " self.sequence_to_qkv = SequenceToQKV(self.d_m)\n", + "\n", + " # Define the initializer for the output linear layer weights\n", + " initializer = nn.initializers.variance_scaling(\n", + " scale=0.5, mode=\"fan_in\", distribution=\"truncated_normal\"\n", + " )\n", + "\n", + " # Initialize the output projection layer Wo (used after attention)\n", + " self.Wo = nn.Dense(self.d_m, kernel_init=initializer)\n", + "\n", + " def __call__(self, X=None, Q=None, K=None, V=None, mask=None, return_weights=False):\n", + " # If Q, K, or V are not provided, use the input X to generate them\n", + " if None in [Q, K, V]:\n", + " assert not X is None, \"X has to be provided if either Q, K, or V are not provided\"\n", + "\n", + " # Generate Q, K, and V matrices from the input X\n", + " Q, K, V = self.sequence_to_qkv(X)\n", + "\n", + " # Extract the batch size (B), sequence length (T), and embedding size (d_m)\n", + " B, T, d_m = K.shape\n", + "\n", + " # Calculate the size of each attention head's embedding (d_m / num_heads)\n", + " head_size = d_m // self.num_heads\n", + "\n", + " # Reshape Q, K, V to have separate dimensions for the heads\n", + " # B, T, d_m -> B, T, num_heads, head_size -> B, num_heads, T, head_size\n", + " q_heads = Q.reshape(B, T, self.num_heads, head_size).swapaxes(1, 2)\n", + " k_heads = K.reshape(B, T, self.num_heads, head_size).swapaxes(1, 2)\n", + " v_heads = V.reshape(B, T, self.num_heads, head_size).swapaxes(1, 2)\n", + "\n", + " # Apply scaled dot-product attention to each head\n", + " attention, attention_weights = scaled_dot_product_attention(\n", + " q_heads, k_heads, v_heads, mask\n", + " )\n", + "\n", + " # Reshape the attention output back to its original dimensions\n", + " # (B, num_heads, T, head_size) -> (B, T, num_heads, head_size) -> (B, T, d_m)\n", + " attention = attention.swapaxes(1, 2).reshape(B, T, d_m)\n", + "\n", + " # Apply the output linear transformation Wo to the attention output\n", + " X_new = self.Wo(attention)\n", + "\n", + " # If return_weights is True, return both the transformed output and attention weights\n", + " if return_weights:\n", + " return X_new, attention_weights\n", + " else:\n", + " # Otherwise, return just the transformed output\n", + " return X_new" + ], + "metadata": { + "id": "BY2xXLMQ9CB6" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e9NW58_3hAg2" + }, + "source": [ + "## **2. Building your own LLM** " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bA_2coZvhAg3" + }, + "source": [ + "### 2.1 High-level overvierw Beginner" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BflycqAw_RF8" + }, + "source": [ + "The Transformer Architecture was famously introduced in the paper entitled [Attention is all you need](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) by Vaswani et al.\n", + "\n", + "As the title of the paper suggests, such an architecture consists of basically only attention mechanisms along with feed-forward layers and linear layers, as shown in the diagram below.\n", + "\n", + "\n", + "\n", + "Transformers and its variations are in the core of Large Language Models and it's not an exaggeration to say that almost all language models out there are Transformer based architectures.\n", + "\n", + "As you can see in the diagram the original Transformer architecture consists of two parts, one that receives inputs usually called encoder and another that receives outputs (i.e. targets) called decoder. This is because the transformer was designed for machine translation.\n", + "\n", + "The encoder will receive an input sentence in one language and process it through multiple stacked `encoder blocks`. This creates a final representation, which contains helpful information necessary for the decoding task. This output is then fed into stacked `decoder blocks` that produce new outputs in an autoregressive manner.\n", + "\n", + "The encoder consists of $N$ identical blocks, which process a sequence of token vectors sequentially. These blocks consist of 3 parts:\n", + "\n", + "1. A multi-head attention block. These are the transformer architecture's backbone. They process the data to generate representations for each token, ensuring that the necessary information for the task at hand is represented in the vectors. These are exactly the MHA we covered in the attention section previously.\n", + "2. An MLP (Multi-Layer Perceptron i.e. a neural network with multiple layers) is applied to each input token separately and identically.\n", + "3. Residual connection that adds the input tokens to the attended representations and a residual connection between the input to the MLP and its outputs. For both these connections, the result is normalized using layernorm. In certain implementations, these normalization steps are applied to the inputs rather than the outputs. Just like a Resnet, transformers are designed to be very deep models thus, these add and norm blocks are essential for a smooth gradient flow. \n", + "\n", + "Similarly, the decoder block consists of $N$ identical blocks, however there is some variation within these block. Concretely, the different parts are:\n", + "\n", + "1. A masked multi-head attention block. This is an MHA block that performs _self-attention_ on the output sequence however this computation is restricted to the inputs that have already been seen. In other words, future tokens are blocked when making predictions.\n", + "2. A multi-head attention block. This block receives the output of the final encoder block, the transformed tokens, and uses that as the key-value pairs, while using the output of the first MHA block as the query. In doing this, the model attends over the input required to perform the sequence task. This MHA block thus performs _cross-attention_ by looking at the encoder inputs.\n", + "3. An MLP same as the encoder\n", + "4. Residual connection same as the encoder.\n", + "\n", + "Given this original architecture, there have been several variation with others focusing on the encoder only and others the **decoder only**. Large language models(LLMs) such as GPT-2, GPT-3 and Turing-NLG were born out of decoder only architectures. These architecture look like:\n", + "\n", + "\"drawing\"\n", + "\n", + "with the cross attention block missing as no encoder output is available. So to build a language model, we will focus on the decoder only architecture as seen above.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fbTsk0MdhAhC" + }, + "source": [ + "### 2.2 Tokenization + Positional encoding Beginner\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DehUpfym_RF8" + }, + "source": [ + "#### 2.2.1 Tokenization" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uBiFpVBu_RF9" + }, + "source": [ + "\n", + "Transformers cannot handle raw strings of text. So to process text, the text is first split up into tokens. The tokens are then indexed and each token is assigned an embedding of size $d_{model}$. These embeddings can be learned during training or can come from a pretrained vocabulary of embeddings. This new sequence of token embeddings is then fed into the transformer architecture. This idea is visualised below.\n", + "\n", + "\\\\\n", + "\n", + "\"drawing\"\n", + "\n", + "\n", + "These token IDs are typically predicted when a model generates text, fills in missing words, etc.\n", + "\n", + "This process of splitting up text into tokens and assigning an ID to each token is called [tokenisation](https://huggingface.co/docs/transformers/tokenizer_summary). There are various ways to tokenise text, with some methods being trained directly from the data. When using pre-trained transformers, it is crucial to use the same tokeniser that was used to train the model. The previous link has in-depth descriptions of many widely known techniques.\n", + "\n", + "Below we show how the [BERT](https://arxiv.org/abs/1810.04805) model's tokeniser tokenises a sentence. We use [Hugging Face](https://huggingface.co/) for this part.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "hJBMvlUA_RF9" + }, + "outputs": [], + "source": [ + "import transformers\n", + "from transformers import pipeline, AutoTokenizer, AutoModel\n", + "\n", + "bert_tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased\")\n", + "encoded_input = bert_tokenizer(\"The practical is so much fun\")\n", + "print(f\"Token IDs: {encoded_input['input_ids']}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GYbtZTVP_RF9" + }, + "source": [ + "Here we can see that the tokeniser returns the IDs for each token, as shown in the figure. But counting the number of IDs, we see that it is larger than the number of words in the sentence. Let's print the tokens associated with each ID.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yPZjiLis_RF9" + }, + "outputs": [], + "source": [ + "print(f\"Tokens: {bert_tokenizer.decode(encoded_input['input_ids'])}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k3K8UFlR_RF9" + }, + "source": [ + "We can see the tokeniser attaches new tokens, `[CLS]` and `[SEP]`, to the start and end of the sequence. This is a BERT-specific requirement for training and inference. Adding special tokens is a very common thing to do. Using special tokens, we can tell a model when a sentence starts or ends or when a new part of the input starts. This can be helpful when performing different tasks.\n", + "\n", + "For instance, to pretrain specific transformers, they perform what is known as masked prediction. For this, random tokens in a sequence are replaced by the `[MASK]` token, and the model is trained to predict the correct token ID for the token replaced with that token." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "djMP4Ijz_RF9" + }, + "source": [ + "**Drawback of using raw token**:\n", + "\n", + "One drawback of using raw tokens is that they lack any indication of the word's position in the sequence. This is evident when considering sentences like \"I am happy\" and \"Am I happy\" - these two phrases have distinct meanings, and the model needs to grasp the word order to understand the intended message accurately.\n", + "\n", + "To address this, when converting the inputs into vectors, position vectors are introduced and added to these vectors to indicate the **position** of each word.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "639s7Zuk_RF9" + }, + "source": [ + "#### 2.2.2 Positional encodings" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s-hBFVYo_RF9" + }, + "source": [ + "In most domains where a transformer can be utilised, there is an underlying order to the tokens produced, be it the order of words in a sentence, the location from which patches are taken in an image or even the steps taken in an RL environment. This order is very important in all cases; just imagine you interpret the sentence \"I have to read this book.\" as \"I have this book to read.\". Both sentences contain the exact same words, yet they have completely different meanings based on the order.\n", + "\n", + "As both the encoder and the decoder blocks process all tokens in parallel, the order of tokens is lost in these calculations. To cope with this, the sequence order has to be injected into the tokens directly. This can be done by adding *positional encodings* to the tokens at the start of the encoder and decoder blocks (though some of the latest techniques add positional information in the attention blocks). An example of how positional encodings alter the tokens is shown below.\n", + "\n", + "\n", + "\\\\\n", + "\n", + "\"drawing\"\n", + "\n", + "Ideally, these encodings should have these characteristics ([source](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/)):\n", + "* Each time-step should have a unique value\n", + "* The distance between time steps should stay constant.\n", + "* The encoding should be able to generalise to longer sequences than seen during training.\n", + "* The encoding must be deterministic." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rklY-aL-_RF9" + }, + "source": [ + "##### **Sine and cosine functions**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GLcfkMku_RF9" + }, + "source": [ + "\n", + "In Attention is All you Need, the authors used a method that can satisfy all these requirements. This involves summing a combination of sine and cosine waves at different frequencies, with the formula for a position encoding at position $D$ shown below, where $i$ is the embedding index and $d_m$ is the token embedding size.\n", + "\n", + "\\\\\n", + "\n", + "$P_{D}= \\begin{cases}\\sin \\left(\\frac{D}{10000^{i/d_{m}}}\\right), & \\text { if } i \\bmod 2=0 \\\\ \\cos \\left(\\frac{D}{10000^{((i-1)/d_{m}}}\\right), & \\text { otherwise } \\end{cases}$\n", + "\n", + "\\\n", + "\n", + "Assuming our model as $d_m=8$, the position embedding will look like this:\n", + "\n", + "\\\n", + "$P_{D}=\\left[\\begin{array}{c}\\sin \\left(\\frac{D}{10000^{0/8}}\\right)\\\\ \\cos \\left(\\frac{D}{10000^{0/8}}\\right)\\\\ \\sin \\left(\\frac{D}{10000^{2/8}}\\right)\\\\ \\cos \\left(\\frac{D}{10000^{2/8}}\\right)\\\\ \\sin \\left(\\frac{D}{10000^{4/8}}\\right)\\\\ \\cos \\left(\\frac{D}{10000^{4/8}}\\right)\\\\ \\sin \\left(\\frac{D}{10000^{8/8}}\\right)\\\\ \\cos \\left(\\frac{D}{10000^{8/8}}\\right)\\end{array}\\right]$\n", + "\n", + "\\\\\n", + "\n", + "Let's first create a function that can return these encodings to understand why this will work." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "zT5t5D30_RF9" + }, + "outputs": [], + "source": [ + "def return_frequency_pe_matrix(token_sequence_length, token_embedding):\n", + "\n", + " assert token_embedding % 2 == 0, \"token_embedding should be divisible by two\"\n", + "\n", + " P = jnp.zeros((token_sequence_length, token_embedding))\n", + " positions = jnp.arange(0, token_sequence_length)[:, jnp.newaxis]\n", + "\n", + " i = jnp.arange(0, token_embedding, 2)\n", + " frequency_steps = jnp.exp(i * (-math.log(10000.0) / token_embedding))\n", + " frequencies = positions * frequency_steps\n", + "\n", + " P = P.at[:, 0::2].set(jnp.sin(frequencies))\n", + " P = P.at[:, 1::2].set(jnp.cos(frequencies))\n", + "\n", + " return P" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CYW-VDOL_RF-" + }, + "outputs": [], + "source": [ + "token_sequence_length = 50 # Number of tokens the model will need to process\n", + "token_embedding = 10000 # token embedding (and positional encoding) dimensions, ensure it is divisible by two\n", + "P = return_frequency_pe_matrix(token_sequence_length, token_embedding)\n", + "plot_position_encodings(P, token_sequence_length, token_embedding)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1mjHEDPO_RF-" + }, + "source": [ + "Looking at the graph above, we can see that for each position index, a unique pattern emerges, where each position index consistently has the same encoding.\n", + "\n", + "### **Group Activity**:\n", + "\n", + "- Take a moment with your friend to explore why this specific pattern appears when `token_sequence_length` is set to 1000, and `token_embedding` is 768.\n", + "- Experiment with smaller values for `token_sequence_length` and `token_embedding` to build a deeper understanding and enhance your discussion.\n", + "- Curious about the constant 10000? Ask your friend why they think it’s used in the functions above.\n", + "- Now, try setting `token_sequence_length` to 50 and `token_embedding` to a much larger value, like 10000. What do you observe? Do we always need a large token embedding?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SdNPg0pnhAhG" + }, + "source": [ + "### 2.3 Transformer block Intermediate" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M4vSolF2_RF-" + }, + "source": [ + "Just like an MLP (a simple neural network that processes input data through multiple layers) or a CNN (a type of neural network that excels at recognizing patterns in images by using convolution layers), transformers are made up of a stack of transformer blocks. In this section, we'll build each of the components needed to create one of these transformer blocks." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kTURbfr__RF-" + }, + "source": [ + "\n", + "#### 2.3.1 Feed Forward Network (FFN) / Multilayer perceptron (MLP) Beginner\n", + "\n", + "\n", + "\"drawing\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LTtFi9AZ_RF-" + }, + "source": [ + "In the original model, these blocks consist of a simple 2-layer MLP (Multi-Layer Perceptron) that uses ReLU activation. However, GeLU (Gaussian Error Linear Unit) has become very popular, and we will be using it throughout this practical. The formula below represents the feedforward neural network (FFN) with GeLU activation. In this network, the input `x` is first passed through two linear layers with weights `W1` and `W2`, followed by bias terms `b1` and `b2`. The ReLU activation function, often represented by the `max` function, is replaced by the GeLU activation function in this case.\n", + "\n", + "$$\n", + "\\operatorname{FFN}(x)=\\max \\left(0, x W_{1}+b_{1}\\right) W_{2}+b_{2}\n", + "$$\n", + "\n", + "One can interpret this block as processing what the MHA block has produced and then projecting these new token representations to a space that the next block can use more optimally. Usually, the first layer is very wide, in the range of 2-8 times the size of the token representations. They do this as it is easier to parallelize computations for a single wider layer during training than to parallelize a feedforward block with multiple layers. Thus they can add in more complexity but keep training and inference optimized.\n", + "\n", + "**Code task:** Code up a Flax Module that implements the feed forward block." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "zsho1CnW_RF-" + }, + "outputs": [], + "source": [ + "class FeedForwardBlock(nn.Module):\n", + " \"\"\"\n", + " A 2-layer MLP which widens then narrows the input.\n", + "\n", + " Args:\n", + " widening_factor [optional, default=4]: The size of the hidden layer will be d_model * widening_factor.\n", + " \"\"\"\n", + "\n", + " widening_factor: int = 4\n", + " init_scale: float = 0.25\n", + "\n", + " @nn.compact\n", + " def __call__(self, x):\n", + " '''\n", + " Args:\n", + " x: [B, T, d_m]\n", + "\n", + " Return:\n", + " x: [B, T, d_m]\n", + " '''\n", + " d_m = x.shape[-1]\n", + " layer1_size = self.widening_factor * d_m\n", + "\n", + " initializer = nn.initializers.variance_scaling(\n", + " scale=self.init_scale, mode='fan_in', distribution='truncated_normal',\n", + " )\n", + "\n", + " # Hint: Layer 1 is a dense layer (fully connected layer) that increases the size of the input by the widening factor.\n", + " # Use nn.Dense to create this layer with layer1_size as the output size.\n", + " layer1 = # FINISH ME\n", + "\n", + " # Hint: Layer 2 is another dense layer that reduces the size back to the original dimension d_m.\n", + " # Use nn.Dense with d_m as the output size to create this layer.\n", + " layer2 = # FINISH ME\n", + "\n", + " x = jax.nn.gelu(layer1(x)) # Apply the GeLU activation function to the output of layer 1\n", + " x = layer2(x) # Pass the result through layer 2\n", + " return x" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-qj0nfhH_RF-" + }, + "outputs": [], + "source": [ + "# @title Answer to code task (Try not to peek until you've given it a good try!')\n", + "\n", + "class FeedForwardBlock(nn.Module):\n", + " \"\"\"A 2-layer MLP (Multi-Layer Perceptron) that first expands the input size and then reduces it back.\"\"\"\n", + "\n", + " # widening_factor controls how much the input dimension is expanded in the first layer.\n", + " widening_factor: int = 4\n", + "\n", + " # init_scale controls the scaling factor for weight initialization.\n", + " init_scale: float = 0.25\n", + "\n", + " @nn.compact\n", + " def __call__(self, x):\n", + " # Get the size of the last dimension of the input (embedding size).\n", + " d_m = x.shape[-1]\n", + "\n", + " # Calculate the size of the first layer by multiplying the embedding size by the widening factor.\n", + " layer1_size = self.widening_factor * d_m\n", + "\n", + " # Initialize the weights for both layers using a variance scaling initializer.\n", + " initializer = nn.initializers.variance_scaling(\n", + " scale=self.init_scale, mode='fan_in', distribution='truncated_normal',\n", + " )\n", + "\n", + " # Define the first dense layer, which expands the input size.\n", + " layer1 = nn.Dense(layer1_size, kernel_init=initializer)\n", + "\n", + " # Define the second dense layer, which reduces the size back to the original dimension.\n", + " layer2 = nn.Dense(d_m, kernel_init=initializer)\n", + "\n", + " # Apply the first dense layer followed by a GELU activation function.\n", + " x = jax.nn.gelu(layer1(x))\n", + "\n", + " # Apply the second dense layer to project the data back to its original dimension.\n", + " x = layer2(x)\n", + "\n", + " # Return the final output.\n", + " return x" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Sts5Vr4i_RF-" + }, + "source": [ + "#### 2.3.2 Add and Norm block Beginner" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TWUpf8wt_RF-" + }, + "source": [ + "In order to get transformers to go deeper, the residual connections are very important to allow an easier flow of gradients through the network. For normalisation, `layer norm` is used. This normalises each token vector independently in the batch. It is found that normalising the vectors improves the convergence and stability of transformers.\n", + "\n", + "There are two learnable parameters in layernorm, `scale` and `bias`, which rescales the normalised value. Thus, for each input token in a batch, we calculate the mean, $\\mu_{i}$ and variance $\\sigma_i^2$. We then normalise the token with:\n", + "\n", + "$\\hat{x}_i = \\frac{x_i-\\mu_{i}}{\\sigma_i^2 + ϵ}$.\n", + "\n", + "Then $\\hat{x}$ is rescaled using the learned `scale`, $γ$, and `bias` $β$, with:\n", + "\n", + "$y_i = γ\\hat{x}_i + β = LN_{γ,β}(x_i)$.\n", + "\n", + "So our add norm block can be represented as $LN(x+f(x))$, where $f(x)$ is either a MLP or MHA block.\n", + "\n", + "**Code task:** Code up a Flax Module that implements the add norm block. It should take as input the processed and unprocessed tokens. Hint: `hk.LayerNorm `" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "F5bLb5Ly_RF_" + }, + "outputs": [], + "source": [ + "class AddNorm(nn.Module):\n", + " \"\"\"A block that impliments the add and norm block\"\"\"\n", + "\n", + " @nn.compact\n", + " def __call__(self, x, processed_x):\n", + " '''\n", + " Args:\n", + " x: Sequence of tokens before feeding into MHA or FF blocks, with shape [B, T, d_m]\n", + " x: Sequence of after being processed by MHA or FF blocks, with shape [B, T, d_m]\n", + "\n", + " Return:\n", + " add_norm_x: Transformed tokens with shape [B, T, d_m]\n", + " '''\n", + " # Hint: Step 1 involves adding the original input `x` to the processed input `processed_x`.\n", + " added = # FINISH ME\n", + "\n", + " # Hint: Step 2 requires applying layer normalization to the result of the addition.\n", + " # Use `nn.LayerNorm`, and set `reduction_axes=-1` to apply normalization across the last dimension.\n", + " normalised = #FINISH ME\n", + " return normalised(added)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "HXSi7BXZ_RF_" + }, + "outputs": [], + "source": [ + "# @title Answer to code task (Try not to peek until you've given it a good try!')\n", + "\n", + "class AddNorm(nn.Module):\n", + " \"\"\"A block that implements the 'Add and Norm' operation used in transformers.\"\"\"\n", + "\n", + " @nn.compact\n", + " def __call__(self, x, processed_x):\n", + " # Step 1: Add the original input (x) to the processed input (processed_x).\n", + " added = x + processed_x\n", + "\n", + " # Step 2: Apply layer normalization to the result of the addition.\n", + " # - LayerNorm helps to stabilize and improve the training process by normalizing the output.\n", + " # - reduction_axes=-1 indicates that normalization is applied across the last dimension (typically the embedding dimension).\n", + " # - use_scale=True and use_bias=True allow the layer to learn scaling and bias parameters for further fine-tuning.\n", + " normalised = nn.LayerNorm(reduction_axes=-1, use_scale=True, use_bias=True)\n", + "\n", + " # Return the normalized result.\n", + " return normalised(added)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "91dXd29b_RF_" + }, + "source": [ + "### 2.4 Building the Transformer Decoder / LLM Intermediate" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Sl0UAyvM_RF_" + }, + "source": [ + "\"drawing\"\n", + "\n", + "Most of the groundwork has happened. We have built the positional encoding block, the MHA block, the feed-forward block and the add&norm block.\n", + "\n", + "The only part needed is passing inputs to each decoder block and applying the masked MHA block found in the decoder blocks.\n", + "\n", + "**Code task:** Code up a FLAX Module that implements the (FFN(norm(MHA(norm(X))))) for the decoder block" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wVmSFKZK_RF_" + }, + "outputs": [], + "source": [ + "class DecoderBlock(nn.Module):\n", + " \"\"\"\n", + " Transformer decoder block.\n", + "\n", + " Args:\n", + " num_heads: The number of heads to be used in the MHA block.\n", + " d_m: Token embedding size\n", + " widening factor: The size of the hidden layer will be d_m * widening_factor.\n", + " \"\"\"\n", + "\n", + " num_heads: int\n", + " d_m: int\n", + " widening_factor: int = 4\n", + "\n", + " def setup(self):\n", + " self.mha = MultiHeadAttention(self.num_heads, self.d_m)\n", + " self.add_norm1 = AddNorm()\n", + " self.add_norm2 = AddNorm()\n", + " self.MLP = FeedForwardBlock(widening_factor=self.widening_factor)\n", + "\n", + " def __call__(self, X, mask=None, return_att_weight=True):\n", + " \"\"\"\n", + " Args:\n", + " X: Batch of tokens being fed into the decoder, with shape [B, T_decoder, d_m]\n", + " encoder_output: Batch of tokens with was processed by the encoder, with shape [B, T_encoder, d_m]\n", + " mask [optional, default=None]: Mask to be applied, with shape [T_decoder, T_decoder].\n", + " return_att_weight [optional, default=True]: Whether to return the attention weights.\n", + " \"\"\"\n", + "\n", + " attention, attention_weights_1 = # FINISH ME\n", + "\n", + " X = # FINISH ME\n", + "\n", + " projection = # FINISH ME\n", + " X = # FINISH ME\n", + "\n", + " return (X, attention_weights_1) if return_att_weight else X" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "stNZVVv3_RF_" + }, + "outputs": [], + "source": [ + "#@title Answer to code task (Try not to peek until you've given it a good try!')\n", + "\n", + "class DecoderBlock(nn.Module):\n", + " \"\"\"\n", + " Transformer decoder block.\n", + "\n", + " Args:\n", + " num_heads: The number of attention heads in the Multi-Head Attention (MHA) block.\n", + " d_m: The size of the token embeddings.\n", + " widening_factor: The factor by which the hidden layer size is expanded in the MLP.\n", + " \"\"\"\n", + "\n", + " num_heads: int\n", + " d_m: int\n", + " widening_factor: int = 4\n", + "\n", + " def setup(self):\n", + " # Initialize the Multi-Head Attention (MHA) block\n", + " self.mha = MultiHeadAttention(self.num_heads, self.d_m)\n", + "\n", + " # Initialize the AddNorm blocks for residual connections and normalization\n", + " self.add_norm1 = AddNorm() # First AddNorm block after MHA\n", + " self.add_norm2 = AddNorm() # Second AddNorm block after the MLP\n", + "\n", + " # Initialize the FeedForwardBlock (MLP) which processes the data after attention\n", + " self.MLP = FeedForwardBlock(widening_factor=self.widening_factor)\n", + "\n", + " def __call__(self, X, mask=None, return_att_weight=True):\n", + " \"\"\"\n", + " Forward pass through the DecoderBlock.\n", + "\n", + " Args:\n", + " X: Batch of input tokens fed into the decoder, shape [B, T_decoder, d_m]\n", + " mask [optional, default=None]: Mask to control which positions the attention is allowed to consider, shape [T_decoder, T_decoder].\n", + " return_att_weight [optional, default=True]: If True, returns the attention weights along with the output.\n", + "\n", + " Returns:\n", + " If return_att_weight is True, returns a tuple (X, attention_weights_1).\n", + " Otherwise, returns the processed token representations X.\n", + " \"\"\"\n", + "\n", + " # Apply Multi-Head Attention to the input tokens (X) with optional masking\n", + " attention, attention_weights_1 = self.mha(X, mask=mask, return_weights=True)\n", + "\n", + " # Apply the first AddNorm block (adds the original input X and normalizes)\n", + " X = self.add_norm1(X, attention)\n", + "\n", + " # Pass the result through the FeedForwardBlock (MLP) to further process the data\n", + " projection = self.MLP(X)\n", + "\n", + " # Apply the second AddNorm block (adds the input from the previous step and normalizes)\n", + " X = self.add_norm2(X, projection)\n", + "\n", + " # Return the final output X, and optionally the attention weights\n", + " return (X, attention_weights_1) if return_att_weight else X\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8SXXVWd7_RF_" + }, + "source": [ + "Next, we just put everything together, adding in the positional encodings as well as stacking multiple transformer blocks and adding our prediction layer." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "4XBG24Qs_RF_" + }, + "outputs": [], + "source": [ + "class LLM(nn.Module):\n", + " \"\"\"\n", + " Transformer model consisting of several layers of decoder blocks.\n", + "\n", + " Args:\n", + " num_heads: Number of attention heads in each Multi-Head Attention (MHA) block.\n", + " num_layers: Number of decoder blocks in the model.\n", + " d_m: Dimensionality of the token embeddings.\n", + " vocab_size: Size of the vocabulary (number of unique tokens).\n", + " widening_factor: Factor by which the hidden layer size is expanded in the MLP.\n", + " \"\"\"\n", + " num_heads: int\n", + " num_layers: int\n", + " d_m: int\n", + " vocab_size: int\n", + " widening_factor: int = 4\n", + "\n", + " def setup(self):\n", + " # Initialize a list of decoder blocks, one for each layer in the model\n", + " self.blocks = [\n", + " DecoderBlock(self.num_heads, self.d_m, self.widening_factor)\n", + " for _ in range(self.num_layers)\n", + " ]\n", + "\n", + " # Initialize an embedding layer to convert token IDs into token embeddings\n", + " self.embedding = nn.Embed(num_embeddings=self.vocab_size, features=self.d_m)\n", + "\n", + " # Initialize a dense layer for predicting the next token in the sequence\n", + " self.pred_layer = nn.Dense(self.vocab_size)\n", + "\n", + " def __call__(self, X, mask=None, return_att_weights=False):\n", + " \"\"\"\n", + " Forward pass through the LLM model.\n", + "\n", + " Args:\n", + " X: Batch of input token IDs, shape [B, T_decoder] where B is batch size and T_decoder is sequence length.\n", + " mask [optional, default=None]: Mask to control which positions the attention can focus on, shape [T_decoder, T_decoder].\n", + " return_att_weights [optional, default=False]: Whether to return the attention weights.\n", + "\n", + " Returns:\n", + " logits: The predicted probabilities for each token in the vocabulary.\n", + " If return_att_weights is True, also returns the attention weights.\n", + " \"\"\"\n", + "\n", + " # Convert token IDs to embeddings (shape [B, T_decoder, d_m])\n", + " X = self.embedding(X)\n", + "\n", + " # Get the sequence length of the input\n", + " sequence_len = X.shape[-2]\n", + "\n", + " # Generate positional encodings and add them to the token embeddings\n", + " positions = return_frequency_pe_matrix(sequence_len, self.d_m)\n", + " X = X + positions\n", + "\n", + " # Initialize a list to store attention weights if needed\n", + " if return_att_weights:\n", + " att_weights = []\n", + "\n", + " # Pass the embeddings through each decoder block in sequence\n", + " for block in self.blocks:\n", + " out = block(X, mask, return_att_weights)\n", + " if return_att_weights:\n", + " # If returning attention weights, unpack the output\n", + " X = out[0]\n", + " att_weights.append(out[1])\n", + " else:\n", + " # Otherwise, just update the input for the next block\n", + " X = out\n", + "\n", + " # Apply a dense layer followed by a log softmax to get logits (predicted token probabilities)\n", + " logits = nn.log_softmax(self.pred_layer(X))\n", + "\n", + " # Return the logits, and optionally, the attention weights\n", + " return logits if not return_att_weights else (logits, jnp.array(att_weights).swapaxes(0, 1))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sClFLLkU_RF_" + }, + "source": [ + "If everything is correct, then if we run the code below, everything should run without any issues." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "82CWEa5m_RGA" + }, + "outputs": [], + "source": [ + "B, T, d_m, N, vocab_size = 18, 32, 16, 8, 25670\n", + "\n", + "llm = LLM(num_heads=1, num_layers=1, d_m=d_m, vocab_size=vocab_size, widening_factor=4)\n", + "mask = jnp.tril(np.ones((T, T)))\n", + "\n", + "# initialise module and get dummy output\n", + "key = jax.random.PRNGKey(42)\n", + "X = jax.random.randint(key, [B, T], 0, vocab_size)\n", + "params = llm.init(key, X, mask=mask)\n", + "\n", + "# extract output from decoder\n", + "logits, decoder_att_weights = llm.apply(\n", + " params,\n", + " X,\n", + " mask=mask,\n", + " return_att_weights=True,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Gve7ssD__RGA" + }, + "source": [ + "As a final sanity check, we can confirm that our attention weights are working correctly. As shown in the figure below, the decoder's attention weights only focus on previous tokens, as expected." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "H4NpywYv_RGA" + }, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(1, 1, figsize=(10, 5))\n", + "plt.suptitle(\"LLM attention weights\")\n", + "sns.heatmap(decoder_att_weights[0, 0, 0, ...], ax=ax, cmap=\"Blues\")\n", + "fig.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wmt3tp38G90A" + }, + "source": [ + "### 2.5 Training your LLM" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "agLIpsoh_RGA" + }, + "source": [ + "#### 2.5.1 Training objective Intermediate\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QOSv1-3B_RGA" + }, + "source": [ + "A sentence is nothing but a string of words. A LLM aims to predict the next word by considering the current context, namely the words that have come before.\n", + "\n", + "Here's the basic idea:\n", + "\n", + "To calculate the probability of a full sentence \"word1, word2, ..., last word\" appearing in a given context $c$, the procedure is to break down the sentence into individual words and consider the probability of each word given the words that precede it. These individual probabilities are then multiplied together:\n", + "\n", + "$$\\text{Probability of sentence} = \\text{Probability of word1} \\times \\text{Probability of word2} \\times \\ldots \\times \\text{Probability of last word}$$\n", + "\n", + "This method is akin to building up a narrative one piece at a time based on the preceding storyline.\n", + "\n", + "Mathematically, this is expressed as the likelihood (probability) of a sequence of words $y_1, y_2, ..., y_n$ in a given context $c$, which is achieved by multiplying the probabilities of each word $y_t$ calculated given the predecessors ($y_{Advanced" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zIQ_aJGW_RGA" + }, + "source": [ + "In the next section, we define all the processes required to train the model using the objective described above. A lot of this is now the work required to do training using FLAX.\n", + "\n", + "Below we gather the dataset and we shall be training on, which is Karpathy's shakespeare dataset. Its not so important to understand this code, so either just run the cell to load the data, or view the code if you want to understand it.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "guMHAaSo_RGB" + }, + "outputs": [], + "source": [ + "# @title Create Shakespeare dataset and iterator (optional, but run the cell)\n", + "\n", + "# Trick to avoid errors when downloading tinyshakespeare.\n", + "import locale\n", + "locale.getpreferredencoding = lambda: \"UTF-8\"\n", + "\n", + "!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O input.txt\n", + "\n", + "class WordBasedAsciiDatasetForLLM:\n", + " \"\"\"In-memory dataset of a single-file ASCII dataset for language-like model.\"\"\"\n", + "\n", + " def __init__(self, path: str, batch_size: int, sequence_length: int):\n", + " \"\"\"Load a single-file ASCII dataset in memory.\"\"\"\n", + " self._batch_size = batch_size\n", + "\n", + " with open(path, \"r\") as f:\n", + " corpus = f.read()\n", + "\n", + " # Tokenize by splitting the text into words\n", + " words = corpus.split()\n", + " self.vocab_size = len(set(words)) # Number of unique words\n", + "\n", + " # Create a mapping from words to unique IDs\n", + " self.word_to_id = {word: i for i, word in enumerate(set(words))}\n", + "\n", + " # Store the inverse mapping from IDs to words\n", + " self.id_to_word = {i: word for word, i in self.word_to_id.items()}\n", + "\n", + " # Convert the words in the corpus to their corresponding IDs\n", + " corpus = np.array([self.word_to_id[word] for word in words]).astype(np.int32)\n", + "\n", + " crop_len = sequence_length + 1\n", + " num_batches, ragged = divmod(corpus.size, batch_size * crop_len)\n", + " if ragged:\n", + " corpus = corpus[:-ragged]\n", + " corpus = corpus.reshape([-1, crop_len])\n", + "\n", + " if num_batches < 10:\n", + " raise ValueError(\n", + " f\"Only {num_batches} batches; consider a shorter \"\n", + " \"sequence or a smaller batch.\"\n", + " )\n", + "\n", + " self._ds = WordBasedAsciiDatasetForLLM._infinite_shuffle(\n", + " corpus, batch_size * 10\n", + " )\n", + "\n", + " def __iter__(self):\n", + " return self\n", + "\n", + " def __next__(self):\n", + " \"\"\"Yield next mini-batch.\"\"\"\n", + " batch = [next(self._ds) for _ in range(self._batch_size)]\n", + " batch = np.stack(batch)\n", + " # Create the language modeling observation/target pairs.\n", + " return dict(\n", + " input=batch[:, :-1], target=batch[:, 1:]\n", + " )\n", + "\n", + " def ids_to_words(self, ids):\n", + " \"\"\"Convert a sequence of word IDs to words.\"\"\"\n", + " return [self.id_to_word[id] for id in ids]\n", + "\n", + " @staticmethod\n", + " def _infinite_shuffle(iterable, buffer_size):\n", + " \"\"\"Infinitely repeat and shuffle data from iterable.\"\"\"\n", + " ds = itertools.cycle(iterable)\n", + " buf = [next(ds) for _ in range(buffer_size)]\n", + " random.shuffle(buf)\n", + " while True:\n", + " item = next(ds)\n", + " idx = random.randint(0, buffer_size - 1) # Inclusive.\n", + " result, buf[idx] = buf[idx], item\n", + " yield result\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_WBIFg51oQl0" + }, + "source": [ + "Lets now look how our data is structured for training" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "WvH3XPM5_RGB" + }, + "outputs": [], + "source": [ + "# sample and look at the data\n", + "batch_size = 2\n", + "seq_length = 32\n", + "train_dataset = WordBasedAsciiDatasetForLLM(\"input.txt\", batch_size, seq_length)\n", + "\n", + "batch = next(train_dataset)\n", + "\n", + "for obs, target in zip(batch[\"input\"], batch[\"target\"]):\n", + " print(\"-\" * 10, \"Input\", \"-\" * 11)\n", + " print(\"TEXT:\", ' '.join(train_dataset.ids_to_words(obs)))\n", + " print(\"ASCII:\", obs)\n", + " print(\"-\" * 10, \"Target\", \"-\" * 10)\n", + " print(\"TEXT:\", ' '.join(train_dataset.ids_to_words(target)))\n", + " print(\"ASCII:\", target)\n", + "\n", + "print(f\"\\n Total vocabulary size: {train_dataset.vocab_size}\")\n", + "\n", + "VOCAB_SIZE = train_dataset.vocab_size" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "w9vzee53_RGB" + }, + "source": [ + "Next, let us train our LLM and see how it performs in producing Shakespearian text. First, we will define what happens for every training step." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PGuYBCkekgDw" + }, + "outputs": [], + "source": [ + "import functools\n", + "\n", + "@functools.partial(jax.jit, static_argnums=(3, 4))\n", + "def train_step(params, optimizer_state, batch, apply_fn, update_fn):\n", + " \"\"\"\n", + " Perform a single training step.\n", + "\n", + " Args:\n", + " params: The current parameters of the model.\n", + " optimizer_state: The current state of the optimizer.\n", + " batch: A dictionary containing the input data and target labels for the batch.\n", + " apply_fn: The function used to apply the model to the inputs.\n", + " update_fn: The function used to update the model parameters based on the gradients.\n", + "\n", + " Returns:\n", + " Updated parameters, updated optimizer state, and the computed loss for the batch.\n", + " \"\"\"\n", + "\n", + " def loss_fn(params):\n", + " # Get the sequence length (T) from the input data.\n", + " T = batch['input'].shape[1]\n", + "\n", + " # Apply the model to the input data, using a lower triangular mask to enforce causality.\n", + " # jnp.tril(np.ones((T, T))) creates a lower triangular matrix of ones.\n", + " logits = apply_fn(params, batch['input'], jnp.tril(np.ones((T, T))))\n", + "\n", + " # Calculate the loss between the predicted logits and the target labels.\n", + " loss = sequence_loss_fn(logits, batch['target'])\n", + "\n", + " return loss\n", + "\n", + " # Compute the loss and its gradients with respect to the parameters.\n", + " loss, gradients = jax.value_and_grad(loss_fn)(params)\n", + "\n", + " # Update the optimizer state and calculate the parameter updates based on the gradients.\n", + " updates, optimizer_state = update_fn(gradients, optimizer_state)\n", + "\n", + " # Apply the updates to the parameters.\n", + " params = optax.apply_updates(params, updates)\n", + "\n", + " # Return the updated parameters, optimizer state, and the loss for the batch.\n", + " return params, optimizer_state, loss" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rtKWzKIAkfYU" + }, + "source": [ + "Next we initialise our optimizer and model. Feel free to play with the hyperparameters during the practical." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8o3q-BZX_RGB" + }, + "outputs": [], + "source": [ + "# Define all hyperparameters\n", + "d_model = 128 # Dimension of token embeddings (d_m)\n", + "num_heads = 4 # Number of attention heads in Multi-Head Attention\n", + "num_layers = 1 # Number of decoder blocks in the model\n", + "widening_factor = 2 # Factor to widen the hidden layer size in the MLP\n", + "LR = 2e-3 # Learning rate for the optimizer\n", + "batch_size = 32 # Number of samples per training batch\n", + "seq_length = 64 # Length of each input sequence (number of tokens)\n", + "\n", + "# Set up the training data\n", + "train_dataset = WordBasedAsciiDatasetForLLM(\"input.txt\", batch_size, seq_length)\n", + "vocab_size = train_dataset.vocab_size # Get the size of the vocabulary from the dataset\n", + "batch = next(train_dataset) # Get the first batch of input data\n", + "\n", + "# Set the random number generator key for model initialization\n", + "rng = jax.random.PRNGKey(42)\n", + "\n", + "# Initialize the LLM model with the specified hyperparameters\n", + "llm = LLM(num_heads=num_heads, num_layers=num_layers, d_m=d_model, vocab_size=vocab_size, widening_factor=widening_factor)\n", + "\n", + "# Create a causal mask to ensure that the model only attends to previous tokens\n", + "mask = jnp.tril(np.ones((batch['input'].shape[1], batch['input'].shape[1])))\n", + "\n", + "# Initialize the model parameters using the first batch of input data and the mask\n", + "params = llm.init(rng, batch['input'], mask)\n", + "\n", + "# Set up the optimizer using the Adam optimization algorithm with the specified learning rate\n", + "optimizer = optax.adam(LR, b1=0.9, b2=0.99)\n", + "optimizer_state = optimizer.init(params) # Initialize the optimizer state with the model parameters" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3bPEFakxmvsM" + }, + "source": [ + "Now we train! This will take a few minutes.. While it trains, have you greeted your neighbour yet?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oUAS6tie_RGB" + }, + "outputs": [], + "source": [ + "plotlosses = PlotLosses()\n", + "\n", + "MAX_STEPS = 3500\n", + "LOG_EVERY = 32\n", + "losses = []\n", + "VOCAB_SIZE = 25670\n", + "\n", + "# Training loop\n", + "for step in range(MAX_STEPS):\n", + " batch = next(train_dataset)\n", + " params, optimizer_state, loss = train_step(\n", + " params, optimizer_state, batch, llm.apply, optimizer.update)\n", + " losses.append(loss)\n", + " if step % LOG_EVERY == 0:\n", + " loss_ = jnp.array(losses).mean()\n", + " plotlosses.update(\n", + " {\n", + " \"loss\": loss_,\n", + " }\n", + " )\n", + " plotlosses.send()\n", + " losses = []" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pGv9c2AFmF4V" + }, + "source": [ + "#### 2.5.3 Inspecting the trained LLM Beginner\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Pfq61gim_RGB" + }, + "source": [ + "**Reminder:** remember to run all code presented so far in this section before runnning the cells below!\n", + "\n", + "Lets generate some text now and see how our model did. DO NOT STOP THE CELL ONCE IT IS RUNNING, THIS WILL CHRASH THE SESSION." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5lt8HTS__RGC" + }, + "outputs": [], + "source": [ + "import functools\n", + "\n", + "@functools.partial(jax.jit, static_argnums=(2, ))\n", + "def generate_prediction(params, input, apply_fn):\n", + " logits = apply_fn(params, input)\n", + " argmax_out = jnp.argmax(logits, axis=-1)\n", + " return argmax_out[0][-1].astype(int)\n", + "\n", + "def generate_random_shakespeare(llm, params, id_2_word, word_2_id):\n", + " '''\n", + " Get the model output\n", + " '''\n", + "\n", + " prompt = \"Love\"\n", + " print(prompt, end=\"\")\n", + " tokens = prompt.split()\n", + "\n", + " # predict and append\n", + " for i in range(15):\n", + " input = jnp.array([[word_2_id[t] for t in tokens]]).astype(int)\n", + " prediction = generate_prediction(params, input, llm.apply)\n", + " prediction = id_2_word[int(prediction)]\n", + " tokens.append(prediction)\n", + " print(\" \"+prediction, end=\"\")\n", + "\n", + " return \" \".join(tokens)\n", + "\n", + "id_2_word = train_dataset.id_to_word\n", + "word_2_id = train_dataset.word_to_id\n", + "\n", + "generated_shakespeare = generate_random_shakespeare(llm, params, id_2_word, word_2_id)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wOwNuMRf_RGC" + }, + "source": [ + "Finally, we implemented everything above by taking the token ID with the maximum probability of being correct. This is greedy decoding, as we only took the most likely token. It worked well in this use case, but there are cases where we will see a degrading performance when taking this greedy approach, specifically when we are interested in generating realistic text.\n", + "\n", + "Other methods exist for sampling from the decoder, with a famous algorithm being beam search. We provide resources below for anyone interested in learning more about this.\n", + "\n", + "[Greedy Decoding](https://www.youtube.com/watch?v=DW5C3eqAFQM&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=4)\n", + "\n", + "[Beam Search](https://www.youtube.com/watch?v=uG3xoYNo3HM&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=5)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fV3YG7QOZD-B" + }, + "source": [ + "## **Conclusion**\n", + "**Summary:**\n", + "\n", + "You've now mastered the essentials of how a Large Language Model (LLM) works, from the fundamentals of attention mechanisms to training your own LLM! These powerful tools have the potential to transform a wide range of tasks. However, like any deep learning model, their magic lies in applying them to the right problems with the right data.\n", + "\n", + "Ready to take your skills to the next level? Dive into fine-tuning your own LLMs and unleash even more potential! I highly recommend exploring last year's practical on Parameter Efficient Fine-Tuning Methods for a comprehensive overview of advanced techniques. The journey doesn't stop here—there's so much more to discover! [LLMs for Everyone 2023](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2023/blob/main/practicals/large_language_models.ipynb)\n", + "\n", + "The world of LLMs is yours to explore—go ahead and create something amazing! 🌟🚀\n", + "\n", + "---\n", + "\n", + "**Next Steps:**\n", + "[**Efficiently Finetuning LLMs with Hugging Face**](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2023/blob/main/practicals/large_language_models.ipynb)\n", + "\n", + "\n", + "**References:** for further references check the links referenced throughout\n", + "specific sections of this colab.\n", + "\n", + "* [Attention is all you need paper](https://arxiv.org/abs/1706.03762)\n", + "* [Additional videos on transformers](https://www.youtube.com/playlist?list=PLmZlBIcArwhOPR2s-FIR7WoqNaBML233s)\n", + "* [LoRA paper](https://arxiv.org/abs/2106.09685)\n", + "* [RLHF](https://huggingface.co/blog/rlhf) (how ChatGPT was trained)\n", + "* [Extending context length](https://kaiokendev.github.io/context):\n", + "\n", + "\n", + "For other practicals from the Deep Learning Indaba, please visit [here](https://github.com/deep-learning-indaba/indaba-pracs-2023)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o1ndpYE50BpG" + }, + "source": [ + "# Feedback\n", + "\n", + "Please provide feedback that we can use to improve our practicals in the future." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true, + "id": "OIZvkhfRz9Jz" + }, + "outputs": [], + "source": [ + "# @title Generate Feedback Form. (Run Cell)\n", + "from IPython.display import HTML\n", + "\n", + "HTML(\n", + " \"\"\"\n", + "\n", + "\tLoading...\n", + "\n", + "\"\"\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oglV4kHMWnIN" + }, + "source": [ + "" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "T4", + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.8.5" + }, + "vscode": { + "interpreter": { + "hash": "145833166d986a8417df3c7acb65d917d84b716b5a452e57fcacdc66f1a168c9" + } + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/practicals/Foundations_of_LLMs/foundations_of_llms_practical_en.ipynb b/practicals/Foundations_of_LLMs/foundations_of_llms_practical_en.ipynb deleted file mode 100644 index 9ff4ded..0000000 --- a/practicals/Foundations_of_LLMs/foundations_of_llms_practical_en.ipynb +++ /dev/null @@ -1,7680 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "view-in-github", - "colab_type": "text" - }, - "source": [ - "\"Open" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "m2s4kN_QPQVe" - }, - "source": [ - "# LLMs for everyone\n", - "\n", - "\n", - "\n", - "\"Open\n", - "\n", - "© Deep Learning Indaba 2024. Apache License 2.0.\n", - "\n", - "**Authors: Jabez Magomere, Harry Mayne, Khalil Mrini, Nabra Rizvi, Doudou Ba**\n", - "\n", - "**Introduction:**\n", - "\n", - "Welcome to **\"LLMs for Everyone\"**—your gateway to the fascinating world of Large Language Models (LLMs)! To kick things off, here’s a fun fact: this entire introduction was generated by ChatGPT, one of the many powerful LLMs you'll be learning about. 🤖✨\n", - "\n", - "In this tutorial, you'll dive into the core principles of transformers, the cutting-edge technology behind models like GPT. You’ll also get hands-on experience training your very own Language Model! Get ready to explore how these impressive AI systems create such realistic and engaging text. Let’s embark on this exciting journey together and unlock the secrets of LLMs! 🚀📚\n", - "\n", - "**Topics:**\n", - "\n", - "Content: [Hugging Face Introduction, Attention Mechanism, Transformer Architecture, Training your own LLM from scratch, Finetuning an LLM for Text Classification]\n", - "\n", - "Level: Beginner, Intermediate, Advanced\n", - "\n", - "**Aims/Learning Objectives:**\n", - "\n", - "* Understand the idea behind [Attention](https://arxiv.org/abs/1706.03762) and why it is used.\n", - "* Present and describe the fundamental building blocks of the [Transformer Architecture](https://arxiv.org/abs/1706.03762) along with an intuition on such an architecture design.\n", - "* Build and train a simple Shakespeare-inspired LLM.\n", - "\n", - "**Prerequisites:**\n", - "\n", - "* Basic knowledge of Deep Learning.\n", - "* Familiarity with Natural Language Processing (NLP).\n", - "* Understanding of sequence-to-sequence models.\n", - "* Basic understanding of Linear Algebra.\n", - "\n", - "**Outline:**\n", - "\n", - ">[LLMs for everyone](#scrollTo=m2s4kN_QPQVe)\n", - "\n", - ">>[Installations, Imports and Helper Functions](#scrollTo=6EqhIg1odqg0)\n", - "\n", - ">>[Let's kick things off with a Hugging Face Demo! Beginner](#scrollTo=4zu5cg-YG4XU)\n", - "\n", - ">>>[Hugging Face](#scrollTo=AwjIIipOG4fz)\n", - "\n", - ">>>[Time for a Demo! ⏰⚡ Loading a Hugging Face Model and Running a Sample](#scrollTo=eq46TV_0G4f0)\n", - "\n", - ">>[1. Attention](#scrollTo=-ZUp8i37dFbU)\n", - "\n", - ">>>[Intuition - Beginner](#scrollTo=ygdi884ugGcu)\n", - "\n", - ">>>[Understanding Attention in Simple Terms](#scrollTo=ygdi884ugGcu)\n", - "\n", - ">>>[Sequence to sequence attenion mechanisms - Intermediate](#scrollTo=aQfqM1EJyDXI)\n", - "\n", - ">>>[Self-attention to Multihead Attention - Intermediate](#scrollTo=J-MU6rrny8Nj)\n", - "\n", - ">>>>[Self-attention](#scrollTo=0AFUEFZGzCTv)\n", - "\n", - ">>>>>[Queries, keys and values](#scrollTo=pwOIMtdZzdTf)\n", - "\n", - ">>>>>[Masked attention](#scrollTo=D7B-AgO80gIt)\n", - "\n", - ">>>>>[Multi-head attention](#scrollTo=OWDubQwCs4zG)\n", - "\n", - ">>[2. Building your own LLM](#scrollTo=e9NW58_3hAg2)\n", - "\n", - ">>>[2.1 High-level overvierw Beginner](#scrollTo=bA_2coZvhAg3)\n", - "\n", - ">>>[2.2 Tokenization + Positional encoding Beginner](#scrollTo=fbTsk0MdhAhC)\n", - "\n", - ">>>>[2.2.1 Tokenization](#scrollTo=DehUpfym_RF8)\n", - "\n", - ">>>>[2.2.2 Positional encodings](#scrollTo=639s7Zuk_RF9)\n", - "\n", - ">>>>>[Sine and cosine functions](#scrollTo=rklY-aL-_RF9)\n", - "\n", - ">>>[2.3 Transformer block Intermediate](#scrollTo=SdNPg0pnhAhG)\n", - "\n", - ">>>>[2.3.1 Feed Forward Network (FFN) / Multilayer perceptron (MLP) Beginner](#scrollTo=kTURbfr__RF-)\n", - "\n", - ">>>>[2.3.2 Add and Norm block Beginner](#scrollTo=Sts5Vr4i_RF-)\n", - "\n", - ">>>[2.4 Building the Transformer Decoder / LLM Intermediate](#scrollTo=91dXd29b_RF_)\n", - "\n", - ">>>[2.5 Training your LLM](#scrollTo=wmt3tp38G90A)\n", - "\n", - ">>>>[2.5.1 Training objective Intermediate](#scrollTo=agLIpsoh_RGA)\n", - "\n", - ">>>>[2.5.2 Training models Advanced](#scrollTo=4CSfvGj__RGA)\n", - "\n", - ">>>>[2.5.3 Inspecting the trained LLM Beginner](#scrollTo=pGv9c2AFmF4V)\n", - "\n", - ">>[Conclusion](#scrollTo=fV3YG7QOZD-B)\n", - "\n", - ">[Feedback](#scrollTo=o1ndpYE50BpG)\n", - "\n", - "\n", - "\n", - "\n", - "**Before you start:**\n", - "\n", - "For this practical, you will need to use a GPU to speed up training. To do this, go to the \"Runtime\" menu in Colab, select \"Change runtime type\" and then in the popup menu, choose \"GPU\" in the \"Hardware accelerator\" box." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "952qogb79nnY" - }, - "source": [ - "**Suggested experience level in this topic:**\n", - "\n", - "| Level | Experience |\n", - "| --- | --- |\n", - "`Beginner` | It is my first time being introduced to this work. |\n", - "`Intermediate` | I have done some basic courses/intros on this topic. |\n", - "`Advanced` | I work in this area/topic daily. |" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "cellView": "form", - "id": "YBdDHcI_ArCR", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "1e963d65-c3e9-42b3-9737-c6a8836e3cbc" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Based on your experience, we recommend you to not attempt to do every coding task but instead, skip through to every section and ensure you interact with the LoRA finetuned LLM presented in the last section as well as with the pretrained LLM to get a practical understanding of how these models behave.\n", - "Note: this is just a guideline, feel free to explore the colab as you'd like if you feel comfort able!\n" - ] - } - ], - "source": [ - "# @title **Paths to follow:** What is your level of experience in the topics presented in this notebook? (Run Cell)\n", - "experience = \"beginner\" #@param [\"beginner\", \"intermediate\", \"advanced\"]\n", - "sections_to_follow=\"\"\n", - "\n", - "\n", - "if experience == \"beginner\": sections_to_follow = \"\"\"we recommend you to not attempt to do every coding task but instead, skip through to every section and ensure you interact with the LoRA finetuned LLM presented in the last section as well as with the pretrained LLM to get a practical understanding of how these models behave\"\"\"\n", - "\n", - "elif experience == \"intermediate\": sections_to_follow = \"\"\"we recommend you go through every section in this notebook and try the coding tasks tagged as beginner or intermediate. If you get stuck on the code ask a tutor for help or move on to better use the time of the practical\"\"\"\n", - "\n", - "elif experience == \"advanced\": sections_to_follow = \"\"\"we recommend you go through every section and try every coding task until you get it to work\"\"\"\n", - "\n", - "\n", - "print(f\"Based on your experience, {sections_to_follow}.\\nNote: this is just a guideline, feel free to explore the colab as you'd like if you feel comfort able!\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6EqhIg1odqg0" - }, - "source": [ - "## Installations, Imports and Helper Functions" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "id": "4boGA9rYdt9l", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "9275ff48-ed7e-4e6e-8038-7918c521acc3" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.42.4)\n", - "Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (2.21.0)\n", - "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.15.4)\n", - "Requirement already satisfied: huggingface-hub<1.0,>=0.23.2 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.23.5)\n", - "Requirement already satisfied: numpy<2.0,>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.26.4)\n", - "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (24.1)\n", - "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.2)\n", - "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2024.5.15)\n", - "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.32.3)\n", - "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.4.4)\n", - "Requirement already satisfied: tokenizers<0.20,>=0.19 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.19.1)\n", - "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.5)\n", - "Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (17.0.0)\n", - "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.3.8)\n", - "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (2.1.4)\n", - "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.5.0)\n", - "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets) (0.70.16)\n", - "Requirement already satisfied: fsspec<=2024.6.1,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from fsspec[http]<=2024.6.1,>=2023.1.0->datasets) (2024.6.1)\n", - "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.10.5)\n", - "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.4.0)\n", - "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)\n", - "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (24.2.0)\n", - "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.1)\n", - "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.5)\n", - "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.4)\n", - "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.23.2->transformers) (4.12.2)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.3.2)\n", - "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.8)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.0.7)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2024.7.4)\n", - "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)\n", - "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.1)\n", - "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.1)\n", - "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n", - "Requirement already satisfied: seaborn in /usr/local/lib/python3.10/dist-packages (0.13.1)\n", - "Requirement already satisfied: umap-learn in /usr/local/lib/python3.10/dist-packages (0.5.6)\n", - "Requirement already satisfied: numpy!=1.24.0,>=1.20 in /usr/local/lib/python3.10/dist-packages (from seaborn) (1.26.4)\n", - "Requirement already satisfied: pandas>=1.2 in /usr/local/lib/python3.10/dist-packages (from seaborn) (2.1.4)\n", - "Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /usr/local/lib/python3.10/dist-packages (from seaborn) (3.7.1)\n", - "Requirement already satisfied: scipy>=1.3.1 in /usr/local/lib/python3.10/dist-packages (from umap-learn) (1.13.1)\n", - "Requirement already satisfied: scikit-learn>=0.22 in /usr/local/lib/python3.10/dist-packages (from umap-learn) (1.3.2)\n", - "Requirement already satisfied: numba>=0.51.2 in /usr/local/lib/python3.10/dist-packages (from umap-learn) (0.60.0)\n", - "Requirement already satisfied: pynndescent>=0.5 in /usr/local/lib/python3.10/dist-packages (from umap-learn) (0.5.13)\n", - "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from umap-learn) (4.66.5)\n", - "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.1)\n", - "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)\n", - "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.53.1)\n", - "Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.5)\n", - "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.1)\n", - "Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (9.4.0)\n", - "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.4)\n", - "Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.8.2)\n", - "Requirement already satisfied: llvmlite<0.44,>=0.43.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba>=0.51.2->umap-learn) (0.43.0)\n", - "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2->seaborn) (2024.1)\n", - "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2->seaborn) (2024.1)\n", - "Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.10/dist-packages (from pynndescent>=0.5->umap-learn) (1.4.2)\n", - "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.22->umap-learn) (3.5.0)\n", - "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)\n", - "Requirement already satisfied: livelossplot in /usr/local/lib/python3.10/dist-packages (0.5.5)\n", - "Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from livelossplot) (3.7.1)\n", - "Requirement already satisfied: bokeh in /usr/local/lib/python3.10/dist-packages (from livelossplot) (3.4.3)\n", - "Requirement already satisfied: Jinja2>=2.9 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (3.1.4)\n", - "Requirement already satisfied: contourpy>=1.2 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (1.2.1)\n", - "Requirement already satisfied: numpy>=1.16 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (1.26.4)\n", - "Requirement already satisfied: packaging>=16.8 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (24.1)\n", - "Requirement already satisfied: pandas>=1.2 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (2.1.4)\n", - "Requirement already satisfied: pillow>=7.1.0 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (9.4.0)\n", - "Requirement already satisfied: PyYAML>=3.10 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (6.0.2)\n", - "Requirement already satisfied: tornado>=6.2 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (6.3.3)\n", - "Requirement already satisfied: xyzservices>=2021.09.1 in /usr/local/lib/python3.10/dist-packages (from bokeh->livelossplot) (2024.6.0)\n", - "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->livelossplot) (0.12.1)\n", - "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->livelossplot) (4.53.1)\n", - "Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->livelossplot) (1.4.5)\n", - "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->livelossplot) (3.1.4)\n", - "Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->livelossplot) (2.8.2)\n", - "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from Jinja2>=2.9->bokeh->livelossplot) (2.1.5)\n", - "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2->bokeh->livelossplot) (2024.1)\n", - "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2->bokeh->livelossplot) (2024.1)\n", - "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib->livelossplot) (1.16.0)\n", - "Requirement already satisfied: accelerate in /usr/local/lib/python3.10/dist-packages (0.33.0)\n", - "Requirement already satisfied: numpy<2.0.0,>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate) (1.26.4)\n", - "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (24.1)\n", - "Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate) (5.9.5)\n", - "Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate) (6.0.2)\n", - "Requirement already satisfied: torch>=1.10.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (2.4.0+cu121)\n", - "Requirement already satisfied: huggingface-hub>=0.21.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (0.23.5)\n", - "Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from accelerate) (0.4.4)\n", - "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.0->accelerate) (3.15.4)\n", - "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.0->accelerate) (2024.6.1)\n", - "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.0->accelerate) (2.32.3)\n", - "Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.0->accelerate) (4.66.5)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.0->accelerate) (4.12.2)\n", - "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (1.13.2)\n", - "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (3.3)\n", - "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (3.1.4)\n", - "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.10.0->accelerate) (2.1.5)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.21.0->accelerate) (3.3.2)\n", - "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.21.0->accelerate) (3.8)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.21.0->accelerate) (2.0.7)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.21.0->accelerate) (2024.7.4)\n", - "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.10.0->accelerate) (1.3.0)\n", - "A GPU is connected.\n" - ] - }, - { - "output_type": "stream", - "name": "stderr", - "text": [ - "[nltk_data] Downloading package word2vec_sample to /root/nltk_data...\n", - "[nltk_data] Package word2vec_sample is already up-to-date!\n" - ] - } - ], - "source": [ - "# Install necessary libraries for deep learning, NLP, and plotting\n", - "!pip install transformers datasets # Transformers and datasets libraries for NLP tasks\n", - "!pip install seaborn umap-learn # Seaborn for plotting, UMAP for dimensionality reduction\n", - "!pip install livelossplot # LiveLossPlot for tracking model training progress\n", - "!pip install -q transformers[torch] # Transformers with PyTorch backend\n", - "!pip install -q peft # Parameter-Efficient Fine-Tuning library\n", - "!pip install accelerate -U # Accelerate library for performance\n", - "\n", - "# Install utilities for debugging and console output formatting\n", - "!pip install -q ipdb # Interactive Python Debugger\n", - "!pip install -q colorama # Colored terminal text output\n", - "\n", - "# Import system and math utilities\n", - "import os\n", - "import math\n", - "import urllib.request\n", - "\n", - "# Check for connected accelerators (GPU or TPU) and set up accordingly\n", - "if os.environ.get(\"COLAB_GPU\") and int(os.environ[\"COLAB_GPU\"]) > 0:\n", - " print(\"A GPU is connected.\")\n", - "elif \"COLAB_TPU_ADDR\" in os.environ and os.environ[\"COLAB_TPU_ADDR\"]:\n", - " print(\"A TPU is connected.\")\n", - " import jax.tools.colab_tpu\n", - " jax.tools.colab_tpu.setup_tpu()\n", - "else:\n", - " print(\"Only CPU accelerator is connected.\")\n", - "\n", - "# Avoid GPU memory allocation to be done by JAX\n", - "os.environ['XLA_PYTHON_CLIENT_PREALLOCATE'] = \"false\"\n", - "\n", - "# Import libraries for JAX-based deep learning\n", - "import chex\n", - "import flax\n", - "import flax.linen as nn\n", - "import jax\n", - "import jax.numpy as jnp\n", - "from jax import grad, jit, vmap\n", - "import optax\n", - "\n", - "# Import NLP and model-related libraries\n", - "import transformers\n", - "from transformers import pipeline, AutoTokenizer, AutoModel\n", - "import datasets\n", - "import peft\n", - "\n", - "# Import image processing and plotting libraries\n", - "from PIL import Image\n", - "from livelossplot import PlotLosses\n", - "import matplotlib.pyplot as plt\n", - "import numpy as np\n", - "import seaborn as sns\n", - "\n", - "# Import additional utilities for working with text and models\n", - "import torch\n", - "import torchvision\n", - "import itertools\n", - "import random\n", - "import copy\n", - "\n", - "# Download an example image to use in the notebook\n", - "urllib.request.urlretrieve(\n", - " \"https://images.unsplash.com/photo-1529778873920-4da4926a72c2?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxzZWFyY2h8MXx8Y3V0ZSUyMGNhdHxlbnwwfHwwfHw%3D&w=1000&q=80\",\n", - " \"cat.png\",\n", - ")\n", - "\n", - "# Import libraries for NLP preprocessing and working with pre-trained models\n", - "import gensim\n", - "from nltk.data import find\n", - "import nltk\n", - "nltk.download(\"word2vec_sample\")\n", - "\n", - "# Import Hugging Face tools and IPython widgets\n", - "import huggingface_hub\n", - "import ipywidgets as widgets\n", - "from IPython.display import display\n", - "import colorama\n", - "\n", - "# Set Matplotlib to output SVG format for better quality plots\n", - "%config InlineBackend.figure_format = 'svg'" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": { - "id": "-9X10jhocGaS" - }, - "outputs": [], - "source": [ - "# @title Helper Plotting Functions. (Run Cell)\n", - "\n", - "def plot_position_encodings(P, max_tokens, d_model):\n", - " \"\"\"\n", - " Plots the position encodings matrix.\n", - "\n", - " Args:\n", - " P: Position encoding matrix (2D array).\n", - " max_tokens: Maximum number of tokens (rows) to plot.\n", - " d_model: Dimensionality of the model (columns) to plot.\n", - " \"\"\"\n", - "\n", - " # Set up the plot size based on the number of tokens and model dimensions\n", - " plt.figure(figsize=(20, np.min([8, max_tokens])))\n", - "\n", - " # Plot the position encoding matrix with a color map for better visualization\n", - " im = plt.imshow(P, aspect=\"auto\", cmap=\"Blues_r\")\n", - "\n", - " # Add a color bar to indicate the encoding values\n", - " plt.colorbar(im, cmap=\"blue\")\n", - "\n", - " # Show embedding indices as ticks if the dimensionality is small\n", - " if d_model <= 64:\n", - " plt.xticks(range(d_model))\n", - "\n", - " # Show position indices as ticks if the number of tokens is small\n", - " if max_tokens <= 32:\n", - " plt.yticks(range(max_tokens))\n", - "\n", - " # Label the axes\n", - " plt.xlabel(\"Embedding index\")\n", - " plt.ylabel(\"Position index\")\n", - "\n", - " # Display the plot\n", - " plt.show()\n", - "\n", - "\n", - "def plot_image_patches(patches):\n", - " \"\"\"\n", - " Function that takes in a list of patches and plots them.\n", - "\n", - " Args:\n", - " patches: A list or array of image patches to plot.\n", - " \"\"\"\n", - "\n", - " # Set up the figure for plotting patches\n", - " fig = plt.figure(figsize=(25, 25))\n", - "\n", - " # Create a subplot for each patch and display it\n", - " axes = []\n", - " for a in range(patches.shape[1]):\n", - " axes.append(fig.add_subplot(1, patches.shape[1], a + 1))\n", - " plt.imshow(patches[0][a])\n", - "\n", - " # Adjust layout to prevent overlap and display the plot\n", - " fig.tight_layout()\n", - " plt.show()\n", - "\n", - "\n", - "def plot_projected_embeddings(embeddings, labels):\n", - " \"\"\"\n", - " Projects high-dimensional embeddings onto 2D space and plots them.\n", - "\n", - " Args:\n", - " embeddings: High-dimensional embedding vectors to project.\n", - " labels: Labels corresponding to each embedding for coloring in the plot.\n", - " \"\"\"\n", - "\n", - " # Import UMAP and Seaborn for dimensionality reduction and plotting\n", - " import umap\n", - " import seaborn as sns\n", - "\n", - " # Reduce the dimensionality of the embeddings to 2D using UMAP\n", - " projected_embeddings = umap.UMAP().fit_transform(embeddings)\n", - "\n", - " # Plot the 2D projections with labels using Seaborn for better aesthetics\n", - " plt.figure(figsize=(15, 8))\n", - " plt.title(\"Projected text embeddings\")\n", - " sns.scatterplot(\n", - " x=projected_embeddings[:, 0], y=projected_embeddings[:, 1], hue=labels\n", - " )\n", - "\n", - " # Display the plot\n", - " plt.show()\n", - "\n", - "\n", - "def plot_attention_weight_matrix(weight_matrix, x_ticks, y_ticks):\n", - " \"\"\"\n", - " Plots an attention weight matrix with custom axis ticks.\n", - "\n", - " Args:\n", - " weight_matrix: The attention weight matrix to plot.\n", - " x_ticks: Labels for the x-axis (typically the query tokens).\n", - " y_ticks: Labels for the y-axis (typically the key tokens).\n", - " \"\"\"\n", - "\n", - " # Set up the plot size\n", - " plt.figure(figsize=(15, 7))\n", - "\n", - " # Plot the attention weight matrix as a heatmap\n", - " ax = sns.heatmap(weight_matrix, cmap=\"Blues\")\n", - "\n", - " # Set custom ticks on the x and y axes\n", - " plt.xticks(np.arange(weight_matrix.shape[1]) + 0.5, x_ticks)\n", - " plt.yticks(np.arange(weight_matrix.shape[0]) + 0.5, y_ticks)\n", - "\n", - " # Label the plot\n", - " plt.title(\"Attention matrix\")\n", - " plt.xlabel(\"Attention score\")\n", - "\n", - " # Display the plot\n", - " plt.show()\n" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "id": "kMkaKekB_pR4" - }, - "outputs": [], - "source": [ - "# @title Helper Text Processing Functions. (Run Cell)\n", - "\n", - "def get_word2vec_embedding(words):\n", - " \"\"\"\n", - " Function that takes in a list of words and returns a list of their embeddings,\n", - " based on a pretrained word2vec encoder.\n", - " \"\"\"\n", - " word2vec_sample = str(find(\"models/word2vec_sample/pruned.word2vec.txt\"))\n", - " model = gensim.models.KeyedVectors.load_word2vec_format(\n", - " word2vec_sample, binary=False\n", - " )\n", - "\n", - " output = []\n", - " words_pass = []\n", - " for word in words:\n", - " try:\n", - " output.append(jnp.array(model.word_vec(word)))\n", - " words_pass.append(word)\n", - " except:\n", - " pass\n", - "\n", - " embeddings = jnp.array(output)\n", - " del model # free up space again\n", - " return embeddings, words_pass\n", - "\n", - "\n", - "def remove_punctuation(text):\n", - " \"\"\"Function that takes in a string and removes all punctuation.\"\"\"\n", - " import re\n", - "\n", - " text = re.sub(r\"[^\\w\\s]\", \"\", text)\n", - " return text\n", - "\n", - "def print_sample(prompt: str, sample: str):\n", - " \"\"\"Function that takes in a prompt instruction and model response and\n", - " prints them out in different colors to show a distinction\"\"\"\n", - " print(colorama.Fore.MAGENTA + prompt, end=\"\")\n", - " print(colorama.Fore.BLUE + sample)\n", - " print(colorama.Fore.RESET)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4zu5cg-YG4XU" - }, - "source": [ - "## Let's kick things off with a Hugging Face Demo! Beginner\n", - "\n", - "We're thrilled to have you on board! 🎉 Before we dive into the hands-on part of our journey, let's take a quick detour into the fascinating world of [Hugging Face](https://huggingface.co/)—an incredible open-source platform for building and deploying cutting-edge language models. 🌐\n", - "\n", - "As a sneak peek into what we'll be creating today, we'll start by loading a *small* large language model (*in comparison to today's models) and prompting it with a simple instruction. This will give you a feel for how to interact with these powerful libraries. 💡 Get ready to unlock the potential of language models with just a few lines of code!" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "AwjIIipOG4fz" - }, - "source": [ - "### Hugging Face\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "N2DSHiuhG4f0" - }, - "source": [ - "\n", - "\n", - "\n", - "[Hugging Face](https://huggingface.co/) is a startup founded in 2016 and, in their own words: \"are on a mission to democratize good machine learning, one commit at a time.\" Currently they are a treasure trove for tools to work on and with Large Language Model (LLMs).\n", - "\n", - "They have developed various open-source packages and allow users to easily interact with a large corpus of pretrained transformer models (across all modalities) and datasets to train or fine-tune pre-trained transformers. Their software is used widely in industry and research. For more details on them and usage, refer to [the 2022 attention and transformer practical](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2022/blob/main/practicals/attention_and_transformers.ipynb#scrollTo=qFBw8kRx-4Mk).\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "3xdt9PQ6G4f0" - }, - "source": [ - "In this colab we print prompts in pink and samples generated from a model in blue like in the example below:" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "id": "L-8C9SJCG4f0", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "e2f784b7-4189-4d16-aa68-33e34a60761c" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "\u001b[35mMy fake prompt\u001b[34m is awesome!\n", - "\u001b[39m\n" - ] - } - ], - "source": [ - "print_sample(prompt='My fake prompt', sample=' is awesome!')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "eq46TV_0G4f0" - }, - "source": [ - "### Time for a Demo! ⏰⚡ Loading a Hugging Face Model and Running a Sample\n", - "\n", - "Let's dive into how simple it is to load and interact with a model from Hugging Face!\n", - "\n", - "For this tutorial, we've pre-configured two model options:\n", - "\n", - "- **`gpt-neo-125M`**: A smaller model with 125 million parameters. It's faster and uses less memory—perfect for getting started! We recommend trying this one first.\n", - "- **`gpt2-medium`**: A larger model with 355 million parameters for more advanced use.\n", - "\n", - "If you want to switch models, just restart the Colab kernel and update the model name in the cell below.\n", - "\n", - "**Note**: The steps we're about to show work not only for these models but also for [all models](https://huggingface.co/models?pipeline_tag=text-generation) on Hugging Face that support text generation pipelines.\n", - "|" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": { - "id": "QVV28V-TG4f1", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 571, - "referenced_widgets": [ - "d8f46e6226af431d9b7c6ecfa1c2769a", - "93a441fdf82141af81e85b2d5aec49b7", - "4eb2c2d0758f4061a3d0fc398018de28", - "fefd764b0cf2425c97cdab4506a81b9c", - "692c620c5e104d33849c3da34268f5cb", - "ecec6f9f968940078a4cf7a9105a3717", - "e352febd89fe462e945d40bcf421f7fd", - "f5a917167c914fdfaa86f1f71176eda2", - "4edc048323484d24b8c59bb8a2f1fab1", - "fe54b7356cf049d5b449b3fd69d64221", - "740307d65ab447658d3945abd47b3318", - "70969f782d9d48cebefa3ee64e3a04f5", - "111b2a654a424460b4917b5dad5fd69e", - "e35b59e079ab498dbe595cde7c984438", - "f2edac25c1e74b14a4ade39fe86dd040", - "4983de4c57c349af8cb8b6be78f64030", - "74926ca6f40e44c0887297ac44cbd577", - "467dfa92fdee4b9d850bd1eac5c502c6", - "52e8f51e283847d79bda1ee4977fd53f", - "aa3a63b55fe74989b92cfe8504695309", - "a78f96142d83418685734ffb9ec85ddd", - "d642dc3cabc64e66a8bf8853527f7165", - "8ca5d9a0316d4d85b88d45f5993bc21c", - "0dc54052e03245c3b149ed1fcb22b038", - "973cf8a73e3a478c888928030194793d", - "1668490995c74105ab6f11ab33ecaefb", - "3ec771810c9d472fa75278a76decf956", - "f505b2bbabae43219a02639d33501a32", - "f0a69ae2f0064d80825b0091932e7813", - "65e1cac3c45442b4a68711d410ef37c3", - "5602e00600884fdfbbe145694bcc0f90", - "f02e4e7118d64a2bb764909705049d18", - "c4710e641a1c421f96222ffb65bcedfe", - "98ca063c1b1548ce8de87647dcf23507", - "ec94052f637b4de3a172b3d5d9d32355", - "267fa0085440473c8762bfe21b5e9106", - "708f1e90e25c49c0bbfac3bc5c233e24", - "14a6f314f3ff4ad3b1f05933c50fc830", - "c69550aa99a34bada8ef67f61115d760", - "d9d79a644a1d42a38b70097c7f77dbad", - "af73f467b5bd4c38882bc27e3b3b5732", - "71eeff60e0f7464ca42483c4fe3c7bca", - "a546a75097e149f4a226453951d987da", - "3d5b4513d6ee4a73a392e4c16896e14c", - "d334e7a205704133b8549562700bca53", - "2e0e5880d12b4a37b673dcb3e455f47e", - "3c7d5cb0f9de4132a2b26428b72cb10a", - "49de67c36f9b4b66ba234bc606ffa0c4", - "ff8540ebd47e4dc39197eb6b19945d32", - "622fd5127db84c7388b9258ee7c4fdc1", - "7a191a40c1e245748de1f54f449afe37", - "3263b46210b544b989301e9edfc23473", - "2a64a5672e934b739e6280e2ab278da9", - "98577aad05654ff5aa46ca95121ac640", - "03251cffadcb429b9dbe87402bb8a4bb", - "f35d92a421d84260976fc4fba3d4527c", - "d96a8053de2f4aeabb8cf68474f6725f", - "441741ea09ac4ba29ac2dcd9f8cc0ade", - "445c478b29e844e99c45eb4e7a093e65", - "9bc0b47676434ca9a6c8c3f8af50d9aa", - "2555e6fbde6e45c2ba5f466701a1fb57", - "73d4afacfe3741bc90dfd70e55d763e3", - "2815ed72ee56478b81379eabd3cdc004", - "7ee8d64ff1344f7f8e6816e8ced5c5d7", - "71ce19a082fd4f28a1c5f4f29853f32d", - "9e1d5d7e79c9494598ca56079525dadc", - "991ab38f5ab142a2a053da131fca08e8", - "b14dc0d2bba84dbbb17b657ef5555132", - "51a7cfd306fc45498f8f84b579e6f05f", - "3e134af4ddc041aab1e0f8769b77e232", - "6b33bff41d3d438f9c81af109273f41e", - "ff568bf3a34b46edb2e50fb910608de3", - "9e1cb31c569f40118ae949cd8643e392", - "eeeca8ac591242edafe63e100421534e", - "68b138160a984dd686787cad890ff13c", - "90ef3af60627467d979237b57a8f411d", - "e93153b83b034468aad8516c644e8d55", - "8b87d73ab967490481a27011d5b53236", - "739e3997f7544b73aff099c31a3d4bed", - "730b750139364343aaee6bd28fde7f53", - "b41c51b1111943468cb50695e5ce1f84", - "6f3f01955c0847a19b2ca4e84a06d149", - "9084385cba634c01a9e746643e40e32c", - "9a694d9d63524748a8dd7f2ae59525d5", - "6541ce75ed414eac9428fc9e0a53d128", - "83436ad970f044c9abc1e79a7b7a749d", - "61cba98f8dae431ea588538cdc5a2e07", - "cbace4a6e6bb476bb9d45d0516e5f8de" - ] - }, - "outputId": "d2a38816-1879-4f98-e342-480790a82873" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n", - "The secret `HF_TOKEN` does not exist in your Colab secrets.\n", - "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n", - "You will be able to reuse this secret in all of your notebooks.\n", - "Please note that authentication is recommended but still optional to access public models or datasets.\n", - " warnings.warn(\n" - ] - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "config.json: 0%| | 0.00/1.01k [00:00 str:\n", - " # This function generates text based on a given prompt using a language model,\n", - " # with options to control randomness, the number of tokens generated, and reproducibility.\n", - "\n", - " # Convert the prompt text into tokens that the model can process\n", - " inputs = tokenizer(prompt, return_tensors=\"pt\")\n", - "\n", - " # Extract the tokens (input IDs) and attention mask (to focus on important parts) from the inputs\n", - " input_ids = inputs[\"input_ids\"]\n", - " attention_mask = inputs[\"attention_mask\"]\n", - "\n", - " # Move the tokens and attention mask to the same device as the model (like a GPU if available)\n", - " input_ids = input_ids.to(model.device)\n", - " attention_mask = attention_mask.to(model.device)\n", - "\n", - " # Set up how we want the model to generate text\n", - " generation_config = transformers.GenerationConfig(\n", - " do_sample=True, # Allow the model to add some randomness to its text generation\n", - " temperature=temperature, # Adjust how random the output is; lower means more focused\n", - " top_p=top_p, # Consider the most likely words that make up the top 90% of possibilities\n", - " pad_token_id=tokenizer.pad_token_id, # Use the token ID that represents padding (extra space)\n", - " top_k=0, # We're not limiting to the top-k words, so we set this to 0\n", - " )\n", - "\n", - " # If a seed is provided, set it so that the results are repeatable (same output each time)\n", - " if seed is not None:\n", - " torch.manual_seed(seed)\n", - "\n", - " # Generate text using the model with the settings we defined\n", - " generation_output = model.generate(\n", - " input_ids=input_ids, # Provide the input tokens to the model\n", - " attention_mask=attention_mask, # Provide the attention mask to help the model focus\n", - " return_dict_in_generate=True, # Ask the model to return detailed information\n", - " output_scores=True, # Include the scores (confidence levels) for the generated tokens\n", - " max_new_tokens=max_new_tokens, # Set the maximum number of tokens to generate\n", - " generation_config=generation_config, # Apply our custom text generation settings\n", - " )\n", - "\n", - " # Make sure only one sequence (output) is generated, to keep things simple\n", - " assert len(generation_output.sequences) == 1\n", - "\n", - " # Get the generated sequence of tokens\n", - " output_sequence = generation_output.sequences[0]\n", - "\n", - " # Convert the generated tokens back into readable text\n", - " output_string = tokenizer.decode(output_sequence)\n", - "\n", - " # Print the prompt and the generated response\n", - " print_sample(prompt, output_string)\n", - "\n", - " # Return the generated text response\n", - " return output_string" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": { - "id": "Yme6VzW4G4f1", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "2d99988d-a455-4929-f0fd-946ba4b69776" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "\u001b[35mWhat is love?\u001b[34mWhat is love?\n", - "\n", - "Love is a term used to describe the way in which one person feels about others. It is a process that involves the emotional and physical interaction of the person, the relationship, and the relationship itself.\n", - "\n", - "Love is the ability to feel the love of another person.\n", - "\n", - "Love is the ability to feel\n", - "\u001b[39m\n" - ] - } - ], - "source": [ - "_ = run_sample(model, tokenizer, prompt=\"What is love?\", temperature = 0.5, seed=2)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "V7vnUawyG4f1" - }, - "source": [ - "Pretty amazing, right? 🤩 Try playing around with the **prompt**, **temperature** and **seed** values above and see what different outputs you get. What do you notice when you increase the temperature? While this might have been mind-blowing back in 2021, by now, most of you have likely interacted with large language models in some way. Today, we're going to take things a step further by training our own **Shakespeare-inspired LLM**. This will give us a hands-on understanding of how these language models work under the hood.\n", - "\n", - "But before we jump into training, let’s first build a solid understanding of what **Large Language Models** are and the key **Machine Learning** concepts that make this groundbreaking technology possible. At the heart of today’s state-of-the-art (SoTA) LLMs are the **Attention Mechanism** and the **Transformer Architecture**. We’ll explore these essential concepts in the upcoming sections of this tutorial. 🚀💡\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-ZUp8i37dFbU" - }, - "source": [ - "## **1. Attention**\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "acgW1ofF_RFz" - }, - "source": [ - "The attention mechanism is inspired by how humans would look at an image or read a sentence.\n", - "\n", - "Let us take the image of the dog in human clothes below (image and example [source](https://lilianweng.github.io/posts/2018-06-24-attention/)). When paying *attention* to the red blocks of pixels, we will say that the yellow block of pointy ears is something we expected (correlated) but that the grey blocks of human clothes are unexpected for us (uncorrelated). This is *based on what we have seen in the past* when looking at pictures of dogs, specifically one of a Shiba Inu.\n", - "\n", - "\"drawing\"\n", - "\n", - "Assume we want to identify the dog breed in this image. When we look at the red blocks of pixels, we tend to pay more *attention* to relevant pixels that are more similar or relevant to them, which could be the ones in the yellow box. We almost completely ignore the snow in the background and the human clothing for this task.\n", - "\n", - "Alternatively, when we begin looking at the background in an attempt to identify what is in it, we subconsciously ignore the dog pixels because they are irrelevant to the current task." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "usLBF2g0x5gH" - }, - "source": [ - "The same thing happens when we read. In order to understand the entire sentence, we will learn to correlate and *attend to* certain words based on the context of the entire sentence.\n", - "\n", - "\"drawing\"\n", - "\n", - " For instance, in the first sentence in the image above, when looking at the word \"coding\", we pay more attention to the word \"Apple\" and \"computer\" because we know that when we speak about coding, \"Apple\" is actually referring to the company. However, in the second sentence, we realise we should not consider \" apple \" when looking at \"code\" because given the context of the rest of the sentence, we know that this apple is referring to an actual apple and not a computer.\n", - "\n", - "We can build better models by developing mechanisms that mimic attention. It will enable our models to learn better representations of our input data by contextualising what it knows about some parts of the input based on other parts. In the following sections, we will explore the mechanisms that enable us to train deep learning models to attend to input data in the context of other input data." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ygdi884ugGcu" - }, - "source": [ - "### Intuition - Beginner\n", - "\n", - "Imagine attention as a mechanism that allows a neural network to focus more on certain parts of data. By doing this, the network can enhance its grasp of the problem it's working on, updating its understanding or representations accordingly.\n", - "\n", - "### Understanding Attention in Simple Terms\n", - "\n", - "One way to implement attention in neural networks is by representing each word (or even parts of a word) as a vector.\n", - "\n", - "So, what’s a vector? A vector is simply an array of numbers (called real-valued numbers) that can have different lengths. Think of it like a list of values that describe certain properties of a word. These vectors allow us to measure how similar two words are to each other. One common way to measure this similarity is by calculating something called the **dot product**.\n", - "\n", - "The result of this similarity calculation is what we refer to as **attention.** This attention value helps the model decide how much one word should influence the representation of another word.\n", - "\n", - "In simpler terms, if two words have similar vector representations, it means they’re likely related or important to each other. Because of this relationship, they affect each other’s representations inside the neural network, allowing the model to understand the context better. 🎯\n", - "\n", - "To illustrate how the dot product can create meaningful attention weights, we'll use pre-trained [word2vec](https://jalammar.github.io/illustrated-word2vec/) embeddings. These word2vec embeddings are generated by a neural network that learned to create similar embeddings for words with similar meanings.\n", - "\n", - "By calculating the matrix of dot products between all vectors, we get an attention matrix. This will indicate which words are correlated and therefore should \"attend\" to each other.\n", - "\n", - "[1] You can find more details about how this is done for LLMs in the \"Building Your Own LLM\" session." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "OvBYShCFk6WC" - }, - "source": [ - "**Code task** Intermediate: Complete the dot product attention function below." - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": { - "id": "yrbITGPnk7Ce" - }, - "outputs": [], - "source": [ - "def dot_product_attention(hidden_states, previous_state):\n", - " \"\"\"\n", - " Calculate the dot product between the hidden states and previous states.\n", - "\n", - " Args:\n", - " hidden_states: A tensor with shape [T_hidden, dm]\n", - " previous_state: A tensor with shape [T_previous, dm]\n", - " \"\"\"\n", - "\n", - " # Hint: To calculate the attention scores, think about how you can use the `previous_state` vector\n", - " # and the `hidden_states` matrix. You want to find out how much each element in `previous_state`\n", - " # should \"pay attention\" to each element in `hidden_states`. Remember that in matrix multiplication,\n", - " # you can find the relationship between two sets of vectors by multiplying one by the transpose of the other.\n", - " # Hint: Use `jnp.matmul` to perform the matrix multiplication between `previous_state` and the\n", - " # transpose of `hidden_states` (`hidden_states.T`).\n", - " scores = ... # FINISH ME\n", - "\n", - " # Hint: Now that you have the scores, you need to convert them into probabilities.\n", - " # A softmax function is typically used in attention mechanisms to turn raw scores into probabilities\n", - " # that sum to 1. This will help in determining how much focus should be placed on each hidden state.\n", - " # Hint: Use `jax.nn.softmax` to apply the softmax function to `scores`.\n", - " w_n = ... # FINISH ME\n", - "\n", - " # Multiply the weights by the hidden states to get the context vector\n", - " # Hint: Use `jnp.matmul` again to multiply the attention weights `w_n` by `hidden_states`\n", - " # to get the context vector.\n", - " c_t = jnp.matmul(w_n, hidden_states)\n", - "\n", - " return w_n, c_t" - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "metadata": { - "id": "QARgTrNZlIqH", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "807606b7-9ead-4cca-b8a2-5c54e1f47bec" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "It looks like the function isn't fully implemented yet. Try modifying it.\n" - ] - } - ], - "source": [ - "# @title Run me to test your code\n", - "\n", - "key = jax.random.PRNGKey(42)\n", - "x = jax.random.normal(key, [2, 2])\n", - "\n", - "try:\n", - " w_n, c_t = dot_product_attention(x, x)\n", - "\n", - " w_n_correct = jnp.array([[0.9567678, 0.04323225], [0.00121029, 0.99878967]])\n", - " c_t_correct = jnp.array([[0.11144122, 0.95290256], [-1.5571996, -1.5321486]])\n", - " assert jnp.allclose(w_n_correct, w_n), \"w_n is not calculated correctly\"\n", - " assert jnp.allclose(c_t_correct, c_t), \"c_t is not calculated correctly\"\n", - "\n", - " print(\"It seems correct. Look at the answer below to compare methods.\")\n", - "except:\n", - " print(\"It looks like the function isn't fully implemented yet. Try modifying it.\")" - ] - }, - { - "cell_type": "code", - "execution_count": 34, - "metadata": { - "id": "Qa6PyKYnkzUJ" - }, - "outputs": [], - "source": [ - "# when changing these words, note that if the word is not in the original\n", - "# training corpus it will not be shown in the weight matrix plot.\n", - "# @title Answer to code task (Try not to peek until you've given it a good try!')\n", - "def dot_product_attention(hidden_states, previous_state):\n", - " # Calculate the attention scores:\n", - " # Multiply the previous state vector by the transpose of the hidden states matrix.\n", - " # This gives us a matrix of scores that show how much attention each element in the previous state\n", - " # should pay to each element in the hidden states.\n", - " # The result is a matrix of shape [T, N], where:\n", - " # T is the number of elements in the hidden states,\n", - " # N is the number of elements in the previous state.\n", - " scores = jnp.matmul(previous_state, hidden_states.T)\n", - "\n", - " # Apply the softmax function to the scores to convert them into probabilities.\n", - " # This normalizes the scores so that they sum up to 1 for each element,\n", - " # allowing us to interpret them as how much attention should be given to each hidden state.\n", - " w_n = jax.nn.softmax(scores)\n", - "\n", - " # Calculate the context vector (c_t):\n", - " # Multiply the attention weights (w_n) by the hidden states.\n", - " # This combines the hidden states based on how much attention each one deserves,\n", - " # resulting in a new vector that represents the weighted sum of the hidden states.\n", - " # The resulting shape is [T, d], where:\n", - " # T is the number of elements in the previous state,\n", - " # d is the dimension of the hidden states.\n", - " c_t = jnp.matmul(w_n, hidden_states)\n", - "\n", - " # Return the attention weights and the context vector.\n", - " return w_n, c_t\n" - ] - }, - { - "cell_type": "code", - "execution_count": 35, - "metadata": { - "id": "QlHL3e_QhLfq", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 653 - }, - "outputId": "b3b0ab21-a262-49e3-ae21-2ab5a4ad3f37" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - ":17: DeprecationWarning: Call to deprecated `word_vec` (Use get_vector instead).\n", - " output.append(jnp.array(model.word_vec(word)))\n" - ] - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "
" - ], - "image/svg+xml": "\n\n\n \n \n \n \n 2024-08-30T08:53:15.411377\n image/svg+xml\n \n \n Matplotlib v3.7.1, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n" - }, - "metadata": {} - } - ], - "source": [ - "words = [\"king\", \"queen\", \"royalty\", \"food\", \"apple\", \"pear\", \"computers\"]\n", - "word_embeddings, words = get_word2vec_embedding(words)\n", - "weights, _ = dot_product_attention(word_embeddings, word_embeddings)\n", - "plot_attention_weight_matrix(weights, words, words)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tItZU09YlhEZ" - }, - "source": [ - "Looking at the matrix, we can see which words have similar meanings. The \"royal\" group of words have higher attention scores with each other than the \"food\" words, which all attend to one another. We also see that \"computers\" have very low attention scores for all of them, which shows that they are neither very related to \"royal\" or \"food\" words. \n", - "\n", - "**Group task:**\n", - " - Play with the word selections above. See if you can find word combinations whose attention values seem counter-intuitive. Think of possible explanations. Which sense of a word did the attention scores capture?\n", - " - Ask your friend if they found examples." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "S3iB8hf0hJdX" - }, - "source": [ - "**Note**: Dot product is only one of the ways to implement the scoring function for attention mechanisms, there is a more extensive list in this [blog](https://lilianweng.github.io/posts/2018-06-24-attention/#summary) post by Dr Lilian Weng.\n", - "\n", - "More resources:\n", - "\n", - "[A basic encoder-decoder model for machine translation](https://www.youtube.com/watch?v=gHk2IWivt_8&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=1)\n", - "\n", - "[Training and loss for encoder-decoder models](https://www.youtube.com/watch?v=aBZUTuT1Izs&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=2)\n", - "\n", - "[Basic attention](https://www.youtube.com/watch?v=BSSoEtv5jvQ&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=6)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "aQfqM1EJyDXI" - }, - "source": [ - "### Sequence to sequence attenion mechanisms - Intermediate\n", - "\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "68QBeG-4yDZ9" - }, - "source": [ - "The first attention mechanisms were used in sequence-to-sequence models. These models were usually RNN encoder and decoder structures. The input sequence was processed sequentially by an RNN, encoding the sequence in a single context vector, which is then fed into another RNN that generates a new sequence. Below is an example of this ([source](https://lilianweng.github.io/posts/2018-06-24-attention/)).\n", - "\n", - "\n", - "\"drawing\"\n", - "\n", - "Since there is only one context vector, it is challenging to for the encoder to represent long sequences and information typically gets lost. The attention mechanism introduced in [Bahdanau et al., 2015](https://arxiv.org/pdf/1409.0473.pdf) was proposed to solve this.\n", - "\n", - "Here, instead of relying on one static context vector, which is also only used once in the decoding process, let us provide information on the entire input sequence at every decoding step using a dynamic context vector. By doing this, the decoder can access a larger \"bank\" of memory and attend to the input's required information based on the current decoder RNN output state, $s_t$. This is shown below.\n", - "\n", - "\"drawing\"\n", - "\n", - "In deep learning, attention can be interpreted as a vector of \"importance.\" To predict or infer one element, such as a pixel in an image or a word in a sentence, we estimate how strongly it is correlated with, or \"attends to,\" other elements using the attention vector/weights. These attention weights are then used to generate a new weighted sum of the remaining elements, which represents the target [(source)](https://lilianweng.github.io/posts/2018-06-24-attention/).\n", - "\n", - "\n", - "This, usually, consists of three steps for each decoding step $t$:\n", - "\n", - "1. Calculate the score (importance) for each $h_n$, given $s_{t-1}$ and use the softmax function to transform this into an attention vector, $w_{n}$.\n", - " - $\\text{score} = a(s_{t−1}, h_{n})$, where $a$ can be any differentiable function, such as the dot product.\n", - " - $w_{n} = \\frac{\\exp \\left\\{a\\left(s_{t-1}, h_{n}\\right)\\right\\}}{\\sum_{j=1}^{N} \\exp \\left\\{a\\left(s_{t-1}, h_{j}\\right)\\right\\}}$, where we use the softmax function to transform the raw scores to relative attention weights.\n", - "2. Generate the final context vector, $c_t$, by summing the products of the attention weights and the encoder context vectors.\n", - " - $c_t=\\sum_{n=1}^{N} w_n h_{n}$\n", - "3. Generate the subsequent decoder state $s_{t+1}$ by combining the current decoder state, $s_t$, with the context vector, $c_t$, via some function, $f$.\n", - "\n", - " - $s_{t+1} = f\\left ( c_t, s_t \\right)$\n", - "\n", - " In Bahdanau et al., 2015, $f$ was a learned feedforward layer taking in the concatenated vector $[c_t; s_t]$, with $a(s_{t−1}, h_{n})$ being the dot product.\n", - " \n", - "Next, let us build up this attention schema, as used in the transformer architecture. We've already calcualed simple dot product attention, where the score was given by $a(s_{t-1}, h_n)=s_{t-1} h_n^\\top$ and we're going to use the same idea again." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "J-MU6rrny8Nj" - }, - "source": [ - "### Self-attention to Multihead Attention - Intermediate\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BRuLtxNey_EQ" - }, - "source": [ - "Self-attention and multi-head attention (MHA) are fundamental components of the transformer architecture. In this section, we'll thoroughly explain the intuition behind these concepts and their implementation. Later, in the **Transformers** section, you'll learn how these attention mechanisms are used to create a sequence-to-sequence model that relies entirely on attention.\n", - "\n", - "As we move forward, we'll represent sentences by breaking them down into individual words and encoding each word using the word2vec model discussed earlier. In the Transformers section, we'll explore in more detail how input sequences are transformed into a series of vectors." - ] - }, - { - "cell_type": "code", - "source": [ - "def embed_sentence(sentence):\n", - " \"\"\"\n", - " Embed a sentence using word2vec; for example use cases only.\n", - " \"\"\"\n", - " # clean sentence (not necessary if using a proper LLM tokenizer)\n", - " sentence = remove_punctuation(sentence)\n", - "\n", - " # extract individual words\n", - " words = sentence.split()\n", - "\n", - " # get the word2vec embedding for each word in the sentence\n", - " word_vector_sequence, words = get_word2vec_embedding(words)\n", - "\n", - " # return with extra dimension (useful for creating batches later)\n", - " return jnp.expand_dims(word_vector_sequence, axis=0), words" - ], - "metadata": { - "id": "J2z6-NckgNT-" - }, - "execution_count": 39, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0AFUEFZGzCTv" - }, - "source": [ - "#### Self-attention" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LF2V3KI-za9l" - }, - "source": [ - "Self-attention is an attention mechanism where each vector of a given input sequence attends to the entire sequence. To gain an intuition for why self-attention is important, let us think about the following sentence (example taken from [source](https://jalammar.github.io/illustrated-transformer/)):\n", - "\n", - "`\"The animal didn't cross the street because it was too tired.\"`\n", - "\n", - "A simple question about this sentence is what the word \"it\" refers to? Even though it might look simple, it can be tough for an algorithm to learn this. This is where self-attention comes in, as it can learn an attention matrix for the word \"it\" where a large weight is assigned to the word \"animal\".\n", - "\n", - "Self-attention also allows the model to learn how to interpret words with the same embeddings, such as apple, which can be a company or food, depending on the context. This is very similar to the hidden state found within an RNN, but this process, as you will see, allows the model to attend over the entire sequence in parallel, allowing longer sequences to be utilised.\n", - "\n", - "Self-attention consists of three concepts:\n", - "\n", - "- Queries, keys and values\n", - "- Scaled dot product attention\n", - "- Masks" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pwOIMtdZzdTf" - }, - "source": [ - "##### **Queries, keys and values**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "mEf7QWIWzdo1" - }, - "source": [ - "Typically all attention mechanisms can be written in terms of `key-value` pairs and `queries` to calculate the attention matrix and new context vector.\n", - "\n", - "To gain intuition, one can interpret the `query` vector as containing the information we are interested in obtaining and the `key` vectors as having some information. The `query` vectors are compared to the `key` vectors to get attention scores, where a higher attention score indicates a `key` had relevant information. These attention scores are then used to determine which `values` (which are paired with the `keys`) we should attend to. Or as [Lena Voita](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html) puts it:\n", - "\n", - "- Query: asking for information\n", - "- Key: saying that it has some information\n", - "- Value: giving the information\n", - "\n", - "In transformer architectures, we use learnable weights matrices, represented as $W_Q,W_K,W_V$, to project each sequence vector to unique $q$, $k$, and $v$ vectors.\n", - "\n", - "\"drawing\"\n", - "\n", - "You will notice that the vectors $q,k,v$ are smaller in size than the input vectors. This will be covered at a later stage, but just know that it is a design choice for transformers and not a requirement to work.\n", - "\n", - "This process can also be parallelised, as the input sequence can be represented as a matrix $X$, which can be transformed into query, key, and value matrices $Q$, $K$, and $V$ respectively:\n", - "\n", - "$Q=W_QX \\\\ K=W_KX \\\\ V=W_VX$\n", - "\n", - "Below we show the code that creates three linear layers, which projects the input data to the $Q,K,V$ matrices, where the output size can be adjusted." - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "metadata": { - "id": "Xc8zjK6eziIV" - }, - "outputs": [], - "source": [ - "class SequenceToQKV(nn.Module):\n", - " output_size: int\n", - "\n", - " @nn.compact\n", - " def __call__(self, X):\n", - "\n", - " # define the method for weight initialisation\n", - " initializer = nn.initializers.variance_scaling(scale=0.5, mode=\"fan_in\", distribution=\"truncated_normal\")\n", - "\n", - " # initialise three linear layers to do the QKV transformations.\n", - " # note: this can also be one layer, how do you think you would do it?\n", - " q_layer = nn.Dense(self.output_size, kernel_init=initializer)\n", - " k_layer = nn.Dense(self.output_size, kernel_init=initializer)\n", - " v_layer = nn.Dense(self.output_size, kernel_init=initializer)\n", - "\n", - " # transform and return the matrices\n", - " Q = q_layer(X)\n", - " K = k_layer(X)\n", - " V = v_layer(X)\n", - "\n", - " return Q, K, V" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "OhGZHFsHz_Qp" - }, - "source": [ - "##### **Scaled dot product attention**\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "DxycHDUW0BVE" - }, - "source": [ - "Now that we have our `query`, `key` and `value` matrices, it is time to calculate the attention matrix. Remember, in all attention mechanisms; we must first find a score for each vector in the sequence and then use these scores to create a new context vector. In self-attention scoring is done using scaled dot product attention, and then the normalised scores are used as weights to sum the value vectors and create the context vector.\n", - "\n", - "$\\operatorname{Attention}(Q, K, V)=\\operatorname{softmax}\\left(\\frac{Q K^{T}}{\\sqrt{d_{k}}}\\right) V$\n", - "\n", - "where the attention scores are calculated by $\\operatorname{softmax}\\left(\\frac{Q K^{T}}{\\sqrt{d_{k}}}\\right)$ and the scores are then multiplied by $V$ to get the context vector.\n", - "\n", - "\n", - "What happens here is similar to what we did in the dot product attention in the previous section, just applying the mechanism to the sequence itself. For each element in the sequence, we calculate the attention weight matrix between $q_i$ and $K$. We then multiply $V$ by each weight and finally sum all weighted vectors $v_{weighted}$ together to form a new representation for $q_i$. By doing this, we are essentially drowning out irrelevant vectors and bringing up important vectors in the sequence when our focus is on $q_1$.\n", - "\n", - "$QK^\\top$ is scaled by the square root of the dimension of the vectors, $\\sqrt{d_k}$, to ensure more stable gradients during training.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "metadata": { - "id": "i_UYNzrS0Hga" - }, - "outputs": [], - "source": [ - "def scaled_dot_product_attention(query, key, value):\n", - " \"\"\"\n", - " Formula to return scaled dot product attention given QKV matrices\n", - " \"\"\"\n", - " d_k = key.shape[-1]\n", - "\n", - " # get the raw scores (logits) from dot producting the queries and keys\n", - " logits = jnp.matmul(query, jnp.swapaxes(key, -2, -1))\n", - "\n", - " # scale the raw scores and apply the softmax function to get the attention scores/weights\n", - " scaled_logits = logits / jnp.sqrt(d_k)\n", - " attention_weights = jax.nn.softmax(scaled_logits, axis=-1)\n", - "\n", - " # multiply the weights by the value matrix to get the output\n", - " output = jnp.matmul(attention_weights, value)\n", - "\n", - " return output, attention_weights" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "cuNaEjIm0PhV" - }, - "source": [ - "Let's now see scaled dot product attention in action. We will take a sentence, embed each word using word2vec, and see what the final self-attention weights look like.\n", - "\n", - "We will not use the linear projection layers we would need to train these. Instead, we are going to make things simple and use $X=Q=V=K$." - ] - }, - { - "cell_type": "code", - "execution_count": 42, - "metadata": { - "id": "3Oy2sWzR0Ok5", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 653 - }, - "outputId": "633491e8-daef-48e6-be3b-b2603e3f750a" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - ":17: DeprecationWarning: Call to deprecated `word_vec` (Use get_vector instead).\n", - " output.append(jnp.array(model.word_vec(word)))\n" - ] - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "
" - ], - "image/svg+xml": "\n\n\n \n \n \n \n 2024-08-30T08:59:47.470126\n image/svg+xml\n \n \n Matplotlib v3.7.1, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n" - }, - "metadata": {} - } - ], - "source": [ - "# define a sentence\n", - "sentence = \"I drink coke, but eat steak\"\n", - "\n", - "# embed and create QKV matrices\n", - "word_embeddings, words = embed_sentence(sentence)\n", - "Q = K = V = word_embeddings\n", - "\n", - "# calculate weights and plot\n", - "outputs, attention_weights = scaled_dot_product_attention(Q, K, V)\n", - "\n", - "# plot the words and the attention weights between them\n", - "words = remove_punctuation(sentence).split()\n", - "plot_attention_weight_matrix(attention_weights[0], words, words)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NG1Kxljr0Vzw" - }, - "source": [ - "Keep in mind that we have not trained our attention matrix yet. However, we can see that by utilising the word2vec vectors as our sequence, we can see how scaled dot product attention already is capable of attending to \"eat\" when \"steak\" is our query and that the query \"drink\" attends more to \"coke\" and \"eat\".\n", - "\n", - "More resources:\n", - "\n", - "[Attention with Q,K,V](https://www.youtube.com/watch?v=k-5QMalS8bQ&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=7)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "D7B-AgO80gIt" - }, - "source": [ - "##### **Masked attention**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tdRoKsu70gGW" - }, - "source": [ - "There are cases where applying self-attention over the entire sequence is not practical. These can include:\n", - "\n", - "- Uneven length sequences batched together.\n", - " - When sending a batch of sequences through a network, the self-attention expects each sequence to be the same length. One handles this by padding the sequence. When calculating attention, ideally, these padding tokens should not be taken into consideration.\n", - "- Training a decoder model.\n", - " - When training decoder models, such as GPT-3, the decoder has access to the entire target sequence when training (as training is done in parallel). In order to prevent the method from cheating by looking at future tokens, we have to mask the future sequence data so that earlier data can not attend to it.\n", - "\n", - "By applying a mask to the final score calculated between queries and keys, we can mitigate the influence of the unwanted sequence vectors. **The vectors are masked by making the score between the query and their respective keys a VERY large negative value.** This results in the softmax function pushing the attention weight very close to zero, and the resulting value will be summed out and not influence the final representation.\n", - "\n", - "\n", - "Putting everything together, masked scaled dot product attention visually looks like this:\n", - "\n", - "\"drawing\".\n" - ] - }, - { - "cell_type": "code", - "execution_count": 44, - "metadata": { - "id": "5Syx8_5E0eM9", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 438 - }, - "outputId": "2922f653-cb02-43cd-9abf-5565d2cd0baf" - }, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": [ - "
" - ], - "image/svg+xml": "\n\n\n \n \n \n \n 2024-08-30T09:01:11.433530\n image/svg+xml\n \n \n Matplotlib v3.7.1, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n" - }, - "metadata": {} - } - ], - "source": [ - "# example of building a mask for tokens of size 32\n", - "# the mask makes sure that positions only attend to previous positions in the input (causal mask)\n", - "# we will use this later to insert -inf values into the raw scores\n", - "mask = jnp.tril(jnp.ones((32, 32)))\n", - "\n", - "# plot\n", - "sns.heatmap(mask, cmap=\"Blues\")\n", - "plt.title(\"Example of mask that can be applied\");" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pfwTJrQ20gDw" - }, - "source": [ - "Lets now adapt our scaled dot product attention function to implement masked attention." - ] - }, - { - "cell_type": "code", - "execution_count": 45, - "metadata": { - "id": "PVHpyNs_0ePh" - }, - "outputs": [], - "source": [ - "def scaled_dot_product_attention(query, key, value, mask=None):\n", - " \"\"\"\n", - " Scaled dot product attention with a causal mask (only allowed to attend to previous positions)\n", - " \"\"\"\n", - " d_k = key.shape[-1]\n", - " T_k = key.shape[-2]\n", - " T_q = query.shape[-2]\n", - "\n", - " # get scaled logits using dot product as before\n", - " logits = jnp.matmul(query, jnp.swapaxes(key, -2, -1))\n", - " scaled_logits = logits / jnp.sqrt(d_k)\n", - "\n", - " # add optional mask where values along the mask are set to -inf\n", - " if mask is not None:\n", - " scaled_logits = jnp.where(mask[:T_q, :T_k], scaled_logits, -jnp.inf)\n", - "\n", - " # calcualte the attention weights via softmax\n", - " attention_weights = jax.nn.softmax(scaled_logits, axis=-1)\n", - "\n", - " # sum with the values to get the output\n", - " output = jnp.matmul(attention_weights, value)\n", - "\n", - " return output, attention_weights" - ] - }, - { - "cell_type": "markdown", - "source": [ - "##### **Multi-head attention**" - ], - "metadata": { - "id": "OWDubQwCs4zG" - } - }, - { - "cell_type": "markdown", - "source": [ - "The attention mechanism we've covered so far successfully allows the model to focus on different positions in the input. In practice, the transformer architecture uses a subtle variation of this mechanism, called multi-head attention (MHA).\n", - "\n", - "The distinction is minimal; rather than only computing the attention once, the MHA mechanism runs through the scaled dot-product attention multiple times in parallel. According to the paper, *Attention is All You Need*, \"multi-head attention allows the model to **jointly attend** to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.\"\n", - "\n", - "Multi-head attention can be viewed as a similar strategy to stacking convolution kernels in a CNN layer. This allows the kernels to focus on and learn different features and rules, which is why multiple heads of attention also work.\n", - "\n", - "The figure below shows how basic MHA works. The scaled dot product attention discussed earlier is just repeated $N$ times ($N=2$ in this figure), with $3N$ learnable matrices for each head. The outputs from the different heads are then concatenated, whereafter it is fed through a linear projection, which produces the final representation.\n", - "\n", - "In practice, MHA significantly out-performs single-head attention.\n", - "\n", - "\"drawing\"\n" - ], - "metadata": { - "id": "nHkyjyErsYae" - } - }, - { - "cell_type": "markdown", - "source": [ - "Let's take a look at how to implement multi-head attention. In simple terms, multi-head attention is like running the attention process multiple times in parallel, using different copies of the Q, K, and V matrices for each \"head.\" This helps the model focus on different parts of the input at the same time. If you're interested in learning more, check out [this blog by Sebastian Raschka](https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention) for a detailed explanation." - ], - "metadata": { - "id": "vtuqNCln9EWW" - } - }, - { - "cell_type": "code", - "source": [ - "class MultiHeadAttention(nn.Module):\n", - " num_heads: int # Number of attention heads\n", - " d_m: int # Dimension of the model's embeddings\n", - "\n", - " def setup(self):\n", - " # Initialize the sequence-to-QKV transformation module\n", - " self.sequence_to_qkv = SequenceToQKV(self.d_m)\n", - "\n", - " # Define the initializer for the output linear layer weights\n", - " initializer = nn.initializers.variance_scaling(\n", - " scale=0.5, mode=\"fan_in\", distribution=\"truncated_normal\"\n", - " )\n", - "\n", - " # Initialize the output projection layer Wo (used after attention)\n", - " self.Wo = nn.Dense(self.d_m, kernel_init=initializer)\n", - "\n", - " def __call__(self, X=None, Q=None, K=None, V=None, mask=None, return_weights=False):\n", - " # If Q, K, or V are not provided, use the input X to generate them\n", - " if None in [Q, K, V]:\n", - " assert not X is None, \"X has to be provided if either Q, K, or V are not provided\"\n", - "\n", - " # Generate Q, K, and V matrices from the input X\n", - " Q, K, V = self.sequence_to_qkv(X)\n", - "\n", - " # Extract the batch size (B), sequence length (T), and embedding size (d_m)\n", - " B, T, d_m = K.shape\n", - "\n", - " # Calculate the size of each attention head's embedding (d_m / num_heads)\n", - " head_size = d_m // self.num_heads\n", - "\n", - " # Reshape Q, K, V to have separate dimensions for the heads\n", - " # B, T, d_m -> B, T, num_heads, head_size -> B, num_heads, T, head_size\n", - " q_heads = Q.reshape(B, T, self.num_heads, head_size).swapaxes(1, 2)\n", - " k_heads = K.reshape(B, T, self.num_heads, head_size).swapaxes(1, 2)\n", - " v_heads = V.reshape(B, T, self.num_heads, head_size).swapaxes(1, 2)\n", - "\n", - " # Apply scaled dot-product attention to each head\n", - " attention, attention_weights = scaled_dot_product_attention(\n", - " q_heads, k_heads, v_heads, mask\n", - " )\n", - "\n", - " # Reshape the attention output back to its original dimensions\n", - " # (B, num_heads, T, head_size) -> (B, T, num_heads, head_size) -> (B, T, d_m)\n", - " attention = attention.swapaxes(1, 2).reshape(B, T, d_m)\n", - "\n", - " # Apply the output linear transformation Wo to the attention output\n", - " X_new = self.Wo(attention)\n", - "\n", - " # If return_weights is True, return both the transformed output and attention weights\n", - " if return_weights:\n", - " return X_new, attention_weights\n", - " else:\n", - " # Otherwise, return just the transformed output\n", - " return X_new" - ], - "metadata": { - "id": "BY2xXLMQ9CB6" - }, - "execution_count": 47, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "e9NW58_3hAg2" - }, - "source": [ - "## **2. Building your own LLM** " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "bA_2coZvhAg3" - }, - "source": [ - "### 2.1 High-level overvierw Beginner" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BflycqAw_RF8" - }, - "source": [ - "The Transformer Architecture was famously introduced in the paper entitled [Attention is all you need](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) by Vaswani et al.\n", - "\n", - "As the title of the paper suggests, such an architecture consists of basically only attention mechanisms along with feed-forward layers and linear layers, as shown in the diagram below.\n", - "\n", - "\n", - "\n", - "Transformers and its variations are in the core of Large Language Models and it's not an exaggeration to say that almost all language models out there are Transformer based architectures.\n", - "\n", - "As you can see in the diagram the original Transformer architecture consists of two parts, one that receives inputs usually called encoder and another that receives outputs (i.e. targets) called decoder. This is because the transformer was designed for machine translation.\n", - "\n", - "The encoder will receive an input sentence in one language and process it through multiple stacked `encoder blocks`. This creates a final representation, which contains helpful information necessary for the decoding task. This output is then fed into stacked `decoder blocks` that produce new outputs in an autoregressive manner.\n", - "\n", - "The encoder consists of $N$ identical blocks, which process a sequence of token vectors sequentially. These blocks consist of 3 parts:\n", - "\n", - "1. A multi-head attention block. These are the transformer architecture's backbone. They process the data to generate representations for each token, ensuring that the necessary information for the task at hand is represented in the vectors. These are exactly the MHA we covered in the attention section previously.\n", - "2. An MLP (Multi-Layer Perceptron i.e. a neural network with multiple layers) is applied to each input token separately and identically.\n", - "3. Residual connection that adds the input tokens to the attended representations and a residual connection between the input to the MLP and its outputs. For both these connections, the result is normalized using layernorm. In certain implementations, these normalization steps are applied to the inputs rather than the outputs. Just like a Resnet, transformers are designed to be very deep models thus, these add and norm blocks are essential for a smooth gradient flow. \n", - "\n", - "Similarly, the decoder block consists of $N$ identical blocks, however there is some variation within these block. Concretely, the different parts are:\n", - "\n", - "1. A masked multi-head attention block. This is an MHA block that performs _self-attention_ on the output sequence however this computation is restricted to the inputs that have already been seen. In other words, future tokens are blocked when making predictions.\n", - "2. A multi-head attention block. This block receives the output of the final encoder block, the transformed tokens, and uses that as the key-value pairs, while using the output of the first MHA block as the query. In doing this, the model attends over the input required to perform the sequence task. This MHA block thus performs _cross-attention_ by looking at the encoder inputs.\n", - "3. An MLP same as the encoder\n", - "4. Residual connection same as the encoder.\n", - "\n", - "Given this original architecture, there have been several variation with others focusing on the encoder only and others the **decoder only**. Large language models(LLMs) such as GPT-2, GPT-3 and Turing-NLG were born out of decoder only architectures. These architecture look like:\n", - "\n", - "\"drawing\"\n", - "\n", - "with the cross attention block missing as no encoder output is available. So to build a language model, we will focus on the decoder only architecture as seen above.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "fbTsk0MdhAhC" - }, - "source": [ - "### 2.2 Tokenization + Positional encoding Beginner\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "DehUpfym_RF8" - }, - "source": [ - "#### 2.2.1 Tokenization" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "uBiFpVBu_RF9" - }, - "source": [ - "\n", - "Transformers cannot handle raw strings of text. So to process text, the text is first split up into tokens. The tokens are then indexed and each token is assigned an embedding of size $d_{model}$. These embeddings can be learned during training or can come from a pretrained vocabulary of embeddings. This new sequence of token embeddings is then fed into the transformer architecture. This idea is visualised below.\n", - "\n", - "\\\\\n", - "\n", - "\"drawing\"\n", - "\n", - "\n", - "These token IDs are typically predicted when a model generates text, fills in missing words, etc.\n", - "\n", - "This process of splitting up text into tokens and assigning an ID to each token is called [tokenisation](https://huggingface.co/docs/transformers/tokenizer_summary). There are various ways to tokenise text, with some methods being trained directly from the data. When using pre-trained transformers, it is crucial to use the same tokeniser that was used to train the model. The previous link has in-depth descriptions of many widely known techniques.\n", - "\n", - "Below we show how the [BERT](https://arxiv.org/abs/1810.04805) model's tokeniser tokenises a sentence. We use [Hugging Face](https://huggingface.co/) for this part.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 48, - "metadata": { - "id": "hJBMvlUA_RF9", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 162, - "referenced_widgets": [ - "fae0de63c22e4ed8afdc9b3554df9dbc", - "cce88c6717f14dbc8d601a5c306380ac", - "5f825f83172f4f3083c63b4f6c6df441", - "88985dba38b64de2b81a0b2da28e801b", - "534bd01df3c84807bd6a7b15d7279847", - "d8eab87c7cc940388dece8670e466a4e", - "285d1ffdac3149a9b2a17c108d35cf15", - "f18d22e0eeac4ca7953d0f87da7fac4a", - "dce8aa0a4a14409e84bf723f84ff5be6", - "e15e6eb22be4477d8ef216343b7371de", - "801ac3fc2ff34d42a65c0898edf5d07d", - "6b23d26a6c6044029025fab0dfbd3555", - "cb48b8b12a584c599917844e958cd69e", - "d91590c0670345b7a572e7f437353be3", - "663af63865a34844b6dc4ccea7df2ed9", - "913668a324b7433f9235fcdb4bc2a644", - "6ed2b83044fa4b4ab0586719ef9edc96", - "87e9e502cd3d48be83cda4999cac92ee", - "e345de9ae283492887f53c83214538b4", - "b40ea73a453845d3ac21caf4b112165a", - "04f2eb37159e4cf79a2b61e1c402d2a6", - "525a68ecc63b445db2a2eba949679625", - "9aa27129beed4d4ebcb728ea46db7294", - "f8d48f5c510e43b586f7cbadf0ac383e", - "e7a1b65310404ae1a4a685cddce1d727", - "58e3bb783e934b81b2457e88fff1c3c6", - "7490d00a7342421eb38a403386e6df64", - "0303f59b0de44e8ea1cb3d1a32147589", - "454d8927472d417081373a30fcc0f919", - "3618690067c7433a8aa81c8ebea5d1a3", - "63bbb5e415c44e15bbc5c88ec9759d4b", - "5878f8b4e8b14f1d9ed602585b6634d0", - "52f1ae9c77b04ce1ad0d9b1a7e8270ab", - "d54d176daad2416ea06ae3e1e1660592", - "589b2d7b489946dbaf2265d20542fd66", - "49da7f1ff2f74d208a0a440430f8845f", - "3952eb993b7d477eaf132061ad355194", - "4804b336a4ba4e4286b352768c4789a8", - "bfd29ef47cfd4db6aa5bb12f05af5779", - "437c0e32e3d84087a8d0b68dbc31f0db", - "31ad90c612e14196873b401e8862ea38", - "bf1eab7f180f45e5b378ceafca1746f4", - "bbca117899a64397b26a512911ba8868", - "fa63f4d8e2934594a6a05c51a70a607e" - ] - }, - "outputId": "77ec71d2-8c67-4a3f-8b2a-ce338e26660e" - }, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": [ - "tokenizer_config.json: 0%| | 0.00/49.0 [00:00\n", - "\n", - "Ideally, these encodings should have these characteristics ([source](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/)):\n", - "* Each time-step should have a unique value\n", - "* The distance between time steps should stay constant.\n", - "* The encoding should be able to generalise to longer sequences than seen during training.\n", - "* The encoding must be deterministic." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "rklY-aL-_RF9" - }, - "source": [ - "##### **Sine and cosine functions**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "GLcfkMku_RF9" - }, - "source": [ - "\n", - "In Attention is All you Need, the authors used a method that can satisfy all these requirements. This involves summing a combination of sine and cosine waves at different frequencies, with the formula for a position encoding at position $D$ shown below, where $i$ is the embedding index and $d_m$ is the token embedding size.\n", - "\n", - "\\\\\n", - "\n", - "$P_{D}= \\begin{cases}\\sin \\left(\\frac{D}{10000^{i/d_{m}}}\\right), & \\text { if } i \\bmod 2=0 \\\\ \\cos \\left(\\frac{D}{10000^{((i-1)/d_{m}}}\\right), & \\text { otherwise } \\end{cases}$\n", - "\n", - "\\\n", - "\n", - "Assuming our model as $d_m=8$, the position embedding will look like this:\n", - "\n", - "\\\n", - "$P_{D}=\\left[\\begin{array}{c}\\sin \\left(\\frac{D}{10000^{0/8}}\\right)\\\\ \\cos \\left(\\frac{D}{10000^{0/8}}\\right)\\\\ \\sin \\left(\\frac{D}{10000^{2/8}}\\right)\\\\ \\cos \\left(\\frac{D}{10000^{2/8}}\\right)\\\\ \\sin \\left(\\frac{D}{10000^{4/8}}\\right)\\\\ \\cos \\left(\\frac{D}{10000^{4/8}}\\right)\\\\ \\sin \\left(\\frac{D}{10000^{8/8}}\\right)\\\\ \\cos \\left(\\frac{D}{10000^{8/8}}\\right)\\end{array}\\right]$\n", - "\n", - "\\\\\n", - "\n", - "Let's first create a function that can return these encodings to understand why this will work." - ] - }, - { - "cell_type": "code", - "execution_count": 50, - "metadata": { - "id": "zT5t5D30_RF9" - }, - "outputs": [], - "source": [ - "def return_frequency_pe_matrix(token_sequence_length, token_embedding):\n", - "\n", - " assert token_embedding % 2 == 0, \"token_embedding should be divisible by two\"\n", - "\n", - " P = jnp.zeros((token_sequence_length, token_embedding))\n", - " positions = jnp.arange(0, token_sequence_length)[:, jnp.newaxis]\n", - "\n", - " i = jnp.arange(0, token_embedding, 2)\n", - " frequency_steps = jnp.exp(i * (-math.log(10000.0) / token_embedding))\n", - " frequencies = positions * frequency_steps\n", - "\n", - " P = P.at[:, 0::2].set(jnp.sin(frequencies))\n", - " P = P.at[:, 1::2].set(jnp.cos(frequencies))\n", - "\n", - " return P" - ] - }, - { - "cell_type": "code", - "execution_count": 51, - "metadata": { - "id": "CYW-VDOL_RF-", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 697 - }, - "outputId": "f41f3e50-a098-4184-9c01-dab5fe30bf8e" - }, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": [ - "
" - ], - "image/svg+xml": "\n\n\n \n \n \n \n 2024-08-30T09:02:57.453110\n image/svg+xml\n \n \n Matplotlib v3.7.1, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n" - }, - "metadata": {} - } - ], - "source": [ - "token_sequence_length = 50 # Number of tokens the model will need to process\n", - "token_embedding = 10000 # token embedding (and positional encoding) dimensions, ensure it is divisible by two\n", - "P = return_frequency_pe_matrix(token_sequence_length, token_embedding)\n", - "plot_position_encodings(P, token_sequence_length, token_embedding)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1mjHEDPO_RF-" - }, - "source": [ - "Looking at the graph above, we can see that for each position index, a unique pattern emerges, where each position index consistently has the same encoding.\n", - "\n", - "### **Group Activity**:\n", - "\n", - "- Take a moment with your friend to explore why this specific pattern appears when `token_sequence_length` is set to 1000, and `token_embedding` is 768.\n", - "- Experiment with smaller values for `token_sequence_length` and `token_embedding` to build a deeper understanding and enhance your discussion.\n", - "- Curious about the constant 10000? Ask your friend why they think it’s used in the functions above.\n", - "- Now, try setting `token_sequence_length` to 50 and `token_embedding` to a much larger value, like 10000. What do you observe? Do we always need a large token embedding?\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SdNPg0pnhAhG" - }, - "source": [ - "### 2.3 Transformer block Intermediate" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "M4vSolF2_RF-" - }, - "source": [ - "Just like an MLP (a simple neural network that processes input data through multiple layers) or a CNN (a type of neural network that excels at recognizing patterns in images by using convolution layers), transformers are made up of a stack of transformer blocks. In this section, we'll build each of the components needed to create one of these transformer blocks." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "kTURbfr__RF-" - }, - "source": [ - "\n", - "#### 2.3.1 Feed Forward Network (FFN) / Multilayer perceptron (MLP) Beginner\n", - "\n", - "\n", - "\"drawing\"" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LTtFi9AZ_RF-" - }, - "source": [ - "In the original model, these blocks consist of a simple 2-layer MLP (Multi-Layer Perceptron) that uses ReLU activation. However, GeLU (Gaussian Error Linear Unit) has become very popular, and we will be using it throughout this practical. The formula below represents the feedforward neural network (FFN) with GeLU activation. In this network, the input `x` is first passed through two linear layers with weights `W1` and `W2`, followed by bias terms `b1` and `b2`. The ReLU activation function, often represented by the `max` function, is replaced by the GeLU activation function in this case.\n", - "\n", - "$$\n", - "\\operatorname{FFN}(x)=\\max \\left(0, x W_{1}+b_{1}\\right) W_{2}+b_{2}\n", - "$$\n", - "\n", - "One can interpret this block as processing what the MHA block has produced and then projecting these new token representations to a space that the next block can use more optimally. Usually, the first layer is very wide, in the range of 2-8 times the size of the token representations. They do this as it is easier to parallelize computations for a single wider layer during training than to parallelize a feedforward block with multiple layers. Thus they can add in more complexity but keep training and inference optimized.\n", - "\n", - "**Code task:** Code up a Flax Module that implements the feed forward block." - ] - }, - { - "cell_type": "code", - "execution_count": 53, - "metadata": { - "id": "zsho1CnW_RF-" - }, - "outputs": [], - "source": [ - "class FeedForwardBlock(nn.Module):\n", - " \"\"\"\n", - " A 2-layer MLP which widens then narrows the input.\n", - "\n", - " Args:\n", - " widening_factor [optional, default=4]: The size of the hidden layer will be d_model * widening_factor.\n", - " \"\"\"\n", - "\n", - " widening_factor: int = 4\n", - " init_scale: float = 0.25\n", - "\n", - " @nn.compact\n", - " def __call__(self, x):\n", - " '''\n", - " Args:\n", - " x: [B, T, d_m]\n", - "\n", - " Return:\n", - " x: [B, T, d_m]\n", - " '''\n", - " d_m = x.shape[-1]\n", - " layer1_size = self.widening_factor * d_m\n", - "\n", - " initializer = nn.initializers.variance_scaling(\n", - " scale=self.init_scale, mode='fan_in', distribution='truncated_normal',\n", - " )\n", - "\n", - " # Hint: Layer 1 is a dense layer (fully connected layer) that increases the size of the input by the widening factor.\n", - " # Use nn.Dense to create this layer with layer1_size as the output size.\n", - " layer1 = # FINISH ME\n", - "\n", - " # Hint: Layer 2 is another dense layer that reduces the size back to the original dimension d_m.\n", - " # Use nn.Dense with d_m as the output size to create this layer.\n", - " layer2 = # FINISH ME\n", - "\n", - " x = jax.nn.gelu(layer1(x)) # Apply the GeLU activation function to the output of layer 1\n", - " x = layer2(x) # Pass the result through layer 2\n", - " return x" - ] - }, - { - "cell_type": "code", - "execution_count": 61, - "metadata": { - "id": "-qj0nfhH_RF-" - }, - "outputs": [], - "source": [ - "# @title Answer to code task (Try not to peek until you've given it a good try!')\n", - "\n", - "class FeedForwardBlock(nn.Module):\n", - " \"\"\"A 2-layer MLP (Multi-Layer Perceptron) that first expands the input size and then reduces it back.\"\"\"\n", - "\n", - " # widening_factor controls how much the input dimension is expanded in the first layer.\n", - " widening_factor: int = 4\n", - "\n", - " # init_scale controls the scaling factor for weight initialization.\n", - " init_scale: float = 0.25\n", - "\n", - " @nn.compact\n", - " def __call__(self, x):\n", - " # Get the size of the last dimension of the input (embedding size).\n", - " d_m = x.shape[-1]\n", - "\n", - " # Calculate the size of the first layer by multiplying the embedding size by the widening factor.\n", - " layer1_size = self.widening_factor * d_m\n", - "\n", - " # Initialize the weights for both layers using a variance scaling initializer.\n", - " initializer = nn.initializers.variance_scaling(\n", - " scale=self.init_scale, mode='fan_in', distribution='truncated_normal',\n", - " )\n", - "\n", - " # Define the first dense layer, which expands the input size.\n", - " layer1 = nn.Dense(layer1_size, kernel_init=initializer)\n", - "\n", - " # Define the second dense layer, which reduces the size back to the original dimension.\n", - " layer2 = nn.Dense(d_m, kernel_init=initializer)\n", - "\n", - " # Apply the first dense layer followed by a GELU activation function.\n", - " x = jax.nn.gelu(layer1(x))\n", - "\n", - " # Apply the second dense layer to project the data back to its original dimension.\n", - " x = layer2(x)\n", - "\n", - " # Return the final output.\n", - " return x" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Sts5Vr4i_RF-" - }, - "source": [ - "#### 2.3.2 Add and Norm block Beginner" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TWUpf8wt_RF-" - }, - "source": [ - "In order to get transformers to go deeper, the residual connections are very important to allow an easier flow of gradients through the network. For normalisation, `layer norm` is used. This normalises each token vector independently in the batch. It is found that normalising the vectors improves the convergence and stability of transformers.\n", - "\n", - "There are two learnable parameters in layernorm, `scale` and `bias`, which rescales the normalised value. Thus, for each input token in a batch, we calculate the mean, $\\mu_{i}$ and variance $\\sigma_i^2$. We then normalise the token with:\n", - "\n", - "$\\hat{x}_i = \\frac{x_i-\\mu_{i}}{\\sigma_i^2 + ϵ}$.\n", - "\n", - "Then $\\hat{x}$ is rescaled using the learned `scale`, $γ$, and `bias` $β$, with:\n", - "\n", - "$y_i = γ\\hat{x}_i + β = LN_{γ,β}(x_i)$.\n", - "\n", - "So our add norm block can be represented as $LN(x+f(x))$, where $f(x)$ is either a MLP or MHA block.\n", - "\n", - "**Code task:** Code up a Flax Module that implements the add norm block. It should take as input the processed and unprocessed tokens. Hint: `hk.LayerNorm `" - ] - }, - { - "cell_type": "code", - "execution_count": 63, - "metadata": { - "id": "F5bLb5Ly_RF_" - }, - "outputs": [], - "source": [ - "class AddNorm(nn.Module):\n", - " \"\"\"A block that impliments the add and norm block\"\"\"\n", - "\n", - " @nn.compact\n", - " def __call__(self, x, processed_x):\n", - " '''\n", - " Args:\n", - " x: Sequence of tokens before feeding into MHA or FF blocks, with shape [B, T, d_m]\n", - " x: Sequence of after being processed by MHA or FF blocks, with shape [B, T, d_m]\n", - "\n", - " Return:\n", - " add_norm_x: Transformed tokens with shape [B, T, d_m]\n", - " '''\n", - " # Hint: Step 1 involves adding the original input `x` to the processed input `processed_x`.\n", - " added = # FINISH ME\n", - "\n", - " # Hint: Step 2 requires applying layer normalization to the result of the addition.\n", - " # Use `nn.LayerNorm`, and set `reduction_axes=-1` to apply normalization across the last dimension.\n", - " normalised = #FINISH ME\n", - " return normalised(added)" - ] - }, - { - "cell_type": "code", - "execution_count": 64, - "metadata": { - "id": "HXSi7BXZ_RF_" - }, - "outputs": [], - "source": [ - "# @title Answer to code task (Try not to peek until you've given it a good try!')\n", - "\n", - "class AddNorm(nn.Module):\n", - " \"\"\"A block that implements the 'Add and Norm' operation used in transformers.\"\"\"\n", - "\n", - " @nn.compact\n", - " def __call__(self, x, processed_x):\n", - " # Step 1: Add the original input (x) to the processed input (processed_x).\n", - " added = x + processed_x\n", - "\n", - " # Step 2: Apply layer normalization to the result of the addition.\n", - " # - LayerNorm helps to stabilize and improve the training process by normalizing the output.\n", - " # - reduction_axes=-1 indicates that normalization is applied across the last dimension (typically the embedding dimension).\n", - " # - use_scale=True and use_bias=True allow the layer to learn scaling and bias parameters for further fine-tuning.\n", - " normalised = nn.LayerNorm(reduction_axes=-1, use_scale=True, use_bias=True)\n", - "\n", - " # Return the normalized result.\n", - " return normalised(added)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "91dXd29b_RF_" - }, - "source": [ - "### 2.4 Building the Transformer Decoder / LLM Intermediate" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Sl0UAyvM_RF_" - }, - "source": [ - "\"drawing\"\n", - "\n", - "Most of the groundwork has happened. We have built the positional encoding block, the MHA block, the feed-forward block and the add&norm block.\n", - "\n", - "The only part needed is passing inputs to each decoder block and applying the masked MHA block found in the decoder blocks.\n", - "\n", - "**Code task:** Code up a FLAX Module that implements the (FFN(norm(MHA(norm(X))))) for the decoder block" - ] - }, - { - "cell_type": "code", - "execution_count": 67, - "metadata": { - "id": "wVmSFKZK_RF_" - }, - "outputs": [], - "source": [ - "class DecoderBlock(nn.Module):\n", - " \"\"\"\n", - " Transformer decoder block.\n", - "\n", - " Args:\n", - " num_heads: The number of heads to be used in the MHA block.\n", - " d_m: Token embedding size\n", - " widening factor: The size of the hidden layer will be d_m * widening_factor.\n", - " \"\"\"\n", - "\n", - " num_heads: int\n", - " d_m: int\n", - " widening_factor: int = 4\n", - "\n", - " def setup(self):\n", - " self.mha = MultiHeadAttention(self.num_heads, self.d_m)\n", - " self.add_norm1 = AddNorm()\n", - " self.add_norm2 = AddNorm()\n", - " self.MLP = FeedForwardBlock(widening_factor=self.widening_factor)\n", - "\n", - " def __call__(self, X, mask=None, return_att_weight=True):\n", - " \"\"\"\n", - " Args:\n", - " X: Batch of tokens being fed into the decoder, with shape [B, T_decoder, d_m]\n", - " encoder_output: Batch of tokens with was processed by the encoder, with shape [B, T_encoder, d_m]\n", - " mask [optional, default=None]: Mask to be applied, with shape [T_decoder, T_decoder].\n", - " return_att_weight [optional, default=True]: Whether to return the attention weights.\n", - " \"\"\"\n", - "\n", - " attention, attention_weights_1 = # FINISH ME\n", - "\n", - " X = # FINISH ME\n", - "\n", - " projection = # FINISH ME\n", - " X = # FINISH ME\n", - "\n", - " return (X, attention_weights_1) if return_att_weight else X" - ] - }, - { - "cell_type": "code", - "execution_count": 69, - "metadata": { - "id": "stNZVVv3_RF_" - }, - "outputs": [], - "source": [ - "#@title Answer to code task (Try not to peek until you've given it a good try!')\n", - "\n", - "class DecoderBlock(nn.Module):\n", - " \"\"\"\n", - " Transformer decoder block.\n", - "\n", - " Args:\n", - " num_heads: The number of attention heads in the Multi-Head Attention (MHA) block.\n", - " d_m: The size of the token embeddings.\n", - " widening_factor: The factor by which the hidden layer size is expanded in the MLP.\n", - " \"\"\"\n", - "\n", - " num_heads: int\n", - " d_m: int\n", - " widening_factor: int = 4\n", - "\n", - " def setup(self):\n", - " # Initialize the Multi-Head Attention (MHA) block\n", - " self.mha = MultiHeadAttention(self.num_heads, self.d_m)\n", - "\n", - " # Initialize the AddNorm blocks for residual connections and normalization\n", - " self.add_norm1 = AddNorm() # First AddNorm block after MHA\n", - " self.add_norm2 = AddNorm() # Second AddNorm block after the MLP\n", - "\n", - " # Initialize the FeedForwardBlock (MLP) which processes the data after attention\n", - " self.MLP = FeedForwardBlock(widening_factor=self.widening_factor)\n", - "\n", - " def __call__(self, X, mask=None, return_att_weight=True):\n", - " \"\"\"\n", - " Forward pass through the DecoderBlock.\n", - "\n", - " Args:\n", - " X: Batch of input tokens fed into the decoder, shape [B, T_decoder, d_m]\n", - " mask [optional, default=None]: Mask to control which positions the attention is allowed to consider, shape [T_decoder, T_decoder].\n", - " return_att_weight [optional, default=True]: If True, returns the attention weights along with the output.\n", - "\n", - " Returns:\n", - " If return_att_weight is True, returns a tuple (X, attention_weights_1).\n", - " Otherwise, returns the processed token representations X.\n", - " \"\"\"\n", - "\n", - " # Apply Multi-Head Attention to the input tokens (X) with optional masking\n", - " attention, attention_weights_1 = self.mha(X, mask=mask, return_weights=True)\n", - "\n", - " # Apply the first AddNorm block (adds the original input X and normalizes)\n", - " X = self.add_norm1(X, attention)\n", - "\n", - " # Pass the result through the FeedForwardBlock (MLP) to further process the data\n", - " projection = self.MLP(X)\n", - "\n", - " # Apply the second AddNorm block (adds the input from the previous step and normalizes)\n", - " X = self.add_norm2(X, projection)\n", - "\n", - " # Return the final output X, and optionally the attention weights\n", - " return (X, attention_weights_1) if return_att_weight else X\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8SXXVWd7_RF_" - }, - "source": [ - "Next, we just put everything together, adding in the positional encodings as well as stacking multiple transformer blocks and adding our prediction layer." - ] - }, - { - "cell_type": "code", - "execution_count": 70, - "metadata": { - "id": "4XBG24Qs_RF_" - }, - "outputs": [], - "source": [ - "class LLM(nn.Module):\n", - " \"\"\"\n", - " Transformer model consisting of several layers of decoder blocks.\n", - "\n", - " Args:\n", - " num_heads: Number of attention heads in each Multi-Head Attention (MHA) block.\n", - " num_layers: Number of decoder blocks in the model.\n", - " d_m: Dimensionality of the token embeddings.\n", - " vocab_size: Size of the vocabulary (number of unique tokens).\n", - " widening_factor: Factor by which the hidden layer size is expanded in the MLP.\n", - " \"\"\"\n", - " num_heads: int\n", - " num_layers: int\n", - " d_m: int\n", - " vocab_size: int\n", - " widening_factor: int = 4\n", - "\n", - " def setup(self):\n", - " # Initialize a list of decoder blocks, one for each layer in the model\n", - " self.blocks = [\n", - " DecoderBlock(self.num_heads, self.d_m, self.widening_factor)\n", - " for _ in range(self.num_layers)\n", - " ]\n", - "\n", - " # Initialize an embedding layer to convert token IDs into token embeddings\n", - " self.embedding = nn.Embed(num_embeddings=self.vocab_size, features=self.d_m)\n", - "\n", - " # Initialize a dense layer for predicting the next token in the sequence\n", - " self.pred_layer = nn.Dense(self.vocab_size)\n", - "\n", - " def __call__(self, X, mask=None, return_att_weights=False):\n", - " \"\"\"\n", - " Forward pass through the LLM model.\n", - "\n", - " Args:\n", - " X: Batch of input token IDs, shape [B, T_decoder] where B is batch size and T_decoder is sequence length.\n", - " mask [optional, default=None]: Mask to control which positions the attention can focus on, shape [T_decoder, T_decoder].\n", - " return_att_weights [optional, default=False]: Whether to return the attention weights.\n", - "\n", - " Returns:\n", - " logits: The predicted probabilities for each token in the vocabulary.\n", - " If return_att_weights is True, also returns the attention weights.\n", - " \"\"\"\n", - "\n", - " # Convert token IDs to embeddings (shape [B, T_decoder, d_m])\n", - " X = self.embedding(X)\n", - "\n", - " # Get the sequence length of the input\n", - " sequence_len = X.shape[-2]\n", - "\n", - " # Generate positional encodings and add them to the token embeddings\n", - " positions = return_frequency_pe_matrix(sequence_len, self.d_m)\n", - " X = X + positions\n", - "\n", - " # Initialize a list to store attention weights if needed\n", - " if return_att_weights:\n", - " att_weights = []\n", - "\n", - " # Pass the embeddings through each decoder block in sequence\n", - " for block in self.blocks:\n", - " out = block(X, mask, return_att_weights)\n", - " if return_att_weights:\n", - " # If returning attention weights, unpack the output\n", - " X = out[0]\n", - " att_weights.append(out[1])\n", - " else:\n", - " # Otherwise, just update the input for the next block\n", - " X = out\n", - "\n", - " # Apply a dense layer followed by a log softmax to get logits (predicted token probabilities)\n", - " logits = nn.log_softmax(self.pred_layer(X))\n", - "\n", - " # Return the logits, and optionally, the attention weights\n", - " return logits if not return_att_weights else (logits, jnp.array(att_weights).swapaxes(0, 1))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sClFLLkU_RF_" - }, - "source": [ - "If everything is correct, then if we run the code below, everything should run without any issues." - ] - }, - { - "cell_type": "code", - "execution_count": 71, - "metadata": { - "id": "82CWEa5m_RGA" - }, - "outputs": [], - "source": [ - "B, T, d_m, N, vocab_size = 18, 32, 16, 8, 25670\n", - "\n", - "llm = LLM(num_heads=1, num_layers=1, d_m=d_m, vocab_size=vocab_size, widening_factor=4)\n", - "mask = jnp.tril(np.ones((T, T)))\n", - "\n", - "# initialise module and get dummy output\n", - "key = jax.random.PRNGKey(42)\n", - "X = jax.random.randint(key, [B, T], 0, vocab_size)\n", - "params = llm.init(key, X, mask=mask)\n", - "\n", - "# extract output from decoder\n", - "logits, decoder_att_weights = llm.apply(\n", - " params,\n", - " X,\n", - " mask=mask,\n", - " return_att_weights=True,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Gve7ssD__RGA" - }, - "source": [ - "As a final sanity check, we can confirm that our attention weights are working correctly. As shown in the figure below, the decoder's attention weights only focus on previous tokens, as expected." - ] - }, - { - "cell_type": "code", - "execution_count": 72, - "metadata": { - "id": "H4NpywYv_RGA", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 480 - }, - "outputId": "2d859add-d15b-47c5-c1b1-6faa6ec63138" - }, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": [ - "
" - ], - "image/svg+xml": "\n\n\n \n \n \n \n 2024-08-30T09:16:07.849621\n image/svg+xml\n \n \n Matplotlib v3.7.1, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n" - }, - "metadata": {} - } - ], - "source": [ - "fig, ax = plt.subplots(1, 1, figsize=(10, 5))\n", - "plt.suptitle(\"LLM attention weights\")\n", - "sns.heatmap(decoder_att_weights[0, 0, 0, ...], ax=ax, cmap=\"Blues\")\n", - "fig.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wmt3tp38G90A" - }, - "source": [ - "### 2.5 Training your LLM" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "agLIpsoh_RGA" - }, - "source": [ - "#### 2.5.1 Training objective Intermediate\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "QOSv1-3B_RGA" - }, - "source": [ - "A sentence is nothing but a string of words. A LLM aims to predict the next word by considering the current context, namely the words that have come before.\n", - "\n", - "Here's the basic idea:\n", - "\n", - "To calculate the probability of a full sentence \"word1, word2, ..., last word\" appearing in a given context $c$, the procedure is to break down the sentence into individual words and consider the probability of each word given the words that precede it. These individual probabilities are then multiplied together:\n", - "\n", - "$$\\text{Probability of sentence} = \\text{Probability of word1} \\times \\text{Probability of word2} \\times \\ldots \\times \\text{Probability of last word}$$\n", - "\n", - "This method is akin to building up a narrative one piece at a time based on the preceding storyline.\n", - "\n", - "Mathematically, this is expressed as the likelihood (probability) of a sequence of words $y_1, y_2, ..., y_n$ in a given context $c$, which is achieved by multiplying the probabilities of each word $y_t$ calculated given the predecessors ($y_{Advanced" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zIQ_aJGW_RGA" - }, - "source": [ - "In the next section, we define all the processes required to train the model using the objective described above. A lot of this is now the work required to do training using FLAX.\n", - "\n", - "Below we gather the dataset and we shall be training on, which is Karpathy's shakespeare dataset. Its not so important to understand this code, so either just run the cell to load the data, or view the code if you want to understand it.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 76, - "metadata": { - "id": "guMHAaSo_RGB", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "f88a064b-a5b2-44f7-8143-9048ff11d7ba" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "--2024-08-30 09:18:33-- https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt\n", - "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n", - "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", - "HTTP request sent, awaiting response... 200 OK\n", - "Length: 1115394 (1.1M) [text/plain]\n", - "Saving to: ‘input.txt’\n", - "\n", - "\rinput.txt 0%[ ] 0 --.-KB/s \rinput.txt 100%[===================>] 1.06M --.-KB/s in 0.04s \n", - "\n", - "2024-08-30 09:18:33 (26.3 MB/s) - ‘input.txt’ saved [1115394/1115394]\n", - "\n" - ] - } - ], - "source": [ - "# @title Create Shakespeare dataset and iterator (optional, but run the cell)\n", - "\n", - "# Trick to avoid errors when downloading tinyshakespeare.\n", - "import locale\n", - "locale.getpreferredencoding = lambda: \"UTF-8\"\n", - "\n", - "!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O input.txt\n", - "\n", - "class WordBasedAsciiDatasetForLLM:\n", - " \"\"\"In-memory dataset of a single-file ASCII dataset for language-like model.\"\"\"\n", - "\n", - " def __init__(self, path: str, batch_size: int, sequence_length: int):\n", - " \"\"\"Load a single-file ASCII dataset in memory.\"\"\"\n", - " self._batch_size = batch_size\n", - "\n", - " with open(path, \"r\") as f:\n", - " corpus = f.read()\n", - "\n", - " # Tokenize by splitting the text into words\n", - " words = corpus.split()\n", - " self.vocab_size = len(set(words)) # Number of unique words\n", - "\n", - " # Create a mapping from words to unique IDs\n", - " self.word_to_id = {word: i for i, word in enumerate(set(words))}\n", - "\n", - " # Store the inverse mapping from IDs to words\n", - " self.id_to_word = {i: word for word, i in self.word_to_id.items()}\n", - "\n", - " # Convert the words in the corpus to their corresponding IDs\n", - " corpus = np.array([self.word_to_id[word] for word in words]).astype(np.int32)\n", - "\n", - " crop_len = sequence_length + 1\n", - " num_batches, ragged = divmod(corpus.size, batch_size * crop_len)\n", - " if ragged:\n", - " corpus = corpus[:-ragged]\n", - " corpus = corpus.reshape([-1, crop_len])\n", - "\n", - " if num_batches < 10:\n", - " raise ValueError(\n", - " f\"Only {num_batches} batches; consider a shorter \"\n", - " \"sequence or a smaller batch.\"\n", - " )\n", - "\n", - " self._ds = WordBasedAsciiDatasetForLLM._infinite_shuffle(\n", - " corpus, batch_size * 10\n", - " )\n", - "\n", - " def __iter__(self):\n", - " return self\n", - "\n", - " def __next__(self):\n", - " \"\"\"Yield next mini-batch.\"\"\"\n", - " batch = [next(self._ds) for _ in range(self._batch_size)]\n", - " batch = np.stack(batch)\n", - " # Create the language modeling observation/target pairs.\n", - " return dict(\n", - " input=batch[:, :-1], target=batch[:, 1:]\n", - " )\n", - "\n", - " def ids_to_words(self, ids):\n", - " \"\"\"Convert a sequence of word IDs to words.\"\"\"\n", - " return [self.id_to_word[id] for id in ids]\n", - "\n", - " @staticmethod\n", - " def _infinite_shuffle(iterable, buffer_size):\n", - " \"\"\"Infinitely repeat and shuffle data from iterable.\"\"\"\n", - " ds = itertools.cycle(iterable)\n", - " buf = [next(ds) for _ in range(buffer_size)]\n", - " random.shuffle(buf)\n", - " while True:\n", - " item = next(ds)\n", - " idx = random.randint(0, buffer_size - 1) # Inclusive.\n", - " result, buf[idx] = buf[idx], item\n", - " yield result\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_WBIFg51oQl0" - }, - "source": [ - "Lets now look how our data is structured for training" - ] - }, - { - "cell_type": "code", - "execution_count": 77, - "metadata": { - "id": "WvH3XPM5_RGB", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "6bd960ae-887a-49b9-90f9-cfaed8fe508f" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "---------- Input -----------\n", - "TEXT: Nay, but speak not maliciously. First Citizen: I say unto you, what he hath done famously, he did it to that end: though soft-conscienced men can be content to say it was\n", - "ASCII: [ 7689 21226 4486 20296 15854 4336 8376 13235 2368 2564 7379 3893\n", - " 4074 6041 7028 7627 4074 6754 9269 23295 11807 785 4841 16254\n", - " 12875 15794 4364 1885 23295 2368 9269 5]\n", - "---------- Target ----------\n", - "TEXT: but speak not maliciously. First Citizen: I say unto you, what he hath done famously, he did it to that end: though soft-conscienced men can be content to say it was for\n", - "ASCII: [21226 4486 20296 15854 4336 8376 13235 2368 2564 7379 3893 4074\n", - " 6041 7028 7627 4074 6754 9269 23295 11807 785 4841 16254 12875\n", - " 15794 4364 1885 23295 2368 9269 5 1215]\n", - "---------- Input -----------\n", - "TEXT: talking on't; let it be done: away, away! Second Citizen: One word, good citizens. First Citizen: We are accounted poor citizens, the patricians good. What authority surfeits on would relieve us: if\n", - "ASCII: [12366 22188 10530 9269 4364 21595 1348 2770 16167 8376 6969 13327\n", - " 21731 23093 4336 8376 3656 12541 21911 8526 21028 14023 905 4469\n", - " 6613 533 10566 17134 9859 19044 1725 18619]\n", - "---------- Target ----------\n", - "TEXT: on't; let it be done: away, away! Second Citizen: One word, good citizens. First Citizen: We are accounted poor citizens, the patricians good. What authority surfeits on would relieve us: if they\n", - "ASCII: [22188 10530 9269 4364 21595 1348 2770 16167 8376 6969 13327 21731\n", - " 23093 4336 8376 3656 12541 21911 8526 21028 14023 905 4469 6613\n", - " 533 10566 17134 9859 19044 1725 18619 8816]\n", - "\n", - " Total vocabulary size: 25670\n" - ] - } - ], - "source": [ - "# sample and look at the data\n", - "batch_size = 2\n", - "seq_length = 32\n", - "train_dataset = WordBasedAsciiDatasetForLLM(\"input.txt\", batch_size, seq_length)\n", - "\n", - "batch = next(train_dataset)\n", - "\n", - "for obs, target in zip(batch[\"input\"], batch[\"target\"]):\n", - " print(\"-\" * 10, \"Input\", \"-\" * 11)\n", - " print(\"TEXT:\", ' '.join(train_dataset.ids_to_words(obs)))\n", - " print(\"ASCII:\", obs)\n", - " print(\"-\" * 10, \"Target\", \"-\" * 10)\n", - " print(\"TEXT:\", ' '.join(train_dataset.ids_to_words(target)))\n", - " print(\"ASCII:\", target)\n", - "\n", - "print(f\"\\n Total vocabulary size: {train_dataset.vocab_size}\")\n", - "\n", - "VOCAB_SIZE = train_dataset.vocab_size" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "w9vzee53_RGB" - }, - "source": [ - "Next, let us train our LLM and see how it performs in producing Shakespearian text. First, we will define what happens for every training step." - ] - }, - { - "cell_type": "code", - "execution_count": 78, - "metadata": { - "id": "PGuYBCkekgDw" - }, - "outputs": [], - "source": [ - "import functools\n", - "\n", - "@functools.partial(jax.jit, static_argnums=(3, 4))\n", - "def train_step(params, optimizer_state, batch, apply_fn, update_fn):\n", - " \"\"\"\n", - " Perform a single training step.\n", - "\n", - " Args:\n", - " params: The current parameters of the model.\n", - " optimizer_state: The current state of the optimizer.\n", - " batch: A dictionary containing the input data and target labels for the batch.\n", - " apply_fn: The function used to apply the model to the inputs.\n", - " update_fn: The function used to update the model parameters based on the gradients.\n", - "\n", - " Returns:\n", - " Updated parameters, updated optimizer state, and the computed loss for the batch.\n", - " \"\"\"\n", - "\n", - " def loss_fn(params):\n", - " # Get the sequence length (T) from the input data.\n", - " T = batch['input'].shape[1]\n", - "\n", - " # Apply the model to the input data, using a lower triangular mask to enforce causality.\n", - " # jnp.tril(np.ones((T, T))) creates a lower triangular matrix of ones.\n", - " logits = apply_fn(params, batch['input'], jnp.tril(np.ones((T, T))))\n", - "\n", - " # Calculate the loss between the predicted logits and the target labels.\n", - " loss = sequence_loss_fn(logits, batch['target'])\n", - "\n", - " return loss\n", - "\n", - " # Compute the loss and its gradients with respect to the parameters.\n", - " loss, gradients = jax.value_and_grad(loss_fn)(params)\n", - "\n", - " # Update the optimizer state and calculate the parameter updates based on the gradients.\n", - " updates, optimizer_state = update_fn(gradients, optimizer_state)\n", - "\n", - " # Apply the updates to the parameters.\n", - " params = optax.apply_updates(params, updates)\n", - "\n", - " # Return the updated parameters, optimizer state, and the loss for the batch.\n", - " return params, optimizer_state, loss" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "rtKWzKIAkfYU" - }, - "source": [ - "Next we initialise our optimizer and model. Feel free to play with the hyperparameters during the practical." - ] - }, - { - "cell_type": "code", - "execution_count": 79, - "metadata": { - "id": "8o3q-BZX_RGB" - }, - "outputs": [], - "source": [ - "# Define all hyperparameters\n", - "d_model = 128 # Dimension of token embeddings (d_m)\n", - "num_heads = 4 # Number of attention heads in Multi-Head Attention\n", - "num_layers = 1 # Number of decoder blocks in the model\n", - "widening_factor = 2 # Factor to widen the hidden layer size in the MLP\n", - "LR = 2e-3 # Learning rate for the optimizer\n", - "batch_size = 32 # Number of samples per training batch\n", - "seq_length = 64 # Length of each input sequence (number of tokens)\n", - "\n", - "# Set up the training data\n", - "train_dataset = WordBasedAsciiDatasetForLLM(\"input.txt\", batch_size, seq_length)\n", - "vocab_size = train_dataset.vocab_size # Get the size of the vocabulary from the dataset\n", - "batch = next(train_dataset) # Get the first batch of input data\n", - "\n", - "# Set the random number generator key for model initialization\n", - "rng = jax.random.PRNGKey(42)\n", - "\n", - "# Initialize the LLM model with the specified hyperparameters\n", - "llm = LLM(num_heads=num_heads, num_layers=num_layers, d_m=d_model, vocab_size=vocab_size, widening_factor=widening_factor)\n", - "\n", - "# Create a causal mask to ensure that the model only attends to previous tokens\n", - "mask = jnp.tril(np.ones((batch['input'].shape[1], batch['input'].shape[1])))\n", - "\n", - "# Initialize the model parameters using the first batch of input data and the mask\n", - "params = llm.init(rng, batch['input'], mask)\n", - "\n", - "# Set up the optimizer using the Adam optimization algorithm with the specified learning rate\n", - "optimizer = optax.adam(LR, b1=0.9, b2=0.99)\n", - "optimizer_state = optimizer.init(params) # Initialize the optimizer state with the model parameters" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "3bPEFakxmvsM" - }, - "source": [ - "Now we train! This will take a few minutes.. While it trains, have you greeted your neighbour yet?" - ] - }, - { - "cell_type": "code", - "execution_count": 80, - "metadata": { - "id": "oUAS6tie_RGB", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 812 - }, - "outputId": "0e4980c4-043d-4f9e-eae2-28c74c3a8817" - }, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": [ - "
" - ], - "image/svg+xml": "\n\n\n \n \n \n \n 2024-08-30T09:21:07.272979\n image/svg+xml\n \n \n Matplotlib v3.7.1, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n" - }, - "metadata": {} - }, - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Loss\n", - "\tloss \t (min: 0.180, max: 10.654, cur: 0.194)\n" - ] - } - ], - "source": [ - "plotlosses = PlotLosses()\n", - "\n", - "MAX_STEPS = 3500\n", - "LOG_EVERY = 32\n", - "losses = []\n", - "VOCAB_SIZE = 25670\n", - "\n", - "# Training loop\n", - "for step in range(MAX_STEPS):\n", - " batch = next(train_dataset)\n", - " params, optimizer_state, loss = train_step(\n", - " params, optimizer_state, batch, llm.apply, optimizer.update)\n", - " losses.append(loss)\n", - " if step % LOG_EVERY == 0:\n", - " loss_ = jnp.array(losses).mean()\n", - " plotlosses.update(\n", - " {\n", - " \"loss\": loss_,\n", - " }\n", - " )\n", - " plotlosses.send()\n", - " losses = []" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pGv9c2AFmF4V" - }, - "source": [ - "#### 2.5.3 Inspecting the trained LLM Beginner\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Pfq61gim_RGB" - }, - "source": [ - "**Reminder:** remember to run all code presented so far in this section before runnning the cells below!\n", - "\n", - "Lets generate some text now and see how our model did. DO NOT STOP THE CELL ONCE IT IS RUNNING, THIS WILL CHRASH THE SESSION." - ] - }, - { - "cell_type": "code", - "execution_count": 81, - "metadata": { - "id": "5lt8HTS__RGC", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "8af3b23e-05df-4a52-f260-5823d29b64de" - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Love the teaching of the maid: That's your device. LUCENTIO: It is: may it be the" - ] - } - ], - "source": [ - "import functools\n", - "\n", - "@functools.partial(jax.jit, static_argnums=(2, ))\n", - "def generate_prediction(params, input, apply_fn):\n", - " logits = apply_fn(params, input)\n", - " argmax_out = jnp.argmax(logits, axis=-1)\n", - " return argmax_out[0][-1].astype(int)\n", - "\n", - "def generate_random_shakespeare(llm, params, id_2_word, word_2_id):\n", - " '''\n", - " Get the model output\n", - " '''\n", - "\n", - " prompt = \"Love\"\n", - " print(prompt, end=\"\")\n", - " tokens = prompt.split()\n", - "\n", - " # predict and append\n", - " for i in range(15):\n", - " input = jnp.array([[word_2_id[t] for t in tokens]]).astype(int)\n", - " prediction = generate_prediction(params, input, llm.apply)\n", - " prediction = id_2_word[int(prediction)]\n", - " tokens.append(prediction)\n", - " print(\" \"+prediction, end=\"\")\n", - "\n", - " return \" \".join(tokens)\n", - "\n", - "id_2_word = train_dataset.id_to_word\n", - "word_2_id = train_dataset.word_to_id\n", - "\n", - "generated_shakespeare = generate_random_shakespeare(llm, params, id_2_word, word_2_id)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wOwNuMRf_RGC" - }, - "source": [ - "Finally, we implemented everything above by taking the token ID with the maximum probability of being correct. This is greedy decoding, as we only took the most likely token. It worked well in this use case, but there are cases where we will see a degrading performance when taking this greedy approach, specifically when we are interested in generating realistic text.\n", - "\n", - "Other methods exist for sampling from the decoder, with a famous algorithm being beam search. We provide resources below for anyone interested in learning more about this.\n", - "\n", - "[Greedy Decoding](https://www.youtube.com/watch?v=DW5C3eqAFQM&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=4)\n", - "\n", - "[Beam Search](https://www.youtube.com/watch?v=uG3xoYNo3HM&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=5)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "fV3YG7QOZD-B" - }, - "source": [ - "## **Conclusion**\n", - "**Summary:**\n", - "\n", - "You've now mastered the essentials of how a Large Language Model (LLM) works, from the fundamentals of attention mechanisms to training your own LLM! These powerful tools have the potential to transform a wide range of tasks. However, like any deep learning model, their magic lies in applying them to the right problems with the right data.\n", - "\n", - "Ready to take your skills to the next level? Dive into fine-tuning your own LLMs and unleash even more potential! I highly recommend exploring last year's practical on Parameter Efficient Fine-Tuning Methods for a comprehensive overview of advanced techniques. The journey doesn't stop here—there's so much more to discover! [LLMs for Everyone 2023](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2023/blob/main/practicals/large_language_models.ipynb)\n", - "\n", - "The world of LLMs is yours to explore—go ahead and create something amazing! 🌟🚀\n", - "\n", - "---\n", - "\n", - "**Next Steps:**\n", - "[**Efficiently Finetuning LLMs with Hugging Face**](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2023/blob/main/practicals/large_language_models.ipynb)\n", - "\n", - "\n", - "**References:** for further references check the links referenced throughout\n", - "specific sections of this colab.\n", - "\n", - "* [Attention is all you need paper](https://arxiv.org/abs/1706.03762)\n", - "* [Additional videos on transformers](https://www.youtube.com/playlist?list=PLmZlBIcArwhOPR2s-FIR7WoqNaBML233s)\n", - "* [LoRA paper](https://arxiv.org/abs/2106.09685)\n", - "* [RLHF](https://huggingface.co/blog/rlhf) (how ChatGPT was trained)\n", - "* [Extending context length](https://kaiokendev.github.io/context):\n", - "\n", - "\n", - "For other practicals from the Deep Learning Indaba, please visit [here](https://github.com/deep-learning-indaba/indaba-pracs-2023)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "o1ndpYE50BpG" - }, - "source": [ - "# Feedback\n", - "\n", - "Please provide feedback that we can use to improve our practicals in the future." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": true, - "id": "OIZvkhfRz9Jz", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "outputId": "46a7ad13-b174-453d-ed32-9f932626a259" - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "\n", - "\n", - "\tLoading...\n", - "\n" - ] - }, - "metadata": {}, - "execution_count": 1 - } - ], - "source": [ - "# @title Generate Feedback Form. (Run Cell)\n", - "from IPython.display import HTML\n", - "\n", - "HTML(\n", - " \"\"\"\n", - "\n", - "\tLoading...\n", - "\n", - "\"\"\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "oglV4kHMWnIN" - }, - "source": [ - "" - ] - } - ], - "metadata": { - "accelerator": "GPU", - "colab": { - "gpuType": "T4", - "provenance": [], - "include_colab_link": true - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - }, - "language_info": { - "name": "python", - "version": "3.8.5" - }, - "vscode": { - "interpreter": { - "hash": "145833166d986a8417df3c7acb65d917d84b716b5a452e57fcacdc66f1a168c9" - } - }, - "widgets": { - "application/vnd.jupyter.widget-state+json": { - "d8f46e6226af431d9b7c6ecfa1c2769a": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_93a441fdf82141af81e85b2d5aec49b7", - "IPY_MODEL_4eb2c2d0758f4061a3d0fc398018de28", - "IPY_MODEL_fefd764b0cf2425c97cdab4506a81b9c" - ], - "layout": "IPY_MODEL_692c620c5e104d33849c3da34268f5cb" - } - }, - "93a441fdf82141af81e85b2d5aec49b7": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_ecec6f9f968940078a4cf7a9105a3717", - "placeholder": "​", - "style": "IPY_MODEL_e352febd89fe462e945d40bcf421f7fd", - "value": "config.json: 100%" - } - }, - "4eb2c2d0758f4061a3d0fc398018de28": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_f5a917167c914fdfaa86f1f71176eda2", - "max": 1007, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_4edc048323484d24b8c59bb8a2f1fab1", - "value": 1007 - } - }, - "fefd764b0cf2425c97cdab4506a81b9c": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_fe54b7356cf049d5b449b3fd69d64221", - "placeholder": "​", - "style": "IPY_MODEL_740307d65ab447658d3945abd47b3318", - "value": " 1.01k/1.01k [00:00<00:00, 14.7kB/s]" - } - }, - "692c620c5e104d33849c3da34268f5cb": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "ecec6f9f968940078a4cf7a9105a3717": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "e352febd89fe462e945d40bcf421f7fd": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "f5a917167c914fdfaa86f1f71176eda2": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "4edc048323484d24b8c59bb8a2f1fab1": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "fe54b7356cf049d5b449b3fd69d64221": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "740307d65ab447658d3945abd47b3318": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "70969f782d9d48cebefa3ee64e3a04f5": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_111b2a654a424460b4917b5dad5fd69e", - "IPY_MODEL_e35b59e079ab498dbe595cde7c984438", - "IPY_MODEL_f2edac25c1e74b14a4ade39fe86dd040" - ], - "layout": "IPY_MODEL_4983de4c57c349af8cb8b6be78f64030" - } - }, - "111b2a654a424460b4917b5dad5fd69e": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_74926ca6f40e44c0887297ac44cbd577", - "placeholder": "​", - "style": "IPY_MODEL_467dfa92fdee4b9d850bd1eac5c502c6", - "value": "model.safetensors: 100%" - } - }, - "e35b59e079ab498dbe595cde7c984438": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_52e8f51e283847d79bda1ee4977fd53f", - "max": 525979192, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_aa3a63b55fe74989b92cfe8504695309", - "value": 525979192 - } - }, - "f2edac25c1e74b14a4ade39fe86dd040": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_a78f96142d83418685734ffb9ec85ddd", - "placeholder": "​", - "style": "IPY_MODEL_d642dc3cabc64e66a8bf8853527f7165", - "value": " 526M/526M [00:06<00:00, 79.6MB/s]" - } - }, - "4983de4c57c349af8cb8b6be78f64030": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "74926ca6f40e44c0887297ac44cbd577": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "467dfa92fdee4b9d850bd1eac5c502c6": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "52e8f51e283847d79bda1ee4977fd53f": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "aa3a63b55fe74989b92cfe8504695309": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "a78f96142d83418685734ffb9ec85ddd": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "d642dc3cabc64e66a8bf8853527f7165": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "8ca5d9a0316d4d85b88d45f5993bc21c": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_0dc54052e03245c3b149ed1fcb22b038", - "IPY_MODEL_973cf8a73e3a478c888928030194793d", - "IPY_MODEL_1668490995c74105ab6f11ab33ecaefb" - ], - "layout": "IPY_MODEL_3ec771810c9d472fa75278a76decf956" - } - }, - "0dc54052e03245c3b149ed1fcb22b038": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_f505b2bbabae43219a02639d33501a32", - "placeholder": "​", - "style": "IPY_MODEL_f0a69ae2f0064d80825b0091932e7813", - "value": "generation_config.json: 100%" - } - }, - "973cf8a73e3a478c888928030194793d": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_65e1cac3c45442b4a68711d410ef37c3", - "max": 119, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_5602e00600884fdfbbe145694bcc0f90", - "value": 119 - } - }, - "1668490995c74105ab6f11ab33ecaefb": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_f02e4e7118d64a2bb764909705049d18", - "placeholder": "​", - "style": "IPY_MODEL_c4710e641a1c421f96222ffb65bcedfe", - "value": " 119/119 [00:00<00:00, 1.29kB/s]" - } - }, - "3ec771810c9d472fa75278a76decf956": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "f505b2bbabae43219a02639d33501a32": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "f0a69ae2f0064d80825b0091932e7813": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "65e1cac3c45442b4a68711d410ef37c3": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "5602e00600884fdfbbe145694bcc0f90": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "f02e4e7118d64a2bb764909705049d18": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "c4710e641a1c421f96222ffb65bcedfe": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "98ca063c1b1548ce8de87647dcf23507": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_ec94052f637b4de3a172b3d5d9d32355", - "IPY_MODEL_267fa0085440473c8762bfe21b5e9106", - "IPY_MODEL_708f1e90e25c49c0bbfac3bc5c233e24" - ], - "layout": "IPY_MODEL_14a6f314f3ff4ad3b1f05933c50fc830" - } - }, - "ec94052f637b4de3a172b3d5d9d32355": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_c69550aa99a34bada8ef67f61115d760", - "placeholder": "​", - "style": "IPY_MODEL_d9d79a644a1d42a38b70097c7f77dbad", - "value": "tokenizer_config.json: 100%" - } - }, - "267fa0085440473c8762bfe21b5e9106": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_af73f467b5bd4c38882bc27e3b3b5732", - "max": 727, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_71eeff60e0f7464ca42483c4fe3c7bca", - "value": 727 - } - }, - "708f1e90e25c49c0bbfac3bc5c233e24": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_a546a75097e149f4a226453951d987da", - "placeholder": "​", - "style": "IPY_MODEL_3d5b4513d6ee4a73a392e4c16896e14c", - "value": " 727/727 [00:00<00:00, 10.4kB/s]" - } - }, - "14a6f314f3ff4ad3b1f05933c50fc830": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "c69550aa99a34bada8ef67f61115d760": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "d9d79a644a1d42a38b70097c7f77dbad": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "af73f467b5bd4c38882bc27e3b3b5732": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "71eeff60e0f7464ca42483c4fe3c7bca": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "a546a75097e149f4a226453951d987da": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "3d5b4513d6ee4a73a392e4c16896e14c": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "d334e7a205704133b8549562700bca53": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_2e0e5880d12b4a37b673dcb3e455f47e", - "IPY_MODEL_3c7d5cb0f9de4132a2b26428b72cb10a", - "IPY_MODEL_49de67c36f9b4b66ba234bc606ffa0c4" - ], - "layout": "IPY_MODEL_ff8540ebd47e4dc39197eb6b19945d32" - } - }, - "2e0e5880d12b4a37b673dcb3e455f47e": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_622fd5127db84c7388b9258ee7c4fdc1", - "placeholder": "​", - "style": "IPY_MODEL_7a191a40c1e245748de1f54f449afe37", - "value": "vocab.json: 100%" - } - }, - "3c7d5cb0f9de4132a2b26428b72cb10a": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_3263b46210b544b989301e9edfc23473", - "max": 898669, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_2a64a5672e934b739e6280e2ab278da9", - "value": 898669 - } - }, - "49de67c36f9b4b66ba234bc606ffa0c4": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_98577aad05654ff5aa46ca95121ac640", - "placeholder": "​", - "style": "IPY_MODEL_03251cffadcb429b9dbe87402bb8a4bb", - "value": " 899k/899k [00:00<00:00, 3.43MB/s]" - } - }, - "ff8540ebd47e4dc39197eb6b19945d32": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "622fd5127db84c7388b9258ee7c4fdc1": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "7a191a40c1e245748de1f54f449afe37": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "3263b46210b544b989301e9edfc23473": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "2a64a5672e934b739e6280e2ab278da9": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "98577aad05654ff5aa46ca95121ac640": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "03251cffadcb429b9dbe87402bb8a4bb": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "f35d92a421d84260976fc4fba3d4527c": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_d96a8053de2f4aeabb8cf68474f6725f", - "IPY_MODEL_441741ea09ac4ba29ac2dcd9f8cc0ade", - "IPY_MODEL_445c478b29e844e99c45eb4e7a093e65" - ], - "layout": "IPY_MODEL_9bc0b47676434ca9a6c8c3f8af50d9aa" - } - }, - "d96a8053de2f4aeabb8cf68474f6725f": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_2555e6fbde6e45c2ba5f466701a1fb57", - "placeholder": "​", - "style": "IPY_MODEL_73d4afacfe3741bc90dfd70e55d763e3", - "value": "merges.txt: 100%" - } - }, - "441741ea09ac4ba29ac2dcd9f8cc0ade": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_2815ed72ee56478b81379eabd3cdc004", - "max": 456318, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_7ee8d64ff1344f7f8e6816e8ced5c5d7", - "value": 456318 - } - }, - "445c478b29e844e99c45eb4e7a093e65": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_71ce19a082fd4f28a1c5f4f29853f32d", - "placeholder": "​", - "style": "IPY_MODEL_9e1d5d7e79c9494598ca56079525dadc", - "value": " 456k/456k [00:00<00:00, 9.82MB/s]" - } - }, - "9bc0b47676434ca9a6c8c3f8af50d9aa": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "2555e6fbde6e45c2ba5f466701a1fb57": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "73d4afacfe3741bc90dfd70e55d763e3": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "2815ed72ee56478b81379eabd3cdc004": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "7ee8d64ff1344f7f8e6816e8ced5c5d7": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "71ce19a082fd4f28a1c5f4f29853f32d": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "9e1d5d7e79c9494598ca56079525dadc": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "991ab38f5ab142a2a053da131fca08e8": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_b14dc0d2bba84dbbb17b657ef5555132", - "IPY_MODEL_51a7cfd306fc45498f8f84b579e6f05f", - "IPY_MODEL_3e134af4ddc041aab1e0f8769b77e232" - ], - "layout": "IPY_MODEL_6b33bff41d3d438f9c81af109273f41e" - } - }, - "b14dc0d2bba84dbbb17b657ef5555132": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_ff568bf3a34b46edb2e50fb910608de3", - "placeholder": "​", - "style": "IPY_MODEL_9e1cb31c569f40118ae949cd8643e392", - "value": "tokenizer.json: 100%" - } - }, - "51a7cfd306fc45498f8f84b579e6f05f": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_eeeca8ac591242edafe63e100421534e", - "max": 2107652, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_68b138160a984dd686787cad890ff13c", - "value": 2107652 - } - }, - "3e134af4ddc041aab1e0f8769b77e232": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_90ef3af60627467d979237b57a8f411d", - "placeholder": "​", - "style": "IPY_MODEL_e93153b83b034468aad8516c644e8d55", - "value": " 2.11M/2.11M [00:00<00:00, 10.8MB/s]" - } - }, - "6b33bff41d3d438f9c81af109273f41e": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "ff568bf3a34b46edb2e50fb910608de3": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "9e1cb31c569f40118ae949cd8643e392": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "eeeca8ac591242edafe63e100421534e": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "68b138160a984dd686787cad890ff13c": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "90ef3af60627467d979237b57a8f411d": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "e93153b83b034468aad8516c644e8d55": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "8b87d73ab967490481a27011d5b53236": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_739e3997f7544b73aff099c31a3d4bed", - "IPY_MODEL_730b750139364343aaee6bd28fde7f53", - "IPY_MODEL_b41c51b1111943468cb50695e5ce1f84" - ], - "layout": "IPY_MODEL_6f3f01955c0847a19b2ca4e84a06d149" - } - }, - "739e3997f7544b73aff099c31a3d4bed": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_9084385cba634c01a9e746643e40e32c", - "placeholder": "​", - "style": "IPY_MODEL_9a694d9d63524748a8dd7f2ae59525d5", - "value": "special_tokens_map.json: 100%" - } - }, - "730b750139364343aaee6bd28fde7f53": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_6541ce75ed414eac9428fc9e0a53d128", - "max": 357, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_83436ad970f044c9abc1e79a7b7a749d", - "value": 357 - } - }, - "b41c51b1111943468cb50695e5ce1f84": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_61cba98f8dae431ea588538cdc5a2e07", - "placeholder": "​", - "style": "IPY_MODEL_cbace4a6e6bb476bb9d45d0516e5f8de", - "value": " 357/357 [00:00<00:00, 5.70kB/s]" - } - }, - "6f3f01955c0847a19b2ca4e84a06d149": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "9084385cba634c01a9e746643e40e32c": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "9a694d9d63524748a8dd7f2ae59525d5": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "6541ce75ed414eac9428fc9e0a53d128": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "83436ad970f044c9abc1e79a7b7a749d": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "61cba98f8dae431ea588538cdc5a2e07": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "cbace4a6e6bb476bb9d45d0516e5f8de": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "fae0de63c22e4ed8afdc9b3554df9dbc": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_cce88c6717f14dbc8d601a5c306380ac", - "IPY_MODEL_5f825f83172f4f3083c63b4f6c6df441", - "IPY_MODEL_88985dba38b64de2b81a0b2da28e801b" - ], - "layout": "IPY_MODEL_534bd01df3c84807bd6a7b15d7279847" - } - }, - "cce88c6717f14dbc8d601a5c306380ac": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_d8eab87c7cc940388dece8670e466a4e", - "placeholder": "​", - "style": "IPY_MODEL_285d1ffdac3149a9b2a17c108d35cf15", - "value": "tokenizer_config.json: 100%" - } - }, - "5f825f83172f4f3083c63b4f6c6df441": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_f18d22e0eeac4ca7953d0f87da7fac4a", - "max": 49, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_dce8aa0a4a14409e84bf723f84ff5be6", - "value": 49 - } - }, - "88985dba38b64de2b81a0b2da28e801b": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_e15e6eb22be4477d8ef216343b7371de", - "placeholder": "​", - "style": "IPY_MODEL_801ac3fc2ff34d42a65c0898edf5d07d", - "value": " 49.0/49.0 [00:00<00:00, 2.48kB/s]" - } - }, - "534bd01df3c84807bd6a7b15d7279847": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "d8eab87c7cc940388dece8670e466a4e": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "285d1ffdac3149a9b2a17c108d35cf15": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "f18d22e0eeac4ca7953d0f87da7fac4a": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "dce8aa0a4a14409e84bf723f84ff5be6": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "e15e6eb22be4477d8ef216343b7371de": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "801ac3fc2ff34d42a65c0898edf5d07d": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "6b23d26a6c6044029025fab0dfbd3555": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_cb48b8b12a584c599917844e958cd69e", - "IPY_MODEL_d91590c0670345b7a572e7f437353be3", - "IPY_MODEL_663af63865a34844b6dc4ccea7df2ed9" - ], - "layout": "IPY_MODEL_913668a324b7433f9235fcdb4bc2a644" - } - }, - "cb48b8b12a584c599917844e958cd69e": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_6ed2b83044fa4b4ab0586719ef9edc96", - "placeholder": "​", - "style": "IPY_MODEL_87e9e502cd3d48be83cda4999cac92ee", - "value": "config.json: 100%" - } - }, - "d91590c0670345b7a572e7f437353be3": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_e345de9ae283492887f53c83214538b4", - "max": 570, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_b40ea73a453845d3ac21caf4b112165a", - "value": 570 - } - }, - "663af63865a34844b6dc4ccea7df2ed9": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_04f2eb37159e4cf79a2b61e1c402d2a6", - "placeholder": "​", - "style": "IPY_MODEL_525a68ecc63b445db2a2eba949679625", - "value": " 570/570 [00:00<00:00, 24.6kB/s]" - } - }, - "913668a324b7433f9235fcdb4bc2a644": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "6ed2b83044fa4b4ab0586719ef9edc96": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "87e9e502cd3d48be83cda4999cac92ee": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "e345de9ae283492887f53c83214538b4": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "b40ea73a453845d3ac21caf4b112165a": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "04f2eb37159e4cf79a2b61e1c402d2a6": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "525a68ecc63b445db2a2eba949679625": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "9aa27129beed4d4ebcb728ea46db7294": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_f8d48f5c510e43b586f7cbadf0ac383e", - "IPY_MODEL_e7a1b65310404ae1a4a685cddce1d727", - "IPY_MODEL_58e3bb783e934b81b2457e88fff1c3c6" - ], - "layout": "IPY_MODEL_7490d00a7342421eb38a403386e6df64" - } - }, - "f8d48f5c510e43b586f7cbadf0ac383e": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_0303f59b0de44e8ea1cb3d1a32147589", - "placeholder": "​", - "style": "IPY_MODEL_454d8927472d417081373a30fcc0f919", - "value": "vocab.txt: 100%" - } - }, - "e7a1b65310404ae1a4a685cddce1d727": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_3618690067c7433a8aa81c8ebea5d1a3", - "max": 213450, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_63bbb5e415c44e15bbc5c88ec9759d4b", - "value": 213450 - } - }, - "58e3bb783e934b81b2457e88fff1c3c6": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_5878f8b4e8b14f1d9ed602585b6634d0", - "placeholder": "​", - "style": "IPY_MODEL_52f1ae9c77b04ce1ad0d9b1a7e8270ab", - "value": " 213k/213k [00:00<00:00, 2.89MB/s]" - } - }, - "7490d00a7342421eb38a403386e6df64": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "0303f59b0de44e8ea1cb3d1a32147589": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "454d8927472d417081373a30fcc0f919": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "3618690067c7433a8aa81c8ebea5d1a3": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "63bbb5e415c44e15bbc5c88ec9759d4b": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "5878f8b4e8b14f1d9ed602585b6634d0": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "52f1ae9c77b04ce1ad0d9b1a7e8270ab": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "d54d176daad2416ea06ae3e1e1660592": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HBoxModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_589b2d7b489946dbaf2265d20542fd66", - "IPY_MODEL_49da7f1ff2f74d208a0a440430f8845f", - "IPY_MODEL_3952eb993b7d477eaf132061ad355194" - ], - "layout": "IPY_MODEL_4804b336a4ba4e4286b352768c4789a8" - } - }, - "589b2d7b489946dbaf2265d20542fd66": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_bfd29ef47cfd4db6aa5bb12f05af5779", - "placeholder": "​", - "style": "IPY_MODEL_437c0e32e3d84087a8d0b68dbc31f0db", - "value": "tokenizer.json: 100%" - } - }, - "49da7f1ff2f74d208a0a440430f8845f": { - "model_module": "@jupyter-widgets/controls", - "model_name": "FloatProgressModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_31ad90c612e14196873b401e8862ea38", - "max": 435797, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_bf1eab7f180f45e5b378ceafca1746f4", - "value": 435797 - } - }, - "3952eb993b7d477eaf132061ad355194": { - "model_module": "@jupyter-widgets/controls", - "model_name": "HTMLModel", - "model_module_version": "1.5.0", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_bbca117899a64397b26a512911ba8868", - "placeholder": "​", - "style": "IPY_MODEL_fa63f4d8e2934594a6a05c51a70a607e", - "value": " 436k/436k [00:00<00:00, 14.4MB/s]" - } - }, - "4804b336a4ba4e4286b352768c4789a8": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "bfd29ef47cfd4db6aa5bb12f05af5779": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "437c0e32e3d84087a8d0b68dbc31f0db": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "31ad90c612e14196873b401e8862ea38": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "bf1eab7f180f45e5b378ceafca1746f4": { - "model_module": "@jupyter-widgets/controls", - "model_name": "ProgressStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "bbca117899a64397b26a512911ba8868": { - "model_module": "@jupyter-widgets/base", - "model_name": "LayoutModel", - "model_module_version": "1.2.0", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "fa63f4d8e2934594a6a05c51a70a607e": { - "model_module": "@jupyter-widgets/controls", - "model_name": "DescriptionStyleModel", - "model_module_version": "1.5.0", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - } - } - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} \ No newline at end of file