|
1 | 1 | # Language Modelling Exercise |
2 | 2 |
|
3 | 3 | This exercsie will allow you to explore language modelling. We focus on the key concept of multi-head attention. |
4 | | -Navigate to the `src/attention_model.py`-file and implement multi-head attention [1] |
5 | 4 |
|
6 | | -``` math |
7 | | -\text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{softmax}(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}})\mathbf{V} |
8 | | -``` |
| 5 | +1. Navigate to the `src/attention_model.py`-file and implement multi-head attention [1] |
9 | 6 |
|
10 | | -To make attention useful in a language modelling scenario we cannot use future information. A model without access to upcoming future inputs or words is known as causal. |
11 | | -Since our attention matrix is multiplied from the left we must mask out the upper triangle |
12 | | -excluding the main diagonal for causality. |
| 7 | + ``` math |
| 8 | + \text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{softmax}(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}})\mathbf{V} |
| 9 | + ``` |
13 | 10 |
|
14 | | -Keep in mind that $\mathbf{Q} \in \mathbb{R}^{b,h,o,d_k}$, $\mathbf{K} \in \mathbb{R}^{b,h,o,d_k}$ and $\mathbf{V} \in \mathbb{R}^{b,h,o,d_v}$, with $b$ the batch size, $h$ the number of heads, $o$ the desired output dimension, $d_k$ the key dimension and finally $d_v$ as value dimension. Your code must rely on broadcasting to process the matrix operations correctly. The notation follows [1]. |
| 11 | + To make attention useful in a language modelling scenario we cannot use future information. A model without access to upcoming future inputs or words is known as causal. |
| 12 | + Since our attention matrix is multiplied from the left we must mask out the upper triangle |
| 13 | + excluding the main diagonal for causality. |
15 | 14 |
|
16 | | -Furthermore write a function to convert the network output of vector encodings back into a string by completing the `convert` function in `src/util.py`. |
| 15 | + Keep in mind that $\mathbf{Q} \in \mathbb{R}^{b,h,o,d_k}$, $\mathbf{K} \in \mathbb{R}^{b,h,o,d_k}$ and $\mathbf{V} \in \mathbb{R}^{b,h,o,d_v}$, with $b$ the batch size, $h$ the number of heads, $o$ the desired output dimension, $d_k$ the key dimension and finally $d_v$ as value dimension. Your code must rely on broadcasting to process the matrix operations correctly. The notation follows [1]. |
17 | 16 |
|
| 17 | +2. Furthermore write a function to convert the network output of vector encodings back into a string by completing the `convert` function in `src/util.py`. |
| 18 | +
|
| 19 | +2. Once you have implemented and tested your version of attention run `sbatch scripts/train.slurm` to train your model on Bender. Once converged you can generate poetry via `sbatch scripts/generate.slurm`. |
| 20 | +Run `src/model_chat.py` to talk to your model. |
18 | 21 |
|
19 | 22 | [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: |
20 | 23 | Attention is All you Need. NIPS 2017: 5998-6008 |
21 | | - |
22 | | -Once you have implemented and tested your version of attention run `sbatch scripts/train.slurm` to train your model on Bender. Once converged you can generate poetry via `sbatch scripts/generate.slurm`. |
23 | | -Run `src/model_chat.py` to talk to your model. |
0 commit comments