Skip to content

Commit e44af2e

Browse files
committed
WIP: More stuff
1 parent 95fccbb commit e44af2e

File tree

1 file changed

+250
-21
lines changed

1 file changed

+250
-21
lines changed

25_library_pytorch_language_modeling.ipynb

Lines changed: 250 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
"id": "0",
66
"metadata": {},
77
"source": [
8-
"# PyTorch LLM Tutorial"
8+
"# Language Modeling With PyTorch"
99
]
1010
},
1111
{
@@ -97,8 +97,7 @@
9797
"id": "8",
9898
"metadata": {},
9999
"source": [
100-
"## Bigram language model\n",
101-
"\n"
100+
"## Bigram language model"
102101
]
103102
},
104103
{
@@ -188,7 +187,10 @@
188187
"\n",
189188
"Since we are dealing with 26 characters, plus `<S>` and `<E>`, we need a total of 28x28 = 784 entries.\n",
190189
"Bracketed tokens are customary in NLP to represent special tokens, but here we are only interested in knowing when a sentence starts or ends.\n",
191-
"We can replace `<S>` with `.` and `<E>` with `.` and have a total of 27x27 = 729 entries."
190+
"\n",
191+
"But there's also a problem with keeping two special tokens for the start and end of a sentence: we can't have `('<S>', '<S>')` or `('<E>', '<E>')`, or other combinations like `('a', '<S>')` or `('<E>', 'b')`.\n",
192+
"These would be invalid bigrams.\n",
193+
"To solve this, we can replace `<S>` with `.` and `<E>` with `.` and have a total of 27x27 = 729 entries."
192194
]
193195
},
194196
{
@@ -370,8 +372,9 @@
370372
"outputs": [],
371373
"source": [
372374
"g = torch.Generator().manual_seed(2147483647)\n",
375+
"NUM_WORDS = 20\n",
373376
"\n",
374-
"for _ in range(10):\n",
377+
"for _ in range(NUM_WORDS):\n",
375378
" \n",
376379
" out = []\n",
377380
" ix = 0\n",
@@ -399,7 +402,8 @@
399402
"id": "30",
400403
"metadata": {},
401404
"source": [
402-
"The result is not very good.\n",
405+
"The results are quite terrible, although they're reasonable given the simplicity of the model and the patterns we're trying to capture.\n",
406+
"\n",
403407
"The core problem is that a bigram model looks only the the frequency of a pair of tokens, but it has zero information of what's most likely to come before or after those two tokens.\n",
404408
"You can imagine that the obvious next step is a **trigram** model, which looks at the frequency of a triplet of tokens.\n",
405409
"\n",
@@ -436,7 +440,10 @@
436440
"metadata": {},
437441
"source": [
438442
"Here `dim=1` tells PyTorch to sum over the columns (the second index), while `keepdim=True` tells it to keep the first dimension (the first index) as a singleton (a `1`) dimension.\n",
439-
"Without `keepdim=True`, the result would have shape `(27,)`, and performing the division would produce the wrong result because of how [brodcasting](https://pytorch.org/docs/stable/notes/broadcasting.html) works."
443+
"Without `keepdim=True`, the result would have shape `(27,)`, and performing the division would produce the wrong result because of how [brodcasting](https://pytorch.org/docs/stable/notes/broadcasting.html) works.\n",
444+
"\n",
445+
"> Try to experiment with the `keepdim` parameter and see what happens if you remove it.\n",
446+
"> Can you explain why the predictions become complete garbage?"
440447
]
441448
},
442449
{
@@ -473,32 +480,254 @@
473480
" print(''.join(out))"
474481
]
475482
},
483+
{
484+
"cell_type": "markdown",
485+
"id": "34",
486+
"metadata": {},
487+
"source": [
488+
"### Evaluating the quality of the model"
489+
]
490+
},
491+
{
492+
"cell_type": "markdown",
493+
"id": "35",
494+
"metadata": {},
495+
"source": [
496+
"We have built a bigram language model by counting letter combination frequencies, then normalizing and sampling with that probability base.\n",
497+
"\n",
498+
"We trained the model, we sampled from the model (iteratively, character-wise). But its still bad at coming up with names.\n",
499+
"\n",
500+
"But how bad? We know that the model's \"knowledge\" is represented by `P`, but how can we boil down the model's quality in one value?\n",
501+
"\n",
502+
"First, let's look at the bigrams we created from the dataset: the bigrams to `emma` are `.e, em, mm, ma, a.`.\n",
503+
"**What probability does the model assign to each of those bigrams?**"
504+
]
505+
},
476506
{
477507
"cell_type": "code",
478508
"execution_count": null,
479-
"id": "34",
509+
"id": "36",
510+
"metadata": {},
511+
"outputs": [],
512+
"source": [
513+
"for w in words[:1]:\n",
514+
" chs = ['.'] + list(w) + ['.']\n",
515+
" for ch1, ch2 in zip(chs, chs[1:]): # Neat way for two char 'sliding-window'\n",
516+
" ix1 = stoi[ch1]\n",
517+
" ix2 = stoi[ch2]\n",
518+
" prob = P[ix1, ix2]\n",
519+
" print(f'{ch1}{ch2}: {prob:.2%}')"
520+
]
521+
},
522+
{
523+
"cell_type": "markdown",
524+
"id": "37",
525+
"metadata": {},
526+
"source": [
527+
"Anything above or below $\\frac{1}{27} \\approx 3.7\\%$ means we deviate from the mean, that is, a completely uniform distribution of bigrams. \n",
528+
"And that means we learned something from the bigram statistics.\n",
529+
"\n",
530+
"How can we summarize these probabilities into a quality indicating measurement?\n",
531+
"We may compute the product of all probabilities — a number called the **likelihood**.\n",
532+
"But since all these probabilities are small numbers, the product is also a small number, and it is hard to compare likelihoods.\n",
533+
"Solution: *The log-likelihood, the **sum** of $\\log(P)$ over all the individual token probabilities* ($\\log$ is applied for convenience).\n",
534+
"\n",
535+
"> The higher the log-likelihood, the better the model, because the more capable it is of predicting the next character in a sequence from the dataset."
536+
]
537+
},
538+
{
539+
"cell_type": "code",
540+
"execution_count": null,
541+
"id": "38",
542+
"metadata": {},
543+
"outputs": [],
544+
"source": [
545+
"# Initialize variables\n",
546+
"log_likelihood = 0.0\n",
547+
"n = 0 # character pair count\n",
548+
"\n",
549+
"for word in words:\n",
550+
" # Add start/end tokens and convert to character list\n",
551+
" chars = ['.'] + list(word) + ['.']\n",
552+
" \n",
553+
" # Calculate log probabilities in a more compact way\n",
554+
" for ch1, ch2 in zip(chars, chars[1:]):\n",
555+
" prob = P[stoi[ch1], stoi[ch2]]\n",
556+
" log_likelihood += torch.log(prob)\n",
557+
" n += 1\n",
558+
"\n",
559+
"print(f'{log_likelihood=}')\n",
560+
"nll = -log_likelihood\n",
561+
"print(f'{nll=}') # Negative log likelihood\n",
562+
"print(f'Average NLL: {nll/n:.4f}') # More descriptive output"
563+
]
564+
},
565+
{
566+
"cell_type": "markdown",
567+
"id": "39",
568+
"metadata": {},
569+
"source": [
570+
"We calculated a negative log-likelihood, because this follows the convention of setting the goal to minimize the **loss function**, the function that drives the optimization (i.e., training) process.\n",
571+
"The lower the loss/negative log-likelihood, the better the model.\n",
572+
"\n",
573+
"We got $2.45$ for the model. The lower, the better.\n",
574+
"We need to find the parameters that reduce this value.\n",
575+
"\n",
576+
"**Goal:** Maximize likelihood of the trained data w. r. t. model parameters in `P`\n",
577+
"- This is equivalent to maximizing the log-likelihood (as $\\log$ is monotonic)\n",
578+
"- This is equivalent to minimizing the *negative* log-likelihood\n",
579+
"- And this is equivalent to minimizing the average negative log-likelihood (the quality-measurement, as shown by $2.45$ above)"
580+
]
581+
},
582+
{
583+
"cell_type": "markdown",
584+
"id": "40",
585+
"metadata": {},
586+
"source": [
587+
"There's an immediate problem, though: if we have a word containing a bigram that **never** appears in our training data, the model will assign a probability of $0$ to it, which will make the log-likelihood $-\\infty$."
588+
]
589+
},
590+
{
591+
"cell_type": "code",
592+
"execution_count": null,
593+
"id": "41",
594+
"metadata": {},
595+
"outputs": [],
596+
"source": [
597+
"# Initialize variables\n",
598+
"log_likelihood = 0.0\n",
599+
"n = 0 # character pair count\n",
600+
"\n",
601+
"for word in [\"edobq\"]:\n",
602+
" # Add start/end tokens and convert to character list\n",
603+
" chars = ['.'] + list(word) + ['.']\n",
604+
" \n",
605+
" # Calculate log probabilities in a more compact way\n",
606+
" for ch1, ch2 in zip(chars, chars[1:]):\n",
607+
" prob = P[stoi[ch1], stoi[ch2]]\n",
608+
" log_likelihood += torch.log(prob)\n",
609+
" n += 1\n",
610+
"\n",
611+
"print(f'{log_likelihood=}')\n",
612+
"nll = -log_likelihood\n",
613+
"print(f'{nll=}') # Negative log likelihood\n",
614+
"print(f'Average NLL: {nll/n:.4f}') # More descriptive output"
615+
]
616+
},
617+
{
618+
"cell_type": "markdown",
619+
"id": "42",
620+
"metadata": {},
621+
"source": [
622+
"A negative infinite log-likelihood is definitely not good because our optimizer will never find a \"stable\" solution.\n",
623+
"\n",
624+
"One simple fix is to assign a small but non-zero probability to every bigram: this is called **model smoothing**.\n",
625+
"The easiest way is to ensure that no bigram *never* appears: we can achieve this by adding a constant to our 2D matrix `N`."
626+
]
627+
},
628+
{
629+
"cell_type": "code",
630+
"execution_count": null,
631+
"id": "43",
480632
"metadata": {},
481633
"outputs": [],
482-
"source": []
634+
"source": [
635+
"PS = (N + 1).float() # The higher the number, the more smoothing we apply\n",
636+
"PS /= PS.sum(dim=1, keepdim=True)\n",
637+
"\n",
638+
"# Initialize variables\n",
639+
"log_likelihood = 0.0\n",
640+
"n = 0 # character pair count\n",
641+
"\n",
642+
"for word in [\"edobq\"]:\n",
643+
" # Add start/end tokens and convert to character list\n",
644+
" chars = ['.'] + list(word) + ['.']\n",
645+
" \n",
646+
" # Calculate log probabilities in a more compact way\n",
647+
" for ch1, ch2 in zip(chars, chars[1:]):\n",
648+
" prob = PS[stoi[ch1], stoi[ch2]] # Use the smoothed probabilities\n",
649+
" log_likelihood += torch.log(prob)\n",
650+
" n += 1\n",
651+
"\n",
652+
"print(f'{log_likelihood=}')\n",
653+
"nll = -log_likelihood\n",
654+
"print(f'{nll=}') # Negative log likelihood\n",
655+
"print(f'Average NLL: {nll/n:.4f}') # More descriptive output"
656+
]
657+
},
658+
{
659+
"cell_type": "markdown",
660+
"id": "44",
661+
"metadata": {},
662+
"source": [
663+
"## A neural network approach"
664+
]
665+
},
666+
{
667+
"cell_type": "markdown",
668+
"id": "45",
669+
"metadata": {},
670+
"source": [
671+
"We will cast the problem of character estimation into the framework of neural networks.\n",
672+
"The problem remains the same, the approach changes, and the outcome should look similar.\n",
673+
"\n",
674+
"Our neural network **receives a single character** and **outputs the probability distribution over the next possible characters** ($27$ in this case).\n",
675+
"\n",
676+
"It's going to make guesses on the most likely character to follow.\n",
677+
"We can still measure the performance through the *same* loss function, the negative log-likelihood.\n",
678+
"\n",
679+
"From the training data, we also know the character that actually comes next in each training example.\n",
680+
"We'll use this information to fine-tune (i.e., train or update the parameters of) the neural network to make better guesses: this is a textbook example of **supervised learning**."
681+
]
682+
},
683+
{
684+
"cell_type": "markdown",
685+
"id": "46",
686+
"metadata": {},
687+
"source": [
688+
"### The training set"
689+
]
690+
},
691+
{
692+
"cell_type": "code",
693+
"execution_count": null,
694+
"id": "47",
695+
"metadata": {},
696+
"outputs": [],
697+
"source": [
698+
"#Create training set of all bigrams\n",
699+
"xs, ys = [], [] # Input and output character indices\n",
700+
"\n",
701+
"for w in words:\n",
702+
" chs = ['.'] + list(w) + ['.']\n",
703+
" for ch1, ch2 in zip(chs, chs[1:]):\n",
704+
" ix1 = stoi[ch1]\n",
705+
" ix2 = stoi[ch2]\n",
706+
" xs.append(ix1)\n",
707+
" ys.append(ix2)\n",
708+
"\n",
709+
"# Convert lists to tensors\n",
710+
"xs = torch.tensor(xs)\n",
711+
"ys = torch.tensor(ys)"
712+
]
713+
},
714+
{
715+
"cell_type": "code",
716+
"execution_count": null,
717+
"id": "48",
718+
"metadata": {},
719+
"outputs": [],
720+
"source": [
721+
"for i in range(5):\n",
722+
" print(f'For character #{i} \"{itos[xs[i].item()]}\" in xs, we expect the model to predict \"{itos[ys[i].item()]}\"')"
723+
]
483724
}
484725
],
485726
"metadata": {
486727
"kernelspec": {
487728
"display_name": "Python 3 (ipykernel)",
488729
"language": "python",
489730
"name": "python3"
490-
},
491-
"language_info": {
492-
"codemirror_mode": {
493-
"name": "ipython",
494-
"version": 3
495-
},
496-
"file_extension": ".py",
497-
"mimetype": "text/x-python",
498-
"name": "python",
499-
"nbconvert_exporter": "python",
500-
"pygments_lexer": "ipython3",
501-
"version": "3.12.10"
502731
}
503732
},
504733
"nbformat": 4,

0 commit comments

Comments
 (0)