|
5 | 5 | "id": "0", |
6 | 6 | "metadata": {}, |
7 | 7 | "source": [ |
8 | | - "# PyTorch LLM Tutorial" |
| 8 | + "# Language Modeling With PyTorch" |
9 | 9 | ] |
10 | 10 | }, |
11 | 11 | { |
|
97 | 97 | "id": "8", |
98 | 98 | "metadata": {}, |
99 | 99 | "source": [ |
100 | | - "## Bigram language model\n", |
101 | | - "\n" |
| 100 | + "## Bigram language model" |
102 | 101 | ] |
103 | 102 | }, |
104 | 103 | { |
|
188 | 187 | "\n", |
189 | 188 | "Since we are dealing with 26 characters, plus `<S>` and `<E>`, we need a total of 28x28 = 784 entries.\n", |
190 | 189 | "Bracketed tokens are customary in NLP to represent special tokens, but here we are only interested in knowing when a sentence starts or ends.\n", |
191 | | - "We can replace `<S>` with `.` and `<E>` with `.` and have a total of 27x27 = 729 entries." |
| 190 | + "\n", |
| 191 | + "But there's also a problem with keeping two special tokens for the start and end of a sentence: we can't have `('<S>', '<S>')` or `('<E>', '<E>')`, or other combinations like `('a', '<S>')` or `('<E>', 'b')`.\n", |
| 192 | + "These would be invalid bigrams.\n", |
| 193 | + "To solve this, we can replace `<S>` with `.` and `<E>` with `.` and have a total of 27x27 = 729 entries." |
192 | 194 | ] |
193 | 195 | }, |
194 | 196 | { |
|
370 | 372 | "outputs": [], |
371 | 373 | "source": [ |
372 | 374 | "g = torch.Generator().manual_seed(2147483647)\n", |
| 375 | + "NUM_WORDS = 20\n", |
373 | 376 | "\n", |
374 | | - "for _ in range(10):\n", |
| 377 | + "for _ in range(NUM_WORDS):\n", |
375 | 378 | " \n", |
376 | 379 | " out = []\n", |
377 | 380 | " ix = 0\n", |
|
399 | 402 | "id": "30", |
400 | 403 | "metadata": {}, |
401 | 404 | "source": [ |
402 | | - "The result is not very good.\n", |
| 405 | + "The results are quite terrible, although they're reasonable given the simplicity of the model and the patterns we're trying to capture.\n", |
| 406 | + "\n", |
403 | 407 | "The core problem is that a bigram model looks only the the frequency of a pair of tokens, but it has zero information of what's most likely to come before or after those two tokens.\n", |
404 | 408 | "You can imagine that the obvious next step is a **trigram** model, which looks at the frequency of a triplet of tokens.\n", |
405 | 409 | "\n", |
|
436 | 440 | "metadata": {}, |
437 | 441 | "source": [ |
438 | 442 | "Here `dim=1` tells PyTorch to sum over the columns (the second index), while `keepdim=True` tells it to keep the first dimension (the first index) as a singleton (a `1`) dimension.\n", |
439 | | - "Without `keepdim=True`, the result would have shape `(27,)`, and performing the division would produce the wrong result because of how [brodcasting](https://pytorch.org/docs/stable/notes/broadcasting.html) works." |
| 443 | + "Without `keepdim=True`, the result would have shape `(27,)`, and performing the division would produce the wrong result because of how [brodcasting](https://pytorch.org/docs/stable/notes/broadcasting.html) works.\n", |
| 444 | + "\n", |
| 445 | + "> Try to experiment with the `keepdim` parameter and see what happens if you remove it.\n", |
| 446 | + "> Can you explain why the predictions become complete garbage?" |
440 | 447 | ] |
441 | 448 | }, |
442 | 449 | { |
|
473 | 480 | " print(''.join(out))" |
474 | 481 | ] |
475 | 482 | }, |
| 483 | + { |
| 484 | + "cell_type": "markdown", |
| 485 | + "id": "34", |
| 486 | + "metadata": {}, |
| 487 | + "source": [ |
| 488 | + "### Evaluating the quality of the model" |
| 489 | + ] |
| 490 | + }, |
| 491 | + { |
| 492 | + "cell_type": "markdown", |
| 493 | + "id": "35", |
| 494 | + "metadata": {}, |
| 495 | + "source": [ |
| 496 | + "We have built a bigram language model by counting letter combination frequencies, then normalizing and sampling with that probability base.\n", |
| 497 | + "\n", |
| 498 | + "We trained the model, we sampled from the model (iteratively, character-wise). But its still bad at coming up with names.\n", |
| 499 | + "\n", |
| 500 | + "But how bad? We know that the model's \"knowledge\" is represented by `P`, but how can we boil down the model's quality in one value?\n", |
| 501 | + "\n", |
| 502 | + "First, let's look at the bigrams we created from the dataset: the bigrams to `emma` are `.e, em, mm, ma, a.`.\n", |
| 503 | + "**What probability does the model assign to each of those bigrams?**" |
| 504 | + ] |
| 505 | + }, |
476 | 506 | { |
477 | 507 | "cell_type": "code", |
478 | 508 | "execution_count": null, |
479 | | - "id": "34", |
| 509 | + "id": "36", |
| 510 | + "metadata": {}, |
| 511 | + "outputs": [], |
| 512 | + "source": [ |
| 513 | + "for w in words[:1]:\n", |
| 514 | + " chs = ['.'] + list(w) + ['.']\n", |
| 515 | + " for ch1, ch2 in zip(chs, chs[1:]): # Neat way for two char 'sliding-window'\n", |
| 516 | + " ix1 = stoi[ch1]\n", |
| 517 | + " ix2 = stoi[ch2]\n", |
| 518 | + " prob = P[ix1, ix2]\n", |
| 519 | + " print(f'{ch1}{ch2}: {prob:.2%}')" |
| 520 | + ] |
| 521 | + }, |
| 522 | + { |
| 523 | + "cell_type": "markdown", |
| 524 | + "id": "37", |
| 525 | + "metadata": {}, |
| 526 | + "source": [ |
| 527 | + "Anything above or below $\\frac{1}{27} \\approx 3.7\\%$ means we deviate from the mean, that is, a completely uniform distribution of bigrams. \n", |
| 528 | + "And that means we learned something from the bigram statistics.\n", |
| 529 | + "\n", |
| 530 | + "How can we summarize these probabilities into a quality indicating measurement?\n", |
| 531 | + "We may compute the product of all probabilities — a number called the **likelihood**.\n", |
| 532 | + "But since all these probabilities are small numbers, the product is also a small number, and it is hard to compare likelihoods.\n", |
| 533 | + "Solution: *The log-likelihood, the **sum** of $\\log(P)$ over all the individual token probabilities* ($\\log$ is applied for convenience).\n", |
| 534 | + "\n", |
| 535 | + "> The higher the log-likelihood, the better the model, because the more capable it is of predicting the next character in a sequence from the dataset." |
| 536 | + ] |
| 537 | + }, |
| 538 | + { |
| 539 | + "cell_type": "code", |
| 540 | + "execution_count": null, |
| 541 | + "id": "38", |
| 542 | + "metadata": {}, |
| 543 | + "outputs": [], |
| 544 | + "source": [ |
| 545 | + "# Initialize variables\n", |
| 546 | + "log_likelihood = 0.0\n", |
| 547 | + "n = 0 # character pair count\n", |
| 548 | + "\n", |
| 549 | + "for word in words:\n", |
| 550 | + " # Add start/end tokens and convert to character list\n", |
| 551 | + " chars = ['.'] + list(word) + ['.']\n", |
| 552 | + " \n", |
| 553 | + " # Calculate log probabilities in a more compact way\n", |
| 554 | + " for ch1, ch2 in zip(chars, chars[1:]):\n", |
| 555 | + " prob = P[stoi[ch1], stoi[ch2]]\n", |
| 556 | + " log_likelihood += torch.log(prob)\n", |
| 557 | + " n += 1\n", |
| 558 | + "\n", |
| 559 | + "print(f'{log_likelihood=}')\n", |
| 560 | + "nll = -log_likelihood\n", |
| 561 | + "print(f'{nll=}') # Negative log likelihood\n", |
| 562 | + "print(f'Average NLL: {nll/n:.4f}') # More descriptive output" |
| 563 | + ] |
| 564 | + }, |
| 565 | + { |
| 566 | + "cell_type": "markdown", |
| 567 | + "id": "39", |
| 568 | + "metadata": {}, |
| 569 | + "source": [ |
| 570 | + "We calculated a negative log-likelihood, because this follows the convention of setting the goal to minimize the **loss function**, the function that drives the optimization (i.e., training) process.\n", |
| 571 | + "The lower the loss/negative log-likelihood, the better the model.\n", |
| 572 | + "\n", |
| 573 | + "We got $2.45$ for the model. The lower, the better.\n", |
| 574 | + "We need to find the parameters that reduce this value.\n", |
| 575 | + "\n", |
| 576 | + "**Goal:** Maximize likelihood of the trained data w. r. t. model parameters in `P`\n", |
| 577 | + "- This is equivalent to maximizing the log-likelihood (as $\\log$ is monotonic)\n", |
| 578 | + "- This is equivalent to minimizing the *negative* log-likelihood\n", |
| 579 | + "- And this is equivalent to minimizing the average negative log-likelihood (the quality-measurement, as shown by $2.45$ above)" |
| 580 | + ] |
| 581 | + }, |
| 582 | + { |
| 583 | + "cell_type": "markdown", |
| 584 | + "id": "40", |
| 585 | + "metadata": {}, |
| 586 | + "source": [ |
| 587 | + "There's an immediate problem, though: if we have a word containing a bigram that **never** appears in our training data, the model will assign a probability of $0$ to it, which will make the log-likelihood $-\\infty$." |
| 588 | + ] |
| 589 | + }, |
| 590 | + { |
| 591 | + "cell_type": "code", |
| 592 | + "execution_count": null, |
| 593 | + "id": "41", |
| 594 | + "metadata": {}, |
| 595 | + "outputs": [], |
| 596 | + "source": [ |
| 597 | + "# Initialize variables\n", |
| 598 | + "log_likelihood = 0.0\n", |
| 599 | + "n = 0 # character pair count\n", |
| 600 | + "\n", |
| 601 | + "for word in [\"edobq\"]:\n", |
| 602 | + " # Add start/end tokens and convert to character list\n", |
| 603 | + " chars = ['.'] + list(word) + ['.']\n", |
| 604 | + " \n", |
| 605 | + " # Calculate log probabilities in a more compact way\n", |
| 606 | + " for ch1, ch2 in zip(chars, chars[1:]):\n", |
| 607 | + " prob = P[stoi[ch1], stoi[ch2]]\n", |
| 608 | + " log_likelihood += torch.log(prob)\n", |
| 609 | + " n += 1\n", |
| 610 | + "\n", |
| 611 | + "print(f'{log_likelihood=}')\n", |
| 612 | + "nll = -log_likelihood\n", |
| 613 | + "print(f'{nll=}') # Negative log likelihood\n", |
| 614 | + "print(f'Average NLL: {nll/n:.4f}') # More descriptive output" |
| 615 | + ] |
| 616 | + }, |
| 617 | + { |
| 618 | + "cell_type": "markdown", |
| 619 | + "id": "42", |
| 620 | + "metadata": {}, |
| 621 | + "source": [ |
| 622 | + "A negative infinite log-likelihood is definitely not good because our optimizer will never find a \"stable\" solution.\n", |
| 623 | + "\n", |
| 624 | + "One simple fix is to assign a small but non-zero probability to every bigram: this is called **model smoothing**.\n", |
| 625 | + "The easiest way is to ensure that no bigram *never* appears: we can achieve this by adding a constant to our 2D matrix `N`." |
| 626 | + ] |
| 627 | + }, |
| 628 | + { |
| 629 | + "cell_type": "code", |
| 630 | + "execution_count": null, |
| 631 | + "id": "43", |
480 | 632 | "metadata": {}, |
481 | 633 | "outputs": [], |
482 | | - "source": [] |
| 634 | + "source": [ |
| 635 | + "PS = (N + 1).float() # The higher the number, the more smoothing we apply\n", |
| 636 | + "PS /= PS.sum(dim=1, keepdim=True)\n", |
| 637 | + "\n", |
| 638 | + "# Initialize variables\n", |
| 639 | + "log_likelihood = 0.0\n", |
| 640 | + "n = 0 # character pair count\n", |
| 641 | + "\n", |
| 642 | + "for word in [\"edobq\"]:\n", |
| 643 | + " # Add start/end tokens and convert to character list\n", |
| 644 | + " chars = ['.'] + list(word) + ['.']\n", |
| 645 | + " \n", |
| 646 | + " # Calculate log probabilities in a more compact way\n", |
| 647 | + " for ch1, ch2 in zip(chars, chars[1:]):\n", |
| 648 | + " prob = PS[stoi[ch1], stoi[ch2]] # Use the smoothed probabilities\n", |
| 649 | + " log_likelihood += torch.log(prob)\n", |
| 650 | + " n += 1\n", |
| 651 | + "\n", |
| 652 | + "print(f'{log_likelihood=}')\n", |
| 653 | + "nll = -log_likelihood\n", |
| 654 | + "print(f'{nll=}') # Negative log likelihood\n", |
| 655 | + "print(f'Average NLL: {nll/n:.4f}') # More descriptive output" |
| 656 | + ] |
| 657 | + }, |
| 658 | + { |
| 659 | + "cell_type": "markdown", |
| 660 | + "id": "44", |
| 661 | + "metadata": {}, |
| 662 | + "source": [ |
| 663 | + "## A neural network approach" |
| 664 | + ] |
| 665 | + }, |
| 666 | + { |
| 667 | + "cell_type": "markdown", |
| 668 | + "id": "45", |
| 669 | + "metadata": {}, |
| 670 | + "source": [ |
| 671 | + "We will cast the problem of character estimation into the framework of neural networks.\n", |
| 672 | + "The problem remains the same, the approach changes, and the outcome should look similar.\n", |
| 673 | + "\n", |
| 674 | + "Our neural network **receives a single character** and **outputs the probability distribution over the next possible characters** ($27$ in this case).\n", |
| 675 | + "\n", |
| 676 | + "It's going to make guesses on the most likely character to follow.\n", |
| 677 | + "We can still measure the performance through the *same* loss function, the negative log-likelihood.\n", |
| 678 | + "\n", |
| 679 | + "From the training data, we also know the character that actually comes next in each training example.\n", |
| 680 | + "We'll use this information to fine-tune (i.e., train or update the parameters of) the neural network to make better guesses: this is a textbook example of **supervised learning**." |
| 681 | + ] |
| 682 | + }, |
| 683 | + { |
| 684 | + "cell_type": "markdown", |
| 685 | + "id": "46", |
| 686 | + "metadata": {}, |
| 687 | + "source": [ |
| 688 | + "### The training set" |
| 689 | + ] |
| 690 | + }, |
| 691 | + { |
| 692 | + "cell_type": "code", |
| 693 | + "execution_count": null, |
| 694 | + "id": "47", |
| 695 | + "metadata": {}, |
| 696 | + "outputs": [], |
| 697 | + "source": [ |
| 698 | + "#Create training set of all bigrams\n", |
| 699 | + "xs, ys = [], [] # Input and output character indices\n", |
| 700 | + "\n", |
| 701 | + "for w in words:\n", |
| 702 | + " chs = ['.'] + list(w) + ['.']\n", |
| 703 | + " for ch1, ch2 in zip(chs, chs[1:]):\n", |
| 704 | + " ix1 = stoi[ch1]\n", |
| 705 | + " ix2 = stoi[ch2]\n", |
| 706 | + " xs.append(ix1)\n", |
| 707 | + " ys.append(ix2)\n", |
| 708 | + "\n", |
| 709 | + "# Convert lists to tensors\n", |
| 710 | + "xs = torch.tensor(xs)\n", |
| 711 | + "ys = torch.tensor(ys)" |
| 712 | + ] |
| 713 | + }, |
| 714 | + { |
| 715 | + "cell_type": "code", |
| 716 | + "execution_count": null, |
| 717 | + "id": "48", |
| 718 | + "metadata": {}, |
| 719 | + "outputs": [], |
| 720 | + "source": [ |
| 721 | + "for i in range(5):\n", |
| 722 | + " print(f'For character #{i} \"{itos[xs[i].item()]}\" in xs, we expect the model to predict \"{itos[ys[i].item()]}\"')" |
| 723 | + ] |
483 | 724 | } |
484 | 725 | ], |
485 | 726 | "metadata": { |
486 | 727 | "kernelspec": { |
487 | 728 | "display_name": "Python 3 (ipykernel)", |
488 | 729 | "language": "python", |
489 | 730 | "name": "python3" |
490 | | - }, |
491 | | - "language_info": { |
492 | | - "codemirror_mode": { |
493 | | - "name": "ipython", |
494 | | - "version": 3 |
495 | | - }, |
496 | | - "file_extension": ".py", |
497 | | - "mimetype": "text/x-python", |
498 | | - "name": "python", |
499 | | - "nbconvert_exporter": "python", |
500 | | - "pygments_lexer": "ipython3", |
501 | | - "version": "3.12.10" |
502 | 731 | } |
503 | 732 | }, |
504 | 733 | "nbformat": 4, |
|
0 commit comments