Skip to content

Parry-97/llm-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€– Building an LLM from Scratch

Python PyTorch tiktoken uv Progress

A step-by-step implementation of a GPT-like Large Language Model following Sebastian Raschka's "Build a Large Language Model (From Scratch)"

πŸ“– About This Project

This repository documents my journey through "Build a Large Language Model (From Scratch)" by Sebastian Raschka. I'm implementing each concept from the book in PyTorch, building a GPT-like language model from the ground up to truly understand how modern LLMs work.

πŸ“š Book Progress: Chapter 6 of 7

Currently implementing: "Fine-tuning for classification"

βœ… Completed Chapters

  • Chapter 1: Understanding large language models
  • Chapter 2: Working with text data
  • Chapter 3: Coding attention mechanisms
  • Chapter 4: Implementing a GPT model from scratch to generate text
  • Chapter 5: Pretraining on unlabeled data

πŸ”œ Upcoming Chapters

  • Chapter 7: Fine-tuning to follow instructions

🎯 Learning Objectives

By following along with the book and this implementation, I'm learning:

  • Fundamentals of LLMs: How transformers revolutionized NLP and the architecture behind GPT models
  • Text Processing: Tokenization strategies, vocabulary building, and data preparation for neural networks
  • Attention Mechanisms: The mathematics and intuition behind self-attention and multi-head attention
  • Model Architecture: How to build a complete GPT model with embeddings, transformer blocks, and generation capabilities
  • Training Strategies: Pretraining objectives, loss functions, and optimization techniques (upcoming)
  • Fine-tuning: Adapting pretrained models for specific tasks (upcoming)

πŸ—οΈ Current Implementation Status

✨ Implemented Components

πŸ“ Text Data Processing (Chapter 2)

  • SimpleTokenizerV1 (src/llm_from_scratch/tokenizer/simple_tokenizer.py): Custom regex-based tokenizer
  • Text splitting and preprocessing utilities
  • Vocabulary management and encoding/decoding
  • Dataset preparation for training
  • Text download utilities for fetching training data

🎯 Attention Mechanisms (Chapter 3)

πŸ€– GPT Model Architecture (Chapter 4 - Completed)

πŸ”€ Text Generation (Chapter 4 - Implemented)

πŸ“¦ Pretraining on Unlabeled Data (Chapter 5 - Completed)

  • Pretraining Utils (src/llm_from_scratch/pretraining/utils.py): Helper functions for training
  • Objective: next-token prediction on unlabeled corpora (language modeling)
  • Data pipeline: tokenize with tiktoken (cl100k_base), create sequences of length context_length with next-token targets
  • Batching: (batch_size, context_length) input IDs with shifted targets
  • Loss: CrossEntropyLoss over vocabulary logits on shifted targets
  • Optimizer: AdamW; regularization via dropout; gradient clipping
  • Training loop: learning-rate warmup, cosine decay (planned), checkpointing and evaluation via perplexity (planned)

πŸ’¬ Fine-tuning for Classification (Chapter 6 - Current Focus)

πŸ“ Project Structure

llm-from-scratch/
β”œβ”€β”€ src/
β”‚   └── llm_from_scratch/
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ attention/                    # Chapter 3: Attention implementations
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   β”œβ”€β”€ simple_attention.py      # Simplified attention for learning
β”‚       β”‚   β”œβ”€β”€ self_attention.py        # Self-attention basics
β”‚       β”‚   β”œβ”€β”€ causal_attention.py      # Masked attention for autoregression
β”‚       β”‚   β”œβ”€β”€ simple_causal_attention.py # Simple causal attention variant
β”‚       β”‚   β”œβ”€β”€ trainable_attention.py   # Attention with learnable parameters
β”‚       β”‚   β”œβ”€β”€ multi_head_attention.py  # Multi-head attention mechanism
β”‚       β”‚   β”œβ”€β”€ multi_head_attention_wrapper.py # MHA wrapper utilities
β”‚       β”‚   └── batched_multiplication.py # Batched tensor operations
β”‚       β”œβ”€β”€ gpt_architecture/             # Chapter 4: GPT model components
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   β”œβ”€β”€ dummy_gpt_model.py       # Main GPT model class
β”‚       β”‚   β”œβ”€β”€ transformer.py           # Transformer block
β”‚       β”‚   β”œβ”€β”€ feed_forward.py          # FFN layer
β”‚       β”‚   β”œβ”€β”€ layer_normalization.py   # LayerNorm implementation
β”‚       β”‚   β”œβ”€β”€ gelu.py                  # GELU activation
β”‚       β”‚   └── text_generation.py       # Greedy decoding utilities
β”‚       β”œβ”€β”€ tokenizer/                    # Chapter 2: Text processing
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   β”œβ”€β”€ simple_tokenizer.py      # Tokenizer implementation
β”‚       β”‚   β”œβ”€β”€ gpt_dataset.py          # Dataset utilities
β”‚       β”‚   β”œβ”€β”€ sampling.py             # Generation sampling methods
β”‚       β”‚   └── text_download.py        # Text data downloading
β”‚       └── pretraining/                  # Chapter 5: Pretraining components
β”‚           β”œβ”€β”€ __init__.py
β”‚           └── utils.py                 # Training utilities
β”œβ”€β”€ tests/                                # Test files and scripts
β”‚   β”œβ”€β”€ test_text_generation.py         # Text generation example
β”‚   β”œβ”€β”€ test_embeddings.py              # Embeddings testing
β”‚   β”œβ”€β”€ test_transformer_import.py      # Import verification
β”‚   β”œβ”€β”€ dummy_gpt_use.py                # GPT model usage example
β”‚   β”œβ”€β”€ loss_calculation.py             # Loss computation tests
β”‚   β”œβ”€β”€ text_splitting.py               # Text processing tests
β”‚   └── the-verdict.txt                 # Sample text data
β”œβ”€β”€ docs/                                 # Documentation and notes
β”‚   β”œβ”€β”€ ffn_importance.md
β”‚   β”œβ”€β”€ gpt_output.md
β”‚   β”œβ”€β”€ input_output_dimensions.md
β”‚   β”œβ”€β”€ llm-optimization-insights.md
β”‚   β”œβ”€β”€ positional_embedding.md
β”‚   β”œβ”€β”€ python_project_best_practices.md
β”‚   β”œβ”€β”€ pytorch_batched_matmul_guide.md
β”‚   β”œβ”€β”€ self_attention_explained.md
β”‚   β”œβ”€β”€ self_attention_weights.md
β”‚   └── trainable_weight_matrices.md
β”œβ”€β”€ main.py                              # Main entry point
β”œβ”€β”€ pyproject.toml                       # Project configuration
β”œβ”€β”€ uv.lock                              # Dependency lock file
└── README.md                            # This file

πŸš€ Installation

Prerequisites

  • Python 3.11+
  • Git
  • (Optional) CUDA-capable GPU for faster computation

Setup Instructions

This project uses uv for fast, reliable dependency management.

  1. Clone the repository:
git clone <your-repo-url>
cd llm-from-scratch
  1. Install uv (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Create and activate virtual environment:
uv venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install dependencies:
uv sync

PyTorch GPU Support

The project uses PyTorch 2.4.0 (CPU version by default). For GPU support:

  1. Visit PyTorch Get Started
  2. Select your configuration and install:
uv pip install torch --index-url https://download.pytorch.org/whl/cu121  # CUDA 12.1

πŸ’» Usage Examples

Basic Tokenization (Chapter 2)

from llm_from_scratch.tokenizer.simple_tokenizer import SimpleTokenizerV1

# Create tokenizer with vocabulary
vocab = {
    "Hello": 0, ",": 1, " ": 2, "world": 3, "!": 4,
    "LLM": 5, "from": 6, "scratch": 7
}
tokenizer = SimpleTokenizerV1(vocab)

# Encode and decode text
text = "Hello, world!"
token_ids = tokenizer.encode(text)
print(f"Tokens: {token_ids}")
print(f"Decoded: {tokenizer.decode(token_ids)}")

Attention Mechanism (Chapter 3)

import torch
from llm_from_scratch.attention.multi_head_attention import MultiHeadAttention

# Setup multi-head attention
batch_size, seq_len, d_model = 2, 10, 768
mha = MultiHeadAttention(
    d_in=d_model,
    d_out=d_model,
    context_length=seq_len,
    dropout=0.1,
    num_heads=12
)

# Process input
x = torch.randn(batch_size, seq_len, d_model)
output = mha(x)
print(f"Output shape: {output.shape}")  # [2, 10, 768]

GPT Model Forward Pass (Chapter 4)

import torch
from llm_from_scratch.gpt_architecture.dummy_gpt_model import DummyGPTModel

# Model configuration
config = {
    "vocab_size": 5000,
    "context_length": 256,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

# Initialize model
model = DummyGPTModel(config)

# Forward pass
input_ids = torch.randint(0, config["vocab_size"], (2, 10))
with torch.no_grad():
    logits = model(input_ids)
print(f"Logits shape: {logits.shape}")  # [2, 10, 5000]

Text Generation Quickstart (Chapter 4)

Example using the greedy generation loop:

import torch
from tiktoken import get_encoding
from llm_from_scratch.gpt_architecture.dummy_gpt_model import DummyGPTModel
from llm_from_scratch.gpt_architecture.text_generation import generate_text

# Tokenizer and model configuration
tokenizer = get_encoding("cl100k_base")
config = {
    "vocab_size": 50257,
    "context_length": 1024,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False,
}

model = DummyGPTModel(config).eval()

start = "Hello, I am"
encoded = tokenizer.encode(start)
idx = torch.tensor(encoded).unsqueeze(0)

out = generate_text(
    model=model,
    idx=idx,
    max_new_tokens=6,
    context_size=config["context_length"],
)
print(tokenizer.decode(out.squeeze(0).tolist()))

Run the example script directly:

uv run python tests/test_text_generation.py

πŸ”¬ Technical Implementation Details

Current Architecture (Chapter 4)

Input Text
    ↓
[Tokenization]
    ↓
Token IDs β†’ Token Embeddings + Positional Embeddings
    ↓
[Transformer Block] Γ— N_LAYERS
    β”œβ”€β”€ Multi-Head Attention (with causal mask)
    β”œβ”€β”€ Add & Norm
    β”œβ”€β”€ Feed-Forward Network
    └── Add & Norm
    ↓
[Final Layer Norm]
    ↓
[Output Projection] β†’ Logits
    ↓
[Sampling/Generation] β†’ Generated Text

Key Design Decisions

  • Tokenizer: Simple regex-based splitting (will explore BPE in later chapters)
  • Attention: Scaled dot-product with causal masking for autoregression
  • Positional Encoding: Learned embeddings (not sinusoidal)
  • Activation: GELU in feed-forward networks
  • Normalization: Pre-norm architecture (LayerNorm before sub-layers)
  • Model Size: Configurable, default similar to GPT-2 small (768 dim, 12 heads, 12 layers)

πŸ› οΈ Technologies Used

  • Python 3.11+: Core language
  • PyTorch 2.4.0: Deep learning framework
  • NumPy 2.3.2+: Numerical operations
  • tiktoken 0.11.0+: OpenAI's BPE tokenizer (for comparison)
  • uv: Fast Python package management

Development Tools

  • pytest: Testing framework
  • IPython: Interactive development
  • matplotlib: Visualizations

πŸ“š References & Resources

Primary Reference

"Build a Large Language Model (From Scratch)" by Sebastian Raschka

Additional Resources

🚧 Roadmap

Immediate Next Steps (Chapter 6: Fine-tuning)

  • Implement classification head
  • Implement fine-tuning loop for classification
  • Build data loading pipeline for spam classification
  • Implement training metrics and logging for classification
  • Add checkpointing and resumability for fine-tuning
  • Provide a fine-tuning entry point (e.g., train_finetuning.py) and docs

Backlog

  • Temperature-based sampling for generation
  • Top-k and top-p (nucleus) sampling
  • Interactive text generation demo

Upcoming Chapters

  • Chapter 7: Instruction following capabilities
  • Chapter 7: RLHF concepts

Future Enhancements

  • Add comprehensive test coverage
  • Create Jupyter notebooks for each chapter
  • Build web interface with Gradio
  • Add model checkpointing
  • Performance profiling and optimization
  • Docker containerization

🀝 Contributing

This is a personal learning project following the book's progression. However, I welcome:

  • Bug reports and fixes
  • Clarifications and documentation improvements
  • Discussions about the concepts
  • Suggestions for better implementations

πŸ“„ License

This project is for educational/starter purposes. No explicit license.

πŸ™ Acknowledgments

  • Sebastian Raschka for writing this excellent book and making LLMs accessible
  • The PyTorch team for the amazing framework
  • The open-source community for inspiration and resources

"The best way to understand something is to build it from scratch"
🧠 Currently learning at Chapter 6/7 of the book πŸ“š

About

LLM from Scratch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages