How to use the GPT model - inputs, outputs, and basic usage.
This document explains how to use the GPT model: what goes in, what comes out, and how to use it. For detailed explanations of how the model works internally, see the links at the end.
Main file: src/model/gpt.py
from src.model.gpt import GPTModel
from src.config import ModelConfig
# Create configuration
config = ModelConfig(
vocab_size=50257, # Vocabulary size (GPT-2 tokenizer)
context_length=128, # Maximum sequence length
embedding_dimension=256, # Embedding dimension
number_of_heads=4, # Attention heads
number_of_layers=4, # Transformer layers
dropout_rate=0.1, # Dropout probability
use_attention_bias=False # No bias in attention projections (GPT-2 style)
)
# Create model
model = GPTModel(config)Location: src/config.py (ModelConfig class), src/model/gpt.py (GPTModel class)
File: src/config.py, lines 9-36
| Parameter | Type | Description | Typical Values |
|---|---|---|---|
vocab_size |
int | Number of tokens in vocabulary | 50257 (GPT-2) |
context_length |
int | Maximum sequence length | 128, 256, 512, 1024 |
embedding_dimension |
int | Size of token embeddings | 128, 256, 512, 768 |
number_of_heads |
int | Attention heads (must divide embedding_dimension) | 2, 4, 8, 12 |
number_of_layers |
int | Number of transformer blocks | 2, 4, 6, 12 |
dropout_rate |
float | Dropout probability | 0.0-0.2 |
use_attention_bias |
bool | Use bias in attention layer projections | False (GPT-2 style) |
Constraint: embedding_dimension must be divisible by number_of_heads.
Location: src/model/gpt.py, GPTModel.forward() method, lines 73-91
# Input
input_ids = torch.tensor([[464, 2361, 373]]) # Shape: [batch_size, sequence_length]
# Example: [1, 3] = batch of 1, sequence of 3 tokens
# Forward pass
logits = model(input_ids)
# Output
# Shape: [batch_size, sequence_length, vocab_size]
# Example: [1, 3, 50257] = batch of 1, 3 positions, 50257 vocabulary scoresWhat goes in:
input_ids: Tensor of token IDs- Shape:
[batch_size, sequence_length] - Values: Integers from 0 to
vocab_size - 1 - Example:
[[15496, 995]]might represent "Hello world"
- Shape:
What comes out:
logits: Raw scores for next token prediction- Shape:
[batch_size, sequence_length, vocab_size] - Values: Float scores (not probabilities)
- Each position has scores for all vocabulary tokens
- Shape:
Think of logits as the model saying: “Here is how much I like every possible next token.”
Example:
input_ids = torch.tensor([[464, 2361, 373]]) # "The cat sat"
logits = model(input_ids) # Shape: [1, 3, 50257]
# logits[0, 0, :] = scores for next token after position 0 ("The")
# logits[0, 1, :] = scores for next token after position 1 ("The cat")
# logits[0, 2, :] = scores for next token after position 2 ("The cat sat")The model outputs logits (raw scores), not probabilities. To get probabilities:
import torch.nn.functional as F
logits = model(input_ids) # [batch, seq_len, vocab_size]
# Get probabilities for last position
last_logits = logits[:, -1, :] # [batch, vocab_size]
probs = F.softmax(last_logits, dim=-1) # [batch, vocab_size]
# Now probs[0, 1234] = probability of token 1234 being nextmodel.train() # Enable dropout, batch norm updatesWhen to use:
- During training
- When you want dropout active
model.eval() # Disable dropout, freeze batch normWhen to use:
- During inference/generation
- When evaluating on validation set
- When you want deterministic outputs
The model and input must be on the same device:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Move model to device
model = model.to(device)
# Input must be on same device
input_ids = input_ids.to(device)
logits = model(input_ids)from src.model.gpt import GPTModel
from src.config import ModelConfig
import torch
# Create config
config = ModelConfig(
vocab_size=50257,
context_length=128,
embedding_dimension=256,
number_of_heads=4,
number_of_layers=4,
dropout_rate=0.1
)
# Create model
model = GPTModel(config)
# Use model
input_ids = torch.randint(0, 50257, (1, 10)) # Random tokens for testing
model.eval()
with torch.no_grad():
logits = model(input_ids) # [1, 10, 50257]import torch
from src.model.gpt import GPTModel
from src.config import ModelConfig
# Load checkpoint
checkpoint = torch.load('checkpoints/best_model.pt', map_location='cpu')
# Recreate config
config = ModelConfig(**checkpoint['config'])
# Create and load model
model = GPTModel(config)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Model size: {total_params * 4 / 1024 / 1024:.2f} MB (FP32)")# Access config
config = model.config
# Convert to dict
config_dict = config.to_dict()
# Access individual values
embedding_dim = config.embedding_dimension
context_len = config.context_lengthThe model consists of:
-
Embeddings (
src/model/gpt.py, lines 59-62)- Token embeddings: Convert token IDs to vectors
- Position embeddings: Encode token positions
-
Transformer Blocks (
src/model/gpt.py, lines 65-67)- Multiple layers that process the sequence
- Each block contains attention and feed-forward networks
-
Output Layer (
src/model/gpt.py, lines 70-71)- Final normalization
- Linear projection to vocabulary size
File structure:
src/model/gpt.py- Main model classsrc/model/attention.py- Attention mechanismsrc/model/blocks.py- LayerNorm, FeedForward, GELUsrc/config.py- Configuration class
The input sequence length cannot exceed context_length:
config = ModelConfig(context_length=128)
model = GPTModel(config)
# This works
input_ids = torch.randint(0, 50257, (1, 128)) # Exactly context_length
logits = model(input_ids)
# This will work but only first 128 tokens are processed
input_ids = torch.randint(0, 50257, (1, 200)) # Longer than context_length
logits = model(input_ids) # Only processes first 128 tokensToken IDs must be in valid range:
# Valid: 0 to vocab_size - 1
input_ids = torch.tensor([[0, 100, 50256]]) # OK for vocab_size=50257
# Invalid: out of range
input_ids = torch.tensor([[50257]]) # Error! Must be < vocab_sizeThe model processes batches:
# Single sequence
input_ids = torch.tensor([[464, 2361, 373]]) # [1, 3]
# Batch of sequences
input_ids = torch.tensor([
[464, 2361, 373], # Sequence 1
[15496, 995, 0], # Sequence 2
[1234, 5678, 9012] # Sequence 3
]) # [3, 3] = batch of 3, each length 3
logits = model(input_ids) # [3, 3, 50257]Note: Sequences in a batch can have different lengths (use padding), but the model processes them as-is.
Logits are raw scores before softmax:
logits = model(input_ids) # [1, 5, 50257]
# For position 2 (after 2 tokens):
position_2_logits = logits[0, 2, :] # [50257] scores
# Higher score = more likely (but not a probability yet)
top_token = torch.argmax(position_2_logits) # Most likely token ID
top_score = position_2_logits[top_token] # Its scoreFor next token prediction, use the last position:
logits = model(input_ids) # [1, seq_len, vocab_size]
# Get logits for next token (after entire sequence)
next_token_logits = logits[:, -1, :] # [1, vocab_size]
# Convert to probabilities
probs = F.softmax(next_token_logits, dim=-1)
# Sample or get most likely
most_likely = torch.argmax(probs, dim=-1) # Greedy
sampled = torch.multinomial(probs, num_samples=1) # Random sampleError: RuntimeError: shape mismatch
Cause: Input shape is wrong
Solution:
# Wrong: 1D tensor
input_ids = torch.tensor([464, 2361, 373]) # Shape: [3]
# Correct: 2D tensor with batch dimension
input_ids = torch.tensor([[464, 2361, 373]]) # Shape: [1, 3]Error: RuntimeError: Expected all tensors to be on the same device
Solution:
device = torch.device('cuda')
model = model.to(device)
input_ids = input_ids.to(device)
logits = model(input_ids)Cause: Sequence too long or batch too large
Solution:
- Reduce
context_lengthin config - Use shorter sequences
- Reduce batch size
- Use smaller model (fewer layers, smaller embedding_dim)
config = ModelConfig(vocab_size=50257, context_length=128, ...)
model = GPTModel(config)logits = model(input_ids) # [batch, seq_len, vocab_size]model.eval()
with torch.no_grad():
logits = model(input_ids)model = model.to(device)
input_ids = input_ids.to(device)If you want to understand how the model works internally, here are detailed explanations:
- File:
src/model/gpt.py - Key classes:
GPTModel,TransformerBlock - What to learn: How embeddings, transformer blocks, and output layer work together
- See: The model implementation code and comments
- File:
src/model/attention.py - Key class:
MultiHeadAttention - What to learn: How queries, keys, values work, causal masking, multi-head attention
- See: Attention implementation with Q/K/V projections and attention computation
- File:
src/model/blocks.py - Key classes:
LayerNorm,FeedForward,GELU - What to learn: Layer normalization, feed-forward networks, activation functions
- See: Individual component implementations
- File:
src/config.py - Key class:
ModelConfig - What to learn: How configuration is structured and validated
- See: Configuration dataclass and validation logic
- File:
src/training/trainer.py - What to learn: How the model is used during training
- See: Training Implementation
- File:
src/generation/generate.py - What to learn: How logits are converted to generated text
- See: Using the Model
- File:
examples/analyze_attention.py - What to learn: How to visualize what the model is "looking at"
- See: Understanding Attention Analysis
- Training: See Training Implementation to learn how to train the model
- Usage: See Using the Model to learn how to generate text
- Challenges: See Pitfalls and Challenges for common issues
- Quick Reference: See Quick Reference for code snippets