09 – How LLMs Work Internally

Why read this? You don't need to understand transformer internals to use LLMs, but understanding the basics makes every error message, every parameter, and every prompt technique make immediate sense.

🧠 The Big Picture

An LLM is fundamentally a next-token prediction machine:

Input:  "The capital of France is"
Output: probability distribution over all possible next tokens

  "Paris"   → 97.3%
  "located" → 1.2%
  "a"       → 0.5%
  ...

The model picks "Paris" → now input becomes "The capital of France is Paris"
Repeats until [END] token is generated

Everything else — answering questions, writing code, explaining concepts — emerges from doing this extremely well, at extreme scale.

🔤 Step 1: Tokenization

Before any processing, text is split into tokens. A token is not a word — it's a subword unit:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "Transformers are the backbone of modern AI"
tokens = tokenizer.tokenize(text)
# ['Transform', 'ers', 'Ġare', 'Ġthe', 'Ġbackbone', 'Ġof', 'Ġmodern', 'ĠAI']
# 8 tokens, not 7 words!

ids = tokenizer.encode(text)
# [8291, 364, 389, 262, 27169, 286, 3660, 9552]
# Each token → an integer (its ID in the vocabulary)

Why subword tokenization?

"unhappiness" → ["un", "happi", "ness"] — handles unseen words by composing known pieces
Rare words don't each need their own vocabulary entry
The vocabulary (~50,000 entries) covers essentially all text

Token cost matters: API pricing is per token. 1 token ≈ 4 characters ≈ 0.75 words in English. Non-English text often uses more tokens per word.

🏗️ Step 2: The Transformer Architecture

Token IDs [8291, 364, 389, ...]
    │
    ▼
Token Embeddings (lookup table → each ID becomes a 768-dim vector)
    │
    ▼
Position Encoding (adds "where in the sequence" information)
    │
    ▼
┌─────────────────────────────────────────┐
│  Transformer Block (repeated N times)   │
│                                         │
│  ┌─────────────────────────────────┐   │
│  │   Multi-Head Self-Attention     │   │  ← "Which other tokens should I
│  │   (each token attends to all    │   │     pay attention to?"
│  │    other tokens)                │   │
│  └─────────────────────────────────┘   │
│           │ (residual connection)       │
│  ┌─────────────────────────────────┐   │
│  │   Feed-Forward Network          │   │  ← "Process this attended info"
│  │   (2 linear layers + activation)│   │
│  └─────────────────────────────────┘   │
└─────────────────────────────────────────┘
    │
    ▼
Final Linear Layer + Softmax
    │
    ▼
Probability over vocabulary (which token comes next?)

GPT-2: 12 layers, 768-dim → 117M parameters
GPT-3: 96 layers, 12288-dim → 175B parameters
DeepSeek-R1: 61 layers, mixture-of-experts → 671B parameters total

🎯 Attention Mechanism — The Key Innovation

Self-attention is what makes transformers powerful. For each token, it answers: "which other tokens in the sequence are most relevant to understanding me?"

Sentence: "The bank can guarantee deposits will be covered"

For the token "bank" (financial institution):
  - High attention to: "deposits", "covered", "guarantee"
  - Low attention to: "The", "can", "will", "be"

For the same word "bank" in "I sat by the river bank":
  - High attention to: "river", "sat", "by"
  - Low attention to: "I", "the"

Same word, different context → different understanding
This is how transformers handle ambiguity.

Multi-Head Attention: Multiple attention mechanisms run in parallel. One head might focus on syntactic relationships, another on semantic meaning, another on coreference (tracking what "it" refers to).

📏 Context Window

The context window is the maximum number of tokens the model can "see" at once. Everything outside it is invisible.

Context window = 8192 tokens (example)

[VISIBLE TO MODEL]
System prompt (200 tokens)
Conversation history (5000 tokens)
Current user message (100 tokens)
Retrieved RAG context (2000 tokens)
─────────────────────────────────────
Total: 7300 tokens (within window ✓)

If you add 1000 more tokens: 8300 → oldest tokens get dropped

Why this matters:

Very long conversations lose early context
RAG chunks must fit within the remaining space after system + conversation
Models with larger context windows cost more per token

🧠 Why LLMs "Hallucinate"

Hallucination = the model generating plausible-sounding but false information. It happens because:

Training objective mismatch: The model was trained to predict fluent text, not to only say things it's certain about.
No knowledge flag: The model has no internal "I don't know" signal. It just predicts the most probable next token.
Pattern completion: If the pattern strongly suggests an answer (even if wrong), the model follows it.

Prompt: "The first person to walk on Mars was..."
The model was trained on lots of text with this pattern:
"The first person to [milestone] was [name]"
It will complete this even though no one has walked on Mars yet.

Fixes:

RAG: ground answers in retrieved documents
System prompt: "Only answer based on the provided context. Say 'I don't know' if the answer isn't there."
Low temperature: reduces creative fabrication
Evaluation: systematically detect hallucinations (see file 06)

⚡ Model Sizes and What They Can Do

Size	Parameters	Typical use	Example
Tiny	<1B	On-device, real-time	Phi-3 mini
Small	1–7B	Fast inference, limited reasoning	Llama 3.1 8B
Medium	7–30B	Good balance, most tasks	Mistral 7B, Qwen 14B
Large	30–70B	Strong reasoning	Llama 3.1 70B
XL	70B+	Best quality, expensive	Llama 3.1 405B

Rule of thumb: Use the smallest model that accomplishes the task. Smaller = faster + cheaper.

🔢 Key Numbers to Remember

Thing	Value
1 token	~4 chars / ~0.75 English words
GPT-4 context window	128K tokens
DeepSeek-R1 context	128K tokens
1 page of text	~500 tokens
1 book (300 pages)	~150K tokens
Minimum for fine-tuning	~100 examples
Semantic embedding size (MiniLM)	384 dimensions
Temperature sweet spot	0.5–0.7

This file is theory-only. The understanding here will make every debug session faster.

Next: 10_production.md — deploying responsibly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

09 – How LLMs Work Internally

🧠 The Big Picture

🔤 Step 1: Tokenization

🏗️ Step 2: The Transformer Architecture

🎯 Attention Mechanism — The Key Innovation

📏 Context Window

🧠 Why LLMs "Hallucinate"

⚡ Model Sizes and What They Can Do

🔢 Key Numbers to Remember

FilesExpand file tree

09_llm_internals.md

Latest commit

History

09_llm_internals.md

File metadata and controls

09 – How LLMs Work Internally

🧠 The Big Picture

🔤 Step 1: Tokenization

🏗️ Step 2: The Transformer Architecture

🎯 Attention Mechanism — The Key Innovation

📏 Context Window

🧠 Why LLMs "Hallucinate"

⚡ Model Sizes and What They Can Do

🔢 Key Numbers to Remember