A guided tour of a complete GPT-2 implementation using the MAX framework.
Each section walks through the code in gpt2_arch/gpt2.py and explains what
it does and why — from model configuration through serving with max serve.
- Transformer architecture: Every component of GPT-2, explained through working code
- MAX Python API: How MAX's
experimental.nnbuilds and compiles neural networks - Inference patterns: Weight loading, lazy initialization, model compilation, and autoregressive generation
- Pixi package manager
- Basic understanding of neural networks
- You'll need to meet the MAX system requirements
git clone https://github.com/modular/max-llm-book
cd max-llm-book
pixi installServe GPT-2 via an OpenAI-compatible HTTP endpoint:
pixi run serveThen query it:
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"gpt2","prompt":"In the beginning","max_tokens":30,"temperature":0}'Explore each GPT-2 component interactively — real tensor shapes, activation visualizations, and live text generation from pretrained weights:
pixi run notebookThis opens JupyterLab with notebooks/tutorial.ipynb. Sections 1–8 run
immediately with random weights; sections 9–12 download the pretrained GPT-2
checkpoint (~500 MB) from Hugging Face and compile the model for inference.
pixi run bookOr read it online at llm.modular.com.
The tutorial walks through gpt2_arch/ section by section:
| Section | Topic | What you'll learn |
|---|---|---|
| — | Run the model | Serve GPT-2 with pixi run serve before diving into code |
| 1 | Model configuration | Architecture hyperparameters and Hugging Face compatibility |
| 2 | Feed-forward network | Two-layer MLP with GELU activation |
| 3 | Causal masking | Preventing attention to future tokens |
| 4 | Multi-head attention | Parallel attention across 12 heads |
| 5 | Layer normalization | Pre-norm pattern for stable activations |
| 6 | Transformer block | Residual connections and component wiring |
| 7 | Stack transformer blocks | Embeddings and the 12-layer model body |
| 8 | Language model head | Projecting hidden states to vocabulary logits |
| 9 | Weight adaptation | Reconciling the HuggingFace checkpoint with MAX's weight layout |
| 10 | KV cache configuration | Exposing attention dimensions for cache pre-allocation |
| 11 | Pipeline model | Load, compile, and execute the model inside max serve |
| 12 | Architecture registration | Declare the package to max serve and wire all pieces together |
max-llm-book/
├── book/ # mdBook tutorial documentation
│ └── src/
│ ├── introduction.md
│ ├── serve_first.md
│ ├── step_01.md ... step_12.md
│ └── SUMMARY.md
├── gpt2_arch/ # GPT-2 model + custom architecture package for `max serve`
│ ├── gpt2.py # Model definition (GPT2Config through MaxGPT2LMHeadModel)
│ ├── model.py # PipelineModel wrapper used by max serve
│ ├── weight_adapters.py# HuggingFace → MAX weight conversion
│ ├── model_config.py # KV cache dimension configuration
│ └── arch.py # Architecture registration entry point
├── notebooks/ # Interactive Jupyter notebook companion
│ └── tutorial.ipynb
├── tests/ # Tests for gpt2_arch/
├── pixi.toml # Project dependencies and tasks
└── README.md # This file
- MAX documentation: docs.modular.com
- Hugging Face GPT-2: huggingface.co/gpt2
- Attention Is All You Need: arxiv.org/abs/1706.03762
- Language Models are Unsupervised Multitask Learners (GPT-2 paper): openai.com
Found an issue or want to improve the tutorial? Contributions welcome:
- File issues for bugs or unclear explanations
- Suggest improvements to code examples or visualizations
- Open a pull request with fixes or additions