The most atomic way to train and run inference for a GPT — in pure Rust.
A faithful Rust port of Andrej Karpathy's microgpt.py, the dependency-free Python implementation of a GPT model. This file is the complete algorithm. Everything else is just efficiency.
This is a single-file, from-scratch implementation of a GPT language model in Rust, including:
- 🧠 Custom Autograd Engine — arena-based computation graph with reverse-mode automatic differentiation
- 🔢 Character-level Tokenizer — maps characters to token IDs with a special BOS (Beginning of Sequence) token
- 🏗️ GPT-2 Architecture — multi-head self-attention, MLP blocks, RMSNorm, residual connections, and KV-cache
- ⚡ Adam Optimizer — with bias correction and linear learning rate decay
- 🎲 Temperature-controlled Inference — autoregressive text generation with weighted sampling
The model trains on a dataset of ~32K names and learns to generate new, plausible-sounding names from scratch.
GPT-2 (simplified) with:
├── Token Embedding (vocab_size × n_embd)
├── Position Embedding (block_size × n_embd)
├── RMSNorm (initial)
├── Transformer Block × n_layer
│ ├── Multi-Head Attention
│ │ ├── RMSNorm
│ │ ├── Q, K, V projections
│ │ ├── Scaled dot-product attention
│ │ ├── Output projection
│ │ └── Residual connection
│ └── MLP
│ ├── RMSNorm
│ ├── FC1 (n_embd → 4×n_embd) + ReLU
│ ├── FC2 (4×n_embd → n_embd)
│ └── Residual connection
└── LM Head (vocab_size × n_embd)
| Parameter | Value |
|---|---|
n_layer |
1 |
n_embd |
16 |
block_size |
16 |
n_head |
4 |
head_dim |
4 |
num_steps |
1000 |
learning_rate |
0.01 |
temperature |
0.5 |
- Rust (stable toolchain)
# Debug build (fast compile, slower execution)
cargo run
# Release build (slower compile, much faster execution — recommended!)
cargo run --releaseThe program will:
- Download the names dataset (~200KB) on first run
- Train for 1000 steps, printing the loss
- Generate 20 hallucinated names
num docs: 32033
vocab size: 27
num params: 7451
step 1000 / 1000 | loss 2.1234
--- inference (new, hallucinated names) ---
sample 1: mara
sample 2: joline
sample 3: kaden
...
The Rust version uses an arena-based autograd tape instead of Python's reference-counted Value objects, providing:
- No Rc/RefCell overhead — all values are stored contiguously in a
Vec - Cache-friendly memory layout — sequential access patterns
- Iterative topological sort — no recursion depth limits
- Pre-allocated tape capacity — minimizes heap allocations during training
For maximum performance, always use cargo run --release which enables LTO and single codegen unit.
| Aspect | Python | Rust |
|---|---|---|
| Autograd | Value with Rc-like semantics |
Arena-based Tape with index handles |
| Memory | GC-managed | Pre-allocated contiguous vectors |
| Backward | Recursive topo sort | Iterative DFS (stack-safe) |
| RNG | random.gauss |
rand_distr::Normal |
| HTTP | urllib |
curl / powershell (no deps) |
microGPT/
├── Cargo.toml # Rust project manifest
├── src/
│ └── main.rs # The complete algorithm (single file)
├── microgpt.py # Original Python version by @karpathy
├── README.md
├── LICENSE
└── .gitignore
- Original Python implementation by Andrej Karpathy —
microgpt.py - Rust port — this repository
MIT License — see LICENSE for details.