A decoder-only Transformer language model built entirely from scratch in PyTorch. Implements the same modern architecture used by LLaMA-3 and Mistral — Grouped Query Attention, RoPE, RMSNorm, SwiGLU FFN, and a full training pipeline with mixed precision, gradient accumulation, and checkpoint resume.
| Component | Implementation | Reference |
|---|---|---|
| Tokenizer | BPE (Byte-Pair Encoding) | GPT-2/LLaMA |
| Normalization | RMSNorm (pre-norm) | LLaMA |
| Position Encoding | Rotary Embeddings (RoPE) | Su et al., 2021 |
| Attention | Grouped Query Attention + KV-cache | Ainslie et al., 2023 |
| Feed-Forward | SwiGLU | Shazeer, 2020 |
| LR Schedule | Warmup-Stable-Decay (WSD) | LLaMA-3, Falcon |
| Precision | fp16 / bf16 (auto-detected) | — |
| Preset | Params | Layers | Dim | Heads | VRAM | Config |
|---|---|---|---|---|---|---|
| nano | 22M | 6 | 512 | 8 | ~2 GB | --preset nano |
| small | 125M | 12 | 768 | 12 | ~4 GB | --config configs/small.yaml |
| medium | 370M | 24 | 1024 | 16 | ~10 GB | --config configs/medium.yaml |
- Python 3.10+
- PyTorch 2.3+ with CUDA (recommended) or CPU
- 4 GB+ GPU VRAM for small model, 2 GB for nano
git clone https://github.com/yourusername/NeoLLM.git
cd NeoLLM
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Linux / macOS
.venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtFor CUDA-enabled PyTorch (recommended — install before requirements.txt):
pip install torch --index-url https://download.pytorch.org/whl/cu126python -c "import torch; print('GPU:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU only')"Trains a 32,000-token BPE vocabulary on OpenWebText. Run once.
python scripts/train_tokenizer.py --samples 200000 --vocab-size 32000Output: data/tokenizer.json
Downloads and tokenizes documents into fast binary shards. Run once (or repeat with --max-docs to get more data).
# Recommended starting point (~500MB, good for testing)
python scripts/prepare_data.py --max-docs 50000
# For serious training (~5GB, much better model quality)
python scripts/prepare_data.py --max-docs 500000Output: data/processed/train_????.bin
python scripts/train.py --preset nano --max-steps 500Use this to verify your setup works before committing to a long run.
python scripts/train.py --config configs/small.yamlpython scripts/train.py --config configs/medium.yamlTraining auto-resumes from the latest checkpoint. Just restart the command to continue.
Open a second terminal while training runs:
tensorboard --logdir runs/Then open http://localhost:6006 in your browser. Live graphs for loss, perplexity, learning rate, and tokens/sec.
# Chat with the small model
python scripts/chat.py --checkpoint checkpoints/small/latest.pt
# Chat with nano (after a quick test run)
python scripts/chat.py --checkpoint checkpoints/nano/latest.pt
# Single prompt (non-interactive)
python scripts/chat.py --checkpoint checkpoints/small/latest.pt \
--prompt "The future of artificial intelligence" --max-tokens 200from neollm.inference.engine import InferenceEngine
engine = InferenceEngine.from_checkpoint(
checkpoint_path="checkpoints/small/latest.pt",
tokenizer_path="data/tokenizer.json",
)
response = engine.generate(
prompt="Once upon a time,",
max_new_tokens=100,
temperature=0.8,
top_p=0.9,
)
print(response)The project ships with a ready-to-use Kaggle notebook (kaggle_neollm_small.ipynb). Kaggle provides free GPU sessions of up to 12 hours with 30 GPU-hours per week.
- Upload this repository as a Kaggle Dataset
- Upload
kaggle_neollm_small.ipynbas a new notebook - Set Accelerator → GPU T4 and enable Internet
- Click Run All
- Download
checkpoints/small/latest.ptfrom the Output tab when done
Re-upload the checkpoint next session — training resumes automatically from where it left off.
checkpoints/
├── nano/
│ ├── step_000100.pt
│ └── latest.pt ← always the most recent
├── small/
│ ├── step_002000.pt
│ ├── step_004000.pt
│ └── latest.pt
└── medium/
└── latest.pt
runs/ ← TensorBoard logs
├── nano/
└── small/
All model and training settings are stored in YAML files under configs/. Key options:
# configs/small.yaml (excerpt)
train:
batch_size: 4 # increase if you have more VRAM
grad_accum_steps: 8 # effective batch = batch_size × grad_accum_steps
lr: 3.0e-4
max_steps: 200000
fp16: true # use bf16: true on Ampere+ GPUs (RTX 3000+)
gradient_checkpointing: false # set true to reduce VRAM at ~20% speed cost
save_every: 2000| Error | Cause | Fix |
|---|---|---|
CUDA out of memory |
Batch too large | Reduce batch_size or set gradient_checkpointing: true |
FileNotFoundError: tokenizer.json |
Step 1 skipped | Run train_tokenizer.py first |
No shard files found |
Step 2 skipped | Run prepare_data.py first |
| Loss not decreasing | Wrong LR or data issue | Run --preset nano first to verify setup |
No module named torch |
Wrong Python | Activate your virtual environment |
python -m pytest tests/ -vNeoLLM/
├── neollm/
│ ├── config/ # ModelConfig, TrainConfig, DataConfig dataclasses
│ ├── model/ # RMSNorm, RoPE, Attention, FFN, Transformer, NeoLLM
│ ├── data/ # BPE tokenizer, dataset sharding, DataLoader
│ ├── training/ # Trainer, AdamW optimizer, WSD LR schedule, ModelEMA
│ ├── inference/ # Sampler (greedy/top-k/top-p), InferenceEngine
│ └── utils/ # Logging, metrics
├── scripts/
│ ├── train_tokenizer.py
│ ├── prepare_data.py
│ ├── train.py
│ └── chat.py
├── configs/
│ ├── nano.yaml # 22M params
│ ├── small.yaml # 125M params
│ └── medium.yaml # 370M params
├── tests/
├── kaggle_neollm_small.ipynb
├── inference_example.py
├── requirements.txt
└── pyproject.toml
MIT