NeoLLM

A decoder-only Transformer language model built entirely from scratch in PyTorch. Implements the same modern architecture used by LLaMA-3 and Mistral — Grouped Query Attention, RoPE, RMSNorm, SwiGLU FFN, and a full training pipeline with mixed precision, gradient accumulation, and checkpoint resume.

Architecture

Component	Implementation	Reference
Tokenizer	BPE (Byte-Pair Encoding)	GPT-2/LLaMA
Normalization	RMSNorm (pre-norm)	LLaMA
Position Encoding	Rotary Embeddings (RoPE)	Su et al., 2021
Attention	Grouped Query Attention + KV-cache	Ainslie et al., 2023
Feed-Forward	SwiGLU	Shazeer, 2020
LR Schedule	Warmup-Stable-Decay (WSD)	LLaMA-3, Falcon
Precision	fp16 / bf16 (auto-detected)	—

Model Sizes

Preset	Params	Layers	Dim	Heads	VRAM	Config
nano	22M	6	512	8	~2 GB	`--preset nano`
small	125M	12	768	12	~4 GB	`--config configs/small.yaml`
medium	370M	24	1024	16	~10 GB	`--config configs/medium.yaml`

Requirements

Python 3.10+
PyTorch 2.3+ with CUDA (recommended) or CPU
4 GB+ GPU VRAM for small model, 2 GB for nano

Installation

git clone https://github.com/yourusername/NeoLLM.git
cd NeoLLM

# Create virtual environment
python -m venv .venv
source .venv/bin/activate      # Linux / macOS
.venv\Scripts\activate         # Windows

# Install dependencies
pip install -r requirements.txt

For CUDA-enabled PyTorch (recommended — install before requirements.txt):

pip install torch --index-url https://download.pytorch.org/whl/cu126

Quick Start

Step 0 — Verify Installation

python -c "import torch; print('GPU:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU only')"

Step 1 — Train Tokenizer

Trains a 32,000-token BPE vocabulary on OpenWebText. Run once.

python scripts/train_tokenizer.py --samples 200000 --vocab-size 32000

Output: data/tokenizer.json

Step 2 — Prepare Training Data

Downloads and tokenizes documents into fast binary shards. Run once (or repeat with --max-docs to get more data).

# Recommended starting point (~500MB, good for testing)
python scripts/prepare_data.py --max-docs 50000

# For serious training (~5GB, much better model quality)
python scripts/prepare_data.py --max-docs 500000

Output: data/processed/train_????.bin

Step 3 — Train a Model

Nano — 22M params (sanity check, runs in minutes)

python scripts/train.py --preset nano --max-steps 500

Use this to verify your setup works before committing to a long run.

Small — 125M params (recommended, GPT-2 scale)

python scripts/train.py --config configs/small.yaml

Medium — 370M params (needs 10GB+ VRAM)

python scripts/train.py --config configs/medium.yaml

Training auto-resumes from the latest checkpoint. Just restart the command to continue.

Step 4 — Monitor Training

Open a second terminal while training runs:

tensorboard --logdir runs/

Then open http://localhost:6006 in your browser. Live graphs for loss, perplexity, learning rate, and tokens/sec.

Step 5 — Chat with Your Model

# Chat with the small model
python scripts/chat.py --checkpoint checkpoints/small/latest.pt

# Chat with nano (after a quick test run)
python scripts/chat.py --checkpoint checkpoints/nano/latest.pt

# Single prompt (non-interactive)
python scripts/chat.py --checkpoint checkpoints/small/latest.pt \
    --prompt "The future of artificial intelligence" --max-tokens 200

Step 6 — Use in Your Application

from neollm.inference.engine import InferenceEngine

engine = InferenceEngine.from_checkpoint(
    checkpoint_path="checkpoints/small/latest.pt",
    tokenizer_path="data/tokenizer.json",
)

response = engine.generate(
    prompt="Once upon a time,",
    max_new_tokens=100,
    temperature=0.8,
    top_p=0.9,
)
print(response)

Training on Free Cloud GPUs (Kaggle)

The project ships with a ready-to-use Kaggle notebook (kaggle_neollm_small.ipynb). Kaggle provides free GPU sessions of up to 12 hours with 30 GPU-hours per week.

Upload this repository as a Kaggle Dataset
Upload kaggle_neollm_small.ipynb as a new notebook
Set Accelerator → GPU T4 and enable Internet
Click Run All
Download checkpoints/small/latest.pt from the Output tab when done

Re-upload the checkpoint next session — training resumes automatically from where it left off.

Checkpoint Layout

checkpoints/
├── nano/
│   ├── step_000100.pt
│   └── latest.pt          ← always the most recent
├── small/
│   ├── step_002000.pt
│   ├── step_004000.pt
│   └── latest.pt
└── medium/
    └── latest.pt

runs/                       ← TensorBoard logs
├── nano/
└── small/

Configuration

All model and training settings are stored in YAML files under configs/. Key options:

# configs/small.yaml (excerpt)
train:
  batch_size: 4          # increase if you have more VRAM
  grad_accum_steps: 8    # effective batch = batch_size × grad_accum_steps
  lr: 3.0e-4
  max_steps: 200000
  fp16: true             # use bf16: true on Ampere+ GPUs (RTX 3000+)
  gradient_checkpointing: false  # set true to reduce VRAM at ~20% speed cost
  save_every: 2000

Troubleshooting

Error	Cause	Fix
`CUDA out of memory`	Batch too large	Reduce `batch_size` or set `gradient_checkpointing: true`
`FileNotFoundError: tokenizer.json`	Step 1 skipped	Run `train_tokenizer.py` first
`No shard files found`	Step 2 skipped	Run `prepare_data.py` first
Loss not decreasing	Wrong LR or data issue	Run `--preset nano` first to verify setup
`No module named torch`	Wrong Python	Activate your virtual environment

Running Tests

python -m pytest tests/ -v

Project Structure

NeoLLM/
├── neollm/
│   ├── config/          # ModelConfig, TrainConfig, DataConfig dataclasses
│   ├── model/           # RMSNorm, RoPE, Attention, FFN, Transformer, NeoLLM
│   ├── data/            # BPE tokenizer, dataset sharding, DataLoader
│   ├── training/        # Trainer, AdamW optimizer, WSD LR schedule, ModelEMA
│   ├── inference/       # Sampler (greedy/top-k/top-p), InferenceEngine
│   └── utils/           # Logging, metrics
├── scripts/
│   ├── train_tokenizer.py
│   ├── prepare_data.py
│   ├── train.py
│   └── chat.py
├── configs/
│   ├── nano.yaml         # 22M params
│   ├── small.yaml        # 125M params
│   └── medium.yaml       # 370M params
├── tests/
├── kaggle_neollm_small.ipynb
├── inference_example.py
├── requirements.txt
└── pyproject.toml

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NeoLLM

Architecture

Model Sizes

Requirements

Installation

Quick Start

Step 0 — Verify Installation

Step 1 — Train Tokenizer

Step 2 — Prepare Training Data

Step 3 — Train a Model

Nano — 22M params (sanity check, runs in minutes)

Small — 125M params (recommended, GPT-2 scale)

Medium — 370M params (needs 10GB+ VRAM)

Step 4 — Monitor Training

Step 5 — Chat with Your Model

Step 6 — Use in Your Application

Training on Free Cloud GPUs (Kaggle)

Checkpoint Layout

Configuration

Troubleshooting

Running Tests

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
data		data
neollm		neollm
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
inference_example.py		inference_example.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
train_small_llm_kaggle.ipynb		train_small_llm_kaggle.ipynb

Folders and files

Latest commit

History

Repository files navigation

NeoLLM

Architecture

Model Sizes

Requirements

Installation

Quick Start

Step 0 — Verify Installation

Step 1 — Train Tokenizer

Step 2 — Prepare Training Data

Step 3 — Train a Model

Nano — 22M params (sanity check, runs in minutes)

Small — 125M params (recommended, GPT-2 scale)

Medium — 370M params (needs 10GB+ VRAM)

Step 4 — Monitor Training

Step 5 — Chat with Your Model

Step 6 — Use in Your Application

Training on Free Cloud GPUs (Kaggle)

Checkpoint Layout

Configuration

Troubleshooting

Running Tests

Project Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages