Qwen3 Inference from Scratch

This repository records my implementation of a Qwen3-style LLM inference stack from low-level tensor operations. The goal is to understand how modern decoder-only LLM inference works by building the core components directly instead of only calling high-level model APIs.

The project is based on the excellent skyzh/tiny-llm course, with my own step-by-step implementation, notes, tests, and future experiments.

What This Project Covers

Implemented so far:

Basic matrix APIs used by the model path
Scaled dot-product attention
Multi-head attention
Rotary positional encoding, including Qwen3 non-traditional RoPE
Grouped Query Attention
Causal attention masking for both training-style and KV-cache-style shapes
Qwen3 attention block with Q/K RMSNorm and RoPE
RMSNorm with float32 accumulation
Numerically stable SiLU
Qwen3 SwiGLU MLP

Planned next:

Full Qwen3 transformer block
Token embedding and LM head path
Loading quantized Qwen3 MLX weights into the custom model
Text generation and sampling
KV cache, paged attention, continuous batching, and serving-oriented optimizations

Why This Project

I am using this project to learn the internals of LLM inference:

How Q/K/V projections become attention heads
Why GQA reduces KV memory and bandwidth
How RoPE injects token position into Q/K vectors
Why causal masks differ between full prefill and cached decoding
Where low precision is safe and where float32 accumulation is useful
How Qwen3-style blocks combine RMSNorm, attention, residuals, and SwiGLU MLPs

This is intended as a learning-oriented implementation that can grow into a compact inference engine.

Current Test Status

The completed Week 1 tasks pass locally:

pdm run test --week 1 --day 1
pdm run test --week 1 --day 2
pdm run test --week 1 --day 3
pdm run test --week 1 --day 4

Recent local results:

Week 1 Day 3: 60 passed
Week 1 Day 4: 22 passed

Setup

Install dependencies:

pdm install

Check the environment:

pdm run check-installation

Run tests for a specific chapter:

pdm run test --week 1 --day 4

Repository Layout

src/tiny_llm/
  attention.py              attention, GQA, causal mask
  positional_encoding.py    RoPE
  layer_norm.py             RMSNorm
  basics.py                 linear, softmax, SiLU
  qwen3_week1.py            Qwen3 attention, MLP, transformer/model path

Reference implementations and tests from the course are kept in:

src/tiny_llm_ref/
tests_refsol/
book/

Project notes written during implementation are organized in:

project-notes/
  resume/          resume-ready summaries and measurable results
  benchmarks/      benchmark reports and performance interpretation
  optimization/    future optimization backlog and bottleneck analysis

Attribution

This project is built while following tiny-llm - LLM Serving in a Week by skyzh. The original course repository is skyzh/tiny-llm.

My work in this repository focuses on implementing, understanding, testing, and documenting the inference components step by step.

Name		Name	Last commit message	Last commit date
Latest commit History 251 Commits
.github/workflows		.github/workflows
.vscode		.vscode
benches		benches
benchmarks		benchmarks
book		book
project-notes		project-notes
scripts		scripts
src		src
tests		tests
tests_refsol		tests_refsol
.clang-format		.clang-format
.cspell.json		.cspell.json
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
batch-main.py		batch-main.py
bench.py		bench.py
main.py		main.py
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qwen3 Inference from Scratch

What This Project Covers

Why This Project

Current Test Status

Setup

Repository Layout

Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Qwen3 Inference from Scratch

What This Project Covers

Why This Project

Current Test Status

Setup

Repository Layout

Attribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages