Skip to content

KazeBox33/CS336-learn

Repository files navigation

CS336-learn

This repository tracks my self-study of Stanford CS336: Language Modeling from Scratch.

本仓库用于记录我自学 Stanford CS336「从零实现语言模型」的过程,重点是通过 Assignment 1 打通从原始文本到小型 Transformer LM 训练的完整链路。

Goals / 学习目标

  • Understand the end-to-end language modeling pipeline: raw text -> tokenizer -> token IDs -> Transformer LM -> training -> evaluation -> generation.
  • 从工程角度理解大语言模型训练流程:原始语料、Tokenizer、数据序列化、Transformer 模块、训练循环、评估与生成。
  • Implement core components by hand instead of treating them as black boxes.
  • 通过手写核心组件来理解实现细节,而不是只调用高层框架 API。
  • Practice debugging, testing, profiling, and performance-aware implementation.
  • 练习测试、调试、性能分析和面向效率的实现方式。

Current Progress / 当前进度

Completed / 已完成

  • Set up the CS336 Assignment 1 workspace under assignment1-basics/.
  • 已完成 assignment1-basics/ 的 A1 环境配置。
  • Implemented byte-level BPE tokenizer training with special-token-aware pre-tokenization.
  • 已实现 byte-level BPE tokenizer 训练,包括 special token 分割与 GPT-2 regex pre-tokenization。
  • Added multiprocessing pre-tokenization and incremental pair-stat updates for faster BPE training.
  • 已加入多进程预分词和增量 pair 统计更新,提升 BPE 训练效率。
  • Trained a TinyStories BPE tokenizer with vocab size 10,000 and <|endoftext|>.
  • 已在 TinyStories 上训练 10K 词表 BPE tokenizer,并保留 <|endoftext|> special token。
  • Implemented BPE Tokenizer with encode, decode, special-token handling, and encode_iterable.
  • 已实现 BPE Tokenizer,支持 encodedecode、special token 保留和流式 encode_iterable
  • Passed tokenizer and BPE training tests.
  • 已通过 tokenizer 与 BPE training 相关测试。
  • Implemented core Transformer LM components: Linear, Embedding, RMSNorm, SwiGLU, RoPE, causal multi-head self-attention, Transformer block, and Transformer LM.
  • 已实现核心 Transformer LM 组件:LinearEmbeddingRMSNormSwiGLU、RoPE、因果多头自注意力、Transformer Block 和 Transformer LM。
  • Implemented numerically stable softmax and core cross_entropy loss.
  • 已实现数值稳定的 softmax 和核心版 cross_entropy 损失函数。
  • Passed the relevant model architecture and loss tests.
  • 已通过相关模型结构与损失函数测试。

Validation / 验证:

uv run pytest tests/test_train_bpe.py -q
uv run pytest tests/test_tokenizer.py -q
uv run pytest tests/test_model.py::test_transformer_lm tests/test_model.py::test_transformer_lm_truncated_input -q
uv run pytest tests/test_nn_utils.py::test_cross_entropy -q

In Progress / 进行中

  • Working through the training section of A1.
  • 正在推进 A1 第四部分训练组件。
  • Current focus: implementing a custom AdamW optimizer while understanding PyTorch optimizer state, parameter groups, and moment estimates.
  • 当前重点:实现自定义 AdamW 优化器,并理解 PyTorch optimizer 的状态字典、参数组和一阶/二阶动量。

Next / 下一步

  • Finish AdamW.step() and connect the optimizer adapter.
  • 完成 AdamW.step() 并接入 optimizer adapter。
  • Implement learning-rate schedule, gradient clipping, data loading, checkpointing, and the training loop.
  • 继续实现学习率调度、梯度裁剪、数据加载、checkpoint 和训练循环。
  • Return to tokenizer experiments when needed: compression ratio, throughput, OpenWebText tokenizer, and token ID serialization.
  • 后续回到 tokenizer 实验:压缩率、吞吐量、OpenWebText tokenizer,以及 token id 序列化。

Notes / 说明

  • This repository is for self-directed learning and implementation practice.
  • 本仓库用于自学和实现练习。
  • Large local datasets and generated outputs are ignored by git.
  • 大型本地数据集和训练输出不会提交到 git。

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors