KempnerForge's data pipeline: the dataset types, the sampler that partitions work across data-parallel ranks, the stateful dataloader that makes checkpoint resume possible, and the mixing / annealing hooks used for curriculum training.
Entry point: scripts/train.py picks one of four paths based on
config — pre-tokenized mmap, HuggingFace eager, HuggingFace streaming,
or multi-source mixture.
| Config path | Dataset class | Sampler | Notes |
|---|---|---|---|
data.dataset_path = "..." |
MemoryMappedDataset |
DistributedSampler |
Pre-tokenized .npy / .bin shards. Fastest path. |
data.hf_dataset_name = "..." (+ hf_streaming=false) |
HuggingFaceDataset |
DistributedSampler |
Eager download, tokenize once, pack into RAM. |
data.hf_dataset_name = "..." + hf_streaming=true |
StreamingHuggingFaceDataset |
(none — IterableDataset) |
On-the-fly tokenization from a streaming Hub dataset. |
data.datasets = [...] (non-empty) |
MixtureDataset over sub-datasets |
MixtureSampler |
Weighted mixing across multiple sources; phase transitions rebalance weights. |
All of the map-style datasets and samplers implement state_dict() /
load_state_dict(). StatefulDataLoader wraps them and tracks the
batch position inside an epoch so training can resume exactly after a
checkpoint — see Stateful dataloader.
Everything here is driven by DataConfig (kempnerforge/config/data.py):
[data]
dataset_path = "" # pre-tokenized shard directory
file_pattern = "*.npy" # glob for shard files (also "*.bin")
tokenizer_path = "" # HF tokenizer name or local path
num_workers = 4
pin_memory = true
prefetch_factor = 2
pack_sequences = false # cross-doc label masking via EOS
# HuggingFace path
hf_dataset_name = "" # default: None (str | None)
hf_dataset_config = "" # default: None (str | None)
hf_dataset_split = "train"
hf_dataset_text_field = "text"
hf_streaming = false
# Multi-dataset mixing
datasets = [] # list of DatasetSource tables
mix_temperature = 1.0 # weight scaling (1.0 = as-is, >1 = more uniform)
# Phase scheduling (step-triggered weight + LR transitions)
phases = []
anneal_start_step = 0 # shortcut for a common 2-phase pattern
anneal_weights = {}See Configuration § [data] — DataConfig
for the per-field reference.
:maxdepth: 1
memory-mapped
huggingface
mixing-and-annealing
stateful-dataloader
sampler
- Checkpointing § Train state — dataloader state infrastructure exists but the shipped training loop doesn't wire it into the save; resume restarts the epoch.
- Training § Training loop — where the dataloader is consumed step by step.
- Configuration §
[data]— DataConfig — every knob on this page.