The
configs/train/
directory ships 17 training presets. Most are named
<model>_<gpu-count>_<parallelism>.toml so the filename doubles as a
recipe. The
configs/model/
directory ships 4 model-only presets used by
scripts/convert_checkpoint.py
for DCP ↔ HuggingFace conversion.
| File | Purpose | Model | GPUs | Parallelism | Notes |
|---|---|---|---|---|---|
debug.toml |
Tiny model, fast smoke test | 256-dim, 4-layer | 1+ | FSDP (dp_shard=-1) |
100 steps, compile_model=false, act_ckpt="none" |
hf_wikitext.toml |
End-to-end HuggingFace streaming | 512-dim, 8-layer | 1+ | FSDP (dp_shard=-1) |
GPT-2 tokenizer, wikitext-103-raw-v1, 500 steps |
7b.toml |
General-purpose Llama-3 7B | 7B | any | FSDP (dp_shard=-1) |
100K steps, compile_model=true, full AC |
7b_32gpu_fsdp.toml |
Baseline multi-node 7B | 7B | 32 | pure FSDP | 2M tokens/step, simplest multi-node recipe |
7b_12gpu_tp4.toml |
7B with intra-node TP | 7B | 12 | TP=4 × FSDP=3 | TP within node (NVLink), FSDP across (IB) |
7b_16gpu_adamw.toml |
Long preemptible 7B run | 7B | 16 | FSDP (dp_shard=-1) |
100K steps, 210B tokens, ckpt every 500 steps |
7b_16gpu_fp8.toml |
FP8 compute + FSDP2 float8 AG | 7B | 16 | FSDP (dp_shard=-1) |
mixed_precision="fp8", E4M3 fwd / E5M2 bwd |
7b_16gpu_muon.toml |
Muon optimizer + z-loss + chunked CE | 7B | 16 | FSDP (dp_shard=-1) |
Tests Muon, chunked cross-entropy, z-loss together |
13b_24gpu_validation.toml |
Full-stack validation run | 13B | 24 | TP=4 × FSDP=6 | WandB, profiling, eval, HF tokenizer, 1000 steps |
13b_32gpu_tp4_pp2.toml |
13B with pipeline parallel | 13B | 32 | TP=4 × PP=2 × FSDP=4 | 40 layers → 20 per PP stage, 1f1b schedule |
29b_32gpu_tp4_pp2.toml |
Custom 29B sized for H200 140GB | 29B | 32 | TP=4 × PP=2 × FSDP=4 | dim=6144, 56 layers, saturates 120 GB/GPU |
70b_32gpu_tp4.toml |
70B without PP bubble | 70B | 32 | TP=4 × FSDP=8 | No PP — fits via FSDP sharding alone |
70b_32gpu_tp4_pp4.toml |
70B when memory is tight | 70B | 32 | TP=4 × PP=4 × FSDP=2 | 80 layers → 20 per PP stage, less FSDP sharding |
| File | Purpose | Model | GPUs | Parallelism | MoE |
|---|---|---|---|---|---|
debug_moe.toml |
Tiny MoE smoke test | 256-dim, 4-layer | 1+ | FSDP (dp_shard=-1) |
4 experts, top-2, moe_frequency=2 |
moe_8gpu_stress.toml |
Saturate 2 nodes with MoE | ~4B total / 1.8B active | 8 | TP=4 × FSDP=2 | 8 experts, top-2, moe_frequency=1 |
moe_24gpu.toml |
24-GPU MoE stress test | ~7B total / 1.8B active | 24 | TP=4 × FSDP=6 | 8 experts, top-2, grad_accum=32 for full saturation |
moe_ep_32gpu.toml |
MoE + Expert Parallel | ~4B total / 1.8B active | 32 | TP=4 × EP=2 × FSDP=4 | 8 experts, top-2, all-to-all across IB |
See the MoE Expert Parallel benchmark for the numbers the last config produced.
These don't include training fields — they're loaded by
scripts/convert_checkpoint.py
to describe the architecture when round-tripping checkpoints to
HuggingFace.
| File | Architecture |
|---|---|
llama_7b.toml |
dim=4096, 32 layers, 32 heads, 8 kv-heads, ffn_dim_multiplier=1.3 |
llama_13b.toml |
dim=5120, 40 layers, 40 heads, 8 kv-heads, ffn_dim_multiplier=1.3 |
llama_70b.toml |
dim=8192, 80 layers, 64 heads, 8 kv-heads, ffn_hidden_dim=28672 |
moe_small.toml |
dim=2048, 24 layers, 16 heads, 4 kv-heads, 8 experts top-2 |
dp_shard = -1in a config means "fill the remaining mesh dimension with FSDP" — the loader resolves this toworld_size / (dp_replicate·tp·pp·cp·ep). Most single-dimension configs (7B, FP8, Muon, debug) use-1so the same config works on any GPU count.compile_model = falseis set on every MoE config. Routing produces data-dependent shapes that breaktorch.compile's graph —JobConfig.validate(world_size)logs a warning (not an error) if you combine them.- Paths in these configs (
dataset_path = "/path/to/...") are placeholders — replace them with a real tokenized shard directory before running. Thehf_wikitext.tomlconfig is the only one that runs end-to-end without path edits (it streams from the HF Hub). - Short
max_steps(20–100) in the multi-node configs is a benchmark-sizing default, not a training budget. Override with--train.max_steps=…for real runs.
- Parallelism recipes — same data, indexed by (model, GPU count) rather than by filename.
- Benchmarks — measured throughput for the configs that were benchmarked end-to-end.
- Config sections — the fields every TOML key maps to.