Skip to content

Latest commit

 

History

History
83 lines (70 loc) · 7.11 KB

File metadata and controls

83 lines (70 loc) · 7.11 KB

Available configs

The configs/train/ directory ships 17 training presets. Most are named <model>_<gpu-count>_<parallelism>.toml so the filename doubles as a recipe. The configs/model/ directory ships 4 model-only presets used by scripts/convert_checkpoint.py for DCP ↔ HuggingFace conversion.

Dense training configs

File Purpose Model GPUs Parallelism Notes
debug.toml Tiny model, fast smoke test 256-dim, 4-layer 1+ FSDP (dp_shard=-1) 100 steps, compile_model=false, act_ckpt="none"
hf_wikitext.toml End-to-end HuggingFace streaming 512-dim, 8-layer 1+ FSDP (dp_shard=-1) GPT-2 tokenizer, wikitext-103-raw-v1, 500 steps
7b.toml General-purpose Llama-3 7B 7B any FSDP (dp_shard=-1) 100K steps, compile_model=true, full AC
7b_32gpu_fsdp.toml Baseline multi-node 7B 7B 32 pure FSDP 2M tokens/step, simplest multi-node recipe
7b_12gpu_tp4.toml 7B with intra-node TP 7B 12 TP=4 × FSDP=3 TP within node (NVLink), FSDP across (IB)
7b_16gpu_adamw.toml Long preemptible 7B run 7B 16 FSDP (dp_shard=-1) 100K steps, 210B tokens, ckpt every 500 steps
7b_16gpu_fp8.toml FP8 compute + FSDP2 float8 AG 7B 16 FSDP (dp_shard=-1) mixed_precision="fp8", E4M3 fwd / E5M2 bwd
7b_16gpu_muon.toml Muon optimizer + z-loss + chunked CE 7B 16 FSDP (dp_shard=-1) Tests Muon, chunked cross-entropy, z-loss together
13b_24gpu_validation.toml Full-stack validation run 13B 24 TP=4 × FSDP=6 WandB, profiling, eval, HF tokenizer, 1000 steps
13b_32gpu_tp4_pp2.toml 13B with pipeline parallel 13B 32 TP=4 × PP=2 × FSDP=4 40 layers → 20 per PP stage, 1f1b schedule
29b_32gpu_tp4_pp2.toml Custom 29B sized for H200 140GB 29B 32 TP=4 × PP=2 × FSDP=4 dim=6144, 56 layers, saturates 120 GB/GPU
70b_32gpu_tp4.toml 70B without PP bubble 70B 32 TP=4 × FSDP=8 No PP — fits via FSDP sharding alone
70b_32gpu_tp4_pp4.toml 70B when memory is tight 70B 32 TP=4 × PP=4 × FSDP=2 80 layers → 20 per PP stage, less FSDP sharding

MoE training configs

File Purpose Model GPUs Parallelism MoE
debug_moe.toml Tiny MoE smoke test 256-dim, 4-layer 1+ FSDP (dp_shard=-1) 4 experts, top-2, moe_frequency=2
moe_8gpu_stress.toml Saturate 2 nodes with MoE ~4B total / 1.8B active 8 TP=4 × FSDP=2 8 experts, top-2, moe_frequency=1
moe_24gpu.toml 24-GPU MoE stress test ~7B total / 1.8B active 24 TP=4 × FSDP=6 8 experts, top-2, grad_accum=32 for full saturation
moe_ep_32gpu.toml MoE + Expert Parallel ~4B total / 1.8B active 32 TP=4 × EP=2 × FSDP=4 8 experts, top-2, all-to-all across IB

See the MoE Expert Parallel benchmark for the numbers the last config produced.

Model-only configs

These don't include training fields — they're loaded by scripts/convert_checkpoint.py to describe the architecture when round-tripping checkpoints to HuggingFace.

File Architecture
llama_7b.toml dim=4096, 32 layers, 32 heads, 8 kv-heads, ffn_dim_multiplier=1.3
llama_13b.toml dim=5120, 40 layers, 40 heads, 8 kv-heads, ffn_dim_multiplier=1.3
llama_70b.toml dim=8192, 80 layers, 64 heads, 8 kv-heads, ffn_hidden_dim=28672
moe_small.toml dim=2048, 24 layers, 16 heads, 4 kv-heads, 8 experts top-2

Conventions

  • dp_shard = -1 in a config means "fill the remaining mesh dimension with FSDP" — the loader resolves this to world_size / (dp_replicate·tp·pp·cp·ep). Most single-dimension configs (7B, FP8, Muon, debug) use -1 so the same config works on any GPU count.
  • compile_model = false is set on every MoE config. Routing produces data-dependent shapes that break torch.compile's graph — JobConfig.validate(world_size) logs a warning (not an error) if you combine them.
  • Paths in these configs (dataset_path = "/path/to/...") are placeholders — replace them with a real tokenized shard directory before running. The hf_wikitext.toml config is the only one that runs end-to-end without path edits (it streams from the HF Hub).
  • Short max_steps (20–100) in the multi-node configs is a benchmark-sizing default, not a training budget. Override with --train.max_steps=… for real runs.

See also

  • Parallelism recipes — same data, indexed by (model, GPU count) rather than by filename.
  • Benchmarks — measured throughput for the configs that were benchmarked end-to-end.
  • Config sections — the fields every TOML key maps to.