+Training a frontier-scale Transformer is itself a substantial systems undertaking. Modern pretraining runs combine data parallelism, tensor parallelism, pipeline parallelism, and sequence parallelism, often coordinated through libraries such as PyTorch FSDP, Megatron, and DeepSpeed ZeRO. Practitioners must balance compute and memory carefully, choosing micro-batch sizes that maximize accelerator utilization without exceeding device memory, designing checkpointing schemes that survive node failures over runs that can last for months, and overlapping communication with computation to hide network latency. Activation checkpointing trades extra computation for reduced memory pressure, while mixed precision training with bfloat16 or FP8 formats shrinks memory bandwidth requirements and unlocks newer hardware features.
0 commit comments