The parallelism families — FSDP2, tensor parallelism, expert
parallelism, pipeline parallelism, FP8 — and the DeviceMesh that
composes them.
:maxdepth: 1
device-mesh
fsdp2
tensor-parallelism
expert-parallelism
pipeline-parallelism
fp8
| Page | Covers |
|---|---|
| Device mesh | Mesh construction, dimension order, sub-mesh extraction (get_dp_mesh / get_tp_mesh / get_pp_mesh), example meshes from shipped configs |
| FSDP2 | Composable fully_shard(), per-block + top-level wrap, mixed-precision policy, reshard_after_forward, EP-MoE per-sub-module wrapping, activation checkpointing modes |
| Tensor parallelism | ColwiseParallel / RowwiseParallel / SequenceParallel plan, forward / post-hooks on embeddings + attention + MLP, meta-device init, head divisibility |
| Expert parallelism | All-to-all dispatch + combine, unused-expert grad kludge, EP + TP + FSDP composition, gradient scaling |
| Pipeline parallelism | Layer assignment, PipelineStageModule, 1F1B / GPipe / interleaved schedules, per-stage DCP checkpointing |
| FP8 | torchao.float8 conversion, filter rules (no experts / router), enable_fsdp_float8_all_gather, TP incompatibility |
- Architecture § Parallelism order — the five-step apply sequence and the reasoning behind it.
- Configuration § DistributedConfig
— the dataclass with
dp_replicate,dp_shard,tp,pp,cp,ep,pp_schedule. - Configuration § Validation rules — arithmetic checks, head divisibility, FP8 + TP, MoE + PP, tie-embeddings + PP.
- Reference § Parallelism recipes — which combinations work at which scales.
- Reference § Benchmarks — measured MFU and the MoE per-sub-module FSDP fix.
- NCCL health / liveness — the all-reduce heartbeat and NaN detection are in Resilience.
- Checkpointing under PP / TP / FSDP — mesh-scoped DCP groups and resharding are in Checkpointing.
- Context parallelism —
cpis declared in config and checked in validation but has noapply_context_parallelyet; see Device mesh § Context parallelism.