Skip to content

Latest commit

 

History

History
52 lines (43 loc) · 2.52 KB

File metadata and controls

52 lines (43 loc) · 2.52 KB

Distributed

The parallelism families — FSDP2, tensor parallelism, expert parallelism, pipeline parallelism, FP8 — and the DeviceMesh that composes them.

:maxdepth: 1

device-mesh
fsdp2
tensor-parallelism
expert-parallelism
pipeline-parallelism
fp8

What lives where

Page Covers
Device mesh Mesh construction, dimension order, sub-mesh extraction (get_dp_mesh / get_tp_mesh / get_pp_mesh), example meshes from shipped configs
FSDP2 Composable fully_shard(), per-block + top-level wrap, mixed-precision policy, reshard_after_forward, EP-MoE per-sub-module wrapping, activation checkpointing modes
Tensor parallelism ColwiseParallel / RowwiseParallel / SequenceParallel plan, forward / post-hooks on embeddings + attention + MLP, meta-device init, head divisibility
Expert parallelism All-to-all dispatch + combine, unused-expert grad kludge, EP + TP + FSDP composition, gradient scaling
Pipeline parallelism Layer assignment, PipelineStageModule, 1F1B / GPipe / interleaved schedules, per-stage DCP checkpointing
FP8 torchao.float8 conversion, filter rules (no experts / router), enable_fsdp_float8_all_gather, TP incompatibility

Cross-cutting references

Not covered here

  • NCCL health / liveness — the all-reduce heartbeat and NaN detection are in Resilience.
  • Checkpointing under PP / TP / FSDP — mesh-scoped DCP groups and resharding are in Checkpointing.
  • Context parallelismcp is declared in config and checked in validation but has no apply_context_parallel yet; see Device mesh § Context parallelism.