-
Notifications
You must be signed in to change notification settings - Fork 312
Pass full_param_layout into DDP (Megatron-LM #3812) #3483
Copy link
Copy link
Open
Labels
area:trainingTraining loop, callbacks, and runtime integrationTraining loop, callbacks, and runtime integrationfeatureNew capabilities, enhancements, or enablement workNew capabilities, enhancements, or enablement workmlm-syncRequires API/behavior sync with upstream Megatron-LM changesRequires API/behavior sync with upstream Megatron-LM changestrackingTracking issue for an ongoing project with smaller stepsTracking issue for an ongoing project with smaller steps
Metadata
Metadata
Assignees
Labels
area:trainingTraining loop, callbacks, and runtime integrationTraining loop, callbacks, and runtime integrationfeatureNew capabilities, enhancements, or enablement workNew capabilities, enhancements, or enablement workmlm-syncRequires API/behavior sync with upstream Megatron-LM changesRequires API/behavior sync with upstream Megatron-LM changestrackingTracking issue for an ongoing project with smaller stepsTracking issue for an ongoing project with smaller steps
Summary
Megatron-LM PR NVIDIA/Megatron-LM#3812 refactors
DistributedDataParallelto accept afull_param_layoutargument describing how parameters and gradients are mapped in_ParamAndGradBuffer. Distributed optimizers compute this mapping via a staticcompute_full_param_layoutmethod.MBridge will need to pass
full_param_layoutinto DDP to fully support this change.Urgency
Not pressing. DDP currently falls back to the existing behavior when
full_param_layoutis not passed (_compute_default_per_buffer_param_layout). However, Deepak plans to remove that fallback code in a future cleanup pass, at which point MBridge must supplyfull_param_layoutor DDP initialization will break.What needs to happen
full_param_layout/PerBufferParamLayout/BufferKeydataclasses introduced inparam_layout.py.DistributedOptimizer.compute_full_param_layout()in MBridge's training initialization and pass the result to DDP.DistributedDataParallelto forward the layout.Context
From Slack discussion with @deepakn94 (2026-04-22). Should be addressed before the fallback path is removed upstream.
cc @deepakn94