Skip to content

Commit ec2ba9b

Browse files
committed
Fix save_megatron_model deadlock with vLLM inference workers
Pass fully_parallel_save=False to save_megatron_model during HF-to-Megatron conversion. When NeMo RL runs with both Megatron training workers and vLLM inference workers, they share a single torch.distributed world. The default fully_parallel_save=True activates FullyParallelSaveStrategyWrapper, which uses all_gather_object on the DP sub-group — a group that includes vLLM ranks. Since only training workers call save_megatron_model, vLLM ranks never participate in these collectives, causing a permanent deadlock. Depends on: NVIDIA-NeMo/Megatron-Bridge#XXXX (exposes fully_parallel_save parameter)
1 parent fe3c4fc commit ec2ba9b

1 file changed

Lines changed: 4 additions & 1 deletion

File tree

nemo_rl/models/megatron/community_import.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,10 @@ def import_model_from_hf_name(
107107
config.num_layers_in_last_pipeline_stage = orig_num_layers_in_last_pipeline_stage
108108
config.pipeline_dtype = orig_pipeline_dtype
109109

110-
bridge.save_megatron_model(megatron_model, output_path)
110+
# fully_parallel_save=False: the torch.distributed world may include non-training
111+
# ranks (e.g., vLLM inference workers) that don't participate in this save.
112+
# FullyParallelSaveStrategyWrapper would deadlock on DP sub-group collectives.
113+
bridge.save_megatron_model(megatron_model, output_path, fully_parallel_save=False)
111114

112115
# resetting mcore state
113116
import megatron.core.rerun_state_machine

0 commit comments

Comments
 (0)