Fix save_megatron_model deadlock with vLLM inference workers

nic-nvidia · nic-nvidia · commit ec2ba9b01a52 · 2026-04-06T20:57:03.000-07:00
Pass fully_parallel_save=False to save_megatron_model during HF-to-Megatron
conversion. When NeMo RL runs with both Megatron training workers and vLLM
inference workers, they share a single torch.distributed world. The default
fully_parallel_save=True activates FullyParallelSaveStrategyWrapper, which
uses all_gather_object on the DP sub-group — a group that includes vLLM
ranks. Since only training workers call save_megatron_model, vLLM ranks
never participate in these collectives, causing a permanent deadlock.

Depends on: NVIDIA-NeMo/Megatron-Bridge#XXXX (exposes fully_parallel_save parameter)
diff --git a/nemo_rl/models/megatron/community_import.py b/nemo_rl/models/megatron/community_import.py
@@ -107,7 +107,10 @@ def import_model_from_hf_name(
     config.num_layers_in_last_pipeline_stage = orig_num_layers_in_last_pipeline_stage
     config.pipeline_dtype = orig_pipeline_dtype
 
-    bridge.save_megatron_model(megatron_model, output_path)
+    # fully_parallel_save=False: the torch.distributed world may include non-training
+    # ranks (e.g., vLLM inference workers) that don't participate in this save.
+    # FullyParallelSaveStrategyWrapper would deadlock on DP sub-group collectives.
+    bridge.save_megatron_model(megatron_model, output_path, fully_parallel_save=False)
 
     # resetting mcore state
     import megatron.core.rerun_state_machine