Commit ec2ba9b
committed
Fix save_megatron_model deadlock with vLLM inference workers
Pass fully_parallel_save=False to save_megatron_model during HF-to-Megatron
conversion. When NeMo RL runs with both Megatron training workers and vLLM
inference workers, they share a single torch.distributed world. The default
fully_parallel_save=True activates FullyParallelSaveStrategyWrapper, which
uses all_gather_object on the DP sub-group — a group that includes vLLM
ranks. Since only training workers call save_megatron_model, vLLM ranks
never participate in these collectives, causing a permanent deadlock.
Depends on: NVIDIA-NeMo/Megatron-Bridge#XXXX (exposes fully_parallel_save parameter)1 parent fe3c4fc commit ec2ba9b
1 file changed
Lines changed: 4 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
107 | 107 | | |
108 | 108 | | |
109 | 109 | | |
110 | | - | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
111 | 114 | | |
112 | 115 | | |
113 | 116 | | |
| |||
0 commit comments