Fix #1040: [Feature] Fixed bugs in Archon LoRA Backend#1139
Conversation
Ranks with no gradients (e.g. frozen non-LoRA params) previously returned 0.0 immediately, skipping the all_reduce. Ranks that do have gradients then hang waiting for the collective to complete. Move device init before the empty-grads check and make zero-grad ranks still participate in all_reduce with a zero-valued tensor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request updates the get_grad_norm_fp32 function to ensure that ranks with empty gradients still participate in distributed all_reduce operations, preventing potential deadlocks during training with frozen parameters. The review feedback suggests refactoring the function to unify synchronization logic and reduce code duplication. Additionally, it is recommended to use consistent 'is not None' checks for process groups to improve readability and adhere to PEP 8 standards.
| if not grads_for_norm: | ||
| # Still participate in all_reduce with zero contribution so that | ||
| # ranks with grads don't hang waiting for this rank (e.g. LoRA frozen ranks). | ||
| total_norm_cuda = torch.tensor(0.0, dtype=torch.float, device=device) | ||
| reduce_op = dist.ReduceOp.MAX if norm_type == torch.inf else dist.ReduceOp.SUM | ||
| if data_parallel_group: | ||
| dist.all_reduce(total_norm_cuda, op=reduce_op, group=data_parallel_group) | ||
| if model_parallel_group is not None: | ||
| dist.all_reduce(total_norm_cuda, op=reduce_op, group=model_parallel_group) | ||
| total_norm = float(total_norm_cuda.item()) | ||
| if norm_type != torch.inf and total_norm > 0: | ||
| total_norm = total_norm ** (1.0 / norm_type) | ||
| return total_norm |
There was a problem hiding this comment.
The logic for participating in the all_reduce collective and calculating the final norm for empty gradient lists is largely duplicated from the rest of the function (specifically the logic found in lines 118-174). While this correctly addresses the distributed deadlock, it introduces a maintainability risk. Any future changes to the distributed synchronization logic (e.g., adding support for new process groups like expert parallel groups) would now need to be updated in three separate locations within this function.
Consider refactoring the function to unify the synchronization and post-processing steps. For instance, you could initialize total_norm_cuda to zero when grads_for_norm is empty and then use conditional blocks to skip only the local gradient accumulation parts, allowing the existing all_reduce and root-calculation logic to handle the final result.
| if data_parallel_group: | ||
| dist.all_reduce(total_norm_cuda, op=reduce_op, group=data_parallel_group) | ||
| if model_parallel_group is not None: | ||
| dist.all_reduce(total_norm_cuda, op=reduce_op, group=model_parallel_group) |
There was a problem hiding this comment.
There is an inconsistency in how the existence of process groups is checked. data_parallel_group is checked using truthiness, while model_parallel_group is checked using an explicit is not None comparison. Per PEP 8, it is generally preferred to use is not None when checking if an optional argument was provided, to avoid potential issues with objects that might define a custom __bool__ method. Using a consistent check also improves readability.
| if data_parallel_group: | |
| dist.all_reduce(total_norm_cuda, op=reduce_op, group=data_parallel_group) | |
| if model_parallel_group is not None: | |
| dist.all_reduce(total_norm_cuda, op=reduce_op, group=model_parallel_group) | |
| if data_parallel_group is not None: | |
| dist.all_reduce(total_norm_cuda, op=reduce_op, group=data_parallel_group) | |
| if model_parallel_group is not None: | |
| dist.all_reduce(total_norm_cuda, op=reduce_op, group=model_parallel_group) |
References
- PEP 8 recommends using 'is not None' for comparisons to singletons like None, rather than relying on truthiness. (link)
…nAI#1139) Ranks with no gradients (e.g. frozen non-LoRA params) previously returned 0.0 immediately, skipping the all_reduce. Ranks that do have gradients then hang waiting for the collective to complete. Move device init before the empty-grads check and make zero-grad ranks still participate in all_reduce with a zero-valued tensor. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes #1040
Description
Fixes a distributed deadlock in
get_grad_norm_fp32(areal/engine/fsdp_utils/grad.py) where ranks with no gradients (e.g., LoRA frozen ranks in the Archon LoRA backend) would return early before participating in theall_reducecollective, causing ranks with real gradients to hang indefinitely waiting for the collective to complete.Root cause: The original early-return on empty
grads_for_normbypassed thedist.all_reducecalls entirely. In a distributed setting, all ranks must call collective operations together.Fix: When
grads_for_normis empty, the rank now contributes a zero tensor toall_reduce(usingReduceOp.MAXforinf-norm,ReduceOp.SUMotherwise) across bothdata_parallel_groupandmodel_parallel_groupbefore returning0.0. This keeps all ranks synchronized without affecting the computed norm on ranks that do have gradients.Related Issue
Fixes #1040
Type of Change
Checklist
pre-commit run --all-files)./docs/build_all.sh)main/review-prcommand/create-prBreaking Change Details (if applicable):
N/A
Additional Context
Two tests updated/added in
tests/test_fsdp_grad.py:test_empty_grads_returns_zero: patchedtorch.distributed.all_reduceto assert it is called exactly twice (once fordp_group, once formp_group) even when the grad list is empty, and that the result is still0.0.test_empty_grads_participates_in_allreduce_no_groups(new): verifies that passingNonefor both process groups with an empty grad list returns0.0without hanging or erroring.End-to-end validation was performed on Qwen 1.5B distill with the DAPO-Math-17k dataset as described in issue #1040 (see also #1015 (comment)).
Need help? Check the Contributing Guide or ask in
GitHub Discussions!
This PR was created with AI assistance (Claude). The changes were reviewed by quality gates and a critic model before submission.