Skip to content

Fix #1040: [Feature] Fixed bugs in Archon LoRA Backend#1139

Merged
garrett4wade merged 1 commit intoinclusionAI:mainfrom
JiwaniZakir:fix/1040-feature-fixed-bugs-in-archon-lora-backe
Apr 8, 2026
Merged

Fix #1040: [Feature] Fixed bugs in Archon LoRA Backend#1139
garrett4wade merged 1 commit intoinclusionAI:mainfrom
JiwaniZakir:fix/1040-feature-fixed-bugs-in-archon-lora-backe

Conversation

@JiwaniZakir
Copy link
Copy Markdown
Contributor

Closes #1040

Description

Fixes a distributed deadlock in get_grad_norm_fp32 (areal/engine/fsdp_utils/grad.py) where ranks with no gradients (e.g., LoRA frozen ranks in the Archon LoRA backend) would return early before participating in the all_reduce collective, causing ranks with real gradients to hang indefinitely waiting for the collective to complete.

Root cause: The original early-return on empty grads_for_norm bypassed the dist.all_reduce calls entirely. In a distributed setting, all ranks must call collective operations together.

Fix: When grads_for_norm is empty, the rank now contributes a zero tensor to all_reduce (using ReduceOp.MAX for inf-norm, ReduceOp.SUM otherwise) across both data_parallel_group and model_parallel_group before returning 0.0. This keeps all ranks synchronized without affecting the computed norm on ranks that do have gradients.

Related Issue

Fixes #1040

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Breaking Change Details (if applicable):

N/A

Additional Context

Two tests updated/added in tests/test_fsdp_grad.py:

  • test_empty_grads_returns_zero: patched torch.distributed.all_reduce to assert it is called exactly twice (once for dp_group, once for mp_group) even when the grad list is empty, and that the result is still 0.0.
  • test_empty_grads_participates_in_allreduce_no_groups (new): verifies that passing None for both process groups with an empty grad list returns 0.0 without hanging or erroring.

End-to-end validation was performed on Qwen 1.5B distill with the DAPO-Math-17k dataset as described in issue #1040 (see also #1015 (comment)).


Need help? Check the Contributing Guide or ask in
GitHub Discussions!


This PR was created with AI assistance (Claude). The changes were reviewed by quality gates and a critic model before submission.

Ranks with no gradients (e.g. frozen non-LoRA params) previously
returned 0.0 immediately, skipping the all_reduce. Ranks that do
have gradients then hang waiting for the collective to complete.

Move device init before the empty-grads check and make zero-grad
ranks still participate in all_reduce with a zero-valued tensor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the get_grad_norm_fp32 function to ensure that ranks with empty gradients still participate in distributed all_reduce operations, preventing potential deadlocks during training with frozen parameters. The review feedback suggests refactoring the function to unify synchronization logic and reduce code duplication. Additionally, it is recommended to use consistent 'is not None' checks for process groups to improve readability and adhere to PEP 8 standards.

Comment on lines +104 to +116
if not grads_for_norm:
# Still participate in all_reduce with zero contribution so that
# ranks with grads don't hang waiting for this rank (e.g. LoRA frozen ranks).
total_norm_cuda = torch.tensor(0.0, dtype=torch.float, device=device)
reduce_op = dist.ReduceOp.MAX if norm_type == torch.inf else dist.ReduceOp.SUM
if data_parallel_group:
dist.all_reduce(total_norm_cuda, op=reduce_op, group=data_parallel_group)
if model_parallel_group is not None:
dist.all_reduce(total_norm_cuda, op=reduce_op, group=model_parallel_group)
total_norm = float(total_norm_cuda.item())
if norm_type != torch.inf and total_norm > 0:
total_norm = total_norm ** (1.0 / norm_type)
return total_norm
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for participating in the all_reduce collective and calculating the final norm for empty gradient lists is largely duplicated from the rest of the function (specifically the logic found in lines 118-174). While this correctly addresses the distributed deadlock, it introduces a maintainability risk. Any future changes to the distributed synchronization logic (e.g., adding support for new process groups like expert parallel groups) would now need to be updated in three separate locations within this function.

Consider refactoring the function to unify the synchronization and post-processing steps. For instance, you could initialize total_norm_cuda to zero when grads_for_norm is empty and then use conditional blocks to skip only the local gradient accumulation parts, allowing the existing all_reduce and root-calculation logic to handle the final result.

Comment on lines +109 to +112
if data_parallel_group:
dist.all_reduce(total_norm_cuda, op=reduce_op, group=data_parallel_group)
if model_parallel_group is not None:
dist.all_reduce(total_norm_cuda, op=reduce_op, group=model_parallel_group)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is an inconsistency in how the existence of process groups is checked. data_parallel_group is checked using truthiness, while model_parallel_group is checked using an explicit is not None comparison. Per PEP 8, it is generally preferred to use is not None when checking if an optional argument was provided, to avoid potential issues with objects that might define a custom __bool__ method. Using a consistent check also improves readability.

Suggested change
if data_parallel_group:
dist.all_reduce(total_norm_cuda, op=reduce_op, group=data_parallel_group)
if model_parallel_group is not None:
dist.all_reduce(total_norm_cuda, op=reduce_op, group=model_parallel_group)
if data_parallel_group is not None:
dist.all_reduce(total_norm_cuda, op=reduce_op, group=data_parallel_group)
if model_parallel_group is not None:
dist.all_reduce(total_norm_cuda, op=reduce_op, group=model_parallel_group)
References
  1. PEP 8 recommends using 'is not None' for comparisons to singletons like None, rather than relying on truthiness. (link)

Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@garrett4wade garrett4wade merged commit 595a3c4 into inclusionAI:main Apr 8, 2026
5 of 6 checks passed
SathyaGnanakumar pushed a commit to danielkiely/AReaL that referenced this pull request Apr 29, 2026
…nAI#1139)

Ranks with no gradients (e.g. frozen non-LoRA params) previously
returned 0.0 immediately, skipping the all_reduce. Ranks that do
have gradients then hang waiting for the collective to complete.

Move device init before the empty-grads check and make zero-grad
ranks still participate in all_reduce with a zero-valued tensor.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Fixed bugs in Archon LoRA Backend

2 participants