Fix #1040: [Feature] Fixed bugs in Archon LoRA Backend by JiwaniZakir · Pull Request #1139 · inclusionAI/AReaL

JiwaniZakir · 2026-04-05T16:10:37Z

Description

Fixes a distributed deadlock in get_grad_norm_fp32 (areal/engine/fsdp_utils/grad.py) where ranks with no gradients (e.g., LoRA frozen ranks in the Archon LoRA backend) would return early before participating in the all_reduce collective, causing ranks with real gradients to hang indefinitely waiting for the collective to complete.

Root cause: The original early-return on empty grads_for_norm bypassed the dist.all_reduce calls entirely. In a distributed setting, all ranks must call collective operations together.

Fix: When grads_for_norm is empty, the rank now contributes a zero tensor to all_reduce (using ReduceOp.MAX for inf-norm, ReduceOp.SUM otherwise) across both data_parallel_group and model_parallel_group before returning 0.0. This keeps all ranks synchronized without affecting the computed norm on ranks that do have gradients.

Related Issue

Fixes #1040

Type of Change

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details (if applicable):

N/A

Additional Context

Two tests updated/added in tests/test_fsdp_grad.py:

test_empty_grads_returns_zero: patched torch.distributed.all_reduce to assert it is called exactly twice (once for dp_group, once for mp_group) even when the grad list is empty, and that the result is still 0.0.
test_empty_grads_participates_in_allreduce_no_groups (new): verifies that passing None for both process groups with an empty grad list returns 0.0 without hanging or erroring.

End-to-end validation was performed on Qwen 1.5B distill with the DAPO-Math-17k dataset as described in issue #1040 (see also #1015 (comment)).

Need help? Check the Contributing Guide or ask in
GitHub Discussions!

This PR was created with AI assistance (Claude). The changes were reviewed by quality gates and a critic model before submission.

Ranks with no gradients (e.g. frozen non-LoRA params) previously returned 0.0 immediately, skipping the all_reduce. Ranks that do have gradients then hang waiting for the collective to complete. Move device init before the empty-grads check and make zero-grad ranks still participate in all_reduce with a zero-valued tensor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request updates the get_grad_norm_fp32 function to ensure that ranks with empty gradients still participate in distributed all_reduce operations, preventing potential deadlocks during training with frozen parameters. The review feedback suggests refactoring the function to unify synchronization logic and reduce code duplication. Additionally, it is recommended to use consistent 'is not None' checks for process groups to improve readability and adhere to PEP 8 standards.

gemini-code-assist · 2026-04-05T16:12:50Z

+    if not grads_for_norm:
+        # Still participate in all_reduce with zero contribution so that
+        # ranks with grads don't hang waiting for this rank (e.g. LoRA frozen ranks).
+        total_norm_cuda = torch.tensor(0.0, dtype=torch.float, device=device)
+        reduce_op = dist.ReduceOp.MAX if norm_type == torch.inf else dist.ReduceOp.SUM
+        if data_parallel_group:
+            dist.all_reduce(total_norm_cuda, op=reduce_op, group=data_parallel_group)
+        if model_parallel_group is not None:
+            dist.all_reduce(total_norm_cuda, op=reduce_op, group=model_parallel_group)
+        total_norm = float(total_norm_cuda.item())
+        if norm_type != torch.inf and total_norm > 0:
+            total_norm = total_norm ** (1.0 / norm_type)
+        return total_norm


The logic for participating in the all_reduce collective and calculating the final norm for empty gradient lists is largely duplicated from the rest of the function (specifically the logic found in lines 118-174). While this correctly addresses the distributed deadlock, it introduces a maintainability risk. Any future changes to the distributed synchronization logic (e.g., adding support for new process groups like expert parallel groups) would now need to be updated in three separate locations within this function.

Consider refactoring the function to unify the synchronization and post-processing steps. For instance, you could initialize total_norm_cuda to zero when grads_for_norm is empty and then use conditional blocks to skip only the local gradient accumulation parts, allowing the existing all_reduce and root-calculation logic to handle the final result.

gemini-code-assist · 2026-04-05T16:12:51Z

+        if data_parallel_group:
+            dist.all_reduce(total_norm_cuda, op=reduce_op, group=data_parallel_group)
+        if model_parallel_group is not None:
+            dist.all_reduce(total_norm_cuda, op=reduce_op, group=model_parallel_group)


There is an inconsistency in how the existence of process groups is checked. data_parallel_group is checked using truthiness, while model_parallel_group is checked using an explicit is not None comparison. Per PEP 8, it is generally preferred to use is not None when checking if an optional argument was provided, to avoid potential issues with objects that might define a custom __bool__ method. Using a consistent check also improves readability.

Suggested change

if data_parallel_group:

dist.all_reduce(total_norm_cuda, op=reduce_op, group=data_parallel_group)

if model_parallel_group is not None:

dist.all_reduce(total_norm_cuda, op=reduce_op, group=model_parallel_group)

if data_parallel_group is not None:

dist.all_reduce(total_norm_cuda, op=reduce_op, group=data_parallel_group)

if model_parallel_group is not None:

dist.all_reduce(total_norm_cuda, op=reduce_op, group=model_parallel_group)

References

PEP 8 recommends using 'is not None' for comparisons to singletons like None, rather than relying on truthiness. ^(link)

garrett4wade

LGTM

…nAI#1139) Ranks with no gradients (e.g. frozen non-LoRA params) previously returned 0.0 immediately, skipping the all_reduce. Ranks that do have gradients then hang waiting for the collective to complete. Move device init before the empty-grads check and make zero-grad ranks still participate in all_reduce with a zero-valued tensor. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Apr 5, 2026

View reviewed changes

garrett4wade approved these changes Apr 8, 2026

View reviewed changes

garrett4wade merged commit 595a3c4 into inclusionAI:main Apr 8, 2026
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #1040: [Feature] Fixed bugs in Archon LoRA Backend#1139

Fix #1040: [Feature] Fixed bugs in Archon LoRA Backend#1139
garrett4wade merged 1 commit intoinclusionAI:mainfrom
JiwaniZakir:fix/1040-feature-fixed-bugs-in-archon-lora-backe

JiwaniZakir commented Apr 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Uh oh!

garrett4wade left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JiwaniZakir commented Apr 5, 2026

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants