feat：Support LoRA incremental weight synchronization on disk for FSDP and SGLang by TaoZex · Pull Request #1233 · inclusionAI/AReaL

TaoZex · 2026-04-23T03:40:01Z

Description

Implement disk-based LoRA delta weight synchronization for FSDP training engine and SGLang inference backend. When lora_delta_sync is enabled, the first weight sync transmits the full base model via /update_weights_from_disk, and subsequent syncs only transmit LoRA adapter weights via /load_lora_adapter, significantly reducing communication overhead.

Key changes:

Disk-based delta sync path: Base model weights are saved as HuggingFace safetensors and loaded by SGLang via /update_weights_from_disk; adapter weights are saved as PEFT format and loaded via /load_lora_adapter.
Dispatch logic: When lora_delta_sync=True, the delta sync path is used regardless of weight_update_mode (disk or xccl), eliminating the need for NCCL-based distributed weight update.
Entropy regularization: Add entropy_coeff parameter to PPOActorConfig and grpo_loss_fn, supporting entropy bonus in the loss function.
Shared filesystem support: Add delta_sync_dir config field for multi-node setups, defaulting to ~/.cache/areal/ for single-node.
Documentation: Update both Chinese and English LoRA reference docs to reflect disk-based sync flow, new parameters, and multi-node filesystem guidance.
Tests: Add TestDeltaSyncDispatchLogic, test_entropy_coeff_default, test_entropy_coeff_can_set unit tests.

Related Issue

Fixes #(issue)

Type of Change

✨ New feature
💥 Breaking change
📝 Documentation update
♻️ Refactoring
⚡ Performance improvement
✅ Test coverage improvement

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details (if applicable):

Additional Context

Files changed:

File	Change
`areal/api/cli_args.py`	Add `entropy_coeff` and `delta_sync_dir` fields; update `lora_delta_sync` help text
`areal/engine/fsdp_engine.py`	Disk-based delta sync (`_update_weights_delta_sync_disk`, `_save_base_model_for_delta_sync`);
`areal/engine/sglang_remote.py`	Update docstrings for disk-based sync
`areal/trainer/ppo/actor.py`	Add `entropy_coeff` parameter and entropy bonus logic

gemini-code-assist

Code Review

This pull request introduces "LoRA Delta Sync," an incremental weight update mechanism for FSDP and SGLang that reduces communication overhead by transmitting only adapter weights after an initial full sync. It also adds entropy regularization to the PPO actor. Feedback focuses on potential memory issues when collecting full model parameters on a single rank, the need for shared storage in multi-node setups for adapter files, and minor code cleanups regarding imports and logic simplification.

TaoZex · 2026-04-23T09:32:23Z

Test Content

Test the code and generate test results

tests/test_lora_delta_sync.py

tests/test_lora_delta_sync_e2e.py (Invoke tests/torchrun/run_lora_delta_sync.py for testing)

TaoZex · 2026-04-23T09:57:57Z

Before optimization, parameters of both the base model and adapter model needed to be updated. After optimization, Step > 1 only the adapter model parameters are transmitted.

1. Task Reward

LoRA adopts incremental disk weight update, delivering stable training with continuous reward growth.

2. Weight Synchronization Data Volume

Base model synchronization: 2944.40M

Adapter model synchronization: 35.21M

The proportion of weight synchronization parameters is reduced to: 35.21 / （35.21 + 2944.40） = 1.18%

The overall parameter transmission volume is reduced by 98.82%.

3. Weight Synchronization Latency

Step 1 (Base + Adapter): 6.74s
Step > 1 (Only Adapter): 0.35s (average indicator in the chart)

The latency is reduced from 6.74s to 0.35s, with the update weight time decreased by 94.8%.

…a_incre

TaoZex · 2026-04-23T10:50:14Z

@rchardx This idea is inspired by the incremental update of weights. If you’re interested, would you mind reviewing it or sharing your suggestions in your spare time? Looking forward to your reply. Thanks～

garrett4wade

Hi @TaoZex , lora update is expected to be finished via the "disk" update mode, which has already implemented the path of "/load_lora_adapter". The critical issue that we should fix is that the FSDP engine always save full parameters rather than the LoRAs.

bingyechen added 10 commits April 21, 2026 13:14

feat(lora): Implement LoRA incremental weight synchronization

a86beda

fix: use d2

60438a9

fix: update actor path

ea026a1

feat(LoRA): fix code

857f374

feat(LoRA): add LoRA adapter

961954e

feat(lora): unload

47937e2

fix: fix lr

f802934

feat(engine): addLoRA Delta Sync wandb metrics

24f5c55

refactor(engine): use disk

9a5101b

feat(ppo): add coeff

e133a64

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread areal/engine/fsdp_engine.py

Comment thread areal/engine/fsdp_engine.py

Comment thread areal/engine/fsdp_engine.py Outdated

Comment thread areal/engine/fsdp_engine.py Outdated

Comment thread areal/trainer/ppo/actor.py Outdated

bingyechen added 8 commits April 23, 2026 12:30

refactor: log

9746f84

docs: remove

b407962

docs: doc fix

bdbc6be

refactor: fix LoRA Delta Sync log

1aaacb7

refactor(engine): fix gemini comment

97e87c5

feat(lora): add delta_sync_dir

8866463

docs: remove useless

6d5603d

refactor(tests): fix test

04458e5

TaoZex and others added 4 commits April 23, 2026 18:02

Merge branch 'main' into lora_incre

b8a46a9

test(lora): fix useless

a8b87c4

Merge branch 'lora_incre' of https://github.com/TaoZex/AReaL into lor…

ac97e02

…a_incre

feat(docs): precommit

b7b5fe5

TaoZex marked this pull request as ready for review April 23, 2026 10:21

TaoZex requested review from garrett4wade, nuzant and rchardx as code owners April 23, 2026 10:21

docs: remove

f890188

bingyechen and others added 2 commits April 23, 2026 18:32

Merge branch 'lora_incre' of https://github.com/TaoZex/AReaL into lor…

4daf7da

…a_incre

Merge branch 'main' into lora_incre

afbf450

garrett4wade reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat：Support LoRA incremental weight synchronization on disk for FSDP and SGLang#1233

feat：Support LoRA incremental weight synchronization on disk for FSDP and SGLang#1233
TaoZex wants to merge 25 commits intoinclusionAI:mainfrom
TaoZex:lora_incre

TaoZex commented Apr 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TaoZex commented Apr 23, 2026

Uh oh!

TaoZex commented Apr 23, 2026 •

edited

Loading

Uh oh!

TaoZex commented Apr 23, 2026

Uh oh!

garrett4wade left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TaoZex commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Checklist

Breaking Change Details (if applicable):

Additional Context

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TaoZex commented Apr 23, 2026

Test Content

Uh oh!

TaoZex commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Task Reward

2. Weight Synchronization Data Volume

3. Weight Synchronization Latency

Uh oh!

TaoZex commented Apr 23, 2026

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TaoZex commented Apr 23, 2026 •

edited

Loading

TaoZex commented Apr 23, 2026 •

edited

Loading