feat: add delta-compressed collective refit by HollowMan6 · Pull Request #2444 · NVIDIA-NeMo/RL

HollowMan6 · 2026-05-08T18:14:50Z

What does this PR do ?

Adds optional delta-compressed weight transfer for non-colocated vLLM collective refit.

This introduces a delta-aware packed weight transfer protocol that can send either full weights or additive deltas, with support for dense, sparse_indices, and sparse_bitmask delta encodings. The trainer source rank keeps a pinned CPU baseline of the last successfully synced HF-format weights, computes deltas against that baseline, and periodically sends full syncs based on full_sync_interval.

The feature is disabled by default and only applies to non-colocated vLLM refit. Colocated CUDA IPC, vLLM FP8 weights, and ModelOpt quantized vLLM paths are rejected.

Note that the receiver does not keep a baseline. It receives actual additive deltas and applies them directly to the current vLLM model weights. The pinned CPU baseline exists only on the trainer source side so the next current - baseline can be computed.

flowchart LR
  subgraph Trainer["Policy trainer ranks"]
    I["state_dict iterator<br/>DTensor.full_tensor / HF export"] --> C["_next_chunk"]
    C --> R{"group.rank == src?"}

    R -- "no" --> NR["consume iterator for collective correctness<br/>receive broadcast metadata/payload only"]
    R -- "yes" --> T{"DeltaCompressionTracker?"}

    T -- "no" --> FULL["FULL update<br/>dense tensors"]
    T -- "yes" --> P["prepare_chunk"]

    P --> S{"full sync due?"}
    S -- "yes" --> FULL
    S -- "no" --> D["delta = current - pinned CPU baseline"]
    D --> ENC{"transport"}
    ENC -- "dense" --> DENSE["dense delta payload"]
    ENC -- "sparse_indices / sparse_bitmask" --> SPARSE["sparse encode<br/>indices or bitmask + values + metadata"]
    SPARSE --> BUCKET["sparse bucket consolidation"]

    FULL --> PACK["pack_named_tensors<br/>uint8 payload + header"]
    DENSE --> PACK
    BUCKET --> PACK
    PACK --> NCCL["NCCL broadcast<br/>shared full/delta protocol"]
    P --> BASE["async baseline snapshot<br/>pinned CPU via D2H stream"]
  end

  subgraph VLLM["Non-colocated vLLM generation workers"]
    NCCL --> RECV["packed_weight_transfer_consumer"]
    RECV --> KIND{"header.kind"}
    KIND -- "FULL" --> LOADFULL["vLLM load_weights<br/>overwrite model weights"]
    KIND -- "DELTA dense" --> LOADDELTA["batch decoded deltas"]
    KIND -- "DELTA sparse" --> DECODE["decode sparse to dense delta tensors"]
    DECODE --> LOADDELTA
    LOADDELTA --> ADDCTX["additive_weight_load_context"]
    ADDCTX --> APPLY["vLLM load_weights<br/>copy/fill/setitem becomes add_"]
  end

sequenceDiagram
  participant Src as Trainer src rank
  participant Peer as Trainer non-src ranks
  participant Comm as NCCL group
  participant V as vLLM worker
  participant Base as Pinned CPU baseline

  Src->>Src: gather/export next chunk
  Peer->>Peer: gather/export same chunk for trainer-side collectives
  Src->>Base: wait old baseline event, read baseline
  Src->>Src: prepare full or delta chunk
  Src->>Base: async snapshot current weights for next refit
  Src->>Src: sparse encode and bucket if useful
  Src->>Comm: broadcast header + packed uint8 payload
  Peer->>Comm: participate in broadcast receive
  V->>Comm: receive header + payload

  alt FULL update
    V->>V: unpack dense tensors
    V->>V: load_weights overwrite
  else DELTA update
    V->>V: unpack or sparse-decode delta tensors
    V->>V: batch delta loads
    V->>V: load_weights under additive context
  end

  Src->>Src: on_sync_succeeded increments committed sync count

Issues

N/A

Usage

Enable under the vLLM generation config:

policy:
  generation:
    backend: vllm
    colocated:
      enabled: false
    delta_compression:
      enabled: true
      dtype: ${policy.precision}
      transport: sparse_indices  # dense, sparse_indices, or sparse_bitmask
      full_sync_interval: 20

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

Adds DeltaCompressionTracker and delta-aware packed transfer utilities.
Uses only the source trainer rank for pinned CPU baseline tracking.
Reuses the full weight transfer API shape: producers can send full or delta chunks.
Applies deltas in vLLM through the existing weight loaders under an additive load context.
Wires DTensor v1, DTensor v2, and Megatron policy workers through a shared dispatch_packed_weight_transfer(...) helper.
Adds shared dtype and transfer type aliases/constants.
Adds config examples for GRPO and distillation.

copy-pr-bot · 2026-05-08T18:14:54Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copilot

Pull request overview

Adds an optional delta-compressed weight transfer protocol for non-colocated vLLM collective refit, enabling the trainer source rank to send full weights or additive deltas (dense / sparse_indices / sparse_bitmask) and apply deltas additively through existing vLLM weight loaders.

Changes:

Introduces a delta-aware packed weight transfer protocol (full/delta/done) with sparse delta encodings and a trainer-side DeltaCompressionTracker baseline.
Integrates the new transfer path into DTensor v1/v2 and Megatron policy workers via a shared dispatch_packed_weight_transfer(...) helper.
Updates vLLM collective refit to optionally consume the new full/delta protocol and adds unit tests + example configs.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`tests/unit/utils/test_weight_transfer.py`	Adds unit coverage for delta tracker behavior, sparse transports, additive load context, and producer/consumer roundtrips.
`nemo_rl/utils/weight_transfer.py`	Implements delta tracking, sparse encodings, packed full/delta broadcast protocol, and additive load context.
`nemo_rl/utils/weight_transfer_types.py`	Defines shared literal types/constants for delta compression and transfer kinds.
`nemo_rl/utils/torch_dtypes.py`	Centralizes dtype string→`torch.dtype` mappings (canonical + aliases).
`nemo_rl/models/policy/workers/megatron_policy_worker.py`	Switches collective weight broadcast to the delta-aware dispatcher when enabled.
`nemo_rl/models/policy/workers/dtensor_policy_worker.py`	Switches collective weight broadcast to the delta-aware dispatcher when enabled.
`nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py`	Switches collective weight broadcast to the delta-aware dispatcher when enabled.
`nemo_rl/models/generation/vllm/vllm_worker.py`	Determines whether to use delta transfer and forwards that flag to the vLLM worker extension.
`nemo_rl/models/generation/vllm/vllm_worker_async.py`	Forwards the delta-transfer enablement flag in the async `prepare_refit_info` path.
`nemo_rl/models/generation/vllm/vllm_backend.py`	Adds delta-aware collective consumer path and additive-delta loading through existing loaders.
`nemo_rl/models/generation/vllm/config.py`	Extends vLLM generation config typing with `delta_compression` settings.
`nemo_rl/models/automodel/setup.py`	Reuses canonical dtype mapping from `torch_dtypes` instead of duplicating it.
`examples/configs/grpo_math_1B.yaml`	Documents/introduces the new `delta_compression` config block (disabled by default).
`examples/configs/distillation_math.yaml`	Documents/introduces the new `delta_compression` config block (disabled by default).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ZhiyuLi-Nvidia · 2026-05-08T18:34:26Z

Awesome @HollowMan6

I found delta weight transfer has its own weight transfer function, which seems a duplicated one compared with the full weight transfer.

It is out of the scope of this PR, but is there any block to have delt and full weight transfer shared the same communication function while have their independent protocol to pack, unpack the model weights?

HollowMan6 · 2026-05-08T23:00:25Z

Thank you @ZhiyuLi-Nvidia for pointing this out, I just did some refactoring according to your suggestion, and it looks fine.

copy-pr-bot · 2026-06-20T08:18:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot · 2026-06-20T15:42:13Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

claude · 2026-06-21T21:12:05Z

        advantages = advantages.clamp(min=clip_low)
    if clip_high is not None:
        advantages = advantages.clamp(max=clip_high)
    return advantages


LGTM — this is a well-structured, well-tested PR. The module decomposition is clean, test coverage is thorough across the new utilities, and the lazy initialization fallback in refit_policy_generation ensures the sync/async paths both work correctly even when the early-init optimization in _start_initial_policy_generation_weight_sync doesn't fire.

HollowMan6 · 2026-06-23T21:41:14Z

/ok to test 6c18684

ZhiyuLi-Nvidia · 2026-06-24T20:49:59Z

Thank you for adding vllm http backend. Just curious how's that compared against zmq in perf / resilience?

Signed-off-by: Hollow Man <hollowman@opensuse.org>

HollowMan6 · 2026-06-30T19:33:17Z

/ok to test 9a8d02b

Signed-off-by: Hollow Man <hollowman@opensuse.org>

Copilot AI review requested due to automatic review settings May 8, 2026 18:14

HollowMan6 requested review from a team as code owners May 8, 2026 18:14

Copilot started reviewing on behalf of HollowMan6 May 8, 2026 18:15 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread nemo_rl/utils/weight_transfer.py Outdated

HollowMan6 force-pushed the delta_weight_transfer branch 6 times, most recently from fa0cb08 to a3b3c70 Compare May 12, 2026 22:21

HollowMan6 force-pushed the delta_weight_transfer branch 3 times, most recently from 4022b81 to dd37c84 Compare May 21, 2026 22:41

HollowMan6 force-pushed the delta_weight_transfer branch 3 times, most recently from 2bab874 to 6071cd5 Compare May 27, 2026 20:45

ZhiyuLi-Nvidia reviewed May 27, 2026

View reviewed changes

Comment thread nemo_rl/utils/weight_transfer.py Outdated

HollowMan6 force-pushed the delta_weight_transfer branch from 9c50ebf to 43719b7 Compare May 28, 2026 02:55

HollowMan6 requested a review from a team as a code owner May 28, 2026 19:57

HollowMan6 force-pushed the delta_weight_transfer branch 2 times, most recently from 0b16704 to 3218f2a Compare May 30, 2026 20:43

HollowMan6 requested a review from a team as a code owner May 30, 2026 20:43

HollowMan6 force-pushed the delta_weight_transfer branch from b944c74 to 19f77c9 Compare June 9, 2026 00:50

copy-pr-bot Bot temporarily deployed to public June 9, 2026 02:03 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 02:04 Inactive

copy-pr-bot Bot temporarily deployed to test June 9, 2026 02:05 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 02:06 Inactive

NVIDIA-NeMo deleted a comment from github-actions Bot Jun 9, 2026

HollowMan6 force-pushed the delta_weight_transfer branch from cbbc2b6 to 6ab0904 Compare June 9, 2026 03:09

copy-pr-bot Bot temporarily deployed to public June 9, 2026 03:09 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 03:10 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 03:13 Inactive

HollowMan6 force-pushed the delta_weight_transfer branch from 6ab0904 to 8794e01 Compare June 9, 2026 03:19

copy-pr-bot Bot temporarily deployed to public June 9, 2026 03:39 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 03:40 Inactive

copy-pr-bot Bot temporarily deployed to test June 9, 2026 03:42 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 03:43 Inactive

HollowMan6 added the Feature label Jun 9, 2026

claude Bot reviewed Jun 21, 2026

View reviewed changes

anwithk mentioned this pull request Jun 23, 2026

NeMo RL Software Architecture Update: PR Tracking #2905

Open

29 tasks

ZhiyuLi-Nvidia reviewed Jun 24, 2026

View reviewed changes

Comment thread nemo_rl/models/generation/vllm/config.py Outdated

feat: add S3-based delta-compressed collective refit

9a8d02b

Signed-off-by: Hollow Man <hollowman@opensuse.org>

HollowMan6 added 2 commits July 2, 2026 00:02

implement locality-aware FIFO batching for vllm receiver

e5be26a

Signed-off-by: Hollow Man <hollowman@opensuse.org>

Add zeromq implementation

75df059

Signed-off-by: Hollow Man <hollowman@opensuse.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add delta-compressed collective refit#2444

feat: add delta-compressed collective refit#2444
HollowMan6 wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
HollowMan6:delta_weight_transfer

HollowMan6 commented May 8, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

ZhiyuLi-Nvidia commented May 8, 2026

Uh oh!

HollowMan6 commented May 8, 2026

Uh oh!

Uh oh!

copy-pr-bot Bot commented Jun 20, 2026

Uh oh!

copy-pr-bot Bot commented Jun 20, 2026

Uh oh!

claude Bot Jun 21, 2026

Uh oh!

HollowMan6 commented Jun 23, 2026

Uh oh!

ZhiyuLi-Nvidia commented Jun 24, 2026

Uh oh!

Uh oh!

HollowMan6 commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

HollowMan6 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

ZhiyuLi-Nvidia commented May 8, 2026

Uh oh!

HollowMan6 commented May 8, 2026

Uh oh!

Uh oh!

copy-pr-bot Bot commented Jun 20, 2026

Uh oh!

copy-pr-bot Bot commented Jun 20, 2026

Uh oh!

claude Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

HollowMan6 commented Jun 23, 2026

Uh oh!

ZhiyuLi-Nvidia commented Jun 24, 2026

Uh oh!

Uh oh!

HollowMan6 commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HollowMan6 commented May 8, 2026 •

edited

Loading