feat: [NemoRL] Introduce WeightSynchronizer ABC with IPC/HTTP/NCCL transports by saumishr · Pull Request #2466 · NVIDIA-NeMo/RL

saumishr · 2026-05-11T23:48:18Z

What does this PR do ?

Introduces the WeightSynchronizer abstraction: a dedicated ABC that decouples weight transfer logic from PolicyInterface and GenerationInterface. This is a purely additive change; no existing code is modified or wired up. It lays the foundation for later MRs that will refactor refit_policy_generation() out of grpo.py and slim down the policy/generation interfaces.

Issues

List issues that this PR closes (syntax):

Usage

from nemo_rl.weight_sync import create_weight_synchronizer
# Factory selects the right transport based on topology + backend
weight_sync = create_weight_synchronizer(
    policy=policy,
    generation=generation,
    generation_backend="vllm",   # or "sglang", "megatron"
    colocated=True,              # colocated → IPC/HTTP; non-colocated → NCCL
)
weight_sync.init_communicator()
# ... after training step ...
weight_sync.mark_stale()
weight_sync.sync_weights()

Changes

nemo_rl/weight_sync/interfaces.py — WeightSynchronizer ABC with 5 abstract methods: sync_weights, is_stale, mark_stale, init_communicator, shutdown
nemo_rl/weight_sync/ipc_weight_synchronizer.py — Colocated vLLM/Megatron transport (ZMQ IPC), mirrors existing refit_policy_generation() IPC path
nemo_rl/weight_sync/http_weight_synchronizer.py — Colocated SGLang transport (HTTP streaming), mirrors existing HTTP path
nemo_rl/weight_sync/collective_weight_synchronizer.py — Non-colocated transport (NCCL collectives), mirrors existing collective path
nemo_rl/weight_sync/factory.py — create_weight_synchronizer() factory that selects the implementation based on colocated flag and generation_backend
nemo_rl/models/generation/constants.py — Centralized string constants for generation backend names (VLLM_BACKEND, SGLANG_BACKEND, MEGATRON_BACKEND)
tests/unit/weight_sync/test_weight_synchronizer.py — 20 unit tests covering ABC contract, all three implementations, and factory logic

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

Future MRs will wire the synchronizer into the orchestrator and refactor the policy/generation interfaces

Summary by CodeRabbit

Release Notes

New Features
- Added weight synchronization system to keep policy and generation models in sync across colocated and distributed GPU clusters.
- Support for multiple backends (SGLang, vLLM, Megatron) with optimized synchronization for each deployment topology.
Tests
- Added unit tests for weight synchronizer implementations and factory selection logic.

copy-pr-bot · 2026-05-11T23:48:21Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

terrykong

Review Summary

Nice abstraction! The WeightSynchronizer ABC and its three transports correctly mirror the existing refit_policy_generation() branches in grpo.py. The factory logic, test suite (20/20 pass), and overall structure are solid. A few doc/code consistency issues to address before this becomes the canonical module.

Linter: PASS (after auto-fixes for import order and unused import)

Informational (not posted inline):

All concrete classes type policy/generation as Any — consider using ColocatablePolicyInterface/GenerationInterface (under TYPE_CHECKING if needed) to get type-checking benefits from the new abstraction.
constants.py defines VLLM_BACKEND/MEGATRON_BACKEND but only SGLANG_BACKEND is used in the new code. Existing raw string literals in grpo.py/distillation.py remain. Consider adopting them in a follow-up MR.
Import order (3 files) and unused PropertyMock import — all auto-fixed by ruff. Please run pre-commit run --all-files before merge.
Test coverage gaps worth adding in follow-up: env-var override path, Timer context, HTTP init_communicator, HTTP/Collective mark_stale, shutdown() no-op, factory kwarg forwarding.

Generated by Claude Code

terrykong · 2026-05-12T06:45:20Z

/claude review

coderabbitai · 2026-05-13T02:37:29Z

Walkthrough

This PR introduces a comprehensive weight synchronization abstraction for coordinating policy-to-generation weight transfers across different deployment topologies. It defines a core WeightSynchronizer interface and three concrete implementations (HTTP for SGLang, IPC for vLLM/Megatron, NCCL for non-colocated clusters), plus a factory dispatcher and extensive test coverage.

Changes

Weight Synchronization Infrastructure

Layer / File(s)	Summary
Generation backend constants `nemo_rl/models/generation/constants.py`	Defines `VLLM_BACKEND`, `SGLANG_BACKEND`, and `MEGATRON_BACKEND` string constants for standardized backend name comparisons.
WeightSynchronizer interface `nemo_rl/weight_sync/interfaces.py`	Abstract base class defining the weight synchronization contract: `sync_weights()` for transferring weights, `is_stale` property and `mark_stale()` for staleness tracking, and lifecycle hooks `init_communicator()` and `shutdown()`.
HTTP synchronizer for colocated SGLang `nemo_rl/weight_sync/http_weight_synchronizer.py`	Implements HTTP-based weight streaming for colocated policy and SGLang generation, including SGLang KV cache invalidation and preparation phases.
IPC synchronizer for colocated vLLM/Megatron `nemo_rl/weight_sync/ipc_weight_synchronizer.py`	Implements ZMQ IPC and CUDA IPC handle-based weight transfer for colocated vLLM/Megatron deployments with dynamic staging buffer sizing via environment ratio or fixed GB parameter.
Collective synchronizer for non-colocated NCCL `nemo_rl/weight_sync/collective_weight_synchronizer.py`	Implements NCCL collective-based synchronization for separate policy and generation Ray GPU clusters, coordinating concurrent broadcasts and updates with combined world size initialization.
Factory dispatcher and public API `nemo_rl/weight_sync/factory.py`, `nemo_rl/weight_sync/__init__.py`	`create_weight_synchronizer()` factory routes to appropriate implementation based on deployment topology (colocated vs. non-colocated) and backend; package exports `WeightSynchronizer` and factory function as public API.
Comprehensive unit tests `tests/unit/weight_sync/test_weight_synchronizer.py`, `tests/unit/weight_sync/__init__.py`	Tests validate abstract base behavior, full sync lifecycles for all three implementations, staleness and KV cache handling, buffer sizing logic, and factory routing across backend and topology combinations.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 13.64% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main changes: introducing a WeightSynchronizer abstraction with three concrete implementations (IPC, HTTP, NCCL) as transport mechanisms.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Test Results For Major Changes	✅ Passed	PR includes 23 unit tests with detailed documentation. Additive change with no existing code modifications, so no regression or performance concerns apply.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_rl/weight_sync/factory.py`:
- Around line 39-40: Fail fast on invalid refit_buffer_size_gb by validating the
parameter before constructing any IPC synchronizer: in the factory function that
accepts refit_buffer_size_gb and returns a WeightSynchronizer, add a check that
if refit_buffer_size_gb is not None and refit_buffer_size_gb <= 0 then raise a
ValueError (with a clear message) before proceeding to create the IPC
synchronizer; apply the same validation at the other IPC-construction site
referenced (the block around lines 76-85) so no non-positive buffer sizes are
allowed to propagate into IPC transfer logic.

In `@nemo_rl/weight_sync/http_weight_synchronizer.py`:
- Around line 61-87: The weight transfer block must ensure restore/rebuild steps
always run even if transfer fails: wrap the transfer and related calls (the with
timer_context: block that calls _generation.get_sglang_url_to_gpu_uuids(),
_generation.invalidate_kv_cache(), and
_policy.stream_weights_via_http()/ray.get()) in a try/finally so that
_policy.offload_after_refit(),
_generation.prepare_for_generation(tags=["kv_cache"]), and setting self._stale =
False execute in the finally; keep the initial _policy.offload_before_refit()
and initial _generation.prepare_for_generation(tags=["weights"]) before the try,
and ensure any exceptions are re-raised after the finally if needed.

In `@nemo_rl/weight_sync/ipc_weight_synchronizer.py`:
- Around line 119-123: Validate and sanitize refit buffer inputs: if
_refit_buffer_size_gb is provided but <= 0, raise or fallback to computing from
memory ratio; when reading NRl_REFIT_BUFFER_MEMORY_RATIO parse it safely (catch
ValueError) and ensure the ratio is a float in (0,1], defaulting to 0.3 if
malformed or out of range; compute bytes using _policy.get_free_memory_bytes() *
ratio, cast to int, and ensure the returned size is non-negative and at least a
small minimum (or raise) to avoid invalid transfer buffer sizes. Reference
symbols: _refit_buffer_size_gb, NRl_REFIT_BUFFER_MEMORY_RATIO (env key), and
_policy.get_free_memory_bytes().
- Around line 72-102: Wrap the weight-transfer block in a try/finally so
restoration always runs: call self._policy.offload_before_refit() and the
transfer/update code as before, but ensure self._policy.offload_after_refit(),
self._generation.prepare_for_generation(tags=["kv_cache"]), and self._stale =
False execute in a finally block so they run even if ray.get(...) or the update
checks raise; preserve re-raising the original exception after finally so
failures are still propagated. Reference methods:
self._policy.offload_before_refit, self._policy.offload_after_refit,
self._generation.prepare_for_generation, stream_weights_via_ipc_zmq,
update_weights_via_ipc_zmq, and the _stale flag.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 98a4db78-25a2-4a1d-bda4-76b06607640e

📥 Commits

Reviewing files that changed from the base of the PR and between 4641794 and 3132fbf.

📒 Files selected for processing (9)

nemo_rl/models/generation/constants.py
nemo_rl/weight_sync/__init__.py
nemo_rl/weight_sync/collective_weight_synchronizer.py
nemo_rl/weight_sync/factory.py
nemo_rl/weight_sync/http_weight_synchronizer.py
nemo_rl/weight_sync/interfaces.py
nemo_rl/weight_sync/ipc_weight_synchronizer.py
tests/unit/weight_sync/__init__.py
tests/unit/weight_sync/test_weight_synchronizer.py

terrykong · 2026-05-21T04:45:43Z

/ok to test 66232a6

terrykong · 2026-05-22T04:33:46Z

/ok to test aadf353

…ansports Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com> Co-authored-by: Saurabh Mishra <sauramishra@nvidia.com> [NemoRL]: fix: Address review feedback on WeightSynchronizer ABC Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com> fix: Address CodeRabbitAI review feedback on WeightSynchronizer - Validate refit_buffer_size_gb > 0 in factory and _compute_buffer_size - Safely parse NRL_REFIT_BUFFER_MEMORY_RATIO env var with error handling - Wrap IPC and HTTP sync_weights transfer in try/finally so phase restoration (offload_after_refit, kv_cache rebuild) always runs - Add 7 new tests covering validation and phase-restoration behavior Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com>

saumishr · 2026-05-26T17:03:41Z

/ok to test 5f41ac0

…ansports (#2466) Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com>

…ansports (NVIDIA-NeMo#2466) Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com>

saumishr force-pushed the modularity/weight-sync-abc branch 2 times, most recently from 2484a45 to 29fed73 Compare May 11, 2026 23:51

saumishr changed the title ~~[NemoRL] Introduce WeightSynchronizer ABC and transport backends for pluggable weight sync~~ feat: [NemoRL] Introduce WeightSynchronizer ABC with IPC/HTTP/NCCL transports May 11, 2026

saumishr mentioned this pull request May 12, 2026

feat: Wire WeightSynchronizer into algorithm layer, replacing inline refit logic #2467

Open

4 tasks

terrykong reviewed May 12, 2026

View reviewed changes

claude Bot reviewed May 12, 2026

View reviewed changes

Comment thread tests/unit/weight_sync/test_weight_synchronizer.py Outdated

saumishr force-pushed the modularity/weight-sync-abc branch from a7743be to 3132fbf Compare May 12, 2026 18:38

saumishr marked this pull request as ready for review May 13, 2026 02:33

saumishr requested review from a team as code owners May 13, 2026 02:33

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread nemo_rl/weight_sync/factory.py

Comment thread nemo_rl/weight_sync/http_weight_synchronizer.py Outdated

Comment thread nemo_rl/weight_sync/ipc_weight_synchronizer.py Outdated

Comment thread nemo_rl/weight_sync/ipc_weight_synchronizer.py Outdated

HollowMan6 mentioned this pull request May 19, 2026

feat: add delta-compressed collective refit #2444

Open

4 tasks

saumishr force-pushed the modularity/weight-sync-abc branch from 3132fbf to db87da7 Compare May 19, 2026 22:39

terrykong previously approved these changes May 21, 2026

View reviewed changes

terrykong added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label May 21, 2026

terrykong enabled auto-merge (squash) May 21, 2026 04:45

copy-pr-bot Bot temporarily deployed to public May 21, 2026 04:45 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 04:46 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 21, 2026 04:46 Error

copy-pr-bot Bot temporarily deployed to nemo-ci May 21, 2026 04:46 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 04:46 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 04:50 Inactive

auto-merge was automatically disabled May 21, 2026 04:54
Head branch was pushed to by a user without write access

saumishr dismissed terrykong’s stale review via 6234e11 May 21, 2026 04:54

saumishr force-pushed the modularity/weight-sync-abc branch from 66232a6 to 6234e11 Compare May 21, 2026 04:54

saumishr dismissed terrykong’s stale review via aadf353 May 21, 2026 23:48

saumishr force-pushed the modularity/weight-sync-abc branch from 2c76900 to aadf353 Compare May 21, 2026 23:48

terrykong enabled auto-merge (squash) May 22, 2026 04:32

terrykong previously approved these changes May 22, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public May 22, 2026 04:34 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 22, 2026 04:34 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 04:34 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 04:38 Inactive

auto-merge was automatically disabled May 26, 2026 15:36
Head branch was pushed to by a user without write access

saumishr dismissed terrykong’s stale review via b9fcf8b May 26, 2026 15:36

saumishr force-pushed the modularity/weight-sync-abc branch from aadf353 to b9fcf8b Compare May 26, 2026 15:36

saumishr force-pushed the modularity/weight-sync-abc branch from b9fcf8b to 5f41ac0 Compare May 26, 2026 17:02

copy-pr-bot Bot temporarily deployed to public May 26, 2026 17:03 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 26, 2026 17:04 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 17:04 Inactive

terrykong enabled auto-merge (squash) May 26, 2026 17:04

copy-pr-bot Bot temporarily deployed to public May 26, 2026 17:04 Inactive

terrykong approved these changes May 26, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public May 26, 2026 17:08 Inactive

terrykong merged commit bb7e845 into NVIDIA-NeMo:main May 26, 2026
36 checks passed

yfw pushed a commit that referenced this pull request May 27, 2026

feat: [NemoRL] Introduce WeightSynchronizer ABC with IPC/HTTP/NCCL tr…

428ebc5

…ansports (#2466) Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com>

terrykong mentioned this pull request May 28, 2026

feat: Update Megatron Inference #2355

Merged

4 tasks

vigneshwaran pushed a commit to vigneshwaran/RL that referenced this pull request May 28, 2026

feat: [NemoRL] Introduce WeightSynchronizer ABC with IPC/HTTP/NCCL tr…

5c149d6

…ansports (NVIDIA-NeMo#2466) Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com>

anwithk mentioned this pull request Jun 23, 2026

NeMo RL Software Architecture Update: PR Tracking #2905

Open

29 tasks

Uh oh!

Conversation

saumishr commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Changes

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

terrykong commented May 12, 2026

Uh oh!

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

terrykong commented May 21, 2026

Uh oh!

terrykong commented May 22, 2026

Uh oh!

saumishr commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

saumishr commented May 11, 2026 •

edited

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading