Skip to content

feat: [NemoRL] Introduce WeightSynchronizer ABC with IPC/HTTP/NCCL transports#2466

Merged
terrykong merged 1 commit into
NVIDIA-NeMo:mainfrom
saumishr:modularity/weight-sync-abc
May 26, 2026
Merged

feat: [NemoRL] Introduce WeightSynchronizer ABC with IPC/HTTP/NCCL transports#2466
terrykong merged 1 commit into
NVIDIA-NeMo:mainfrom
saumishr:modularity/weight-sync-abc

Conversation

@saumishr

@saumishr saumishr commented May 11, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

Introduces the WeightSynchronizer abstraction: a dedicated ABC that decouples weight transfer logic from PolicyInterface and GenerationInterface. This is a purely additive change; no existing code is modified or wired up. It lays the foundation for later MRs that will refactor refit_policy_generation() out of grpo.py and slim down the policy/generation interfaces.

Issues

List issues that this PR closes (syntax):

Usage

from nemo_rl.weight_sync import create_weight_synchronizer
# Factory selects the right transport based on topology + backend
weight_sync = create_weight_synchronizer(
    policy=policy,
    generation=generation,
    generation_backend="vllm",   # or "sglang", "megatron"
    colocated=True,              # colocated → IPC/HTTP; non-colocated → NCCL
)
weight_sync.init_communicator()
# ... after training step ...
weight_sync.mark_stale()
weight_sync.sync_weights()

Changes

  • nemo_rl/weight_sync/interfaces.py — WeightSynchronizer ABC with 5 abstract methods: sync_weights, is_stale, mark_stale, init_communicator, shutdown

  • nemo_rl/weight_sync/ipc_weight_synchronizer.py — Colocated vLLM/Megatron transport (ZMQ IPC), mirrors existing refit_policy_generation() IPC path

  • nemo_rl/weight_sync/http_weight_synchronizer.py — Colocated SGLang transport (HTTP streaming), mirrors existing HTTP path

  • nemo_rl/weight_sync/collective_weight_synchronizer.py — Non-colocated transport (NCCL collectives), mirrors existing collective path

  • nemo_rl/weight_sync/factory.py — create_weight_synchronizer() factory that selects the implementation based on colocated flag and generation_backend

  • nemo_rl/models/generation/constants.py — Centralized string constants for generation backend names (VLLM_BACKEND, SGLANG_BACKEND, MEGATRON_BACKEND)

  • tests/unit/weight_sync/test_weight_synchronizer.py — 20 unit tests covering ABC contract, all three implementations, and factory logic

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • Future MRs will wire the synchronizer into the orchestrator and refactor the policy/generation interfaces

Summary by CodeRabbit

Release Notes

  • New Features

    • Added weight synchronization system to keep policy and generation models in sync across colocated and distributed GPU clusters.
    • Support for multiple backends (SGLang, vLLM, Megatron) with optimized synchronization for each deployment topology.
  • Tests

    • Added unit tests for weight synchronizer implementations and factory selection logic.

Review Change Stack

@copy-pr-bot

copy-pr-bot Bot commented May 11, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@saumishr saumishr force-pushed the modularity/weight-sync-abc branch 2 times, most recently from 2484a45 to 29fed73 Compare May 11, 2026 23:51
@saumishr saumishr changed the title [NemoRL] Introduce WeightSynchronizer ABC and transport backends for pluggable weight sync feat: [NemoRL] Introduce WeightSynchronizer ABC with IPC/HTTP/NCCL transports May 11, 2026

@terrykong terrykong left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Nice abstraction! The WeightSynchronizer ABC and its three transports correctly mirror the existing refit_policy_generation() branches in grpo.py. The factory logic, test suite (20/20 pass), and overall structure are solid. A few doc/code consistency issues to address before this becomes the canonical module.

Linter: PASS (after auto-fixes for import order and unused import)

Informational (not posted inline):

  • All concrete classes type policy/generation as Any — consider using ColocatablePolicyInterface/GenerationInterface (under TYPE_CHECKING if needed) to get type-checking benefits from the new abstraction.
  • constants.py defines VLLM_BACKEND/MEGATRON_BACKEND but only SGLANG_BACKEND is used in the new code. Existing raw string literals in grpo.py/distillation.py remain. Consider adopting them in a follow-up MR.
  • Import order (3 files) and unused PropertyMock import — all auto-fixed by ruff. Please run pre-commit run --all-files before merge.
  • Test coverage gaps worth adding in follow-up: env-var override path, Timer context, HTTP init_communicator, HTTP/Collective mark_stale, shutdown() no-op, factory kwarg forwarding.

Generated by Claude Code

Comment thread nemo_rl/weight_sync/interfaces.py
Comment thread nemo_rl/weight_sync/factory.py Outdated
Comment thread nemo_rl/weight_sync/http_weight_synchronizer.py Outdated
Comment thread nemo_rl/weight_sync/collective_weight_synchronizer.py Outdated
Comment thread nemo_rl/weight_sync/interfaces.py
Comment thread tests/unit/weight_sync/test_weight_synchronizer.py
Comment thread nemo_rl/weight_sync/collective_weight_synchronizer.py
Comment thread nemo_rl/weight_sync/collective_weight_synchronizer.py
Comment thread nemo_rl/weight_sync/interfaces.py Outdated
@terrykong

Copy link
Copy Markdown
Collaborator

/claude review

Comment thread tests/unit/weight_sync/test_weight_synchronizer.py Outdated
@saumishr saumishr force-pushed the modularity/weight-sync-abc branch from a7743be to 3132fbf Compare May 12, 2026 18:38
@saumishr saumishr marked this pull request as ready for review May 13, 2026 02:33
@saumishr saumishr requested review from a team as code owners May 13, 2026 02:33
@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

This PR introduces a comprehensive weight synchronization abstraction for coordinating policy-to-generation weight transfers across different deployment topologies. It defines a core WeightSynchronizer interface and three concrete implementations (HTTP for SGLang, IPC for vLLM/Megatron, NCCL for non-colocated clusters), plus a factory dispatcher and extensive test coverage.

Changes

Weight Synchronization Infrastructure

Layer / File(s) Summary
Generation backend constants
nemo_rl/models/generation/constants.py
Defines VLLM_BACKEND, SGLANG_BACKEND, and MEGATRON_BACKEND string constants for standardized backend name comparisons.
WeightSynchronizer interface
nemo_rl/weight_sync/interfaces.py
Abstract base class defining the weight synchronization contract: sync_weights() for transferring weights, is_stale property and mark_stale() for staleness tracking, and lifecycle hooks init_communicator() and shutdown().
HTTP synchronizer for colocated SGLang
nemo_rl/weight_sync/http_weight_synchronizer.py
Implements HTTP-based weight streaming for colocated policy and SGLang generation, including SGLang KV cache invalidation and preparation phases.
IPC synchronizer for colocated vLLM/Megatron
nemo_rl/weight_sync/ipc_weight_synchronizer.py
Implements ZMQ IPC and CUDA IPC handle-based weight transfer for colocated vLLM/Megatron deployments with dynamic staging buffer sizing via environment ratio or fixed GB parameter.
Collective synchronizer for non-colocated NCCL
nemo_rl/weight_sync/collective_weight_synchronizer.py
Implements NCCL collective-based synchronization for separate policy and generation Ray GPU clusters, coordinating concurrent broadcasts and updates with combined world size initialization.
Factory dispatcher and public API
nemo_rl/weight_sync/factory.py, nemo_rl/weight_sync/__init__.py
create_weight_synchronizer() factory routes to appropriate implementation based on deployment topology (colocated vs. non-colocated) and backend; package exports WeightSynchronizer and factory function as public API.
Comprehensive unit tests
tests/unit/weight_sync/test_weight_synchronizer.py, tests/unit/weight_sync/__init__.py
Tests validate abstract base behavior, full sync lifecycles for all three implementations, staleness and KV cache handling, buffer sizing logic, and factory routing across backend and topology combinations.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main changes: introducing a WeightSynchronizer abstraction with three concrete implementations (IPC, HTTP, NCCL) as transport mechanisms.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Test Results For Major Changes ✅ Passed PR includes 23 unit tests with detailed documentation. Additive change with no existing code modifications, so no regression or performance concerns apply.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_rl/weight_sync/factory.py`:
- Around line 39-40: Fail fast on invalid refit_buffer_size_gb by validating the
parameter before constructing any IPC synchronizer: in the factory function that
accepts refit_buffer_size_gb and returns a WeightSynchronizer, add a check that
if refit_buffer_size_gb is not None and refit_buffer_size_gb <= 0 then raise a
ValueError (with a clear message) before proceeding to create the IPC
synchronizer; apply the same validation at the other IPC-construction site
referenced (the block around lines 76-85) so no non-positive buffer sizes are
allowed to propagate into IPC transfer logic.

In `@nemo_rl/weight_sync/http_weight_synchronizer.py`:
- Around line 61-87: The weight transfer block must ensure restore/rebuild steps
always run even if transfer fails: wrap the transfer and related calls (the with
timer_context: block that calls _generation.get_sglang_url_to_gpu_uuids(),
_generation.invalidate_kv_cache(), and
_policy.stream_weights_via_http()/ray.get()) in a try/finally so that
_policy.offload_after_refit(),
_generation.prepare_for_generation(tags=["kv_cache"]), and setting self._stale =
False execute in the finally; keep the initial _policy.offload_before_refit()
and initial _generation.prepare_for_generation(tags=["weights"]) before the try,
and ensure any exceptions are re-raised after the finally if needed.

In `@nemo_rl/weight_sync/ipc_weight_synchronizer.py`:
- Around line 119-123: Validate and sanitize refit buffer inputs: if
_refit_buffer_size_gb is provided but <= 0, raise or fallback to computing from
memory ratio; when reading NRl_REFIT_BUFFER_MEMORY_RATIO parse it safely (catch
ValueError) and ensure the ratio is a float in (0,1], defaulting to 0.3 if
malformed or out of range; compute bytes using _policy.get_free_memory_bytes() *
ratio, cast to int, and ensure the returned size is non-negative and at least a
small minimum (or raise) to avoid invalid transfer buffer sizes. Reference
symbols: _refit_buffer_size_gb, NRl_REFIT_BUFFER_MEMORY_RATIO (env key), and
_policy.get_free_memory_bytes().
- Around line 72-102: Wrap the weight-transfer block in a try/finally so
restoration always runs: call self._policy.offload_before_refit() and the
transfer/update code as before, but ensure self._policy.offload_after_refit(),
self._generation.prepare_for_generation(tags=["kv_cache"]), and self._stale =
False execute in a finally block so they run even if ray.get(...) or the update
checks raise; preserve re-raising the original exception after finally so
failures are still propagated. Reference methods:
self._policy.offload_before_refit, self._policy.offload_after_refit,
self._generation.prepare_for_generation, stream_weights_via_ipc_zmq,
update_weights_via_ipc_zmq, and the _stale flag.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 98a4db78-25a2-4a1d-bda4-76b06607640e

📥 Commits

Reviewing files that changed from the base of the PR and between 4641794 and 3132fbf.

📒 Files selected for processing (9)
  • nemo_rl/models/generation/constants.py
  • nemo_rl/weight_sync/__init__.py
  • nemo_rl/weight_sync/collective_weight_synchronizer.py
  • nemo_rl/weight_sync/factory.py
  • nemo_rl/weight_sync/http_weight_synchronizer.py
  • nemo_rl/weight_sync/interfaces.py
  • nemo_rl/weight_sync/ipc_weight_synchronizer.py
  • tests/unit/weight_sync/__init__.py
  • tests/unit/weight_sync/test_weight_synchronizer.py

Comment thread nemo_rl/weight_sync/factory.py
Comment thread nemo_rl/weight_sync/http_weight_synchronizer.py Outdated
Comment thread nemo_rl/weight_sync/ipc_weight_synchronizer.py Outdated
Comment thread nemo_rl/weight_sync/ipc_weight_synchronizer.py Outdated
@saumishr saumishr force-pushed the modularity/weight-sync-abc branch from 3132fbf to db87da7 Compare May 19, 2026 22:39
terrykong
terrykong previously approved these changes May 21, 2026
@terrykong terrykong added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label May 21, 2026
@terrykong terrykong enabled auto-merge (squash) May 21, 2026 04:45
@terrykong

Copy link
Copy Markdown
Collaborator

/ok to test 66232a6

auto-merge was automatically disabled May 21, 2026 04:54

Head branch was pushed to by a user without write access

@saumishr saumishr force-pushed the modularity/weight-sync-abc branch from 66232a6 to 6234e11 Compare May 21, 2026 04:54
@saumishr saumishr force-pushed the modularity/weight-sync-abc branch from 2c76900 to aadf353 Compare May 21, 2026 23:48
@terrykong terrykong enabled auto-merge (squash) May 22, 2026 04:32
terrykong
terrykong previously approved these changes May 22, 2026
@terrykong

Copy link
Copy Markdown
Collaborator

/ok to test aadf353

auto-merge was automatically disabled May 26, 2026 15:36

Head branch was pushed to by a user without write access

@saumishr saumishr force-pushed the modularity/weight-sync-abc branch from aadf353 to b9fcf8b Compare May 26, 2026 15:36
…ansports

Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com>
Co-authored-by: Saurabh Mishra <sauramishra@nvidia.com>

[NemoRL]: fix: Address review feedback on WeightSynchronizer ABC

Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com>

fix: Address CodeRabbitAI review feedback on WeightSynchronizer

- Validate refit_buffer_size_gb > 0 in factory and _compute_buffer_size
- Safely parse NRL_REFIT_BUFFER_MEMORY_RATIO env var with error handling
- Wrap IPC and HTTP sync_weights transfer in try/finally so phase
  restoration (offload_after_refit, kv_cache rebuild) always runs
- Add 7 new tests covering validation and phase-restoration behavior

Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com>
@saumishr saumishr force-pushed the modularity/weight-sync-abc branch from b9fcf8b to 5f41ac0 Compare May 26, 2026 17:02
@saumishr

Copy link
Copy Markdown
Contributor Author

/ok to test 5f41ac0

@terrykong terrykong merged commit bb7e845 into NVIDIA-NeMo:main May 26, 2026
36 checks passed
yfw pushed a commit that referenced this pull request May 27, 2026
…ansports (#2466)

Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com>
@terrykong terrykong mentioned this pull request May 28, 2026
4 tasks
vigneshwaran pushed a commit to vigneshwaran/RL that referenced this pull request May 28, 2026
…ansports (NVIDIA-NeMo#2466)

Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants