feat: [NemoRL] Introduce WeightSynchronizer ABC with IPC/HTTP/NCCL transports#2466
Conversation
2484a45 to
29fed73
Compare
terrykong
left a comment
There was a problem hiding this comment.
Review Summary
Nice abstraction! The WeightSynchronizer ABC and its three transports correctly mirror the existing refit_policy_generation() branches in grpo.py. The factory logic, test suite (20/20 pass), and overall structure are solid. A few doc/code consistency issues to address before this becomes the canonical module.
Linter: PASS (after auto-fixes for import order and unused import)
Informational (not posted inline):
- All concrete classes type
policy/generationasAny— consider usingColocatablePolicyInterface/GenerationInterface(underTYPE_CHECKINGif needed) to get type-checking benefits from the new abstraction. constants.pydefinesVLLM_BACKEND/MEGATRON_BACKENDbut onlySGLANG_BACKENDis used in the new code. Existing raw string literals in grpo.py/distillation.py remain. Consider adopting them in a follow-up MR.- Import order (3 files) and unused
PropertyMockimport — all auto-fixed by ruff. Please runpre-commit run --all-filesbefore merge. - Test coverage gaps worth adding in follow-up: env-var override path, Timer context, HTTP
init_communicator, HTTP/Collectivemark_stale,shutdown()no-op, factory kwarg forwarding.
Generated by Claude Code
|
/claude review |
a7743be to
3132fbf
Compare
WalkthroughThis PR introduces a comprehensive weight synchronization abstraction for coordinating policy-to-generation weight transfers across different deployment topologies. It defines a core ChangesWeight Synchronization Infrastructure
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemo_rl/weight_sync/factory.py`:
- Around line 39-40: Fail fast on invalid refit_buffer_size_gb by validating the
parameter before constructing any IPC synchronizer: in the factory function that
accepts refit_buffer_size_gb and returns a WeightSynchronizer, add a check that
if refit_buffer_size_gb is not None and refit_buffer_size_gb <= 0 then raise a
ValueError (with a clear message) before proceeding to create the IPC
synchronizer; apply the same validation at the other IPC-construction site
referenced (the block around lines 76-85) so no non-positive buffer sizes are
allowed to propagate into IPC transfer logic.
In `@nemo_rl/weight_sync/http_weight_synchronizer.py`:
- Around line 61-87: The weight transfer block must ensure restore/rebuild steps
always run even if transfer fails: wrap the transfer and related calls (the with
timer_context: block that calls _generation.get_sglang_url_to_gpu_uuids(),
_generation.invalidate_kv_cache(), and
_policy.stream_weights_via_http()/ray.get()) in a try/finally so that
_policy.offload_after_refit(),
_generation.prepare_for_generation(tags=["kv_cache"]), and setting self._stale =
False execute in the finally; keep the initial _policy.offload_before_refit()
and initial _generation.prepare_for_generation(tags=["weights"]) before the try,
and ensure any exceptions are re-raised after the finally if needed.
In `@nemo_rl/weight_sync/ipc_weight_synchronizer.py`:
- Around line 119-123: Validate and sanitize refit buffer inputs: if
_refit_buffer_size_gb is provided but <= 0, raise or fallback to computing from
memory ratio; when reading NRl_REFIT_BUFFER_MEMORY_RATIO parse it safely (catch
ValueError) and ensure the ratio is a float in (0,1], defaulting to 0.3 if
malformed or out of range; compute bytes using _policy.get_free_memory_bytes() *
ratio, cast to int, and ensure the returned size is non-negative and at least a
small minimum (or raise) to avoid invalid transfer buffer sizes. Reference
symbols: _refit_buffer_size_gb, NRl_REFIT_BUFFER_MEMORY_RATIO (env key), and
_policy.get_free_memory_bytes().
- Around line 72-102: Wrap the weight-transfer block in a try/finally so
restoration always runs: call self._policy.offload_before_refit() and the
transfer/update code as before, but ensure self._policy.offload_after_refit(),
self._generation.prepare_for_generation(tags=["kv_cache"]), and self._stale =
False execute in a finally block so they run even if ray.get(...) or the update
checks raise; preserve re-raising the original exception after finally so
failures are still propagated. Reference methods:
self._policy.offload_before_refit, self._policy.offload_after_refit,
self._generation.prepare_for_generation, stream_weights_via_ipc_zmq,
update_weights_via_ipc_zmq, and the _stale flag.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 98a4db78-25a2-4a1d-bda4-76b06607640e
📒 Files selected for processing (9)
nemo_rl/models/generation/constants.pynemo_rl/weight_sync/__init__.pynemo_rl/weight_sync/collective_weight_synchronizer.pynemo_rl/weight_sync/factory.pynemo_rl/weight_sync/http_weight_synchronizer.pynemo_rl/weight_sync/interfaces.pynemo_rl/weight_sync/ipc_weight_synchronizer.pytests/unit/weight_sync/__init__.pytests/unit/weight_sync/test_weight_synchronizer.py
3132fbf to
db87da7
Compare
|
/ok to test 66232a6 |
Head branch was pushed to by a user without write access
66232a6 to
6234e11
Compare
2c76900 to
aadf353
Compare
|
/ok to test aadf353 |
Head branch was pushed to by a user without write access
aadf353 to
b9fcf8b
Compare
…ansports Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com> Co-authored-by: Saurabh Mishra <sauramishra@nvidia.com> [NemoRL]: fix: Address review feedback on WeightSynchronizer ABC Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com> fix: Address CodeRabbitAI review feedback on WeightSynchronizer - Validate refit_buffer_size_gb > 0 in factory and _compute_buffer_size - Safely parse NRL_REFIT_BUFFER_MEMORY_RATIO env var with error handling - Wrap IPC and HTTP sync_weights transfer in try/finally so phase restoration (offload_after_refit, kv_cache rebuild) always runs - Add 7 new tests covering validation and phase-restoration behavior Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com>
b9fcf8b to
5f41ac0
Compare
|
/ok to test 5f41ac0 |
…ansports (#2466) Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com>
…ansports (NVIDIA-NeMo#2466) Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com>
What does this PR do ?
Introduces the WeightSynchronizer abstraction: a dedicated ABC that decouples weight transfer logic from PolicyInterface and GenerationInterface. This is a purely additive change; no existing code is modified or wired up. It lays the foundation for later MRs that will refactor refit_policy_generation() out of grpo.py and slim down the policy/generation interfaces.
Issues
List issues that this PR closes (syntax):
Usage
Changes
nemo_rl/weight_sync/interfaces.py — WeightSynchronizer ABC with 5 abstract methods: sync_weights, is_stale, mark_stale, init_communicator, shutdown
nemo_rl/weight_sync/ipc_weight_synchronizer.py — Colocated vLLM/Megatron transport (ZMQ IPC), mirrors existing refit_policy_generation() IPC path
nemo_rl/weight_sync/http_weight_synchronizer.py — Colocated SGLang transport (HTTP streaming), mirrors existing HTTP path
nemo_rl/weight_sync/collective_weight_synchronizer.py — Non-colocated transport (NCCL collectives), mirrors existing collective path
nemo_rl/weight_sync/factory.py — create_weight_synchronizer() factory that selects the implementation based on colocated flag and generation_backend
nemo_rl/models/generation/constants.py — Centralized string constants for generation backend names (VLLM_BACKEND, SGLANG_BACKEND, MEGATRON_BACKEND)
tests/unit/weight_sync/test_weight_synchronizer.py — 20 unit tests covering ABC contract, all three implementations, and factory logic
Before your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
Release Notes
New Features
Tests