Skip to content

fix(rm): raise clear error when CP is used with DTensor RM training#2483

Merged
terrykong merged 3 commits into
mainfrom
terryk/rl-749-rm-cp-error-message
May 15, 2026
Merged

fix(rm): raise clear error when CP is used with DTensor RM training#2483
terrykong merged 3 commits into
mainfrom
terryk/rl-749-rm-cp-error-message

Conversation

@terrykong
Copy link
Copy Markdown
Collaborator

Summary

  • Adds early validation in rm.setup() that raises a clear ValueError when context_parallel_size > 1 is set for reward model training on the DTensor backend
  • Adds unit tests for the validation (both the rejection and the pass-through cases)
  • All validation and tests are tagged with TODO(https://github.com/NVIDIA-NeMo/RL/issues/2482) for easy cleanup when CP support is implemented

Context

Related to #2482 — this PR does not implement CP support for RM training, it only surfaces a clear error message instead of letting users hit cryptic runtime failures (Unknown parallel style: local_rowwise or NotImplementedError: Operator aten.log_sigmoid_forward.default does not have a sharding strategy registered).

Test plan

  • uv run --group test pytest tests/unit/algorithms/test_rm.py -v passes
  • test_context_parallel_rejected_for_dtensor_rm verifies the ValueError fires when CP > 1
  • test_context_parallel_allowed_when_one verifies CP=1 passes the validation

@terrykong terrykong requested review from a team as code owners May 13, 2026 07:23
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@terrykong terrykong added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label May 13, 2026
@terrykong terrykong requested a review from yuki-97 May 13, 2026 07:24
@terrykong terrykong changed the title fix(rm): raise clear error when context parallelism is used with DTensor RM training fix(rm): raise clear error when CP is used with DTensor RM training May 13, 2026
@terrykong terrykong enabled auto-merge (squash) May 13, 2026 07:25
@terrykong
Copy link
Copy Markdown
Collaborator Author

/ok to test 01993fb

Copy link
Copy Markdown
Contributor

@yuki-97 yuki-97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue: test_context_parallel_allowed_when_one uses a broken negative lookahead regex that will not catch regressions.

Comment thread tests/unit/algorithms/test_rm.py Outdated
…sor RM training

Context parallelism (context_parallel_size > 1) is not supported for
reward model training on the DTensor backend because the log_sigmoid
operator lacks a DTensor sharding strategy for CP meshes. Instead of
letting users hit cryptic runtime errors, raise a clear ValueError
during setup with a link to the tracking issue.

Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong force-pushed the terryk/rl-749-rm-cp-error-message branch from 4157ff0 to d318709 Compare May 15, 2026 05:28
@terrykong
Copy link
Copy Markdown
Collaborator Author

/ok to test d318709

The NeMo Gym docs URL returns 404, causing sphinx-build CI to fail.
Add the URL pattern to linkcheck_ignore since the external docs site
is not under our control.

Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong requested a review from a team as a code owner May 15, 2026 06:23
@github-actions github-actions Bot added the Documentation Improvements or additions to documentation label May 15, 2026
@terrykong
Copy link
Copy Markdown
Collaborator Author

/ok to test 9c93d8d

Replace the blanket linkcheck_ignore for all NeMo Gym docs with
pinned v0.2.1 URLs so linkcheck still validates them.

Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong
Copy link
Copy Markdown
Collaborator Author

/ok to test 4c24d9e

@terrykong terrykong merged commit f87cb1b into main May 15, 2026
47 of 53 checks passed
@terrykong terrykong deleted the terryk/rl-749-rm-cp-error-message branch May 15, 2026 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants