fix: use prompt token length for advantage group extraction and fix token mask by yfw · Pull Request #2176 · NVIDIA-NeMo/RL

yfw · 2026-03-30T23:17:17Z

This PR fixes two multi-turn GRPO training issues:

The previous role-based extraction (_extract_prompt_only_messages) broke on multi-turn prompts containing assistant messages in the conversation history — it would strip them, corrupting the prompt IDs used for advantage estimation.
Replace with extract_initial_prompt_messages() which uses the length field to identify the original prompt boundary. Applied to both sync and async GRPO paths.
GRPO token loss masks previously unmasked every message with role == "assistant". In multi-turn data, assistant messages can be part of the prompt history, not generated rollout output, so those tokens should not contribute to the policy loss. This PR updates masking so only assistant messages produced by generation, identified by existing generation_logprobs, are trainable. Missing generation_logprobs are still filled with zeros for downstream tensorization.

Closes #1960 and #1956

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

copy-pr-bot · 2026-03-30T23:17:22Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yuki-97 · 2026-03-31T08:39:27Z

/ok to test 628a248

macandro96 · 2026-05-21T23:28:17Z

/ok to test 7d96ad3

copy-pr-bot · 2026-05-21T23:28:20Z

/ok to test 7d96ad3

@macandro96, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

macandro96 · 2026-05-21T23:29:41Z

/ok to test d5961e5

The previous role-based extraction (`_extract_prompt_only_messages`) broke on multi-turn prompts containing assistant messages in the conversation history — it would strip them, corrupting the prompt IDs used for advantage estimation. Replace with `extract_initial_prompt_messages()` which uses the `length` field to identify the original prompt boundary. Applied to both sync and async GRPO paths. Closes #1960 Co-Authored-By: Jiaqi Zeng <jiaqiz@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Signed-off-by: Anish Mahishi <amahishi@cw-dfw-cs-001-vscode-02.cm.cluster>

yuki-97 · 2026-05-22T14:50:44Z

+            role = cast(str, message["role"])
+            token_ids = cast(torch.Tensor, message["token_ids"])
+
+            if role == "assistant" and "generation_logprobs" in message:


we didn't check generation_logprobs before, is there a reason we need to check it now?

Suggested change

if role == "assistant" and "generation_logprobs" in message:

if role == "assistant":

Yes - I think we want to set token_mask = 1 for assistant part of messages where generation logprobs are available. If its not available, it means - that assistant text was part of input prompt for a multi-turn conversation and should be excluded while computing gradients.

This was a separate commit for super. I combined it into this PR as it was related.

yuki-97 · 2026-05-22T14:50:59Z

-                        prompt_only_message_logs,
-                        pad_value_dict={"token_ids": tokenizer.pad_token_id},
+
+                    prompt_batched_flat, prompt_input_lengths = (


nit: looks prompt_input_lengths is never used.

Suggested change

prompt_batched_flat, prompt_input_lengths = (

prompt_batched_flat, _ = (

yuki-97 · 2026-05-22T14:51:18Z

-                    prompt_batched_flat, _ = batched_message_log_to_flat_message(
-                        prompt_only_message_logs,
-                        pad_value_dict={"token_ids": tokenizer.pad_token_id},
+                    prompt_batched_flat, prompt_input_lengths = (


nit: same as above

Suggested change

prompt_batched_flat, prompt_input_lengths = (

prompt_batched_flat, _ = (

yuki-97 · 2026-05-22T14:54:20Z

@yfw @HeyyyyyyG could you help to take a review as well?

yfw requested a review from a team as a code owner March 30, 2026 23:17

yfw added the super-v3 label Mar 30, 2026

yfw requested a review from a team as a code owner March 30, 2026 23:17

yuki-97 previously approved these changes Mar 31, 2026

View reviewed changes

yuki-97 added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Mar 31, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci March 31, 2026 08:39 Inactive

macandro96 dismissed yuki-97’s stale review via 7d96ad3 May 21, 2026 23:07

macandro96 force-pushed the yifu/fix-prompt-extraction-multi-turn branch from 628a248 to 7d96ad3 Compare May 21, 2026 23:07

macandro96 requested a review from yuki-97 May 21, 2026 23:18

macandro96 force-pushed the yifu/fix-prompt-extraction-multi-turn branch from 7d96ad3 to d5961e5 Compare May 21, 2026 23:20

copy-pr-bot Bot temporarily deployed to public May 21, 2026 23:29 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 23:30 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 21, 2026 23:30 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 23:30 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 23:34 Inactive

macandro96 changed the title ~~fix: use prompt token length for advantage group extraction~~ fix: use prompt token length for advantage group extraction and fix token mask May 21, 2026

yfw and others added 2 commits May 22, 2026 10:30

fix: only train on generated assistant turns

20adf67

Signed-off-by: Anish Mahishi <amahishi@cw-dfw-cs-001-vscode-02.cm.cluster>

macandro96 force-pushed the yifu/fix-prompt-extraction-multi-turn branch from d5961e5 to 20adf67 Compare May 22, 2026 14:30

yuki-97 reviewed May 22, 2026

View reviewed changes

yuki-97 requested a review from HeyyyyyyG May 22, 2026 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use prompt token length for advantage group extraction and fix token mask#2176

fix: use prompt token length for advantage group extraction and fix token mask#2176
yfw wants to merge 2 commits into
mainfrom
yifu/fix-prompt-extraction-multi-turn

yfw commented Mar 30, 2026 •

edited by macandro96

Loading

Uh oh!

copy-pr-bot Bot commented Mar 30, 2026

Uh oh!

yuki-97 commented Mar 31, 2026

Uh oh!

macandro96 commented May 21, 2026

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

macandro96 commented May 21, 2026

Uh oh!

yuki-97 May 22, 2026

Uh oh!

macandro96 May 22, 2026

Uh oh!

macandro96 May 22, 2026

Uh oh!

yuki-97 May 22, 2026

Uh oh!

yuki-97 May 22, 2026

Uh oh!

yuki-97 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if role == "assistant" and "generation_logprobs" in message:
	if role == "assistant":

	prompt_batched_flat, prompt_input_lengths = (
	prompt_batched_flat, _ = (

Conversation

yfw commented Mar 30, 2026 • edited by macandro96 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Mar 30, 2026

Uh oh!

yuki-97 commented Mar 31, 2026

Uh oh!

macandro96 commented May 21, 2026

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

macandro96 commented May 21, 2026

Uh oh!

yuki-97 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

macandro96 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

macandro96 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

yuki-97 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

yuki-97 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

yuki-97 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yfw commented Mar 30, 2026 •

edited by macandro96

Loading