Skip to content

feat: support structured reward outputs and grouped reward aggregation#1200

Open
Wangxiaoxiaoa wants to merge 1 commit intoinclusionAI:mainfrom
Wangxiaoxiaoa:xiao/pr-reward-structured
Open

feat: support structured reward outputs and grouped reward aggregation#1200
Wangxiaoxiaoa wants to merge 1 commit intoinclusionAI:mainfrom
Wangxiaoxiaoa:xiao/pr-reward-structured

Conversation

@Wangxiaoxiaoa
Copy link
Copy Markdown
Contributor

Description

This PR adds support for structured reward outputs in the reward path for multi-reward RL workflows.

Today, the reward interface is much more naturally aligned with a single scalar reward, which makes it hard to represent multiple reward components for one sample.

This PR extends the reward path so reward functions can return either:

  • a scalar reward
  • a structured reward dictionary

while keeping existing scalar-only behavior unchanged.

This is useful for reproducing multi-reward RL setups such as GDPO. With this change, GDPO-style logic can be implemented at the user level:

  • in a custom reward function, by returning multiple reward components
  • or in a workflow, by aggregating those components into the final scalar reward used for training

This PR does not implement GDPO itself in AReaL core. It provides the reward representation needed to build GDPO-style and other multi-reward workflows on top of AReaL.

Related Issue

Fixes #1196

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Breaking Change Details (if applicable):

N/A

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the reward API to support both float and dictionary-based reward types and introduces a mechanism in the inference engine to aggregate group results through a workflow method. Feedback was provided to refine a type hint from Any to a more specific union type to maintain consistency with the updated documentation.

Comment thread areal/api/reward_api.py Outdated
return None

async def __call__(self, *args, **kwargs) -> float:
async def __call__(self, *args, **kwargs) -> Any:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The return type hint Any is too generic. Since the reward_fn docstring at line 60 has been updated to specify float | dict[str, float], it is better to use the same specific type hint here to maintain consistency and improve type checking.

Suggested change
async def __call__(self, *args, **kwargs) -> Any:
async def __call__(self, *args, **kwargs) -> float | dict[str, float]:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied,thx

@Wangxiaoxiaoa Wangxiaoxiaoa force-pushed the xiao/pr-reward-structured branch 2 times, most recently from babf3ad to b708fef Compare April 17, 2026 10:20
@Wangxiaoxiaoa Wangxiaoxiaoa force-pushed the xiao/pr-reward-structured branch from b708fef to 7365bca Compare April 17, 2026 10:53
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days.

Please add a comment or push new commits to keep it active.

Thank you for your contribution!

@github-actions github-actions Bot added the stale label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]Support structured reward outputs and grouped reward aggregation for multi-reward RL workflows

1 participant