feat: support structured reward outputs and grouped reward aggregation#1200
feat: support structured reward outputs and grouped reward aggregation#1200Wangxiaoxiaoa wants to merge 1 commit intoinclusionAI:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the reward API to support both float and dictionary-based reward types and introduces a mechanism in the inference engine to aggregate group results through a workflow method. Feedback was provided to refine a type hint from Any to a more specific union type to maintain consistency with the updated documentation.
| return None | ||
|
|
||
| async def __call__(self, *args, **kwargs) -> float: | ||
| async def __call__(self, *args, **kwargs) -> Any: |
There was a problem hiding this comment.
The return type hint Any is too generic. Since the reward_fn docstring at line 60 has been updated to specify float | dict[str, float], it is better to use the same specific type hint here to maintain consistency and improve type checking.
| async def __call__(self, *args, **kwargs) -> Any: | |
| async def __call__(self, *args, **kwargs) -> float | dict[str, float]: |
babf3ad to
b708fef
Compare
b708fef to
7365bca
Compare
|
This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days. Please add a comment or push new commits to keep it active. Thank you for your contribution! |
Description
This PR adds support for structured reward outputs in the reward path for multi-reward RL workflows.
Today, the reward interface is much more naturally aligned with a single scalar reward, which makes it hard to represent multiple reward components for one sample.
This PR extends the reward path so reward functions can return either:
while keeping existing scalar-only behavior unchanged.
This is useful for reproducing multi-reward RL setups such as GDPO. With this change, GDPO-style logic can be implemented at the user level:
This PR does not implement GDPO itself in AReaL core. It provides the reward representation needed to build GDPO-style and other multi-reward workflows on top of AReaL.
Related Issue
Fixes #1196
Type of Change
Checklist
pre-commit run --all-files)./docs/build_all.sh)main/review-prcommand/create-prBreaking Change Details (if applicable):
N/A