Skip to content

是否支持基于偏好对比损失来训练RM #9

@SYSUzhouting

Description

@SYSUzhouting

目前的RM偏好训练教程(examples/train/pairwise/run_pairwise.sh)中采用的是 GRPO + PPO 结合的训练方法,是否支持更直接快速的训练方法,例如直接基于偏好对比损失来优化? 如基于​Bradley-Terry loss

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions