Commit 70acd22
feat(trainer): add dpo (#1190)
* feat(trainer): add DPO trainer with FSDP backend
Add Direct Preference Optimization (Rafailov et al. 2023) as a new
trainer. The policy is directly optimized to prefer chosen over
rejected responses via a contrastive loss on log-probability ratios
against a frozen reference model, removing the need for a separately
trained reward model.
Reference logprobs are computed online each step by a colocated ref
engine, following the PPO/GRPO pattern. FSDP is the supported backend;
Megatron and Archon variants raise NotImplementedError as placeholders.
Verified on Qwen2.5-7B-Base + Anthropic/hh-rlhf (1 epoch, no SFT):
reward_accuracy rises from 0.50 to ~0.70 and reward_margin grows
monotonically, matching the original DPO paper's HH-RLHF results.
* fix(trainer): fix DPO config forwarding, require ref model, and correct IPO normalization
Fixes several issues found during PR review of the DPO trainer:
Key changes:
- Create DPOEngineConfig(TrainEngineConfig) embedding beta and loss_type,
fixing silent parameter drop in single-controller mode (as_controller
never forwarded beta/loss_type to workers)
- Make ref a required field in DPOConfig (ref_logprobs are required at
runtime, so config should enforce this upfront)
- Remove zero-ref fallback in compute_dpo_loss; use input_["ref_logprobs"]
directly
- Add IPO loss with per-token length normalization matching TRL author-
confirmed convention (normalize per-sequence logratios by completion
length before the squared loss)
- Remove all ref-is-None guard branches from DPOTrainer
- Update docs, YAML config, and tests for all changes
Refs: #1190
---------
Co-authored-by: 博惟 <bowei.fw@antgroup.com>1 parent 5bd7a18 commit 70acd22
26 files changed
Lines changed: 2158 additions & 13 deletions
File tree
- areal
- api
- dataset
- engine
- experimental/engine
- trainer
- dpo
- utils/functional
- docs
- en
- algorithms
- zh
- algorithms
- examples/alignment
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
207 | 207 | | |
208 | 208 | | |
209 | 209 | | |
210 | | - | |
| 210 | + | |
| 211 | + | |
211 | 212 | | |
212 | 213 | | |
213 | 214 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
18 | | - | |
19 | | - | |
| 18 | + | |
| 19 | + | |
20 | 20 | | |
21 | 21 | | |
| 22 | + | |
22 | 23 | | |
23 | 24 | | |
24 | 25 | | |
| |||
29 | 30 | | |
30 | 31 | | |
31 | 32 | | |
| 33 | + | |
32 | 34 | | |
33 | 35 | | |
34 | 36 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2669 | 2669 | | |
2670 | 2670 | | |
2671 | 2671 | | |
| 2672 | + | |
| 2673 | + | |
| 2674 | + | |
| 2675 | + | |
| 2676 | + | |
| 2677 | + | |
| 2678 | + | |
| 2679 | + | |
| 2680 | + | |
| 2681 | + | |
| 2682 | + | |
| 2683 | + | |
| 2684 | + | |
| 2685 | + | |
| 2686 | + | |
| 2687 | + | |
| 2688 | + | |
| 2689 | + | |
| 2690 | + | |
| 2691 | + | |
| 2692 | + | |
| 2693 | + | |
| 2694 | + | |
| 2695 | + | |
| 2696 | + | |
| 2697 | + | |
| 2698 | + | |
| 2699 | + | |
| 2700 | + | |
| 2701 | + | |
| 2702 | + | |
| 2703 | + | |
| 2704 | + | |
| 2705 | + | |
| 2706 | + | |
| 2707 | + | |
| 2708 | + | |
| 2709 | + | |
| 2710 | + | |
| 2711 | + | |
| 2712 | + | |
| 2713 | + | |
| 2714 | + | |
| 2715 | + | |
| 2716 | + | |
| 2717 | + | |
2672 | 2718 | | |
2673 | 2719 | | |
2674 | 2720 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
113 | 113 | | |
114 | 114 | | |
115 | 115 | | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
116 | 126 | | |
117 | 127 | | |
118 | 128 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| 9 | + | |
9 | 10 | | |
10 | 11 | | |
11 | 12 | | |
12 | 13 | | |
13 | 14 | | |
| 15 | + | |
14 | 16 | | |
15 | 17 | | |
16 | 18 | | |
| |||
21 | 23 | | |
22 | 24 | | |
23 | 25 | | |
| 26 | + | |
24 | 27 | | |
25 | 28 | | |
26 | 29 | | |
27 | 30 | | |
28 | 31 | | |
| 32 | + | |
29 | 33 | | |
30 | 34 | | |
31 | 35 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
131 | 131 | | |
132 | 132 | | |
133 | 133 | | |
134 | | - | |
| 134 | + | |
135 | 135 | | |
136 | 136 | | |
137 | 137 | | |
| |||
1966 | 1966 | | |
1967 | 1967 | | |
1968 | 1968 | | |
| 1969 | + | |
| 1970 | + | |
| 1971 | + | |
| 1972 | + | |
| 1973 | + | |
| 1974 | + | |
| 1975 | + | |
| 1976 | + | |
| 1977 | + | |
| 1978 | + | |
| 1979 | + | |
| 1980 | + | |
| 1981 | + | |
| 1982 | + | |
| 1983 | + | |
| 1984 | + | |
| 1985 | + | |
| 1986 | + | |
| 1987 | + | |
| 1988 | + | |
| 1989 | + | |
| 1990 | + | |
| 1991 | + | |
| 1992 | + | |
| 1993 | + | |
| 1994 | + | |
| 1995 | + | |
| 1996 | + | |
| 1997 | + | |
| 1998 | + | |
| 1999 | + | |
| 2000 | + | |
| 2001 | + | |
| 2002 | + | |
| 2003 | + | |
| 2004 | + | |
| 2005 | + | |
| 2006 | + | |
| 2007 | + | |
| 2008 | + | |
| 2009 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
113 | 113 | | |
114 | 114 | | |
115 | 115 | | |
116 | | - | |
| 116 | + | |
117 | 117 | | |
118 | 118 | | |
119 | 119 | | |
| |||
1987 | 1987 | | |
1988 | 1988 | | |
1989 | 1989 | | |
| 1990 | + | |
| 1991 | + | |
| 1992 | + | |
| 1993 | + | |
| 1994 | + | |
| 1995 | + | |
| 1996 | + | |
| 1997 | + | |
| 1998 | + | |
| 1999 | + | |
| 2000 | + | |
| 2001 | + | |
| 2002 | + | |
| 2003 | + | |
| 2004 | + | |
| 2005 | + | |
| 2006 | + | |
| 2007 | + | |
| 2008 | + | |
| 2009 | + | |
| 2010 | + | |
| 2011 | + | |
| 2012 | + | |
| 2013 | + | |
| 2014 | + | |
| 2015 | + | |
| 2016 | + | |
| 2017 | + | |
| 2018 | + | |
| 2019 | + | |
| 2020 | + | |
| 2021 | + | |
| 2022 | + | |
| 2023 | + | |
| 2024 | + | |
| 2025 | + | |
| 2026 | + | |
| 2027 | + | |
| 2028 | + | |
| 2029 | + | |
| 2030 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
113 | 113 | | |
114 | 114 | | |
115 | 115 | | |
116 | | - | |
| 116 | + | |
117 | 117 | | |
118 | 118 | | |
119 | 119 | | |
| |||
1476 | 1476 | | |
1477 | 1477 | | |
1478 | 1478 | | |
| 1479 | + | |
| 1480 | + | |
| 1481 | + | |
| 1482 | + | |
| 1483 | + | |
| 1484 | + | |
| 1485 | + | |
| 1486 | + | |
| 1487 | + | |
| 1488 | + | |
| 1489 | + | |
| 1490 | + | |
| 1491 | + | |
| 1492 | + | |
| 1493 | + | |
| 1494 | + | |
| 1495 | + | |
| 1496 | + | |
| 1497 | + | |
| 1498 | + | |
| 1499 | + | |
| 1500 | + | |
| 1501 | + | |
| 1502 | + | |
| 1503 | + | |
| 1504 | + | |
| 1505 | + | |
| 1506 | + | |
| 1507 | + | |
| 1508 | + | |
| 1509 | + | |
| 1510 | + | |
| 1511 | + | |
| 1512 | + | |
| 1513 | + | |
| 1514 | + | |
| 1515 | + | |
| 1516 | + | |
| 1517 | + | |
| 1518 | + | |
| 1519 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
| 3 | + | |
3 | 4 | | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
7 | | - | |
| 8 | + | |
0 commit comments