Skip to content

Commit 4a91cc2

Browse files
author
Gang Li
committed
update docs for disco
1 parent 3cc3b8f commit 4a91cc2

2 files changed

Lines changed: 97 additions & 0 deletions

File tree

docs/algo/disco.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Recipe: Discriminative Constrained Optimization(DisCO)
2+
3+
Last updated: 09/04/2025.
4+
5+
6+
📝 [Paper@arXiv](https://arxiv.org/abs/2505.12366) | 🏠 [Repo@GitHub](https://github.com/Optimization-AI/DisCO) | 🤗 [Models@HF](https://huggingface.co/ganglii)
7+
### 💡 Introducing **DisCO***Discriminative Constrained Optimization*
8+
9+
**DisCO** is a new RL framework grounded in **discriminative learning**. It trains models by **increasing scores for positive answers while decreasing those for negatives**, enabling:
10+
11+
***No Early Entropy Collapse**
12+
***Faster convergence**
13+
* 📉 **More stable training**
14+
* ⚖️ **Handles sparse rewards** – robust to imbalanced data with advanced discriminative approaches
15+
16+
---
17+
18+
### 📈 Quick Results
19+
20+
On six math reasoning benchmarks with a 1.5B model, **DisCO outperforms GRPO and its variants**:
21+
22+
* **+7% vs GRPO**
23+
* **+6% vs DAPO**
24+
25+
**DisCO with 8k response length is on par with or even better than GRPO with 32k response length**
26+
27+
---
28+
29+
30+
## Quickstart
31+
32+
1. Prepare the datasets:
33+
34+
```bash
35+
bash prepare_data.sh # This downloads the datasets to current folder
36+
```
37+
38+
2. Run script:
39+
40+
```bash
41+
cd verl # Repo root
42+
bash recipe/disco/run_disco_1.5b.sh # or other scripts
43+
```
44+
45+
## Configuration
46+
47+
To configure DisCO within the framework, use the following YAML settings.
48+
49+
```yaml
50+
algorithm:
51+
adv_estimator: disco # Use disco dummy advantage function
52+
actor_rollout_ref:
53+
actor:
54+
policy_loss:
55+
loss_mode: 'disco'
56+
score_func: 'logL' # score function used in disco. Options: 'logL', 'Lratio'
57+
delta: 1e-4
58+
beta: 1e3
59+
tau: 10 # tau=10 is recommended for 'logL', tau=1 is recommended for 'Lratio'
60+
trainer:
61+
# We input all responses to a given question in one forward for better performance.
62+
# Specifically, it better to have:
63+
# (ppo_micro_batch_size_per_gpu * nnodes * n_gpus_per_node) % rollout.n = 0
64+
balance_batch: False
65+
```
66+
67+
68+
## More Results
69+
70+
Comparison with baseline models and baseline methods for fine-tuning 1.5B models. OpenAI-o1-preview is included as a reference. MRL denotes Max Response Length utilized in training/testing. The shaded models are trained by other works and the shaded numbers are reported in their original works or in DeepScalaR. All other results are either evaluated on existing models or on the models trained by us using different approaches. Methods in the bottom area are all for fine-tuning DeepSeek-R1-Distill-Qwen-1.5B model on the same DeepScaleR dataset. DS is short for DeepSeek-R1, DSR is short for DeepScalaR.
71+
72+
<p align="center"><img alt="Comparison with baselines on 1.5B model" src="https://github.com/Optimization-AI/DisCO/blob/main/assets/1p5model.png" width="800"/></p>
73+
74+
75+
Comparison with baseline models and baseline methods for fine-tuning 7B models. Methods in the bottom area are all for fine-tuning DeepSeek-R1-Distill-Qwen-7B model on the the same DeepScalaR dataset.
76+
77+
<p align="center"><img alt="Comparison with baselines on 7B model" src="https://github.com/Optimization-AI/DisCO/blob/main/assets/7Bmodel.png" width="800"/></p>
78+
79+
Training dynamics of different methods: left two are for fine-tuning 1.5B model and right two are for fine-tuning 7B model. (a), (c) plot the training reward (averaged over generated outputs for questions used in each step) vs the number of training steps; (b), (d) plot the generation entropy vs training steps.
80+
81+
<p align="center"><img alt="Training Dynamics" src="https://github.com/Optimization-AI/DisCO/blob/main/assets/training-dyanmics.png" width="800"/></p>
82+
83+
84+
## Citing DisCO
85+
86+
If you find DisCO useful in your research, please consider citing the following paper:
87+
```bibtex
88+
@article{li2025disco,
89+
title={DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization},
90+
author={Li, Gang and Lin, Ming and Galanti, Tomer and Tu, Zhengzhong and Yang, Tianbao},
91+
journal={arXiv preprint arXiv:2505.12366},
92+
year={2025}
93+
}
94+
```
95+
96+

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ verl is fast with:
7474
algo/spin.md
7575
algo/sppo.md
7676
algo/entropy.md
77+
algo/disco.md
7778
algo/opo.md
7879
algo/baseline.md
7980
algo/gpg.md

0 commit comments

Comments
 (0)