|
| 1 | +# Recipe: Discriminative Constrained Optimization(DisCO) |
| 2 | + |
| 3 | +Last updated: 09/04/2025. |
| 4 | + |
| 5 | + |
| 6 | + 📝 [Paper@arXiv](https://arxiv.org/abs/2505.12366) | 🏠 [Repo@GitHub](https://github.com/Optimization-AI/DisCO) | 🤗 [Models@HF](https://huggingface.co/ganglii) |
| 7 | +### 💡 Introducing **DisCO** — *Discriminative Constrained Optimization* |
| 8 | + |
| 9 | +**DisCO** is a new RL framework grounded in **discriminative learning**. It trains models by **increasing scores for positive answers while decreasing those for negatives**, enabling: |
| 10 | + |
| 11 | +* ❌ **No Early Entropy Collapse** |
| 12 | +* ⚡ **Faster convergence** |
| 13 | +* 📉 **More stable training** |
| 14 | +* ⚖️ **Handles sparse rewards** – robust to imbalanced data with advanced discriminative approaches |
| 15 | + |
| 16 | +--- |
| 17 | + |
| 18 | +### 📈 Quick Results |
| 19 | + |
| 20 | +On six math reasoning benchmarks with a 1.5B model, **DisCO outperforms GRPO and its variants**: |
| 21 | + |
| 22 | +* **+7% vs GRPO** |
| 23 | +* **+6% vs DAPO** |
| 24 | + |
| 25 | +**DisCO with 8k response length is on par with or even better than GRPO with 32k response length** |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | + |
| 30 | +## Quickstart |
| 31 | + |
| 32 | +1. Prepare the datasets: |
| 33 | + |
| 34 | +```bash |
| 35 | +bash prepare_data.sh # This downloads the datasets to current folder |
| 36 | +``` |
| 37 | + |
| 38 | +2. Run script: |
| 39 | + |
| 40 | +```bash |
| 41 | +cd verl # Repo root |
| 42 | +bash recipe/disco/run_disco_1.5b.sh # or other scripts |
| 43 | +``` |
| 44 | + |
| 45 | +## Configuration |
| 46 | + |
| 47 | +To configure DisCO within the framework, use the following YAML settings. |
| 48 | + |
| 49 | +```yaml |
| 50 | +algorithm: |
| 51 | + adv_estimator: disco # Use disco dummy advantage function |
| 52 | +actor_rollout_ref: |
| 53 | + actor: |
| 54 | + policy_loss: |
| 55 | + loss_mode: 'disco' |
| 56 | + score_func: 'logL' # score function used in disco. Options: 'logL', 'Lratio' |
| 57 | + delta: 1e-4 |
| 58 | + beta: 1e3 |
| 59 | + tau: 10 # tau=10 is recommended for 'logL', tau=1 is recommended for 'Lratio' |
| 60 | +trainer: |
| 61 | + # We input all responses to a given question in one forward for better performance. |
| 62 | + # Specifically, it better to have: |
| 63 | + # (ppo_micro_batch_size_per_gpu * nnodes * n_gpus_per_node) % rollout.n = 0 |
| 64 | + balance_batch: False |
| 65 | +``` |
| 66 | +
|
| 67 | +
|
| 68 | +## More Results |
| 69 | +
|
| 70 | +Comparison with baseline models and baseline methods for fine-tuning 1.5B models. OpenAI-o1-preview is included as a reference. MRL denotes Max Response Length utilized in training/testing. The shaded models are trained by other works and the shaded numbers are reported in their original works or in DeepScalaR. All other results are either evaluated on existing models or on the models trained by us using different approaches. Methods in the bottom area are all for fine-tuning DeepSeek-R1-Distill-Qwen-1.5B model on the same DeepScaleR dataset. DS is short for DeepSeek-R1, DSR is short for DeepScalaR. |
| 71 | +
|
| 72 | +<p align="center"><img alt="Comparison with baselines on 1.5B model" src="https://github.com/Optimization-AI/DisCO/blob/main/assets/1p5model.png" width="800"/></p> |
| 73 | +
|
| 74 | +
|
| 75 | +Comparison with baseline models and baseline methods for fine-tuning 7B models. Methods in the bottom area are all for fine-tuning DeepSeek-R1-Distill-Qwen-7B model on the the same DeepScalaR dataset. |
| 76 | +
|
| 77 | +<p align="center"><img alt="Comparison with baselines on 7B model" src="https://github.com/Optimization-AI/DisCO/blob/main/assets/7Bmodel.png" width="800"/></p> |
| 78 | +
|
| 79 | +Training dynamics of different methods: left two are for fine-tuning 1.5B model and right two are for fine-tuning 7B model. (a), (c) plot the training reward (averaged over generated outputs for questions used in each step) vs the number of training steps; (b), (d) plot the generation entropy vs training steps. |
| 80 | +
|
| 81 | +<p align="center"><img alt="Training Dynamics" src="https://github.com/Optimization-AI/DisCO/blob/main/assets/training-dyanmics.png" width="800"/></p> |
| 82 | +
|
| 83 | +
|
| 84 | +## Citing DisCO |
| 85 | +
|
| 86 | +If you find DisCO useful in your research, please consider citing the following paper: |
| 87 | +```bibtex |
| 88 | +@article{li2025disco, |
| 89 | + title={DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization}, |
| 90 | + author={Li, Gang and Lin, Ming and Galanti, Tomer and Tu, Zhengzhong and Yang, Tianbao}, |
| 91 | + journal={arXiv preprint arXiv:2505.12366}, |
| 92 | + year={2025} |
| 93 | +} |
| 94 | +``` |
| 95 | + |
| 96 | + |
0 commit comments