Skip to content

Commit 3cc3b8f

Browse files
author
Gang Li
committed
add disco algorithm, recipe, and CI tests
1 parent e90f18c commit 3cc3b8f

12 files changed

Lines changed: 1834 additions & 1 deletion

File tree

.github/workflows/e2e_disco.yml

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# # Tests layout
2+
3+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4+
# - `tests/trainer` for testing functionality related to `verl/trainer`
5+
# - `tests/models` for testing functionality related to `verl/models`
6+
# - ...
7+
8+
# There are a few folders with `special_` prefix, created for special purposes:
9+
# - `special_distributed`: unit tests that must run with multiple GPUs
10+
# - `special_e2e`: end-to-end tests with training/generation scripts
11+
# - `special_npu`: tests for NPUs
12+
# - `special_sanity`: a suite of quick sanity tests
13+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
14+
15+
# Accelerators for tests
16+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18+
19+
# # Workflow layout
20+
21+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24+
# 3. End-to-end tests: `e2e_*.yml`
25+
# 4. Unit tests
26+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29+
# - new workflow yaml is added to `.github/workflows`
30+
# - new tests are added to workflow mentioned in 2.
31+
32+
33+
name: e2e_disco
34+
35+
on:
36+
# Trigger the workflow on push or pull request,
37+
# but only for the main branch
38+
# For push, for now only anti-patterns are specified so it is more conservative
39+
# and achieves higher coverage.
40+
push:
41+
branches:
42+
- main
43+
- v0.*
44+
paths:
45+
- "verl/*.py"
46+
# Other entrypoints
47+
- "!examples/*trainer*"
48+
- "!tests/**"
49+
- "!verl/trainer/main_*.py"
50+
- "!verl/trainer/fsdp_sft_trainer.py"
51+
# Megatron
52+
- "!verl/workers/**/megatron_*.py"
53+
- "!recipe/**"
54+
- "recipe/disco"
55+
# Entrypoints
56+
- ".github/workflows/e2e_disco.yml"
57+
- "examples/data_preprocess/gsm8k.py"
58+
- "tests/special_e2e/run_disco.sh"
59+
pull_request:
60+
branches:
61+
- main
62+
- v0.*
63+
paths:
64+
- "**/*.py"
65+
# Other entrypoints
66+
- "!examples/**"
67+
- "!tests/**"
68+
- "!verl/trainer/main_*.py"
69+
- "!verl/trainer/fsdp_sft_trainer.py"
70+
# Other recipes
71+
- "!recipe/**"
72+
# Megatron
73+
- "!verl/workers/**/megatron_*.py"
74+
# Home
75+
- "recipe/disco"
76+
# Entrypoints
77+
- ".github/workflows/e2e_disco.yml"
78+
- "examples/data_preprocess/gsm8k.py"
79+
- "tests/special_e2e/run_disco.sh"
80+
81+
# Cancel jobs on the same ref if a new one is triggered
82+
concurrency:
83+
group: ${{ github.workflow }}-${{ github.ref }}
84+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
85+
86+
# Declare permissions just read content.
87+
permissions:
88+
contents: read
89+
90+
jobs:
91+
e2e_disco:
92+
runs-on: [L20x8]
93+
timeout-minutes: 40 # Increase this timeout value as needed
94+
env:
95+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
96+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
97+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
98+
HF_ENDPOINT: "https://hf-mirror.com"
99+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
100+
container:
101+
image: verlai/verl:app-verl0.5-transformers4.55.4-vllm0.10.0-mcore0.13.0-te2.2
102+
options: --gpus all --shm-size=10g
103+
steps:
104+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
105+
with:
106+
fetch-depth: 0
107+
- name: Install the current repository
108+
run: |
109+
pip3 install --no-deps -e .[test,gpu]
110+
- name: Prepare GSM8K dataset
111+
run: |
112+
python3 examples/data_preprocess/gsm8k.py
113+
- name: Running the E2E test with the DisCO algorithm
114+
run: |
115+
ray stop --force
116+
bash tests/special_e2e/run_disco.sh

recipe/disco/README.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
<h1 align="center">🚀 DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization</h1>
2+
<p align="center"><img alt="DisCO" src="https://github.com/Optimization-AI/DisCO/blob/main/assets/disco-final.png" width="300"/></p>
3+
<p align="center" tyle="font-size: 5px;"><em>Credit: the above image was generated by Sora</em></p>
4+
5+
6+
📝 [Paper@arXiv](https://arxiv.org/abs/2505.12366) | 🏠 [Repo@GitHub](https://github.com/Optimization-AI/DisCO)
7+
8+
9+
---
10+
11+
### 💡 Introducing **DisCO***Discriminative Constrained Optimization*
12+
13+
**DisCO** is a new RL framework grounded in **discriminative learning**. It trains models by **increasing scores for positive answers while decreasing those for negatives**, enabling:
14+
15+
***No Early Entropy Collapse**
16+
***Faster convergence**
17+
* 📉 **More stable training**
18+
* ⚖️ **Handles sparse rewards** – robust to imbalanced data with advanced discriminative approaches
19+
20+
---
21+
22+
### 📈 Quick Results
23+
24+
On six math reasoning benchmarks with a 1.5B model, **DisCO outperforms GRPO and its variants**:
25+
26+
* **+7% vs GRPO**
27+
* **+6% vs DAPO**
28+
29+
**DisCO with 8k response length is on par with or even better than GRPO with 32k response length**
30+
31+
---
32+
33+
- [Model Checkpoints](#model-checkpoints)
34+
- [Quickstart](#quickstart)
35+
- [More Results](#more-results)
36+
- [Citing DisCO](#citing-disco)
37+
38+
39+
## Model Checkpoints
40+
41+
- DisCO (Log-L) finetuned DeepSeek-R1-Distill-Qwen-1.5B Model: [DisCO-1.5B-logL](https://huggingface.co/ganglii/DisCO-1.5B-logL)
42+
- DisCO (L-Ratio) finetuned DeepSeek-R1-Distill-Qwen-1.5B Model: [DisCO-1.5B-Lratio](https://huggingface.co/ganglii/DisCO-1.5B-Lratio)
43+
- DisCO (Log-L) finetuned DeepSeek-R1-Distill-Qwen-7B Model: [DisCO-7B-logL](https://huggingface.co/ganglii/DisCO-7B-logL)
44+
- DisCO (L-Ratio) finetuned DeepSeek-R1-Distill-Qwen-7B Model: [DisCO-7B-Lratio](https://huggingface.co/ganglii/DisCO-7B-Lratio)
45+
46+
47+
## Quickstart
48+
49+
1. Prepare the datasets:
50+
51+
```bash
52+
bash prepare_data.sh # This downloads the datasets to current folder
53+
```
54+
55+
2. Run script:
56+
57+
```bash
58+
cd verl # Repo root
59+
bash recipe/disco/run_disco_1.5b.sh # or other scripts
60+
```
61+
62+
## More Results
63+
64+
Comparison with baseline models and baseline methods for fine-tuning 1.5B models. OpenAI-o1-preview is included as a reference. MRL denotes Max Response Length utilized in training/testing. The shaded models are trained by other works and the shaded numbers are reported in their original works or in DeepScalaR. All other results are either evaluated on existing models or on the models trained by us using different approaches. Methods in the bottom area are all for fine-tuning DeepSeek-R1-Distill-Qwen-1.5B model on the same DeepScaleR dataset. DS is short for DeepSeek-R1, DSR is short for DeepScalaR.
65+
66+
<p align="center"><img alt="Comparison with baselines on 1.5B model" src="https://github.com/Optimization-AI/DisCO/blob/main/assets/1p5model.png" width="800"/></p>
67+
68+
69+
Comparison with baseline models and baseline methods for fine-tuning 7B models. Methods in the bottom area are all for fine-tuning DeepSeek-R1-Distill-Qwen-7B model on the the same DeepScalaR dataset.
70+
71+
<p align="center"><img alt="Comparison with baselines on 7B model" src="https://github.com/Optimization-AI/DisCO/blob/main/assets/7Bmodel.png" width="800"/></p>
72+
73+
Training dynamics of different methods: left two are for fine-tuning 1.5B model and right two are for fine-tuning 7B model. (a), (c) plot the training reward (averaged over generated outputs for questions used in each step) vs the number of training steps; (b), (d) plot the generation entropy vs training steps.
74+
75+
<p align="center"><img alt="Training Dynamics" src="https://github.com/Optimization-AI/DisCO/blob/main/assets/training-dyanmics.png" width="800"/></p>
76+
77+
78+
79+
80+
## Citing DisCO
81+
82+
If you find DisCO useful in your research, please consider citing the following paper:
83+
```bibtex
84+
@article{li2025disco,
85+
title={DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization},
86+
author={Li, Gang and Lin, Ming and Galanti, Tomer and Tu, Zhengzhong and Yang, Tianbao},
87+
journal={arXiv preprint arXiv:2505.12366},
88+
year={2025}
89+
}
90+
```
91+
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
hydra:
2+
searchpath:
3+
- file://verl/trainer/config
4+
5+
defaults:
6+
- ppo_trainer
7+
- _self_
8+
9+
data:
10+
gen_batch_size: ${data.train_batch_size}
11+
12+
actor_rollout_ref:
13+
actor:
14+
policy_loss:
15+
loss_mode: 'disco'
16+
score_func: 'logL' # score function used in disco. Options: 'logL', 'Lratio'
17+
delta: 1e-4
18+
beta: 1e3
19+
tau: 10
20+
21+
reward_model:
22+
reward_manager: naive
23+
24+
custom_reward_function:
25+
26+
# The path to the file containing your customized reward function.
27+
# If not specified, pre-implemented reward functions will be used.
28+
path: recipe/disco/reward/deepscaler_reward.py
29+
30+
# The name of the reward function within the specified file. Default is 'compute_score'.
31+
name: deepscaler_reward_fn
32+
33+
algorithm:
34+
filter_groups:
35+
_target_: verl.trainer.config.FilterGroupsConfig
36+
enable: False # We try to avoid forgetting to set enable
37+
metric: null # acc / score / seq_reward / seq_final_reward / ...
38+
max_num_gen_batches: 0 # Non-positive values mean no upper limit
39+
40+
trainer:
41+
project_name: verl-disco
42+

0 commit comments

Comments
 (0)