Skip to content

Commit 5ce0057

Browse files
Bihan  RanaBihan  Rana
authored andcommitted
Add Distributed Agent Fine Tuning Example
1 parent 1e07524 commit 5ce0057

2 files changed

Lines changed: 154 additions & 0 deletions

File tree

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
type: task
2+
name: agent-fine-tuning
3+
nodes: 2
4+
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.2
5+
6+
env:
7+
- WANDB_API_KEY
8+
9+
commands:
10+
- wget -O miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
11+
- bash miniconda.sh -b -p /workflow/miniconda
12+
- eval "$(/workflow/miniconda/bin/conda shell.bash hook)"
13+
- git clone https://github.com/RAGEN-AI/RAGEN.git
14+
- cd RAGEN
15+
- bash scripts/setup_ragen.sh
16+
- conda activate ragen
17+
- cd verl
18+
- pip install --no-deps -e .
19+
- pip install hf_transfer hf_xet
20+
- pip uninstall -y ray
21+
- pip install -U "ray[default]"
22+
- >
23+
if [ $DSTACK_NODE_RANK = 0 ]; then
24+
ray start --head --port=6379;
25+
else
26+
ray start --address=$DSTACK_MASTER_NODE_IP:6379
27+
fi
28+
ports:
29+
- 8265 # ray dashboard port
30+
resources:
31+
gpu: nvidia:8:80GB
32+
shm_size: 128GB
33+
34+
volumes:
35+
- /checkpoints:/checkpoints
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Agent Fine Tuning
2+
3+
This example shows how use `dstack` and [RAGEN](https://github.com/RAGEN-AI/RAGEN) for multi-node Agent Fine Tuning. Under the hood `RAGEN` uses [VERL](https://github.com/volcengine/verl) for Reinforcement Learning.
4+
5+
## Create fleet
6+
7+
Create an SSH fleet through the login node specified via [proxy_jump](https://dstack.ai/blog/gpu-blocks-and-proxy-jump/#proxy-jump).
8+
9+
```yaml
10+
type: fleet
11+
name: lambda-h100-fleet
12+
13+
ssh_config:
14+
user: ubuntu
15+
identity_file: ~/.ssh/peterschmidt85
16+
hosts:
17+
- lambda-cluster-node-001
18+
- lambda-cluster-node-002
19+
proxy_jump:
20+
hostname: 192.222.48.90
21+
user: ubuntu
22+
identity_file: ~/.ssh/peterschmidt85
23+
24+
placement: cluster
25+
```
26+
27+
```shell
28+
dstack apply -f lambda-h100-fleet.yaml
29+
```
30+
31+
## Launch Ray cluster
32+
33+
The following `dstack` task sets up `RAGEN` and launches Ray master and worker nodes.
34+
`dstack` makes the Ray dashboard available at `localhost:8265`.
35+
36+
```yaml
37+
type: task
38+
name: agent-fine-tuning
39+
nodes: 2
40+
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.2
41+
42+
env:
43+
- WANDB_API_KEY
44+
45+
commands:
46+
- wget -O miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
47+
- bash miniconda.sh -b -p /workflow/miniconda
48+
- eval "$(/workflow/miniconda/bin/conda shell.bash hook)"
49+
- git clone https://github.com/RAGEN-AI/RAGEN.git
50+
- cd RAGEN
51+
- bash scripts/setup_ragen.sh
52+
- conda activate ragen
53+
- cd verl
54+
- pip install --no-deps -e .
55+
- pip install hf_transfer hf_xet
56+
- pip uninstall -y ray
57+
- pip install -U "ray[default]"
58+
- >
59+
if [ $DSTACK_NODE_RANK = 0 ]; then
60+
ray start --head --port=6379;
61+
else
62+
ray start --address=$DSTACK_MASTER_NODE_IP:6379
63+
fi
64+
ports:
65+
- 8265 # ray dashboard port
66+
resources:
67+
gpu: nvidia:8:80GB
68+
shm_size: 128GB
69+
70+
volumes:
71+
- /checkpoints:/checkpoints
72+
```
73+
!!! Note
74+
1. We are using `VERL` docker image for vLLM with FSDP. See [Installation](https://verl.readthedocs.io/en/latest/start/install.html)
75+
2.`RAGEN` setup script `scripts/setup_ragen.sh` isolates dependencies within Conda environment.
76+
3. The Ray setup in the RAGEN environment is missing the dashboard, so we reinstall it using "ray[default]".
77+
78+
```shell
79+
dstack apply -f agent-fine-tuning.yaml
80+
```
81+
82+
## Run Ray jobs
83+
84+
Install Ray locally:
85+
86+
```shell
87+
pip install ray
88+
```
89+
90+
Now you can submit agent fine tuning job to the cluster available at `localhost:8265`:
91+
92+
```shell
93+
RAY_ADDRESS='http://localhost:8265' \
94+
ray job submit \
95+
-- bash -c "\
96+
export PYTHONPATH=/workflow/RAGEN; \
97+
cd /workflow/RAGEN; \
98+
/workflow/miniconda/envs/ragen/bin/python train.py \
99+
--config-name base \
100+
system.CUDA_VISIBLE_DEVICES=[0,1,2,3,4,5,6,7] \
101+
model_path=Qwen/Qwen2.5-7B-Instruct \
102+
trainer.experiment_name=agent-fine-tuning-Qwen2.5-7B \
103+
trainer.n_gpus_per_node=8 \
104+
trainer.nnodes=2 \
105+
micro_batch_size_per_gpu=2 \
106+
trainer.default_local_dir=/checkpoints \
107+
trainer.save_freq=50 \
108+
actor_rollout_ref.rollout.tp_size_check=False \
109+
actor_rollout_ref.rollout.tensor_model_parallel_size=4"
110+
```
111+
112+
!!! info "Training Parameters"
113+
1. `actor_rollout_ref.rollout.tensor_model_parallel_size=4`, because Qwen/Qwen2.5-7B-Instruct has 28 attention heads and number of attention heads should be divisible by `tensor_model_parallel_size`.
114+
2. `actor_rollout_ref.rollout.tp_size_check=False`, if True `tensor_model_parallel_size` should be equal to `trainer.n_gpus_per_node`
115+
3. `micro_batch_size_per_gpu=2`, to keep the RAGEN-paper's `rollout_filter_ratio` and `es_manager` settings as it is for world size `16`.
116+
117+
See more examples in the [Ray docs](https://docs.ray.io/en/latest/train/examples.html).
118+
119+
Using Ray via `dstack` is a powerful way to get access to the rich Ray ecosystem while benefiting from `dstack`'s provisioning capabilities.

0 commit comments

Comments
 (0)