Skip to content

Commit 8db354c

Browse files
Bihan  RanaBihan  Rana
authored andcommitted
Add distributed Axolotl and TRL example
1 parent cdf0a4f commit 8db354c

9 files changed

Lines changed: 471 additions & 0 deletions

File tree

docs/examples.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,28 @@ hide:
9797
with RAGEN, verl, and Ray.
9898
</p>
9999
</a>
100+
<a href="/examples/distributed-training/trl"
101+
class="feature-cell sky">
102+
<h3>
103+
TRL
104+
</h3>
105+
106+
<p>
107+
Fine-tune LLM on multiple nodes
108+
with TRL, Accelerate, and Deepspeed.
109+
</p>
110+
</a>
111+
<a href="/examples/distributed-training/axolotl"
112+
class="feature-cell sky">
113+
<h3>
114+
Axolotl
115+
</h3>
116+
117+
<p>
118+
Fine-tune LLM on multiple nodes
119+
with Axolotl.
120+
</p>
121+
</a>
100122
</div>
101123

102124
## Inference

docs/examples/distributed-training/axolotl/index.md

Whitespace-only changes.

docs/examples/distributed-training/trl/index.md

Whitespace-only changes.
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
type: task
2+
# The name is optional, if not specified, generated randomly
3+
name: axolotl-multi-node-qlora-llama3-70b
4+
5+
# Size of the cluster
6+
nodes: 2
7+
8+
# The axolotlai/axolotl:main-latest image does not include InfiniBand or RDMA libraries, so we need to use the NGC container.
9+
image: nvcr.io/nvidia/pytorch:25.01-py3
10+
# Required environment variables
11+
env:
12+
- HF_TOKEN
13+
- ACCELERATE_LOG_LEVEL=info
14+
- WANDB_API_KEY
15+
- NCCL_DEBUG=INFO
16+
- CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
17+
- WANDB_NAME=axolotl-dist-llama-qlora-train
18+
- WANDB_PROJECT
19+
- HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B
20+
# Commands of the task
21+
commands:
22+
# NCG container torch and flash-attn is not compatible with axolotl.
23+
- pip uninstall torch -y
24+
- pip uninstall flash-attn -y
25+
- pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/test/cu124
26+
- pip install --no-build-isolation axolotl[flash-attn,deepspeed]
27+
- wget https://raw.githubusercontent.com/huggingface/trl/main/examples/accelerate_configs/fsdp1.yaml
28+
- wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-3/qlora-fsdp-70b.yaml
29+
# axolotl includes hf-xet 1.1.0 which crashes while downloading, so installing the latest 1.1.2
30+
- pip uninstall -y hf-xet
31+
- pip install hf-xet --no-cache-dir
32+
- accelerate launch --config_file=fsdp1.yaml -m axolotl.cli.train qlora-fsdp-70b.yaml --hub-model-id $HUB_MODEL_ID --output-dir /checkpoints/qlora-llama3-70b --wandb-project $WANDB_PROJECT --wandb-name $WANDB_NAME
33+
--main_process_ip=$DSTACK_MASTER_NODE_IP
34+
--main_process_port=8008
35+
--machine_rank=$DSTACK_NODE_RANK
36+
--num_processes=$DSTACK_GPUS_NUM
37+
--num_machines=$DSTACK_NODES_NUM
38+
39+
40+
resources:
41+
gpu: 80GB:8
42+
shm_size: 128GB
43+
44+
volumes:
45+
- /checkpoints:/checkpoints
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# Axolotl
2+
3+
This example walks you through how to run distributed fine-tune using [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) with `dstack`.
4+
5+
??? info "Prerequisites"
6+
Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`.
7+
8+
<div class="termy">
9+
10+
```shell
11+
$ git clone https://github.com/dstackai/dstack
12+
$ cd dstack
13+
$ dstack init
14+
```
15+
</div>
16+
17+
## Create fleet
18+
19+
Before submitted disributed training runs, make sure to create a fleet with a `placement` set to `cluster`.
20+
21+
> For more detials on how to use clusters with `dstack`, check the [Clusters](https://dstack.ai/docs/guides/clusters) guide.
22+
23+
## Run Distributed Training
24+
Once the fleet is created, define a distributed task configuration. Here's an example of distributed `QLORA` task using `FSDP`.
25+
26+
<div editor-title="examples/distributed-training/ray-ragen/.dstack.yml">
27+
28+
```yaml
29+
type: task
30+
# The name is optional, if not specified, generated randomly
31+
name: axolotl-multi-node-qlora-llama3-70b
32+
33+
# Size of the cluster
34+
nodes: 2
35+
36+
# The axolotlai/axolotl:main-latest image does not include InfiniBand or RDMA libraries, so we need to use the NGC container.
37+
image: nvcr.io/nvidia/pytorch:25.01-py3
38+
# Required environment variables
39+
env:
40+
- HF_TOKEN
41+
- ACCELERATE_LOG_LEVEL=info
42+
- WANDB_API_KEY
43+
- NCCL_DEBUG=INFO
44+
- CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
45+
- WANDB_NAME=axolotl-dist-llama-qlora-train
46+
- WANDB_PROJECT
47+
- HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B
48+
49+
# Commands of the task
50+
commands:
51+
# Replacing the default Torch and FlashAttention in the NCG container with Axolotl-compatible versions.
52+
# The preinstalled versions are incompatible with Axolotl.
53+
- pip uninstall torch -y
54+
- pip uninstall flash-attn -y
55+
- pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/test/cu124
56+
- pip install --no-build-isolation axolotl[flash-attn,deepspeed]
57+
- wget https://raw.githubusercontent.com/huggingface/trl/main/examples/accelerate_configs/fsdp1.yaml
58+
- wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-3/qlora-fsdp-70b.yaml
59+
# Axolotl includes hf-xet version 1.1.0, which fails during downloads. Replacing it with the latest version (1.1.2).
60+
- pip uninstall -y hf-xet
61+
- pip install hf-xet --no-cache-dir
62+
- accelerate launch --config_file=fsdp1.yaml -m axolotl.cli.train qlora-fsdp-70b.yaml --hub-model-id $HUB_MODEL_ID --output-dir /checkpoints/qlora-llama3-70b --wandb-project $WANDB_PROJECT --wandb-name $WANDB_NAME
63+
--main_process_ip=$DSTACK_MASTER_NODE_IP
64+
--main_process_port=8008
65+
--machine_rank=$DSTACK_NODE_RANK
66+
--num_processes=$DSTACK_GPUS_NUM
67+
--num_machines=$DSTACK_NODES_NUM
68+
69+
resources:
70+
gpu: 80GB:8
71+
shm_size: 128GB
72+
73+
volumes:
74+
- /checkpoints:/checkpoints
75+
```
76+
</div>
77+
78+
!!! Note
79+
We are using the NGC container because it includes the necessary libraries and packages for RDMA and InfiniBand support.
80+
81+
### Applying the configuration
82+
To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
83+
84+
<div class="termy">
85+
86+
```shell
87+
$ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml
88+
89+
# BACKEND RESOURCES INSTANCE TYPE PRICE
90+
1 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle
91+
2 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle
92+
93+
Submit the run trl-train-fsdp-distrib? [y/n]: y
94+
95+
Provisioning...
96+
---> 100%
97+
```
98+
</div>
99+
100+
## Source code
101+
102+
The source-code of this example can be found in
103+
[`examples/distributed-training/axolotl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/axolotl).
104+
105+
!!! info "What's next?"
106+
1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks),
107+
[services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips).
Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
# TRL
2+
3+
This example walks you through how to run distributed fine-tune using [TRL](https://github.com/huggingface/trl), [Accelerate](https://github.com/huggingface/accelerate) and [Deepspeed](https://github.com/deepspeedai/DeepSpeed) with `dstack`.
4+
5+
??? info "Prerequisites"
6+
Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`.
7+
8+
<div class="termy">
9+
10+
```shell
11+
$ git clone https://github.com/dstackai/dstack
12+
$ cd dstack
13+
$ dstack init
14+
```
15+
</div>
16+
17+
## Create fleet
18+
19+
Before submitted disributed training runs, make sure to create a fleet with a `placement` set to `cluster`.
20+
21+
> For more detials on how to use clusters with `dstack`, check the [Clusters](https://dstack.ai/docs/guides/clusters) guide.
22+
23+
## Run Distributed Training
24+
Once the fleet is created, define a distributed task configuration. Here's an example of distributed Supervised Fine-Tuning (SFT) task using `FSDP` and `Deepseed ZeRO-3`.
25+
26+
27+
=== "FSDP"
28+
29+
<div editor-title="examples/distributed-training/trl/fsdp.dstack.yml">
30+
```yaml
31+
type: task
32+
# The name is optional, if not specified, generated randomly
33+
name: trl-train-fsdp-distrib
34+
35+
# Size of the cluster
36+
nodes: 2
37+
38+
image: nvcr.io/nvidia/pytorch:25.01-py3
39+
40+
# Required environment variables
41+
env:
42+
- HF_TOKEN
43+
- ACCELERATE_LOG_LEVEL=info
44+
- WANDB_API_KEY
45+
- MODEL_ID=meta-llama/Llama-3.1-8B
46+
- HUB_MODEL_ID
47+
48+
# Commands of the task
49+
commands:
50+
- pip install transformers
51+
- pip install bitsandbytes
52+
- pip install peft
53+
- pip install wandb
54+
- git clone https://github.com/huggingface/trl
55+
- cd trl
56+
- pip install .
57+
- accelerate launch
58+
--config_file=examples/accelerate_configs/fsdp1.yaml
59+
--main_process_ip=$DSTACK_MASTER_NODE_IP
60+
--main_process_port=8008
61+
--machine_rank=$DSTACK_NODE_RANK
62+
--num_processes=$DSTACK_GPUS_NUM
63+
--num_machines=$DSTACK_NODES_NUM
64+
trl/scripts/sft.py
65+
--model_name $MODEL_ID
66+
--dataset_name OpenAssistant/oasst_top1_2023-08-25
67+
--dataset_text_field="text"
68+
--per_device_train_batch_size 1
69+
--per_device_eval_batch_size 1
70+
--gradient_accumulation_steps 4
71+
--learning_rate 2e-4
72+
--report_to wandb
73+
--bf16
74+
--max_seq_length 1024
75+
--attn_implementation flash_attention_2
76+
--logging_steps=10
77+
--output_dir /checkpoints/llama31-ft
78+
--hub_model_id $HUB_MODEL_ID
79+
--torch_dtype bfloat16
80+
81+
resources:
82+
gpu: 80GB:8
83+
shm_size: 128GB
84+
85+
volumes:
86+
- /checkpoints:/checkpoints
87+
```
88+
</div>
89+
90+
=== "Deepseed ZeRO-3"
91+
92+
<div editor-title="examples/distributed-training/trl/deepspeed.dstack.yml">
93+
```yaml
94+
type: task
95+
# The name is optional, if not specified, generated randomly
96+
name: trl-train-deepspeed-distrib
97+
98+
# Size of the cluster
99+
nodes: 2
100+
101+
image: nvcr.io/nvidia/pytorch:25.01-py3
102+
103+
# Required environment variables
104+
env:
105+
- HF_TOKEN
106+
- ACCELERATE_LOG_LEVEL=info
107+
- WANDB_API_KEY
108+
- MODEL_ID=meta-llama/Llama-3.1-8B
109+
- HUB_MODEL_ID
110+
111+
# Commands of the task
112+
commands:
113+
- pip install transformers
114+
- pip install bitsandbytes
115+
- pip install peft
116+
- pip install wandb
117+
- pip install deepspeed
118+
- git clone https://github.com/huggingface/trl
119+
- cd trl
120+
- pip install .
121+
- accelerate launch
122+
--config_file=examples/accelerate_configs/deepspeed_zero3.yaml
123+
--main_process_ip=$DSTACK_MASTER_NODE_IP
124+
--main_process_port=8008
125+
--machine_rank=$DSTACK_NODE_RANK
126+
--num_processes=$DSTACK_GPUS_NUM
127+
--num_machines=$DSTACK_NODES_NUM
128+
trl/scripts/sft.py
129+
--model_name $MODEL_ID
130+
--dataset_name OpenAssistant/oasst_top1_2023-08-25
131+
--dataset_text_field="text"
132+
--per_device_train_batch_size 1
133+
--per_device_eval_batch_size 1
134+
--gradient_accumulation_steps 4
135+
--learning_rate 2e-4
136+
--report_to wandb
137+
--bf16
138+
--max_seq_length 1024
139+
--attn_implementation flash_attention_2
140+
--logging_steps=10
141+
--output_dir /checkpoints/llama31-ft
142+
--hub_model_id $HUB_MODEL_ID
143+
--torch_dtype bfloat16
144+
145+
resources:
146+
gpu: 80GB:8
147+
shm_size: 128GB
148+
149+
volumes:
150+
- /checkpoints:/checkpoints
151+
```
152+
</div>
153+
154+
155+
!!! Note
156+
We are using the NGC container because it includes the necessary libraries and packages for RDMA and InfiniBand support.
157+
158+
### Applying the configuration
159+
To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
160+
161+
<div class="termy">
162+
163+
```shell
164+
$ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml
165+
166+
# BACKEND RESOURCES INSTANCE TYPE PRICE
167+
1 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle
168+
2 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle
169+
170+
Submit the run trl-train-fsdp-distrib? [y/n]: y
171+
172+
Provisioning...
173+
---> 100%
174+
```
175+
</div>
176+
177+
## Source code
178+
179+
The source-code of this example can be found in
180+
[`examples/distributed-training/trl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/trl).
181+
182+
!!! info "What's next?"
183+
1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks),
184+
[services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips).

0 commit comments

Comments
 (0)