Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions docs/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,28 @@ hide:
with RAGEN, verl, and Ray.
</p>
</a>
<a href="/examples/distributed-training/trl"
class="feature-cell sky">
<h3>
TRL
</h3>

<p>
Fine-tune LLM on multiple nodes
with TRL, Accelerate, and Deepspeed.
</p>
</a>
<a href="/examples/distributed-training/axolotl"
class="feature-cell sky">
<h3>
Axolotl
</h3>

<p>
Fine-tune LLM on multiple nodes
with Axolotl.
</p>
</a>
</div>

## Inference
Expand Down
Empty file.
Empty file.
45 changes: 45 additions & 0 deletions examples/distributed-training/axolotl/.dstack.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
type: task
# The name is optional, if not specified, generated randomly
name: axolotl-multi-node-qlora-llama3-70b

# Size of the cluster
nodes: 2

# The axolotlai/axolotl:main-latest image does not include InfiniBand or RDMA libraries, so we need to use the NGC container.
image: nvcr.io/nvidia/pytorch:25.01-py3
# Required environment variables
env:
- HF_TOKEN
- ACCELERATE_LOG_LEVEL=info
- WANDB_API_KEY
- NCCL_DEBUG=INFO
- CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
- WANDB_NAME=axolotl-dist-llama-qlora-train
- WANDB_PROJECT
- HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B
# Commands of the task
commands:
# NCG container torch and flash-attn is not compatible with axolotl.
- pip uninstall torch -y
- pip uninstall flash-attn -y
- pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/test/cu124
- pip install --no-build-isolation axolotl[flash-attn,deepspeed]
- wget https://raw.githubusercontent.com/huggingface/trl/main/examples/accelerate_configs/fsdp1.yaml
- wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-3/qlora-fsdp-70b.yaml
# axolotl includes hf-xet 1.1.0 which crashes while downloading, so installing the latest 1.1.2
- pip uninstall -y hf-xet
- pip install hf-xet --no-cache-dir
- accelerate launch --config_file=fsdp1.yaml -m axolotl.cli.train qlora-fsdp-70b.yaml --hub-model-id $HUB_MODEL_ID --output-dir /checkpoints/qlora-llama3-70b --wandb-project $WANDB_PROJECT --wandb-name $WANDB_NAME
--main_process_ip=$DSTACK_MASTER_NODE_IP
--main_process_port=8008
--machine_rank=$DSTACK_NODE_RANK
--num_processes=$DSTACK_GPUS_NUM
--num_machines=$DSTACK_NODES_NUM


resources:
gpu: 80GB:8
shm_size: 128GB

volumes:
- /checkpoints:/checkpoints
107 changes: 107 additions & 0 deletions examples/distributed-training/axolotl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Axolotl

This example walks you through how to run distributed fine-tune using [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) with `dstack`.

??? info "Prerequisites"
Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`.

<div class="termy">

```shell
$ git clone https://github.com/dstackai/dstack
$ cd dstack
$ dstack init
```
</div>

## Create fleet

Before submitted disributed training runs, make sure to create a fleet with a `placement` set to `cluster`.

> For more detials on how to use clusters with `dstack`, check the [Clusters](https://dstack.ai/docs/guides/clusters) guide.

## Run Distributed Training
Once the fleet is created, define a distributed task configuration. Here's an example of distributed `QLORA` task using `FSDP`.

<div editor-title="examples/distributed-training/ray-ragen/.dstack.yml">

```yaml
type: task
# The name is optional, if not specified, generated randomly
name: axolotl-multi-node-qlora-llama3-70b

# Size of the cluster
nodes: 2

# The axolotlai/axolotl:main-latest image does not include InfiniBand or RDMA libraries, so we need to use the NGC container.
image: nvcr.io/nvidia/pytorch:25.01-py3
# Required environment variables
env:
- HF_TOKEN
- ACCELERATE_LOG_LEVEL=info
- WANDB_API_KEY
- NCCL_DEBUG=INFO
- CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
- WANDB_NAME=axolotl-dist-llama-qlora-train
- WANDB_PROJECT
- HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B

# Commands of the task
commands:
# Replacing the default Torch and FlashAttention in the NCG container with Axolotl-compatible versions.
# The preinstalled versions are incompatible with Axolotl.
- pip uninstall torch -y
- pip uninstall flash-attn -y
- pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/test/cu124
- pip install --no-build-isolation axolotl[flash-attn,deepspeed]
- wget https://raw.githubusercontent.com/huggingface/trl/main/examples/accelerate_configs/fsdp1.yaml
- wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-3/qlora-fsdp-70b.yaml
# Axolotl includes hf-xet version 1.1.0, which fails during downloads. Replacing it with the latest version (1.1.2).
- pip uninstall -y hf-xet
- pip install hf-xet --no-cache-dir
- accelerate launch --config_file=fsdp1.yaml -m axolotl.cli.train qlora-fsdp-70b.yaml --hub-model-id $HUB_MODEL_ID --output-dir /checkpoints/qlora-llama3-70b --wandb-project $WANDB_PROJECT --wandb-name $WANDB_NAME
--main_process_ip=$DSTACK_MASTER_NODE_IP
--main_process_port=8008
--machine_rank=$DSTACK_NODE_RANK
--num_processes=$DSTACK_GPUS_NUM
--num_machines=$DSTACK_NODES_NUM

resources:
gpu: 80GB:8
shm_size: 128GB

volumes:
- /checkpoints:/checkpoints
```
</div>

!!! Note
We are using the NGC container because it includes the necessary libraries and packages for RDMA and InfiniBand support.

### Applying the configuration
To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.

<div class="termy">

```shell
$ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml

# BACKEND RESOURCES INSTANCE TYPE PRICE
1 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle
2 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle

Submit the run trl-train-fsdp-distrib? [y/n]: y

Provisioning...
---> 100%
```
</div>

## Source code

The source-code of this example can be found in
[`examples/distributed-training/axolotl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/axolotl).

!!! info "What's next?"
1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks),
[services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips).
184 changes: 184 additions & 0 deletions examples/distributed-training/trl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# TRL

This example walks you through how to run distributed fine-tune using [TRL](https://github.com/huggingface/trl), [Accelerate](https://github.com/huggingface/accelerate) and [Deepspeed](https://github.com/deepspeedai/DeepSpeed) with `dstack`.

??? info "Prerequisites"
Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`.

<div class="termy">

```shell
$ git clone https://github.com/dstackai/dstack
$ cd dstack
$ dstack init
```
</div>

## Create fleet

Before submitted disributed training runs, make sure to create a fleet with a `placement` set to `cluster`.

> For more detials on how to use clusters with `dstack`, check the [Clusters](https://dstack.ai/docs/guides/clusters) guide.

## Run Distributed Training
Once the fleet is created, define a distributed task configuration. Here's an example of distributed Supervised Fine-Tuning (SFT) task using `FSDP` and `Deepseed ZeRO-3`.


=== "FSDP"

<div editor-title="examples/distributed-training/trl/fsdp.dstack.yml">
```yaml
type: task
# The name is optional, if not specified, generated randomly
name: trl-train-fsdp-distrib

# Size of the cluster
nodes: 2

image: nvcr.io/nvidia/pytorch:25.01-py3

# Required environment variables
env:
- HF_TOKEN
- ACCELERATE_LOG_LEVEL=info
- WANDB_API_KEY
- MODEL_ID=meta-llama/Llama-3.1-8B
- HUB_MODEL_ID
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is HUB_MODEL_ID environment variable? How is it diffrenent in this context from MODEL_ID. Why does the Axolotl examample, use HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B?

Copy link
Copy Markdown
Collaborator Author

@Bihan Bihan May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterschmidt85 HUB_MODEL_ID : Repository ID where the model will be pushed on Hugging Face Hub (format: username/repo-name)

I will remove the assignment HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B and only use HUB_MODEL_ID as in TRL example.


# Commands of the task
commands:
- pip install transformers
- pip install bitsandbytes
- pip install peft
- pip install wandb
- git clone https://github.com/huggingface/trl
- cd trl
- pip install .
- accelerate launch
--config_file=examples/accelerate_configs/fsdp1.yaml
--main_process_ip=$DSTACK_MASTER_NODE_IP
--main_process_port=8008
--machine_rank=$DSTACK_NODE_RANK
--num_processes=$DSTACK_GPUS_NUM
--num_machines=$DSTACK_NODES_NUM
trl/scripts/sft.py
--model_name $MODEL_ID
--dataset_name OpenAssistant/oasst_top1_2023-08-25
--dataset_text_field="text"
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--learning_rate 2e-4
--report_to wandb
--bf16
--max_seq_length 1024
--attn_implementation flash_attention_2
--logging_steps=10
--output_dir /checkpoints/llama31-ft
--hub_model_id $HUB_MODEL_ID
--torch_dtype bfloat16

resources:
gpu: 80GB:8
shm_size: 128GB

volumes:
- /checkpoints:/checkpoints
```
</div>

=== "Deepseed ZeRO-3"

<div editor-title="examples/distributed-training/trl/deepspeed.dstack.yml">
```yaml
type: task
# The name is optional, if not specified, generated randomly
name: trl-train-deepspeed-distrib

# Size of the cluster
nodes: 2

image: nvcr.io/nvidia/pytorch:25.01-py3

# Required environment variables
env:
- HF_TOKEN
- ACCELERATE_LOG_LEVEL=info
- WANDB_API_KEY
- MODEL_ID=meta-llama/Llama-3.1-8B
- HUB_MODEL_ID

# Commands of the task
commands:
- pip install transformers
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use separate pip install commands instead of a single pip install command with multiple packages?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using uv pip install in examples since we now recommend uv?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update with uv pip install

- pip install bitsandbytes
- pip install peft
- pip install wandb
- pip install deepspeed
- git clone https://github.com/huggingface/trl
- cd trl
- pip install .
- accelerate launch
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For such multi-line commands as accelerate launch, should we use - | syntax?

Copy link
Copy Markdown
Collaborator Author

@Bihan Bihan May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterschmidt85 Yes, we can use - | like below for every multi-line commands.

 - |
   accelerate launch \
     --config_file=examples/accelerate_configs/fsdp1.yaml \
     --main_process_ip=$DSTACK_MASTER_NODE_IP \
     --main_process_port=8008 \
     --machine_rank=$DSTACK_NODE_RANK \
     --num_processes=$DSTACK_GPUS_NUM \
     --num_machines=$DSTACK_NODES_NUM \
     trl/scripts/sft.py \
     --model_name $MODEL_ID \
     --dataset_name OpenAssistant/oasst_top1_2023-08-25 \
     --dataset_text_field="text" \
     --per_device_train_batch_size 1 \
     --per_device_eval_batch_size 1 \
     --gradient_accumulation_steps 4 \
     --learning_rate 2e-4 \
     --report_to wandb 

This would make copy/paste multi-line command to shell very easy during debugging.

--config_file=examples/accelerate_configs/deepspeed_zero3.yaml
--main_process_ip=$DSTACK_MASTER_NODE_IP
--main_process_port=8008
--machine_rank=$DSTACK_NODE_RANK
--num_processes=$DSTACK_GPUS_NUM
--num_machines=$DSTACK_NODES_NUM
trl/scripts/sft.py
--model_name $MODEL_ID
--dataset_name OpenAssistant/oasst_top1_2023-08-25
--dataset_text_field="text"
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--learning_rate 2e-4
--report_to wandb
--bf16
--max_seq_length 1024
--attn_implementation flash_attention_2
--logging_steps=10
--output_dir /checkpoints/llama31-ft
--hub_model_id $HUB_MODEL_ID
--torch_dtype bfloat16

resources:
gpu: 80GB:8
shm_size: 128GB

volumes:
- /checkpoints:/checkpoints
```
</div>


!!! Note
We are using the NGC container because it includes the necessary libraries and packages for RDMA and InfiniBand support.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know which specific drivers are missing in dstack's default Docker image?

cc @un-def

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterschmidt85 As far as I remember dstack's default Docker image was issuing this error. libnccl-net.so was not found.

lambda-cluster-node-001:1652:1652 [6] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
lambda-cluster-node-001:1652:1652 [6] NCCL INFO NET/Plugin: Using internal network plugin.
lambda-cluster-node-001:1651:1651 [5] NCCL INFO cudaDriverVersion 12080
lambda-cluster-node-001:1651:1651 [5] NCCL INFO Bootstrap : Using eno1:172.26.135.50<0>
lambda-cluster-node-001:1651:1651 [5] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)


### Applying the configuration
To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.

<div class="termy">

```shell
$ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to set environment variables which values aren't configured in YAML (HF_TOKEN, WANDB_API_KEY, etc).


# BACKEND RESOURCES INSTANCE TYPE PRICE
1 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle
2 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle

Submit the run trl-train-fsdp-distrib? [y/n]: y

Provisioning...
---> 100%
```
</div>

## Source code

The source-code of this example can be found in
[`examples/distributed-training/trl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/trl).

!!! info "What's next?"
1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link to the Clusters guide too

Copy link
Copy Markdown
Collaborator Author

@Bihan Bihan May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterschmidt85 I have added link to clusters guide in Create Fleet section as below
"For more detials on how to use clusters with dstack, check the Clusters guide."

therefore I did not add in What's next? section.

[services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips).
Loading