dstackai · Bihan · May 29, 2025 · May 22, 2025 · May 29, 2025 · May 29, 2025
diff --git a/docs/examples.md b/docs/examples.md
@@ -97,6 +97,28 @@ hide:
             with RAGEN, verl, and Ray.
         </p>
     </a>
+    <a href="/examples/distributed-training/trl"
+       class="feature-cell sky">
+        <h3>
+            TRL
+        </h3>
+
+        <p>
+            Fine-tune LLM on multiple nodes
+            with TRL, Accelerate, and Deepspeed.
+        </p>
+    </a>
+    <a href="/examples/distributed-training/axolotl"
+       class="feature-cell sky">
+        <h3>
+            Axolotl
+        </h3>
+
+        <p>
+            Fine-tune LLM on multiple nodes
+            with Axolotl.
+        </p>
+    </a>
 </div>
 
 ## Inference

diff --git a/docs/examples/distributed-training/axolotl/index.md b/docs/examples/distributed-training/axolotl/index.md
diff --git a/docs/examples/distributed-training/trl/index.md b/docs/examples/distributed-training/trl/index.md
diff --git a/examples/distributed-training/axolotl/.dstack.yml b/examples/distributed-training/axolotl/.dstack.yml
@@ -0,0 +1,45 @@
+type: task
+# The name is optional, if not specified, generated randomly
+name: axolotl-multi-node-qlora-llama3-70b
+
+# Size of the cluster
+nodes: 2
+
+# The axolotlai/axolotl:main-latest image does not include InfiniBand or RDMA libraries, so we need to use the NGC container.
+image: nvcr.io/nvidia/pytorch:25.01-py3
+# Required environment variables
+env:
+  - HF_TOKEN
+  - ACCELERATE_LOG_LEVEL=info
+  - WANDB_API_KEY
+  - NCCL_DEBUG=INFO
+  - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+  - WANDB_NAME=axolotl-dist-llama-qlora-train
+  - WANDB_PROJECT
+  - HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B
+# Commands of the task
+commands:
+  # NCG container torch and flash-attn is not compatible with axolotl.
+  - pip uninstall torch -y
+  - pip uninstall flash-attn -y
+  - pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/test/cu124
+  - pip install --no-build-isolation axolotl[flash-attn,deepspeed]
+  - wget https://raw.githubusercontent.com/huggingface/trl/main/examples/accelerate_configs/fsdp1.yaml
+  - wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-3/qlora-fsdp-70b.yaml
+  # axolotl includes hf-xet 1.1.0 which crashes while downloading, so installing the latest 1.1.2
+  - pip uninstall -y hf-xet
+  - pip install hf-xet --no-cache-dir
+  - accelerate launch --config_file=fsdp1.yaml -m axolotl.cli.train qlora-fsdp-70b.yaml --hub-model-id $HUB_MODEL_ID --output-dir /checkpoints/qlora-llama3-70b --wandb-project $WANDB_PROJECT --wandb-name $WANDB_NAME
+    --main_process_ip=$DSTACK_MASTER_NODE_IP
+    --main_process_port=8008
+    --machine_rank=$DSTACK_NODE_RANK
+    --num_processes=$DSTACK_GPUS_NUM
+    --num_machines=$DSTACK_NODES_NUM
+
+
+resources:
+  gpu: 80GB:8
+  shm_size: 128GB
+
+volumes:
+  - /checkpoints:/checkpoints
diff --git a/examples/distributed-training/axolotl/README.md b/examples/distributed-training/axolotl/README.md
@@ -0,0 +1,107 @@
+# Axolotl
+
+This example walks you through how to run distributed fine-tune using [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) with `dstack`.
+
+??? info "Prerequisites"
+    Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`.
+
+    <div class="termy">
+
+    ```shell
+    $ git clone https://github.com/dstackai/dstack
+    $ cd dstack
+    $ dstack init
+    ```
+    </div>
+
+## Create fleet
+
+Before submitted disributed training runs, make sure to create a fleet with a `placement` set to `cluster`.
+
+> For more detials on how to use clusters with `dstack`, check the [Clusters](https://dstack.ai/docs/guides/clusters) guide.
+
+## Run Distributed Training
+Once the fleet is created, define a distributed task configuration. Here's an example of distributed `QLORA` task using `FSDP`.
+
+<div editor-title="examples/distributed-training/ray-ragen/.dstack.yml">
+
+```yaml
+type: task
+# The name is optional, if not specified, generated randomly
+name: axolotl-multi-node-qlora-llama3-70b
+
+# Size of the cluster
+nodes: 2
+
+# The axolotlai/axolotl:main-latest image does not include InfiniBand or RDMA libraries, so we need to use the NGC container.
+image: nvcr.io/nvidia/pytorch:25.01-py3
+# Required environment variables
+env:
+  - HF_TOKEN
+  - ACCELERATE_LOG_LEVEL=info
+  - WANDB_API_KEY
+  - NCCL_DEBUG=INFO
+  - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+  - WANDB_NAME=axolotl-dist-llama-qlora-train
+  - WANDB_PROJECT
+  - HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B
+
+# Commands of the task
+commands:
+  # Replacing the default Torch and FlashAttention in the NCG container with Axolotl-compatible versions.
+  # The preinstalled versions are incompatible with Axolotl.
+  - pip uninstall torch -y
+  - pip uninstall flash-attn -y
+  - pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/test/cu124
+  - pip install --no-build-isolation axolotl[flash-attn,deepspeed]
+  - wget https://raw.githubusercontent.com/huggingface/trl/main/examples/accelerate_configs/fsdp1.yaml
+  - wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-3/qlora-fsdp-70b.yaml
+  # Axolotl includes hf-xet version 1.1.0, which fails during downloads. Replacing it with the latest version (1.1.2).
+  - pip uninstall -y hf-xet
+  - pip install hf-xet --no-cache-dir
+  - accelerate launch --config_file=fsdp1.yaml -m axolotl.cli.train qlora-fsdp-70b.yaml --hub-model-id $HUB_MODEL_ID --output-dir /checkpoints/qlora-llama3-70b --wandb-project $WANDB_PROJECT --wandb-name $WANDB_NAME
+    --main_process_ip=$DSTACK_MASTER_NODE_IP
+    --main_process_port=8008
+    --machine_rank=$DSTACK_NODE_RANK
+    --num_processes=$DSTACK_GPUS_NUM
+    --num_machines=$DSTACK_NODES_NUM
+
+resources:
+  gpu: 80GB:8
+  shm_size: 128GB
+
+volumes:
+  - /checkpoints:/checkpoints
+```
+</div>
+
+!!! Note
+    We are using the NGC container because it includes the necessary libraries and packages for RDMA and InfiniBand support.
+
+### Applying the configuration
+To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
+
+<div class="termy">
+
+```shell
+$ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml
+
+ #  BACKEND       RESOURCES                       INSTANCE TYPE  PRICE       
+ 1  ssh (remote)  cpu=208 mem=1772GB H100:80GB:8  instance       $0     idle 
+ 2  ssh (remote)  cpu=208 mem=1772GB H100:80GB:8  instance       $0     idle  
+
+Submit the run trl-train-fsdp-distrib? [y/n]: y
+
+Provisioning...
+---> 100%
+```
+</div>
+
+## Source code
+
+The source-code of this example can be found in 
+[`examples/distributed-training/axolotl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/axolotl).
+
+!!! info "What's next?"
+    1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), 
+       [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips).
diff --git a/examples/distributed-training/trl/README.md b/examples/distributed-training/trl/README.md
@@ -0,0 +1,184 @@
+# TRL
+
+This example walks you through how to run distributed fine-tune using [TRL](https://github.com/huggingface/trl), [Accelerate](https://github.com/huggingface/accelerate) and [Deepspeed](https://github.com/deepspeedai/DeepSpeed) with `dstack`.
+
+??? info "Prerequisites"
+    Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`.
+
+    <div class="termy">
+
+    ```shell
+    $ git clone https://github.com/dstackai/dstack
+    $ cd dstack
+    $ dstack init
+    ```
+    </div>
+
+## Create fleet
+
+Before submitted disributed training runs, make sure to create a fleet with a `placement` set to `cluster`.
+
+> For more detials on how to use clusters with `dstack`, check the [Clusters](https://dstack.ai/docs/guides/clusters) guide.
+
+## Run Distributed Training
+Once the fleet is created, define a distributed task configuration. Here's an example of distributed Supervised Fine-Tuning (SFT) task using `FSDP` and `Deepseed ZeRO-3`.
+
+
+=== "FSDP"
+
+    <div editor-title="examples/distributed-training/trl/fsdp.dstack.yml">
+    ```yaml
+    type: task
+    # The name is optional, if not specified, generated randomly
+    name: trl-train-fsdp-distrib
+
+    # Size of the cluster
+    nodes: 2
+
+    image: nvcr.io/nvidia/pytorch:25.01-py3
+
+    # Required environment variables
+    env:
+      - HF_TOKEN
+      - ACCELERATE_LOG_LEVEL=info
+      - WANDB_API_KEY
+      - MODEL_ID=meta-llama/Llama-3.1-8B
+      - HUB_MODEL_ID
+
+    # Commands of the task
+    commands:
+      - pip install transformers
+      - pip install bitsandbytes
+      - pip install peft
+      - pip install wandb
+      - git clone https://github.com/huggingface/trl
+      - cd trl
+      - pip install .
+      - accelerate launch
+        --config_file=examples/accelerate_configs/fsdp1.yaml
+        --main_process_ip=$DSTACK_MASTER_NODE_IP
+        --main_process_port=8008
+        --machine_rank=$DSTACK_NODE_RANK
+        --num_processes=$DSTACK_GPUS_NUM 
+        --num_machines=$DSTACK_NODES_NUM 
+        trl/scripts/sft.py
+        --model_name $MODEL_ID
+        --dataset_name OpenAssistant/oasst_top1_2023-08-25
+        --dataset_text_field="text"
+        --per_device_train_batch_size 1
+        --per_device_eval_batch_size 1
+        --gradient_accumulation_steps 4 
+        --learning_rate 2e-4
+        --report_to wandb
+        --bf16
+        --max_seq_length 1024
+        --attn_implementation flash_attention_2 
+        --logging_steps=10
+        --output_dir /checkpoints/llama31-ft
+        --hub_model_id $HUB_MODEL_ID
+        --torch_dtype bfloat16
+
+    resources:
+      gpu: 80GB:8
+      shm_size: 128GB
+
+    volumes:
+      - /checkpoints:/checkpoints
+    ```
+    </div>
+
+=== "Deepseed ZeRO-3"
+
+    <div editor-title="examples/distributed-training/trl/deepspeed.dstack.yml">
+    ```yaml
+    type: task
+    # The name is optional, if not specified, generated randomly
+    name: trl-train-deepspeed-distrib
+
+    # Size of the cluster
+    nodes: 2
+
+    image: nvcr.io/nvidia/pytorch:25.01-py3
+
+    # Required environment variables
+    env:
+      - HF_TOKEN
+      - ACCELERATE_LOG_LEVEL=info
+      - WANDB_API_KEY
+      - MODEL_ID=meta-llama/Llama-3.1-8B
+      - HUB_MODEL_ID
+
+    # Commands of the task
+    commands:
+      - pip install transformers
+      - pip install bitsandbytes
+      - pip install peft
+      - pip install wandb
+      - pip install deepspeed
+      - git clone https://github.com/huggingface/trl
+      - cd trl
+      - pip install .
+      - accelerate launch
+        --config_file=examples/accelerate_configs/deepspeed_zero3.yaml
+        --main_process_ip=$DSTACK_MASTER_NODE_IP
+        --main_process_port=8008
+        --machine_rank=$DSTACK_NODE_RANK 
+        --num_processes=$DSTACK_GPUS_NUM
+        --num_machines=$DSTACK_NODES_NUM
+        trl/scripts/sft.py
+        --model_name $MODEL_ID
+        --dataset_name OpenAssistant/oasst_top1_2023-08-25
+        --dataset_text_field="text"
+        --per_device_train_batch_size 1
+        --per_device_eval_batch_size 1
+        --gradient_accumulation_steps 4 
+        --learning_rate 2e-4
+        --report_to wandb
+        --bf16
+        --max_seq_length 1024
+        --attn_implementation flash_attention_2
+        --logging_steps=10
+        --output_dir /checkpoints/llama31-ft
+        --hub_model_id $HUB_MODEL_ID
+        --torch_dtype bfloat16
+
+    resources:
+      gpu: 80GB:8
+      shm_size: 128GB
+
+    volumes:
+      - /checkpoints:/checkpoints
+    ```
+    </div>
+
+
+!!! Note
+    We are using the NGC container because it includes the necessary libraries and packages for RDMA and InfiniBand support.
+
+### Applying the configuration
+To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
+
+<div class="termy">
+
+```shell
+$ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml
+
+ #  BACKEND       RESOURCES                       INSTANCE TYPE  PRICE       
+ 1  ssh (remote)  cpu=208 mem=1772GB H100:80GB:8  instance       $0     idle 
+ 2  ssh (remote)  cpu=208 mem=1772GB H100:80GB:8  instance       $0     idle  
+
+Submit the run trl-train-fsdp-distrib? [y/n]: y
+
+Provisioning...
+---> 100%
+```
+</div>
+
+## Source code
+
+The source-code of this example can be found in 
+[`examples/distributed-training/trl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/trl).
+
+!!! info "What's next?"
+    1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), 
+       [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips).