Add distributed Axolotl and TRL example by Bihan · Pull Request #2703 · dstackai/dstack

Bihan · 2025-05-27T12:37:26Z

No description provided.

peterschmidt85 · 2025-05-28T05:44:14Z

+      - ACCELERATE_LOG_LEVEL=info
+      - WANDB_API_KEY
+      - MODEL_ID=meta-llama/Llama-3.1-8B
+      - HUB_MODEL_ID


What is HUB_MODEL_ID environment variable? How is it diffrenent in this context from MODEL_ID. Why does the Axolotl examample, use HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B?

@peterschmidt85 HUB_MODEL_ID : Repository ID where the model will be pushed on Hugging Face Hub (format: username/repo-name)

I will remove the assignment HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B and only use HUB_MODEL_ID as in TRL example.

peterschmidt85 · 2025-05-28T05:49:33Z

+
+
+!!! Note
+    We are using the NGC container because it includes the necessary libraries and packages for RDMA and InfiniBand support.


Do you know which specific drivers are missing in dstack's default Docker image?

cc @un-def

@peterschmidt85 As far as I remember dstack's default Docker image was issuing this error. libnccl-net.so was not found.

lambda-cluster-node-001:1652:1652 [6] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so lambda-cluster-node-001:1652:1652 [6] NCCL INFO NET/Plugin: Using internal network plugin. lambda-cluster-node-001:1651:1651 [5] NCCL INFO cudaDriverVersion 12080 lambda-cluster-node-001:1651:1651 [5] NCCL INFO Bootstrap : Using eno1:172.26.135.50<0> lambda-cluster-node-001:1651:1651 [5] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)

peterschmidt85 · 2025-05-28T05:51:25Z

+
+    # Commands of the task
+    commands:
+      - pip install transformers


Why use separate pip install commands instead of a single pip install command with multiple packages?

What about using uv pip install in examples since we now recommend uv?

I will update with uv pip install

peterschmidt85 · 2025-05-28T05:52:16Z

+      - git clone https://github.com/huggingface/trl
+      - cd trl
+      - pip install .
+      - accelerate launch


For such multi-line commands as accelerate launch, should we use - | syntax?

@peterschmidt85 Yes, we can use - | like below for every multi-line commands.

- | accelerate launch \ --config_file=examples/accelerate_configs/fsdp1.yaml \ --main_process_ip=$DSTACK_MASTER_NODE_IP \ --main_process_port=8008 \ --machine_rank=$DSTACK_NODE_RANK \ --num_processes=$DSTACK_GPUS_NUM \ --num_machines=$DSTACK_NODES_NUM \ trl/scripts/sft.py \ --model_name $MODEL_ID \ --dataset_name OpenAssistant/oasst_top1_2023-08-25 \ --dataset_text_field="text" \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 4 \ --learning_rate 2e-4 \ --report_to wandb

This would make copy/paste multi-line command to shell very easy during debugging.

peterschmidt85 · 2025-05-28T05:54:52Z

+<div class="termy">
+
+```shell
+$ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml


Need to set environment variables which values aren't configured in YAML (HF_TOKEN, WANDB_API_KEY, etc).

peterschmidt85 · 2025-05-28T05:56:00Z

+[`examples/distributed-training/trl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/trl).
+
+!!! info "What's next?"
+    1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), 


Add a link to the Clusters guide too

@peterschmidt85 I have added link to clusters guide in Create Fleet section as below
"For more detials on how to use clusters with dstack, check the Clusters guide."

therefore I did not add in What's next? section.

peterschmidt85

…ty and consistence

…ingle node training

Add distributed Axolotl and TRL example

8db354c

Bihan requested a review from peterschmidt85 May 27, 2025 12:37

peterschmidt85 reviewed May 28, 2025

View reviewed changes

Bihan Rana and others added 5 commits May 29, 2025 12:36

Resolve review comments

874099c

[Docs] Renamed Fine-tuning to Single-node training for more clari…

7364f8f

…ty and consistence

Remove uv from examples with ngc and remove multi-node example from s…

6cfa0b6

…ingle node training

[Examples] Minor improvements regarding TRL and Axolotl

119f7b8

Update Axolotl Single Node Training Example

ff9118d

Bihan merged commit 36cb5aa into dstackai:master May 29, 2025
25 checks passed



		!!! Note
		We are using the NGC container because it includes the necessary libraries and packages for RDMA and InfiniBand support.

Uh oh!

Conversation

Bihan commented May 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bihan May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bihan May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bihan May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterschmidt85 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Bihan May 28, 2025 •

edited

Loading

Bihan May 28, 2025 •

edited

Loading

Bihan May 28, 2025 •

edited

Loading