Skip to content

Add distributed Axolotl and TRL example#2703

Merged
Bihan merged 6 commits intodstackai:masterfrom
Bihan:add_dist_training_axolotl_trl
May 29, 2025
Merged

Add distributed Axolotl and TRL example#2703
Bihan merged 6 commits intodstackai:masterfrom
Bihan:add_dist_training_axolotl_trl

Conversation

@Bihan
Copy link
Copy Markdown
Collaborator

@Bihan Bihan commented May 27, 2025

No description provided.

@Bihan Bihan requested a review from peterschmidt85 May 27, 2025 12:37
- ACCELERATE_LOG_LEVEL=info
- WANDB_API_KEY
- MODEL_ID=meta-llama/Llama-3.1-8B
- HUB_MODEL_ID
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is HUB_MODEL_ID environment variable? How is it diffrenent in this context from MODEL_ID. Why does the Axolotl examample, use HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B?

Copy link
Copy Markdown
Collaborator Author

@Bihan Bihan May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterschmidt85 HUB_MODEL_ID : Repository ID where the model will be pushed on Hugging Face Hub (format: username/repo-name)

I will remove the assignment HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B and only use HUB_MODEL_ID as in TRL example.



!!! Note
We are using the NGC container because it includes the necessary libraries and packages for RDMA and InfiniBand support.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know which specific drivers are missing in dstack's default Docker image?

cc @un-def

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterschmidt85 As far as I remember dstack's default Docker image was issuing this error. libnccl-net.so was not found.

lambda-cluster-node-001:1652:1652 [6] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
lambda-cluster-node-001:1652:1652 [6] NCCL INFO NET/Plugin: Using internal network plugin.
lambda-cluster-node-001:1651:1651 [5] NCCL INFO cudaDriverVersion 12080
lambda-cluster-node-001:1651:1651 [5] NCCL INFO Bootstrap : Using eno1:172.26.135.50<0>
lambda-cluster-node-001:1651:1651 [5] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)


# Commands of the task
commands:
- pip install transformers
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use separate pip install commands instead of a single pip install command with multiple packages?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using uv pip install in examples since we now recommend uv?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update with uv pip install

- git clone https://github.com/huggingface/trl
- cd trl
- pip install .
- accelerate launch
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For such multi-line commands as accelerate launch, should we use - | syntax?

Copy link
Copy Markdown
Collaborator Author

@Bihan Bihan May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterschmidt85 Yes, we can use - | like below for every multi-line commands.

 - |
   accelerate launch \
     --config_file=examples/accelerate_configs/fsdp1.yaml \
     --main_process_ip=$DSTACK_MASTER_NODE_IP \
     --main_process_port=8008 \
     --machine_rank=$DSTACK_NODE_RANK \
     --num_processes=$DSTACK_GPUS_NUM \
     --num_machines=$DSTACK_NODES_NUM \
     trl/scripts/sft.py \
     --model_name $MODEL_ID \
     --dataset_name OpenAssistant/oasst_top1_2023-08-25 \
     --dataset_text_field="text" \
     --per_device_train_batch_size 1 \
     --per_device_eval_batch_size 1 \
     --gradient_accumulation_steps 4 \
     --learning_rate 2e-4 \
     --report_to wandb 

This would make copy/paste multi-line command to shell very easy during debugging.

<div class="termy">

```shell
$ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to set environment variables which values aren't configured in YAML (HF_TOKEN, WANDB_API_KEY, etc).

[`examples/distributed-training/trl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/trl).

!!! info "What's next?"
1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link to the Clusters guide too

Copy link
Copy Markdown
Collaborator Author

@Bihan Bihan May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterschmidt85 I have added link to clusters guide in Create Fleet section as below
"For more detials on how to use clusters with dstack, check the Clusters guide."

therefore I did not add in What's next? section.

Copy link
Copy Markdown
Contributor

@peterschmidt85 peterschmidt85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also:
Remove multi-node example from Fine-tuning | TRL
Add links to Distributed training | TRL from-tuning | TRL
Add links to Distributed training | Axolotl from-tuning | Axolot

@Bihan Bihan merged commit 36cb5aa into dstackai:master May 29, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants