Add distributed Axolotl and TRL example#2703
Conversation
| - ACCELERATE_LOG_LEVEL=info | ||
| - WANDB_API_KEY | ||
| - MODEL_ID=meta-llama/Llama-3.1-8B | ||
| - HUB_MODEL_ID |
There was a problem hiding this comment.
What is HUB_MODEL_ID environment variable? How is it diffrenent in this context from MODEL_ID. Why does the Axolotl examample, use HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B?
There was a problem hiding this comment.
@peterschmidt85 HUB_MODEL_ID : Repository ID where the model will be pushed on Hugging Face Hub (format: username/repo-name)
I will remove the assignment HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B and only use HUB_MODEL_ID as in TRL example.
|
|
||
|
|
||
| !!! Note | ||
| We are using the NGC container because it includes the necessary libraries and packages for RDMA and InfiniBand support. |
There was a problem hiding this comment.
Do you know which specific drivers are missing in dstack's default Docker image?
cc @un-def
There was a problem hiding this comment.
@peterschmidt85 As far as I remember dstack's default Docker image was issuing this error. libnccl-net.so was not found.
lambda-cluster-node-001:1652:1652 [6] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
lambda-cluster-node-001:1652:1652 [6] NCCL INFO NET/Plugin: Using internal network plugin.
lambda-cluster-node-001:1651:1651 [5] NCCL INFO cudaDriverVersion 12080
lambda-cluster-node-001:1651:1651 [5] NCCL INFO Bootstrap : Using eno1:172.26.135.50<0>
lambda-cluster-node-001:1651:1651 [5] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
|
|
||
| # Commands of the task | ||
| commands: | ||
| - pip install transformers |
There was a problem hiding this comment.
Why use separate pip install commands instead of a single pip install command with multiple packages?
There was a problem hiding this comment.
What about using uv pip install in examples since we now recommend uv?
There was a problem hiding this comment.
I will update with uv pip install
| - git clone https://github.com/huggingface/trl | ||
| - cd trl | ||
| - pip install . | ||
| - accelerate launch |
There was a problem hiding this comment.
For such multi-line commands as accelerate launch, should we use - | syntax?
There was a problem hiding this comment.
@peterschmidt85 Yes, we can use - | like below for every multi-line commands.
- |
accelerate launch \
--config_file=examples/accelerate_configs/fsdp1.yaml \
--main_process_ip=$DSTACK_MASTER_NODE_IP \
--main_process_port=8008 \
--machine_rank=$DSTACK_NODE_RANK \
--num_processes=$DSTACK_GPUS_NUM \
--num_machines=$DSTACK_NODES_NUM \
trl/scripts/sft.py \
--model_name $MODEL_ID \
--dataset_name OpenAssistant/oasst_top1_2023-08-25 \
--dataset_text_field="text" \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 4 \
--learning_rate 2e-4 \
--report_to wandb
This would make copy/paste multi-line command to shell very easy during debugging.
| <div class="termy"> | ||
|
|
||
| ```shell | ||
| $ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml |
There was a problem hiding this comment.
Need to set environment variables which values aren't configured in YAML (HF_TOKEN, WANDB_API_KEY, etc).
| [`examples/distributed-training/trl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/trl). | ||
|
|
||
| !!! info "What's next?" | ||
| 1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), |
There was a problem hiding this comment.
Add a link to the Clusters guide too
There was a problem hiding this comment.
@peterschmidt85 I have added link to clusters guide in Create Fleet section as below
"For more detials on how to use clusters with dstack, check the Clusters guide."
therefore I did not add in What's next? section.
peterschmidt85
left a comment
There was a problem hiding this comment.
Also:
Remove multi-node example from Fine-tuning | TRL
Add links to Distributed training | TRL from-tuning | TRL
Add links to Distributed training | Axolotl from-tuning | Axolot
…ty and consistence
…ingle node training
No description provided.