-
-
Notifications
You must be signed in to change notification settings - Fork 226
Add distributed Axolotl and TRL example #2703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
8db354c
874099c
7364f8f
6cfa0b6
119f7b8
ff9118d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| type: task | ||
| # The name is optional, if not specified, generated randomly | ||
| name: axolotl-multi-node-qlora-llama3-70b | ||
|
|
||
| # Size of the cluster | ||
| nodes: 2 | ||
|
|
||
| # The axolotlai/axolotl:main-latest image does not include InfiniBand or RDMA libraries, so we need to use the NGC container. | ||
| image: nvcr.io/nvidia/pytorch:25.01-py3 | ||
| # Required environment variables | ||
| env: | ||
| - HF_TOKEN | ||
| - ACCELERATE_LOG_LEVEL=info | ||
| - WANDB_API_KEY | ||
| - NCCL_DEBUG=INFO | ||
| - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 | ||
| - WANDB_NAME=axolotl-dist-llama-qlora-train | ||
| - WANDB_PROJECT | ||
| - HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B | ||
| # Commands of the task | ||
| commands: | ||
| # NCG container torch and flash-attn is not compatible with axolotl. | ||
| - pip uninstall torch -y | ||
| - pip uninstall flash-attn -y | ||
| - pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/test/cu124 | ||
| - pip install --no-build-isolation axolotl[flash-attn,deepspeed] | ||
| - wget https://raw.githubusercontent.com/huggingface/trl/main/examples/accelerate_configs/fsdp1.yaml | ||
| - wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-3/qlora-fsdp-70b.yaml | ||
| # axolotl includes hf-xet 1.1.0 which crashes while downloading, so installing the latest 1.1.2 | ||
| - pip uninstall -y hf-xet | ||
| - pip install hf-xet --no-cache-dir | ||
| - accelerate launch --config_file=fsdp1.yaml -m axolotl.cli.train qlora-fsdp-70b.yaml --hub-model-id $HUB_MODEL_ID --output-dir /checkpoints/qlora-llama3-70b --wandb-project $WANDB_PROJECT --wandb-name $WANDB_NAME | ||
| --main_process_ip=$DSTACK_MASTER_NODE_IP | ||
| --main_process_port=8008 | ||
| --machine_rank=$DSTACK_NODE_RANK | ||
| --num_processes=$DSTACK_GPUS_NUM | ||
| --num_machines=$DSTACK_NODES_NUM | ||
|
|
||
|
|
||
| resources: | ||
| gpu: 80GB:8 | ||
| shm_size: 128GB | ||
|
|
||
| volumes: | ||
| - /checkpoints:/checkpoints |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,107 @@ | ||
| # Axolotl | ||
|
|
||
| This example walks you through how to run distributed fine-tune using [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) with `dstack`. | ||
|
|
||
| ??? info "Prerequisites" | ||
| Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`. | ||
|
|
||
| <div class="termy"> | ||
|
|
||
| ```shell | ||
| $ git clone https://github.com/dstackai/dstack | ||
| $ cd dstack | ||
| $ dstack init | ||
| ``` | ||
| </div> | ||
|
|
||
| ## Create fleet | ||
|
|
||
| Before submitted disributed training runs, make sure to create a fleet with a `placement` set to `cluster`. | ||
|
|
||
| > For more detials on how to use clusters with `dstack`, check the [Clusters](https://dstack.ai/docs/guides/clusters) guide. | ||
|
|
||
| ## Run Distributed Training | ||
| Once the fleet is created, define a distributed task configuration. Here's an example of distributed `QLORA` task using `FSDP`. | ||
|
|
||
| <div editor-title="examples/distributed-training/ray-ragen/.dstack.yml"> | ||
|
|
||
| ```yaml | ||
| type: task | ||
| # The name is optional, if not specified, generated randomly | ||
| name: axolotl-multi-node-qlora-llama3-70b | ||
|
|
||
| # Size of the cluster | ||
| nodes: 2 | ||
|
|
||
| # The axolotlai/axolotl:main-latest image does not include InfiniBand or RDMA libraries, so we need to use the NGC container. | ||
| image: nvcr.io/nvidia/pytorch:25.01-py3 | ||
| # Required environment variables | ||
| env: | ||
| - HF_TOKEN | ||
| - ACCELERATE_LOG_LEVEL=info | ||
| - WANDB_API_KEY | ||
| - NCCL_DEBUG=INFO | ||
| - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 | ||
| - WANDB_NAME=axolotl-dist-llama-qlora-train | ||
| - WANDB_PROJECT | ||
| - HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B | ||
|
|
||
| # Commands of the task | ||
| commands: | ||
| # Replacing the default Torch and FlashAttention in the NCG container with Axolotl-compatible versions. | ||
| # The preinstalled versions are incompatible with Axolotl. | ||
| - pip uninstall torch -y | ||
| - pip uninstall flash-attn -y | ||
| - pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/test/cu124 | ||
| - pip install --no-build-isolation axolotl[flash-attn,deepspeed] | ||
| - wget https://raw.githubusercontent.com/huggingface/trl/main/examples/accelerate_configs/fsdp1.yaml | ||
| - wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-3/qlora-fsdp-70b.yaml | ||
| # Axolotl includes hf-xet version 1.1.0, which fails during downloads. Replacing it with the latest version (1.1.2). | ||
| - pip uninstall -y hf-xet | ||
| - pip install hf-xet --no-cache-dir | ||
| - accelerate launch --config_file=fsdp1.yaml -m axolotl.cli.train qlora-fsdp-70b.yaml --hub-model-id $HUB_MODEL_ID --output-dir /checkpoints/qlora-llama3-70b --wandb-project $WANDB_PROJECT --wandb-name $WANDB_NAME | ||
| --main_process_ip=$DSTACK_MASTER_NODE_IP | ||
| --main_process_port=8008 | ||
| --machine_rank=$DSTACK_NODE_RANK | ||
| --num_processes=$DSTACK_GPUS_NUM | ||
| --num_machines=$DSTACK_NODES_NUM | ||
|
|
||
| resources: | ||
| gpu: 80GB:8 | ||
| shm_size: 128GB | ||
|
|
||
| volumes: | ||
| - /checkpoints:/checkpoints | ||
| ``` | ||
| </div> | ||
|
|
||
| !!! Note | ||
| We are using the NGC container because it includes the necessary libraries and packages for RDMA and InfiniBand support. | ||
|
|
||
| ### Applying the configuration | ||
| To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command. | ||
|
|
||
| <div class="termy"> | ||
|
|
||
| ```shell | ||
| $ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml | ||
|
|
||
| # BACKEND RESOURCES INSTANCE TYPE PRICE | ||
| 1 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle | ||
| 2 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle | ||
|
|
||
| Submit the run trl-train-fsdp-distrib? [y/n]: y | ||
|
|
||
| Provisioning... | ||
| ---> 100% | ||
| ``` | ||
| </div> | ||
|
|
||
| ## Source code | ||
|
|
||
| The source-code of this example can be found in | ||
| [`examples/distributed-training/axolotl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/axolotl). | ||
|
|
||
| !!! info "What's next?" | ||
| 1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), | ||
| [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips). |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,184 @@ | ||
| # TRL | ||
|
|
||
| This example walks you through how to run distributed fine-tune using [TRL](https://github.com/huggingface/trl), [Accelerate](https://github.com/huggingface/accelerate) and [Deepspeed](https://github.com/deepspeedai/DeepSpeed) with `dstack`. | ||
|
|
||
| ??? info "Prerequisites" | ||
| Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`. | ||
|
|
||
| <div class="termy"> | ||
|
|
||
| ```shell | ||
| $ git clone https://github.com/dstackai/dstack | ||
| $ cd dstack | ||
| $ dstack init | ||
| ``` | ||
| </div> | ||
|
|
||
| ## Create fleet | ||
|
|
||
| Before submitted disributed training runs, make sure to create a fleet with a `placement` set to `cluster`. | ||
|
|
||
| > For more detials on how to use clusters with `dstack`, check the [Clusters](https://dstack.ai/docs/guides/clusters) guide. | ||
|
|
||
| ## Run Distributed Training | ||
| Once the fleet is created, define a distributed task configuration. Here's an example of distributed Supervised Fine-Tuning (SFT) task using `FSDP` and `Deepseed ZeRO-3`. | ||
|
|
||
|
|
||
| === "FSDP" | ||
|
|
||
| <div editor-title="examples/distributed-training/trl/fsdp.dstack.yml"> | ||
| ```yaml | ||
| type: task | ||
| # The name is optional, if not specified, generated randomly | ||
| name: trl-train-fsdp-distrib | ||
|
|
||
| # Size of the cluster | ||
| nodes: 2 | ||
|
|
||
| image: nvcr.io/nvidia/pytorch:25.01-py3 | ||
|
|
||
| # Required environment variables | ||
| env: | ||
| - HF_TOKEN | ||
| - ACCELERATE_LOG_LEVEL=info | ||
| - WANDB_API_KEY | ||
| - MODEL_ID=meta-llama/Llama-3.1-8B | ||
| - HUB_MODEL_ID | ||
|
|
||
| # Commands of the task | ||
| commands: | ||
| - pip install transformers | ||
| - pip install bitsandbytes | ||
| - pip install peft | ||
| - pip install wandb | ||
| - git clone https://github.com/huggingface/trl | ||
| - cd trl | ||
| - pip install . | ||
| - accelerate launch | ||
| --config_file=examples/accelerate_configs/fsdp1.yaml | ||
| --main_process_ip=$DSTACK_MASTER_NODE_IP | ||
| --main_process_port=8008 | ||
| --machine_rank=$DSTACK_NODE_RANK | ||
| --num_processes=$DSTACK_GPUS_NUM | ||
| --num_machines=$DSTACK_NODES_NUM | ||
| trl/scripts/sft.py | ||
| --model_name $MODEL_ID | ||
| --dataset_name OpenAssistant/oasst_top1_2023-08-25 | ||
| --dataset_text_field="text" | ||
| --per_device_train_batch_size 1 | ||
| --per_device_eval_batch_size 1 | ||
| --gradient_accumulation_steps 4 | ||
| --learning_rate 2e-4 | ||
| --report_to wandb | ||
| --bf16 | ||
| --max_seq_length 1024 | ||
| --attn_implementation flash_attention_2 | ||
| --logging_steps=10 | ||
| --output_dir /checkpoints/llama31-ft | ||
| --hub_model_id $HUB_MODEL_ID | ||
| --torch_dtype bfloat16 | ||
|
|
||
| resources: | ||
| gpu: 80GB:8 | ||
| shm_size: 128GB | ||
|
|
||
| volumes: | ||
| - /checkpoints:/checkpoints | ||
| ``` | ||
| </div> | ||
|
|
||
| === "Deepseed ZeRO-3" | ||
|
|
||
| <div editor-title="examples/distributed-training/trl/deepspeed.dstack.yml"> | ||
| ```yaml | ||
| type: task | ||
| # The name is optional, if not specified, generated randomly | ||
| name: trl-train-deepspeed-distrib | ||
|
|
||
| # Size of the cluster | ||
| nodes: 2 | ||
|
|
||
| image: nvcr.io/nvidia/pytorch:25.01-py3 | ||
|
|
||
| # Required environment variables | ||
| env: | ||
| - HF_TOKEN | ||
| - ACCELERATE_LOG_LEVEL=info | ||
| - WANDB_API_KEY | ||
| - MODEL_ID=meta-llama/Llama-3.1-8B | ||
| - HUB_MODEL_ID | ||
|
|
||
| # Commands of the task | ||
| commands: | ||
| - pip install transformers | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why use separate
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about using
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will update with |
||
| - pip install bitsandbytes | ||
| - pip install peft | ||
| - pip install wandb | ||
| - pip install deepspeed | ||
| - git clone https://github.com/huggingface/trl | ||
| - cd trl | ||
| - pip install . | ||
| - accelerate launch | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For such multi-line commands as
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @peterschmidt85 Yes, we can use This would make |
||
| --config_file=examples/accelerate_configs/deepspeed_zero3.yaml | ||
| --main_process_ip=$DSTACK_MASTER_NODE_IP | ||
| --main_process_port=8008 | ||
| --machine_rank=$DSTACK_NODE_RANK | ||
| --num_processes=$DSTACK_GPUS_NUM | ||
| --num_machines=$DSTACK_NODES_NUM | ||
| trl/scripts/sft.py | ||
| --model_name $MODEL_ID | ||
| --dataset_name OpenAssistant/oasst_top1_2023-08-25 | ||
| --dataset_text_field="text" | ||
| --per_device_train_batch_size 1 | ||
| --per_device_eval_batch_size 1 | ||
| --gradient_accumulation_steps 4 | ||
| --learning_rate 2e-4 | ||
| --report_to wandb | ||
| --bf16 | ||
| --max_seq_length 1024 | ||
| --attn_implementation flash_attention_2 | ||
| --logging_steps=10 | ||
| --output_dir /checkpoints/llama31-ft | ||
| --hub_model_id $HUB_MODEL_ID | ||
| --torch_dtype bfloat16 | ||
|
|
||
| resources: | ||
| gpu: 80GB:8 | ||
| shm_size: 128GB | ||
|
|
||
| volumes: | ||
| - /checkpoints:/checkpoints | ||
| ``` | ||
| </div> | ||
|
|
||
|
|
||
| !!! Note | ||
| We are using the NGC container because it includes the necessary libraries and packages for RDMA and InfiniBand support. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you know which specific drivers are missing in dstack's default Docker image? cc @un-def
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @peterschmidt85 As far as I remember dstack's default Docker image was issuing this error. |
||
|
|
||
| ### Applying the configuration | ||
| To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command. | ||
|
|
||
| <div class="termy"> | ||
|
|
||
| ```shell | ||
| $ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Need to set environment variables which values aren't configured in YAML ( |
||
|
|
||
| # BACKEND RESOURCES INSTANCE TYPE PRICE | ||
| 1 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle | ||
| 2 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle | ||
|
|
||
| Submit the run trl-train-fsdp-distrib? [y/n]: y | ||
|
|
||
| Provisioning... | ||
| ---> 100% | ||
| ``` | ||
| </div> | ||
|
|
||
| ## Source code | ||
|
|
||
| The source-code of this example can be found in | ||
| [`examples/distributed-training/trl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/trl). | ||
|
|
||
| !!! info "What's next?" | ||
| 1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a link to the Clusters guide too
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @peterschmidt85 I have added link to clusters guide in therefore I did not add in |
||
| [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips). | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is
HUB_MODEL_IDenvironment variable? How is it diffrenent in this context fromMODEL_ID. Why does the Axolotl examample, useHUB_MODEL_ID=meta-llama/Meta-Llama-3-70B?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peterschmidt85
HUB_MODEL_ID: Repository ID where the model will be pushed on Hugging Face Hub (format: username/repo-name)I will remove the assignment
HUB_MODEL_ID=meta-llama/Meta-Llama-3-70Band only useHUB_MODEL_IDas in TRL example.