Name	Name	Last commit message	Last commit date
parent directory ..
amd	amd
README.md	README.md
train.dstack.yml	train.dstack.yml

TRL

This example walks you through how to use TRL :material-arrow-top-right-thin:{ .external }{:target="_blank"} to fine-tune Llama-3.1-8B with dstack using SFT with QLoRA.

Define a configuration

Below is a task configuration that does fine-tuning.

type: task
name: trl-train

python: 3.12
# Ensure nvcc is installed (req. for Flash Attention) 
nvcc: true

env:
  - HF_TOKEN
  - WANDB_API_KEY
  - HUB_MODEL_ID
commands:
  # Pin torch==2.6.0 to avoid building Flash Attention from source.
  # Prebuilt Flash Attention wheels are not available for the latest torch==2.7.0.
  - uv pip install torch==2.6.0
  - uv pip install transformers bitsandbytes peft wandb
  - uv pip install flash_attn --no-build-isolation
  - git clone https://github.com/huggingface/trl
  - cd trl
  - uv pip install .
  - |
    accelerate launch \
      --config_file=examples/accelerate_configs/multi_gpu.yaml \
      --num_processes $DSTACK_GPUS_PER_NODE \
      trl/scripts/sft.py \
      --model_name meta-llama/Meta-Llama-3.1-8B \
      --dataset_name OpenAssistant/oasst_top1_2023-08-25 \
      --dataset_text_field="text" \
      --per_device_train_batch_size 1 \
      --per_device_eval_batch_size 1 \
      --gradient_accumulation_steps 4 \
      --learning_rate 2e-4 \
      --report_to wandb \
      --bf16 \
      --max_seq_length 1024 \
      --lora_r 16 \
      --lora_alpha 32 \
      --lora_target_modules q_proj k_proj v_proj o_proj \
      --load_in_4bit \
      --use_peft \
      --attn_implementation "flash_attention_2" \
      --logging_steps=10 \
      --output_dir models/llama31 \
      --hub_model_id peterschmidt85/FineLlama-3.1-8B

resources:
  gpu:
    # 24GB or more VRAM
    memory: 24GB..
    # One or more GPU
    count: 1..
  # Shared memory (for multi-gpu)
  shm_size: 24GB

Change the resources property to specify more GPUs.

!!! info "AMD" The example above uses NVIDIA accelerators. To use it with AMD, check out AMD.

??? info "DeepSpeed" For more memory-efficient use of multiple GPUs, consider using DeepSpeed and ZeRO Stage 3.

To do this, use the `examples/accelerate_configs/deepspeed_zero3.yaml` configuration file instead of 
`examples/accelerate_configs/multi_gpu.yaml`.

Run the configuration

Once the configuration is ready, run dstack apply -f <configuration file>, and dstack will automatically provision the cloud resources and run the configuration.

$ HF_TOKEN=...
$ WANDB_API_KEY=...
$ HUB_MODEL_ID=...
$ dstack apply -f examples/single-node-training/trl/train.dstack.yml

 #  BACKEND              RESOURCES                     INSTANCE TYPE  PRICE     
 1  vastai (cz-czechia)  cpu=64 mem=128GB H100:80GB:2  18794506       $3.8907   
 2  vastai (us-texas)    cpu=52 mem=64GB  H100:80GB:2  20442365       $3.6926   
 3  vastai (fr-france)   cpu=64 mem=96GB  H100:80GB:2  20379984       $3.7389

Submit the run trl-train? [y/n]:

Provisioning...
---> 100%

Source code

The source-code of this example can be found in examples/llms/llama31 :material-arrow-top-right-thin:{ .external }{:target="_blank"} and examples/single-node-training/trl :material-arrow-top-right-thin:{ .external }{:target="_blank"}.

What's next?

Browse the TRL distributed training example
Check dev environments, tasks, services, and fleets
See the AMD example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

TRL

Define a configuration

Run the configuration

Source code

What's next?

Uh oh!

FilesExpand file tree

trl

Directory actions

More options

Directory actions

More options

Latest commit

History

trl

Folders and files

parent directory

README.md

TRL

Define a configuration

Run the configuration

Source code

What's next?