Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,4 @@ build/
.vscode
.aider*
uv.lock
.local/
2 changes: 1 addition & 1 deletion docs/docs/guides/clusters.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,5 +76,5 @@ Refer to [instance volumes](../concepts/volumes.md#instance) for an example.

!!! info "What's next?"
1. Read about [distributed tasks](../concepts/tasks.md#distributed-tasks), [fleets](../concepts/fleets.md), and [volumes](../concepts/volumes.md)
2. Browse the [Clusters](../../examples.md#clusters) examples
2. Browse the [Clusters](../../examples.md#clusters) and [Distributed training](../../examples.md#distributed-training) examples

18 changes: 17 additions & 1 deletion docs/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,22 @@ hide:
</a>
</div>

## Distributed training

<div class="tx-landing__highlights_grid">
<a href="/examples/distributed-training/ray-ragen"
class="feature-cell sky">
<h3>
Ray+RAGEN
</h3>

<p>
Fine-tune an agent on multiple nodes
with RAGEN, verl, and Ray.
</p>
</a>
</div>

## Inference

<div class="tx-landing__highlights_grid">
Expand Down Expand Up @@ -128,7 +144,7 @@ hide:
TensorRT-LLM
</h3>
<p>
Deploy DeepSeek R1 and its distilled version with TensorRT-LLM
Deploy DeepSeek models with TensorRT-LLM
</p>
</a>
</div>
Expand Down
Empty file.
1 change: 1 addition & 0 deletions docs/overrides/main.html
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@
<div class="tx-footer__section-title">Examples</div>
<a href="/examples#fine-tuning" class="tx-footer__section-link">Fine-tuning</a>
<a href="/examples#clusters" class="tx-footer__section-link">Clusters</a>
<a href="/examples#distributed-training" class="tx-footer__section-link">Distributed training</a>
<a href="/examples#inference" class="tx-footer__section-link">Inference</a>
<a href="/examples#accelerators" class="tx-footer__section-link">Accelerators</a>
<a href="/examples#llms" class="tx-footer__section-link">LLMs</a>
Expand Down
11 changes: 6 additions & 5 deletions examples/.dstack.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,15 @@ type: dev-environment
# The name is optional, if not specified, generated randomly
name: vscode

python: "3.11"
# Uncomment to use a custom Docker image
#image: dstackai/base:py3.13-0.7-cuda-12.1
#python: "3.11"

image: un1def/dstack-base:py3.12-dev-cuda-12.1

ide: vscode

# Use either spot or on-demand instances
spot_policy: auto
#spot_policy: auto

resources:
gpu: 1
cpu: x86:8..32
gpu: 24GB..:1
39 changes: 39 additions & 0 deletions examples/distributed-training/ray-ragen/.dstack.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
type: task
name: ray-ragen-cluster

nodes: 2

env:
- WANDB_API_KEY
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.2
commands:
- wget -O miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
- bash miniconda.sh -b -p /workflow/miniconda
- eval "$(/workflow/miniconda/bin/conda shell.bash hook)"
- git clone https://github.com/RAGEN-AI/RAGEN.git
- cd RAGEN
- bash scripts/setup_ragen.sh
- conda activate ragen
- cd verl
- pip install --no-deps -e .
- pip install hf_transfer hf_xet
- pip uninstall -y ray
- pip install -U "ray[default]"
- |
if [ $DSTACK_NODE_RANK = 0 ]; then
ray start --head --port=6379;
else
ray start --address=$DSTACK_MASTER_NODE_IP:6379
fi

# Expose Ray dashboard port
ports:
- 8265

resources:
gpu: 80GB:8
shm_size: 128GB

# Save checkpoints on the instance
volumes:
- /checkpoints:/checkpoints
133 changes: 133 additions & 0 deletions examples/distributed-training/ray-ragen/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Ray + RAGEN

This example shows how use `dstack` and [RAGEN :material-arrow-top-right-thin:{ .external }](https://github.com/RAGEN-AI/RAGEN){:target="_blank"}
to fine-tune an agent on mulitiple nodes.

Under the hood `RAGEN` uses [verl :material-arrow-top-right-thin:{ .external }](https://github.com/volcengine/verl){:target="_blank"} for Reinforcement Learning and [Ray :material-arrow-top-right-thin:{ .external }](https://docs.ray.io/en/latest/){:target="_blank"} for ditributed training.

## Create fleet

Before submitted disributed training runs, make sure to create a fleet with a `placement` set to `cluster`.

> For more detials on how to use clusters with `dstack`, check the [Clusters](https://dstack.ai/docs/guides/clusters) guide.

## Run a Ray cluster

If you want to use Ray with `dstack`, you have to first run a Ray cluster.

The task below runs a Ray cluster on an existing fleet:

<div editor-title="examples/distributed-training/ray-ragen/.dstack.yml">

```yaml
type: task
name: ray-ragen-cluster

nodes: 2

env:
- WANDB_API_KEY
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.2
commands:
- wget -O miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
- bash miniconda.sh -b -p /workflow/miniconda
- eval "$(/workflow/miniconda/bin/conda shell.bash hook)"
- git clone https://github.com/RAGEN-AI/RAGEN.git
- cd RAGEN
- bash scripts/setup_ragen.sh
- conda activate ragen
- cd verl
- pip install --no-deps -e .
- pip install hf_transfer hf_xet
- pip uninstall -y ray
- pip install -U "ray[default]"
- |
if [ $DSTACK_NODE_RANK = 0 ]; then
ray start --head --port=6379;
else
ray start --address=$DSTACK_MASTER_NODE_IP:6379
fi

# Expose Ray dashboard port
ports:
- 8265

resources:
gpu: 80GB:8
shm_size: 128GB

# Save checkpoints on the instance
volumes:
- /checkpoints:/checkpoints
```

</div>

We are using verl's docker image for vLLM with FSDP. See [Installation :material-arrow-top-right-thin:{ .external }](https://verl.readthedocs.io/en/latest/start/install.html){:target="_blank"} for more.

The `RAGEN` setup script `scripts/setup_ragen.sh` isolates dependencies within Conda environment.

Note that the Ray setup in the RAGEN environment is missing the dashboard, so we reinstall it using `ray[default]`.

Now, if you run this task via `dstack apply`, it will automatically forward the Ray's dashboard port to `localhost:8265`.

<div class="termy">

```shell
$ dstack apply -f examples/distributed-training/ray-ragen/.dstack.yml
```

</div>

As long as the `dstack apply` is attached, you can use `localhost:8265` to submit Ray jobs for execution.
If `dstack apply` is detached, you can use `dstack attach` to re-attach.

## Submit Ray jobs

Before you can submit Ray jobs, ensure to install `ray` locally:

<div class="termy">

```shell
$ pip install ray
```

</div>

Now you can submit the training job to the Ray cluster which is available at `localhost:8265`:

<div class="termy">

```shell
$ RAY_ADDRESS=http://localhost:8265
$ ray job submit \
-- bash -c "\
export PYTHONPATH=/workflow/RAGEN; \
cd /workflow/RAGEN; \
/workflow/miniconda/envs/ragen/bin/python train.py \
--config-name base \
system.CUDA_VISIBLE_DEVICES=[0,1,2,3,4,5,6,7] \
model_path=Qwen/Qwen2.5-7B-Instruct \
trainer.experiment_name=agent-fine-tuning-Qwen2.5-7B \
trainer.n_gpus_per_node=8 \
trainer.nnodes=2 \
micro_batch_size_per_gpu=2 \
trainer.default_local_dir=/checkpoints \
trainer.save_freq=50 \
actor_rollout_ref.rollout.tp_size_check=False \
actor_rollout_ref.rollout.tensor_model_parallel_size=4"
```

</div>

!!! info "Training parameters"
1. `actor_rollout_ref.rollout.tensor_model_parallel_size=4`, because `Qwen/Qwen2.5-7B-Instruct` has 28 attention heads and number of attention heads should be divisible by `tensor_model_parallel_size`
2. `actor_rollout_ref.rollout.tp_size_check=False`, if True `tensor_model_parallel_size` should be equal to `trainer.n_gpus_per_node`
3. `micro_batch_size_per_gpu=2`, to keep the RAGEN-paper's `rollout_filter_ratio` and `es_manager` settings as it is for world size `16`

Using Ray via `dstack` is a powerful way to get access to the rich Ray ecosystem while benefiting from `dstack`'s provisioning capabilities.

!!! info "What's next"
1. Check the [Clusters](https://dstack.ai/docs/guides/clusters) guide
2. Read about [distributed tasks](https://dstack.ai/docs/concepts/tasks#distributed-tasks) and [fleets](https://dstack.ai/docs/concepts/fleets)
3. Browse Ray's [docs :material-arrow-top-right-thin:{ .external }](https://docs.ray.io/en/latest/train/examples.html){:target="_blank"} for other examples.
2 changes: 1 addition & 1 deletion examples/misc/ray/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ name: ray-cluster
nodes: 4
commands:
- pip install -U "ray[default]"
- >
- |
if [ $DSTACK_NODE_RANK = 0 ]; then
ray start --head --port=6379;
else
Expand Down
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,8 @@ nav:
- RCCL tests: examples/clusters/rccl-tests/index.md
- A3 Mega: examples/clusters/a3mega/index.md
- A3 High: examples/clusters/a3high/index.md
- Distributed training:
- Ray+RAGEN: examples/distributed-training/ray-ragen/index.md
- Deployment:
- SGLang: examples/inference/sglang/index.md
- vLLM: examples/inference/vllm/index.md
Expand Down