You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/blog/archive/efa.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@ categories:
10
10
11
11
# Efficient distributed training with AWS EFA
12
12
13
-
[Amazon Elastic Fabric Adapter (EFA) :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"} is a high-performance network interface designed for AWS EC2 instances, enabling
13
+
[Amazon Elastic Fabric Adapter (EFA)](https://aws.amazon.com/hpc/efa/) is a high-performance network interface designed for AWS EC2 instances, enabling
14
14
ultra-low latency and high-throughput communication between nodes. This makes it an ideal solution for scaling
15
15
distributed training workloads across multiple GPUs and instances.
16
16
@@ -39,7 +39,7 @@ network interfaces, you’ll need to disable public IPs. Note, the `dstack`
39
39
server in this case should have access to the private subnet of the VPC.
40
40
41
41
You’ll also need to specify an AMI that includes the GDRCopy drivers. For example, you can use the
42
-
[AWS Deep Learning Base GPU AMI :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-22-04/){:target="_blank"}.
42
+
[AWS Deep Learning Base GPU AMI](https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-22-04/).
@@ -106,7 +106,7 @@ Here is the spec of the bare metal machine we got:
106
106
??? info "TGI"
107
107
The `ghcr.io/huggingface/text-generation-inference:sha-11d7af7-rocm` Docker image was used.
108
108
109
-
For conducting the tests, we've been using the [`benchmark_serving` :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py){:target="_blank"} provided by vLLM.
109
+
For conducting the tests, we've been using the [`benchmark_serving`](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py) provided by vLLM.
This difference may be related to how vLLM [pre-allocates GPU cache :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/models/performance.html){:target="_blank"}.
178
+
This difference may be related to how vLLM [pre-allocates GPU cache](https://docs.vllm.ai/en/latest/models/performance.html).
179
179
180
180
## Conclusion
181
181
@@ -203,22 +203,22 @@ like the H100 and H200, as well as possibly Google TPU.
203
203
### Source code
204
204
205
205
The source code used for this benchmark can be found in our
is the primary sponsor of this benchmark, and we are sincerely grateful for their hardware and support.
216
216
217
217
If you'd like to use top-tier bare metal compute with AMD GPUs, we recommend going
218
218
with Hot Aisle. Once you gain access to a cluster, it can be easily accessed via `dstack`'s [SSH fleet](../../docs/concepts/fleets.md#ssh-fleets) easily.
219
219
220
220
### RunPod
221
221
If you’d like to use on-demand compute with AMD GPUs at affordable prices, you can configure `dstack` to
222
-
use [RunPod :material-arrow-top-right-thin:{ .external }](https://runpod.io/){:target="_blank"}. In
222
+
use [RunPod](https://runpod.io/). In
223
223
this case, `dstack` will be able to provision fleets automatically when you run dev environments, tasks, and
Copy file name to clipboardExpand all lines: docs/blog/posts/amd-on-runpod.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,14 +33,14 @@ One of the main advantages of the `MI300X` is its VRAM. For example, with the `H
33
33
version of Llama 3.1 405B into a single node with 8 GPUs—you'd have to use FP8 instead. However, with the `MI300X`, you
34
34
can fit FP16 into a single node with 8 GPUs, and for FP8, you'd only need 4 GPUs.
35
35
36
-
With the [latest update :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/releases/0.18.11rc1){:target="_blank"},
36
+
With the [latest update](https://github.com/dstackai/dstack/releases/0.18.11rc1),
37
37
you can now specify an AMD GPU under `resources`. Below are a few examples.
38
38
39
39
## Configuration
40
40
41
41
=== "Service"
42
42
Here's an example of a [service](../../docs/concepts/services.md) that deploys
43
-
Llama 3.1 70B in FP16 using [TGI :material-arrow-top-right-thin:{ .external }](https://huggingface.co/docs/text-generation-inference/en/installation_amd){:target="_blank"}.
43
+
Llama 3.1 70B in FP16 using [TGI](https://huggingface.co/docs/text-generation-inference/en/installation_amd).
AMD accelerators can also be used with other frameworks like vLLM, Ollama, etc., and we'll be adding more examples soon.
116
116
2. RunPod is the first cloud provider where dstack supports AMD. More cloud providers will be supported soon as well.
117
-
3. Want to give RunPod and `dstack` a try? Make sure you've signed up for [RunPod :material-arrow-top-right-thin:{ .external }](https://www.runpod.io/){:target="_blank"},
117
+
3. Want to give RunPod and `dstack` a try? Make sure you've signed up for [RunPod](https://www.runpod.io/),
118
118
then [set up](../../docs/reference/server/config.yml.md#runpod) the `dstack server`.
119
119
120
-
> Have questioned or feedback? Join our [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"}
120
+
> Have questioned or feedback? Join our [Discord](https://discord.gg/u8SmfwPpMd)
@@ -237,4 +237,4 @@ Want to see how it works? Check out the video below:
237
237
!!! info "What's next?"
238
238
1. See [SSH fleets](../../docs/concepts/fleets.md#ssh-fleets)
239
239
2. Read about [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), and [services](../../docs/concepts/services.md)
Copy file name to clipboardExpand all lines: docs/blog/posts/benchmark-amd-containers-and-partitions.md
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,7 +16,7 @@ Our new benchmark explores two important areas for optimizing AI workloads on AM
16
16
17
17
<!-- more -->
18
18
19
-
This benchmark was supported by [Hot Aisle :material-arrow-top-right-thin:{ .external }](https://hotaisle.xyz/){:target="_blank"},
19
+
This benchmark was supported by [Hot Aisle](https://hotaisle.xyz/),
20
20
a provider of AMD GPU bare-metal and VM infrastructure.
21
21
22
22
## Benchmark 1: Bare-metal vs containers
@@ -56,11 +56,11 @@ Our experiments consistently demonstrate that running multi-node AI workloads in
56
56
57
57
## Benchmark 2: Partition performance isolated vs mesh
58
58
59
-
The AMD GPU can be [partitioned :material-arrow-top-right-thin:{ .external }](https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/gpu-partitioning/mi300x/overview.html){:target="_blank"} into smaller, independent units (e.g., NPS4 mode splits one GPU into four partitions). This promises better memory bandwidth utilization. Does this theoretical gain translate to better performance in practice?
59
+
The AMD GPU can be [partitioned](https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/gpu-partitioning/mi300x/overview.html) into smaller, independent units (e.g., NPS4 mode splits one GPU into four partitions). This promises better memory bandwidth utilization. Does this theoretical gain translate to better performance in practice?
60
60
61
61
### Finding 1: Higher performance for isolated partitions
62
62
63
-
First, we sought to reproduce and extend findings from the [official ROCm blog :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/software-tools-optimization/compute-memory-modes/README.html){:target="_blank"}. We benchmarked the memory bandwidth of a single partition (in CPX/NPS4 mode) against a full, unpartitioned GPU (in SPX/NPS1 mode).
63
+
First, we sought to reproduce and extend findings from the [official ROCm blog](https://rocm.blogs.amd.com/software-tools-optimization/compute-memory-modes/README.html). We benchmarked the memory bandwidth of a single partition (in CPX/NPS4 mode) against a full, unpartitioned GPU (in SPX/NPS1 mode).
@@ -100,7 +100,7 @@ GPU partitioning is only practical if used dynamically—for instance, to run mu
100
100
#### Limitations
101
101
102
102
1.**Reproducibility**: AMD’s original blog post on partitioning lacked detailed setup information, so we had to reconstruct the benchmarks independently.
103
-
2.**Network tuning**: These benchmarks were run on a default, out-of-the-box network configuration. Our results for RCCL (~339 GB/s) and RDMA (~726 Gbps) are slightly below the peak figures [reported by Dell :material-arrow-top-right-thin:{ .external }](https://infohub.delltechnologies.com/en-us/l/generative-ai-in-the-enterprise-with-amd-accelerators/rccl-and-perftest-for-cluster-validation-1/4/){:target="_blank"}. This suggests that further performance could be unlocked with expert tuning of network topology, MTU size, and NCCL environment variables.
103
+
2.**Network tuning**: These benchmarks were run on a default, out-of-the-box network configuration. Our results for RCCL (~339 GB/s) and RDMA (~726 Gbps) are slightly below the peak figures [reported by Dell](https://infohub.delltechnologies.com/en-us/l/generative-ai-in-the-enterprise-with-amd-accelerators/rccl-and-perftest-for-cluster-validation-1/4/). This suggests that further performance could be unlocked with expert tuning of network topology, MTU size, and NCCL environment variables.
104
104
105
105
## Benchmark setup
106
106
@@ -352,7 +352,7 @@ The `SIZE` value is `1M`, `2M`, .., `8G`.
352
352
353
353
**vLLM data parallel**
354
354
355
-
1. Build nginx container (see [vLLM-nginx :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/stable/deployment/nginx.html#build-nginx-container){:target="_blank"}).
355
+
1. Build nginx container (see [vLLM-nginx](https://docs.vllm.ai/en/stable/deployment/nginx.html#build-nginx-container)).
All source code and findings are available in [our GitHub repo :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/tree/main/amd/baremetal_container_partition){:target="_blank"}.
474
+
All source code and findings are available in [our GitHub repo](https://github.com/dstackai/benchmarks/tree/main/amd/baremetal_container_partition).
*[Deep dive into partition modes by AMD :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/software-tools-optimization/compute-memory-modes/README.html){:target="_blank"}.
480
-
*[RCCL and PerfTest for cluster validation by Dell :material-arrow-top-right-thin:{ .external }](https://infohub.delltechnologies.com/en-us/l/generative-ai-in-the-enterprise-with-amd-accelerators/rccl-and-perftest-for-cluster-validation-1/4/){:target="_blank"}.
*[Deep dive into partition modes by AMD](https://rocm.blogs.amd.com/software-tools-optimization/compute-memory-modes/README.html).
480
+
*[RCCL and PerfTest for cluster validation by Dell](https://infohub.delltechnologies.com/en-us/l/generative-ai-in-the-enterprise-with-amd-accelerators/rccl-and-perftest-for-cluster-validation-1/4/).
481
481
482
482
## What's next?
483
483
@@ -487,5 +487,5 @@ Benchmark the performance impact of VMs vs bare-metal for inference and training
487
487
488
488
#### Hot Aisle
489
489
490
-
Big thanks to [Hot Aisle :material-arrow-top-right-thin:{ .external }](https://hotaisle.xyz/){:target="_blank"} for providing the compute power behind these benchmarks.
490
+
Big thanks to [Hot Aisle](https://hotaisle.xyz/) for providing the compute power behind these benchmarks.
491
491
If you’re looking for fast AMD GPU bare-metal or VM instances, they’re definitely worth checking out.
0 commit comments