Skip to content

Commit 2c05b93

Browse files
committed
Merge remote-tracking branch 'origin/master' into issue_2181_probes
2 parents acd006d + 83835ba commit 2c05b93

47 files changed

Lines changed: 1206 additions & 352 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/build.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,18 @@ jobs:
3535
- run: uv tool install pre-commit
3636
- run: pre-commit run -a --show-diff-on-failure
3737

38+
frontend-lint:
39+
runs-on: ubuntu-latest
40+
defaults:
41+
run:
42+
working-directory: frontend
43+
steps:
44+
- uses: actions/checkout@v4
45+
- name: Install modules
46+
run: npm install
47+
- name: Run Eslint
48+
run: npm run precommit
49+
3850
frontend-build:
3951
runs-on: ubuntu-latest
4052
defaults:

.justfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,4 @@ set allow-duplicate-recipes
1414

1515
import "runner/.justfile"
1616

17-
# TODO: Add frontend/justfile for managing frontend development tasks
17+
import "frontend/.justfile"

docs/blog/posts/benchmark-amd-containers-and-partitions.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,11 @@ Our new benchmark explores two important areas for optimizing AI workloads on AM
1414

1515
<img src="https://dstack.ai/static-assets/static-assets/images/benchmark-amd-containers-and-partitions.png" width="630"/>
1616

17+
<!-- more -->
18+
1719
This benchmark was supported by [Hot Aisle :material-arrow-top-right-thin:{ .external }](https://hotaisle.xyz/){:target="_blank"},
1820
a provider of AMD GPU bare-metal and VM infrastructure.
1921

20-
<!-- more -->
21-
2222
## Benchmark 1: Bare-metal vs containers
2323

2424
### Finding 1: No loss in interconnect bandwidth
Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,219 @@
1+
---
2+
title: "Benchmarking AMD GPUs: bare-metal, VMs"
3+
date: 2025-07-22
4+
description: "TBA"
5+
slug: benchmark-amd-vms
6+
image: https://dstack.ai/static-assets/static-assets/images/benchmark-amd-vms.png
7+
categories:
8+
- Benchmarks
9+
---
10+
11+
# Benchmarking AMD GPUs: bare-metal, VMs
12+
13+
This is the first in our series of benchmarks exploring the performance of AMD GPUs in virtualized versus bare-metal environments. As cloud infrastructure increasingly relies on virtualization, a key question arises: can VMs match bare-metal performance for GPU-intensive tasks? For this initial investigation, we focus specifically on a single-GPU setup, comparing a containerized workload on a VM against a bare-metal server, both equipped with the powerful AMD MI300X GPU.
14+
15+
<img src="https://dstack.ai/static-assets/static-assets/images/benchmark-amd-vms.png" width="630"/>
16+
17+
<!-- more -->
18+
19+
Our findings reveal that for single-GPU LLM training and inference, both setups deliver comparable performance. The subtle differences we observed highlight how virtualization overhead can influence performance under specific conditions, but for most practical purposes, the performance is nearly identical.
20+
21+
This benchmark was supported by [Hot Aisle :material-arrow-top-right-thin:{ .external }](https://hotaisle.xyz/){:target="_blank"},
22+
a provider of AMD GPU bare-metal and VM infrastructure.
23+
24+
## Benchmark 1: Inference
25+
26+
### Finding 1: Identical performance at moderate concurrency levels and slightly worse otherwise
27+
28+
**Throughput vs latency**
29+
30+
Comparing throughput (tokens/second) against end-to-end latency across multiple concurrency levels is an effective way to measure an LLM inference system's scalability and responsiveness. This benchmark reveals how VM and bare-metal environments handle varying loads and pinpoints their throughput saturation points.
31+
32+
<img src="https://dstack.ai/static-assets/static-assets/images/benchmark-amd-vms-throuput-latency.png" width="750"/>
33+
34+
At moderate concurrency levels (16–64), both bare-metal and VM deliver near-identical inference performance. At lower levels (4-16), bare-metal shows slightly better throughput, likely due to faster kernel launches and direct device access. At high concurrency (64–128), bare-metal maintains a slight edge in latency and throughput. At a concurrency of 256, throughput saturates for both, suggesting a bottleneck from KV cache pressure on GPU memory.
35+
36+
## Benchmark 2: Training
37+
38+
### Finding 1: Identical performance at large batches with only minor variations
39+
40+
For training, we compare throughput (samples/second) and total runtime across increasing batch sizes. These metrics are crucial for evaluating cost and training efficiency.
41+
42+
**Throughput**
43+
44+
Bare metal performs slightly better at small batch sizes, but the VM consistently shows slightly better throughput and runtime at larger batch sizes (≥8).
45+
46+
<img src="https://dstack.ai/static-assets/static-assets/images/benchmark-amd-vms-throuput.png" width="750"/>
47+
48+
This may be because larger batches are compute-bound, making CPU-GPU synchronization less frequent.
49+
50+
<img src="https://dstack.ai/static-assets/static-assets/images/benchmark-amd-vms-runtime.png" width="750"/>
51+
52+
One plausible explanation for the VM's slight advantage here is that in the bare-metal setup, using only one of eight available GPUs may lead to minor interference from shared background services.
53+
54+
### Finding 2: Identical convergence, GPU utilization, memory consumption
55+
56+
Training/eval loss, GPU utilization, and VRAM usage are key indicators of training stability and system efficiency. Loss shows model convergence, while utilization and memory reflect hardware efficiency.
57+
58+
<img src="https://dstack.ai/static-assets/static-assets/images/benchmark-amd-vms-vm.png" width="750"/>
59+
60+
Both VM and bare-metal setups exhibited nearly identical training and evaluation loss curves, indicating consistent model convergence. GPU utilization remained high (~95–100%) and stable in both environments, with similar VRAM consumption.
61+
62+
<img src="https://dstack.ai/static-assets/static-assets/images/benchmark-amd-vms-bare-metal.png" width="750"/>
63+
64+
This demonstrates that from a model training and hardware utilization perspective, both setups are equally efficient.
65+
66+
## Limitations
67+
68+
**Multi-GPU**
69+
70+
This initial benchmark deliberately focused on a single-GPU setup to establish a baseline. A more production-representative evaluation would compare multi-GPU VMs with multi-GPU bare-metal systems. In multi-GPU inference, bare-metal’s direct hardware access could offer an advantage. For distributed training, however, where all GPUs are fully engaged, the performance between VM and bare-metal would likely be even closer.
71+
72+
Furthermore, it's important to note that the performance gap in virtualized setups can potentially be narrowed significantly with expert hypervisor tuning, such as CPU pinning and NUMA node alignment.
73+
74+
**Multi-node**
75+
76+
For distributed training, models are trained across multi-node clusters where control-plane operations rely on the CPU. This can impact interconnect bandwidth and overall performance. A future comparison is critical, as performance will heavily depend on the network virtualization technology used.
77+
78+
For instance, testing setups that use SR-IOV (Single Root I/O Virtualization)—a technology designed to provide near-native network performance to VMs—would be essential for a complete picture.
79+
80+
## Conclusion
81+
82+
Our initial benchmark shows that performance differences between a VM and bare-metal are minimal. Both environments exhibit near-identical behavior aside from a few subtle variations. These findings suggest that VMs are a highly viable option for demanding GPU tasks, with only minor trade-offs under specific conditions, and that AMD GPUs deliver exceptional performance in both virtualized and bare-metal environments.
83+
84+
## Benchmark setup
85+
86+
### Hardware configuration
87+
88+
**VM**
89+
90+
* CPU: Intel Xeon Platinum 8470: 13c @ 2 GHz
91+
* RAM: 224 GiB
92+
* NVMe: 13 TB
93+
* GPUs: 1 x AMD MI300X
94+
95+
**Bare-metal**
96+
97+
* CPU: Intel Xeon Pla*tinum 8470: 13c @ 2 GHz (`--cpuset-cpus="0-12"`)
98+
* RAM: 224 GiB (`--memory="224g"`)
99+
* GPUs: 1x AMD MI300X
100+
101+
### Benchmark methodology
102+
103+
The steps to run benchmarks are identical for both setups, except that the docker run command for bare metal includes `--cpuset-cpus="0-12"` and `--memory="224g"` to match the VM's resources.
104+
105+
#### Inference
106+
107+
1. Run a `rocm/vllm` container:
108+
109+
```shell
110+
docker run -it \
111+
--network=host \
112+
--group-add=video \
113+
--ipc=host \
114+
--cap-add=SYS_PTRACE \
115+
--security-opt seccomp=unconfined \
116+
--device /dev/kfd \
117+
--device /dev/dri \
118+
rocm/vllm:latest /bin/bash
119+
```
120+
121+
2. Start the vLLM server:
122+
123+
```shell
124+
vllm serve meta-llama/Llama-3.3-70B-Instruct --max-model-len 100000
125+
```
126+
127+
3. Start the benchmark
128+
129+
```shell
130+
isl=1024
131+
osl=1024
132+
MaxConcurrency="4 8 16 32 64 128 256"
133+
RESULT_DIR="./results_concurrency_sweep"
134+
mkdir -p $RESULT_DIR
135+
136+
for concurrency in $MaxConcurrency; do
137+
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
138+
FILENAME="llama3.3-70B-random-${concurrency}concurrency-${TIMESTAMP}.json"
139+
140+
python3 /app/vllm/benchmarks/benchmark_serving.py \
141+
--model meta-llama/Llama-3.3-70B-Instruct \
142+
--dataset-name random \
143+
--random-input-len $isl \
144+
--random-output-len $osl \
145+
--num-prompts $((10 * $concurrency)) \
146+
--max-concurrency $concurrency \
147+
--ignore-eos \
148+
--percentile-metrics ttft,tpot,e2el \
149+
--save-result \
150+
--result-dir "$RESULT_DIR" \
151+
--result-filename "$FILENAME"
152+
done
153+
```
154+
155+
#### Training
156+
157+
1. Run the `rocm/dev-ubuntu-22.04:6.4-complete` container:
158+
159+
```shell
160+
docker run -it \
161+
--network=host \
162+
--group-add=video \
163+
--ipc=host \
164+
--cap-add=SYS_PTRACE \
165+
--security-opt seccomp=unconfined \
166+
--device /dev/kfd \
167+
--device /dev/dri \
168+
rocm/dev-ubuntu-22.04:6.4-complete /bin/bash
169+
```
170+
171+
2. Install TRL:
172+
173+
```shell
174+
sudo apt-get update && sudo apt-get install -y git cmake && \
175+
pip install torch --index-url https://download.pytorch.org/whl/nightly/rocm6.4 && \
176+
pip install transformers peft wandb && \
177+
git clone https://github.com/huggingface/trl && \
178+
cd trl && \
179+
pip install .
180+
```
181+
182+
1. Run the benchmark
183+
184+
```shell
185+
python3 trl/scripts/sft.py \
186+
--model_name_or_path Qwen/Qwen2-0.5B \
187+
--dataset_name trl-lib/Capybara \
188+
--learning_rate 2.0e-4 \
189+
--num_train_epochs 1 \
190+
--packing \
191+
--per_device_train_batch_size 2 \
192+
--gradient_accumulation_steps 8 \
193+
--gradient_checkpointing \
194+
--eos_token '<|im_end|>' \
195+
--eval_strategy steps \
196+
--eval_steps 100 \
197+
--use_peft \
198+
--lora_r 32 \
199+
--lora_alpha 16
200+
```
201+
202+
## Source code
203+
204+
All source code and findings are available in our [GitHub repo :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/tree/main/amd/single_gpu_vm_vs_bare-metal){:target="_blank"}.
205+
206+
## References
207+
208+
* [vLLM V1 Meets AMD Instinct GPUs: A New Era for LLM Inference Performance :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/software-tools-optimization/vllmv1-rocm-llm/README.html){:target="_blank"}
209+
210+
## What's next?
211+
212+
Our next steps are to benchmark VM vs. bare-metal performance in multi-GPU and multi-node setups, covering tensor-parallel inference and distributed training scenarios.
213+
214+
## Acknowledgments
215+
216+
#### Hot Aisle
217+
218+
Big thanks to [Hot Aisle :material-arrow-top-right-thin:{ .external }](https://hotaisle.xyz/){:target="_blank"} for providing the compute power behind these benchmarks.
219+
If you’re looking for fast AMD GPU bare-metal or VM instances, they’re definitely worth checking out.

docs/docs/concepts/dev-environments.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -491,7 +491,7 @@ The `schedule` property can be combined with `max_duration` or `utilization_poli
491491
??? info "Cron syntax"
492492
`dstack` supports [POSIX cron syntax](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/crontab.html#tag_20_25_07). One exception is that days of the week are started from Monday instead of Sunday so `0` corresponds to Monday.
493493

494-
The month and day of week fields accept abbreviated English month and weekday names (`jan–de`c and `mon–sun`) respectively.
494+
The month and day of week fields accept abbreviated English month and weekday names (`jan–dec` and `mon–sun`) respectively.
495495

496496
A cron expression consists of five fields:
497497

docs/docs/concepts/services.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -703,7 +703,7 @@ The `schedule` property can be combined with `max_duration` or `utilization_poli
703703
??? info "Cron syntax"
704704
`dstack` supports [POSIX cron syntax](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/crontab.html#tag_20_25_07). One exception is that days of the week are started from Monday instead of Sunday so `0` corresponds to Monday.
705705

706-
The month and day of week fields accept abbreviated English month and weekday names (`jan–de`c and `mon–sun`) respectively.
706+
The month and day of week fields accept abbreviated English month and weekday names (`jan–dec` and `mon–sun`) respectively.
707707

708708
A cron expression consists of five fields:
709709

docs/docs/concepts/tasks.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -678,7 +678,7 @@ schedule:
678678
??? info "Cron syntax"
679679
`dstack` supports [POSIX cron syntax](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/crontab.html#tag_20_25_07). One exception is that days of the week are started from Monday instead of Sunday so `0` corresponds to Monday.
680680

681-
The month and day of week fields accept abbreviated English month and weekday names (`jan–de`c and `mon–sun`) respectively.
681+
The month and day of week fields accept abbreviated English month and weekday names (`jan–dec` and `mon–sun`) respectively.
682682

683683
A cron expression consists of five fields:
684684

frontend/.justfile

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Justfile for building frontend
2+
#
3+
# Run `just` to see all available commands
4+
5+
default:
6+
@just --list
7+
8+
[private]
9+
install-frontend:
10+
#!/usr/bin/env bash
11+
set -e
12+
cd {{source_directory()}}
13+
npm install
14+
15+
build-frontend:
16+
#!/usr/bin/env bash
17+
set -e
18+
cd {{source_directory()}}
19+
npm run build
20+
cp -r build/ ../src/dstack/_internal/server/statics/

frontend/src/api.ts

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,13 @@ export const API = {
9898
// METRICS
9999
JOB_METRICS: (projectName: IProject['project_name'], runName: IRun['run_spec']['run_name']) =>
100100
`${API.BASE()}/project/${projectName}/metrics/job/${runName}`,
101+
102+
// SECRETS
103+
SECRETS_LIST: (projectName: IProject['project_name']) => `${API.BASE()}/project/${projectName}/secrets/list`,
104+
SECRET_GET: (projectName: IProject['project_name']) => `${API.BASE()}/project/${projectName}/secrets/get`,
105+
SECRETS_UPDATE: (projectName: IProject['project_name']) =>
106+
`${API.BASE()}/project/${projectName}/secrets/create_or_update`,
107+
SECRETS_DELETE: (projectName: IProject['project_name']) => `${API.BASE()}/project/${projectName}/secrets/delete`,
101108
},
102109

103110
BACKENDS: {

frontend/src/locale/en.json

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -297,6 +297,24 @@
297297
"name": "User name",
298298
"role": "Project role"
299299
},
300+
"secrets": {
301+
"section_title": "Secrets",
302+
"empty_message_title": "No secrets",
303+
"empty_message_text": "No secrets to display.",
304+
"name": "Secret name",
305+
"value": "Secret value",
306+
"create_secret": "Create secret",
307+
"update_secret": "Update secret",
308+
"delete_confirm_title": "Delete secret",
309+
"delete_confirm_message": "Are you sure you want to delete the {{name}} secret?",
310+
"multiple_delete_confirm_title": "Delete secrets",
311+
"multiple_delete_confirm_message": "Are you sure you want to delete {{count}} secrets?",
312+
"not_permissions_title": "No permissions",
313+
"not_permissions_description": "You don't have permissions for managing secrets",
314+
"validation": {
315+
"secret_name_format": "Invalid secret name"
316+
}
317+
},
300318
"error_notification": "Update project error",
301319
"validation": {
302320
"user_name_format": "Only letters, numbers, - or _"

0 commit comments

Comments
 (0)