Skip to content

Commit 6b563aa

Browse files
committed
Merge branch 'master' into issue_2650_multinode_on_shared_instances
2 parents 3158183 + fd2adf6 commit 6b563aa

15 files changed

Lines changed: 258 additions & 80 deletions

File tree

docs/assets/stylesheets/extra.css

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,10 @@
2929
margin-left: 32px;
3030
}
3131

32+
[dir=ltr] .md-typeset :is(.md-typeset__scrollwrap,p,h4,h3,.tabbed-set,ul):not(.admonition-title) :is(.admonition,details, .termy) {
33+
margin-left: 0;
34+
}
35+
3236
@media screen and (max-width: 76.1875em) {
3337
.md-header {
3438
background-color: rgb(255, 255, 255);

docs/docs/concepts/fleets.md

Lines changed: 14 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -263,26 +263,30 @@ Define a fleet configuration as a YAML file in your project directory. The file
263263
</div>
264264

265265
??? info "Requirements"
266-
1.&nbsp;Hosts should be pre-installed with Docker.
266+
1.&nbsp;Hosts must be pre-installed with Docker.
267267

268268
=== "NVIDIA"
269-
2.&nbsp;Hosts with NVIDIA GPUs should also be pre-installed with CUDA 12.1 and
269+
2.&nbsp;Hosts with NVIDIA GPUs must also be pre-installed with CUDA 12.1 and
270270
[NVIDIA Container Toolkit :material-arrow-top-right-thin:{ .external }](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html).
271271

272272
=== "AMD"
273-
2.&nbsp;Hosts with AMD GPUs should also be pre-installed with AMDGPU-DKMS kernel driver (e.g. via
273+
2.&nbsp;Hosts with AMD GPUs must also be pre-installed with AMDGPU-DKMS kernel driver (e.g. via
274274
[native package manager :material-arrow-top-right-thin:{ .external }](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/native-install/index.html)
275275
or [AMDGPU installer :material-arrow-top-right-thin:{ .external }](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/amdgpu-install.html).)
276276

277277
=== "Intel Gaudi"
278-
2.&nbsp;Hosts with Intel Gaudi accelerators should be pre-installed with [Gaudi software and drivers](https://docs.habana.ai/en/latest/Installation_Guide/Driver_Installation.html#driver-installation).
279-
This should include the drivers, `hl-smi`, and Habana Container Runtime.
278+
2.&nbsp;Hosts with Intel Gaudi accelerators must be pre-installed with [Gaudi software and drivers](https://docs.habana.ai/en/latest/Installation_Guide/Driver_Installation.html#driver-installation).
279+
This must include the drivers, `hl-smi`, and Habana Container Runtime.
280280

281281
=== "Tenstorrent"
282-
2.&nbsp;Hosts with Tenstorrent accelerators should be pre-installed with [Tenstorrent software](https://docs.tenstorrent.com/getting-started/README.html#software-installation).
283-
This should include the drivers, `tt-smi`, and HugePages.
282+
2.&nbsp;Hosts with Tenstorrent accelerators must be pre-installed with [Tenstorrent software](https://docs.tenstorrent.com/getting-started/README.html#software-installation).
283+
This must include the drivers, `tt-smi`, and HugePages.
284284

285-
3.&nbsp;The user specified should have passwordless `sudo` access.
285+
3.&nbsp;The user specified must have passwordless `sudo` access.
286+
287+
4.&nbsp;The SSH server must be running and configured with `AllowTcpForwarding yes` in `/etc/ssh/sshd_config`.
288+
289+
5.&nbsp;The firewall must allow SSH and should forbid any other connections from external networks. For `placement: cluster` fleets, it should also allow any communication between fleet nodes.
286290

287291
To create or update the fleet, pass the fleet configuration to [`dstack apply`](../reference/cli/dstack/apply.md):
288292

@@ -341,10 +345,10 @@ divided into, allowing multiple jobs to use these blocks concurrently.
341345
hosts:
342346
- hostname: 3.255.177.51
343347
blocks: 4
344-
- hostaname: 3.255.177.52
348+
- hostname: 3.255.177.52
345349
# As many as possible, according to numbers of GPUs and CPUs
346350
blocks: auto
347-
- hostaname: 3.255.177.53
351+
- hostname: 3.255.177.53
348352
# Do not sclice. This is the default value, may be omitted
349353
blocks: 1
350354
```

docs/docs/concepts/tasks.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -426,6 +426,34 @@ retry:
426426
If one job of a multi-node task fails with retry enabled,
427427
`dstack` will stop all the jobs and resubmit the run.
428428

429+
### Priority
430+
431+
Be default, submitted runs are scheduled in the order they were submitted.
432+
When compute resources are limited, you may want to prioritize some runs over others.
433+
This can be done by specifying the [`priority`](../reference/dstack.yml/task.md) property in the run configuration:
434+
435+
<div editor-title=".dstack.yml">
436+
437+
```yaml
438+
type: task
439+
name: train
440+
441+
python: "3.10"
442+
443+
# Commands of the task
444+
commands:
445+
- pip install -r fine-tuning/qlora/requirements.txt
446+
- python fine-tuning/qlora/train.py
447+
448+
priority: 50
449+
```
450+
451+
</div>
452+
453+
`dstack` tries to provision runs with higher priority first.
454+
Note that if a high priority run cannot be scheduled,
455+
it does not block other runs with lower priority from scheduling.
456+
429457
--8<-- "docs/concepts/snippets/manage-fleets.ext"
430458

431459
--8<-- "docs/concepts/snippets/manage-runs.ext"

docs/docs/installation/index.md

Lines changed: 55 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -139,71 +139,71 @@ Configuration is updated at ~/.dstack/config.yml
139139

140140
This configuration is stored in `~/.dstack/config.yml`.
141141

142-
### (Optional) Enable shell autocompletion
142+
??? info "Shell autocompletion"
143143

144-
`dstack` supports shell autocompletion for `bash` and `zsh`.
144+
`dstack` supports shell autocompletion for `bash` and `zsh`.
145145

146-
=== "bash"
146+
=== "bash"
147147

148-
First, validate if completion scripts load correctly in your current shell session:
149-
150-
<div class="termy">
151-
152-
```shell
153-
$ eval "$(dstack completion bash)"
154-
```
148+
First, validate if completion scripts load correctly in your current shell session:
149+
150+
<div class="termy">
151+
152+
```shell
153+
$ eval "$(dstack completion bash)"
154+
```
155155

156-
</div>
157-
158-
If completions work as expected and you would like them to persist across shell sessions, add the completion script to your shell profile using these commands:
159-
160-
<div class="termy">
161-
162-
```shell
163-
$ mkdir -p ~/.dstack
164-
$ dstack completion bash > ~/.dstack/completion.sh
165-
$ echo 'source ~/.dstack/completion.sh' >> ~/.bashrc
166-
```
167-
168-
</div>
156+
</div>
157+
158+
If completions work as expected and you would like them to persist across shell sessions, add the completion script to your shell profile using these commands:
159+
160+
<div class="termy">
161+
162+
```shell
163+
$ mkdir -p ~/.dstack
164+
$ dstack completion bash > ~/.dstack/completion.sh
165+
$ echo 'source ~/.dstack/completion.sh' >> ~/.bashrc
166+
```
167+
168+
</div>
169169

170-
=== "zsh"
171-
172-
First, validate if completion scripts load correctly in your current shell session:
173-
174-
<div class="termy">
175-
176-
```shell
177-
$ eval "$(dstack completion zsh)"
178-
```
170+
=== "zsh"
171+
172+
First, validate if completion scripts load correctly in your current shell session:
173+
174+
<div class="termy">
175+
176+
```shell
177+
$ eval "$(dstack completion zsh)"
178+
```
179179

180-
</div>
181-
182-
If completions work as expected and you would like them to persist across shell sessions, you can install them via Oh My Zsh using these commands:
183-
184-
<div class="termy">
185-
186-
```shell
187-
$ mkdir -p ~/.oh-my-zsh/completions
188-
$ dstack completion zsh > ~/.oh-my-zsh/completions/_dstack
189-
```
180+
</div>
190181
191-
</div>
182+
If completions work as expected and you would like them to persist across shell sessions, you can install them via Oh My Zsh using these commands:
183+
184+
<div class="termy">
185+
186+
```shell
187+
$ mkdir -p ~/.oh-my-zsh/completions
188+
$ dstack completion zsh > ~/.oh-my-zsh/completions/_dstack
189+
```
190+
191+
</div>
192192

193-
And if you don't use Oh My Zsh:
193+
And if you don't use Oh My Zsh:
194194

195-
<div class="termy">
196-
197-
```shell
198-
$ mkdir -p ~/.dstack
199-
$ dstack completion zsh > ~/.dstack/completion.sh
200-
$ echo 'source ~/.dstack/completion.sh' >> ~/.zshrc
201-
```
202-
203-
</div>
195+
<div class="termy">
196+
197+
```shell
198+
$ mkdir -p ~/.dstack
199+
$ dstack completion zsh > ~/.dstack/completion.sh
200+
$ echo 'source ~/.dstack/completion.sh' >> ~/.zshrc
201+
```
202+
203+
</div>
204204

205-
> If you get an error similar to `2: command not found: compdef`, then add the following line to the beginning of your `~/.zshrc` file:
206-
> `autoload -Uz compinit && compinit`.
205+
> If you get an error similar to `2: command not found: compdef`, then add the following line to the beginning of your `~/.zshrc` file:
206+
> `autoload -Uz compinit && compinit`.
207207

208208

209209
!!! info "What's next?"

docs/overrides/main.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@
122122
<a href="/examples#inference" class="tx-footer__section-link">Inference</a>
123123
<a href="/examples#accelerators" class="tx-footer__section-link">Accelerators</a>
124124
<a href="/examples#llms" class="tx-footer__section-link">LLMs</a>
125-
<a href="/examples#misc" class="tx-footer__section-link">Misc</a>
125+
<!-- <a href="/examples#misc" class="tx-footer__section-link">Misc</a> -->
126126
</div>
127127

128128
<div class="tx-footer__section">
@@ -136,7 +136,7 @@
136136
<div class="tx-footer__section">
137137
<div class="tx-footer__section-title">Company</div>
138138
<a href="https://calendly.com/dstackai/discovery-call" target="_blank" class="tx-footer__section-link external">Talk to an expert</a>
139-
<a href="/partners" class="tx-footer__section-link">Partners</a>
139+
<!-- <a href="/partners" class="tx-footer__section-link">Partners</a> -->
140140
<a href="/terms" class="tx-footer__section-link">Terms</a>
141141
<a href="/privacy" class="tx-footer__section-link">Privacy</a>
142142
<!--<a href="https://dstackai.notion.site/Jobs-at-dstack-ff01a450c3174904b70c36124f2d8f4b" target="_blank" class="tx-footer__section-link external">Careers</a>-->

pyproject.toml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,11 @@ lambda = [
166166
"dstack[server]",
167167
]
168168
oci = [
169-
"oci",
169+
"oci>=2.150.0",
170+
"cryptography>=44.0.3",
171+
# pyopenssl is indirect to avoid uv falling back to the old version
172+
# due to an upper limit from oci
173+
"pyopenssl>=23.2.0",
170174
"dstack[server]",
171175
]
172176
nebius = [

src/dstack/_internal/core/models/configurations.py

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,9 @@
2323
MAX_INT64 = 2**63 - 1
2424
SERVICE_HTTPS_DEFAULT = True
2525
STRIP_PREFIX_DEFAULT = True
26+
RUN_PRIOTIRY_MIN = 0
27+
RUN_PRIOTIRY_MAX = 100
28+
RUN_PRIORITY_DEFAULT = 0
2629

2730

2831
class RunConfigurationType(str, Enum):
@@ -221,14 +224,26 @@ class BaseRunConfiguration(CoreModel):
221224
)
222225
),
223226
] = None
224-
# deprecated since 0.18.31; task, service -- no effect; dev-environment -- executed right before `init`
225-
setup: CommandsList = []
226227
resources: Annotated[
227228
ResourcesSpec, Field(description="The resources requirements to run the configuration")
228229
] = ResourcesSpec()
230+
priority: Annotated[
231+
Optional[int],
232+
Field(
233+
ge=RUN_PRIOTIRY_MIN,
234+
le=RUN_PRIOTIRY_MAX,
235+
description=(
236+
f"The priority of the run, an integer between `{RUN_PRIOTIRY_MIN}` and `{RUN_PRIOTIRY_MAX}`."
237+
" `dstack` tries to provision runs with higher priority first."
238+
f" Defaults to `{RUN_PRIORITY_DEFAULT}`"
239+
),
240+
),
241+
] = None
229242
volumes: Annotated[
230243
List[Union[MountPoint, str]], Field(description="The volumes mount points")
231244
] = []
245+
# deprecated since 0.18.31; task, service -- no effect; dev-environment -- executed right before `init`
246+
setup: CommandsList = []
232247

233248
@validator("python", pre=True, always=True)
234249
def convert_python(cls, v, values) -> Optional[PythonVersion]:

src/dstack/_internal/server/background/tasks/process_submitted_jobs.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,11 +93,20 @@ async def _process_next_submitted_job():
9393
async with lock:
9494
res = await session.execute(
9595
select(JobModel)
96+
.join(JobModel.run)
9697
.where(
9798
JobModel.status == JobStatus.SUBMITTED,
9899
JobModel.id.not_in(lockset),
99100
)
100-
.order_by(JobModel.last_processed_at.asc())
101+
# Jobs are process in FIFO sorted by priority globally,
102+
# thus runs from different project can "overtake" each other by using higher priorities.
103+
# That's not a big problem as long as projects do not compete for the same compute resources.
104+
# Jobs with lower priorities from other projects will be processed without major lag
105+
# as long as new higher priority runs are not constantly submitted.
106+
# TODO: Consider processing jobs from different projects fairly/round-robin
107+
# Fully fair processing can be tricky to implement via the current DB queue as
108+
# there can be many projects and we are limited by the max DB connections.
109+
.order_by(RunModel.priority.desc(), JobModel.last_processed_at.asc())
101110
.limit(1)
102111
.with_for_update(skip_locked=True)
103112
)
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
"""Add RunModel.priority
2+
3+
Revision ID: bca2fdf130bf
4+
Revises: 20166748b60c
5+
Create Date: 2025-05-14 15:24:21.269775
6+
7+
"""
8+
9+
import sqlalchemy as sa
10+
from alembic import op
11+
12+
# revision identifiers, used by Alembic.
13+
revision = "bca2fdf130bf"
14+
down_revision = "20166748b60c"
15+
branch_labels = None
16+
depends_on = None
17+
18+
19+
def upgrade() -> None:
20+
# ### commands auto generated by Alembic - please adjust! ###
21+
with op.batch_alter_table("runs", schema=None) as batch_op:
22+
batch_op.add_column(sa.Column("priority", sa.Integer(), nullable=True))
23+
batch_op.execute("UPDATE runs SET priority = 0")
24+
with op.batch_alter_table("runs", schema=None) as batch_op:
25+
batch_op.alter_column("priority", nullable=False)
26+
# ### end Alembic commands ###
27+
28+
29+
def downgrade() -> None:
30+
# ### commands auto generated by Alembic - please adjust! ###
31+
with op.batch_alter_table("runs", schema=None) as batch_op:
32+
batch_op.drop_column("priority")
33+
34+
# ### end Alembic commands ###

src/dstack/_internal/server/models.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -348,6 +348,7 @@ class RunModel(BaseModel):
348348
resubmission_attempt: Mapped[int] = mapped_column(Integer, default=0)
349349
run_spec: Mapped[str] = mapped_column(Text)
350350
service_spec: Mapped[Optional[str]] = mapped_column(Text)
351+
priority: Mapped[int] = mapped_column(Integer, default=0)
351352

352353
jobs: Mapped[List["JobModel"]] = relationship(
353354
back_populates="run", lazy="selectin", order_by="[JobModel.replica_num, JobModel.job_num]"

0 commit comments

Comments
 (0)