Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/docs/concepts/dev-environments.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,8 @@ name: vscode
ide: vscode

resources:
# 16 or more x86_64 cores
cpu: 16..
# 200GB or more RAM
memory: 200GB..
# 4 GPUs from 40GB to 80GB
Expand All @@ -187,10 +189,16 @@ resources:

</div>

The `cpu` property also allows you to specify the CPU architecture, `x86` or `arm`. Examples:
Comment thread
un-def marked this conversation as resolved.
`x86:16` (16 x86-64 cores), `arm:8..` (at least 8 ARM64 cores).
If the architecture is not specified, `dstack` tries to infer it from the `gpu` specification
using `x86` as the fallback value.

The `gpu` property allows specifying not only memory size but also GPU vendor, names
and their quantity. Examples: `nvidia` (one NVIDIA GPU), `A100` (one A100), `A10G,A100` (either A10G or A100),
`A100:80GB` (one A100 of 80GB), `A100:2` (two A100), `24GB..40GB:2` (two GPUs between 24GB and 40GB),
`A100:40GB:2` (two A100 GPUs of 40GB).
If the vendor is not specified, `dstack` tries to infer it from the GPU name using `nvidia` as the fallback value.

??? info "Google Cloud TPU"
To use TPUs, specify its architecture via the `gpu` property.
Expand Down
8 changes: 8 additions & 0 deletions docs/docs/concepts/services.md
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,8 @@ commands:
port: 8000

resources:
# 16 or more x86_64 cores
cpu: 16..
# 2 GPUs of 80GB
gpu: 80GB:2

Expand All @@ -325,10 +327,16 @@ resources:

</div>

The `cpu` property also allows you to specify the CPU architecture, `x86` or `arm`. Examples:
`x86:16` (16 x86-64 cores), `arm:8..` (at least 8 ARM64 cores).
If the architecture is not specified, `dstack` tries to infer it from the `gpu` specification
using `x86` as the fallback value.

The `gpu` property allows specifying not only memory size but also GPU vendor, names
and their quantity. Examples: `nvidia` (one NVIDIA GPU), `A100` (one A100), `A10G,A100` (either A10G or A100),
`A100:80GB` (one A100 of 80GB), `A100:2` (two A100), `24GB..40GB:2` (two GPUs between 24GB and 40GB),
`A100:40GB:2` (two A100 GPUs of 40GB).
If the vendor is not specified, `dstack` tries to infer it from the GPU name using `nvidia` as the fallback value.

??? info "Google Cloud TPU"
To use TPUs, specify its architecture via the `gpu` property.
Expand Down
8 changes: 8 additions & 0 deletions docs/docs/concepts/tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,8 @@ commands:
- python fine-tuning/qlora/train.py

resources:
# 16 or more x86_64 cores
cpu: 16..
# 200GB or more RAM
memory: 200GB..
# 4 GPUs from 40GB to 80GB
Expand All @@ -204,10 +206,16 @@ resources:

</div>

The `cpu` property also allows you to specify the CPU architecture, `x86` or `arm`. Examples:
`x86:16` (16 x86-64 cores), `arm:8..` (at least 8 ARM64 cores).
If the architecture is not specified, `dstack` tries to infer it from the `gpu` specification
using `x86` as the fallback value.

The `gpu` property allows specifying not only memory size but also GPU vendor, names
and their quantity. Examples: `nvidia` (one NVIDIA GPU), `A100` (one A100), `A10G,A100` (either A10G or A100),
`A100:80GB` (one A100 of 80GB), `A100:2` (two A100), `24GB..40GB:2` (two GPUs between 24GB and 40GB),
`A100:40GB:2` (two A100 GPUs of 40GB).
If the vendor is not specified, `dstack` tries to infer it from the GPU name using `nvidia` as the fallback value.

??? info "Google Cloud TPU"
To use TPUs, specify its architecture via the `gpu` property.
Expand Down
11 changes: 11 additions & 0 deletions docs/docs/reference/api/python/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,10 +136,21 @@ finally:
show_root_toc_entry: false
heading_level: 4
item_id_mapping:
cpu: dstack.api.CPU
gpu: dstack.api.GPU
memory: dstack.api.Memory
Range: dstack.api.Range

### `dstack.api.CPU` { #dstack.api.CPU data-toc-label="CPU" }

#SCHEMA# dstack.api.CPU
overrides:
show_root_heading: false
show_root_toc_entry: false
heading_level: 4
item_id_mapping:
Range: dstack.api.Range

### `dstack.api.GPU` { #dstack.api.GPU data-toc-label="GPU" }

#SCHEMA# dstack.api.GPU
Expand Down
8 changes: 8 additions & 0 deletions docs/docs/reference/dstack.yml/dev-environment.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,14 @@ The `dev-environment` configuration type allows running [dev environments](../..
required: true
item_id_prefix: resources-

#### `resources.cpu` { #resources-cpu data-toc-label="cpu" }

#SCHEMA# dstack._internal.core.models.resources.CPUSpec
overrides:
show_root_heading: false
type:
required: true

#### `resources.gpu` { #resources-gpu data-toc-label="gpu" }

#SCHEMA# dstack._internal.core.models.resources.GPUSpec
Expand Down
12 changes: 10 additions & 2 deletions docs/docs/reference/dstack.yml/fleet.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,15 +46,23 @@ The `fleet` configuration type allows creating and updating fleets.
required: true
item_id_prefix: resources-

#### `resouces.gpu` { #resources-gpu data-toc-label="gpu" }
#### `resources.cpu` { #resources-cpu data-toc-label="cpu" }

#SCHEMA# dstack._internal.core.models.resources.CPUSpec
overrides:
show_root_heading: false
type:
required: true

#### `resources.gpu` { #resources-gpu data-toc-label="gpu" }

#SCHEMA# dstack._internal.core.models.resources.GPUSpec
overrides:
show_root_heading: false
type:
required: true

#### `resouces.disk` { #resources-disk data-toc-label="disk" }
#### `resources.disk` { #resources-disk data-toc-label="disk" }

#SCHEMA# dstack._internal.core.models.resources.DiskSpec
overrides:
Expand Down
12 changes: 10 additions & 2 deletions docs/docs/reference/dstack.yml/service.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,15 +129,23 @@ The `service` configuration type allows running [services](../../concepts/servic
required: true
item_id_prefix: resources-

#### `resouces.gpu` { #resources-gpu data-toc-label="gpu" }
#### `resources.cpu` { #resources-cpu data-toc-label="cpu" }

#SCHEMA# dstack._internal.core.models.resources.CPUSpec
overrides:
show_root_heading: false
type:
required: true

#### `resources.gpu` { #resources-gpu data-toc-label="gpu" }

#SCHEMA# dstack._internal.core.models.resources.GPUSpec
overrides:
show_root_heading: false
type:
required: true

#### `resouces.disk` { #resources-disk data-toc-label="disk" }
#### `resources.disk` { #resources-disk data-toc-label="disk" }

#SCHEMA# dstack._internal.core.models.resources.DiskSpec
overrides:
Expand Down
12 changes: 10 additions & 2 deletions docs/docs/reference/dstack.yml/task.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,15 +35,23 @@ The `task` configuration type allows running [tasks](../../concepts/tasks.md).
required: true
item_id_prefix: resources-

#### `resouces.gpu` { #resources-gpu data-toc-label="gpu" }
#### `resources.cpu` { #resources-cpu data-toc-label="cpu" }

#SCHEMA# dstack._internal.core.models.resources.CPUSpec
overrides:
show_root_heading: false
type:
required: true

#### `resources.gpu` { #resources-gpu data-toc-label="gpu" }

#SCHEMA# dstack._internal.core.models.resources.GPUSpec
overrides:
show_root_heading: false
type:
required: true

#### `resouces.disk` { #resources-disk data-toc-label="disk" }
#### `resources.disk` { #resources-disk data-toc-label="disk" }

#SCHEMA# dstack._internal.core.models.resources.DiskSpec
overrides:
Expand Down
7 changes: 5 additions & 2 deletions docs/docs/reference/environment-variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,8 +117,11 @@ For more details on the options below, refer to the [server deployment](../guide
* `DSTACK_SERVER_MAX_OFFERS_TRIED` - Sets how many instance offers to try when starting a job.
Setting a high value can degrade server performance.
* `DSTACK_RUNNER_VERSION` – Sets exact runner version for debug. Defaults to `latest`. Ignored if `DSTACK_RUNNER_DOWNLOAD_URL` is set.
* `DSTACK_RUNNER_DOWNLOAD_URL` – Overrides `dstack-runner` binary download URL.
* `DSTACK_SHIM_DOWNLOAD_URL` – Overrides `dstack-shim` binary download URL.
* `DSTACK_RUNNER_DOWNLOAD_URL` – Overrides `dstack-runner` binary download URL. The URL can contain `{version}` and/or `{arch}` placeholders,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to update

  1. runner/README.md
  2. runner/.just (currently it only builds/uploads one arch)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather open another PR for justfile — I have some ideas for improvements.

where `{version}` is `dstack` version in the `X.Y.Z` format or `latest`, and `{arch}` is either `amd64` or `arm64`, for example,
`https://dstack.example.com/{arch}/{version}/dstack-runner`.
* `DSTACK_SHIM_DOWNLOAD_URL` – Overrides `dstack-shim` binary download URL. The URL can contain `{version}` and/or `{arch}` placeholders,
see `DSTACK_RUNNER_DOWNLOAD_URL` for the details.
* `DSTACK_DEFAULT_CREDS_DISABLED` – Disables default credentials detection if set. Defaults to `None`.
* `DSTACK_LOCAL_BACKEND_ENABLED` – Enables local backend for debug if set. Defaults to `None`.

Expand Down
4 changes: 2 additions & 2 deletions src/dstack/_internal/cli/services/args.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ def port_mapping(v: str) -> PortMapping:
return PortMapping.parse(v)


def cpu_spec(v: str) -> resources.Range[int]:
return parse_obj_as(resources.Range[int], v)
def cpu_spec(v: str) -> dict:
return resources.CPUSpec.parse(v)


def memory_spec(v: str) -> resources.Range[resources.Memory]:
Expand Down
40 changes: 38 additions & 2 deletions src/dstack/_internal/cli/services/configurators/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,10 @@
from typing import Dict, List, Optional, Set, Tuple

import gpuhunt
from pydantic import parse_obj_as

import dstack._internal.core.models.resources as resources
from dstack._internal.cli.services.args import disk_spec, gpu_spec, port_mapping
from dstack._internal.cli.services.args import cpu_spec, disk_spec, gpu_spec, port_mapping
from dstack._internal.cli.services.configurators.base import (
ApplyEnvVarsConfiguratorMixin,
BaseApplyConfigurator,
Expand Down Expand Up @@ -39,6 +40,7 @@
TaskConfiguration,
)
from dstack._internal.core.models.repos.base import Repo
from dstack._internal.core.models.resources import CPUSpec
from dstack._internal.core.models.runs import JobSubmission, JobTerminationReason, RunStatus
from dstack._internal.core.services.configs import ConfigManager
from dstack._internal.core.services.diff import diff_models
Expand Down Expand Up @@ -72,6 +74,7 @@ def apply_configuration(
):
self.apply_args(conf, configurator_args, unknown_args)
self.validate_gpu_vendor_and_image(conf)
self.validate_cpu_arch_and_image(conf)
if repo is None:
repo = self.api.repos.load(Path.cwd())
config_manager = ConfigManager()
Expand Down Expand Up @@ -289,6 +292,14 @@ def register_args(cls, parser: argparse.ArgumentParser, default_max_offers: int
default=default_max_offers,
)
cls.register_env_args(configuration_group)
configuration_group.add_argument(
"--cpu",
type=cpu_spec,
help="Request CPU for the run. "
"The format is [code]ARCH[/]:[code]COUNT[/] (all parts are optional)",
dest="cpu_spec",
metavar="SPEC",
)
configuration_group.add_argument(
"--gpu",
type=gpu_spec,
Expand All @@ -310,6 +321,8 @@ def apply_args(self, conf: BaseRunConfiguration, args: argparse.Namespace, unkno
apply_profile_args(args, conf)
if args.run_name:
conf.name = args.run_name
if args.cpu_spec:
conf.resources.cpu = resources.CPUSpec.parse_obj(args.cpu_spec)
if args.gpu_spec:
conf.resources.gpu = resources.GPUSpec.parse_obj(args.gpu_spec)
if args.disk_spec:
Expand Down Expand Up @@ -342,7 +355,7 @@ def interpolate_env(self, conf: BaseRunConfiguration):

def validate_gpu_vendor_and_image(self, conf: BaseRunConfiguration) -> None:
"""
Infers `resources.gpu.vendor` if not set, requires `image` if the vendor is AMD.
Infers and sets `resources.gpu.vendor` if not set, requires `image` if the vendor is AMD.
"""
gpu_spec = conf.resources.gpu
if gpu_spec is None:
Expand Down Expand Up @@ -400,6 +413,29 @@ def validate_gpu_vendor_and_image(self, conf: BaseRunConfiguration) -> None:
"`image` is required if `resources.gpu.vendor` is `tenstorrent`"
)

def validate_cpu_arch_and_image(self, conf: BaseRunConfiguration) -> None:
"""
Infers `resources.cpu.arch` if not set, requires `image` if the architecture is ARM.
"""
# TODO: Remove in 0.20. Use conf.resources.cpu directly
cpu_spec = parse_obj_as(CPUSpec, conf.resources.cpu)
arch = cpu_spec.arch
if arch is None:
gpu_spec = conf.resources.gpu
if (
gpu_spec is not None
and gpu_spec.vendor in [None, gpuhunt.AcceleratorVendor.NVIDIA]
and gpu_spec.name
and any(map(gpuhunt.is_nvidia_superchip, gpu_spec.name))
):
arch = gpuhunt.CPUArchitecture.ARM
else:
arch = gpuhunt.CPUArchitecture.X86
# NOTE: We don't set the inferred resources.cpu.arch for compatibility with older servers.
# Servers with ARM support set the arch using the same logic.
if arch == gpuhunt.CPUArchitecture.ARM and conf.image is None:
raise ConfigurationError("`image` is required if `resources.cpu.arch` is `arm`")


class RunWithPortsConfigurator(BaseRunConfigurator):
@classmethod
Expand Down
Loading
Loading