Services

Services allow you to deploy models or web apps as secure and scalable endpoints.

Apply a configuration

First, define a service configuration as a YAML file in your project folder. The filename must end with .dstack.yml (e.g. .dstack.yml or dev.dstack.yml are both acceptable).

type: service
name: llama31

# If `image` is not specified, dstack uses its default image
python: 3.12
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
  - MAX_MODEL_LEN=4096
commands:
  - uv pip install vllm
  - vllm serve $MODEL_ID
    --max-model-len $MAX_MODEL_LEN
    --tensor-parallel-size $DSTACK_GPUS_NUM
port: 8000
# (Optional) Register the model
model: meta-llama/Meta-Llama-3.1-8B-Instruct

# Uncomment to leverage spot instances
#spot_policy: auto

resources:
  gpu: 24GB

To run a service, pass the configuration to dstack apply:

$ HF_TOKEN=...
$ dstack apply -f .dstack.yml

 #  BACKEND  REGION    RESOURCES                    SPOT  PRICE
 1  runpod   CA-MTL-1  18xCPU, 100GB, A5000:24GB:2  yes   $0.22
 2  runpod   EU-SE-1   18xCPU, 100GB, A5000:24GB:2  yes   $0.22
 3  gcp      us-west4  27xCPU, 150GB, A5000:24GB:3  yes   $0.33
 
Submit the run llama31? [y/n]: y

Provisioning...
---> 100%

Service is published at: 
  http://localhost:3000/proxy/services/main/llama31/
Model meta-llama/Meta-Llama-3.1-8B-Instruct is published at:
  http://localhost:3000/proxy/models/main/

dstack apply automatically provisions instances, uploads the contents of the repo (incl. your local uncommitted changes), and runs the service.

If a gateway is not configured, the service’s endpoint will be accessible at <dstack server URL>/proxy/services/<project name>/<run name>/.

$ curl http://localhost:3000/proxy/services/main/llama31/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer &lt;dstack token&gt;' \
    -d '{
        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "messages": [
            {
                "role": "user",
                "content": "Compose a poem that explains the concept of recursion in programming."
            }
        ]
    }'

If the service defines the model property, the model can be accessed with the global OpenAI-compatible endpoint at <dstack server URL>/proxy/models/<project name>/, or via dstack UI.

If authorization is not disabled, the service endpoint requires the Authorization header with Bearer <dstack token>.

??? info "Gateway" Running services for development purposes doesn’t require setting up a gateway.

However, you'll need a gateway in the following cases:

* To use auto-scaling or rate limits
* To enable HTTPS for the endpoint and map it to your domain
* If your service requires WebSockets
* If your service cannot work with a [path prefix](#path-prefix)

Note, if you're using [dstack Sky :material-arrow-top-right-thin:{ .external }](https://sky.dstack.ai){:target="_blank"},
a gateway is already pre-configured for you.

If a [gateway](gateways.md) is configured, the service endpoint will be accessible at
`https://<run name>.<gateway domain>/`.

If the service defines the `model` property, the model will be available via the global OpenAI-compatible endpoint 
at `https://gateway.<gateway domain>/`.

Configuration options

Replicas and scaling

By default, dstack runs a single replica of the service. You can configure the number of replicas as well as the auto-scaling rules.

type: service
name: llama31-service

python: 3.12

env:
  - HF_TOKEN
commands:
  - uv pip install vllm
  - vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --max-model-len 4096
port: 8000

resources:
  gpu: 24GB

replicas: 1..4
scaling:
  # Requests per seconds
  metric: rps
  # Target metric value
  target: 10

The replicas property can be a number or a range.

The metric property of scaling only supports the rps metric (requests per second). In this case dstack adjusts the number of replicas (scales up or down) automatically based on the load.

Setting the minimum number of replicas to 0 allows the service to scale down to zero when there are no requests.

The scaling property requires creating a gateway.

Model

If the service is running a chat model with an OpenAI-compatible interface, set the model property to make the model accessible via dstack's global OpenAI-compatible endpoint, and also accessible via dstack's UI.

Authorization

By default, the service enables authorization, meaning the service endpoint requires a dstack user token. This can be disabled by setting auth to false.

type: service
name: http-server-service

# Disable authorization
auth: false

python: 3.12

commands:
  - python3 -m http.server
port: 8000

Path prefix { #path-prefix }

If your dstack project doesn't have a gateway, services are hosted with the /proxy/services/<project name>/<run name>/ path prefix in the URL. When running web apps, you may need to set some app-specific settings so that browser-side scripts and CSS work correctly with the path prefix.

type: service
name: dash
gateway: false

auth: false
# Do not strip the path prefix
strip_prefix: false

env:
  # Configure Dash to work with a path prefix
  # Replace `main` with your dstack project name
  - DASH_ROUTES_PATHNAME_PREFIX=/proxy/services/main/dash/

commands:
  - uv pip install dash
  # Assuming the Dash app is in your repo at app.py
  - python app.py

port: 8050

By default, dstack strips the prefix before forwarding requests to your service, so to the service it appears as if the prefix isn't there. This allows some apps to work out of the box. If your app doesn't expect the prefix to be stripped, set strip_prefix to false.

If your app cannot be configured to work with a path prefix, you can host it on a dedicated domain name by setting up a gateway.

Rate limits { #rate-limits }

If you have a gateway, you can configure rate limits for your service using the rate_limits property.

type: service
image: my-app:latest
port: 80

rate_limits:
# For /api/auth/* - 1 request per second, no bursts
- prefix: /api/auth/
  rps: 1
# For other URLs - 4 requests per second + bursts of up to 9 requests
- rps: 4
  burst: 9

The rps limit sets the max requests per second, tracked in milliseconds (e.g., rps: 4 means 1 request every 250 ms). Use burst to allow short spikes while keeping the average within rps.

Limits apply to the whole service (all replicas) and per client (by IP). Clients exceeding the limit get a 429 error.

??? info "Partitioning key" Instead of partitioning requests by client IP address, you can choose to partition by the value of a header.

<div editor-title="service.dstack.yml"> 

```yaml
type: service
image: my-app:latest
port: 80

rate_limits:
- rps: 4
  burst: 9
  # Apply to each user, as determined by the `Authorization` header
  key:
    type: header
    header: Authorization
```

</div>

Resources

If you specify memory size, you can either specify an explicit size (e.g. 24GB) or a range (e.g. 24GB.., or 24GB..80GB, or ..80GB).

type: service
name: llama31-service

python: 3.12
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
  - MAX_MODEL_LEN=4096
commands:
  - uv pip install vllm
  - |
    vllm serve $MODEL_ID
      --max-model-len $MAX_MODEL_LEN
      --tensor-parallel-size $DSTACK_GPUS_NUM
port: 8000

resources:
  # 16 or more x86_64 cores
  cpu: 16..
  # 2 GPUs of 80GB
  gpu: 80GB:2

  # Minimum disk size
  disk: 200GB

The cpu property lets you set the architecture (x86 or arm) and core count — e.g., x86:16 (16 x86 cores), arm:8.. (at least 8 ARM cores). If not set, dstack infers it from the GPU or defaults to x86.

The gpu property lets you specify vendor, model, memory, and count — e.g., nvidia (one NVIDIA GPU), A100 (one A100), A10G,A100 (either), A100:80GB (one 80GB A100), A100:2 (two A100), 24GB..40GB:2 (two GPUs with 24–40GB), A100:40GB:2 (two 40GB A100s).

If vendor is omitted, dstack infers it from the model or defaults to nvidia.

??? info "Shared memory" If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure shm_size, e.g. set it to 16GB.

If you’re unsure which offers (hardware configurations) are available from the configured backends, use the dstack offer command to list them.

Docker

Default image

If you don't specify image, dstack uses its base Docker image pre-configured with uv, python, pip, essential CUDA drivers, and NCCL tests (under /opt/nccl-tests/build).

Set the python property to pre-install a specific version of Python.

type: service
name: http-server-service    

python: 3.12

commands:
  - python3 -m http.server
port: 8000

NVCC

By default, the base Docker image doesn’t include nvcc, which is required for building custom CUDA kernels. If you need nvcc, set the nvcc property to true.

type: service
name: http-server-service    

python: 3.12
nvcc: true

commands:
  - python3 -m http.server
port: 8000

Custom image

If you want, you can specify your own Docker image via image.

```yaml
type: service
name: http-server-service

image: python

commands:
  - python3 -m http.server
port: 8000
```

Docker in Docker

Set docker to true to enable the docker CLI in your service, e.g., to run Docker images or use Docker Compose.

type: service
name: chat-ui-task

auth: false

docker: true

working_dir: examples/misc/docker-compose
commands:
  - docker compose up
port: 9000

Cannot be used with python or image. Not supported on runpod, vastai, or kubernetes.

Privileged mode

To enable privileged mode, set privileged to true.

Not supported with runpod, vastai, and kubernetes.

Private registry

Use the registry_auth property to provide credentials for a private Docker registry.

type: service
name: serve-distill-deepseek

env:
  - NGC_API_KEY
  - NIM_MAX_MODEL_LEN=4096

image: nvcr.io/nim/deepseek-ai/deepseek-r1-distill-llama-8b
registry_auth:
  username: $oauthtoken
  password: ${{ env.NGC_API_KEY }}
port: 8000

model: deepseek-ai/deepseek-r1-distill-llama-8b

resources:
  gpu: H100:1

Environment variables

type: service
name: llama-2-7b-service

python: 3.12

env:
  - HF_TOKEN
  - MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
  - uv pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

resources:
  gpu: 24GB

If you don't assign a value to an environment variable (see HF_TOKEN above), dstack will require the value to be passed via the CLI or set in the current process.

??? info "System environment variables" The following environment variables are available in any run by default:

| Name                    | Description                             |
|-------------------------|-----------------------------------------|
| `DSTACK_RUN_NAME`       | The name of the run                     |
| `DSTACK_REPO_ID`        | The ID of the repo                      |
| `DSTACK_GPUS_NUM`       | The total number of GPUs in the run     |

Files

By default, dstack automatically mounts the repo directory where you ran dstack init to any run configuration.

However, in some cases, you may not want to mount the entire directory (e.g., if it’s too large), or you might want to mount files outside of it. In such cases, you can use the files property.

type: service
name: llama-2-7b-service

files:
  - .:examples  # Maps the directory where `.dstack.yml` to `/workflow/examples`
  - ~/.ssh/id_rsa:/root/.ssh/id_rsa  # Maps `~/.ssh/id_rsa` to `/root/.ssh/id_rsa`

python: 3.12

env:
  - HF_TOKEN
  - MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
  - uv pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

resources:
  gpu: 24GB

Each entry maps a local directory or file to a path inside the container. Both local and container paths can be relative or absolute.

If the local path is relative, it’s resolved relative to the configuration file.
If the container path is relative, it’s resolved relative to /workflow.

The container path is optional. If not specified, it will be automatically calculated.

type: service
name: llama-2-7b-service

files:
  - ../examples  # Maps `examples` (the parent directory of `.dstack.yml`) to `/workflow/examples`
  - ~/.ssh/id_rsa  # Maps `~/.ssh/id_rsa` to `/root/.ssh/id_rsa`

python: 3.12

env:
  - HF_TOKEN
  - MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
  - uv pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

resources:
  gpu: 24GB

Note: If you want to use files without mounting the entire repo directory, make sure to pass --no-repo when running dstack apply:

$ dstack apply -f examples/.dstack.yml --no-repo

??? info ".gitignore and .dstackignore" dstack automatically excludes files and folders listed in .gitignore and .dstackignore.

Uploads are limited to 2MB. To avoid exceeding this limit, make sure to exclude unnecessary files.
You can increase the default server limit by setting the `DSTACK_SERVER_CODE_UPLOAD_LIMIT` environment variable.

!!! warning "Experimental" The files feature is experimental. Feedback is highly appreciated.

Retry policy

By default, if dstack can't find capacity, or the service exits with an error, or the instance is interrupted, the run will fail.

If you'd like dstack to automatically retry, configure the retry property accordingly:

type: service
image: my-app:latest
port: 80

retry:
  on_events: [no-capacity, error, interruption]
  # Retry for up to 1 hour
  duration: 1h

If one replica of a multi-replica service fails with retry enabled, dstack will resubmit only the failed replica while keeping active replicas running.

Spot policy

By default, dstack uses on-demand instances. However, you can change that via the spot_policy property. It accepts spot, on-demand, and auto.

Utilization policy

Sometimes it’s useful to track whether a service is fully utilizing all GPUs. While you can check this with dstack metrics, dstack also lets you set a policy to auto-terminate the run if any GPU is underutilized.

Below is an example of a service that auto-terminate if any GPU stays below 10% utilization for 1 hour.

type: service
name: llama-2-7b-service

python: 3.12
env:
  - HF_TOKEN
  - MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
  - uv pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

resources:
  gpu: 24GB

utilization_policy:
  min_gpu_utilization: 10
  time_window: 1h

Schedule

Specify schedule to start a service periodically at specific UTC times using the cron syntax:

type: service
name: llama-2-7b-service

python: 3.12
env:
  - HF_TOKEN
  - MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
  - uv pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000

resources:
  gpu: 24GB

schedule:
  cron: "0 8 * * mon-fri" # at 8:00 UTC from Monday through Friday

The schedule property can be combined with max_duration or utilization_policy to shutdown the service automatically when it's not needed.

??? info "Cron syntax" dstack supports POSIX cron syntax. One exception is that days of the week are started from Monday instead of Sunday so 0 corresponds to Monday.

The month and day of week fields accept abbreviated English month and weekday names (`jan–dec` and `mon–sun`) respectively.

A cron expression consists of five fields:

```
┌───────────── minute (0-59)
│ ┌───────────── hour (0-23)
│ │ ┌───────────── day of the month (1-31)
│ │ │ ┌───────────── month (1-12 or jan-dec)
│ │ │ │ ┌───────────── day of the week (0-6 or mon-sun)
│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
* * * * *
```

The following operators can be used in any of the fields:

| Operator | Description           | Example                                                                 |
|----------|-----------------------|-------------------------------------------------------------------------|
| `*`      | Any value             | `0 * * * *` runs every hour at minute 0                                 |
| `,`      | Value list separator  | `15,45 10 * * *` runs at 10:15 and 10:45 every day.                     |
| `-`      | Range of values       | `0 1-3 * * *` runs at 1:00, 2:00, and 3:00 every day.                   |
| `/`      | Step values           | `*/10 8-10 * * *` runs every 10 minutes during the hours 8:00 to 10:59. |

--8<-- "docs/concepts/snippets/manage-fleets.ext"

!!! info "Reference" Services support many more configuration options, incl. backends, regions, max_price, and among others.

Rolling deployment

To deploy a new version of a service that is already running, use dstack apply. dstack will automatically detect changes and suggest a rolling deployment update.

$ dstack apply -f my-service.dstack.yml

Active run my-service already exists. Detected changes that can be updated in-place:
- Repo state (branch, commit, or other)
- File archives
- Configuration properties:
  - env
  - files

Update the run? [y/n]:

If approved, dstack gradually updates the service replicas. To update a replica, dstack starts a new replica, waits for it to become running, then terminates the old replica. This process is repeated for each replica, one at a time.

You can track the progress of rolling deployment in both dstack apply or dstack ps. Older replicas have lower deployment numbers; newer ones have higher.

$ dstack apply -f my-service.dstack.yml

⠋ Launching my-service...
 NAME                            BACKEND          PRICE    STATUS       SUBMITTED
 my-service deployment=1                                   running      11 mins ago
   replica=0 job=0 deployment=0  aws (us-west-2)  $0.0026  terminating  11 mins ago
   replica=1 job=0 deployment=1  aws (us-west-2)  $0.0026  running      1 min ago

The rolling deployment stops when all replicas are updated or when a new deployment is submitted.

??? info "Supported properties"

Rolling deployment supports changes to the following properties: `port`, `resources`, `volumes`, `docker`, `files`, `image`, `user`, `privileged`, `entrypoint`, `working_dir`, `python`, `nvcc`, `single_branch`, `env`, `shell`, `commands`, as well as changes to [repo](repos.md) or [file](#files) contents.

Changes to `replicas` and `scaling` can be applied without redeploying replicas.

Changes to other properties require a full service restart.

To trigger a rolling deployment when no properties have changed (e.g., after updating [secrets](secrets.md) or to restart all replicas),  
make a minor config change, such as adding a dummy [environment variable](#environment-variables).

--8<-- "docs/concepts/snippets/manage-runs.ext"

!!! info "What's next?" 1. Read about dev environments, tasks, and repos 2. Learn how to manage fleets 3. See how to set up gateways 4. Check the TGI :material-arrow-top-right-thin:{ .external }{:target="_blank"}, vLLM :material-arrow-top-right-thin:{ .external }{:target="_blank"}, and NIM :material-arrow-top-right-thin:{ .external }{:target="_blank"} examples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Services

Apply a configuration

Configuration options

Replicas and scaling

Model

Authorization

Path prefix { #path-prefix }

Rate limits { #rate-limits }

Resources

Docker

Default image

NVCC

Custom image

Docker in Docker

Privileged mode

Private registry

Environment variables

Files

Retry policy

Spot policy

Utilization policy

Schedule

Rolling deployment

Uh oh!

FilesExpand file tree

services.md

Latest commit

History

services.md

File metadata and controls

Services

Apply a configuration

Configuration options

Replicas and scaling

Model

Authorization

Path prefix { #path-prefix }

Rate limits { #rate-limits }

Resources

Docker

Default image

NVCC

Custom image

Docker in Docker

Privileged mode

Private registry

Environment variables

Files

Retry policy

Spot policy

Utilization policy

Schedule

Rolling deployment