Services allow you to deploy models or web apps as secure and scalable endpoints.
First, define a service configuration as a YAML file in your project folder.
The filename must end with .dstack.yml (e.g. .dstack.yml or dev.dstack.yml are both acceptable).
type: service
name: llama31
# If `image` is not specified, dstack uses its default image
python: 3.12
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
- MAX_MODEL_LEN=4096
commands:
- uv pip install vllm
- vllm serve $MODEL_ID
--max-model-len $MAX_MODEL_LEN
--tensor-parallel-size $DSTACK_GPUS_NUM
port: 8000
# (Optional) Register the model
model: meta-llama/Meta-Llama-3.1-8B-Instruct
# Uncomment to leverage spot instances
#spot_policy: auto
resources:
gpu: 24GBTo run a service, pass the configuration to dstack apply:
$ HF_TOKEN=...
$ dstack apply -f .dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 runpod CA-MTL-1 18xCPU, 100GB, A5000:24GB:2 yes $0.22
2 runpod EU-SE-1 18xCPU, 100GB, A5000:24GB:2 yes $0.22
3 gcp us-west4 27xCPU, 150GB, A5000:24GB:3 yes $0.33
Submit the run llama31? [y/n]: y
Provisioning...
---> 100%
Service is published at:
http://localhost:3000/proxy/services/main/llama31/
Model meta-llama/Meta-Llama-3.1-8B-Instruct is published at:
http://localhost:3000/proxy/models/main/dstack apply automatically provisions instances, uploads the contents of the repo (incl. your local uncommitted changes),
and runs the service.
If a gateway is not configured, the service’s endpoint will be accessible at
<dstack server URL>/proxy/services/<project name>/<run name>/.
$ curl http://localhost:3000/proxy/services/main/llama31/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <dstack token>' \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming."
}
]
}'If the service defines the model property, the model can be accessed with
the global OpenAI-compatible endpoint at <dstack server URL>/proxy/models/<project name>/,
or via dstack UI.
If authorization is not disabled, the service endpoint requires the Authorization header with
Bearer <dstack token>.
??? info "Gateway" Running services for development purposes doesn’t require setting up a gateway.
However, you'll need a gateway in the following cases:
* To use auto-scaling or rate limits
* To enable HTTPS for the endpoint and map it to your domain
* If your service requires WebSockets
* If your service cannot work with a [path prefix](#path-prefix)
Note, if you're using [dstack Sky :material-arrow-top-right-thin:{ .external }](https://sky.dstack.ai){:target="_blank"},
a gateway is already pre-configured for you.
If a [gateway](gateways.md) is configured, the service endpoint will be accessible at
`https://<run name>.<gateway domain>/`.
If the service defines the `model` property, the model will be available via the global OpenAI-compatible endpoint
at `https://gateway.<gateway domain>/`.
By default, dstack runs a single replica of the service.
You can configure the number of replicas as well as the auto-scaling rules.
type: service
name: llama31-service
python: 3.12
env:
- HF_TOKEN
commands:
- uv pip install vllm
- vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --max-model-len 4096
port: 8000
resources:
gpu: 24GB
replicas: 1..4
scaling:
# Requests per seconds
metric: rps
# Target metric value
target: 10The replicas property can be a number or a range.
The metric property of scaling only supports the rps metric (requests per second). In this
case dstack adjusts the number of replicas (scales up or down) automatically based on the load.
Setting the minimum number of replicas to 0 allows the service to scale down to zero when there are no requests.
The
scalingproperty requires creating a gateway.
If the service is running a chat model with an OpenAI-compatible interface,
set the model property to make the model accessible via dstack's
global OpenAI-compatible endpoint, and also accessible via dstack's UI.
By default, the service enables authorization, meaning the service endpoint requires a dstack user token.
This can be disabled by setting auth to false.
type: service
name: http-server-service
# Disable authorization
auth: false
python: 3.12
commands:
- python3 -m http.server
port: 8000If your dstack project doesn't have a gateway, services are hosted with the
/proxy/services/<project name>/<run name>/ path prefix in the URL.
When running web apps, you may need to set some app-specific settings
so that browser-side scripts and CSS work correctly with the path prefix.
type: service
name: dash
gateway: false
auth: false
# Do not strip the path prefix
strip_prefix: false
env:
# Configure Dash to work with a path prefix
# Replace `main` with your dstack project name
- DASH_ROUTES_PATHNAME_PREFIX=/proxy/services/main/dash/
commands:
- uv pip install dash
# Assuming the Dash app is in your repo at app.py
- python app.py
port: 8050By default, dstack strips the prefix before forwarding requests to your service,
so to the service it appears as if the prefix isn't there. This allows some apps
to work out of the box. If your app doesn't expect the prefix to be stripped,
set strip_prefix to false.
If your app cannot be configured to work with a path prefix, you can host it on a dedicated domain name by setting up a gateway.
If you have a gateway, you can configure rate limits for your service
using the rate_limits property.
type: service
image: my-app:latest
port: 80
rate_limits:
# For /api/auth/* - 1 request per second, no bursts
- prefix: /api/auth/
rps: 1
# For other URLs - 4 requests per second + bursts of up to 9 requests
- rps: 4
burst: 9The rps limit sets the max requests per second, tracked in milliseconds (e.g., rps: 4 means 1 request every 250 ms). Use burst to allow short spikes while keeping the average within rps.
Limits apply to the whole service (all replicas) and per client (by IP). Clients exceeding the limit get a 429 error.
??? info "Partitioning key" Instead of partitioning requests by client IP address, you can choose to partition by the value of a header.
<div editor-title="service.dstack.yml">
```yaml
type: service
image: my-app:latest
port: 80
rate_limits:
- rps: 4
burst: 9
# Apply to each user, as determined by the `Authorization` header
key:
type: header
header: Authorization
```
</div>
If you specify memory size, you can either specify an explicit size (e.g. 24GB) or a
range (e.g. 24GB.., or 24GB..80GB, or ..80GB).
type: service
name: llama31-service
python: 3.12
env:
- HF_TOKEN
- MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
- MAX_MODEL_LEN=4096
commands:
- uv pip install vllm
- |
vllm serve $MODEL_ID
--max-model-len $MAX_MODEL_LEN
--tensor-parallel-size $DSTACK_GPUS_NUM
port: 8000
resources:
# 16 or more x86_64 cores
cpu: 16..
# 2 GPUs of 80GB
gpu: 80GB:2
# Minimum disk size
disk: 200GBThe cpu property lets you set the architecture (x86 or arm) and core count — e.g., x86:16 (16 x86 cores), arm:8.. (at least 8 ARM cores).
If not set, dstack infers it from the GPU or defaults to x86.
The gpu property lets you specify vendor, model, memory, and count — e.g., nvidia (one NVIDIA GPU), A100 (one A100), A10G,A100 (either), A100:80GB (one 80GB A100), A100:2 (two A100), 24GB..40GB:2 (two GPUs with 24–40GB), A100:40GB:2 (two 40GB A100s).
If vendor is omitted, dstack infers it from the model or defaults to nvidia.
??? info "Shared memory"
If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure
shm_size, e.g. set it to 16GB.
If you’re unsure which offers (hardware configurations) are available from the configured backends, use the
dstack offercommand to list them.
If you don't specify image, dstack uses its base Docker image pre-configured with
uv, python, pip, essential CUDA drivers, and NCCL tests (under /opt/nccl-tests/build).
Set the python property to pre-install a specific version of Python.
type: service
name: http-server-service
python: 3.12
commands:
- python3 -m http.server
port: 8000By default, the base Docker image doesn’t include nvcc, which is required for building custom CUDA kernels.
If you need nvcc, set the nvcc property to true.
type: service
name: http-server-service
python: 3.12
nvcc: true
commands:
- python3 -m http.server
port: 8000If you want, you can specify your own Docker image via image.
```yaml
type: service
name: http-server-service
image: python
commands:
- python3 -m http.server
port: 8000
```
Set docker to true to enable the docker CLI in your service, e.g., to run Docker images or use Docker Compose.
type: service
name: chat-ui-task
auth: false
docker: true
working_dir: examples/misc/docker-compose
commands:
- docker compose up
port: 9000Cannot be used with python or image. Not supported on runpod, vastai, or kubernetes.
To enable privileged mode, set privileged to true.
Not supported with runpod, vastai, and kubernetes.
Use the registry_auth property to provide credentials for a private Docker registry.
type: service
name: serve-distill-deepseek
env:
- NGC_API_KEY
- NIM_MAX_MODEL_LEN=4096
image: nvcr.io/nim/deepseek-ai/deepseek-r1-distill-llama-8b
registry_auth:
username: $oauthtoken
password: ${{ env.NGC_API_KEY }}
port: 8000
model: deepseek-ai/deepseek-r1-distill-llama-8b
resources:
gpu: H100:1type: service
name: llama-2-7b-service
python: 3.12
env:
- HF_TOKEN
- MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
- uv pip install vllm
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000
resources:
gpu: 24GBIf you don't assign a value to an environment variable (see
HF_TOKENabove),dstackwill require the value to be passed via the CLI or set in the current process.
??? info "System environment variables" The following environment variables are available in any run by default:
| Name | Description |
|-------------------------|-----------------------------------------|
| `DSTACK_RUN_NAME` | The name of the run |
| `DSTACK_REPO_ID` | The ID of the repo |
| `DSTACK_GPUS_NUM` | The total number of GPUs in the run |
By default, dstack automatically mounts the repo directory where you ran dstack init to any run configuration.
However, in some cases, you may not want to mount the entire directory (e.g., if it’s too large),
or you might want to mount files outside of it. In such cases, you can use the files property.
type: service
name: llama-2-7b-service
files:
- .:examples # Maps the directory where `.dstack.yml` to `/workflow/examples`
- ~/.ssh/id_rsa:/root/.ssh/id_rsa # Maps `~/.ssh/id_rsa` to `/root/.ssh/id_rsa`
python: 3.12
env:
- HF_TOKEN
- MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
- uv pip install vllm
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000
resources:
gpu: 24GBEach entry maps a local directory or file to a path inside the container. Both local and container paths can be relative or absolute.
- If the local path is relative, it’s resolved relative to the configuration file.
- If the container path is relative, it’s resolved relative to
/workflow.
The container path is optional. If not specified, it will be automatically calculated.
type: service
name: llama-2-7b-service
files:
- ../examples # Maps `examples` (the parent directory of `.dstack.yml`) to `/workflow/examples`
- ~/.ssh/id_rsa # Maps `~/.ssh/id_rsa` to `/root/.ssh/id_rsa`
python: 3.12
env:
- HF_TOKEN
- MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
- uv pip install vllm
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000
resources:
gpu: 24GBNote: If you want to use files without mounting the entire repo directory,
make sure to pass --no-repo when running dstack apply:
$ dstack apply -f examples/.dstack.yml --no-repo??? info ".gitignore and .dstackignore"
dstack automatically excludes files and folders listed in .gitignore and .dstackignore.
Uploads are limited to 2MB. To avoid exceeding this limit, make sure to exclude unnecessary files.
You can increase the default server limit by setting the `DSTACK_SERVER_CODE_UPLOAD_LIMIT` environment variable.
!!! warning "Experimental"
The files feature is experimental. Feedback is highly appreciated.
By default, if dstack can't find capacity, or the service exits with an error, or the instance is interrupted, the run will fail.
If you'd like dstack to automatically retry, configure the
retry property accordingly:
type: service
image: my-app:latest
port: 80
retry:
on_events: [no-capacity, error, interruption]
# Retry for up to 1 hour
duration: 1hIf one replica of a multi-replica service fails with retry enabled,
dstack will resubmit only the failed replica while keeping active replicas running.
By default, dstack uses on-demand instances. However, you can change that
via the spot_policy property. It accepts spot, on-demand, and auto.
Sometimes it’s useful to track whether a service is fully utilizing all GPUs. While you can check this with
dstack metrics, dstack also lets you set a policy to auto-terminate the run if any GPU is underutilized.
Below is an example of a service that auto-terminate if any GPU stays below 10% utilization for 1 hour.
type: service
name: llama-2-7b-service
python: 3.12
env:
- HF_TOKEN
- MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
- uv pip install vllm
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000
resources:
gpu: 24GB
utilization_policy:
min_gpu_utilization: 10
time_window: 1hSpecify schedule to start a service periodically at specific UTC times using the cron syntax:
type: service
name: llama-2-7b-service
python: 3.12
env:
- HF_TOKEN
- MODEL=NousResearch/Llama-2-7b-chat-hf
commands:
- uv pip install vllm
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
port: 8000
resources:
gpu: 24GB
schedule:
cron: "0 8 * * mon-fri" # at 8:00 UTC from Monday through FridayThe schedule property can be combined with max_duration or utilization_policy to shutdown the service automatically when it's not needed.
??? info "Cron syntax"
dstack supports POSIX cron syntax. One exception is that days of the week are started from Monday instead of Sunday so 0 corresponds to Monday.
The month and day of week fields accept abbreviated English month and weekday names (`jan–dec` and `mon–sun`) respectively.
A cron expression consists of five fields:
```
┌───────────── minute (0-59)
│ ┌───────────── hour (0-23)
│ │ ┌───────────── day of the month (1-31)
│ │ │ ┌───────────── month (1-12 or jan-dec)
│ │ │ │ ┌───────────── day of the week (0-6 or mon-sun)
│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
* * * * *
```
The following operators can be used in any of the fields:
| Operator | Description | Example |
|----------|-----------------------|-------------------------------------------------------------------------|
| `*` | Any value | `0 * * * *` runs every hour at minute 0 |
| `,` | Value list separator | `15,45 10 * * *` runs at 10:15 and 10:45 every day. |
| `-` | Range of values | `0 1-3 * * *` runs at 1:00, 2:00, and 3:00 every day. |
| `/` | Step values | `*/10 8-10 * * *` runs every 10 minutes during the hours 8:00 to 10:59. |
--8<-- "docs/concepts/snippets/manage-fleets.ext"
!!! info "Reference"
Services support many more configuration options,
incl. backends,
regions,
max_price, and
among others.
To deploy a new version of a service that is already running, use dstack apply. dstack will automatically detect changes and suggest a rolling deployment update.
$ dstack apply -f my-service.dstack.yml
Active run my-service already exists. Detected changes that can be updated in-place:
- Repo state (branch, commit, or other)
- File archives
- Configuration properties:
- env
- files
Update the run? [y/n]:If approved, dstack gradually updates the service replicas. To update a replica, dstack starts a new replica, waits for it to become running, then terminates the old replica. This process is repeated for each replica, one at a time.
You can track the progress of rolling deployment in both dstack apply or dstack ps.
Older replicas have lower deployment numbers; newer ones have higher.
$ dstack apply -f my-service.dstack.yml
⠋ Launching my-service...
NAME BACKEND PRICE STATUS SUBMITTED
my-service deployment=1 running 11 mins ago
replica=0 job=0 deployment=0 aws (us-west-2) $0.0026 terminating 11 mins ago
replica=1 job=0 deployment=1 aws (us-west-2) $0.0026 running 1 min agoThe rolling deployment stops when all replicas are updated or when a new deployment is submitted.
??? info "Supported properties"
Rolling deployment supports changes to the following properties: `port`, `resources`, `volumes`, `docker`, `files`, `image`, `user`, `privileged`, `entrypoint`, `working_dir`, `python`, `nvcc`, `single_branch`, `env`, `shell`, `commands`, as well as changes to [repo](repos.md) or [file](#files) contents.
Changes to `replicas` and `scaling` can be applied without redeploying replicas.
Changes to other properties require a full service restart.
To trigger a rolling deployment when no properties have changed (e.g., after updating [secrets](secrets.md) or to restart all replicas),
make a minor config change, such as adding a dummy [environment variable](#environment-variables).
--8<-- "docs/concepts/snippets/manage-runs.ext"
!!! info "What's next?" 1. Read about dev environments, tasks, and repos 2. Learn how to manage fleets 3. See how to set up gateways 4. Check the TGI :material-arrow-top-right-thin:{ .external }{:target="_blank"}, vLLM :material-arrow-top-right-thin:{ .external }{:target="_blank"}, and NIM :material-arrow-top-right-thin:{ .external }{:target="_blank"} examples