Skip to content

Commit fd9fbce

Browse files
authored
[Internal]: Update backend contributing docs (#2369)
1 parent d4a6061 commit fd9fbce

File tree

2 files changed

+112
-72
lines changed

2 files changed

+112
-72
lines changed

contributing/BACKENDS.md

Lines changed: 70 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -23,62 +23,64 @@ To add a new cloud provider to `gpuhunt`, follow these steps:
2323
git clone https://github.com/dstackai/gpuhunt.git
2424
```
2525

26-
### 1.2. Create the provider class
26+
### 1.2. Decide if you will implement an offline or an online provider
2727

28-
Create the provider class file under `src/gpuhunt/providers`.
29-
30-
Ensure your class...
31-
32-
- Extends the `AbstractProvider` base class.
33-
- Has the `NAME` property, that will be used as the unique identifier for your provider.
34-
- Implements the `get` method, that is responsible for fetching the available machine configurations from the cloud provider.
28+
- **Offline providers** offer static machine configurations that are not frequently updated.
29+
`gpuhunt` collects offline providers' instance offers on an hourly basis.
30+
Examples: `aws`, `gcp`, `azure`, etc.
31+
- **Online providers** offer dynamic machine configurations that are available at the very moment
32+
when you fetch configurations (e.g., GPU marketplaces).
33+
`gpuhunt` collects online providers' instance offers each time a `dstack` user provisions a new instance.
34+
Examples: `tensordock`, `vastai`, etc.
3535

36-
[//]: # (TODO: Elaborate better on how to use `query_filter` and `balance_resources`)
36+
### 1.3. Create the provider class
3737

38-
Refer to examples: [datacrunch.py](https://github.com/dstackai/gpuhunt/blob/main/src/gpuhunt/providers/datacrunch.py),
39-
[aws.py](https://github.com/dstackai/gpuhunt/blob/main/src/gpuhunt/providers/aws.py),
40-
[gcp.py](https://github.com/dstackai/gpuhunt/blob/main/src/gpuhunt/providers/gcp.py),
41-
[azure.py](https://github.com/dstackai/gpuhunt/blob/main/src/gpuhunt/providers/azure.py),
42-
[lambdalabs.py](https://github.com/dstackai/gpuhunt/blob/main/src/gpuhunt/providers/lambdalabs.py),
43-
[tensordock.py](https://github.com/dstackai/gpuhunt/blob/main/src/gpuhunt/providers/tensordock.py),
44-
[vastai.py](https://github.com/dstackai/gpuhunt/blob/main/src/gpuhunt/providers/vastai.py).
38+
Create the provider class file under `src/gpuhunt/providers`.
4539

46-
### 1.3. Register the provider with the catalog
40+
Make sure your class extends the [`AbstractProvider`](https://github.com/dstackai/gpuhunt/blob/main/src/gpuhunt/providers/__init__.py)
41+
base class. See its docstrings for descriptions of the methods that your class should implement.
4742

48-
Update the `src/gpuhunt/_internal/catalog.py` file by adding the provider name
49-
to either `OFFLINE_PROVIDERS` or `ONLINE_PROVIDERS` depending on the type of the provider.
43+
Refer to examples:
44+
- Offline providers:
45+
[datacrunch.py](https://github.com/dstackai/gpuhunt/blob/main/src/gpuhunt/providers/datacrunch.py),
46+
[aws.py](https://github.com/dstackai/gpuhunt/blob/main/src/gpuhunt/providers/aws.py),
47+
[azure.py](https://github.com/dstackai/gpuhunt/blob/main/src/gpuhunt/providers/azure.py),
48+
[lambdalabs.py](https://github.com/dstackai/gpuhunt/blob/main/src/gpuhunt/providers/lambdalabs.py).
49+
- Online providers:
50+
[vultr.py](https://github.com/dstackai/gpuhunt/blob/main/src/gpuhunt/providers/vultr.py)
51+
[tensordock.py](https://github.com/dstackai/gpuhunt/blob/main/src/gpuhunt/providers/tensordock.py),
52+
[vastai.py](https://github.com/dstackai/gpuhunt/blob/main/src/gpuhunt/providers/vastai.py).
5053

51-
How do I decide which type my provider is?
54+
### 1.4. Register the provider with the catalog
5255

53-
- `OFFLINE_PROVIDERS` - Use this type if your provider offers static machine configurations that may be collected and
54-
published on a daily basis. Examples: `aws`, `gcp`, `azure`, etc. These providers offer many machine configurations,
55-
but they are not updated frequently.
56-
- `ONLINE_PROVIDERS` - Use this type if your provider offers dynamic machine configurations that are available at the very moment when you fetch configurations (e.g., GPU marketplaces).
57-
Examples: `tensordock`, `vast`, etc.
56+
Add your provider in the following places:
57+
- Either `OFFLINE_PROVIDERS` or `ONLINE_PROVIDERS` in `src/gpuhunt/_internal/catalog.py`.
58+
- The `python -m gpuhunt` command in `src/gpuhunt/__main__.py`.
59+
- (offline providers) The CI workflow in `.github/workflows/catalogs.yml`.
60+
- (online providers) The default catalog in `src/gpuhunt/_internal/default.py`.
5861

59-
### 1.4. Add data quality tests
62+
### 1.5. Add data quality tests
6063

61-
If the provider is registered via `OFFLINE_PROVIDERS`, you can add data quality tests
62-
under `src/integrity_tests/`.
64+
For offline providers, you can add data quality tests under `src/integrity_tests/`.
65+
Data quality tests are run after collecting offline catalogs to ensure their integrity.
6366

6467
Refer to examples: [test_datacrunch.py](https://github.com/dstackai/gpuhunt/blob/main/src/integrity_tests/test_datacrunch.py),
6568
[test_gcp.py](https://github.com/dstackai/gpuhunt/blob/main/src/integrity_tests/test_gcp.py).
6669

67-
> Anything unclear? Ask questions on the [Discord server](https://discord.gg/u8SmfwPpMd).
70+
### 1.6. Submit a pull request
6871

6972
Once the cloud provider is added, submit a pull request.
7073

74+
> Anything unclear? Ask questions on the [Discord server](https://discord.gg/u8SmfwPpMd).
7175
7276
## 2. Integrate the cloud provider to dstackai/dstack
7377

7478
Once the provider is added to `gpuhunt`, we can proceed with implementing
7579
the corresponding backend with `dstack`. Follow the steps below.
7680

77-
#### 2.1 Clone the repo
81+
#### 2.1. Determine if you will implement a VM-based or a container-based backend
7882

79-
```bash
80-
git clone https://github.com/dstackai/dstack.git
81-
```
83+
See the Appendix at the end of this document and make sure the provider meets the outlined requirements.
8284

8385
#### 2.2. Set up the development environment
8486

@@ -96,6 +98,8 @@ these dependencies, and ensure that you update the `all` section to include them
9698
Add a new enumeration member for your provider to `BackendType` (`src/dstack/_internal/core/models/backends/base.py`).
9799
Use the name of the provider.
98100

101+
Then create a database [migration](MIGRATIONS.md) to reflect the new enum member.
102+
99103
##### 2.4.2. Create the provider directory
100104

101105
Create a new directory under `src/dstack/_internal/core/backends` with the name of the backend type.
@@ -108,22 +112,24 @@ backend class there (should extend `dstack._internal.core.backends.base.Backend`
108112
Refer to examples:
109113
[datacrunch](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/datacrunch/__init__.py),
110114
[aws](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/aws/__init__.py),
111-
[gcp.py](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/gcp/__init__.py),
115+
[gcp](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/gcp/__init__.py),
112116
[azure](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/azure/__init__.py), etc.
113117

114118
##### 2.4.4. Create the backend compute class
115119

116120
Under the backend directory you've created, create the `compute.py` file and define the
117121
backend compute class there (should extend `dstack._internal.core.backends.base.compute.Compute`).
118122

119-
You'll have to implement `get_offers`, `create_instance`, `run_job` and `terminate_instance`.
123+
You'll have to implement `get_offers`, `run_job` and `terminate_instance`.
124+
You may need to implement `update_provisioning_data`, see its docstring for details.
120125

121-
The `create_instance` method is required for the pool feature. If you implement the `create_instance` method, you should add the provider name to `BACKENDS_WITH_CREATE_INSTANCE_SUPPORT`. (`src/dstack/_internal/server/services/runs.py`).
126+
For VM-based backends, also implement the `create_instance` method and add the backend name to
127+
[`BACKENDS_WITH_CREATE_INSTANCE_SUPPORT`](`https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/__init__.py`).
122128

123129
Refer to examples:
124130
[datacrunch](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/datacrunch/compute.py),
125131
[aws](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/aws/compute.py),
126-
[gcp.py](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/gcp/compute.py),
132+
[gcp](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/gcp/compute.py),
127133
[azure](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/azure/compute.py), etc.
128134

129135
##### 2.4.5. Create the backend config model class
@@ -138,7 +144,7 @@ backend config model classes there.
138144
Refer to examples:
139145
[datacrunch](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/models/backends/datacrunch.py),
140146
[aws](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/models/backends/aws.py),
141-
[gcp.py](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/models/backends/gcp.py),
147+
[gcp](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/models/backends/gcp.py),
142148
[azure](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/models/backends/azure.py), etc.
143149

144150
##### 2.4.6. Create the backend config class
@@ -152,24 +158,22 @@ and the backend configuration model class defined above).
152158
Refer to examples:
153159
[datacrunch](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/datacrunch/config.py),
154160
[aws](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/aws/config.py),
155-
[gcp.py](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/gcp/config.py),
161+
[gcp](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/gcp/config.py),
156162
[azure](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/azure/config.py), etc.
157163

158164
##### 2.4.7. Import config model classes
159165

160166
Ensure the config model classes are imported
161167
into [`src/dstack/_internal/core/models/backends/__init__.py`](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/models/backends/__init__.py).
162168

163-
[//]: # (TODO: The backend configuration is overly complex and needs simplification: https://github.com/dstackai/dstack/issues/888)
164-
165169
##### 2.4.8. Create the configurator class
166170

167171
Create the file with the backend name under `src/dstack/_internal/server/services/backends/configurators`(https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/server/services/backends/configurators)
168172
and define the backend configurator class (must extend `dstack._internal.server.services.backends.configurators.base.Configurator`).
169173

170174
Refer to examples: [datacrunch](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/server/services/backends/configurators/datacrunch.py),
171175
[aws](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/server/services/backends/configurators/aws.py),
172-
[gcp.py](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/server/services/backends/configurators/gcp.py),
176+
[gcp](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/server/services/backends/configurators/gcp.py),
173177
[azure](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/server/services/backends/configurators/azure.py), etc.
174178

175179
##### 2.4.9. Create the server config class
@@ -199,49 +203,43 @@ If instances in the backend take more than 10 minutes to start, override the def
199203

200204
#### 3.1.1. VM-based backend compute type
201205

202-
It's when the cloud provider allows provisioning Virtual machines (VMs).
203-
This is the most flexible backend compute type.
206+
Used if the cloud provider allows provisioning virtual machines (VMs).
207+
When `dstack` provisions a VM, it launches the `dstack-shim` agent inside the VM.
208+
The agent controls the VM and starts Docker containers for users' jobs.
204209

205-
[//]: # (TODO: Elaborate why it's the most flexible)
210+
Since `dstack` controls the entire VM, VM-based backends can support more features,
211+
such as blocks, instance volumes, privileged containers, and reusable instances.
206212

207-
To support it, `dstack` expects the following from the cloud provider:
213+
To support a VM-based backend, `dstack` expects the following:
208214

209215
- An API for creating and terminating VMs
210-
- Ubuntu 22.04 LTS
211-
- NVIDIA CUDA driver 535
212-
- Docker with NVIDIA runtime
213-
- OpenSSH server
214-
- Cloud-init script (preferred)
215-
- An external IP and public port for SSH
216-
217-
When `dstack` provisions a VM, it launches there `dstack-shim`.
218-
219-
[//]: # (TODO: Elaborate on what dstack-shim is and how it works)
216+
- An external IP and a public port for SSH
217+
- Cloud-init (preferred)
218+
- VM images with Ubuntu, OpenSSH, GPU drivers, and Docker with NVIDIA runtime
220219

221-
The examples of VM-based backends include: `aws`, `azure`, `gcp`, `lambda`, `datacrunch`, `tensordock`, etc.
220+
For some VM-based backends, the `dstack` team also maintains
221+
[custom VM images](../scripts/packer/README.md) with the required dependencies
222+
and `dstack`-specific optimizations.
222223

223-
[//]: # (TODO: Elaborate on packer scripts)
224+
Examples of VM-based backends include: `aws`, `azure`, `gcp`, `lambda`, `datacrunch`, `tensordock`, etc.
224225

225226
#### 3.1.2. Container-based backend compute type
226227

227-
It's when the cloud provider allows provisioning only containers.
228-
This is the most limited backend compute type.
228+
Used if the cloud provider only allows provisioning containers.
229+
When `dstack` provisions a container, it launches the `dstack-runner` agent inside the container.
230+
The agent accepts and runs users' jobs.
229231

230-
[//]: # (TODO: Elaborate on why it's the most limited)
232+
Since `dstack` doesn't control the underlying machine, container-based backends don't support some
233+
`dstack` features, such as blocks, instance volumes, privileged containers, and reusable instances.
231234

232-
To support it, `dstack` expects the following from the cloud provider:
235+
To support a container-based backend, `dstack` expects the following:
233236

234237
- An API for creating and terminating containers
235-
- Docker with NVIDIA runtime
238+
- Containers properly configured to access GPUs
236239
- An external IP and a public port for SSH
240+
- A way to specify the Docker image
241+
- A way to specify credentials for pulling images from private Docker registries
237242
- A way to override the container entrypoint (at least ~2KB)
243+
- A way to override the container user to root (as in `docker run --user root ...`)
238244

239-
The examples of container-based backends include: `kubernetes`, `vastai`, etc.
240-
241-
Note: There are two types of computing in dstack:
242-
243-
When `dstack` provisions a VM, it launches their `dstack-runner`.
244-
245-
[//]: # (TODO: Elaborate on what dstack-runner is and how it works)
246-
247-
[//]: # (TODO: Update this guide to incorporate the pool feature)
245+
Examples of container-based backends include: `kubernetes`, `vastai`, `runpod`.

contributing/MIGRATIONS.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Database migrations
2+
3+
`dstack` uses Alembic to manage database migrations. If you modify any SQLAlchemy
4+
[models](../src/dstack/_internal/server/models.py) or related data structures,
5+
generate a new migration with Alembic:
6+
7+
```shell
8+
cd src/dstack/_internal/server/
9+
alembic revision -m "<some message>" --autogenerate
10+
```
11+
12+
Then adjust the generated migration if needed.
13+
14+
## PostgreSQL enums
15+
16+
If you modify any enums used in SQLAlchemy models, you will need to set up PostgreSQL
17+
in order to generate a PostgreSQL-specific enum migration.
18+
19+
1. Run PostgreSQL.
20+
21+
```shell
22+
docker run --rm -p 5432:5432 -e POSTGRES_PASSWORD=password postgres
23+
```
24+
25+
1. Create a database for `dstack`.
26+
27+
```shell
28+
psql -h localhost -U postgres --command "CREATE DATABASE dstack"
29+
```
30+
31+
1. Run `dstack server` once to create the previous database schema.
32+
33+
```shell
34+
DSTACK_DATABASE_URL=postgresql+asyncpg://postgres:password@localhost:5432/dstack dstack server
35+
```
36+
37+
1. Generate the migration.
38+
39+
```shell
40+
cd src/dstack/_internal/server/
41+
DSTACK_DATABASE_URL=postgresql+asyncpg://postgres:password@localhost:5432/dstack alembic revision -m "<some message>" --autogenerate
42+
```

0 commit comments

Comments
 (0)