You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note that `-v /etc/localtime:/etc/localtime` is necessary to synchronize the time zone in the container with the host machine.
96
82
97
-
4. Optional: Publish the container privately to NERSC registry(https://registry.nersc.gov):
98
-
```console
83
+
4. (Optional) Publish the container privately to [NERSC registry](https://registry.nersc.gov):
84
+
```bash
99
85
docker login registry.nersc.gov
100
86
# Username: your NERSC username
101
87
# Password: your NERSC password without 2FA
102
88
```
103
-
```console
89
+
```bash
104
90
docker tag synapse-gui:latest registry.nersc.gov/m558/superfacility/synapse-gui:latest
105
91
docker tag synapse-gui:latest registry.nersc.gov/m558/superfacility/synapse-gui:$(date "+%y.%m")
106
92
docker push -a registry.nersc.gov/m558/superfacility/synapse-gui
107
93
```
108
-
This has been also automated through the Python script [publish_container.py](https://github.com/BLAST-AI-ML/synapse/blob/main/publish_container.py), which can be executed via
109
-
```console
94
+
This has been also automated through the Python script [publish_container.py](../publish_container.py), which can be executed via
95
+
```bash
110
96
python publish_container.py --gui
111
97
```
112
98
113
-
5. Optional: From time to time, as you develop the container, you might want to prune old, unused images to get back GBytes of storage on your development machine:
114
-
```console
99
+
5. (Optional) As you develop the container, you might want to prune old, unused images periodically in order to free space on your development machine:
100
+
```bash
115
101
docker system prune -a
116
102
```
117
103
118
-
### How to get the Superfacility API credentials
104
+
## How to get the Superfacility API credentials
119
105
120
106
Following the instructions at [docs.nersc.gov/services/sfapi/authentication/#client](https://docs.nersc.gov/services/sfapi/authentication/#client):
121
107
@@ -127,8 +113,8 @@ Following the instructions at [docs.nersc.gov/services/sfapi/authentication/#cli
127
113
128
114
4. Enter a client name (e.g., "Synapse"), choose `sf558`for the user, choose "Red" security level, and selecteither"Your IP" or "Spin" from the "IP Presets" menu, depending on whether the key will be used from a local computer or from Spin.
129
115
130
-
5. Download the private key file (in pem format) and save it as `priv_key.pem` in the root directory of the GUI.
131
-
Each time the GUI is launched, it will automatically find the existing key file and load the corresponding credentials.
116
+
5. Download the private key file (in pem format) and save it as `priv_key.pem`in the root directory of the dashboard.
117
+
Each time the dashboard is launched, it will automatically find the existing key file and load the corresponding credentials.
132
118
133
119
6. Copy your client ID and add it on the first line of your private key file as described in the instructions at [nersc.github.io/sfapi_client/quickstart/#storing-keys-in-files](https://nersc.github.io/sfapi_client/quickstart/#storing-keys-in-files):
The ML training (implemented in ``train_model.py``) can be run in two ways:
3
+
This guide contains important instructions on how to train ML models within Synapse.
4
4
5
-
- In your local Python environment, for testing/debugging: ``python train_model.py ...``
5
+
## Prerequisites
6
6
7
-
- Through the GUI, by clicking the ``Train`` button, or through SLURM by running ``sbatch training_pm.sbatch``.
8
-
In both cases, the training runs in a Docker container at NERSC. This Docker container
9
-
is pulled from the NERSC registry (https://registry.nersc.gov) and does not reflect any local changes
10
-
you may have made to ``train_model.py``, unless you re-build and re-deploy the container.
7
+
Make sure you have installed [conda](https://docs.conda.io/) and [Docker](https://docs.docker.com/).
11
8
12
-
Both methods are described in more detail below.
9
+
## Overview
13
10
14
-
## Training in a local Python environment (testing/debugging)
11
+
Synapse's ML training is implemented primarily in [train_model.py](train_model.py).
12
+
ML models can be trained in two distinct ways:
15
13
16
-
### On your local computer
14
+
1. In a local Python environment, for testing and debugging.
17
15
18
-
For local development, ensure you have [Conda](https://conda-forge.org/download/) installed. Then:
16
+
2. Through the dashboard (by clicking the ``Train`` button) or through SLURM (by running ``sbatch training_pm.sbatch``).
17
+
In both cases, the training runs in a Docker container at NERSC.
18
+
This Docker container is pulled from the [NERSC registry](https://registry.nersc.gov) and does not reflect any local changes you may have made to [train_model.py](train_model.py), unless you re-build and re-deploy the container first.
19
19
20
-
1. Create the conda environment (this only needs to be done once):
20
+
The following sections describe in more details these two ways of training ML models.
21
+
22
+
## How to run ML training in a local Python environment
23
+
24
+
### On a local computer
25
+
26
+
1. Create the conda environment defined by the lock file (only once):
21
27
```bash
22
-
conda env create -f environment.yml
28
+
conda activate base
29
+
conda install -c conda-forge conda-lock # if conda-lock is not installed
## Training through the dashboard or through SLURM
62
79
63
-
> **Warning:**
64
-
>
65
-
> Pushing a new Docker container affects training jobs launched from your locally-deployed GUI,
66
-
> but also from the production GUI (deployed on NERSC Spin), since in both cases, the training
67
-
> runs in a Docker container at NERSC, which is pulled from the NERSC registry (https://registry.nersc.gov).
68
-
>
69
-
> Yet, currently, this is the only way to test the end-to-end integration of the GUI with the training workflow.
80
+
> [!WARNING]
81
+
> Pushing a new Docker container affects training jobs launched from your locally-deployed dashboard, but also from the production dashboard (deployed at NERSC through Spin), because in both cases the ML training runs in a Docker container at NERSC, which is pulled from the [NERSC registry](https://registry.nersc.gov).
82
+
> Currently, this is the only way to test the end-to-end integration of the dashboard with the ML training workflow.
3. Optional: From time to time, as you develop the container, you might want to prune old, unused images to get back GBytes of storage on your development machine:
79
-
```console
91
+
3.(Optional) As you develop the container, you might want to prune old, unused images periodically in order to free space on your development machine:
92
+
```bash
80
93
docker system prune -a
81
94
```
82
95
83
-
4. Publish the container privately to NERSC registry(https://registry.nersc.gov):
84
-
```console
96
+
4. Publish the container privately to [NERSC registry](https://registry.nersc.gov):
97
+
```bash
85
98
docker login registry.nersc.gov
86
99
# Username: your NERSC username
87
100
# Password: your NERSC password without 2FA
88
101
```
89
-
90
-
```console
102
+
```bash
91
103
docker tag synapse-ml:latest registry.nersc.gov/m558/superfacility/synapse-ml:latest
92
104
docker tag synapse-ml:latest registry.nersc.gov/m558/superfacility/synapse-ml:$(date "+%y.%m")
93
105
docker push -a registry.nersc.gov/m558/superfacility/synapse-ml
94
106
```
95
-
This has been also automated through the Python script [publish_container.py](https://github.com/BLAST-AI-ML/synapse/blob/main/publish_container.py), which can be executed via
96
-
```console
107
+
This has been also automated through the Python script [publish_container.py](../publish_container.py), which can be executed via
108
+
```bash
97
109
python publish_container.py --ml
98
110
```
99
111
100
-
5. Optional test: Run the Docker container manually on Perlmutter:
101
-
```console
112
+
5. (Optional) Run the Docker container manually on Perlmutter:
podman-hpc run --gpu -v /etc/localtime:/etc/localtime -v $HOME/db.profile:/root/db.profile -v /path/to/config.yaml:/app/ml/config.yaml --rm -it registry.nersc.gov/m558/superfacility/synapse-ml:latest python -u /app/ml/train_model.py --test --config_file /app/ml/config.yaml --model NN
116
128
```
117
129
Note that `-v /etc/localtime:/etc/localtime` is necessary to synchronize the time zone in the container with the host machine.
118
130
119
131
120
-
> **Note:**
121
-
>
122
-
> When we run ML training jobs through the GUI, we use NERSC's Superfacility API with the collaboration account `sf558`.
123
-
> Since this is a non-interactive, non-user account, we also use a custom user to pull the image from https://registry.nersc.gov to Perlmutter.
124
-
> The registry login credentials need to be prepared (once) in the `$HOME` of `sf558` (`/global/homes/s/sf558/`) in a file named `registry.profile` with the following content:
132
+
> [!NOTE]
133
+
> When we run ML training jobs through the dashboard, we use NERSC's Superfacility API with the collaboration account `sf558`.
134
+
> Since this is a non-interactive, non-user account, we also use a custom user to pull the image from the [NERSC registry](https://registry.nersc.gov) to Perlmutter.
135
+
> The registry login credentials need to be prepared (only once) in the `$HOME` of user `sf558` (`/global/homes/s/sf558/`), in a file named `registry.profile` with the following content:
0 commit comments