Skip to content

Commit b0bbfbe

Browse files
committed
Update README files
1 parent c1ffe3d commit b0bbfbe

2 files changed

Lines changed: 107 additions & 110 deletions

File tree

dashboard/README.md

Lines changed: 45 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -1,121 +1,107 @@
1-
## Dashboard
1+
# Dashboard
22

3-
Here are a few how-to guides on how to develop and use the dashboard.
3+
This guide contains important instructions on how to use and develop the Synapse dashboard.
44

5-
### Prerequisites
6-
- Ensure you have Conda installed.
7-
- Ensure you have Docker installed (if you plan to use Docker).
5+
## Prerequisites
86

9-
### How to create a new conda environment lock file
7+
Make sure you have installed [conda](https://docs.conda.io/) and [Docker](https://docs.docker.com/).
108

11-
1. Activate the `base` conda environment:
12-
```console
13-
conda activate base
14-
```
9+
## How to generate the conda environment lock file
1510

16-
2. Install `conda-lock` (if not already installed):
17-
```console
18-
conda install -c conda-forge conda-lock
19-
```
11+
A new conda environment lock file can be generated by running the following commands:
2012

21-
3. Create the lock file starting from the existing minimal environment file:
22-
```console
23-
conda-lock --file environment.yml --lockfile environment-lock.yml
24-
```
13+
```bash
14+
conda activate base
15+
conda install -c conda-forge conda-lock # if conda-lock is not installed
16+
conda-lock --file environment.yml --lockfile environment-lock.yml
17+
```
2518

26-
### How to set up the conda environment
19+
## How to create the conda environment
2720

28-
1. Activate the `base` conda environment:
29-
```console
30-
conda activate base
31-
```
21+
The conda environment defined by the lock file can be created by running the following commands:
3222

33-
2. Install `conda-lock` (if not already installed):
34-
```console
35-
conda install -c conda-forge conda-lock
36-
```
37-
38-
3. Create the `synapse-gui` conda environment from the lock file:
39-
```console
40-
conda-lock install --name synapse-gui environment-lock.yml
41-
```
23+
```bash
24+
conda activate base
25+
conda install -c conda-forge conda-lock # if conda-lock is not installed
26+
conda-lock install --name synapse-gui environment-lock.yml
27+
```
4228

43-
### How to run the GUI
29+
## How to run the dashboard
4430

45-
1. Activate the `synapse-gui` conda environment:
46-
```console
31+
1. Activate the conda environment:
32+
```bash
4733
conda activate synapse-gui
4834
```
4935

5036
2. Set the database settings (read only):
51-
```console
37+
```bash
5238
export SF_DB_HOST='127.0.0.1'
5339
export SF_DB_READONLY_PASSWORD='your_password_here' # Use SINGLE quotes around the password!
5440
```
5541

56-
3. For local development, open a separate terminal and keep it open while SSH forwarding the database connection:
57-
```console
42+
3. Open a separate terminal and keep it open while SSH forwarding the database connection:
43+
```bash
5844
ssh -L 27017:mongodb05.nersc.gov:27017 <username>@dtn03.nersc.gov -N
5945
```
6046

61-
4. Run the GUI from the `dashboard/` folder:
47+
4. Run the dashboard from the `dashboard/` folder:
6248
- Via the web browser interface:
63-
```console
49+
```bash
6450
python -u app.py --port 8080
6551
```
6652
- As a desktop application:
67-
```console
53+
```bash
6854
python -u app.py --app
6955
```
70-
If you run the GUI as a desktop application, make sure to set the following environment variable first:
71-
```console
56+
If you run the dashboard as a desktop application, make sure to set the following environment variable first:
57+
```bash
7258
python -m pip install pywebview[qt]
7359
export PYWEBVIEW_GUI=qt
7460
```
7561

76-
5. Terminate the GUI via `Ctrl` + `C`.
62+
5. Terminate the application via `Ctrl` + `C`.
7763

78-
### How to build and run the Docker container
64+
## How to build and run the Docker container
7965

8066
1. Move to the root directory of the repository.
8167

8268
2. Build the Docker image based on `Dockerfile`:
83-
```console
69+
```bash
8470
docker build --platform linux/amd64 -t synapse-gui -f dashboard.Dockerfile .
8571
```
8672

87-
3. Run the Docker container from the `dashboard/` folder:
88-
```console
73+
3. Run the Docker container:
74+
```bash
8975
docker run --network=host -v /etc/localtime:/etc/localtime -v $PWD/ml:/app/ml -e SF_DB_HOST='127.0.0.1' -e SF_DB_READONLY_PASSWORD='your_password_here' synapse-gui
9076
```
9177
For debugging, you can also enter the container without starting the app:
92-
```console
78+
```bash
9379
docker run --network=host -v /etc/localtime:/etc/localtime -v $PWD/ml:/app/ml -e SF_DB_HOST='127.0.0.1' -e SF_DB_READONLY_PASSWORD='your_password_here' -it synapse-gui bash
9480
```
9581
Note that `-v /etc/localtime:/etc/localtime` is necessary to synchronize the time zone in the container with the host machine.
9682

97-
4. Optional: Publish the container privately to NERSC registry (https://registry.nersc.gov):
98-
```console
83+
4. (Optional) Publish the container privately to [NERSC registry](https://registry.nersc.gov):
84+
```bash
9985
docker login registry.nersc.gov
10086
# Username: your NERSC username
10187
# Password: your NERSC password without 2FA
10288
```
103-
```console
89+
```bash
10490
docker tag synapse-gui:latest registry.nersc.gov/m558/superfacility/synapse-gui:latest
10591
docker tag synapse-gui:latest registry.nersc.gov/m558/superfacility/synapse-gui:$(date "+%y.%m")
10692
docker push -a registry.nersc.gov/m558/superfacility/synapse-gui
10793
```
108-
This has been also automated through the Python script [publish_container.py](https://github.com/BLAST-AI-ML/synapse/blob/main/publish_container.py), which can be executed via
109-
```console
94+
This has been also automated through the Python script [publish_container.py](../publish_container.py), which can be executed via
95+
```bash
11096
python publish_container.py --gui
11197
```
11298

113-
5. Optional: From time to time, as you develop the container, you might want to prune old, unused images to get back GBytes of storage on your development machine:
114-
```console
99+
5. (Optional) As you develop the container, you might want to prune old, unused images periodically in order to free space on your development machine:
100+
```bash
115101
docker system prune -a
116102
```
117103

118-
### How to get the Superfacility API credentials
104+
## How to get the Superfacility API credentials
119105

120106
Following the instructions at [docs.nersc.gov/services/sfapi/authentication/#client](https://docs.nersc.gov/services/sfapi/authentication/#client):
121107

@@ -127,8 +113,8 @@ Following the instructions at [docs.nersc.gov/services/sfapi/authentication/#cli
127113

128114
4. Enter a client name (e.g., "Synapse"), choose `sf558` for the user, choose "Red" security level, and select either "Your IP" or "Spin" from the "IP Presets" menu, depending on whether the key will be used from a local computer or from Spin.
129115

130-
5. Download the private key file (in pem format) and save it as `priv_key.pem` in the root directory of the GUI.
131-
Each time the GUI is launched, it will automatically find the existing key file and load the corresponding credentials.
116+
5. Download the private key file (in pem format) and save it as `priv_key.pem` in the root directory of the dashboard.
117+
Each time the dashboard is launched, it will automatically find the existing key file and load the corresponding credentials.
132118

133119
6. Copy your client ID and add it on the first line of your private key file as described in the instructions at [nersc.github.io/sfapi_client/quickstart/#storing-keys-in-files](https://nersc.github.io/sfapi_client/quickstart/#storing-keys-in-files):
134120
```

ml/README.md

Lines changed: 62 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -1,104 +1,116 @@
11
# ML Training
22

3-
The ML training (implemented in ``train_model.py``) can be run in two ways:
3+
This guide contains important instructions on how to train ML models within Synapse.
44

5-
- In your local Python environment, for testing/debugging: ``python train_model.py ...``
5+
## Prerequisites
66

7-
- Through the GUI, by clicking the ``Train`` button, or through SLURM by running ``sbatch training_pm.sbatch``.
8-
In both cases, the training runs in a Docker container at NERSC. This Docker container
9-
is pulled from the NERSC registry (https://registry.nersc.gov) and does not reflect any local changes
10-
you may have made to ``train_model.py``, unless you re-build and re-deploy the container.
7+
Make sure you have installed [conda](https://docs.conda.io/) and [Docker](https://docs.docker.com/).
118

12-
Both methods are described in more detail below.
9+
## Overview
1310

14-
## Training in a local Python environment (testing/debugging)
11+
Synapse's ML training is implemented primarily in [train_model.py](train_model.py).
12+
ML models can be trained in two distinct ways:
1513

16-
### On your local computer
14+
1. In a local Python environment, for testing and debugging.
1715

18-
For local development, ensure you have [Conda](https://conda-forge.org/download/) installed. Then:
16+
2. Through the dashboard (by clicking the ``Train`` button) or through SLURM (by running ``sbatch training_pm.sbatch``).
17+
In both cases, the training runs in a Docker container at NERSC.
18+
This Docker container is pulled from the [NERSC registry](https://registry.nersc.gov) and does not reflect any local changes you may have made to [train_model.py](train_model.py), unless you re-build and re-deploy the container first.
1919

20-
1. Create the conda environment (this only needs to be done once):
20+
The following sections describe in more details these two ways of training ML models.
21+
22+
## How to run ML training in a local Python environment
23+
24+
### On a local computer
25+
26+
1. Create the conda environment defined by the lock file (only once):
2127
```bash
22-
conda env create -f environment.yml
28+
conda activate base
29+
conda install -c conda-forge conda-lock # if conda-lock is not installed
30+
conda-lock install --name synapse-ml environment-lock.yml
2331
```
2432

25-
2. Open a separate terminal and keep it open:
33+
2. Open a separate terminal and keep it open while SSH forwarding the database connection:
2634
```bash
2735
ssh -L 27017:mongodb05.nersc.gov:27017 <username>@dtn03.nersc.gov -N
2836
```
2937

30-
3. Activate the conda environment and setup database read-write access:
38+
3. Activate the conda environment:
3139
```bash
3240
conda activate synapse-ml
41+
```
42+
43+
4. Set up database settings (read-write):
44+
```bash
3345
export SF_DB_ADMIN_PASSWORD='your_password_here' # Use SINGLE quotes around the password!
3446
```
3547

36-
4. Run the training script in test mode:
37-
```console
38-
python train_model.py --test --model <NN/GP> --config_file <your_test_yaml_file>
48+
5. Run the ML training script in test mode:
49+
```bash
50+
python train_model.py --test --model <NN/GP> --config_file <your_config_file>
3951
```
4052

4153
### At NERSC
4254

43-
1. Create the conda environment (this only needs to be done once):
55+
1. Create the conda environment defined by the lock file (only once):
4456
```bash
4557
module load python
46-
conda env create --prefix /global/cfs/cdirs/m558/$(whoami)/sw/perlmutter/synapse-ml -f environment.yml
58+
conda env create --prefix /global/cfs/cdirs/m558/$(whoami)/sw/perlmutter/synapse-ml -f environment.yml # FIXME
4759
```
4860

49-
2. Activate the environment and setup database read-write access:
61+
2. Activate the conda environment:
5062
```bash
5163
module load python
5264
conda activate /global/cfs/cdirs/m558/$(whoami)/sw/perlmutter/synapse-ml
65+
```
66+
67+
3. Set up database settings (read-write):
68+
```bash
69+
module load python
5370
export SF_DB_ADMIN_PASSWORD='your_password_here' # Use SINGLE quotes around the password!
5471
```
5572

56-
3. Run the training script in test mode:
57-
```console
58-
python train_model.py --test --model <NN/GP> --config_file <your_test_yaml_file>
73+
4. Run the ML training script in test mode:
74+
```bash
75+
python train_model.py --test --model <NN/GP> --config_file <your_config_file>
5976
```
6077

61-
## Training through the GUI or through SLURM
78+
## Training through the dashboard or through SLURM
6279

63-
> **Warning:**
64-
>
65-
> Pushing a new Docker container affects training jobs launched from your locally-deployed GUI,
66-
> but also from the production GUI (deployed on NERSC Spin), since in both cases, the training
67-
> runs in a Docker container at NERSC, which is pulled from the NERSC registry (https://registry.nersc.gov).
68-
>
69-
> Yet, currently, this is the only way to test the end-to-end integration of the GUI with the training workflow.
80+
> [!WARNING]
81+
> Pushing a new Docker container affects training jobs launched from your locally-deployed dashboard, but also from the production dashboard (deployed at NERSC through Spin), because in both cases the ML training runs in a Docker container at NERSC, which is pulled from the [NERSC registry](https://registry.nersc.gov).
82+
> Currently, this is the only way to test the end-to-end integration of the dashboard with the ML training workflow.
7083
7184
1. Move to the root directory of the repository.
7285

7386
2. Build the Docker image based on `Dockerfile`:
74-
```console
87+
```bash
7588
docker build --platform linux/amd64 -t synapse-ml -f ml.Dockerfile .
7689
```
7790

78-
3. Optional: From time to time, as you develop the container, you might want to prune old, unused images to get back GBytes of storage on your development machine:
79-
```console
91+
3. (Optional) As you develop the container, you might want to prune old, unused images periodically in order to free space on your development machine:
92+
```bash
8093
docker system prune -a
8194
```
8295

83-
4. Publish the container privately to NERSC registry (https://registry.nersc.gov):
84-
```console
96+
4. Publish the container privately to [NERSC registry](https://registry.nersc.gov):
97+
```bash
8598
docker login registry.nersc.gov
8699
# Username: your NERSC username
87100
# Password: your NERSC password without 2FA
88101
```
89-
90-
```console
102+
```bash
91103
docker tag synapse-ml:latest registry.nersc.gov/m558/superfacility/synapse-ml:latest
92104
docker tag synapse-ml:latest registry.nersc.gov/m558/superfacility/synapse-ml:$(date "+%y.%m")
93105
docker push -a registry.nersc.gov/m558/superfacility/synapse-ml
94106
```
95-
This has been also automated through the Python script [publish_container.py](https://github.com/BLAST-AI-ML/synapse/blob/main/publish_container.py), which can be executed via
96-
```console
107+
This has been also automated through the Python script [publish_container.py](../publish_container.py), which can be executed via
108+
```bash
97109
python publish_container.py --ml
98110
```
99111

100-
5. Optional test: Run the Docker container manually on Perlmutter:
101-
```console
112+
5. (Optional) Run the Docker container manually on Perlmutter:
113+
```bash
102114
ssh perlmutter-p1.nersc.gov
103115
104116
podman-hpc login --username $USER registry.nersc.gov
@@ -107,27 +119,26 @@ For local development, ensure you have [Conda](https://conda-forge.org/download/
107119
podman-hpc pull registry.nersc.gov/m558/superfacility/synapse-ml:latest
108120
```
109121

110-
Ensure the file `$HOME/db.profile` contains a line `export SF_DB_ADMIN_PASSWORD=...` with the write password to the database.
122+
Ensure the file `$HOME/db.profile` contains a line `export SF_DB_ADMIN_PASSWORD=...` with the read-write password to the database.
111123

112-
```console
124+
```bash
113125
salloc -N 1 --ntasks-per-node=1 -t 1:00:00 -q interactive -C gpu --gpu-bind=single:1 -c 32 -G 1 -A m558
114126
115127
podman-hpc run --gpu -v /etc/localtime:/etc/localtime -v $HOME/db.profile:/root/db.profile -v /path/to/config.yaml:/app/ml/config.yaml --rm -it registry.nersc.gov/m558/superfacility/synapse-ml:latest python -u /app/ml/train_model.py --test --config_file /app/ml/config.yaml --model NN
116128
```
117129
Note that `-v /etc/localtime:/etc/localtime` is necessary to synchronize the time zone in the container with the host machine.
118130

119131

120-
> **Note:**
121-
>
122-
> When we run ML training jobs through the GUI, we use NERSC's Superfacility API with the collaboration account `sf558`.
123-
> Since this is a non-interactive, non-user account, we also use a custom user to pull the image from https://registry.nersc.gov to Perlmutter.
124-
> The registry login credentials need to be prepared (once) in the `$HOME` of `sf558` (`/global/homes/s/sf558/`) in a file named `registry.profile` with the following content:
132+
> [!NOTE]
133+
> When we run ML training jobs through the dashboard, we use NERSC's Superfacility API with the collaboration account `sf558`.
134+
> Since this is a non-interactive, non-user account, we also use a custom user to pull the image from the [NERSC registry](https://registry.nersc.gov) to Perlmutter.
135+
> The registry login credentials need to be prepared (only once) in the `$HOME` of user `sf558` (`/global/homes/s/sf558/`), in a file named `registry.profile` with the following content:
125136
> ```bash
126137
> export REGISTRY_USER="robot\$m558+perlmutter-nersc-gov"
127138
> export REGISTRY_PASSWORD="..."
128139
> ```
129140
130141
## References
131142
132-
* https://docs.nersc.gov/development/containers/podman-hpc/overview/
133-
* https://docs.nersc.gov/development/containers/registry/
143+
* [Podman at NERSC](https://docs.nersc.gov/development/containers/podman-hpc/overview/)
144+
* [Using NERSC's `registry.nersc.gov`](https://docs.nersc.gov/development/containers/registry/)

0 commit comments

Comments
 (0)