Skip to content

Commit 8742dc1

Browse files
authored
skypilot readme updates (#1121)
# Improved SkyPilot Documentation and Removed Unused Sandbox Script ### TL;DR Enhanced SkyPilot documentation with dashboard access instructions and improved shell aliases, while removing an unused sandbox script. ### What changed? - Removed the unused `devops/sandbox.sh` script that was only running an infinite loop - Added documentation for accessing the SkyPilot web dashboard with authentication instructions - Updated shell aliases documentation to match the current aliases - Removed the requirement for AWS ECR access in us-east-1 region from prerequisites ### How to test? 1. Try accessing the SkyPilot dashboard using the new instructions: - Visit https://skypilot-api.softmax-research.net/ - Use credentials from `~/.sky/config.yaml` or by running `sky api info` 2. Source the updated shell script to verify the aliases work correctly: ```bash source ./devops/skypilot/setup_shell.sh jj # Should list active jobs ``` ### Why make this change? - The sandbox script was unused and could be safely removed - The dashboard access instructions help users monitor their jobs more effectively - Improved documentation makes the SkyPilot tooling more accessible to team members [Asana Task](https://app.asana.com/1/1209016784099267/project/1210348820405981/task/1210630741514021)
1 parent 2423704 commit 8742dc1

3 files changed

Lines changed: 50 additions & 58 deletions

File tree

devops/sandbox.sh

Lines changed: 0 additions & 12 deletions
This file was deleted.

devops/skypilot/README.md

Lines changed: 49 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@ This script provides a convenient way to launch training jobs on AWS using SkyPi
55
## Prerequisites
66

77
- AWS credentials configured with `softmax` profile
8-
- Access to AWS ECR in us-east-1 region
98
- SkyPilot CLI installed and configured
109
- Git repository with pushed commits (unless using `--skip-git-check`)
1110

@@ -28,11 +27,20 @@ For a complete list of optional parameters and their descriptions, use:
2827
./devops/skypilot/launch.py --help
2928
```
3029

30+
## Accessing the Dashboard
31+
32+
There's a [web dashboard](https://skypilot-api.softmax-research.net/) that displays the status of all clusters and jobs.
33+
34+
To sign in, you'll need user/password credentials. You can obtain these by peeking into the URL in your `~/.sky/config.yaml`, or by running `sky api info`.
35+
36+
Warning: if you click the URL with `https://user:password@...` pair in the terminal, it will load the dashboard, but it might show up as empty. To fix this, you'll need to open a new tab with the [dashboard](https://skypilot-api.softmax-research.net/) **without** the user:password pair.
37+
3138
## Examples
3239

3340
### Basic Usage
3441

3542
1. **Launch a training run with default parameters:**
43+
3644
```bash
3745
devops/skypilot/launch.py train run=my_experiment_001
3846
```
@@ -45,11 +53,13 @@ For a complete list of optional parameters and their descriptions, use:
4553
### Resource Configuration
4654

4755
3. **Use multiple GPUs:**
56+
4857
```bash
4958
devops/skypilot/launch.py train run=gpu_experiment --gpus 4
5059
```
5160

5261
4. **Multi-node training:**
62+
5363
```bash
5464
devops/skypilot/launch.py train run=distributed_training --nodes 2 --gpus 8
5565
```
@@ -62,6 +72,7 @@ For a complete list of optional parameters and their descriptions, use:
6272
### Time Management
6373

6474
6. **Quick 30-minute experiment:**
75+
6576
```bash
6677
devops/skypilot/launch.py train run=quick_test --timeout-hours 0.5
6778
```
@@ -74,11 +85,13 @@ For a complete list of optional parameters and their descriptions, use:
7485
### Advanced Usage
7586

7687
8. **Launch multiple identical experiments:**
88+
7789
```bash
7890
devops/skypilot/launch.py train run=ablation_study --copies 5 --timeout-hours 2
7991
```
8092

8193
9. **Use specific git commit:**
94+
8295
```bash
8396
devops/skypilot/launch.py train run=reproducible_exp --git-ref abc123def
8497
```
@@ -90,24 +103,24 @@ For a complete list of optional parameters and their descriptions, use:
90103

91104
The `--confirm` flag displays a detailed job summary before launching:
92105

93-
```sh
94-
============================================================
95-
Job details:
96-
============================================================
97-
Name: my_experiment_001
98-
GPUs: 1x A10G
99-
CPUs: 8+
100-
Spot Instances: Yes
101-
Auto-termination: 2h
102-
Git Reference: 56e04aa725000f186ec1bb2de84b359b4f273947
103-
------------------------------------------------------------
104-
Command: train
105-
Task Arguments:
106-
1. trainer.curriculum=env/mettagrid/curriculum/navigation
107-
2. trainer.learning_rate=0.001
108-
============================================================
109-
Should we launch this task? (Y/n):
110-
```
106+
```sh
107+
============================================================
108+
Job details:
109+
============================================================
110+
Name: my_experiment_001
111+
GPUs: 1x A10G
112+
CPUs: 8+
113+
Spot Instances: Yes
114+
Auto-termination: 2h
115+
Git Reference: 56e04aa725000f186ec1bb2de84b359b4f273947
116+
------------------------------------------------------------
117+
Command: train
118+
Task Arguments:
119+
1. trainer.curriculum=env/mettagrid/curriculum/navigation
120+
2. trainer.learning_rate=0.001
121+
============================================================
122+
Should we launch this task? (Y/n):
123+
```
111124

112125
11. **Dry run:**
113126
```bash
@@ -117,6 +130,7 @@ The `--confirm` flag displays a detailed job summary before launching:
117130
The `--dry-run` flag allows you to preview the configuration that will be used before launching.
118131

119132
It will output the complete YAML configuration that would be used for the deployment, including:
133+
120134
- Resource specifications (cloud provider, instance types, GPUs)
121135
- Docker configurations
122136
- Environment variables
@@ -157,6 +171,7 @@ sky jobs cancel -n "experiment_*"
157171
### Job Status
158172

159173
Jobs can have the following statuses:
174+
160175
- `PENDING`: Waiting for resources
161176
- `RUNNING`: Currently executing
162177
- `SUCCEEDED`: Completed successfully
@@ -179,42 +194,28 @@ This script also sets `AWS_PROFILE=softmax` automatically.
179194

180195
### Available Aliases
181196

197+
`source ./devops/skypilot/setup_shell.sh` to configure some convenient aliases.
198+
182199
#### Job Queue Management
183-
- `jq` - List active jobs (skips finished jobs)
184-
```bash
185-
jq # Equivalent to: sky jobs queue --skip-finished
186-
```
200+
201+
- `jj` - List active jobs (skips finished jobs)
187202
- `jqa` - List all jobs including finished ones
188-
```bash
189-
jqa # Equivalent to: sky jobs queue
190-
```
191203

192204
#### Job Control
193-
- `jk` - Cancel a job
194-
```bash
195-
jk <JOB_ID> # Equivalent to: sky jobs cancel <JOB_ID>
196-
```
205+
206+
- `jk <JOB_ID>` - Cancel a job
207+
- `jka` - Cancel all jobs
197208

198209
#### Logs
199-
- `jl` - View job logs
200-
```bash
201-
jl <JOB_ID> # Equivalent to: sky jobs logs <JOB_ID>
202-
```
203-
- `jlc` - View controller logs (useful for debugging)
204-
```bash
205-
jlc <JOB_ID> # Equivalent to: sky jobs logs --controller <JOB_ID>
206-
```
210+
211+
- `jl <JOB_ID>` - View job logs
212+
- `jlc <JOB_ID>` - View controller logs (useful for debugging)
207213
- `jll` - View logs for the most recent job
208-
```bash
209-
jll # Automatically gets logs for the latest running job
210-
```
211214
- `jllc` - View controller logs for the most recent job
212-
```bash
213-
jllc # Automatically gets controller logs for the latest running job
214-
```
215215

216216
#### Launching
217-
- `lt` - Quick launch training jobs
217+
218+
- `lt run=<NAME>` - Quick launch training jobs
218219
```bash
219220
lt run=my_experiment_001 # Equivalent to: ./devops/skypilot/launch.py train run=my_experiment_001
220221
```
@@ -277,6 +278,7 @@ sky down <cluster_name>
277278
## Configuration
278279

279280
The script uses `./devops/skypilot/config/sk_train.yaml` as the base configuration. This file defines:
281+
280282
- Default resource requirements (CPU, GPU, memory)
281283
- Docker image settings
282284
- Environment variables
@@ -285,6 +287,7 @@ The script uses `./devops/skypilot/config/sk_train.yaml` as the base configurati
285287
### Environment Variables
286288

287289
The following environment variables are automatically set:
290+
288291
- `METTA_RUN_ID`: The run identifier
289292
- `METTA_CMD`: The command being executed
290293
- `METTA_CMD_ARGS`: Additional command arguments
@@ -309,6 +312,7 @@ sky jobs queue -a
309312
## Best Practices
310313

311314
1. **Use descriptive run IDs**: Include date, user name, experiment type, and key parameters
315+
312316
```bash
313317
lt run=2024_01_15_bert_lr_sweep_001
314318
```

devops/skypilot/setup_shell.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
export AWS_PROFILE=softmax
22

33
# list jobs
4-
alias jj="sky jobs queue --skip-finished" # avoid conflict with jq (JSON processor)
4+
alias jj="sky jobs queue --skip-finished"
55
alias jja="sky jobs queue"
66

77
# cancel ("kill") job

0 commit comments

Comments
 (0)