Update SKILL#9196
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the SkyPilot skill documentation by adding advanced command options, explaining workdir synchronization behavior, and providing examples for parallel experiment submission. It also refines the recommended agent feedback loop and lists common mistakes. The review feedback suggests making the Python script for job ID extraction more robust against empty results and using explicit job IDs in log commands to avoid ambiguity when multiple jobs are present.
| job_id=$(sky queue exp-vm-01 -o json \ | ||
| | python3 -c "import sys,json; jobs=json.load(sys.stdin)['exp-vm-01']; print(max(j['job_id'] for j in jobs))") |
There was a problem hiding this comment.
The Python one-liner to get the latest job ID is clever, but it's not robust against cases where no jobs are running on the cluster. If sky queue returns no jobs, json.load(sys.stdin)['exp-vm-01'] will likely be an empty list, causing max() to raise a ValueError and break the script. It's better to handle this edge case gracefully.
| job_id=$(sky queue exp-vm-01 -o json \ | |
| | python3 -c "import sys,json; jobs=json.load(sys.stdin)['exp-vm-01']; print(max(j['job_id'] for j in jobs))") | |
| job_id=$(sky queue exp-vm-01 -o json \ | |
| | python3 -c "import sys, json; jobs = json.load(sys.stdin).get('exp-vm-01', []); print(max(j['job_id'] for j in jobs) if jobs else '')") |
| 4. **Debug**: `sky logs mycluster` (stream logs) or `ssh mycluster` (interactive) | ||
| 5. **Iterate**: `sky exec mycluster updated_task.yaml` (run on existing cluster) | ||
| 6. **Cleanup**: `sky down mycluster` | ||
| 4. **Wait for completion**: `sky logs mycluster --status` (blocks until job finishes; exits 0 on success) |
There was a problem hiding this comment.
For robustness, especially in programmatic use by an agent, it's better to explicitly use a job ID with sky logs. The current command sky logs mycluster --status might be ambiguous if multiple jobs are on the cluster, even if it defaults to the latest one. This would also make the example consistent with the recommendation in the callout on line 408.
Consider mentioning how to obtain the job ID (e.g., from sky launch output or sky queue) and using it here and in the following steps. For example:
4. **Wait for completion**: Get the job ID (e.g., from sky launchoutput), then runsky logs mycluster <JOB_ID> --status...
romilbhardwaj
left a comment
There was a problem hiding this comment.
Thanks @alex000kim!
| 4. **Debug**: `sky logs mycluster` (stream logs) or `ssh mycluster` (interactive) | ||
| 5. **Iterate**: `sky exec mycluster updated_task.yaml` (run on existing cluster) | ||
| 6. **Cleanup**: `sky down mycluster` | ||
| 4. **Wait for completion**: `sky logs mycluster <JOB_ID> --status` (blocks until job finishes; exits 0 on success; get JOB_ID from `sky queue mycluster -o json`) |
There was a problem hiding this comment.
instead of sky logs mycluster <JOB_ID> --status, would just streaming logs with sky logs mycluster be better for an agent? So if an application causes the job stall (e.g., timeout + retry loops), the agent can look at the output to infer next steps?
There was a problem hiding this comment.
good point, updated!
Updates to SKILL based on several autoresearch runs from https://blog.skypilot.co/scaling-autoresearch/