diff --git a/.agents/skills/common/environment-setup.md b/.agents/skills/common/environment-setup.md index 1745e103297..7af2eac2513 100644 --- a/.agents/skills/common/environment-setup.md +++ b/.agents/skills/common/environment-setup.md @@ -24,7 +24,7 @@ If previous runs left patches in `modelopt/` (from 4C unlisted model work), chec 2. **User doesn't specify** → check for cluster config: ```bash -cat ~/.config/modelopt/clusters.yaml 2>/dev/null || cat .claude/clusters.yaml 2>/dev/null +cat ~/.config/modelopt/clusters.yaml 2>/dev/null || cat .agents/clusters.yaml 2>/dev/null || cat .claude/clusters.yaml 2>/dev/null ``` If a cluster config exists with content → **use the remote cluster** (do not fall back to local even if local GPUs are available — the cluster config indicates the user's preferred execution environment). Otherwise → **local execution**. diff --git a/.agents/skills/deployment/SKILL.md b/.agents/skills/deployment/SKILL.md index 8e18bf7f0cd..4579a9bd62c 100644 --- a/.agents/skills/deployment/SKILL.md +++ b/.agents/skills/deployment/SKILL.md @@ -185,7 +185,7 @@ All checks must pass before reporting success to the user. ### 6. Remote deployment (SSH/SLURM) -If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`), or the user mentions running on a remote machine: +If a cluster config exists (`~/.config/modelopt/clusters.yaml`, `.agents/clusters.yaml`, or `.claude/clusters.yaml`), or the user mentions running on a remote machine: 0. **Check container registry auth** — before submitting any SLURM job with a container image, verify credentials exist on the cluster per `skills/common/slurm-setup.md` section 6. If credentials are missing for the image's registry, ask the user to fix auth or switch to an image on an authenticated registry (e.g., NGC). **Do not submit until auth is confirmed.** diff --git a/.agents/skills/deployment/tests/evals.json b/.agents/skills/deployment/tests/evals.json index 82a36b6b0c9..a94bef08e3a 100644 --- a/.agents/skills/deployment/tests/evals.json +++ b/.agents/skills/deployment/tests/evals.json @@ -26,7 +26,7 @@ "query": "deploy my quantized model on the SLURM cluster", "files": [], "expected_behavior": [ - "Checks for cluster config at ~/.config/modelopt/clusters.yaml or .claude/clusters.yaml", + "Checks for cluster config at ~/.config/modelopt/clusters.yaml, .agents/clusters.yaml, or .claude/clusters.yaml", "Sources .agents/skills/common/remote_exec.sh", "Calls remote_load_cluster, remote_check_ssh, remote_detect_env", "Checks if checkpoint is already on remote (e.g., from prior PTQ run) before syncing; only syncs if local",