Skip to content

Commit 7824b24

Browse files
committed
Unblock CI and address mxinO review on remote-execution.md
Two fixes: 1. Exclude vendored upstream skills from markdownlint. `.claude/skills/launching-evals/` and `.claude/skills/accessing-mlflow/` are vendored verbatim from NVIDIA-NeMo/Evaluator and re-synced via .claude/scripts/sync-upstream-skills.sh. Markdownlint wanted to reformat them (trailing blank lines, spacing around fences), but fixing would violate the "verbatim" property documented in their frontmatter. Add an `ignores:` glob to `.markdownlint-cli2.yaml`. 2. Reframe the checkpoint/storage note on `skills/common/remote-execution.md`. Reviewer @mxinO noted (PR #1239) that the previous "compute nodes may not share the same filesystem as login nodes" framing is misleading — compute nodes on a given cluster do share storage with the login node. The real issue is that workstation filesystems aren't mounted on the cluster at all. Also drops the dlcluster-specific row, which @mxinO flagged as an internal quirk that shouldn't ship publicly. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
1 parent 290f432 commit 7824b24

File tree

2 files changed

+12
-7
lines changed

2 files changed

+12
-7
lines changed

.claude/skills/common/remote-execution.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -28,16 +28,15 @@ clusters:
2828
default_cluster: my-cluster
2929
```
3030
31-
### Checkpoint and storage availability
31+
### Staging checkpoints from your workstation
3232
33-
Cluster compute nodes may not share the same filesystem as login nodes or other clusters. Before running any workload that references a checkpoint path, verify the path is accessible from compute nodes:
33+
Workstation filesystems (`/home/scratch.*`, local NFS) are **not** mounted on the cluster. If a checkpoint was produced on your workstation, copy it to the cluster's own storage before submitting any job that references it — NEL and SLURM do NOT sync checkpoints automatically.
3434

35-
| Cluster type | Compute-node storage | NOT accessible from compute nodes |
36-
|-------------|---------------------|----------------------------------|
37-
| JET clusters (oci-hsg, cw, oci-nrt) | `/lustre/fsw/...` | Workstation NFS (`/home/scratch.*`), other cluster mounts |
38-
| dlcluster | `/home/omniml_data_*`, `/home/scratch.*` | `/lustre/` paths |
35+
```bash
36+
rsync -av /path/to/local/checkpoint <cluster-login>:<cluster-workspace>/checkpoints/
37+
```
3938

40-
If a checkpoint was produced on a different cluster or workstation, copy it to the target cluster's accessible storage before submitting jobs. NEL and SLURM do NOT sync checkpoints automatically.
39+
Use the `workspace` path from your cluster config as the destination. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.
4140

4241
See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.
4342

.markdownlint-cli2.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,9 @@ config:
1010
MD036: false # no-emphasis-as-heading - allow **bold** as section markers
1111
MD041: false # first-line-heading
1212
MD059: false # no-hard-tabs
13+
14+
# Vendored upstream skills — kept byte-identical to upstream via
15+
# .claude/scripts/sync-upstream-skills.sh; do not reformat.
16+
ignores:
17+
- ".claude/skills/launching-evals/**"
18+
- ".claude/skills/accessing-mlflow/**"

0 commit comments

Comments
 (0)