You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .claude/skills/common/remote-execution.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,10 +33,10 @@ default_cluster: my-cluster
33
33
Workstation filesystems (`/home/scratch.*`, local NFS) are **not** mounted on the cluster. If a checkpoint was produced on your workstation, copy it to the cluster's own storage before submitting any job that references it — NEL and SLURM do NOT sync checkpoints automatically.
Use the `workspace` path from your cluster config as the destination. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.
39
+
Use the `workspace` path from your cluster config as the destination root, and keep staged checkpoints under the session/model directory. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.
40
40
41
41
See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.
42
42
@@ -118,8 +118,8 @@ When submitting SLURM jobs remotely, write **two files** locally to avoid shell
- Claude Code: `$CLAUDE_CODE_SESSION_ID`, or the `session_id` field from hook input
12
+
- Codex: `$CODEX_THREAD_ID`
13
+
- If no session id is available, create a stable id for the current terminal session and reuse it for every local and remote path created by that agent
33
14
34
15
## When to Reuse vs Create
35
16
36
-
**Before starting any task**, check for an existing workspace:
17
+
**Before starting any task**, check for an existing workspace in the current
18
+
session:
37
19
38
20
```bash
39
-
ls ./workspaces/ 2>/dev/null
21
+
ls ./workspaces/<session_id>/2>/dev/null
40
22
```
41
23
42
24
**Reuse** when:
43
25
44
-
-Same model (e.g., deploying a model you just quantized)
26
+
-The matching model workspace already exists under `./workspaces/<session_id>/`
45
27
- Task requires output from a previous step (e.g., eval requires the PTQ checkpoint)
46
28
- User says "deploy the model I just quantized"
47
29
48
30
**Create new** when:
49
31
50
-
-New model not seen before
32
+
-No matching model workspace exists under `./workspaces/<session_id>/`
51
33
- User explicitly asks for a fresh start
52
-
- Different quantization format for same model (e.g., `qwen3-0.6b-fp8` vs `qwen3-0.6b-nvfp4`)
34
+
35
+
## Model Workspace Names
36
+
37
+
Within `./workspaces/<session_id>/`, create one model workspace per model or
38
+
model variant. Include meaningful variant details in the model workspace name,
39
+
for example quantization format or checkpoint role:
40
+
41
+
```bash
42
+
mkdir -p ./workspaces/<session_id>/<model-name>
43
+
```
44
+
45
+
Use descriptive model workspace names, not timestamps:
46
+
47
+
```text
48
+
# Good
49
+
workspaces/<session_id>/qwen3-0.6b-nvfp4/
50
+
workspaces/<session_id>/qwen3-0.6b-fp8/
51
+
workspaces/<session_id>/qwen3-0.6b-baseline/
52
+
53
+
# Bad
54
+
workspaces/<session_id>/ptq-20260318-143022/
55
+
workspaces/<session_id>/job-001/
56
+
```
57
+
58
+
Store outputs (checkpoints, logs) inside the model workspace:
Copy file name to clipboardExpand all lines: .claude/skills/deployment/SKILL.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -38,10 +38,10 @@ The script handles: GPU detection, quantization flag auto-detection (FP8 vs FP4)
38
38
39
39
### 0. Check workspace (multi-user / Slack bot)
40
40
41
-
If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check for existing ones — especially if deploying a checkpoint from a prior PTQ run:
41
+
If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check the current session for existing model workspaces — especially if deploying a checkpoint from a prior PTQ run:
42
42
43
43
```bash
44
-
ls "$MODELOPT_WORKSPACE_ROOT/"2>/dev/null
44
+
ls "$MODELOPT_WORKSPACE_ROOT/<session_id>/"2>/dev/null
45
45
```
46
46
47
47
If the user says "deploy the model I just quantized" or references a previous PTQ, find the matching workspace and `cd` into it. The checkpoint should be in that workspace's output directory.
@@ -190,7 +190,7 @@ If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clust
190
190
If the checkpoint path is a remote/absolute path (e.g., from a prior PTQ run on the cluster), skip sync — it's already there. Verify with `remote_run "ls <checkpoint_path>/config.json"`. Only sync if the checkpoint is local:
Copy file name to clipboardExpand all lines: .claude/skills/evaluation/SKILL.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ You're an expert in NeMo Evaluator Launcher! Guide the user through creating pro
14
14
15
15
### Workspace and Pipeline Integration
16
16
17
-
If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Check for existing workspaces — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications.
17
+
If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Check for existing workspaces in the current session — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications.
18
18
19
19
This skill is often the final stage of the PTQ → Deploy → Eval pipeline. If the model required runtime patches during deployment (transformers upgrade, framework source fixes), carry those patches into the NEL config via `deployment.command`.
Copy file name to clipboardExpand all lines: .claude/skills/monitor/SKILL.md
+31-9Lines changed: 31 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,13 +10,31 @@ Monitor jobs submitted to SLURM clusters — PTQ quantization, NEL evaluation, m
10
10
## When to use
11
11
12
12
1.**Auto-monitor** — another skill (PTQ, evaluation, deployment) just submitted a job. Register the job and set up monitoring immediately.
13
-
2.**User-initiated** — user asks about a job status, possibly in a new conversation. Check the registry, identify the job, and report.
13
+
2.**User-initiated** — user asks about a job status. Check the current session registry first; if the job is not registered there, use the discovery steps below.
14
14
15
15
---
16
16
17
17
## Job Registry
18
18
19
-
All active jobs are tracked in `.claude/active_jobs.json`. This file is the single source of truth for what's being monitored.
19
+
Active jobs are tracked in per-session registries under `.claude/agents/`.
20
+
This avoids multiple agents clobbering one shared registry when they run at
21
+
the same time.
22
+
23
+
Use the current agent session id as `<session_id>`:
24
+
25
+
- Claude Code: `$CLAUDE_CODE_SESSION_ID`, or the `session_id` field from hook input
26
+
- Codex: `$CODEX_THREAD_ID`
27
+
- If no session id is available, create a stable id for the current terminal session and reuse it for every job registered by that agent
28
+
29
+
Registry layout:
30
+
31
+
```text
32
+
.claude/agents/
33
+
<session_id>/
34
+
active_jobs.json
35
+
```
36
+
37
+
Each session's `active_jobs.json` is a JSON array:
20
38
21
39
```json
22
40
[
@@ -27,7 +45,11 @@ All active jobs are tracked in `.claude/active_jobs.json`. This file is the sing
27
45
"user": "<ssh_user>",
28
46
"submitted": "YYYY-MM-DD HH:MM",
29
47
"description": "<what this job does>",
30
-
"last_status": "<last known status>"
48
+
"last_status": "<last known status>",
49
+
"owner": {
50
+
"agent": "claude-code|codex|manual",
51
+
"session_id": "<session_id>"
52
+
}
31
53
}
32
54
]
33
55
```
@@ -40,8 +62,8 @@ All active jobs are tracked in `.claude/active_jobs.json`. This file is the sing
40
62
41
63
Every time a job is submitted (by any skill or manually):
42
64
43
-
1.**Add an entry** to `.claude/active_jobs.json`. Create the file if it doesn't exist.
44
-
2.**Start a durable monitor** (if one isn't already watching the registry) that polls all registered jobs until they reach terminal status. Prefer the Claude Code `Monitor` tool when it is available: write a small watcher that reads the registry on each poll, checks every job with the appropriate method below, prints state-change events, updates `last_status`, removes terminal jobs, and exits when the registry is empty.
65
+
1.**Add an entry** to `.claude/agents/<session_id>/active_jobs.json`. Create the session directory and file if they don't exist.
66
+
2.**Start a durable monitor** (if one isn't already watching the registry) that polls this session's registered jobs until they reach terminal status. Prefer the Claude Code `Monitor` tool when it is available: write a small watcher that reads `.claude/agents/<session_id>/active_jobs.json`, checks every job with the appropriate method below, prints state-change events, updates `last_status`, removes terminal jobs from the session registry, and exits when no active jobs remain for this session.
45
67
46
68
The monitor should terminate naturally when every registered job has reached a terminal state. If the `Monitor` tool is not available in the current harness, run an equivalent background process that implements the same loop and lets the agent resume/restart when the process exits.
47
69
@@ -53,12 +75,12 @@ Always do both steps. Don't try to predict job duration.
53
75
54
76
Whether triggered by monitor output or by the user asking "check status":
55
77
56
-
1.**Read the registry** from `.claude/active_jobs.json`
78
+
1.**Read the registry** from `.claude/agents/<session_id>/active_jobs.json`
57
79
2.**Check each job** using the appropriate method (see below)
58
80
3.**Report only state changes** — compare against `last_status` in registry
59
-
4.**Update `last_status`** in the registry
81
+
4.**Update `last_status`** in the session registry
60
82
5.**Remove completed jobs** — any job in a terminal state (COMPLETED, FAILED, CANCELLED, KILLED, TIMEOUT, NODE_FAIL, OUT_OF_MEMORY, PREEMPTED, BOOT_FAIL, DEADLINE)
61
-
6.**If registry is empty** — let the monitor exit
83
+
6.**If no active jobs remain** — let the monitor exit
62
84
63
85
---
64
86
@@ -111,7 +133,7 @@ vocabulary of the source you're polling.
111
133
112
134
When the user asks about a job without specifying an ID, check in order:
113
135
114
-
1.`.claude/active_jobs.json` — most reliable, has context
136
+
1.`.claude/agents/<current_session_id>/active_jobs.json` — current agent's jobs
115
137
2.`nel ls runs --since 1d` — recent NEL runs
116
138
3.`ssh <host> "squeue -u <user>"` — active SLURM jobs
117
139
4.`ls -lt tools/launcher/experiments/cicd/ | head -10` — recent launcher experiments
Copy file name to clipboardExpand all lines: .claude/skills/ptq/references/unsupported-models.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ After download, inspect the model files on the target machine (use `remote_run`
13
13
1.**Read `README.md`** — often lists required transformers versions, dependencies, or `trust_remote_code` requirements
14
14
2.**Check for `modeling_*.py` or `tokenization_*.py`** — custom code shipped with the model. If found, **always use `--trust_remote_code`** with `hf_ptq.py`, and `trust_remote_code=True` in any custom scripts. Without it, `AutoConfig`, `AutoTokenizer`, and `AutoModel` will fail to resolve custom classes.
15
15
16
-
Write custom scripts locally (in `./workspaces/<model>/scripts/`), then sync to remote before running.
16
+
Write custom scripts locally (in `./workspaces/<session_id>/<model>/scripts/`), then sync to remote before running.
17
17
18
18
**Check transformers compatibility** (on the target machine):
0 commit comments