Skip to content

Commit ffa7558

Browse files
committed
Scope agent state by session
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
1 parent f360752 commit ffa7558

7 files changed

Lines changed: 108 additions & 67 deletions

File tree

.claude/skills/common/remote-execution.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,10 +33,10 @@ default_cluster: my-cluster
3333
Workstation filesystems (`/home/scratch.*`, local NFS) are **not** mounted on the cluster. If a checkpoint was produced on your workstation, copy it to the cluster's own storage before submitting any job that references it — NEL and SLURM do NOT sync checkpoints automatically.
3434

3535
```bash
36-
rsync -av /path/to/local/checkpoint <cluster-login>:<cluster-workspace>/checkpoints/
36+
rsync -av /path/to/local/checkpoint <cluster-login>:<cluster-workspace>/<session_id>/<model>/checkpoints/
3737
```
3838

39-
Use the `workspace` path from your cluster config as the destination. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.
39+
Use the `workspace` path from your cluster config as the destination root, and keep staged checkpoints under the session/model directory. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.
4040

4141
See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.
4242

@@ -118,8 +118,8 @@ When submitting SLURM jobs remotely, write **two files** locally to avoid shell
118118
Then upload both and submit:
119119

120120
```bash
121-
remote_sync_to /local/scripts/ scripts/
122-
JOBID=$(remote_run "sbatch /remote/path/scripts/job_slurm.sh" | grep -o '[0-9]\+' | tail -1)
121+
remote_sync_to /local/scripts/ <session_id>/<model>/scripts/
122+
JOBID=$(remote_run "sbatch <remote_workspace>/<session_id>/<model>/scripts/job_slurm.sh" | grep -o '[0-9]\+' | tail -1)
123123
```
124124

125125
---
Lines changed: 67 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,77 +1,95 @@
11
# Workspace Management
22

3-
Organize work by model name so outputs (checkpoints, logs) are easy to find and reuse across PTQ → deploy → eval pipelines.
3+
Organize work by session id and model name so concurrent agents do not
4+
clobber each other, while outputs (checkpoints, logs) stay easy to find and
5+
reuse across PTQ → deploy → eval pipelines within the same session.
46

5-
## Single-user (default)
7+
## Session Workspaces
68

7-
Create a work directory named after the model in the current project:
9+
Use the same `<session_id>` convention as the monitor skill:
810

9-
```bash
10-
mkdir -p ./workspaces/<model-name>
11-
```
12-
13-
Use descriptive names, not timestamps:
14-
15-
```bash
16-
# Good
17-
workspaces/qwen3-0.6b-nvfp4/
18-
workspaces/llama-3.1-8b-fp8/
19-
20-
# Bad
21-
workspaces/ptq-20260318-143022/
22-
workspaces/job-001/
23-
```
24-
25-
Store outputs (checkpoints, logs) inside the workspace:
26-
27-
```bash
28-
workspaces/qwen3-0.6b-nvfp4/
29-
output/ # quantized checkpoint
30-
logs/ # job logs
31-
scripts/ # custom PTQ scripts (if unsupported model)
32-
```
11+
- Claude Code: `$CLAUDE_CODE_SESSION_ID`, or the `session_id` field from hook input
12+
- Codex: `$CODEX_THREAD_ID`
13+
- If no session id is available, create a stable id for the current terminal session and reuse it for every local and remote path created by that agent
3314

3415
## When to Reuse vs Create
3516

36-
**Before starting any task**, check for an existing workspace:
17+
**Before starting any task**, check for an existing workspace in the current
18+
session:
3719

3820
```bash
39-
ls ./workspaces/ 2>/dev/null
21+
ls ./workspaces/<session_id>/ 2>/dev/null
4022
```
4123

4224
**Reuse** when:
4325

44-
- Same model (e.g., deploying a model you just quantized)
26+
- The matching model workspace already exists under `./workspaces/<session_id>/`
4527
- Task requires output from a previous step (e.g., eval requires the PTQ checkpoint)
4628
- User says "deploy the model I just quantized"
4729

4830
**Create new** when:
4931

50-
- New model not seen before
32+
- No matching model workspace exists under `./workspaces/<session_id>/`
5133
- User explicitly asks for a fresh start
52-
- Different quantization format for same model (e.g., `qwen3-0.6b-fp8` vs `qwen3-0.6b-nvfp4`)
34+
35+
## Model Workspace Names
36+
37+
Within `./workspaces/<session_id>/`, create one model workspace per model or
38+
model variant. Include meaningful variant details in the model workspace name,
39+
for example quantization format or checkpoint role:
40+
41+
```bash
42+
mkdir -p ./workspaces/<session_id>/<model-name>
43+
```
44+
45+
Use descriptive model workspace names, not timestamps:
46+
47+
```text
48+
# Good
49+
workspaces/<session_id>/qwen3-0.6b-nvfp4/
50+
workspaces/<session_id>/qwen3-0.6b-fp8/
51+
workspaces/<session_id>/qwen3-0.6b-baseline/
52+
53+
# Bad
54+
workspaces/<session_id>/ptq-20260318-143022/
55+
workspaces/<session_id>/job-001/
56+
```
57+
58+
Store outputs (checkpoints, logs) inside the model workspace:
59+
60+
```text
61+
workspaces/<session_id>/qwen3-0.6b-nvfp4/
62+
output/ # quantized checkpoint
63+
logs/ # job logs
64+
scripts/ # custom PTQ scripts (if unsupported model)
65+
```
5366

5467
## Remote execution
5568

5669
When using a remote machine (clusters.yaml configured), create matching workspaces on **both** local and remote:
5770

58-
- **Local** `./workspaces/<model>/` — write and edit scripts here
59-
- **Remote** `<remote_workspace>/workspaces/<model>/` — model downloads, execution, outputs
71+
- **Local** `./workspaces/<session_id>/<model>/` — write and edit scripts here
72+
- **Remote** `<remote_workspace>/<session_id>/<model>/` — model downloads, execution, outputs
73+
74+
Session-scope newly created remote run directories, logs, response caches,
75+
temporary configs, and output artifacts. Shared read-only or concurrency-safe
76+
caches, such as Hugging Face model caches and prebuilt container image caches,
77+
can remain outside the session directory.
6078

6179
Before running, sync the local ModelOpt source and scripts to the remote workspace:
6280

6381
```bash
6482
# Sync ModelOpt source (first time or after local changes)
65-
remote_sync_to ./ workspaces/<model>/Model-Optimizer/
83+
remote_sync_to ./ <session_id>/<model>/Model-Optimizer/
6684

6785
# Sync custom scripts
68-
remote_sync_to ./workspaces/<model>/scripts/ workspaces/<model>/scripts/
86+
remote_sync_to ./workspaces/<session_id>/<model>/scripts/ <session_id>/<model>/scripts/
6987
```
7088

7189
Download the model on the **remote** machine (avoids transferring large model files):
7290

7391
```bash
74-
remote_run "python -c \"from huggingface_hub import snapshot_download; snapshot_download('<model_id>', local_dir='<remote_workspace>/workspaces/<model>/model')\""
92+
remote_run "python -c \"from huggingface_hub import snapshot_download; snapshot_download('<model_id>', local_dir='<remote_workspace>/<session_id>/<model>/model')\""
7593
```
7694

7795
Inspect remote files with `remote_run "cat ..."` — read README, config.json, tokenizer_config.json to understand requirements before writing scripts locally.
@@ -80,7 +98,7 @@ Inspect remote files with `remote_run "cat ..."` — read README, config.json, t
8098

8199
When `MODELOPT_WORKSPACE_ROOT` is set, use it instead of `./workspaces/`:
82100

83-
- `MODELOPT_WORKSPACE_ROOT` — user's workspace root (set by the bot)
101+
- `MODELOPT_WORKSPACE_ROOT` — user's workspace root (set by the bot); use `$MODELOPT_WORKSPACE_ROOT/<session_id>/<name>/`
84102
- `MODELOPT_REPO_DIR` — shared upstream repo (read-only, use for fresh copies)
85103

86104
To create a workspace, copy the upstream repo (without `.git`):
@@ -89,15 +107,15 @@ To create a workspace, copy the upstream repo (without `.git`):
89107
rsync -a --quiet \
90108
--exclude .git --exclude __pycache__ --exclude '*.pyc' \
91109
--exclude node_modules --exclude '*.egg-info' --exclude '*.sqsh' \
92-
"$MODELOPT_REPO_DIR/" "$MODELOPT_WORKSPACE_ROOT/<name>/"
110+
"$MODELOPT_REPO_DIR/" "$MODELOPT_WORKSPACE_ROOT/<session_id>/<name>/"
93111
```
94112

95113
## Cross-Skill Workspace Flow
96114

97115
Workspaces carry over across the PTQ → Deploy → Eval pipeline. Each stage adds to the same directory:
98116

99117
```text
100-
workspaces/model-name-format/
118+
workspaces/<session_id>/model-name-format/
101119
output/ ← PTQ: quantized checkpoint
102120
eval_results/ ← Evaluation: NEL artifacts (results.yml per task)
103121
eval_config.yaml ← Evaluation: NEL config
@@ -109,19 +127,19 @@ workspaces/model-name-format/
109127

110128
```text
111129
User: "quantize Qwen3-0.6B with nvfp4"
112-
Agent: ls workspaces/ → no "qwen3-0.6b-nvfp4"
113-
→ mkdir workspaces/qwen3-0.6b-nvfp4
114-
→ run PTQ, output to workspaces/qwen3-0.6b-nvfp4/output/
130+
Agent: ls workspaces/<session_id>/ → no "qwen3-0.6b-nvfp4"
131+
→ mkdir workspaces/<session_id>/qwen3-0.6b-nvfp4
132+
→ run PTQ, output to workspaces/<session_id>/qwen3-0.6b-nvfp4/output/
115133
116134
User: "deploy the model I just quantized"
117-
Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
118-
→ reuse, find checkpoint at workspaces/qwen3-0.6b-nvfp4/output/
135+
Agent: ls workspaces/<session_id>/ → sees "qwen3-0.6b-nvfp4"
136+
→ reuse, find checkpoint at workspaces/<session_id>/qwen3-0.6b-nvfp4/output/
119137
120138
User: "evaluate the quantized model on MMLU and GSM8K"
121-
Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
122-
→ reuse, write eval_config.yaml, results to workspaces/qwen3-0.6b-nvfp4/eval_results/
139+
Agent: ls workspaces/<session_id>/ → sees "qwen3-0.6b-nvfp4"
140+
→ reuse, write eval_config.yaml, results to workspaces/<session_id>/qwen3-0.6b-nvfp4/eval_results/
123141
124142
User: "now quantize Llama-3.1-8B with fp8"
125-
Agent: ls workspaces/ → no llama
126-
→ mkdir workspaces/llama-3.1-8b-fp8
143+
Agent: ls workspaces/<session_id>/ → no llama
144+
→ mkdir workspaces/<session_id>/llama-3.1-8b-fp8
127145
```

.claude/skills/deployment/SKILL.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,10 +38,10 @@ The script handles: GPU detection, quantization flag auto-detection (FP8 vs FP4)
3838

3939
### 0. Check workspace (multi-user / Slack bot)
4040

41-
If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check for existing ones — especially if deploying a checkpoint from a prior PTQ run:
41+
If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check the current session for existing model workspaces — especially if deploying a checkpoint from a prior PTQ run:
4242

4343
```bash
44-
ls "$MODELOPT_WORKSPACE_ROOT/" 2>/dev/null
44+
ls "$MODELOPT_WORKSPACE_ROOT/<session_id>/" 2>/dev/null
4545
```
4646

4747
If the user says "deploy the model I just quantized" or references a previous PTQ, find the matching workspace and `cd` into it. The checkpoint should be in that workspace's output directory.
@@ -190,7 +190,7 @@ If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clust
190190
If the checkpoint path is a remote/absolute path (e.g., from a prior PTQ run on the cluster), skip sync — it's already there. Verify with `remote_run "ls <checkpoint_path>/config.json"`. Only sync if the checkpoint is local:
191191

192192
```bash
193-
remote_sync_to <local_checkpoint_path> checkpoints/
193+
remote_sync_to <local_checkpoint_path> <session_id>/<model>/checkpoints/
194194
```
195195

196196
3. **Deploy based on remote environment:**

.claude/skills/evaluation/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ You're an expert in NeMo Evaluator Launcher! Guide the user through creating pro
1414

1515
### Workspace and Pipeline Integration
1616

17-
If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Check for existing workspaces — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications.
17+
If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Check for existing workspaces in the current session — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications.
1818

1919
This skill is often the final stage of the PTQ → Deploy → Eval pipeline. If the model required runtime patches during deployment (transformers upgrade, framework source fixes), carry those patches into the NEL config via `deployment.command`.
2020

.claude/skills/monitor/SKILL.md

Lines changed: 31 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,31 @@ Monitor jobs submitted to SLURM clusters — PTQ quantization, NEL evaluation, m
1010
## When to use
1111

1212
1. **Auto-monitor** — another skill (PTQ, evaluation, deployment) just submitted a job. Register the job and set up monitoring immediately.
13-
2. **User-initiated** — user asks about a job status, possibly in a new conversation. Check the registry, identify the job, and report.
13+
2. **User-initiated** — user asks about a job status. Check the current session registry first; if the job is not registered there, use the discovery steps below.
1414

1515
---
1616

1717
## Job Registry
1818

19-
All active jobs are tracked in `.claude/active_jobs.json`. This file is the single source of truth for what's being monitored.
19+
Active jobs are tracked in per-session registries under `.claude/agents/`.
20+
This avoids multiple agents clobbering one shared registry when they run at
21+
the same time.
22+
23+
Use the current agent session id as `<session_id>`:
24+
25+
- Claude Code: `$CLAUDE_CODE_SESSION_ID`, or the `session_id` field from hook input
26+
- Codex: `$CODEX_THREAD_ID`
27+
- If no session id is available, create a stable id for the current terminal session and reuse it for every job registered by that agent
28+
29+
Registry layout:
30+
31+
```text
32+
.claude/agents/
33+
<session_id>/
34+
active_jobs.json
35+
```
36+
37+
Each session's `active_jobs.json` is a JSON array:
2038

2139
```json
2240
[
@@ -27,7 +45,11 @@ All active jobs are tracked in `.claude/active_jobs.json`. This file is the sing
2745
"user": "<ssh_user>",
2846
"submitted": "YYYY-MM-DD HH:MM",
2947
"description": "<what this job does>",
30-
"last_status": "<last known status>"
48+
"last_status": "<last known status>",
49+
"owner": {
50+
"agent": "claude-code|codex|manual",
51+
"session_id": "<session_id>"
52+
}
3153
}
3254
]
3355
```
@@ -40,8 +62,8 @@ All active jobs are tracked in `.claude/active_jobs.json`. This file is the sing
4062

4163
Every time a job is submitted (by any skill or manually):
4264

43-
1. **Add an entry** to `.claude/active_jobs.json`. Create the file if it doesn't exist.
44-
2. **Start a durable monitor** (if one isn't already watching the registry) that polls all registered jobs until they reach terminal status. Prefer the Claude Code `Monitor` tool when it is available: write a small watcher that reads the registry on each poll, checks every job with the appropriate method below, prints state-change events, updates `last_status`, removes terminal jobs, and exits when the registry is empty.
65+
1. **Add an entry** to `.claude/agents/<session_id>/active_jobs.json`. Create the session directory and file if they don't exist.
66+
2. **Start a durable monitor** (if one isn't already watching the registry) that polls this session's registered jobs until they reach terminal status. Prefer the Claude Code `Monitor` tool when it is available: write a small watcher that reads `.claude/agents/<session_id>/active_jobs.json`, checks every job with the appropriate method below, prints state-change events, updates `last_status`, removes terminal jobs from the session registry, and exits when no active jobs remain for this session.
4567

4668
The monitor should terminate naturally when every registered job has reached a terminal state. If the `Monitor` tool is not available in the current harness, run an equivalent background process that implements the same loop and lets the agent resume/restart when the process exits.
4769

@@ -53,12 +75,12 @@ Always do both steps. Don't try to predict job duration.
5375

5476
Whether triggered by monitor output or by the user asking "check status":
5577

56-
1. **Read the registry** from `.claude/active_jobs.json`
78+
1. **Read the registry** from `.claude/agents/<session_id>/active_jobs.json`
5779
2. **Check each job** using the appropriate method (see below)
5880
3. **Report only state changes** — compare against `last_status` in registry
59-
4. **Update `last_status`** in the registry
81+
4. **Update `last_status`** in the session registry
6082
5. **Remove completed jobs** — any job in a terminal state (COMPLETED, FAILED, CANCELLED, KILLED, TIMEOUT, NODE_FAIL, OUT_OF_MEMORY, PREEMPTED, BOOT_FAIL, DEADLINE)
61-
6. **If registry is empty** — let the monitor exit
83+
6. **If no active jobs remain** — let the monitor exit
6284

6385
---
6486

@@ -111,7 +133,7 @@ vocabulary of the source you're polling.
111133

112134
When the user asks about a job without specifying an ID, check in order:
113135

114-
1. `.claude/active_jobs.json`most reliable, has context
136+
1. `.claude/agents/<current_session_id>/active_jobs.json`current agent's jobs
115137
2. `nel ls runs --since 1d` — recent NEL runs
116138
3. `ssh <host> "squeue -u <user>"` — active SLURM jobs
117139
4. `ls -lt tools/launcher/experiments/cicd/ | head -10` — recent launcher experiments

.claude/skills/ptq/references/unsupported-models.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ After download, inspect the model files on the target machine (use `remote_run`
1313
1. **Read `README.md`** — often lists required transformers versions, dependencies, or `trust_remote_code` requirements
1414
2. **Check for `modeling_*.py` or `tokenization_*.py`** — custom code shipped with the model. If found, **always use `--trust_remote_code`** with `hf_ptq.py`, and `trust_remote_code=True` in any custom scripts. Without it, `AutoConfig`, `AutoTokenizer`, and `AutoModel` will fail to resolve custom classes.
1515

16-
Write custom scripts locally (in `./workspaces/<model>/scripts/`), then sync to remote before running.
16+
Write custom scripts locally (in `./workspaces/<session_id>/<model>/scripts/`), then sync to remote before running.
1717

1818
**Check transformers compatibility** (on the target machine):
1919

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ venv/
6161

6262
# Ignore claude local settings
6363
.claude/settings.local.json
64+
.claude/agents/
6465
CLAUDE.local.md
6566
AGENTS.override.md
6667

0 commit comments

Comments
 (0)