Scope agent state by session

chadvoegele · chadvoegele · commit ffa7558aa516 · 2026-05-20T09:44:26.000-05:00
Signed-off-by: Chad Voegele &lt;cvoegele@nvidia.com&gt;
diff --git a/.claude/skills/common/remote-execution.md b/.claude/skills/common/remote-execution.md
@@ -33,10 +33,10 @@ default_cluster: my-cluster
 Workstation filesystems (`/home/scratch.*`, local NFS) are **not** mounted on the cluster. If a checkpoint was produced on your workstation, copy it to the cluster's own storage before submitting any job that references it — NEL and SLURM do NOT sync checkpoints automatically.
 
 ```bash
-rsync -av /path/to/local/checkpoint <cluster-login>:<cluster-workspace>/checkpoints/
+rsync -av /path/to/local/checkpoint <cluster-login>:<cluster-workspace>/<session_id>/<model>/checkpoints/
 ```
 
-Use the `workspace` path from your cluster config as the destination. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.
+Use the `workspace` path from your cluster config as the destination root, and keep staged checkpoints under the session/model directory. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.
 
 See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.
 
@@ -118,8 +118,8 @@ When submitting SLURM jobs remotely, write **two files** locally to avoid shell
 Then upload both and submit:
 
 ```bash
-remote_sync_to /local/scripts/ scripts/
-JOBID=$(remote_run "sbatch /remote/path/scripts/job_slurm.sh" | grep -o '[0-9]\+' | tail -1)
+remote_sync_to /local/scripts/ <session_id>/<model>/scripts/
+JOBID=$(remote_run "sbatch <remote_workspace>/<session_id>/<model>/scripts/job_slurm.sh" | grep -o '[0-9]\+' | tail -1)
 ```
 
 ---
diff --git a/.claude/skills/common/workspace-management.md b/.claude/skills/common/workspace-management.md
@@ -1,77 +1,95 @@
 # Workspace Management
 
-Organize work by model name so outputs (checkpoints, logs) are easy to find and reuse across PTQ → deploy → eval pipelines.
+Organize work by session id and model name so concurrent agents do not
+clobber each other, while outputs (checkpoints, logs) stay easy to find and
+reuse across PTQ → deploy → eval pipelines within the same session.
 
-## Single-user (default)
+## Session Workspaces
 
-Create a work directory named after the model in the current project:
+Use the same `<session_id>` convention as the monitor skill:
 
-```bash
-mkdir -p ./workspaces/<model-name>
-```
-
-Use descriptive names, not timestamps:
-
-```bash
-# Good
-workspaces/qwen3-0.6b-nvfp4/
-workspaces/llama-3.1-8b-fp8/
-
-# Bad
-workspaces/ptq-20260318-143022/
-workspaces/job-001/
-```
-
-Store outputs (checkpoints, logs) inside the workspace:
-
-```bash
-workspaces/qwen3-0.6b-nvfp4/
-  output/          # quantized checkpoint
-  logs/            # job logs
-  scripts/         # custom PTQ scripts (if unsupported model)
-```
+- Claude Code: `$CLAUDE_CODE_SESSION_ID`, or the `session_id` field from hook input
+- Codex: `$CODEX_THREAD_ID`
+- If no session id is available, create a stable id for the current terminal session and reuse it for every local and remote path created by that agent
 
 ## When to Reuse vs Create
 
-**Before starting any task**, check for an existing workspace:
+**Before starting any task**, check for an existing workspace in the current
+session:
 
 ```bash
-ls ./workspaces/ 2>/dev/null
+ls ./workspaces/<session_id>/ 2>/dev/null
 ```
 
 **Reuse** when:
 
-- Same model (e.g., deploying a model you just quantized)
+- The matching model workspace already exists under `./workspaces/<session_id>/`
 - Task requires output from a previous step (e.g., eval requires the PTQ checkpoint)
 - User says "deploy the model I just quantized"
 
 **Create new** when:
 
-- New model not seen before
+- No matching model workspace exists under `./workspaces/<session_id>/`
 - User explicitly asks for a fresh start
-- Different quantization format for same model (e.g., `qwen3-0.6b-fp8` vs `qwen3-0.6b-nvfp4`)
+
+## Model Workspace Names
+
+Within `./workspaces/<session_id>/`, create one model workspace per model or
+model variant. Include meaningful variant details in the model workspace name,
+for example quantization format or checkpoint role:
+
+```bash
+mkdir -p ./workspaces/<session_id>/<model-name>
+```
+
+Use descriptive model workspace names, not timestamps:
+
+```text
+# Good
+workspaces/<session_id>/qwen3-0.6b-nvfp4/
+workspaces/<session_id>/qwen3-0.6b-fp8/
+workspaces/<session_id>/qwen3-0.6b-baseline/
+
+# Bad
+workspaces/<session_id>/ptq-20260318-143022/
+workspaces/<session_id>/job-001/
+```
+
+Store outputs (checkpoints, logs) inside the model workspace:
+
+```text
+workspaces/<session_id>/qwen3-0.6b-nvfp4/
+  output/          # quantized checkpoint
+  logs/            # job logs
+  scripts/         # custom PTQ scripts (if unsupported model)
+```
 
 ## Remote execution
 
 When using a remote machine (clusters.yaml configured), create matching workspaces on **both** local and remote:
 
-- **Local** `./workspaces/<model>/` — write and edit scripts here
-- **Remote** `<remote_workspace>/workspaces/<model>/` — model downloads, execution, outputs
+- **Local** `./workspaces/<session_id>/<model>/` — write and edit scripts here
+- **Remote** `<remote_workspace>/<session_id>/<model>/` — model downloads, execution, outputs
+
+Session-scope newly created remote run directories, logs, response caches,
+temporary configs, and output artifacts. Shared read-only or concurrency-safe
+caches, such as Hugging Face model caches and prebuilt container image caches,
+can remain outside the session directory.
 
 Before running, sync the local ModelOpt source and scripts to the remote workspace:
 
 ```bash
 # Sync ModelOpt source (first time or after local changes)
-remote_sync_to ./ workspaces/<model>/Model-Optimizer/
+remote_sync_to ./ <session_id>/<model>/Model-Optimizer/
 
 # Sync custom scripts
-remote_sync_to ./workspaces/<model>/scripts/ workspaces/<model>/scripts/
+remote_sync_to ./workspaces/<session_id>/<model>/scripts/ <session_id>/<model>/scripts/
 ```
 
 Download the model on the **remote** machine (avoids transferring large model files):
 
 ```bash
-remote_run "python -c \"from huggingface_hub import snapshot_download; snapshot_download('<model_id>', local_dir='<remote_workspace>/workspaces/<model>/model')\""
+remote_run "python -c \"from huggingface_hub import snapshot_download; snapshot_download('<model_id>', local_dir='<remote_workspace>/<session_id>/<model>/model')\""
 ```
 
 Inspect remote files with `remote_run "cat ..."` — read README, config.json, tokenizer_config.json to understand requirements before writing scripts locally.
@@ -80,7 +98,7 @@ Inspect remote files with `remote_run "cat ..."` — read README, config.json, t
 
 When `MODELOPT_WORKSPACE_ROOT` is set, use it instead of `./workspaces/`:
 
-- `MODELOPT_WORKSPACE_ROOT` — user's workspace root (set by the bot)
+- `MODELOPT_WORKSPACE_ROOT` — user's workspace root (set by the bot); use `$MODELOPT_WORKSPACE_ROOT/<session_id>/<name>/`
 - `MODELOPT_REPO_DIR` — shared upstream repo (read-only, use for fresh copies)
 
 To create a workspace, copy the upstream repo (without `.git`):
@@ -89,15 +107,15 @@ To create a workspace, copy the upstream repo (without `.git`):
 rsync -a --quiet \
     --exclude .git --exclude __pycache__ --exclude '*.pyc' \
     --exclude node_modules --exclude '*.egg-info' --exclude '*.sqsh' \
-    "$MODELOPT_REPO_DIR/" "$MODELOPT_WORKSPACE_ROOT/<name>/"
+    "$MODELOPT_REPO_DIR/" "$MODELOPT_WORKSPACE_ROOT/<session_id>/<name>/"
 ```
 
 ## Cross-Skill Workspace Flow
 
 Workspaces carry over across the PTQ → Deploy → Eval pipeline. Each stage adds to the same directory:
 
 ```text
-workspaces/model-name-format/
+workspaces/<session_id>/model-name-format/
   output/              ← PTQ: quantized checkpoint
   eval_results/        ← Evaluation: NEL artifacts (results.yml per task)
   eval_config.yaml     ← Evaluation: NEL config
@@ -109,19 +127,19 @@ workspaces/model-name-format/
 
 ```text
 User: "quantize Qwen3-0.6B with nvfp4"
-Agent: ls workspaces/ → no "qwen3-0.6b-nvfp4"
-       → mkdir workspaces/qwen3-0.6b-nvfp4
-       → run PTQ, output to workspaces/qwen3-0.6b-nvfp4/output/
+Agent: ls workspaces/<session_id>/ → no "qwen3-0.6b-nvfp4"
+       → mkdir workspaces/<session_id>/qwen3-0.6b-nvfp4
+       → run PTQ, output to workspaces/<session_id>/qwen3-0.6b-nvfp4/output/
 
 User: "deploy the model I just quantized"
-Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
-       → reuse, find checkpoint at workspaces/qwen3-0.6b-nvfp4/output/
+Agent: ls workspaces/<session_id>/ → sees "qwen3-0.6b-nvfp4"
+       → reuse, find checkpoint at workspaces/<session_id>/qwen3-0.6b-nvfp4/output/
 
 User: "evaluate the quantized model on MMLU and GSM8K"
-Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
-       → reuse, write eval_config.yaml, results to workspaces/qwen3-0.6b-nvfp4/eval_results/
+Agent: ls workspaces/<session_id>/ → sees "qwen3-0.6b-nvfp4"
+       → reuse, write eval_config.yaml, results to workspaces/<session_id>/qwen3-0.6b-nvfp4/eval_results/
 
 User: "now quantize Llama-3.1-8B with fp8"
-Agent: ls workspaces/ → no llama
-       → mkdir workspaces/llama-3.1-8b-fp8
+Agent: ls workspaces/<session_id>/ → no llama
+       → mkdir workspaces/<session_id>/llama-3.1-8b-fp8
 ```
diff --git a/.claude/skills/deployment/SKILL.md b/.claude/skills/deployment/SKILL.md
@@ -38,10 +38,10 @@ The script handles: GPU detection, quantization flag auto-detection (FP8 vs FP4)
 
 ### 0. Check workspace (multi-user / Slack bot)
 
-If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check for existing ones — especially if deploying a checkpoint from a prior PTQ run:
+If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check the current session for existing model workspaces — especially if deploying a checkpoint from a prior PTQ run:
 
 ```bash
-ls "$MODELOPT_WORKSPACE_ROOT/" 2>/dev/null
+ls "$MODELOPT_WORKSPACE_ROOT/<session_id>/" 2>/dev/null
 ```
 
 If the user says "deploy the model I just quantized" or references a previous PTQ, find the matching workspace and `cd` into it. The checkpoint should be in that workspace's output directory.
@@ -190,7 +190,7 @@ If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clust
    If the checkpoint path is a remote/absolute path (e.g., from a prior PTQ run on the cluster), skip sync — it's already there. Verify with `remote_run "ls <checkpoint_path>/config.json"`. Only sync if the checkpoint is local:
 
    ```bash
-   remote_sync_to <local_checkpoint_path> checkpoints/
+   remote_sync_to <local_checkpoint_path> <session_id>/<model>/checkpoints/
    ```
 
 3. **Deploy based on remote environment:**
diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
@@ -14,7 +14,7 @@ You're an expert in NeMo Evaluator Launcher! Guide the user through creating pro
 
 ### Workspace and Pipeline Integration
 
-If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Check for existing workspaces — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications.
+If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Check for existing workspaces in the current session — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications.
 
 This skill is often the final stage of the PTQ → Deploy → Eval pipeline. If the model required runtime patches during deployment (transformers upgrade, framework source fixes), carry those patches into the NEL config via `deployment.command`.
 
diff --git a/.claude/skills/monitor/SKILL.md b/.claude/skills/monitor/SKILL.md
@@ -10,13 +10,31 @@ Monitor jobs submitted to SLURM clusters — PTQ quantization, NEL evaluation, m
 ## When to use
 
 1. **Auto-monitor** — another skill (PTQ, evaluation, deployment) just submitted a job. Register the job and set up monitoring immediately.
-2. **User-initiated** — user asks about a job status, possibly in a new conversation. Check the registry, identify the job, and report.
+2. **User-initiated** — user asks about a job status. Check the current session registry first; if the job is not registered there, use the discovery steps below.
 
 ---
 
 ## Job Registry
 
-All active jobs are tracked in `.claude/active_jobs.json`. This file is the single source of truth for what's being monitored.
+Active jobs are tracked in per-session registries under `.claude/agents/`.
+This avoids multiple agents clobbering one shared registry when they run at
+the same time.
+
+Use the current agent session id as `<session_id>`:
+
+- Claude Code: `$CLAUDE_CODE_SESSION_ID`, or the `session_id` field from hook input
+- Codex: `$CODEX_THREAD_ID`
+- If no session id is available, create a stable id for the current terminal session and reuse it for every job registered by that agent
+
+Registry layout:
+
+```text
+.claude/agents/
+  <session_id>/
+    active_jobs.json
+```
+
+Each session's `active_jobs.json` is a JSON array:
 
 ```json
 [
@@ -27,7 +45,11 @@ All active jobs are tracked in `.claude/active_jobs.json`. This file is the sing
     "user": "<ssh_user>",
     "submitted": "YYYY-MM-DD HH:MM",
     "description": "<what this job does>",
-    "last_status": "<last known status>"
+    "last_status": "<last known status>",
+    "owner": {
+      "agent": "claude-code|codex|manual",
+      "session_id": "<session_id>"
+    }
   }
 ]
 ```
@@ -40,8 +62,8 @@ All active jobs are tracked in `.claude/active_jobs.json`. This file is the sing
 
 Every time a job is submitted (by any skill or manually):
 
-1. **Add an entry** to `.claude/active_jobs.json`. Create the file if it doesn't exist.
-2. **Start a durable monitor** (if one isn't already watching the registry) that polls all registered jobs until they reach terminal status. Prefer the Claude Code `Monitor` tool when it is available: write a small watcher that reads the registry on each poll, checks every job with the appropriate method below, prints state-change events, updates `last_status`, removes terminal jobs, and exits when the registry is empty.
+1. **Add an entry** to `.claude/agents/<session_id>/active_jobs.json`. Create the session directory and file if they don't exist.
+2. **Start a durable monitor** (if one isn't already watching the registry) that polls this session's registered jobs until they reach terminal status. Prefer the Claude Code `Monitor` tool when it is available: write a small watcher that reads `.claude/agents/<session_id>/active_jobs.json`, checks every job with the appropriate method below, prints state-change events, updates `last_status`, removes terminal jobs from the session registry, and exits when no active jobs remain for this session.
 
 The monitor should terminate naturally when every registered job has reached a terminal state. If the `Monitor` tool is not available in the current harness, run an equivalent background process that implements the same loop and lets the agent resume/restart when the process exits.
 
@@ -53,12 +75,12 @@ Always do both steps. Don't try to predict job duration.
 
 Whether triggered by monitor output or by the user asking "check status":
 
-1. **Read the registry** from `.claude/active_jobs.json`
+1. **Read the registry** from `.claude/agents/<session_id>/active_jobs.json`
 2. **Check each job** using the appropriate method (see below)
 3. **Report only state changes** — compare against `last_status` in registry
-4. **Update `last_status`** in the registry
+4. **Update `last_status`** in the session registry
 5. **Remove completed jobs** — any job in a terminal state (COMPLETED, FAILED, CANCELLED, KILLED, TIMEOUT, NODE_FAIL, OUT_OF_MEMORY, PREEMPTED, BOOT_FAIL, DEADLINE)
-6. **If registry is empty** — let the monitor exit
+6. **If no active jobs remain** — let the monitor exit
 
 ---
 
@@ -111,7 +133,7 @@ vocabulary of the source you're polling.
 
 When the user asks about a job without specifying an ID, check in order:
 
-1. `.claude/active_jobs.json` — most reliable, has context
+1. `.claude/agents/<current_session_id>/active_jobs.json` — current agent's jobs
 2. `nel ls runs --since 1d` — recent NEL runs
 3. `ssh <host> "squeue -u <user>"` — active SLURM jobs
 4. `ls -lt tools/launcher/experiments/cicd/ | head -10` — recent launcher experiments
diff --git a/.claude/skills/ptq/references/unsupported-models.md b/.claude/skills/ptq/references/unsupported-models.md
@@ -13,7 +13,7 @@ After download, inspect the model files on the target machine (use `remote_run`
 1. **Read `README.md`** — often lists required transformers versions, dependencies, or `trust_remote_code` requirements
 2. **Check for `modeling_*.py` or `tokenization_*.py`** — custom code shipped with the model. If found, **always use `--trust_remote_code`** with `hf_ptq.py`, and `trust_remote_code=True` in any custom scripts. Without it, `AutoConfig`, `AutoTokenizer`, and `AutoModel` will fail to resolve custom classes.
 
-Write custom scripts locally (in `./workspaces/<model>/scripts/`), then sync to remote before running.
+Write custom scripts locally (in `./workspaces/<session_id>/<model>/scripts/`), then sync to remote before running.
 
 **Check transformers compatibility** (on the target machine):
 
diff --git a/.gitignore b/.gitignore
@@ -61,6 +61,7 @@ venv/
 
 # Ignore claude local settings
 .claude/settings.local.json
+.claude/agents/
 CLAUDE.local.md
 AGENTS.override.md