NVIDIA
diff --git a/‎.claude/skills/common/environment-setup.md‎
Lines changed: 2 additions & 0 deletions b/‎.claude/skills/common/environment-setup.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎.claude/skills/common/remote-execution.md‎
Lines changed: 4 additions & 4 deletions b/‎.claude/skills/common/remote-execution.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎.claude/skills/common/workspace-management.md‎
Lines changed: 67 additions & 49 deletions b/‎.claude/skills/common/workspace-management.md‎
Lines changed: 67 additions & 49 deletions
diff --git a/‎.claude/skills/compare-results/SKILL.md‎
Lines changed: 73 additions & 0 deletions b/‎.claude/skills/compare-results/SKILL.md‎
Lines changed: 73 additions & 0 deletions
diff --git a/‎.claude/skills/compare-results/tests/evals.json‎
Lines changed: 30 additions & 0 deletions b/‎.claude/skills/compare-results/tests/evals.json‎
Lines changed: 30 additions & 0 deletions
diff --git a/‎.claude/skills/deployment/SKILL.md‎
Lines changed: 3 additions & 3 deletions b/‎.claude/skills/deployment/SKILL.md‎
Lines changed: 3 additions & 3 deletions
@@ -29,6 +29,8 @@ cat ~/.config/modelopt/clusters.yaml 2>/dev/null || cat .claude/clusters.yaml 2>
 
 If a cluster config exists with content → **use the remote cluster** (do not fall back to local even if local GPUs are available — the cluster config indicates the user's preferred execution environment). Otherwise → **local execution**.
 
+If the cluster config contains multiple clusters and the user did not name the target cluster, ask which cluster to use before calling `remote_load_cluster`. Do not silently fall back to `default_cluster` in multi-cluster configs; different clusters can have different filesystems, GPU types, auth paths, and SSH setup.
+
 For remote, connect:
 
 ```bash
 
@@ -33,10 +33,10 @@ default_cluster: my-cluster
 Workstation filesystems (`/home/scratch.*`, local NFS) are **not** mounted on the cluster. If a checkpoint was produced on your workstation, copy it to the cluster's own storage before submitting any job that references it — NEL and SLURM do NOT sync checkpoints automatically.
 
 ```bash
-rsync -av /path/to/local/checkpoint <cluster-login>:<cluster-workspace>/checkpoints/
+rsync -av /path/to/local/checkpoint <cluster-login>:<cluster-workspace>/<session_id>/<model>/checkpoints/
 ```
 
-Use the `workspace` path from your cluster config as the destination. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.
+Use the `workspace` path from your cluster config as the destination root, and keep staged checkpoints under the session/model directory. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.
 
 See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.
 
@@ -118,8 +118,8 @@ When submitting SLURM jobs remotely, write **two files** locally to avoid shell
 Then upload both and submit:
 
 ```bash
-remote_sync_to /local/scripts/ scripts/
-JOBID=$(remote_run "sbatch /remote/path/scripts/job_slurm.sh" | grep -o '[0-9]\+' | tail -1)
+remote_sync_to /local/scripts/ <session_id>/<model>/scripts/
+JOBID=$(remote_run "sbatch <remote_workspace>/<session_id>/<model>/scripts/job_slurm.sh" | grep -o '[0-9]\+' | tail -1)
 ```
 
 ---
 
@@ -1,77 +1,95 @@
 # Workspace Management
 
-Organize work by model name so outputs (checkpoints, logs) are easy to find and reuse across PTQ → deploy → eval pipelines.
+Organize work by session id and model name so concurrent agents do not
+clobber each other, while outputs (checkpoints, logs) stay easy to find and
+reuse across PTQ → deploy → eval pipelines within the same session.
 
-## Single-user (default)
+## Session Workspaces
 
-Create a work directory named after the model in the current project:
+Use the same `<session_id>` convention as the monitor skill:
 
-```bash
-mkdir -p ./workspaces/<model-name>
-```
-
-Use descriptive names, not timestamps:
-
-```bash
-# Good
-workspaces/qwen3-0.6b-nvfp4/
-workspaces/llama-3.1-8b-fp8/
-
-# Bad
-workspaces/ptq-20260318-143022/
-workspaces/job-001/
-```
-
-Store outputs (checkpoints, logs) inside the workspace:
-
-```bash
-workspaces/qwen3-0.6b-nvfp4/
-  output/          # quantized checkpoint
-  logs/            # job logs
-  scripts/         # custom PTQ scripts (if unsupported model)
-```
+- Claude Code: `$CLAUDE_CODE_SESSION_ID`, or the `session_id` field from hook input
+- Codex: `$CODEX_THREAD_ID`
+- If no session id is available, create a stable id for the current terminal session and reuse it for every local and remote path created by that agent
 
 ## When to Reuse vs Create
 
-**Before starting any task**, check for an existing workspace:
+**Before starting any task**, check for an existing workspace in the current
+session:
 
 ```bash
-ls ./workspaces/ 2>/dev/null
+ls ./workspaces/<session_id>/ 2>/dev/null
 ```
 
 **Reuse** when:
 
-- Same model (e.g., deploying a model you just quantized)
+- The matching model workspace already exists under `./workspaces/<session_id>/`
 - Task requires output from a previous step (e.g., eval requires the PTQ checkpoint)
 - User says "deploy the model I just quantized"
 
 **Create new** when:
 
-- New model not seen before
+- No matching model workspace exists under `./workspaces/<session_id>/`
 - User explicitly asks for a fresh start
-- Different quantization format for same model (e.g., `qwen3-0.6b-fp8` vs `qwen3-0.6b-nvfp4`)
+
+## Model Workspace Names
+
+Within `./workspaces/<session_id>/`, create one model workspace per model or
+model variant. Include meaningful variant details in the model workspace name,
+for example quantization format or checkpoint role:
+
+```bash
+mkdir -p ./workspaces/<session_id>/<model-name>
+```
+
+Use descriptive model workspace names, not timestamps:
+
+```text
+# Good
+workspaces/<session_id>/qwen3-0.6b-nvfp4/
+workspaces/<session_id>/qwen3-0.6b-fp8/
+workspaces/<session_id>/qwen3-0.6b-baseline/
+
+# Bad
+workspaces/<session_id>/ptq-20260318-143022/
+workspaces/<session_id>/job-001/
+```
+
+Store outputs (checkpoints, logs) inside the model workspace:
+
+```text
+workspaces/<session_id>/qwen3-0.6b-nvfp4/
+  output/          # quantized checkpoint
+  logs/            # job logs
+  scripts/         # custom PTQ scripts (if unsupported model)
+```
 
 ## Remote execution
 
 When using a remote machine (clusters.yaml configured), create matching workspaces on **both** local and remote:
 
-- **Local** `./workspaces/<model>/` — write and edit scripts here
-- **Remote** `<remote_workspace>/workspaces/<model>/` — model downloads, execution, outputs
+- **Local** `./workspaces/<session_id>/<model>/` — write and edit scripts here
+- **Remote** `<remote_workspace>/<session_id>/<model>/` — model downloads, execution, outputs
+
+Session-scope newly created remote run directories, logs, response caches,
+temporary configs, and output artifacts. Shared read-only or concurrency-safe
+caches, such as Hugging Face model caches and prebuilt container image caches,
+can remain outside the session directory.
 
 Before running, sync the local ModelOpt source and scripts to the remote workspace:
 
 ```bash
 # Sync ModelOpt source (first time or after local changes)
-remote_sync_to ./ workspaces/<model>/Model-Optimizer/
+remote_sync_to ./ <session_id>/<model>/Model-Optimizer/
 
 # Sync custom scripts
-remote_sync_to ./workspaces/<model>/scripts/ workspaces/<model>/scripts/
+remote_sync_to ./workspaces/<session_id>/<model>/scripts/ <session_id>/<model>/scripts/
 ```
 
 Download the model on the **remote** machine (avoids transferring large model files):
 
 ```bash
-remote_run "python -c \"from huggingface_hub import snapshot_download; snapshot_download('<model_id>', local_dir='<remote_workspace>/workspaces/<model>/model')\""
+remote_run "python -c \"from huggingface_hub import snapshot_download; snapshot_download('<model_id>', local_dir='<remote_workspace>/<session_id>/<model>/model')\""
 ```
 
 Inspect remote files with `remote_run "cat ..."` — read README, config.json, tokenizer_config.json to understand requirements before writing scripts locally.
@@ -80,7 +98,7 @@ Inspect remote files with `remote_run "cat ..."` — read README, config.json, t
 
 When `MODELOPT_WORKSPACE_ROOT` is set, use it instead of `./workspaces/`:
 
-- `MODELOPT_WORKSPACE_ROOT` — user's workspace root (set by the bot)
+- `MODELOPT_WORKSPACE_ROOT` — user's workspace root (set by the bot); use `$MODELOPT_WORKSPACE_ROOT/<session_id>/<name>/`
 - `MODELOPT_REPO_DIR` — shared upstream repo (read-only, use for fresh copies)
 
 To create a workspace, copy the upstream repo (without `.git`):
@@ -89,15 +107,15 @@ To create a workspace, copy the upstream repo (without `.git`):
 rsync -a --quiet \
     --exclude .git --exclude __pycache__ --exclude '*.pyc' \
     --exclude node_modules --exclude '*.egg-info' --exclude '*.sqsh' \
-    "$MODELOPT_REPO_DIR/" "$MODELOPT_WORKSPACE_ROOT/<name>/"
+    "$MODELOPT_REPO_DIR/" "$MODELOPT_WORKSPACE_ROOT/<session_id>/<name>/"
 ```
 
 ## Cross-Skill Workspace Flow
 
 Workspaces carry over across the PTQ → Deploy → Eval pipeline. Each stage adds to the same directory:
 
 ```text
-workspaces/model-name-format/
+workspaces/<session_id>/model-name-format/
   output/              ← PTQ: quantized checkpoint
   eval_results/        ← Evaluation: NEL artifacts (results.yml per task)
   eval_config.yaml     ← Evaluation: NEL config
@@ -109,19 +127,19 @@ workspaces/model-name-format/
 
 ```text
 User: "quantize Qwen3-0.6B with nvfp4"
-Agent: ls workspaces/ → no "qwen3-0.6b-nvfp4"
-       → mkdir workspaces/qwen3-0.6b-nvfp4
-       → run PTQ, output to workspaces/qwen3-0.6b-nvfp4/output/
+Agent: ls workspaces/<session_id>/ → no "qwen3-0.6b-nvfp4"
+       → mkdir workspaces/<session_id>/qwen3-0.6b-nvfp4
+       → run PTQ, output to workspaces/<session_id>/qwen3-0.6b-nvfp4/output/
 
 User: "deploy the model I just quantized"
-Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
-       → reuse, find checkpoint at workspaces/qwen3-0.6b-nvfp4/output/
+Agent: ls workspaces/<session_id>/ → sees "qwen3-0.6b-nvfp4"
+       → reuse, find checkpoint at workspaces/<session_id>/qwen3-0.6b-nvfp4/output/
 
 User: "evaluate the quantized model on MMLU and GSM8K"
-Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
-       → reuse, write eval_config.yaml, results to workspaces/qwen3-0.6b-nvfp4/eval_results/
+Agent: ls workspaces/<session_id>/ → sees "qwen3-0.6b-nvfp4"
+       → reuse, write eval_config.yaml, results to workspaces/<session_id>/qwen3-0.6b-nvfp4/eval_results/
 
 User: "now quantize Llama-3.1-8B with fp8"
-Agent: ls workspaces/ → no llama
-       → mkdir workspaces/llama-3.1-8b-fp8
+Agent: ls workspaces/<session_id>/ → no llama
+       → mkdir workspaces/<session_id>/llama-3.1-8b-fp8
 ```
@@ -0,0 +1,73 @@
+---
+name: compare-results
+description: Establish baseline-vs-candidate evaluation plans, delegate missing evaluations, compare validated results, and decide quantization feasibility. Use when the user asks to compare baseline vs quantized runs, explain an accuracy drop/regression, verify whether a quantized checkpoint is acceptable, or compare NEL/MLflow evaluation outputs. Do NOT use for generic single-model evaluation without comparison intent (use evaluation), live NEL status/debugging (use launching-evals), or generic MLflow browsing without a comparison goal (use accessing-mlflow).
+license: Apache-2.0
+---
+
+# Compare Results
+
+Use this to plan and complete a baseline-vs-candidate comparison. The baseline
+is the reference checkpoint, and the candidate is the checkpoint whose accuracy
+change is being measured, typically a further quantized version of the baseline.
+
+## Workflow
+
+1. Establish the candidate checkpoint/run and the matching baseline. Infer the
+   baseline from the PTQ source model/checkpoint in the workspace or config used
+   to create the candidate. If it cannot be inferred, ask the user for the
+   baseline checkpoint or an existing baseline invocation/run path.
+2. If a required baseline or candidate evaluation is missing, delegate to the
+   evaluation skill to create, run, and verify it. The companion evaluation
+   config should match benchmark versions, task configs, serving args, token
+   limits, dataset setup, credentials, cluster, and container as closely as
+   possible; change only the model/checkpoint and checkpoint-specific serving or
+   quantization flags.
+3. Fetch the baseline and candidate task list, configs, score artifacts, and
+   logs. If the user provides MLflow runs or invocation IDs, use the
+   accessing-mlflow skill to fetch configs and artifacts.
+4. Confirm each run passed evaluation Step 9, "Verify completed evaluation run",
+   before comparing scores. If not, validate logs, server health,
+   judge/code-execution status, sample accounting, and reasoning parsing before
+   computing deltas.
+5. For each task, use the canonical score field from the matching
+   `.claude/skills/evaluation/recipes/tasks/<task>.md` Score Extraction
+   section.
+6. Compute exact deltas outside the chat context when there are multiple tasks
+   or repeated runs.
+7. Report comparability and quantized-feasibility verdicts before interpreting
+   the delta as model quality. If the user did not provide an acceptance
+   threshold, report feasibility as inconclusive instead of inventing one.
+
+## Comparability Checklist
+
+Before treating a baseline-vs-quantized delta as a model quality result, verify
+the validated runs are comparable:
+
+1. Prompt text, system prompt, chat template, and rendered messages match.
+2. Task name, benchmark version, dataset split, container, harness, and task
+   fragment match.
+3. Generation settings match, including temperature, top_p, top_k, max tokens,
+   stop strings, chat-template kwargs, reasoning mode/budget, and task-specific
+   overrides.
+4. Reasoning traces are enabled, disabled, parsed, stripped, or ignored
+   consistently between runs.
+5. The number of evaluated and scored samples/repeats matches for each task and
+   split.
+6. Judge-backed or simulator-backed tasks use the same judge/user model,
+   endpoint class, prompt, and scoring config.
+7. The same accuracy metric and score field is used for both runs.
+
+If any item differs, either rerun with matched settings or label the result as
+not an apples-to-apples quantization comparison.
+
+## Report Format
+
+Include:
+
+- Baseline and candidate identifiers.
+- Per-task metric path, baseline score, candidate score, delta, and stderr if
+  available.
+- Comparability status for prompt/template, generation settings, sample counts,
+  reasoning handling, judge/simulator setup, and score field.
+- Comparability verdict: comparable, not comparable, or inconclusive.
+- Quantization feasibility verdict: acceptable, not acceptable, or inconclusive.
@@ -0,0 +1,30 @@
+[
+  {
+    "name": "baseline-vs-quantized-delta",
+    "skills": ["compare-results"],
+    "query": "Compare the baseline and quantized NEL runs and tell me the accuracy drop",
+    "files": [],
+    "expected_behavior": [
+      "Establishes the baseline and candidate before computing deltas",
+      "Identifies baseline and candidate run artifacts before computing deltas",
+      "Checks that both runs were validated before accepting scores",
+      "Uses task recipe Score Extraction sections for canonical score fields",
+      "Verifies prompt/template, generation settings, reasoning handling, sample counts, judge/simulator setup, and score field comparability",
+      "Computes exact per-task deltas outside the chat context when there are multiple tasks or repeated runs",
+      "Labels the result as not apples-to-apples if comparability checks fail",
+      "Reports quantization feasibility as inconclusive when no acceptance threshold is provided"
+    ]
+  },
+  {
+    "name": "missing-baseline",
+    "skills": ["compare-results", "evaluation"],
+    "query": "Is this quantized eval good enough if I only have the quantized run?",
+    "files": [],
+    "expected_behavior": [
+      "Does not treat a standalone quantized score as a release-ready delta",
+      "Establishes the matching baseline from the PTQ source model/checkpoint or asks the user for one",
+      "Delegates missing baseline or candidate evaluations to the evaluation skill",
+      "Explains that the baseline config must match task versions, serving args, dataset setup, credentials, cluster, and container except for the checkpoint/model"
+    ]
+  }
+]
@@ -38,10 +38,10 @@ The script handles: GPU detection, quantization flag auto-detection (FP8 vs FP4)
 
 ### 0. Check workspace (multi-user / Slack bot)
 
-If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check for existing ones — especially if deploying a checkpoint from a prior PTQ run:
+If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check the current session for existing model workspaces — especially if deploying a checkpoint from a prior PTQ run:
 
 ```bash
-ls "$MODELOPT_WORKSPACE_ROOT/" 2>/dev/null
+ls "$MODELOPT_WORKSPACE_ROOT/<session_id>/" 2>/dev/null
 ```
 
 If the user says "deploy the model I just quantized" or references a previous PTQ, find the matching workspace and `cd` into it. The checkpoint should be in that workspace's output directory.
@@ -190,7 +190,7 @@ If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clust
    If the checkpoint path is a remote/absolute path (e.g., from a prior PTQ run on the cluster), skip sync — it's already there. Verify with `remote_run "ls <checkpoint_path>/config.json"`. Only sync if the checkpoint is local:
 
    ```bash
-   remote_sync_to <local_checkpoint_path> checkpoints/
+   remote_sync_to <local_checkpoint_path> <session_id>/<model>/checkpoints/
    ```
 
 3. **Deploy based on remote environment:**