NVIDIA · chadvoegele · May 8, 2026 · May 8, 2026 · May 8, 2026 · May 11, 2026
diff --git a/.claude/skills/common/environment-setup.md b/.claude/skills/common/environment-setup.md
@@ -29,6 +29,8 @@ cat ~/.config/modelopt/clusters.yaml 2>/dev/null || cat .claude/clusters.yaml 2>
 
 If a cluster config exists with content → **use the remote cluster** (do not fall back to local even if local GPUs are available — the cluster config indicates the user's preferred execution environment). Otherwise → **local execution**.
 
+If the cluster config contains multiple clusters and the user did not name the target cluster, ask which cluster to use before calling `remote_load_cluster`. Do not silently fall back to `default_cluster` in multi-cluster configs; different clusters can have different filesystems, GPU types, auth paths, and SSH setup.
+
 For remote, connect:
 
 ```bash

diff --git a/.claude/skills/common/remote-execution.md b/.claude/skills/common/remote-execution.md
@@ -33,10 +33,10 @@ default_cluster: my-cluster
 Workstation filesystems (`/home/scratch.*`, local NFS) are **not** mounted on the cluster. If a checkpoint was produced on your workstation, copy it to the cluster's own storage before submitting any job that references it — NEL and SLURM do NOT sync checkpoints automatically.
 
 ```bash
-rsync -av /path/to/local/checkpoint <cluster-login>:<cluster-workspace>/checkpoints/
+rsync -av /path/to/local/checkpoint <cluster-login>:<cluster-workspace>/<session_id>/<model>/checkpoints/
 ```
 
-Use the `workspace` path from your cluster config as the destination. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.
+Use the `workspace` path from your cluster config as the destination root, and keep staged checkpoints under the session/model directory. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.
 
 See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.
 
@@ -118,8 +118,8 @@ When submitting SLURM jobs remotely, write **two files** locally to avoid shell
 Then upload both and submit:
 
 ```bash
-remote_sync_to /local/scripts/ scripts/
-JOBID=$(remote_run "sbatch /remote/path/scripts/job_slurm.sh" | grep -o '[0-9]\+' | tail -1)
+remote_sync_to /local/scripts/ <session_id>/<model>/scripts/
+JOBID=$(remote_run "sbatch <remote_workspace>/<session_id>/<model>/scripts/job_slurm.sh" | grep -o '[0-9]\+' | tail -1)
 ```
 
 ---

diff --git a/.claude/skills/common/workspace-management.md b/.claude/skills/common/workspace-management.md
@@ -1,77 +1,95 @@
 # Workspace Management
 
-Organize work by model name so outputs (checkpoints, logs) are easy to find and reuse across PTQ → deploy → eval pipelines.
+Organize work by session id and model name so concurrent agents do not
+clobber each other, while outputs (checkpoints, logs) stay easy to find and
+reuse across PTQ → deploy → eval pipelines within the same session.
 
-## Single-user (default)
+## Session Workspaces
 
-Create a work directory named after the model in the current project:
+Use the same `<session_id>` convention as the monitor skill:
 
-```bash
-mkdir -p ./workspaces/<model-name>
-```
-
-Use descriptive names, not timestamps:
-
-```bash
-# Good
-workspaces/qwen3-0.6b-nvfp4/
-workspaces/llama-3.1-8b-fp8/
-
-# Bad
-workspaces/ptq-20260318-143022/
-workspaces/job-001/
-```
-
-Store outputs (checkpoints, logs) inside the workspace:
-
-```bash
-workspaces/qwen3-0.6b-nvfp4/
-  output/          # quantized checkpoint
-  logs/            # job logs
-  scripts/         # custom PTQ scripts (if unsupported model)
-```
+- Claude Code: `$CLAUDE_CODE_SESSION_ID`, or the `session_id` field from hook input
+- Codex: `$CODEX_THREAD_ID`
+- If no session id is available, create a stable id for the current terminal session and reuse it for every local and remote path created by that agent
 
 ## When to Reuse vs Create
 
-**Before starting any task**, check for an existing workspace:
+**Before starting any task**, check for an existing workspace in the current
+session:
 
 ```bash
-ls ./workspaces/ 2>/dev/null
+ls ./workspaces/<session_id>/ 2>/dev/null
 ```
 
 **Reuse** when:
 
-- Same model (e.g., deploying a model you just quantized)
+- The matching model workspace already exists under `./workspaces/<session_id>/`
 - Task requires output from a previous step (e.g., eval requires the PTQ checkpoint)
 - User says "deploy the model I just quantized"
 
 **Create new** when:
 
-- New model not seen before
+- No matching model workspace exists under `./workspaces/<session_id>/`
 - User explicitly asks for a fresh start
-- Different quantization format for same model (e.g., `qwen3-0.6b-fp8` vs `qwen3-0.6b-nvfp4`)
+
+## Model Workspace Names
+
+Within `./workspaces/<session_id>/`, create one model workspace per model or
+model variant. Include meaningful variant details in the model workspace name,
+for example quantization format or checkpoint role:
+
+```bash
+mkdir -p ./workspaces/<session_id>/<model-name>
+```
+
+Use descriptive model workspace names, not timestamps:
+
+```text
+# Good
+workspaces/<session_id>/qwen3-0.6b-nvfp4/
+workspaces/<session_id>/qwen3-0.6b-fp8/
+workspaces/<session_id>/qwen3-0.6b-baseline/
+
+# Bad
+workspaces/<session_id>/ptq-20260318-143022/
+workspaces/<session_id>/job-001/
+```
+
+Store outputs (checkpoints, logs) inside the model workspace:
+
+```text
+workspaces/<session_id>/qwen3-0.6b-nvfp4/
+  output/          # quantized checkpoint
+  logs/            # job logs
+  scripts/         # custom PTQ scripts (if unsupported model)
+```
 
 ## Remote execution
 
 When using a remote machine (clusters.yaml configured), create matching workspaces on **both** local and remote:
 
-- **Local** `./workspaces/<model>/` — write and edit scripts here
-- **Remote** `<remote_workspace>/workspaces/<model>/` — model downloads, execution, outputs
+- **Local** `./workspaces/<session_id>/<model>/` — write and edit scripts here
+- **Remote** `<remote_workspace>/<session_id>/<model>/` — model downloads, execution, outputs
+
+Session-scope newly created remote run directories, logs, response caches,
+temporary configs, and output artifacts. Shared read-only or concurrency-safe
+caches, such as Hugging Face model caches and prebuilt container image caches,
+can remain outside the session directory.
 
 Before running, sync the local ModelOpt source and scripts to the remote workspace:
 
 ```bash
 # Sync ModelOpt source (first time or after local changes)
-remote_sync_to ./ workspaces/<model>/Model-Optimizer/
+remote_sync_to ./ <session_id>/<model>/Model-Optimizer/
 
 # Sync custom scripts
-remote_sync_to ./workspaces/<model>/scripts/ workspaces/<model>/scripts/
+remote_sync_to ./workspaces/<session_id>/<model>/scripts/ <session_id>/<model>/scripts/
 ```
 
 Download the model on the **remote** machine (avoids transferring large model files):
 
 ```bash
-remote_run "python -c \"from huggingface_hub import snapshot_download; snapshot_download('<model_id>', local_dir='<remote_workspace>/workspaces/<model>/model')\""
+remote_run "python -c \"from huggingface_hub import snapshot_download; snapshot_download('<model_id>', local_dir='<remote_workspace>/<session_id>/<model>/model')\""
 ```
 
 Inspect remote files with `remote_run "cat ..."` — read README, config.json, tokenizer_config.json to understand requirements before writing scripts locally.
@@ -80,7 +98,7 @@ Inspect remote files with `remote_run "cat ..."` — read README, config.json, t
 
 When `MODELOPT_WORKSPACE_ROOT` is set, use it instead of `./workspaces/`:
 
-- `MODELOPT_WORKSPACE_ROOT` — user's workspace root (set by the bot)
+- `MODELOPT_WORKSPACE_ROOT` — user's workspace root (set by the bot); use `$MODELOPT_WORKSPACE_ROOT/<session_id>/<name>/`
 - `MODELOPT_REPO_DIR` — shared upstream repo (read-only, use for fresh copies)
 
 To create a workspace, copy the upstream repo (without `.git`):
@@ -89,15 +107,15 @@ To create a workspace, copy the upstream repo (without `.git`):
 rsync -a --quiet \
     --exclude .git --exclude __pycache__ --exclude '*.pyc' \
     --exclude node_modules --exclude '*.egg-info' --exclude '*.sqsh' \
-    "$MODELOPT_REPO_DIR/" "$MODELOPT_WORKSPACE_ROOT/<name>/"
+    "$MODELOPT_REPO_DIR/" "$MODELOPT_WORKSPACE_ROOT/<session_id>/<name>/"
 ```
 
 ## Cross-Skill Workspace Flow
 
 Workspaces carry over across the PTQ → Deploy → Eval pipeline. Each stage adds to the same directory:
 
 ```text
-workspaces/model-name-format/
+workspaces/<session_id>/model-name-format/
   output/              ← PTQ: quantized checkpoint
   eval_results/        ← Evaluation: NEL artifacts (results.yml per task)
   eval_config.yaml     ← Evaluation: NEL config
@@ -109,19 +127,19 @@ workspaces/model-name-format/
 
 ```text
 User: "quantize Qwen3-0.6B with nvfp4"
-Agent: ls workspaces/ → no "qwen3-0.6b-nvfp4"
-       → mkdir workspaces/qwen3-0.6b-nvfp4
-       → run PTQ, output to workspaces/qwen3-0.6b-nvfp4/output/
+Agent: ls workspaces/<session_id>/ → no "qwen3-0.6b-nvfp4"
+       → mkdir workspaces/<session_id>/qwen3-0.6b-nvfp4
+       → run PTQ, output to workspaces/<session_id>/qwen3-0.6b-nvfp4/output/
 
 User: "deploy the model I just quantized"
-Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
-       → reuse, find checkpoint at workspaces/qwen3-0.6b-nvfp4/output/
+Agent: ls workspaces/<session_id>/ → sees "qwen3-0.6b-nvfp4"
+       → reuse, find checkpoint at workspaces/<session_id>/qwen3-0.6b-nvfp4/output/
 
 User: "evaluate the quantized model on MMLU and GSM8K"
-Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
-       → reuse, write eval_config.yaml, results to workspaces/qwen3-0.6b-nvfp4/eval_results/
+Agent: ls workspaces/<session_id>/ → sees "qwen3-0.6b-nvfp4"
+       → reuse, write eval_config.yaml, results to workspaces/<session_id>/qwen3-0.6b-nvfp4/eval_results/
 
 User: "now quantize Llama-3.1-8B with fp8"
-Agent: ls workspaces/ → no llama
-       → mkdir workspaces/llama-3.1-8b-fp8
+Agent: ls workspaces/<session_id>/ → no llama
+       → mkdir workspaces/<session_id>/llama-3.1-8b-fp8
 ```
diff --git a/.claude/skills/debugging-playbooks/SKILL.md b/.claude/skills/debugging-playbooks/SKILL.md
@@ -0,0 +1,22 @@
+---
+name: debugging-playbooks
+description: Diagnostic playbooks for tricky failures — failures where the traceback misdirects and the first 2-3 reasonable hypotheses turn out wrong. Use when a run fails with a framework-internal-looking error (cryptic torch.compile / dynamo / NCCL / vLLM / transformers / CUDA / pyxis / enroot / NEL / SLURM / container runtime), the top frame appears to blame the wrong layer (e.g. the user's code, ModelOpt, the quantized linear, the wrapper class) but fixing that layer doesn't help, or the symptom recurs across unrelated changes. Use this skill when you've eliminated the obvious suspects and the bug hasn't budged. Don't reach for this on the first guess; reach for it when the obvious answers don't pan out. Each playbook is keyed by a literal symptom string from logs so future agents can grep for it.
+---
+
+# Debugging playbooks
+
+When a failure surfaces a symptom that doesn't clearly map to the code under change, check whether one of the documented playbooks below already describes it. Each playbook is keyed by the literal symptom string so future agents can match by grep.
+
+| Symptom (literal string from logs) | Playbook |
+| --- | --- |
+| `AttributeError: 'NoneType' object has no attribute 'size'` during vLLM `profile_run` / `_dummy_run` / CUDA-graph capture | [vllm-aot-cache-poisoning.md](references/vllm-aot-cache-poisoning.md) |
+
+## When to add a new playbook
+
+Add an entry when **all three** are true:
+
+1. The root cause was non-obvious from the traceback — the immediate frame was misleading (e.g. blames ModelOpt when the bug is in vLLM).
+2. The symptom is likely to recur across runs (different models, different containers).
+3. There is a concrete fix (config change, env var, cache invalidation) that future agents should reach for before deeper debugging.
+
+Each playbook should include: the literal symptom string, the actual mechanism, how to confirm the diagnosis, and the minimal fix.
diff --git a/.claude/skills/debugging-playbooks/references/vllm-aot-cache-poisoning.md b/.claude/skills/debugging-playbooks/references/vllm-aot-cache-poisoning.md
@@ -0,0 +1,139 @@
+# vLLM AOT compile-cache poisoning across multimodal-on / multimodal-off runs
+
+Applies to **any** model whose vLLM architecture supports multimodal input —
+this is modality-agnostic, covering image, video, audio, or any other
+modality (`vllm/multimodal/registry.py: supports_multimodal_inputs` iterates
+the model's `supported_mm_limits`, which can be `{"image": N}`,
+`{"video": N}`, `{"audio": N}`, `{"image": N, "video": N}`, etc.). The hazard
+appears when multiple vLLM runs against the **same checkpoint** share a
+`VLLM_CACHE_ROOT` and differ in whether **all** of the model's modalities
+are zeroed out via `--limit-mm-per-prompt`.
+
+## Symptom
+
+vLLM startup crashes during `profile_run` / `_dummy_run` / CUDA-graph capture
+with:
+
+```text
+AttributeError: 'NoneType' object has no attribute 'size'
+```
+
+The traceback ends inside `torch/_dynamo/utils.py call_size → x.size(i)`,
+after passing through `vllm/compilation/decorators.py: aot_compiled_fn`.
+**There is no model-layer frame** in the failing stack — no attention op,
+no MLP, no quantized linear. The compiled function is loaded from disk and
+crashes in dynamo's prologue, before any decoder layer runs. The log line
+just above the traceback is the smoking gun:
+
+```text
+INFO ... [decorators.py:...] Directly load AOT compilation from path
+  /vllm-cache/torch_compile_cache/torch_aot_compile/<hash>/rank_*/model
+```
+
+## Mechanism
+
+vLLM's `@support_torch_compile` decorator caches one compiled `forward` per
+`(aot_compile_hash_factors(vllm_config), _model_hash_key(forward))` key
+(`vllm/compilation/decorators.py`). That key includes the model config and
+quantization, but **does not include** `--limit-mm-per-prompt` or the
+derived `supports_mm_inputs` flag.
+
+`vllm/v1/worker/gpu_model_runner.py: _dummy_run` branches on
+`supports_mm_inputs`:
+
+```python
+if self.supports_mm_inputs and not self.model_config.is_encoder_decoder:
+    input_ids, inputs_embeds = self._prepare_mm_inputs(...)   # (None, Tensor)
+else:
+    input_ids = self.input_ids.gpu[:num_tokens_padded]        # (Tensor, None)
+    inputs_embeds = None
+```
+
+`supports_mm_inputs` (`vllm/multimodal/registry.py: supports_multimodal_inputs`)
+returns `False` when **every** supported modality has
+`--limit-mm-per-prompt = 0`. So:
+
+| Run config | `supports_mm_inputs` | Pattern compiled / loaded |
+| --- | --- | --- |
+| `--limit-mm-per-prompt '{"image":0}'` (and `"video":0` etc.) | False | `input_ids=Tensor, inputs_embeds=None` |
+| default, or any modality non-zero | True | `input_ids=None, inputs_embeds=Tensor` |
+
+The `@support_torch_compile` docstring explicitly forbids the same argument
+slot from being `None` on one invocation and a Tensor on another — Dynamo
+specializes on None-vs-Tensor identity per argument, so one cached graph
+cannot serve both patterns. When run A populates the cache slot and run B
+shares the slot but uses the opposite pattern, the prologue calls
+`.size()` on what is now `None` and dies.
+
+This is symmetric: a multimodal-first run followed by a text-only-via-image:0
+run fails the same way, just with the None/Tensor roles swapped.
+
+## How to confirm
+
+1. **Cache hit before the crash.** Look in the server log for
+   `Directly load AOT compilation from path ...` shortly before the
+   traceback. A cache *hit* immediately before a `NoneType.size()` is the
+   diagnostic. (A cold compile would print `Dynamo bytecode transform
+   time` and `Inductor compile took ...` instead.)
+2. **Config delta on `--limit-mm-per-prompt`.** Compare the failing run's
+   serving args against the most recent successful runs that share
+   `$VLLM_CACHE_ROOT`. If they disagree on whether any modality is
+   zero-limited (or one side omits the flag while the other passes
+   `{"image":0}`), the cache slot is colliding.
+3. **Positive control.** Relaunch the failing config with
+   `VLLM_DISABLE_COMPILE_CACHE=1` and change nothing else. If `profile_run`
+   passes, the cache was the cause.
+
+## Fix
+
+Two parts — stop the poisoning, then heal what's already poisoned.
+
+### Stop poisoning
+
+For multimodal-architecture models, do **not** zero out a modality with
+`--limit-mm-per-prompt '{"image":0}'` (or `"video":0`, …) on runs intended
+to share a cache root with multimodal runs. The vision tower weights are
+loaded from the checkpoint regardless of this flag; zeroing only flips
+`supports_mm_inputs` and creates the cache hazard. Text-only inference
+still works without the flag because vLLM's `_preprocess` routes both
+text and multimodal prompts through the same `inputs_embeds` path when
+`supports_mm_inputs=True`:
+
+```python
+# vllm/v1/worker/gpu_model_runner.py: _preprocess
+# NOTE(woosuk): To unify token ids and soft tokens (vision embeddings),
+# we always use embeddings (rather than token ids) as input to the
+# multimodal model, even when the input is text.
+inputs_embeds_scheduled = self.model.embed_input_ids(
+    self.input_ids.gpu[:num_scheduled_tokens],
+    multimodal_embeddings=mm_embeds,
+    is_multimodal=is_mm_embed,
+)
+```
+
+A text-only prompt simply has `mm_embeds=[]` / `is_multimodal=False`; the
+call signature into the language model is unchanged. The small cost of
+keeping multimodal inputs enabled is that vLLM allocates an encoder cache
+budget at startup (e.g. a few hundred MB) and prints a vision warmup line.
+
+### Heal existing cache
+
+Either fully wipe and let the next run repopulate:
+
+```bash
+rm -rf "$VLLM_CACHE_ROOT/torch_compile_cache/torch_aot_compile/"
+```
+
+…or sidestep by separating cache roots per multimodal-ness (set a different
+`VLLM_CACHE_ROOT` for the runs that need a different pattern), or just set
+`VLLM_DISABLE_COMPILE_CACHE=1` on the affected runs and accept a one-time
+recompile (~20-30 s) at every startup.
+
+## See also
+
+- `vllm/compilation/decorators.py` — `support_torch_compile` decorator and
+  its docstring on the None-vs-Tensor invariant.
+- `vllm/v1/worker/gpu_model_runner.py` — the input-construction branch in
+  `_dummy_run` and the unified-`inputs_embeds` comment in `_preprocess`.
+- `vllm/multimodal/registry.py` — how `supports_multimodal_inputs` is
+  computed from `--limit-mm-per-prompt`.
-Original file line number
+Diff line change
@@ Expand Up @@
     If a cluster config exists with content → **use the remote cluster** (do not fall back to local even if local GPUs are available — the cluster config indicates the user's preferred execution environment). Otherwise → **local execution**.
+    If the cluster config contains multiple clusters and the user did not name the target cluster, ask which cluster to use before calling `remote_load_cluster`. Do not silently fall back to `default_cluster` in multi-cluster configs; different clusters can have different filesystems, GPU types, auth paths, and SSH setup.
     For remote, connect:
     ```bash
@@ Expand Down @@