Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
66fd103
Update evaluation skill guidance
chadvoegele May 8, 2026
8aad1eb
Refine agent skill guidance
chadvoegele May 8, 2026
02ec0b2
Clarify quantized eval baseline comparison
chadvoegele May 8, 2026
267c19e
Document repeat guidance for reasoning evals
chadvoegele May 11, 2026
de7cd39
Add PTQ and evaluation verification guidance
chadvoegele May 11, 2026
874581c
Deduplicate PTQ checkpoint size guidance
chadvoegele May 11, 2026
187ca1e
Deduplicate evaluation recipe guidance
chadvoegele May 14, 2026
947074b
Add SLURM QoS launcher option
chadvoegele May 14, 2026
d885ad6
Make PTQ checkpoint validation a required gate
chadvoegele May 14, 2026
be79555
Refine evaluation run gating guidance
chadvoegele May 14, 2026
8b6cc5f
Document NEL timeout resume behavior
chadvoegele May 14, 2026
717507f
Split evaluation validation and comparability steps
chadvoegele May 14, 2026
9092bc4
Convert evaluation task snippets to references
chadvoegele May 14, 2026
1b5e031
Add evaluation task references
chadvoegele May 15, 2026
cce42cc
Use NeMo Skills MMLU-Pro recipe
chadvoegele May 18, 2026
f40898d
Update evaluation task recipes
chadvoegele May 18, 2026
b793ea5
Add debugging playbooks skill
chadvoegele May 19, 2026
a662e43
Clarify monitor status handling
chadvoegele May 19, 2026
f360752
Use ns_hle_aa for HLE AA evaluations
chadvoegele May 19, 2026
ffa7558
Scope agent state by session
chadvoegele May 19, 2026
343fe71
Add evaluation score extraction helpers
chadvoegele May 20, 2026
32c2072
Add robust monitor status parsing
chadvoegele May 20, 2026
74635c9
Increase GPQA evaluation repeats
chadvoegele May 20, 2026
44499ba
Fix markdownlint formatting in skill docs
chadvoegele May 20, 2026
fd07d91
Fix launcher Slurm config typing
chadvoegele May 20, 2026
2595c72
Use launcher-compatible optional type hints
chadvoegele May 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .claude/skills/common/environment-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ cat ~/.config/modelopt/clusters.yaml 2>/dev/null || cat .claude/clusters.yaml 2>

If a cluster config exists with content → **use the remote cluster** (do not fall back to local even if local GPUs are available — the cluster config indicates the user's preferred execution environment). Otherwise → **local execution**.

If the cluster config contains multiple clusters and the user did not name the target cluster, ask which cluster to use before calling `remote_load_cluster`. Do not silently fall back to `default_cluster` in multi-cluster configs; different clusters can have different filesystems, GPU types, auth paths, and SSH setup.

For remote, connect:

```bash
Expand Down
8 changes: 4 additions & 4 deletions .claude/skills/common/remote-execution.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,10 @@ default_cluster: my-cluster
Workstation filesystems (`/home/scratch.*`, local NFS) are **not** mounted on the cluster. If a checkpoint was produced on your workstation, copy it to the cluster's own storage before submitting any job that references it — NEL and SLURM do NOT sync checkpoints automatically.

```bash
rsync -av /path/to/local/checkpoint <cluster-login>:<cluster-workspace>/checkpoints/
rsync -av /path/to/local/checkpoint <cluster-login>:<cluster-workspace>/<session_id>/<model>/checkpoints/
```

Use the `workspace` path from your cluster config as the destination. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.
Use the `workspace` path from your cluster config as the destination root, and keep staged checkpoints under the session/model directory. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.

See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.

Expand Down Expand Up @@ -118,8 +118,8 @@ When submitting SLURM jobs remotely, write **two files** locally to avoid shell
Then upload both and submit:

```bash
remote_sync_to /local/scripts/ scripts/
JOBID=$(remote_run "sbatch /remote/path/scripts/job_slurm.sh" | grep -o '[0-9]\+' | tail -1)
remote_sync_to /local/scripts/ <session_id>/<model>/scripts/
JOBID=$(remote_run "sbatch <remote_workspace>/<session_id>/<model>/scripts/job_slurm.sh" | grep -o '[0-9]\+' | tail -1)
```

---
Expand Down
116 changes: 67 additions & 49 deletions .claude/skills/common/workspace-management.md
Original file line number Diff line number Diff line change
@@ -1,77 +1,95 @@
# Workspace Management

Organize work by model name so outputs (checkpoints, logs) are easy to find and reuse across PTQ → deploy → eval pipelines.
Organize work by session id and model name so concurrent agents do not
clobber each other, while outputs (checkpoints, logs) stay easy to find and
reuse across PTQ → deploy → eval pipelines within the same session.

## Single-user (default)
## Session Workspaces

Create a work directory named after the model in the current project:
Use the same `<session_id>` convention as the monitor skill:

```bash
mkdir -p ./workspaces/<model-name>
```

Use descriptive names, not timestamps:

```bash
# Good
workspaces/qwen3-0.6b-nvfp4/
workspaces/llama-3.1-8b-fp8/

# Bad
workspaces/ptq-20260318-143022/
workspaces/job-001/
```

Store outputs (checkpoints, logs) inside the workspace:

```bash
workspaces/qwen3-0.6b-nvfp4/
output/ # quantized checkpoint
logs/ # job logs
scripts/ # custom PTQ scripts (if unsupported model)
```
- Claude Code: `$CLAUDE_CODE_SESSION_ID`, or the `session_id` field from hook input
- Codex: `$CODEX_THREAD_ID`
- If no session id is available, create a stable id for the current terminal session and reuse it for every local and remote path created by that agent

## When to Reuse vs Create

**Before starting any task**, check for an existing workspace:
**Before starting any task**, check for an existing workspace in the current
session:

```bash
ls ./workspaces/ 2>/dev/null
ls ./workspaces/<session_id>/ 2>/dev/null
```

**Reuse** when:

- Same model (e.g., deploying a model you just quantized)
- The matching model workspace already exists under `./workspaces/<session_id>/`
- Task requires output from a previous step (e.g., eval requires the PTQ checkpoint)
- User says "deploy the model I just quantized"

**Create new** when:

- New model not seen before
- No matching model workspace exists under `./workspaces/<session_id>/`
- User explicitly asks for a fresh start
- Different quantization format for same model (e.g., `qwen3-0.6b-fp8` vs `qwen3-0.6b-nvfp4`)

## Model Workspace Names

Within `./workspaces/<session_id>/`, create one model workspace per model or
model variant. Include meaningful variant details in the model workspace name,
for example quantization format or checkpoint role:

```bash
mkdir -p ./workspaces/<session_id>/<model-name>
```

Use descriptive model workspace names, not timestamps:

```text
# Good
workspaces/<session_id>/qwen3-0.6b-nvfp4/
workspaces/<session_id>/qwen3-0.6b-fp8/
workspaces/<session_id>/qwen3-0.6b-baseline/

# Bad
workspaces/<session_id>/ptq-20260318-143022/
workspaces/<session_id>/job-001/
```

Store outputs (checkpoints, logs) inside the model workspace:

```text
workspaces/<session_id>/qwen3-0.6b-nvfp4/
output/ # quantized checkpoint
logs/ # job logs
scripts/ # custom PTQ scripts (if unsupported model)
```

## Remote execution

When using a remote machine (clusters.yaml configured), create matching workspaces on **both** local and remote:

- **Local** `./workspaces/<model>/` — write and edit scripts here
- **Remote** `<remote_workspace>/workspaces/<model>/` — model downloads, execution, outputs
- **Local** `./workspaces/<session_id>/<model>/` — write and edit scripts here
- **Remote** `<remote_workspace>/<session_id>/<model>/` — model downloads, execution, outputs

Session-scope newly created remote run directories, logs, response caches,
temporary configs, and output artifacts. Shared read-only or concurrency-safe
caches, such as Hugging Face model caches and prebuilt container image caches,
can remain outside the session directory.

Before running, sync the local ModelOpt source and scripts to the remote workspace:

```bash
# Sync ModelOpt source (first time or after local changes)
remote_sync_to ./ workspaces/<model>/Model-Optimizer/
remote_sync_to ./ <session_id>/<model>/Model-Optimizer/

# Sync custom scripts
remote_sync_to ./workspaces/<model>/scripts/ workspaces/<model>/scripts/
remote_sync_to ./workspaces/<session_id>/<model>/scripts/ <session_id>/<model>/scripts/
```

Download the model on the **remote** machine (avoids transferring large model files):

```bash
remote_run "python -c \"from huggingface_hub import snapshot_download; snapshot_download('<model_id>', local_dir='<remote_workspace>/workspaces/<model>/model')\""
remote_run "python -c \"from huggingface_hub import snapshot_download; snapshot_download('<model_id>', local_dir='<remote_workspace>/<session_id>/<model>/model')\""
```

Inspect remote files with `remote_run "cat ..."` — read README, config.json, tokenizer_config.json to understand requirements before writing scripts locally.
Expand All @@ -80,7 +98,7 @@ Inspect remote files with `remote_run "cat ..."` — read README, config.json, t

When `MODELOPT_WORKSPACE_ROOT` is set, use it instead of `./workspaces/`:

- `MODELOPT_WORKSPACE_ROOT` — user's workspace root (set by the bot)
- `MODELOPT_WORKSPACE_ROOT` — user's workspace root (set by the bot); use `$MODELOPT_WORKSPACE_ROOT/<session_id>/<name>/`
- `MODELOPT_REPO_DIR` — shared upstream repo (read-only, use for fresh copies)

To create a workspace, copy the upstream repo (without `.git`):
Expand All @@ -89,15 +107,15 @@ To create a workspace, copy the upstream repo (without `.git`):
rsync -a --quiet \
--exclude .git --exclude __pycache__ --exclude '*.pyc' \
--exclude node_modules --exclude '*.egg-info' --exclude '*.sqsh' \
"$MODELOPT_REPO_DIR/" "$MODELOPT_WORKSPACE_ROOT/<name>/"
"$MODELOPT_REPO_DIR/" "$MODELOPT_WORKSPACE_ROOT/<session_id>/<name>/"
```

## Cross-Skill Workspace Flow

Workspaces carry over across the PTQ → Deploy → Eval pipeline. Each stage adds to the same directory:

```text
workspaces/model-name-format/
workspaces/<session_id>/model-name-format/
output/ ← PTQ: quantized checkpoint
eval_results/ ← Evaluation: NEL artifacts (results.yml per task)
eval_config.yaml ← Evaluation: NEL config
Expand All @@ -109,19 +127,19 @@ workspaces/model-name-format/

```text
User: "quantize Qwen3-0.6B with nvfp4"
Agent: ls workspaces/ → no "qwen3-0.6b-nvfp4"
→ mkdir workspaces/qwen3-0.6b-nvfp4
→ run PTQ, output to workspaces/qwen3-0.6b-nvfp4/output/
Agent: ls workspaces/<session_id>/ → no "qwen3-0.6b-nvfp4"
→ mkdir workspaces/<session_id>/qwen3-0.6b-nvfp4
→ run PTQ, output to workspaces/<session_id>/qwen3-0.6b-nvfp4/output/

User: "deploy the model I just quantized"
Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
→ reuse, find checkpoint at workspaces/qwen3-0.6b-nvfp4/output/
Agent: ls workspaces/<session_id>/ → sees "qwen3-0.6b-nvfp4"
→ reuse, find checkpoint at workspaces/<session_id>/qwen3-0.6b-nvfp4/output/

User: "evaluate the quantized model on MMLU and GSM8K"
Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
→ reuse, write eval_config.yaml, results to workspaces/qwen3-0.6b-nvfp4/eval_results/
Agent: ls workspaces/<session_id>/ → sees "qwen3-0.6b-nvfp4"
→ reuse, write eval_config.yaml, results to workspaces/<session_id>/qwen3-0.6b-nvfp4/eval_results/

User: "now quantize Llama-3.1-8B with fp8"
Agent: ls workspaces/ → no llama
→ mkdir workspaces/llama-3.1-8b-fp8
Agent: ls workspaces/<session_id>/ → no llama
→ mkdir workspaces/<session_id>/llama-3.1-8b-fp8
```
22 changes: 22 additions & 0 deletions .claude/skills/debugging-playbooks/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
name: debugging-playbooks
description: Diagnostic playbooks for tricky failures — failures where the traceback misdirects and the first 2-3 reasonable hypotheses turn out wrong. Use when a run fails with a framework-internal-looking error (cryptic torch.compile / dynamo / NCCL / vLLM / transformers / CUDA / pyxis / enroot / NEL / SLURM / container runtime), the top frame appears to blame the wrong layer (e.g. the user's code, ModelOpt, the quantized linear, the wrapper class) but fixing that layer doesn't help, or the symptom recurs across unrelated changes. Use this skill when you've eliminated the obvious suspects and the bug hasn't budged. Don't reach for this on the first guess; reach for it when the obvious answers don't pan out. Each playbook is keyed by a literal symptom string from logs so future agents can grep for it.
---

# Debugging playbooks

When a failure surfaces a symptom that doesn't clearly map to the code under change, check whether one of the documented playbooks below already describes it. Each playbook is keyed by the literal symptom string so future agents can match by grep.

| Symptom (literal string from logs) | Playbook |
| --- | --- |
| `AttributeError: 'NoneType' object has no attribute 'size'` during vLLM `profile_run` / `_dummy_run` / CUDA-graph capture | [vllm-aot-cache-poisoning.md](references/vllm-aot-cache-poisoning.md) |

## When to add a new playbook

Add an entry when **all three** are true:

1. The root cause was non-obvious from the traceback — the immediate frame was misleading (e.g. blames ModelOpt when the bug is in vLLM).
2. The symptom is likely to recur across runs (different models, different containers).
3. There is a concrete fix (config change, env var, cache invalidation) that future agents should reach for before deeper debugging.

Each playbook should include: the literal symptom string, the actual mechanism, how to confirm the diagnosis, and the minimal fix.
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# vLLM AOT compile-cache poisoning across multimodal-on / multimodal-off runs

Applies to **any** model whose vLLM architecture supports multimodal input —
this is modality-agnostic, covering image, video, audio, or any other
modality (`vllm/multimodal/registry.py: supports_multimodal_inputs` iterates
the model's `supported_mm_limits`, which can be `{"image": N}`,
`{"video": N}`, `{"audio": N}`, `{"image": N, "video": N}`, etc.). The hazard
appears when multiple vLLM runs against the **same checkpoint** share a
`VLLM_CACHE_ROOT` and differ in whether **all** of the model's modalities
are zeroed out via `--limit-mm-per-prompt`.

## Symptom

vLLM startup crashes during `profile_run` / `_dummy_run` / CUDA-graph capture
with:

```text
AttributeError: 'NoneType' object has no attribute 'size'
```

The traceback ends inside `torch/_dynamo/utils.py call_size → x.size(i)`,
after passing through `vllm/compilation/decorators.py: aot_compiled_fn`.
**There is no model-layer frame** in the failing stack — no attention op,
no MLP, no quantized linear. The compiled function is loaded from disk and
crashes in dynamo's prologue, before any decoder layer runs. The log line
just above the traceback is the smoking gun:

```text
INFO ... [decorators.py:...] Directly load AOT compilation from path
/vllm-cache/torch_compile_cache/torch_aot_compile/<hash>/rank_*/model
```

## Mechanism

vLLM's `@support_torch_compile` decorator caches one compiled `forward` per
`(aot_compile_hash_factors(vllm_config), _model_hash_key(forward))` key
(`vllm/compilation/decorators.py`). That key includes the model config and
quantization, but **does not include** `--limit-mm-per-prompt` or the
derived `supports_mm_inputs` flag.

`vllm/v1/worker/gpu_model_runner.py: _dummy_run` branches on
`supports_mm_inputs`:

```python
if self.supports_mm_inputs and not self.model_config.is_encoder_decoder:
input_ids, inputs_embeds = self._prepare_mm_inputs(...) # (None, Tensor)
else:
input_ids = self.input_ids.gpu[:num_tokens_padded] # (Tensor, None)
inputs_embeds = None
```

`supports_mm_inputs` (`vllm/multimodal/registry.py: supports_multimodal_inputs`)
returns `False` when **every** supported modality has
`--limit-mm-per-prompt = 0`. So:

| Run config | `supports_mm_inputs` | Pattern compiled / loaded |
| --- | --- | --- |
| `--limit-mm-per-prompt '{"image":0}'` (and `"video":0` etc.) | False | `input_ids=Tensor, inputs_embeds=None` |
| default, or any modality non-zero | True | `input_ids=None, inputs_embeds=Tensor` |

The `@support_torch_compile` docstring explicitly forbids the same argument
slot from being `None` on one invocation and a Tensor on another — Dynamo
specializes on None-vs-Tensor identity per argument, so one cached graph
cannot serve both patterns. When run A populates the cache slot and run B
shares the slot but uses the opposite pattern, the prologue calls
`.size()` on what is now `None` and dies.

This is symmetric: a multimodal-first run followed by a text-only-via-image:0
run fails the same way, just with the None/Tensor roles swapped.

## How to confirm

1. **Cache hit before the crash.** Look in the server log for
`Directly load AOT compilation from path ...` shortly before the
traceback. A cache *hit* immediately before a `NoneType.size()` is the
diagnostic. (A cold compile would print `Dynamo bytecode transform
time` and `Inductor compile took ...` instead.)
2. **Config delta on `--limit-mm-per-prompt`.** Compare the failing run's
serving args against the most recent successful runs that share
`$VLLM_CACHE_ROOT`. If they disagree on whether any modality is
zero-limited (or one side omits the flag while the other passes
`{"image":0}`), the cache slot is colliding.
3. **Positive control.** Relaunch the failing config with
`VLLM_DISABLE_COMPILE_CACHE=1` and change nothing else. If `profile_run`
passes, the cache was the cause.

## Fix

Two parts — stop the poisoning, then heal what's already poisoned.

### Stop poisoning

For multimodal-architecture models, do **not** zero out a modality with
`--limit-mm-per-prompt '{"image":0}'` (or `"video":0`, …) on runs intended
to share a cache root with multimodal runs. The vision tower weights are
loaded from the checkpoint regardless of this flag; zeroing only flips
`supports_mm_inputs` and creates the cache hazard. Text-only inference
still works without the flag because vLLM's `_preprocess` routes both
text and multimodal prompts through the same `inputs_embeds` path when
`supports_mm_inputs=True`:

```python
# vllm/v1/worker/gpu_model_runner.py: _preprocess
# NOTE(woosuk): To unify token ids and soft tokens (vision embeddings),
# we always use embeddings (rather than token ids) as input to the
# multimodal model, even when the input is text.
inputs_embeds_scheduled = self.model.embed_input_ids(
self.input_ids.gpu[:num_scheduled_tokens],
multimodal_embeddings=mm_embeds,
is_multimodal=is_mm_embed,
)
```

A text-only prompt simply has `mm_embeds=[]` / `is_multimodal=False`; the
call signature into the language model is unchanged. The small cost of
keeping multimodal inputs enabled is that vLLM allocates an encoder cache
budget at startup (e.g. a few hundred MB) and prints a vision warmup line.

### Heal existing cache

Either fully wipe and let the next run repopulate:

```bash
rm -rf "$VLLM_CACHE_ROOT/torch_compile_cache/torch_aot_compile/"
```

…or sidestep by separating cache roots per multimodal-ness (set a different
`VLLM_CACHE_ROOT` for the runs that need a different pattern), or just set
`VLLM_DISABLE_COMPILE_CACHE=1` on the affected runs and accept a one-time
recompile (~20-30 s) at every startup.

## See also

- `vllm/compilation/decorators.py` — `support_torch_compile` decorator and
its docstring on the None-vs-Tensor invariant.
- `vllm/v1/worker/gpu_model_runner.py` — the input-construction branch in
`_dummy_run` and the unified-`inputs_embeds` comment in `_preprocess`.
- `vllm/multimodal/registry.py` — how `supports_multimodal_inputs` is
computed from `--limit-mm-per-prompt`.
Loading
Loading