You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### What does this PR do?
Type of change: bug fix
<!-- Details about the change. -->
### Usage
Ask Claude Code:
```
Quantize `mistralai/Mistral-Medium-3.5-128B` to NVFP4 using the ModelOpt NVFP4 experts-only recipe.
Run on $cluster
Evaluate the resulting quantized checkpoint on:
- GPQA Diamond AA v3
- SciCode AA v2
Complete the quantization and evaluation workflow end to end. Prompt when you require user input, otherwise keep going.
```
### Testing
I'm running the full loop with the above prompt, and iterating on skills
to resolve undesired agent behavior.
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).
- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅
- Did you write any new necessary tests?: N/A
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅
- Did you get Claude approval on this PR?: TODO
### Additional Information
See trials log for details.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Release Notes
* **New Features**
* Added SLURM Quality of Service (QoS) configuration support for job
submission
* Introduced 8 new evaluation task recipes (AIME 2025, GPQA, IFBench,
LiveCodeBench, SciCode, AA-LCR, HLE-AA, MMMU-Pro, tau2_bench)
* Enhanced job monitoring with continuous polling-based tracking
* **Documentation**
* Restructured evaluation workflow with explicit dry-run, canary, and
full-run validation stages
* Expanded PTQ validation with mandatory pre-deployment verification
gates
* Updated remote cluster selection and quantization detection guidance
* **Tests**
* Updated evaluation test expectations to reflect refined workflow
stages
<!-- review_stack_entry_start -->
[](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1493?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)
<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If a cluster config exists with content → **use the remote cluster** (do not fall back to local even if local GPUs are available — the cluster config indicates the user's preferred execution environment). Otherwise → **local execution**.
31
31
32
+
If the cluster config contains multiple clusters and the user did not name the target cluster, ask which cluster to use before calling `remote_load_cluster`. Do not silently fall back to `default_cluster` in multi-cluster configs; different clusters can have different filesystems, GPU types, auth paths, and SSH setup.
Copy file name to clipboardExpand all lines: .claude/skills/common/remote-execution.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,10 +33,10 @@ default_cluster: my-cluster
33
33
Workstation filesystems (`/home/scratch.*`, local NFS) are **not** mounted on the cluster. If a checkpoint was produced on your workstation, copy it to the cluster's own storage before submitting any job that references it — NEL and SLURM do NOT sync checkpoints automatically.
Use the `workspace` path from your cluster config as the destination. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.
39
+
Use the `workspace` path from your cluster config as the destination root, and keep staged checkpoints under the session/model directory. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.
40
40
41
41
See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.
42
42
@@ -118,8 +118,8 @@ When submitting SLURM jobs remotely, write **two files** locally to avoid shell
- Claude Code: `$CLAUDE_CODE_SESSION_ID`, or the `session_id` field from hook input
12
+
- Codex: `$CODEX_THREAD_ID`
13
+
- If no session id is available, create a stable id for the current terminal session and reuse it for every local and remote path created by that agent
33
14
34
15
## When to Reuse vs Create
35
16
36
-
**Before starting any task**, check for an existing workspace:
17
+
**Before starting any task**, check for an existing workspace in the current
18
+
session:
37
19
38
20
```bash
39
-
ls ./workspaces/ 2>/dev/null
21
+
ls ./workspaces/<session_id>/2>/dev/null
40
22
```
41
23
42
24
**Reuse** when:
43
25
44
-
-Same model (e.g., deploying a model you just quantized)
26
+
-The matching model workspace already exists under `./workspaces/<session_id>/`
45
27
- Task requires output from a previous step (e.g., eval requires the PTQ checkpoint)
46
28
- User says "deploy the model I just quantized"
47
29
48
30
**Create new** when:
49
31
50
-
-New model not seen before
32
+
-No matching model workspace exists under `./workspaces/<session_id>/`
51
33
- User explicitly asks for a fresh start
52
-
- Different quantization format for same model (e.g., `qwen3-0.6b-fp8` vs `qwen3-0.6b-nvfp4`)
34
+
35
+
## Model Workspace Names
36
+
37
+
Within `./workspaces/<session_id>/`, create one model workspace per model or
38
+
model variant. Include meaningful variant details in the model workspace name,
39
+
for example quantization format or checkpoint role:
40
+
41
+
```bash
42
+
mkdir -p ./workspaces/<session_id>/<model-name>
43
+
```
44
+
45
+
Use descriptive model workspace names, not timestamps:
46
+
47
+
```text
48
+
# Good
49
+
workspaces/<session_id>/qwen3-0.6b-nvfp4/
50
+
workspaces/<session_id>/qwen3-0.6b-fp8/
51
+
workspaces/<session_id>/qwen3-0.6b-baseline/
52
+
53
+
# Bad
54
+
workspaces/<session_id>/ptq-20260318-143022/
55
+
workspaces/<session_id>/job-001/
56
+
```
57
+
58
+
Store outputs (checkpoints, logs) inside the model workspace:
description: Establish baseline-vs-candidate evaluation plans, delegate missing evaluations, compare validated results, and decide quantization feasibility. Use when the user asks to compare baseline vs quantized runs, explain an accuracy drop/regression, verify whether a quantized checkpoint is acceptable, or compare NEL/MLflow evaluation outputs. Do NOT use for generic single-model evaluation without comparison intent (use evaluation), live NEL status/debugging (use launching-evals), or generic MLflow browsing without a comparison goal (use accessing-mlflow).
4
+
license: Apache-2.0
5
+
---
6
+
7
+
# Compare Results
8
+
9
+
Use this to plan and complete a baseline-vs-candidate comparison. The baseline
10
+
is the reference checkpoint, and the candidate is the checkpoint whose accuracy
11
+
change is being measured, typically a further quantized version of the baseline.
12
+
13
+
## Workflow
14
+
15
+
1. Establish the candidate checkpoint/run and the matching baseline. Infer the
16
+
baseline from the PTQ source model/checkpoint in the workspace or config used
17
+
to create the candidate. If it cannot be inferred, ask the user for the
18
+
baseline checkpoint or an existing baseline invocation/run path.
19
+
2. If a required baseline or candidate evaluation is missing, delegate to the
20
+
evaluation skill to create, run, and verify it. The companion evaluation
21
+
config should match benchmark versions, task configs, serving args, token
22
+
limits, dataset setup, credentials, cluster, and container as closely as
23
+
possible; change only the model/checkpoint and checkpoint-specific serving or
24
+
quantization flags.
25
+
3. Fetch the baseline and candidate task list, configs, score artifacts, and
26
+
logs. If the user provides MLflow runs or invocation IDs, use the
27
+
accessing-mlflow skill to fetch configs and artifacts.
28
+
4. Confirm each run passed evaluation Step 9, "Verify completed evaluation run",
29
+
before comparing scores. If not, validate logs, server health,
30
+
judge/code-execution status, sample accounting, and reasoning parsing before
31
+
computing deltas.
32
+
5. For each task, use the canonical score field from the matching
"query": "Compare the baseline and quantized NEL runs and tell me the accuracy drop",
6
+
"files": [],
7
+
"expected_behavior": [
8
+
"Establishes the baseline and candidate before computing deltas",
9
+
"Identifies baseline and candidate run artifacts before computing deltas",
10
+
"Checks that both runs were validated before accepting scores",
11
+
"Uses task recipe Score Extraction sections for canonical score fields",
12
+
"Verifies prompt/template, generation settings, reasoning handling, sample counts, judge/simulator setup, and score field comparability",
13
+
"Computes exact per-task deltas outside the chat context when there are multiple tasks or repeated runs",
14
+
"Labels the result as not apples-to-apples if comparability checks fail",
15
+
"Reports quantization feasibility as inconclusive when no acceptance threshold is provided"
16
+
]
17
+
},
18
+
{
19
+
"name": "missing-baseline",
20
+
"skills": ["compare-results", "evaluation"],
21
+
"query": "Is this quantized eval good enough if I only have the quantized run?",
22
+
"files": [],
23
+
"expected_behavior": [
24
+
"Does not treat a standalone quantized score as a release-ready delta",
25
+
"Establishes the matching baseline from the PTQ source model/checkpoint or asks the user for one",
26
+
"Delegates missing baseline or candidate evaluations to the evaluation skill",
27
+
"Explains that the baseline config must match task versions, serving args, dataset setup, credentials, cluster, and container except for the checkpoint/model"
Copy file name to clipboardExpand all lines: .claude/skills/deployment/SKILL.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -38,10 +38,10 @@ The script handles: GPU detection, quantization flag auto-detection (FP8 vs FP4)
38
38
39
39
### 0. Check workspace (multi-user / Slack bot)
40
40
41
-
If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check for existing ones — especially if deploying a checkpoint from a prior PTQ run:
41
+
If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check the current session for existing model workspaces — especially if deploying a checkpoint from a prior PTQ run:
42
42
43
43
```bash
44
-
ls "$MODELOPT_WORKSPACE_ROOT/"2>/dev/null
44
+
ls "$MODELOPT_WORKSPACE_ROOT/<session_id>/"2>/dev/null
45
45
```
46
46
47
47
If the user says "deploy the model I just quantized" or references a previous PTQ, find the matching workspace and `cd` into it. The checkpoint should be in that workspace's output directory.
@@ -190,7 +190,7 @@ If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clust
190
190
If the checkpoint path is a remote/absolute path (e.g., from a prior PTQ run on the cluster), skip sync — it's already there. Verify with `remote_run "ls <checkpoint_path>/config.json"`. Only sync if the checkpoint is local:
0 commit comments