Skip to content

Commit e2d4d73

Browse files
chadvoegeleclaude
andauthored
Agent Skills Updates From Live Trials (#1493)
### What does this PR do? Type of change: bug fix <!-- Details about the change. --> ### Usage Ask Claude Code: ``` Quantize `mistralai/Mistral-Medium-3.5-128B` to NVFP4 using the ModelOpt NVFP4 experts-only recipe. Run on $cluster Evaluate the resulting quantized checkpoint on: - GPQA Diamond AA v3 - SciCode AA v2 Complete the quantization and evaluation workflow end to end. Prompt when you require user input, otherwise keep going. ``` ### Testing I'm running the full loop with the above prompt, and iterating on skills to resolve undesired agent behavior. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ - Did you get Claude approval on this PR?: TODO ### Additional Information See trials log for details. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added SLURM Quality of Service (QoS) configuration support for job submission * Introduced 8 new evaluation task recipes (AIME 2025, GPQA, IFBench, LiveCodeBench, SciCode, AA-LCR, HLE-AA, MMMU-Pro, tau2_bench) * Enhanced job monitoring with continuous polling-based tracking * **Documentation** * Restructured evaluation workflow with explicit dry-run, canary, and full-run validation stages * Expanded PTQ validation with mandatory pre-deployment verification gates * Updated remote cluster selection and quantization detection guidance * **Tests** * Updated evaluation test expectations to reflect refined workflow stages <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1493?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Chad Voegele <cvoegele@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent c9098b6 commit e2d4d73

36 files changed

Lines changed: 1070 additions & 279 deletions

.claude/skills/common/environment-setup.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@ cat ~/.config/modelopt/clusters.yaml 2>/dev/null || cat .claude/clusters.yaml 2>
2929

3030
If a cluster config exists with content → **use the remote cluster** (do not fall back to local even if local GPUs are available — the cluster config indicates the user's preferred execution environment). Otherwise → **local execution**.
3131

32+
If the cluster config contains multiple clusters and the user did not name the target cluster, ask which cluster to use before calling `remote_load_cluster`. Do not silently fall back to `default_cluster` in multi-cluster configs; different clusters can have different filesystems, GPU types, auth paths, and SSH setup.
33+
3234
For remote, connect:
3335

3436
```bash

.claude/skills/common/remote-execution.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,10 +33,10 @@ default_cluster: my-cluster
3333
Workstation filesystems (`/home/scratch.*`, local NFS) are **not** mounted on the cluster. If a checkpoint was produced on your workstation, copy it to the cluster's own storage before submitting any job that references it — NEL and SLURM do NOT sync checkpoints automatically.
3434

3535
```bash
36-
rsync -av /path/to/local/checkpoint <cluster-login>:<cluster-workspace>/checkpoints/
36+
rsync -av /path/to/local/checkpoint <cluster-login>:<cluster-workspace>/<session_id>/<model>/checkpoints/
3737
```
3838

39-
Use the `workspace` path from your cluster config as the destination. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.
39+
Use the `workspace` path from your cluster config as the destination root, and keep staged checkpoints under the session/model directory. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.
4040

4141
See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.
4242

@@ -118,8 +118,8 @@ When submitting SLURM jobs remotely, write **two files** locally to avoid shell
118118
Then upload both and submit:
119119

120120
```bash
121-
remote_sync_to /local/scripts/ scripts/
122-
JOBID=$(remote_run "sbatch /remote/path/scripts/job_slurm.sh" | grep -o '[0-9]\+' | tail -1)
121+
remote_sync_to /local/scripts/ <session_id>/<model>/scripts/
122+
JOBID=$(remote_run "sbatch <remote_workspace>/<session_id>/<model>/scripts/job_slurm.sh" | grep -o '[0-9]\+' | tail -1)
123123
```
124124

125125
---
Lines changed: 67 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,77 +1,95 @@
11
# Workspace Management
22

3-
Organize work by model name so outputs (checkpoints, logs) are easy to find and reuse across PTQ → deploy → eval pipelines.
3+
Organize work by session id and model name so concurrent agents do not
4+
clobber each other, while outputs (checkpoints, logs) stay easy to find and
5+
reuse across PTQ → deploy → eval pipelines within the same session.
46

5-
## Single-user (default)
7+
## Session Workspaces
68

7-
Create a work directory named after the model in the current project:
9+
Use the same `<session_id>` convention as the monitor skill:
810

9-
```bash
10-
mkdir -p ./workspaces/<model-name>
11-
```
12-
13-
Use descriptive names, not timestamps:
14-
15-
```bash
16-
# Good
17-
workspaces/qwen3-0.6b-nvfp4/
18-
workspaces/llama-3.1-8b-fp8/
19-
20-
# Bad
21-
workspaces/ptq-20260318-143022/
22-
workspaces/job-001/
23-
```
24-
25-
Store outputs (checkpoints, logs) inside the workspace:
26-
27-
```bash
28-
workspaces/qwen3-0.6b-nvfp4/
29-
output/ # quantized checkpoint
30-
logs/ # job logs
31-
scripts/ # custom PTQ scripts (if unsupported model)
32-
```
11+
- Claude Code: `$CLAUDE_CODE_SESSION_ID`, or the `session_id` field from hook input
12+
- Codex: `$CODEX_THREAD_ID`
13+
- If no session id is available, create a stable id for the current terminal session and reuse it for every local and remote path created by that agent
3314

3415
## When to Reuse vs Create
3516

36-
**Before starting any task**, check for an existing workspace:
17+
**Before starting any task**, check for an existing workspace in the current
18+
session:
3719

3820
```bash
39-
ls ./workspaces/ 2>/dev/null
21+
ls ./workspaces/<session_id>/ 2>/dev/null
4022
```
4123

4224
**Reuse** when:
4325

44-
- Same model (e.g., deploying a model you just quantized)
26+
- The matching model workspace already exists under `./workspaces/<session_id>/`
4527
- Task requires output from a previous step (e.g., eval requires the PTQ checkpoint)
4628
- User says "deploy the model I just quantized"
4729

4830
**Create new** when:
4931

50-
- New model not seen before
32+
- No matching model workspace exists under `./workspaces/<session_id>/`
5133
- User explicitly asks for a fresh start
52-
- Different quantization format for same model (e.g., `qwen3-0.6b-fp8` vs `qwen3-0.6b-nvfp4`)
34+
35+
## Model Workspace Names
36+
37+
Within `./workspaces/<session_id>/`, create one model workspace per model or
38+
model variant. Include meaningful variant details in the model workspace name,
39+
for example quantization format or checkpoint role:
40+
41+
```bash
42+
mkdir -p ./workspaces/<session_id>/<model-name>
43+
```
44+
45+
Use descriptive model workspace names, not timestamps:
46+
47+
```text
48+
# Good
49+
workspaces/<session_id>/qwen3-0.6b-nvfp4/
50+
workspaces/<session_id>/qwen3-0.6b-fp8/
51+
workspaces/<session_id>/qwen3-0.6b-baseline/
52+
53+
# Bad
54+
workspaces/<session_id>/ptq-20260318-143022/
55+
workspaces/<session_id>/job-001/
56+
```
57+
58+
Store outputs (checkpoints, logs) inside the model workspace:
59+
60+
```text
61+
workspaces/<session_id>/qwen3-0.6b-nvfp4/
62+
output/ # quantized checkpoint
63+
logs/ # job logs
64+
scripts/ # custom PTQ scripts (if unsupported model)
65+
```
5366

5467
## Remote execution
5568

5669
When using a remote machine (clusters.yaml configured), create matching workspaces on **both** local and remote:
5770

58-
- **Local** `./workspaces/<model>/` — write and edit scripts here
59-
- **Remote** `<remote_workspace>/workspaces/<model>/` — model downloads, execution, outputs
71+
- **Local** `./workspaces/<session_id>/<model>/` — write and edit scripts here
72+
- **Remote** `<remote_workspace>/<session_id>/<model>/` — model downloads, execution, outputs
73+
74+
Session-scope newly created remote run directories, logs, response caches,
75+
temporary configs, and output artifacts. Shared read-only or concurrency-safe
76+
caches, such as Hugging Face model caches and prebuilt container image caches,
77+
can remain outside the session directory.
6078

6179
Before running, sync the local ModelOpt source and scripts to the remote workspace:
6280

6381
```bash
6482
# Sync ModelOpt source (first time or after local changes)
65-
remote_sync_to ./ workspaces/<model>/Model-Optimizer/
83+
remote_sync_to ./ <session_id>/<model>/Model-Optimizer/
6684

6785
# Sync custom scripts
68-
remote_sync_to ./workspaces/<model>/scripts/ workspaces/<model>/scripts/
86+
remote_sync_to ./workspaces/<session_id>/<model>/scripts/ <session_id>/<model>/scripts/
6987
```
7088

7189
Download the model on the **remote** machine (avoids transferring large model files):
7290

7391
```bash
74-
remote_run "python -c \"from huggingface_hub import snapshot_download; snapshot_download('<model_id>', local_dir='<remote_workspace>/workspaces/<model>/model')\""
92+
remote_run "python -c \"from huggingface_hub import snapshot_download; snapshot_download('<model_id>', local_dir='<remote_workspace>/<session_id>/<model>/model')\""
7593
```
7694

7795
Inspect remote files with `remote_run "cat ..."` — read README, config.json, tokenizer_config.json to understand requirements before writing scripts locally.
@@ -80,7 +98,7 @@ Inspect remote files with `remote_run "cat ..."` — read README, config.json, t
8098

8199
When `MODELOPT_WORKSPACE_ROOT` is set, use it instead of `./workspaces/`:
82100

83-
- `MODELOPT_WORKSPACE_ROOT` — user's workspace root (set by the bot)
101+
- `MODELOPT_WORKSPACE_ROOT` — user's workspace root (set by the bot); use `$MODELOPT_WORKSPACE_ROOT/<session_id>/<name>/`
84102
- `MODELOPT_REPO_DIR` — shared upstream repo (read-only, use for fresh copies)
85103

86104
To create a workspace, copy the upstream repo (without `.git`):
@@ -89,15 +107,15 @@ To create a workspace, copy the upstream repo (without `.git`):
89107
rsync -a --quiet \
90108
--exclude .git --exclude __pycache__ --exclude '*.pyc' \
91109
--exclude node_modules --exclude '*.egg-info' --exclude '*.sqsh' \
92-
"$MODELOPT_REPO_DIR/" "$MODELOPT_WORKSPACE_ROOT/<name>/"
110+
"$MODELOPT_REPO_DIR/" "$MODELOPT_WORKSPACE_ROOT/<session_id>/<name>/"
93111
```
94112

95113
## Cross-Skill Workspace Flow
96114

97115
Workspaces carry over across the PTQ → Deploy → Eval pipeline. Each stage adds to the same directory:
98116

99117
```text
100-
workspaces/model-name-format/
118+
workspaces/<session_id>/model-name-format/
101119
output/ ← PTQ: quantized checkpoint
102120
eval_results/ ← Evaluation: NEL artifacts (results.yml per task)
103121
eval_config.yaml ← Evaluation: NEL config
@@ -109,19 +127,19 @@ workspaces/model-name-format/
109127

110128
```text
111129
User: "quantize Qwen3-0.6B with nvfp4"
112-
Agent: ls workspaces/ → no "qwen3-0.6b-nvfp4"
113-
→ mkdir workspaces/qwen3-0.6b-nvfp4
114-
→ run PTQ, output to workspaces/qwen3-0.6b-nvfp4/output/
130+
Agent: ls workspaces/<session_id>/ → no "qwen3-0.6b-nvfp4"
131+
→ mkdir workspaces/<session_id>/qwen3-0.6b-nvfp4
132+
→ run PTQ, output to workspaces/<session_id>/qwen3-0.6b-nvfp4/output/
115133
116134
User: "deploy the model I just quantized"
117-
Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
118-
→ reuse, find checkpoint at workspaces/qwen3-0.6b-nvfp4/output/
135+
Agent: ls workspaces/<session_id>/ → sees "qwen3-0.6b-nvfp4"
136+
→ reuse, find checkpoint at workspaces/<session_id>/qwen3-0.6b-nvfp4/output/
119137
120138
User: "evaluate the quantized model on MMLU and GSM8K"
121-
Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
122-
→ reuse, write eval_config.yaml, results to workspaces/qwen3-0.6b-nvfp4/eval_results/
139+
Agent: ls workspaces/<session_id>/ → sees "qwen3-0.6b-nvfp4"
140+
→ reuse, write eval_config.yaml, results to workspaces/<session_id>/qwen3-0.6b-nvfp4/eval_results/
123141
124142
User: "now quantize Llama-3.1-8B with fp8"
125-
Agent: ls workspaces/ → no llama
126-
→ mkdir workspaces/llama-3.1-8b-fp8
143+
Agent: ls workspaces/<session_id>/ → no llama
144+
→ mkdir workspaces/<session_id>/llama-3.1-8b-fp8
127145
```
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
---
2+
name: compare-results
3+
description: Establish baseline-vs-candidate evaluation plans, delegate missing evaluations, compare validated results, and decide quantization feasibility. Use when the user asks to compare baseline vs quantized runs, explain an accuracy drop/regression, verify whether a quantized checkpoint is acceptable, or compare NEL/MLflow evaluation outputs. Do NOT use for generic single-model evaluation without comparison intent (use evaluation), live NEL status/debugging (use launching-evals), or generic MLflow browsing without a comparison goal (use accessing-mlflow).
4+
license: Apache-2.0
5+
---
6+
7+
# Compare Results
8+
9+
Use this to plan and complete a baseline-vs-candidate comparison. The baseline
10+
is the reference checkpoint, and the candidate is the checkpoint whose accuracy
11+
change is being measured, typically a further quantized version of the baseline.
12+
13+
## Workflow
14+
15+
1. Establish the candidate checkpoint/run and the matching baseline. Infer the
16+
baseline from the PTQ source model/checkpoint in the workspace or config used
17+
to create the candidate. If it cannot be inferred, ask the user for the
18+
baseline checkpoint or an existing baseline invocation/run path.
19+
2. If a required baseline or candidate evaluation is missing, delegate to the
20+
evaluation skill to create, run, and verify it. The companion evaluation
21+
config should match benchmark versions, task configs, serving args, token
22+
limits, dataset setup, credentials, cluster, and container as closely as
23+
possible; change only the model/checkpoint and checkpoint-specific serving or
24+
quantization flags.
25+
3. Fetch the baseline and candidate task list, configs, score artifacts, and
26+
logs. If the user provides MLflow runs or invocation IDs, use the
27+
accessing-mlflow skill to fetch configs and artifacts.
28+
4. Confirm each run passed evaluation Step 9, "Verify completed evaluation run",
29+
before comparing scores. If not, validate logs, server health,
30+
judge/code-execution status, sample accounting, and reasoning parsing before
31+
computing deltas.
32+
5. For each task, use the canonical score field from the matching
33+
`.claude/skills/evaluation/recipes/tasks/<task>.md` Score Extraction
34+
section.
35+
6. Compute exact deltas outside the chat context when there are multiple tasks
36+
or repeated runs.
37+
7. Report comparability and quantized-feasibility verdicts before interpreting
38+
the delta as model quality. If the user did not provide an acceptance
39+
threshold, report feasibility as inconclusive instead of inventing one.
40+
41+
## Comparability Checklist
42+
43+
Before treating a baseline-vs-quantized delta as a model quality result, verify
44+
the validated runs are comparable:
45+
46+
1. Prompt text, system prompt, chat template, and rendered messages match.
47+
2. Task name, benchmark version, dataset split, container, harness, and task
48+
fragment match.
49+
3. Generation settings match, including temperature, top_p, top_k, max tokens,
50+
stop strings, chat-template kwargs, reasoning mode/budget, and task-specific
51+
overrides.
52+
4. Reasoning traces are enabled, disabled, parsed, stripped, or ignored
53+
consistently between runs.
54+
5. The number of evaluated and scored samples/repeats matches for each task and
55+
split.
56+
6. Judge-backed or simulator-backed tasks use the same judge/user model,
57+
endpoint class, prompt, and scoring config.
58+
7. The same accuracy metric and score field is used for both runs.
59+
60+
If any item differs, either rerun with matched settings or label the result as
61+
not an apples-to-apples quantization comparison.
62+
63+
## Report Format
64+
65+
Include:
66+
67+
- Baseline and candidate identifiers.
68+
- Per-task metric path, baseline score, candidate score, delta, and stderr if
69+
available.
70+
- Comparability status for prompt/template, generation settings, sample counts,
71+
reasoning handling, judge/simulator setup, and score field.
72+
- Comparability verdict: comparable, not comparable, or inconclusive.
73+
- Quantization feasibility verdict: acceptable, not acceptable, or inconclusive.
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
[
2+
{
3+
"name": "baseline-vs-quantized-delta",
4+
"skills": ["compare-results"],
5+
"query": "Compare the baseline and quantized NEL runs and tell me the accuracy drop",
6+
"files": [],
7+
"expected_behavior": [
8+
"Establishes the baseline and candidate before computing deltas",
9+
"Identifies baseline and candidate run artifacts before computing deltas",
10+
"Checks that both runs were validated before accepting scores",
11+
"Uses task recipe Score Extraction sections for canonical score fields",
12+
"Verifies prompt/template, generation settings, reasoning handling, sample counts, judge/simulator setup, and score field comparability",
13+
"Computes exact per-task deltas outside the chat context when there are multiple tasks or repeated runs",
14+
"Labels the result as not apples-to-apples if comparability checks fail",
15+
"Reports quantization feasibility as inconclusive when no acceptance threshold is provided"
16+
]
17+
},
18+
{
19+
"name": "missing-baseline",
20+
"skills": ["compare-results", "evaluation"],
21+
"query": "Is this quantized eval good enough if I only have the quantized run?",
22+
"files": [],
23+
"expected_behavior": [
24+
"Does not treat a standalone quantized score as a release-ready delta",
25+
"Establishes the matching baseline from the PTQ source model/checkpoint or asks the user for one",
26+
"Delegates missing baseline or candidate evaluations to the evaluation skill",
27+
"Explains that the baseline config must match task versions, serving args, dataset setup, credentials, cluster, and container except for the checkpoint/model"
28+
]
29+
}
30+
]

.claude/skills/deployment/SKILL.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,10 +38,10 @@ The script handles: GPU detection, quantization flag auto-detection (FP8 vs FP4)
3838

3939
### 0. Check workspace (multi-user / Slack bot)
4040

41-
If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check for existing ones — especially if deploying a checkpoint from a prior PTQ run:
41+
If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check the current session for existing model workspaces — especially if deploying a checkpoint from a prior PTQ run:
4242

4343
```bash
44-
ls "$MODELOPT_WORKSPACE_ROOT/" 2>/dev/null
44+
ls "$MODELOPT_WORKSPACE_ROOT/<session_id>/" 2>/dev/null
4545
```
4646

4747
If the user says "deploy the model I just quantized" or references a previous PTQ, find the matching workspace and `cd` into it. The checkpoint should be in that workspace's output directory.
@@ -190,7 +190,7 @@ If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clust
190190
If the checkpoint path is a remote/absolute path (e.g., from a prior PTQ run on the cluster), skip sync — it's already there. Verify with `remote_run "ls <checkpoint_path>/config.json"`. Only sync if the checkpoint is local:
191191

192192
```bash
193-
remote_sync_to <local_checkpoint_path> checkpoints/
193+
remote_sync_to <local_checkpoint_path> <session_id>/<model>/checkpoints/
194194
```
195195

196196
3. **Deploy based on remote environment:**

0 commit comments

Comments
 (0)