Skip to content
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
4bf8253
Polish eval skills
Edwardf0t1 Apr 12, 2026
2e84f3b
Polish eval skills
Edwardf0t1 Apr 12, 2026
e952bcd
update
Edwardf0t1 Apr 12, 2026
2cb3b39
Add end-to-end workflow doc and cross-skill references
Edwardf0t1 Apr 12, 2026
1b94fc9
fix format
Edwardf0t1 Apr 13, 2026
b1be817
Add NEL CI learnings: wrapper script pattern, cross-cluster copy, Hyd…
Edwardf0t1 Apr 13, 2026
7dcede4
fix format
Edwardf0t1 Apr 13, 2026
8176fc7
Add served_model_name mismatch to NEL CI common issues
Edwardf0t1 Apr 13, 2026
b0748dd
Vendor launching-evals and accessing-mlflow skills from NVIDIA-NeMo/E…
Edwardf0t1 Apr 19, 2026
edc3a9b
Merge origin/main into zhiyu/polish-eval-skills
Edwardf0t1 Apr 19, 2026
03dfca7
Add sync-upstream-skills.sh to re-vendor upstream NEL skills
Edwardf0t1 Apr 19, 2026
9cb309b
Delete end-to-end-workflow.md per review feedback
Edwardf0t1 Apr 19, 2026
290f432
Move nel-ci-guide.md to Model-Optimizer-Internal per review feedback
Edwardf0t1 Apr 19, 2026
7824b24
Unblock CI and address mxinO review on remote-execution.md
Edwardf0t1 Apr 19, 2026
31c4fe8
fix format
Edwardf0t1 Apr 19, 2026
645a545
Split credential setup out of slurm-setup.md into credentials.md
Edwardf0t1 Apr 19, 2026
8d63c0e
Merge origin/main — pulls in monitor skill from #1252
Edwardf0t1 Apr 19, 2026
c664a30
Add CHANGELOG entry for evaluation skills polish
Edwardf0t1 Apr 19, 2026
86adcf4
Bump vendored upstream NEL skills SHA to 8fa16b2
Edwardf0t1 Apr 27, 2026
634ac7d
Update credentials.md per @mxinO review on PR #1239
Edwardf0t1 Apr 27, 2026
655e224
Merge origin/main — bring branch up to date
Edwardf0t1 Apr 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions .claude/skills/common/end-to-end-workflow.md
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this workflow really necessary? It feels like the existing skills and Claude Code already cover it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we talked about this and decided to not add an e2e workflow I remember, the idea is the Claude knows to call multiple skills when user says something "quantize and deploy"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — deleted in 9cb309b. The skill descriptions already handle cross-skill routing ("quantize and deploy" chains ptq → deployment); an e2e prose doc was duplicating that in a form agents don't read.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged, sorry for re-introducing it despite the earlier discussion. Deleted in 9cb309b. The one useful insight ("carry PTQ patches forward to deploy/eval") is preserved as a one-liner in evaluation/SKILL.md.

Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# End-to-End Workflow: PTQ → Deploy → Eval

This document ties together the three domain skills (PTQ, Deployment, Evaluation) for the common workflow of quantizing a model, deploying it, and evaluating accuracy.

## Pipeline Overview

```text
PTQ (quantize) → Deployment (serve) → Evaluation (benchmark)
───────────────── ────────────────── ────────────────────────
hf_ptq.py vLLM / SGLang / TRT-LLM NEL (SLURM or JET)
↓ ↓ ↓
NVFP4/FP8 checkpoint OpenAI-compatible API MMLU, GSM8K, GPQA scores
(safetensors) (http://host:8000) (results.yml)
```

## Workspace Continuity

All three stages share the same workspace directory. The PTQ output becomes the deployment input, and eval results land alongside:

```text
workspaces/model-name-format/
output/ ← PTQ checkpoint (safetensors + config.json)
eval_results/ ← NEL evaluation artifacts (results.yml per task)
eval_config.yaml ← NEL config for evaluation
scripts/ ← Custom run scripts (if needed)
logs/ ← SLURM job logs
```

When starting a deployment or evaluation step, always check for an existing workspace from a prior PTQ run:

```bash
ls workspaces/
```

## Unsupported Models

Models not in the verified support matrices require extra work at each stage:

| Stage | What can go wrong | Reference |
|-------|-------------------|-----------|
| **PTQ** | Unknown architecture, FP8 source checkpoint, VLM structure | `ptq/references/unsupported-models.md` |
| **Deployment** | Missing architecture mapping, weight key mismatches, quant/unquant layer confusion | `deployment/references/unsupported-models.md` |
| **Evaluation** | Framework patches needed in deployment container, gated datasets, cluster storage | `evaluation/references/nel-ci-guide.md` |

Each stage has its own debug loop (run → read error → diagnose → patch → re-run). Fixes from one stage often inform the next — e.g., if PTQ required a transformers upgrade, deployment and evaluation will too.

## NEL Evaluation with Custom Deployments

When the serving framework needs runtime patches (e.g., transformers upgrade, model handler fix), override `deployment.command` in the NEL config to inject fixes before serving:

```yaml
deployment:
command: >-
pip install "transformers>=5.0.0.dev0" --pre -q &&
sed -i 's/old_pattern/new_pattern/' /path/to/framework/file.py &&
${deployment.base_command}
```

This works with both NEL SLURM executor and NEL CI (via `NEL_DEPLOYMENT_COMMAND`).
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Outdated

## Decision: NEL SLURM Executor vs NEL CI (JET)

| Factor | NEL SLURM executor | NEL CI (JET) |
|--------|-------------------|--------------|
| **When to use** | Iterative debugging, checkpoint on non-JET cluster, custom patches needed | Production evals, MLflow tracking, reproducible configs |
| **Checkpoint location** | Any cluster you have SSH access to | Must be on JET cluster `/lustre/` storage |
| **Secrets (HF_TOKEN, NGC)** | Provide your own via `host:` env vars | Managed centrally via JET secrets |
| **Container patches** | Override `deployment.command` | Use `NEL_DEPLOYMENT_COMMAND` |
| **MLflow export** | Manual setup | Automatic |
| **Gated datasets** | Your HF account needs access | Handled by `COMPEVAL_HF_TOKEN` |
11 changes: 11 additions & 0 deletions .claude/skills/common/remote-execution.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,17 @@ clusters:
default_cluster: my-cluster
```

### Checkpoint and storage availability
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get this part, the compute node should have same storage access as login node, the only special i have seen is dlcluster, in which each node needs IT path export to access user storage. This shouldn't be added publicly, and internally we should use our team's storage path to avoid this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — framing was wrong. Compute nodes on a given cluster do share login-node storage; the real issue is workstation paths aren't on the cluster at all. Reframed as "Staging checkpoints from your workstation" in 7824b24, and dropped the dlcluster-specific row per your note that it's an internal quirk that shouldn't ship publicly.


Cluster compute nodes may not share the same filesystem as login nodes or other clusters. Before running any workload that references a checkpoint path, verify the path is accessible from compute nodes:

| Cluster type | Compute-node storage | NOT accessible from compute nodes |
|-------------|---------------------|----------------------------------|
| JET clusters (oci-hsg, cw, oci-nrt) | `/lustre/fsw/...` | Workstation NFS (`/home/scratch.*`), other cluster mounts |
| dlcluster | `/home/omniml_data_*`, `/home/scratch.*` | `/lustre/` paths |

If a checkpoint was produced on a different cluster or workstation, copy it to the target cluster's accessible storage before submitting jobs. NEL and SLURM do NOT sync checkpoints automatically.

See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.

---
Expand Down
14 changes: 14 additions & 0 deletions .claude/skills/common/slurm-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,20 @@ srun \
"
```

### Container registry credentials (pyxis)

If `srun --container-image` uses an image from a private registry (e.g., `nvcr.io/nvidia/...`), pyxis/enroot needs credentials on the cluster. Check for existing credentials and add if missing:

```bash
cat ~/.config/enroot/.credentials 2>/dev/null || echo "No credentials"
# To add NGC credentials:
mkdir -p ~/.config/enroot
echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials
Copy link
Copy Markdown
Contributor

@mxinO mxinO Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not a slurm specific, considering the env setup, we should have a general setup guide to help users to set HF token, docker loign token, ngc login token etc. Maybe we can create another env-setup.md to handle this in one place? But this sounds beyond a modelopt skill. cc @kaix-nv

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — split out into .claude/skills/common/credentials.md in 645a545, covering HF_TOKEN, NGC API key (Docker + enroot paths), and Docker Hub. slurm-setup.md keeps only a one-paragraph pyxis pointer at it. Same change also handles CodeRabbit's $oauthtoken-is-literal note and Copilot's append-don't-overwrite concern.

chmod 600 ~/.config/enroot/.credentials
```

Without this, `srun` will fail with `401 Unauthorized` when pulling from `nvcr.io`.

Submit and capture the job ID:

```bash
Expand Down
19 changes: 19 additions & 0 deletions .claude/skills/common/workspace-management.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,21 @@ rsync -a --quiet \
"$MODELOPT_REPO_DIR/" "$MODELOPT_WORKSPACE_ROOT/<name>/"
```

## Cross-Skill Workspace Flow

Workspaces carry over across the PTQ → Deploy → Eval pipeline. Each stage adds to the same directory:

```text
workspaces/model-name-format/
output/ ← PTQ: quantized checkpoint
eval_results/ ← Evaluation: NEL artifacts (results.yml per task)
eval_config.yaml ← Evaluation: NEL config
scripts/ ← Deployment/PTQ: custom run scripts
logs/ ← All: SLURM job logs
```

See `skills/common/end-to-end-workflow.md` for the full pipeline.

## Example Flow

```text
Expand All @@ -104,6 +119,10 @@ User: "deploy the model I just quantized"
Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
→ reuse, find checkpoint at workspaces/qwen3-0.6b-nvfp4/output/

User: "evaluate the quantized model on MMLU and GSM8K"
Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
→ reuse, write eval_config.yaml, results to workspaces/qwen3-0.6b-nvfp4/eval_results/

User: "now quantize Llama-3.1-8B with fp8"
Agent: ls workspaces/ → no llama
→ mkdir workspaces/llama-3.1-8b-fp8
Expand Down
17 changes: 16 additions & 1 deletion .claude/skills/evaluation/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,12 @@ license: Apache-2.0

You're an expert in NeMo Evaluator Launcher! Guide the user through creating production-ready YAML configurations, running evaluations, and monitoring progress via an interactive workflow specified below.

### Workspace (multi-user / Slack bot)
### Workspace and Pipeline Integration

If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Check for existing workspaces — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications.

This skill is often the final stage of the PTQ → Deploy → Eval pipeline. If the model required runtime patches during deployment (transformers upgrade, framework source fixes), carry those patches into the NEL config via `deployment.command`. See `skills/common/end-to-end-workflow.md` for the full pipeline.

### Workflow

```text
Expand Down Expand Up @@ -286,6 +288,19 @@ After job submission, you can monitor progress using:

---

### NEL CI and Cluster-Specific Notes

For running evaluations on NVIDIA JET clusters (oci-hsg, cw, oci-nrt) or SLURM clusters like dlcluster, read `references/nel-ci-guide.md`. It covers:
- NEL CI GitLab trigger pattern vs NEL SLURM executor
- Cluster-specific GPU counts and storage paths
- Checkpoint availability (compute nodes may not share login node filesystems)
- Environment variable prefixes (`host:`, `lit:`) for SLURM executor
- SGLang must bind `--host 0.0.0.0` for health checks
- Directory setup and `chmod 777` for JET service account access
- Common issues (NGC auth, gated datasets, walltime, `NEL_OTHER_OVERRIDES` space-splitting)

---

Direct users with issues to:

- **GitHub Issues:** <https://github.com/NVIDIA-NeMo/Evaluator/issues>
Expand Down
Loading
Loading