-
Notifications
You must be signed in to change notification settings - Fork 425
[1/N] Polish evaluation skills and common skills based on an E2E workflow testing, vendor two Claude skills from NeMo Evaluator #1239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
4bf8253
Polish eval skills
Edwardf0t1 2e84f3b
Polish eval skills
Edwardf0t1 e952bcd
update
Edwardf0t1 2cb3b39
Add end-to-end workflow doc and cross-skill references
Edwardf0t1 1b94fc9
fix format
Edwardf0t1 b1be817
Add NEL CI learnings: wrapper script pattern, cross-cluster copy, Hyd…
Edwardf0t1 7dcede4
fix format
Edwardf0t1 8176fc7
Add served_model_name mismatch to NEL CI common issues
Edwardf0t1 b0748dd
Vendor launching-evals and accessing-mlflow skills from NVIDIA-NeMo/E…
Edwardf0t1 edc3a9b
Merge origin/main into zhiyu/polish-eval-skills
Edwardf0t1 03dfca7
Add sync-upstream-skills.sh to re-vendor upstream NEL skills
Edwardf0t1 9cb309b
Delete end-to-end-workflow.md per review feedback
Edwardf0t1 290f432
Move nel-ci-guide.md to Model-Optimizer-Internal per review feedback
Edwardf0t1 7824b24
Unblock CI and address mxinO review on remote-execution.md
Edwardf0t1 31c4fe8
fix format
Edwardf0t1 645a545
Split credential setup out of slurm-setup.md into credentials.md
Edwardf0t1 8d63c0e
Merge origin/main — pulls in monitor skill from #1252
Edwardf0t1 c664a30
Add CHANGELOG entry for evaluation skills polish
Edwardf0t1 86adcf4
Bump vendored upstream NEL skills SHA to 8fa16b2
Edwardf0t1 634ac7d
Update credentials.md per @mxinO review on PR #1239
Edwardf0t1 655e224
Merge origin/main — bring branch up to date
Edwardf0t1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,150 @@ | ||
| #!/usr/bin/env bash | ||
| # SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # Re-vendor upstream Claude skills from NVIDIA-NeMo/Evaluator at a pinned SHA. | ||
| # | ||
| # Scope: only skills we vendor verbatim (launching-evals, accessing-mlflow). | ||
| # The `evaluation` skill is a *modified* fork of upstream nel-assistant and is | ||
| # NOT managed by this script — update it manually when pulling upstream changes. | ||
| # | ||
| # Usage: | ||
| # .claude/scripts/sync-upstream-skills.sh # re-vendor at the pinned SHA | ||
| # UPSTREAM_SHA=<sha> .claude/scripts/sync-upstream-skills.sh # bump to a new SHA | ||
| # | ||
| # Requires: gh, base64, awk. Run from the repo root. | ||
| # | ||
| # The script overwrites .claude/skills/<skill>/ with upstream contents and | ||
| # re-applies our provenance lines into each SKILL.md frontmatter. If you have | ||
| # local changes to a vendored skill, they will be lost — that is expected, | ||
| # since vendored-verbatim skills should not be modified locally. | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| # Pinned upstream commit. Bump this (or pass UPSTREAM_SHA=...) when syncing. | ||
| DEFAULT_SHA="8fa16b237d11e213ea665d5bad6b44d393762317" | ||
| SHA="${UPSTREAM_SHA:-$DEFAULT_SHA}" | ||
| SHORT_SHA="${SHA:0:7}" | ||
|
|
||
| UPSTREAM_REPO="NVIDIA-NeMo/Evaluator" | ||
| UPSTREAM_BASE="packages/nemo-evaluator-launcher/.claude/skills" | ||
| DEST_BASE=".claude/skills" | ||
|
|
||
| if [[ ! -d "$DEST_BASE" ]]; then | ||
| echo "error: run from the repo root (expected $DEST_BASE/ to exist)" >&2 | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "Syncing upstream skills from $UPSTREAM_REPO @ $SHORT_SHA" | ||
|
|
||
| fetch_tree() { | ||
| local skill="$1" | ||
| local path="$2" | ||
| gh api "repos/$UPSTREAM_REPO/contents/$UPSTREAM_BASE/$skill/$path?ref=$SHA" \ | ||
| -q '.[] | "\(.type)\t\(.name)"' | ||
| } | ||
|
|
||
| fetch_file() { | ||
| local src="$1" | ||
| local dst="$2" | ||
| mkdir -p "$(dirname "$dst")" | ||
| gh api "repos/$UPSTREAM_REPO/contents/$src?ref=$SHA" -q '.content' | base64 -d > "$dst" | ||
| } | ||
|
|
||
| fetch_skill_recursive() { | ||
| local skill="$1" | ||
| local subpath="${2:-}" | ||
| local remote="$UPSTREAM_BASE/$skill" | ||
| [[ -n "$subpath" ]] && remote="$remote/$subpath" | ||
|
|
||
| local entries | ||
| entries=$(gh api "repos/$UPSTREAM_REPO/contents/$remote?ref=$SHA" -q '.[] | "\(.type)\t\(.name)"') | ||
|
|
||
| while IFS=$'\t' read -r type name; do | ||
| local rel_path | ||
| if [[ -n "$subpath" ]]; then | ||
| rel_path="$subpath/$name" | ||
| else | ||
| rel_path="$name" | ||
| fi | ||
|
|
||
| if [[ "$type" == "file" ]]; then | ||
| local dst="$DEST_BASE/$skill/$rel_path" | ||
| echo " fetch: $dst" | ||
| fetch_file "$UPSTREAM_BASE/$skill/$rel_path" "$dst" | ||
| elif [[ "$type" == "dir" ]]; then | ||
| fetch_skill_recursive "$skill" "$rel_path" | ||
| fi | ||
| done <<< "$entries" | ||
| } | ||
|
|
||
| # Inject our provenance lines into a SKILL.md frontmatter, right after the | ||
| # `description:` line. Idempotent — removes any existing provenance block first. | ||
| inject_provenance() { | ||
| local skill="$1" | ||
| local extra_note="${2:-}" | ||
| local path="$DEST_BASE/$skill/SKILL.md" | ||
|
|
||
| awk -v sha="$SHA" -v short="$SHORT_SHA" -v skill="$skill" -v extra="$extra_note" ' | ||
| BEGIN { in_fm = 0; injected = 0; fm_end_seen = 0 } | ||
| # Skip any pre-existing provenance or license lines we own | ||
| /^license: Apache-2\.0$/ && in_fm && !fm_end_seen { next } | ||
| /^# Vendored verbatim/ && in_fm && !fm_end_seen { next } | ||
| /^# https:\/\/github\.com\/NVIDIA-NeMo\/Evaluator\/tree\// && in_fm && !fm_end_seen { next } | ||
| /^# To re-sync:/ && in_fm && !fm_end_seen { next } | ||
| /^# Note: this skill depends on the mlflow-mcp/ && in_fm && !fm_end_seen { next } | ||
| /^# configured in the user/ && in_fm && !fm_end_seen { next } | ||
| { | ||
| if ($0 == "---") { | ||
| if (in_fm == 0) { in_fm = 1 } | ||
| else { in_fm = 0; fm_end_seen = 1 } | ||
| } | ||
| if (in_fm && !injected && $0 ~ /^description: /) { | ||
| print "license: Apache-2.0" | ||
| print "# Vendored verbatim from NVIDIA NeMo Evaluator (commit " short ")" | ||
| print "# https://github.com/NVIDIA-NeMo/Evaluator/tree/" sha "/packages/nemo-evaluator-launcher/.claude/skills/" skill | ||
| print "# To re-sync: .claude/scripts/sync-upstream-skills.sh" | ||
| if (extra != "") { | ||
| n = split(extra, lines, "\\|") | ||
| for (i = 1; i <= n; i++) print "# " lines[i] | ||
| } | ||
| injected = 1 | ||
| } | ||
| } | ||
| ' "$path" > "$path.tmp" | ||
| mv "$path.tmp" "$path" | ||
| } | ||
|
|
||
| for skill in launching-evals accessing-mlflow; do | ||
| echo "" | ||
| echo "== $skill ==" | ||
| rm -rf "${DEST_BASE:?}/$skill" | ||
| fetch_skill_recursive "$skill" | ||
|
|
||
| case "$skill" in | ||
| accessing-mlflow) | ||
| inject_provenance "$skill" \ | ||
| "Note: this skill depends on the mlflow-mcp MCP server (https://github.com/kkruglik/mlflow-mcp)|configured in the user's Claude Code setup." | ||
| ;; | ||
| *) | ||
| inject_provenance "$skill" | ||
| ;; | ||
| esac | ||
| done | ||
|
|
||
| echo "" | ||
| echo "Done. Review with: git diff $DEST_BASE/launching-evals $DEST_BASE/accessing-mlflow" | ||
| echo "If the SHA changed, update DEFAULT_SHA at the top of this script before committing." |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,104 @@ | ||
| --- | ||
| name: accessing-mlflow | ||
| description: Query and browse evaluation results stored in MLflow. Use when the user wants to look up runs by invocation ID, compare metrics across models, fetch artifacts (configs, logs, results), or set up the MLflow MCP server. ALWAYS triggers on mentions of MLflow, experiment results, run comparison, invocation IDs in the context of results, or MLflow MCP setup. | ||
| license: Apache-2.0 | ||
| # Vendored verbatim from NVIDIA NeMo Evaluator (commit 8fa16b2) | ||
| # https://github.com/NVIDIA-NeMo/Evaluator/tree/8fa16b237d11e213ea665d5bad6b44d393762317/packages/nemo-evaluator-launcher/.claude/skills/accessing-mlflow | ||
| # To re-sync: .claude/scripts/sync-upstream-skills.sh | ||
| # Note: this skill depends on the mlflow-mcp MCP server (https://github.com/kkruglik/mlflow-mcp) | ||
| # configured in the user's Claude Code setup. | ||
| --- | ||
|
|
||
| # Accessing MLflow | ||
|
|
||
| ## MCP Server | ||
|
|
||
| [mlflow-mcp](https://github.com/kkruglik/mlflow-mcp) gives agents direct access to MLflow — query runs, compare metrics, browse artifacts, all through natural language. | ||
|
|
||
| ## ID Convention | ||
|
|
||
| When the user provides a hex ID (e.g. `71f3f3199ea5e1f0`) without specifying what it is, assume it is an **invocation_id** (not an MLflow run_id). An invocation_id identifies a launcher invocation and is stored as both a tag and a param on MLflow runs. One invocation can produce multiple MLflow runs (one per task). You may need to search across multiple experiments if you don't know which experiment the run belongs to. | ||
|
|
||
| ## Querying Runs | ||
|
|
||
| ```python | ||
| # Find runs by invocation_id | ||
| MLflow:search_runs_by_tags(experiment_id, {"invocation_id": "<invocation_id>"}) | ||
|
|
||
| # Query for example model/task runs | ||
| MLflow:query_runs(experiment_id, "tags.model LIKE '%<model>%'") | ||
| MLflow:query_runs(experiment_id, "tags.task_name LIKE '%<task_name>%'") | ||
|
|
||
| # Get a config from run's artifacts | ||
| MLflow:get_artifact_content(run_id, "config.yml") | ||
|
|
||
| # Get nested stats from run's artifacts | ||
| MLflow:get_artifact_content(run_id, "artifacts/eval_factory_metrics.json") | ||
| ``` | ||
|
|
||
| NOTE: You WILL NOT find PENDING, RUNNING, KILLED, or FAILED runs in MLflow! Only SUCCESSFUL runs are exported to MLflow. | ||
|
|
||
| ## Workflow Tips | ||
|
|
||
| When comparing metrics across runs, fetch the data via MCP, then run the computation in Python for exact results rather than doing math in-context: | ||
|
|
||
| ```bash | ||
| uv run --with pandas python3 << 'EOF' | ||
| import pandas as pd | ||
| # ... compute deltas, averages, etc. | ||
| EOF | ||
| ``` | ||
|
|
||
| ## Artifacts Structure | ||
|
|
||
| ``` | ||
| <harness>.<task>/ | ||
| ├── artifacts/ | ||
| │ ├── config.yml # Fully resolved config used during the evaluation | ||
| │ ├── launcher_unresolved_config.yaml # Unresolved config passed to the launcher | ||
| │ ├── results.yml # All results in YAML format | ||
| │ ├── eval_factory_metrics.json # Runtime stats (latency, tokens count, memory) | ||
| │ ├── report.html # Request-Response Pairs samples in HTML format (if enabled) | ||
| │ └── report.json # Request-Response Pairs samples in JSON format (if enabled) | ||
| └── logs/ | ||
| ├── client-*.log # Evaluation client | ||
| ├── server-*-N.log # Deployment per node | ||
| ├── slurm-*.log # Slurm job | ||
| └── proxy-*.log # Request proxy | ||
| ``` | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| If the MLflow MCP server fails to load or its tools are unavailable: | ||
|
|
||
| 1. **`uvx` not found** — install [uv](https://docs.astral.sh/uv/getting-started/installation/): | ||
| ```bash | ||
| curl -LsSf https://astral.sh/uv/install.sh | sh | ||
| ``` | ||
| 2. **MCP server not configured** — add the config and restart the agent: | ||
|
|
||
| **For Claude Code** — add to `.claude/settings.json` (project or user level), under `"mcpServers"`: | ||
| ```json | ||
| "MLflow": { | ||
| "command": "uvx", | ||
| "args": ["mlflow-mcp"], | ||
| "env": { | ||
| "MLFLOW_TRACKING_URI": "https://<your-mlflow-server>/" | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| **For Cursor** — edit `~/.cursor/mcp.json` (Settings > Tools & MCP > New MCP Server): | ||
| ```json | ||
| { | ||
| "mcpServers": { | ||
| "MLflow": { | ||
| "command": "uvx", | ||
| "args": ["mlflow-mcp"], | ||
| "env": { | ||
| "MLFLOW_TRACKING_URI": "https://<your-mlflow-server>/" | ||
| } | ||
| } | ||
| } | ||
| } | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,87 @@ | ||
| # Credentials Setup | ||
|
|
||
| Tokens and registry credentials that ModelOpt workflows need across local and cluster environments. Not SLURM-specific — referenced from PTQ, deployment, evaluation, and slurm-setup skills. | ||
|
|
||
| ## Check what's already set first | ||
|
|
||
| Before configuring anything, check what the user already has — many of these are likely in place from prior `hf auth login`, `docker login`, or previous SLURM work. Skip any section below for which credentials are already present. | ||
|
|
||
| ```bash | ||
| # HF token: env var or persisted from `hf auth login` | ||
| [ -n "$HF_TOKEN" ] && echo "✓ HF_TOKEN set in env" | ||
| [ -s ~/.cache/huggingface/token ] && echo "✓ HF token at ~/.cache/huggingface/token (from 'hf auth login')" | ||
|
|
||
| # Docker / NGC registry credentials | ||
| grep -qE '"(nvcr\.io|https://index\.docker\.io)"' ~/.docker/config.json 2>/dev/null && echo "✓ Docker login present" | ||
|
|
||
| # Enroot / pyxis credentials (on cluster login node, for SLURM users) | ||
| grep -qE '^machine nvcr\.io ' ~/.config/enroot/.credentials 2>/dev/null && echo "✓ Enroot NGC entry present" | ||
| ``` | ||
|
|
||
| For remote clusters, run the same checks via SSH (`ssh <cluster-login> '<check>'`) — credentials live on the cluster, not your workstation. | ||
|
|
||
| ## HuggingFace token (`HF_TOKEN`) | ||
|
|
||
| Required for gated models (e.g., Llama, Mistral, some Nemotron variants) and gated datasets (e.g., GPQA, HLE). | ||
|
|
||
| Generate at <https://huggingface.co/settings/tokens>. Two persistence options (you can use either or both): | ||
|
|
||
| 1. **`hf auth login`** (recommended for interactive use) — stores the token at `~/.cache/huggingface/token`. The HF Python client picks it up automatically; `transformers`, `datasets`, and the `hf` CLI all read this file without needing `HF_TOKEN` in the env. | ||
|
|
||
| ```bash | ||
| pip install -U huggingface_hub | ||
| hf auth login # paste the token interactively | ||
| ``` | ||
|
|
||
| 2. **Environment variable** (good for scripts, CI, and remote sessions): | ||
|
|
||
| ```bash | ||
| export HF_TOKEN=hf_... | ||
| ``` | ||
|
|
||
| Persist in `~/.bashrc` or a project-local `.env` file. `HF_TOKEN` takes precedence when both are present. | ||
|
|
||
| ## NGC API key (for `nvcr.io`) | ||
|
|
||
| Required for pulling NGC images (`nvcr.io/nvidia/pytorch:...`, `nvcr.io/nvidia/vllm:...`) via Docker, `srun --container-image`, or enroot. | ||
|
|
||
| Generate at <https://ngc.nvidia.com/setup/api-key>. | ||
|
|
||
| ### Docker | ||
|
|
||
| ```bash | ||
| docker login nvcr.io -u '$oauthtoken' -p <NGC_API_KEY> | ||
| ``` | ||
|
|
||
| ### Enroot (SLURM / pyxis) | ||
|
|
||
| Add an entry to `~/.config/enroot/.credentials` on the cluster. The file may already hold credentials for other registries — **append rather than overwrite**: | ||
|
|
||
| ```bash | ||
| mkdir -p ~/.config/enroot | ||
| CREDS=~/.config/enroot/.credentials | ||
| touch "$CREDS" | ||
| grep -q '^machine nvcr.io ' "$CREDS" || \ | ||
| echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' >> "$CREDS" | ||
| chmod 600 "$CREDS" | ||
| ``` | ||
|
|
||
| > **Note**: `$oauthtoken` is a **literal string** required by NGC, not a shell variable. Do not replace it and do not let your shell expand it — the single quotes above keep it literal. | ||
|
|
||
| Without this, `srun --container-image=nvcr.io/...` fails with `401 Unauthorized` when the compute node tries to pull. | ||
|
|
||
| ## Docker Hub login | ||
|
|
||
| Only needed if you hit rate limits pulling public images: | ||
|
|
||
| ```bash | ||
| docker login | ||
| ``` | ||
|
|
||
| ## Summary | ||
|
|
||
| | Credential | Used for | Set via | | ||
| |---|---|---| | ||
| | `HF_TOKEN` | Gated HF models / datasets | Env var (`export HF_TOKEN=...`) or `.env` | | ||
| | NGC API key | `nvcr.io` image pulls | `docker login` or `~/.config/enroot/.credentials` | | ||
| | Docker Hub | Rate-limited public image pulls | `docker login` | | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a section before setup, user may already have all/some of the credentials setup, so first step is to check if the needed token is already set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea — added a "Check what's already set first" section at the top of
credentials.mdin 634ac7d, with one-liner detection for HF token (env +~/.cache/huggingface/token), Docker login, and enroot/NGC entries. Agent skips any section already in place.