-
Notifications
You must be signed in to change notification settings - Fork 424
[1/N] Polish evaluation skills and common skills based on an E2E workflow testing, vendor two Claude skills from NeMo Evaluator #1239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 8 commits
4bf8253
2e84f3b
e952bcd
2cb3b39
1b94fc9
b1be817
7dcede4
8176fc7
b0748dd
edc3a9b
03dfca7
9cb309b
290f432
7824b24
31c4fe8
645a545
8d63c0e
c664a30
86adcf4
634ac7d
655e224
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,70 @@ | ||
| # End-to-End Workflow: PTQ → Deploy → Eval | ||
|
|
||
| This document ties together the three domain skills (PTQ, Deployment, Evaluation) for the common workflow of quantizing a model, deploying it, and evaluating accuracy. | ||
|
|
||
| ## Pipeline Overview | ||
|
|
||
| ```text | ||
| PTQ (quantize) → Deployment (serve) → Evaluation (benchmark) | ||
| ───────────────── ────────────────── ──────────────────────── | ||
| hf_ptq.py vLLM / SGLang / TRT-LLM NEL (SLURM or JET) | ||
| ↓ ↓ ↓ | ||
| NVFP4/FP8 checkpoint OpenAI-compatible API MMLU, GSM8K, GPQA scores | ||
| (safetensors) (http://host:8000) (results.yml) | ||
| ``` | ||
|
|
||
| ## Workspace Continuity | ||
|
|
||
| All three stages share the same workspace directory. The PTQ output becomes the deployment input, and eval results land alongside: | ||
|
|
||
| ```text | ||
| workspaces/model-name-format/ | ||
| output/ ← PTQ checkpoint (safetensors + config.json) | ||
| eval_results/ ← NEL evaluation artifacts (results.yml per task) | ||
| eval_config.yaml ← NEL config for evaluation | ||
| scripts/ ← Custom run scripts (if needed) | ||
| logs/ ← SLURM job logs | ||
| ``` | ||
|
|
||
| When starting a deployment or evaluation step, always check for an existing workspace from a prior PTQ run: | ||
|
|
||
| ```bash | ||
| ls workspaces/ | ||
| ``` | ||
|
|
||
| ## Unsupported Models | ||
|
|
||
| Models not in the verified support matrices require extra work at each stage: | ||
|
|
||
| | Stage | What can go wrong | Reference | | ||
| |-------|-------------------|-----------| | ||
| | **PTQ** | Unknown architecture, FP8 source checkpoint, VLM structure | `ptq/references/unsupported-models.md` | | ||
| | **Deployment** | Missing architecture mapping, weight key mismatches, quant/unquant layer confusion | `deployment/references/unsupported-models.md` | | ||
|
|
||
| | **Evaluation** | Framework patches needed in deployment container, gated datasets, cluster storage | `evaluation/references/nel-ci-guide.md` | | ||
|
|
||
| Each stage has its own debug loop (run → read error → diagnose → patch → re-run). Fixes from one stage often inform the next — e.g., if PTQ required a transformers upgrade, deployment and evaluation will too. | ||
|
|
||
| ## NEL Evaluation with Custom Deployments | ||
|
|
||
| When the serving framework needs runtime patches (e.g., transformers upgrade, model handler fix), override `deployment.command` in the NEL config to inject fixes before serving: | ||
|
|
||
| ```yaml | ||
| deployment: | ||
| command: >- | ||
| pip install "transformers>=5.0.0.dev0" --pre -q && | ||
|
|
||
| sed -i 's/old_pattern/new_pattern/' /path/to/framework/file.py && | ||
| ${deployment.base_command} | ||
| ``` | ||
|
|
||
| This works with both NEL SLURM executor and NEL CI (via `NEL_DEPLOYMENT_COMMAND`). | ||
|
coderabbitai[bot] marked this conversation as resolved.
Outdated
|
||
|
|
||
| ## Decision: NEL SLURM Executor vs NEL CI (JET) | ||
|
|
||
| | Factor | NEL SLURM executor | NEL CI (JET) | | ||
| |--------|-------------------|--------------| | ||
| | **When to use** | Iterative debugging, checkpoint on non-JET cluster, custom patches needed | Production evals, MLflow tracking, reproducible configs | | ||
| | **Checkpoint location** | Any cluster you have SSH access to | Must be on JET cluster `/lustre/` storage | | ||
| | **Secrets (HF_TOKEN, NGC)** | Provide your own via `host:` env vars | Managed centrally via JET secrets | | ||
| | **Container patches** | Override `deployment.command` | Use `NEL_DEPLOYMENT_COMMAND` | | ||
| | **MLflow export** | Manual setup | Automatic | | ||
| | **Gated datasets** | Your HF account needs access | Handled by `COMPEVAL_HF_TOKEN` | | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,6 +28,17 @@ clusters: | |
| default_cluster: my-cluster | ||
| ``` | ||
|
|
||
| ### Checkpoint and storage availability | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't get this part, the compute node should have same storage access as login node, the only special i have seen is
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch — framing was wrong. Compute nodes on a given cluster do share login-node storage; the real issue is workstation paths aren't on the cluster at all. Reframed as "Staging checkpoints from your workstation" in 7824b24, and dropped the dlcluster-specific row per your note that it's an internal quirk that shouldn't ship publicly. |
||
|
|
||
| Cluster compute nodes may not share the same filesystem as login nodes or other clusters. Before running any workload that references a checkpoint path, verify the path is accessible from compute nodes: | ||
|
|
||
| | Cluster type | Compute-node storage | NOT accessible from compute nodes | | ||
| |-------------|---------------------|----------------------------------| | ||
| | JET clusters (oci-hsg, cw, oci-nrt) | `/lustre/fsw/...` | Workstation NFS (`/home/scratch.*`), other cluster mounts | | ||
| | dlcluster | `/home/omniml_data_*`, `/home/scratch.*` | `/lustre/` paths | | ||
|
|
||
| If a checkpoint was produced on a different cluster or workstation, copy it to the target cluster's accessible storage before submitting jobs. NEL and SLURM do NOT sync checkpoints automatically. | ||
|
|
||
| See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types. | ||
|
|
||
| --- | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -51,6 +51,20 @@ srun \ | |
| " | ||
| ``` | ||
|
|
||
| ### Container registry credentials (pyxis) | ||
|
|
||
| If `srun --container-image` uses an image from a private registry (e.g., `nvcr.io/nvidia/...`), pyxis/enroot needs credentials on the cluster. Check for existing credentials and add if missing: | ||
|
|
||
| ```bash | ||
| cat ~/.config/enroot/.credentials 2>/dev/null || echo "No credentials" | ||
| # To add NGC credentials: | ||
| mkdir -p ~/.config/enroot | ||
| echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems not a slurm specific, considering the env setup, we should have a general setup guide to help users to set HF token, docker loign token, ngc login token etc. Maybe we can create another
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed — split out into |
||
| chmod 600 ~/.config/enroot/.credentials | ||
| ``` | ||
|
|
||
| Without this, `srun` will fail with `401 Unauthorized` when pulling from `nvcr.io`. | ||
|
|
||
| Submit and capture the job ID: | ||
|
|
||
| ```bash | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this workflow really necessary? It feels like the existing skills and Claude Code already cover it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, we talked about this and decided to not add an e2e workflow I remember, the idea is the Claude knows to call multiple skills when user says something "quantize and deploy"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right — deleted in 9cb309b. The skill descriptions already handle cross-skill routing ("quantize and deploy" chains ptq → deployment); an e2e prose doc was duplicating that in a form agents don't read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged, sorry for re-introducing it despite the earlier discussion. Deleted in 9cb309b. The one useful insight ("carry PTQ patches forward to deploy/eval") is preserved as a one-liner in
evaluation/SKILL.md.