-
Notifications
You must be signed in to change notification settings - Fork 360
[1/N] Polish evaluation skills and common skills based on an E2E workflow testing #1239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
4bf8253
2e84f3b
e952bcd
2cb3b39
1b94fc9
b1be817
7dcede4
8176fc7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,70 @@ | ||
| # End-to-End Workflow: PTQ → Deploy → Eval | ||
|
|
||
| This document ties together the three domain skills (PTQ, Deployment, Evaluation) for the common workflow of quantizing a model, deploying it, and evaluating accuracy. | ||
|
|
||
| ## Pipeline Overview | ||
|
|
||
| ```text | ||
| PTQ (quantize) → Deployment (serve) → Evaluation (benchmark) | ||
| ───────────────── ────────────────── ──────────────────────── | ||
| hf_ptq.py vLLM / SGLang / TRT-LLM NEL (SLURM or JET) | ||
| ↓ ↓ ↓ | ||
| NVFP4/FP8 checkpoint OpenAI-compatible API MMLU, GSM8K, GPQA scores | ||
| (safetensors) (http://host:8000) (results.yml) | ||
| ``` | ||
|
|
||
| ## Workspace Continuity | ||
|
|
||
| All three stages share the same workspace directory. The PTQ output becomes the deployment input, and eval results land alongside: | ||
|
|
||
| ```text | ||
| workspaces/model-name-format/ | ||
| output/ ← PTQ checkpoint (safetensors + config.json) | ||
| eval_results/ ← NEL evaluation artifacts (results.yml per task) | ||
| eval_config.yaml ← NEL config for evaluation | ||
| scripts/ ← Custom run scripts (if needed) | ||
| logs/ ← SLURM job logs | ||
| ``` | ||
|
|
||
| When starting a deployment or evaluation step, always check for an existing workspace from a prior PTQ run: | ||
|
|
||
| ```bash | ||
| ls workspaces/ | ||
| ``` | ||
|
|
||
| ## Unsupported Models | ||
|
|
||
| Models not in the verified support matrices require extra work at each stage: | ||
|
|
||
| | Stage | What can go wrong | Reference | | ||
| |-------|-------------------|-----------| | ||
| | **PTQ** | Unknown architecture, FP8 source checkpoint, VLM structure | `ptq/references/unsupported-models.md` | | ||
| | **Deployment** | Missing architecture mapping, weight key mismatches, quant/unquant layer confusion | `deployment/references/unsupported-models.md` | | ||
|
|
||
| | **Evaluation** | Framework patches needed in deployment container, gated datasets, cluster storage | `evaluation/references/nel-ci-guide.md` | | ||
|
|
||
| Each stage has its own debug loop (run → read error → diagnose → patch → re-run). Fixes from one stage often inform the next — e.g., if PTQ required a transformers upgrade, deployment and evaluation will too. | ||
|
|
||
| ## NEL Evaluation with Custom Deployments | ||
|
|
||
| When the serving framework needs runtime patches (e.g., transformers upgrade, model handler fix), override `deployment.command` in the NEL config to inject fixes before serving: | ||
|
|
||
| ```yaml | ||
| deployment: | ||
| command: >- | ||
| pip install "transformers>=5.0.0.dev0" --pre -q && | ||
|
Comment on lines
+49
to
+54
|
||
| sed -i 's/old_pattern/new_pattern/' /path/to/framework/file.py && | ||
| ${deployment.base_command} | ||
| ``` | ||
|
|
||
| This works with both NEL SLURM executor and NEL CI (via `NEL_DEPLOYMENT_COMMAND`). | ||
|
Comment on lines
+47
to
+59
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🌐 Web query:
💡 Result: Yes, NeMo Evaluator Launcher (NEL) supports overriding deployment.command in SLURM executor mode. No evidence found for GitLab CI mode support; official docs only document local, Slurm, and Lepton AI executors. Citations:
Remove or verify unsupported CI mode claim for Line 59 claims the 🤖 Prompt for AI Agents |
||
|
|
||
| ## Decision: NEL SLURM Executor vs NEL CI (JET) | ||
|
|
||
| | Factor | NEL SLURM executor | NEL CI (JET) | | ||
| |--------|-------------------|--------------| | ||
| | **When to use** | Iterative debugging, checkpoint on non-JET cluster, custom patches needed | Production evals, MLflow tracking, reproducible configs | | ||
| | **Checkpoint location** | Any cluster you have SSH access to | Must be on JET cluster `/lustre/` storage | | ||
| | **Secrets (HF_TOKEN, NGC)** | Provide your own via `host:` env vars | Managed centrally via JET secrets | | ||
| | **Container patches** | Override `deployment.command` | Use `NEL_DEPLOYMENT_COMMAND` | | ||
| | **MLflow export** | Manual setup | Automatic | | ||
| | **Gated datasets** | Your HF account needs access | Handled by `COMPEVAL_HF_TOKEN` | | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,6 +28,17 @@ clusters: | |
| default_cluster: my-cluster | ||
| ``` | ||
|
|
||
| ### Checkpoint and storage availability | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't get this part, the compute node should have same storage access as login node, the only special i have seen is |
||
|
|
||
| Cluster compute nodes may not share the same filesystem as login nodes or other clusters. Before running any workload that references a checkpoint path, verify the path is accessible from compute nodes: | ||
|
|
||
| | Cluster type | Compute-node storage | NOT accessible from compute nodes | | ||
| |-------------|---------------------|----------------------------------| | ||
| | JET clusters (oci-hsg, cw, oci-nrt) | `/lustre/fsw/...` | Workstation NFS (`/home/scratch.*`), other cluster mounts | | ||
| | dlcluster | `/home/omniml_data_*`, `/home/scratch.*` | `/lustre/` paths | | ||
|
|
||
| If a checkpoint was produced on a different cluster or workstation, copy it to the target cluster's accessible storage before submitting jobs. NEL and SLURM do NOT sync checkpoints automatically. | ||
|
|
||
| See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types. | ||
|
|
||
| --- | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -51,6 +51,20 @@ srun \ | |
| " | ||
| ``` | ||
|
|
||
| ### Container registry credentials (pyxis) | ||
|
|
||
| If `srun --container-image` uses an image from a private registry (e.g., `nvcr.io/nvidia/...`), pyxis/enroot needs credentials on the cluster. Check for existing credentials and add if missing: | ||
|
|
||
| ```bash | ||
| cat ~/.config/enroot/.credentials 2>/dev/null || echo "No credentials" | ||
| # To add NGC credentials: | ||
| mkdir -p ~/.config/enroot | ||
| echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials | ||
|
Comment on lines
+60
to
+62
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems not a slurm specific, considering the env setup, we should have a general setup guide to help users to set HF token, docker loign token, ngc login token etc. Maybe we can create another |
||
| chmod 600 ~/.config/enroot/.credentials | ||
| ``` | ||
|
|
||
| Without this, `srun` will fail with `401 Unauthorized` when pulling from `nvcr.io`. | ||
|
|
||
| Submit and capture the job ID: | ||
|
|
||
| ```bash | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this workflow really necessary? It feels like the existing skills and Claude Code already cover it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, we talked about this and decided to not add an e2e workflow I remember, the idea is the Claude knows to call multiple skills when user says something "quantize and deploy"