Refactor evaluation run validation guidance (#1535)

chadvoegele · jenchen13 · commit 09bef05a5b11 · 2026-05-27T15:34:36.000-07:00
### What does this PR do? Type of change: documentation. Refactors the evaluation skill for lazy loading by moving detailed NEL timeout/resume behavior and completed-run validation guidance from `.claude/skills/evaluation/SKILL.md` into `.claude/skills/evaluation/references/run-validation.md`. The main skill keeps Step 9 as dispatch/navigation and links to the reference. ### Usage N/A ### Testing - `git diff --cached --check` ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A - Did you get Claude approval on this PR?: N/A ### Additional Information Follow-up from review feedback on evaluation/SKILL.md lazy-loading drift.  ## Summary by CodeRabbit * **Documentation** * Reorganized evaluation workflow guidance to streamline post-submission validation procedures and improve reference organization. * Added detailed validation reference guide covering timeout recovery, run resumption from cached artifacts, comprehensive log verification across multiple layers, score extraction and validation, and diagnostic commands for evaluation submissions.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1535?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)   Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
@@ -18,17 +18,6 @@ If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md
 
 This skill is often the final stage of the PTQ → Deploy → Eval pipeline. If the model required runtime patches during deployment (transformers upgrade, framework source fixes), carry those patches into the NEL config via `deployment.command`.
 
-### NEL Timeout and Resume Behavior
-
-NEL submissions commonly create a dependency chain of SLURM jobs. The first job
-runs the evaluation and writes response/result caches. A dependent follow-on job
-resumes from those caches if the first job times out, then queues another follow-on
-job so long-running evals can continue across walltime windows.
-
-Do not assume a timeout means the evaluation failed or produced invalid results.
-Treat timeouts as expected resume events until `nel status`/`nel info`, artifacts,
-and logs show a terminal failure or invalid run.
-
 ### Workflow
 
 ```text
@@ -349,46 +338,18 @@ parallelism, then rerun the canary before launching the full evaluation.
 
 **Monitoring Progress**
 
-After job submission, register the job per the **monitor skill** for durable cross-session tracking. For one-off queries (live status, debugging a failed run, analyzing results) use the **launching-evals skill**; for querying past runs in MLflow use **accessing-mlflow**.
+After job submission, register the job per the **monitor skill** for durable
+cross-session tracking. For one-off queries (live status, debugging a failed
+run, analyzing results) use the **launching-evals skill**; for querying past
+runs in MLflow use **accessing-mlflow**. If a NEL job times out or resumes,
+read `references/run-validation.md` before treating the run as failed.
 
 **Step 9: Verify completed evaluation run**
 
-Before pulling/reporting scores, validate the completed run itself. Do not accept a run as complete just because `results.yml` or a summary file exists.
-
-For each completed invocation/run directory, whether baseline, quantized, or a single-model run:
-
-1. Inspect client, server/deployment, SLURM, judge, and task-specific/code-execution logs as applicable. Search for `Traceback`, `Exception`, `ERROR`, `FAILED`, `OOM`, `Killed`, `timeout`, `rate limit`, `unauthorized`, `connection refused/reset`, `health check`, `sandbox`, `container`, `judge`, `parse`, `scoring`, and task-specific failure strings.
-2. Confirm the inference server loaded the intended checkpoint/model and stayed healthy through the run: no startup failure, mid-run crash/restart, OOM, request validation failure, max-context truncation, quantization load error, or repeated 4xx/5xx responses.
-3. For judge-backed tasks, confirm judge calls succeeded and were parsed/scored correctly: no auth/rate-limit failures, malformed judge responses, invalid JSON, missing scores, or fallback/default scores.
-4. For code-execution tasks, inspect executor/sandbox/container logs for setup failures, package install failures, timeouts, thread/process exhaustion, permission errors, harness crashes, or skipped tests that would make scores non-comparable.
-5. Confirm sample accounting: expected samples/repeats match completed, scored samples; no unexpected dropped/skipped/failed samples, `unknown_agent_error`, `failed_samples_policy` aborts, empty outputs, or partial result files.
-6. If reasoning traces are present, confirm they are parsed/stripped/ignored before scoring consistently. Check for parser errors, unmatched reasoning delimiters, `finish_reason: length`, reasoning text leaked into answers, answers stripped with the reasoning, or reasoning disabled when the config intended it to be active.
-
-Report the run-validation summary before any score: log scan status, sample accounting, reasoning/answer parsing status, and any errors or warnings found. If any validation item fails, either rerun/fix it or label the result as incomplete or invalid.
-
-For score harvesting, use the `Score Extraction` section from the matching task
-reference in `recipes/tasks/<task>.md`. Do not rely on ad hoc `results.yml`
-greps when a task reference defines the canonical score and stderr fields.
-
-For baseline-vs-quantized deltas, use the compare-results skill after run
-validation.
-
-**NEL-specific diagnostics** (for debugging failures):
-
-```bash
-# Quick status check
-nel status <invocation_id>
-nel info <invocation_id>
-
-# Get log paths
-nel info <invocation_id> --logs
-
-# Inspect logs via SSH
-ssh <user>@<host> "tail -100 <log_path>/server-<slurm_job_id>-*.log"   # deployment errors
-ssh <user>@<host> "tail -100 <log_path>/client-<slurm_job_id>.log"     # evaluation errors
-ssh <user>@<host> "tail -100 <log_path>/slurm-<slurm_job_id>.log"      # scheduling/walltime
-ssh <user>@<host> "grep -i 'traceback\|exception\|error\|failed\|oom\|killed\|timeout\|unauthorized\|rate limit\|sandbox\|container\|judge\|parse\|scoring' <log_path>/*.log"  # search all logs
-```
+Before pulling/reporting scores, validate the completed run itself. Read
+`references/run-validation.md` for NEL timeout/resume behavior, completed-run
+validation, diagnostics, score-harvesting guidance, and the handoff to
+`compare-results` for baseline-vs-candidate deltas.
 
 ---
 
diff --git a/.claude/skills/evaluation/references/run-validation.md b/.claude/skills/evaluation/references/run-validation.md
@@ -0,0 +1,60 @@
+# Run Validation
+
+Use this reference when checking NEL progress after submission, resuming from
+timeouts, validating completed runs, or handing completed baseline/candidate runs
+to `compare-results`.
+
+## NEL Timeout and Resume Behavior
+
+NEL submissions commonly create a dependency chain of SLURM jobs. The first job
+runs the evaluation and writes response/result caches. A dependent follow-on job
+resumes from those caches if the first job times out, then queues another
+follow-on job so long-running evals can continue across walltime windows.
+
+Do not assume a timeout means the evaluation failed or produced invalid results.
+Treat timeouts as expected resume events until `nel status`/`nel info`,
+artifacts, and logs show a terminal failure or invalid run.
+
+## Verify Completed Evaluation Run
+
+Before pulling/reporting scores, validate the completed run itself. Do not
+accept a run as complete just because `results.yml` or a summary file exists.
+
+For each completed invocation/run directory, whether baseline, quantized, or a
+single-model run:
+
+1. Inspect client, server/deployment, SLURM, judge, and task-specific/code-execution logs as applicable. Search for `Traceback`, `Exception`, `ERROR`, `FAILED`, `OOM`, `Killed`, `timeout`, `rate limit`, `unauthorized`, `connection refused/reset`, `health check`, `sandbox`, `container`, `judge`, `parse`, `scoring`, and task-specific failure strings.
+2. Confirm the inference server loaded the intended checkpoint/model and stayed healthy through the run: no startup failure, mid-run crash/restart, OOM, request validation failure, max-context truncation, quantization load error, or repeated 4xx/5xx responses.
+3. For judge-backed tasks, confirm judge calls succeeded and were parsed/scored correctly: no auth/rate-limit failures, malformed judge responses, invalid JSON, missing scores, or fallback/default scores.
+4. For code-execution tasks, inspect executor/sandbox/container logs for setup failures, package install failures, timeouts, thread/process exhaustion, permission errors, harness crashes, or skipped tests that would make scores non-comparable.
+5. Confirm sample accounting: expected samples/repeats match completed, scored samples; no unexpected dropped/skipped/failed samples, `unknown_agent_error`, `failed_samples_policy` aborts, empty outputs, or partial result files.
+6. If reasoning traces are present, confirm they are parsed/stripped/ignored before scoring consistently. Check for parser errors, unmatched reasoning delimiters, `finish_reason: length`, reasoning text leaked into answers, answers stripped with the reasoning, or reasoning disabled when the config intended it to be active.
+
+Report the run-validation summary before any score: log scan status, sample
+accounting, reasoning/answer parsing status, and any errors or warnings found.
+If any validation item fails, either rerun/fix it or label the result as
+incomplete or invalid.
+
+For score harvesting, use the `Score Extraction` section from the matching task
+reference in `recipes/tasks/<task>.md`. Do not rely on ad hoc `results.yml`
+greps when a task reference defines the canonical score and stderr fields.
+
+For baseline-vs-candidate deltas, use the `compare-results` skill after each run
+passes validation.
+
+## NEL Diagnostics
+
+```bash
+# Quick status check
+nel status <invocation_id>
+nel info <invocation_id>
+
+# Get log paths
+nel info <invocation_id> --logs
+
+# Inspect logs via SSH
+ssh <user>@<host> "tail -100 <log_path>/server-<slurm_job_id>-*.log"   # deployment errors
+ssh <user>@<host> "tail -100 <log_path>/client-<slurm_job_id>.log"     # evaluation errors
+ssh <user>@<host> "tail -100 <log_path>/slurm-<slurm_job_id>.log"      # scheduling/walltime
+ssh <user>@<host> "grep -i 'traceback\|exception\|error\|failed\|oom\|killed\|timeout\|unauthorized\|rate limit\|sandbox\|container\|judge\|parse\|scoring' <log_path>/*.log"  # search all logs
+```