Skip to content

Commit 09bef05

Browse files
chadvoegelejenchen13
authored andcommitted
Refactor evaluation run validation guidance (#1535)
### What does this PR do? Type of change: documentation. Refactors the evaluation skill for lazy loading by moving detailed NEL timeout/resume behavior and completed-run validation guidance from `.claude/skills/evaluation/SKILL.md` into `.claude/skills/evaluation/references/run-validation.md`. The main skill keeps Step 9 as dispatch/navigation and links to the reference. ### Usage N/A ### Testing - `git diff --cached --check` ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A - Did you get Claude approval on this PR?: N/A ### Additional Information Follow-up from review feedback on evaluation/SKILL.md lazy-loading drift. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Reorganized evaluation workflow guidance to streamline post-submission validation procedures and improve reference organization. * Added detailed validation reference guide covering timeout recovery, run resumption from cached artifacts, comprehensive log verification across multiple layers, score extraction and validation, and diagnostic commands for evaluation submissions. <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1535?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
1 parent 5b869d1 commit 09bef05

2 files changed

Lines changed: 69 additions & 48 deletions

File tree

.claude/skills/evaluation/SKILL.md

Lines changed: 9 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -18,17 +18,6 @@ If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md
1818

1919
This skill is often the final stage of the PTQ → Deploy → Eval pipeline. If the model required runtime patches during deployment (transformers upgrade, framework source fixes), carry those patches into the NEL config via `deployment.command`.
2020

21-
### NEL Timeout and Resume Behavior
22-
23-
NEL submissions commonly create a dependency chain of SLURM jobs. The first job
24-
runs the evaluation and writes response/result caches. A dependent follow-on job
25-
resumes from those caches if the first job times out, then queues another follow-on
26-
job so long-running evals can continue across walltime windows.
27-
28-
Do not assume a timeout means the evaluation failed or produced invalid results.
29-
Treat timeouts as expected resume events until `nel status`/`nel info`, artifacts,
30-
and logs show a terminal failure or invalid run.
31-
3221
### Workflow
3322

3423
```text
@@ -349,46 +338,18 @@ parallelism, then rerun the canary before launching the full evaluation.
349338

350339
**Monitoring Progress**
351340

352-
After job submission, register the job per the **monitor skill** for durable cross-session tracking. For one-off queries (live status, debugging a failed run, analyzing results) use the **launching-evals skill**; for querying past runs in MLflow use **accessing-mlflow**.
341+
After job submission, register the job per the **monitor skill** for durable
342+
cross-session tracking. For one-off queries (live status, debugging a failed
343+
run, analyzing results) use the **launching-evals skill**; for querying past
344+
runs in MLflow use **accessing-mlflow**. If a NEL job times out or resumes,
345+
read `references/run-validation.md` before treating the run as failed.
353346

354347
**Step 9: Verify completed evaluation run**
355348

356-
Before pulling/reporting scores, validate the completed run itself. Do not accept a run as complete just because `results.yml` or a summary file exists.
357-
358-
For each completed invocation/run directory, whether baseline, quantized, or a single-model run:
359-
360-
1. Inspect client, server/deployment, SLURM, judge, and task-specific/code-execution logs as applicable. Search for `Traceback`, `Exception`, `ERROR`, `FAILED`, `OOM`, `Killed`, `timeout`, `rate limit`, `unauthorized`, `connection refused/reset`, `health check`, `sandbox`, `container`, `judge`, `parse`, `scoring`, and task-specific failure strings.
361-
2. Confirm the inference server loaded the intended checkpoint/model and stayed healthy through the run: no startup failure, mid-run crash/restart, OOM, request validation failure, max-context truncation, quantization load error, or repeated 4xx/5xx responses.
362-
3. For judge-backed tasks, confirm judge calls succeeded and were parsed/scored correctly: no auth/rate-limit failures, malformed judge responses, invalid JSON, missing scores, or fallback/default scores.
363-
4. For code-execution tasks, inspect executor/sandbox/container logs for setup failures, package install failures, timeouts, thread/process exhaustion, permission errors, harness crashes, or skipped tests that would make scores non-comparable.
364-
5. Confirm sample accounting: expected samples/repeats match completed, scored samples; no unexpected dropped/skipped/failed samples, `unknown_agent_error`, `failed_samples_policy` aborts, empty outputs, or partial result files.
365-
6. If reasoning traces are present, confirm they are parsed/stripped/ignored before scoring consistently. Check for parser errors, unmatched reasoning delimiters, `finish_reason: length`, reasoning text leaked into answers, answers stripped with the reasoning, or reasoning disabled when the config intended it to be active.
366-
367-
Report the run-validation summary before any score: log scan status, sample accounting, reasoning/answer parsing status, and any errors or warnings found. If any validation item fails, either rerun/fix it or label the result as incomplete or invalid.
368-
369-
For score harvesting, use the `Score Extraction` section from the matching task
370-
reference in `recipes/tasks/<task>.md`. Do not rely on ad hoc `results.yml`
371-
greps when a task reference defines the canonical score and stderr fields.
372-
373-
For baseline-vs-quantized deltas, use the compare-results skill after run
374-
validation.
375-
376-
**NEL-specific diagnostics** (for debugging failures):
377-
378-
```bash
379-
# Quick status check
380-
nel status <invocation_id>
381-
nel info <invocation_id>
382-
383-
# Get log paths
384-
nel info <invocation_id> --logs
385-
386-
# Inspect logs via SSH
387-
ssh <user>@<host> "tail -100 <log_path>/server-<slurm_job_id>-*.log" # deployment errors
388-
ssh <user>@<host> "tail -100 <log_path>/client-<slurm_job_id>.log" # evaluation errors
389-
ssh <user>@<host> "tail -100 <log_path>/slurm-<slurm_job_id>.log" # scheduling/walltime
390-
ssh <user>@<host> "grep -i 'traceback\|exception\|error\|failed\|oom\|killed\|timeout\|unauthorized\|rate limit\|sandbox\|container\|judge\|parse\|scoring' <log_path>/*.log" # search all logs
391-
```
349+
Before pulling/reporting scores, validate the completed run itself. Read
350+
`references/run-validation.md` for NEL timeout/resume behavior, completed-run
351+
validation, diagnostics, score-harvesting guidance, and the handoff to
352+
`compare-results` for baseline-vs-candidate deltas.
392353

393354
---
394355

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# Run Validation
2+
3+
Use this reference when checking NEL progress after submission, resuming from
4+
timeouts, validating completed runs, or handing completed baseline/candidate runs
5+
to `compare-results`.
6+
7+
## NEL Timeout and Resume Behavior
8+
9+
NEL submissions commonly create a dependency chain of SLURM jobs. The first job
10+
runs the evaluation and writes response/result caches. A dependent follow-on job
11+
resumes from those caches if the first job times out, then queues another
12+
follow-on job so long-running evals can continue across walltime windows.
13+
14+
Do not assume a timeout means the evaluation failed or produced invalid results.
15+
Treat timeouts as expected resume events until `nel status`/`nel info`,
16+
artifacts, and logs show a terminal failure or invalid run.
17+
18+
## Verify Completed Evaluation Run
19+
20+
Before pulling/reporting scores, validate the completed run itself. Do not
21+
accept a run as complete just because `results.yml` or a summary file exists.
22+
23+
For each completed invocation/run directory, whether baseline, quantized, or a
24+
single-model run:
25+
26+
1. Inspect client, server/deployment, SLURM, judge, and task-specific/code-execution logs as applicable. Search for `Traceback`, `Exception`, `ERROR`, `FAILED`, `OOM`, `Killed`, `timeout`, `rate limit`, `unauthorized`, `connection refused/reset`, `health check`, `sandbox`, `container`, `judge`, `parse`, `scoring`, and task-specific failure strings.
27+
2. Confirm the inference server loaded the intended checkpoint/model and stayed healthy through the run: no startup failure, mid-run crash/restart, OOM, request validation failure, max-context truncation, quantization load error, or repeated 4xx/5xx responses.
28+
3. For judge-backed tasks, confirm judge calls succeeded and were parsed/scored correctly: no auth/rate-limit failures, malformed judge responses, invalid JSON, missing scores, or fallback/default scores.
29+
4. For code-execution tasks, inspect executor/sandbox/container logs for setup failures, package install failures, timeouts, thread/process exhaustion, permission errors, harness crashes, or skipped tests that would make scores non-comparable.
30+
5. Confirm sample accounting: expected samples/repeats match completed, scored samples; no unexpected dropped/skipped/failed samples, `unknown_agent_error`, `failed_samples_policy` aborts, empty outputs, or partial result files.
31+
6. If reasoning traces are present, confirm they are parsed/stripped/ignored before scoring consistently. Check for parser errors, unmatched reasoning delimiters, `finish_reason: length`, reasoning text leaked into answers, answers stripped with the reasoning, or reasoning disabled when the config intended it to be active.
32+
33+
Report the run-validation summary before any score: log scan status, sample
34+
accounting, reasoning/answer parsing status, and any errors or warnings found.
35+
If any validation item fails, either rerun/fix it or label the result as
36+
incomplete or invalid.
37+
38+
For score harvesting, use the `Score Extraction` section from the matching task
39+
reference in `recipes/tasks/<task>.md`. Do not rely on ad hoc `results.yml`
40+
greps when a task reference defines the canonical score and stderr fields.
41+
42+
For baseline-vs-candidate deltas, use the `compare-results` skill after each run
43+
passes validation.
44+
45+
## NEL Diagnostics
46+
47+
```bash
48+
# Quick status check
49+
nel status <invocation_id>
50+
nel info <invocation_id>
51+
52+
# Get log paths
53+
nel info <invocation_id> --logs
54+
55+
# Inspect logs via SSH
56+
ssh <user>@<host> "tail -100 <log_path>/server-<slurm_job_id>-*.log" # deployment errors
57+
ssh <user>@<host> "tail -100 <log_path>/client-<slurm_job_id>.log" # evaluation errors
58+
ssh <user>@<host> "tail -100 <log_path>/slurm-<slurm_job_id>.log" # scheduling/walltime
59+
ssh <user>@<host> "grep -i 'traceback\|exception\|error\|failed\|oom\|killed\|timeout\|unauthorized\|rate limit\|sandbox\|container\|judge\|parse\|scoring' <log_path>/*.log" # search all logs
60+
```

0 commit comments

Comments
 (0)