NVIDIA
diff --git a/‎.claude/skills/deployment/SKILL.md‎
Lines changed: 1 addition & 1 deletion b/‎.claude/skills/deployment/SKILL.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.claude/skills/evaluation/SKILL.md‎
Lines changed: 16 additions & 56 deletions b/‎.claude/skills/evaluation/SKILL.md‎
Lines changed: 16 additions & 56 deletions
diff --git a/‎.claude/skills/monitor/SKILL.md‎
Lines changed: 102 additions & 0 deletions b/‎.claude/skills/monitor/SKILL.md‎
Lines changed: 102 additions & 0 deletions
diff --git a/‎.claude/skills/ptq/SKILL.md‎
Lines changed: 1 addition & 3 deletions b/‎.claude/skills/ptq/SKILL.md‎
Lines changed: 1 addition & 3 deletions
diff --git a/‎.github/CODEOWNERS‎
Lines changed: 1 addition & 1 deletion b/‎.github/CODEOWNERS‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.github/codecov.yml‎
Lines changed: 12 additions & 0 deletions b/‎.github/codecov.yml‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎.github/workflows/_example_tests_runner.yml‎
Lines changed: 0 additions & 1 deletion b/‎.github/workflows/_example_tests_runner.yml‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎.github/workflows/_pr_gate.yml‎
Lines changed: 65 additions & 0 deletions b/‎.github/workflows/_pr_gate.yml‎
Lines changed: 65 additions & 0 deletions
diff --git a/‎.github/workflows/bump_uv_lock.yml‎
Lines changed: 2 additions & 1 deletion b/‎.github/workflows/bump_uv_lock.yml‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎.github/workflows/code_quality.yml‎
Lines changed: 5 additions & 3 deletions b/‎.github/workflows/code_quality.yml‎
Lines changed: 5 additions & 3 deletions
@@ -195,7 +195,7 @@ If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clust
 
 3. **Deploy based on remote environment:**
 
-   - **SLURM** — see `skills/common/slurm-setup.md` for job script templates (container setup, account/partition discovery). The server command inside the container is the same as Step 4 (e.g., `python -m vllm.entrypoints.openai.api_server --model <path> --quantization modelopt`). Use `remote_submit_job` and `remote_poll_job` to manage the job. Get the node hostname from `squeue -j $JOBID -o %N`.
+   - **SLURM** — see `skills/common/slurm-setup.md` for job script templates (container setup, account/partition discovery). The server command inside the container is the same as Step 4 (e.g., `python -m vllm.entrypoints.openai.api_server --model <path> --quantization modelopt`). After submitting, register the job and set up monitoring per the **monitor skill**. Get the node hostname from `squeue -j $JOBID -o %N`.
 
    - **Bare metal / Docker** — use `remote_run` to start the server directly:
 
 
@@ -256,64 +256,24 @@ After the dry-run, check the output from `nel` for any problems with the config.
 
 **Monitoring Progress**
 
-After job submission, you can monitor progress using:
+After job submission, register the job per the **monitor skill** for durable cross-session tracking. For one-off queries (live status, debugging a failed run, analyzing results) use the **launching-evals skill**; for querying past runs in MLflow use **accessing-mlflow**.
 
-1. **Check job status:**
+**NEL-specific diagnostics** (for debugging failures):
 
-   ```bash
-   nel status <invocation_id>
-   nel info <invocation_id>
-   ```
-
-2. **Stream logs** (Local execution only):
-
-   ```bash
-   nel logs <invocation_id>
-   ```
-
-   Note: `nel logs` is not supported for SLURM execution.
-
-3. **Inspect logs via SSH** (SLURM workaround):
-
-   When `nel logs` is unavailable (SLURM), use SSH to inspect logs directly:
-
-   First, get log locations:
-
-   ```bash
-   nel info <invocation_id> --logs
-   ```
-
-   Then, use SSH to view logs:
-
-   **Check server deployment logs:**
-
-   ```bash
-   ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/server-<slurm_job_id>-*.log"
-   ```
-
-   Shows vLLM server startup, model loading, and deployment errors (e.g., missing wget/curl).
-
-   **Check evaluation client logs:**
-
-   ```bash
-   ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/client-<slurm_job_id>.log"
-   ```
-
-   Shows evaluation progress, task execution, and results.
-
-   **Check SLURM scheduler logs:**
-
-   ```bash
-   ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/slurm-<slurm_job_id>.log"
-   ```
-
-   Shows job scheduling, health checks, and overall execution flow.
-
-   **Search for errors:**
-
-   ```bash
-   ssh <username>@<hostname> "grep -i 'error\|warning\|failed' <log path from `nel info <invocation_id> --logs`>/*.log"
-   ```
+```bash
+# Quick status check
+nel status <invocation_id>
+nel info <invocation_id>
+
+# Get log paths
+nel info <invocation_id> --logs
+
+# Inspect logs via SSH
+ssh <user>@<host> "tail -100 <log_path>/server-<slurm_job_id>-*.log"   # deployment errors
+ssh <user>@<host> "tail -100 <log_path>/client-<slurm_job_id>.log"     # evaluation errors
+ssh <user>@<host> "tail -100 <log_path>/slurm-<slurm_job_id>.log"      # scheduling/walltime
+ssh <user>@<host> "grep -i 'error\|failed' <log_path>/*.log"           # search all logs
+```
 
 ---
 
 
@@ -0,0 +1,102 @@
+---
+name: monitor
+description: Monitor submitted jobs (PTQ, evaluation, deployment) on SLURM clusters. Use when the user asks "check job status", "is my job done", "monitor my evaluation", "what's the status of the PTQ", "check on job <slurm_job_id>", or after any skill submits a long-running job. Also triggers on "nel status", "squeue", or any request to check progress of a previously submitted job.
+---
+
+# Job Monitor
+
+Monitor jobs submitted to SLURM clusters — PTQ quantization, NEL evaluation, model deployment, or raw SLURM jobs.
+
+## When to use
+
+1. **Auto-monitor** — another skill (PTQ, evaluation, deployment) just submitted a job. Register the job and set up monitoring immediately.
+2. **User-initiated** — user asks about a job status, possibly in a new conversation. Check the registry, identify the job, and report.
+
+---
+
+## Job Registry
+
+All active jobs are tracked in `.claude/active_jobs.json`. This file is the single source of truth for what's being monitored.
+
+```json
+[
+  {
+    "type": "nel",
+    "id": "<invocation_id or slurm_job_id>",
+    "host": "<cluster_hostname>",
+    "user": "<ssh_user>",
+    "submitted": "YYYY-MM-DD HH:MM",
+    "description": "<what this job does>",
+    "last_status": "<last known status>"
+  }
+]
+```
+
+`type` is one of: `nel`, `slurm`, `launcher`.
+
+---
+
+## On Job Submission
+
+Every time a job is submitted (by any skill or manually):
+
+1. **Add an entry** to `.claude/active_jobs.json`. Create the file if it doesn't exist.
+2. **Set up a durable recurring cron** (if one isn't already running) that polls all registered jobs every 15 minutes. The cron prompt should: read the registry, check each job, report state changes to the user, remove completed jobs, and delete itself when the registry is empty.
+
+Always do both steps. Don't try to predict job duration.
+
+---
+
+## On Cron Fire / Status Check
+
+Whether triggered by the cron or by the user asking "check status":
+
+1. **Read the registry** from `.claude/active_jobs.json`
+2. **Check each job** using the appropriate method (see below)
+3. **Report only state changes** — compare against `last_status` in registry
+4. **Update `last_status`** in the registry
+5. **Remove completed jobs** — any job in a terminal state (COMPLETED, FAILED, CANCELLED, KILLED)
+6. **If registry is empty** — delete the recurring cron
+
+---
+
+## How to Check Each Job Type
+
+### NEL jobs (`type: nel`)
+
+- **Check:** `nel status <id>`
+- **On completion:** `nel info <id>` to fetch results
+- **On failure:** `nel info <id> --logs` then inspect server/client/SLURM logs via SSH
+
+### Launcher jobs (`type: launcher`)
+
+- **Check:** Tail the launcher's background output file for key events
+- **Key events:** experiment ID, SLURM job ID, container import, calibration progress, export path, final status
+- **On failure:** Look for `Traceback`, `Error`, or `FAILED` in the output
+
+### Raw SLURM jobs (`type: slurm`)
+
+- **Check:** `ssh <host> "squeue -j <id> -h -o '%T %M %R'"` — if empty, job left the queue
+- **On completion:** `ssh <host> "sacct -j <id> --format=State,ExitCode,Elapsed -n"`
+- **On failure:** Check the job's output log file
+
+---
+
+## Identifying Jobs (user-initiated, no ID given)
+
+When the user asks about a job without specifying an ID, check in order:
+
+1. `.claude/active_jobs.json` — most reliable, has context
+2. `nel ls runs --since 1d` — recent NEL runs
+3. `ssh <host> "squeue -u <user>"` — active SLURM jobs
+4. `ls -lt tools/launcher/experiments/cicd/ | head -10` — recent launcher experiments
+
+---
+
+## Reporting Guidelines
+
+- **Report state changes proactively** — PENDING → RUNNING, or job completes
+- **Aggregate multiple jobs** — "2 of 4 completed (MMLU-Pro: 42.3%, GSM8K: 67.1%), 1 running, 1 pending"
+- **Summarize, don't echo** — interpret events ("Calibration complete, exporting checkpoint") not raw logs
+- **On failure, diagnose immediately** — check logs and report root cause without waiting for user to ask
+- **Minimize noise** — don't report "still running" unless the user is actively asking
@@ -118,9 +118,7 @@ For SLURM, see `skills/common/slurm-setup.md` and `references/slurm-setup-ptq.md
 
 ### Monitoring
 
-- **Launcher**: blocks and tails logs automatically
-- **SLURM (manual)**: poll with `squeue -u $USER` + `sleep` (not cron or background tasks)
-- **Local**: watch stdout
+After job submission, register the job and set up monitoring per the **monitor skill**.
 
 ## Step 5 — Verify output
 
 
@@ -9,7 +9,7 @@ LICENSE @NVIDIA/modelopt-setup-codeowners
 LICENSE_HEADER @NVIDIA/modelopt-setup-codeowners
 pyproject.toml @NVIDIA/modelopt-setup-codeowners
 SECURITY.md @NVIDIA/modelopt-setup-codeowners
-tox.ini @NVIDIA/modelopt-setup-codeowners
+noxfile.py @NVIDIA/modelopt-setup-codeowners
 uv.lock @NVIDIA/modelopt-setup-codeowners
 
 # Library
 
@@ -11,3 +11,15 @@ coverage:
         target: auto
         threshold: 1% # Allow atmost 1% coverage drop from main branch.
     patch: false
+
+# Exclude GPU-only Triton kernel files from ALL codecov calculations (project
+# and patch checks, all flags). Rationale: these files are dominated by
+# @triton.jit kernel bodies that CPU unit tests cannot exercise. GPU tests
+# cover them end-to-end (see tests/gpu/torch/sparsity/attention_sparsity/) but
+# the `gpu`-flag upload may race with the PR status check, so relying on flag
+# combination alone leaves the project check flaky. Dropping these files here
+# makes the check deterministic — local `pytest --cov` and GPU runs still
+# measure them; only the codecov PR status ignores them.
+ignore:
+  - "modelopt/torch/kernels/triton_fa.py"
+  - "modelopt/torch/kernels/hf_triton_attention.py"
@@ -48,7 +48,6 @@ jobs:
       - name: Install dependencies
         run: |
           # use `python -m pip` instead of `pip` to avoid conflicts with system pip for nemo containers
-          pip uninstall -y nvidia-modelopt
           python -m pip install ".${{ inputs.pip_install_extras }}"
 
           if [[ "${{ inputs.example }}" == *"diffusers"* ]]; then
 
@@ -0,0 +1,65 @@
+name: PR Gate
+
+on:
+  workflow_call:
+    inputs:
+      files:
+        description: "Newline-separated list of file patterns to watch for changes"
+        required: true
+        type: string
+    outputs:
+      any_changed:
+        description: "Whether any relevant files changed"
+        value: ${{ jobs.check-file-changes.outputs.any_changed }}
+
+jobs:
+  check-file-changes:
+    runs-on: ubuntu-latest
+    outputs:
+      any_changed: ${{ steps.changed-tests.outputs.any_changed || steps.non-pr.outputs.any_changed }}
+    steps:
+      # For non-PR triggers (schedule, workflow_dispatch), always run tests
+      - id: non-pr
+        if: "!startsWith(github.ref, 'refs/heads/pull-request/')"
+        run: echo "any_changed=true" >> $GITHUB_OUTPUT
+      - if: startsWith(github.ref, 'refs/heads/pull-request/')
+        uses: actions/checkout@v6
+        with:
+          fetch-depth: 0
+      - if: startsWith(github.ref, 'refs/heads/pull-request/')
+        id: get-pr-info
+        uses: nv-gha-runners/get-pr-info@main
+      # Extract SHAs from pr-info JSON via shell to avoid fromJSON on potentially-empty outputs
+      - if: startsWith(github.ref, 'refs/heads/pull-request/')
+        id: pr-shas
+        env:
+          PR_INFO: ${{ steps.get-pr-info.outputs.pr-info }}
+        run: |
+          echo "head_sha=$(echo "$PR_INFO" | jq -r '.head.sha')" >> $GITHUB_OUTPUT
+          echo "base_sha=$(echo "$PR_INFO" | jq -r '.base.sha')" >> $GITHUB_OUTPUT
+      # Get commit from main branch that is present in the PR to use as base for changed files
+      - if: startsWith(github.ref, 'refs/heads/pull-request/')
+        id: calculate-merge-base
+        run: |
+          (echo -n "merge-base="; git merge-base "${{ steps.pr-shas.outputs.base_sha }}" "${{ steps.pr-shas.outputs.head_sha }}") | tee --append "${GITHUB_OUTPUT}"
+      - if: startsWith(github.ref, 'refs/heads/pull-request/')
+        name: Check for changes in test-relevant directories
+        id: changed-tests
+        uses: step-security/changed-files@v46.0.5
+        with:
+          base_sha: ${{ steps.calculate-merge-base.outputs.merge-base }}
+          sha: ${{ steps.pr-shas.outputs.head_sha }}
+          files: ${{ inputs.files }}
+          fail_on_initial_diff_error: true
+  wait-checks:
+    needs: [check-file-changes]
+    if: >-
+      startsWith(github.ref, 'refs/heads/pull-request/') &&
+      needs.check-file-changes.outputs.any_changed == 'true'
+    uses: ./.github/workflows/_wait_for_checks.yml
+    permissions:
+      checks: read
+    secrets: inherit
+    with:
+      match_pattern: "^linux$" # Wait for Unit tests / linux (DCO is a prerequisite of linux)
+      delay: 300s
@@ -3,7 +3,8 @@ name: Bump uv.lock
 on:
   schedule:
     - cron: "0 9 * * 1" # Every Monday at 9:00 UTC
-  workflow_dispatch: # On-demand
+  workflow_dispatch:
+    # On-demand
 
 permissions:
   contents: write
 
@@ -5,10 +5,12 @@ on:
     branches: [main, release/*, feature/*]
   schedule:
     - cron: "0 0 * * *" # Nightly
-  workflow_dispatch: # On-demand
+  workflow_dispatch:
+    # On-demand
+
 
-# Cancel previous runs if new commit is pushed to the same PR
 concurrency:
+  # Cancel previous runs if new commit is pushed to the same PR
   group: ${{ github.workflow }}-${{ github.event.pull_request.number }}
   cancel-in-progress: true
 
@@ -24,4 +26,4 @@ jobs:
         with:
           extra_args: --results=verified,unknown
       - name: Run code quality checks
-        run: pip install tox && tox -e pre-commit-all
+        run: pip install nox uv && nox -s pre_commit_all