Skip to content

Commit 33c3528

Browse files
authored
Merge branch 'main' into fridah/calib-registry
2 parents 57b33f3 + c20f9c4 commit 33c3528

111 files changed

Lines changed: 8340 additions & 1538 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/skills/deployment/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -195,7 +195,7 @@ If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clust
195195

196196
3. **Deploy based on remote environment:**
197197

198-
- **SLURM** — see `skills/common/slurm-setup.md` for job script templates (container setup, account/partition discovery). The server command inside the container is the same as Step 4 (e.g., `python -m vllm.entrypoints.openai.api_server --model <path> --quantization modelopt`). Use `remote_submit_job` and `remote_poll_job` to manage the job. Get the node hostname from `squeue -j $JOBID -o %N`.
198+
- **SLURM** — see `skills/common/slurm-setup.md` for job script templates (container setup, account/partition discovery). The server command inside the container is the same as Step 4 (e.g., `python -m vllm.entrypoints.openai.api_server --model <path> --quantization modelopt`). After submitting, register the job and set up monitoring per the **monitor skill**. Get the node hostname from `squeue -j $JOBID -o %N`.
199199

200200
- **Bare metal / Docker** — use `remote_run` to start the server directly:
201201

.claude/skills/evaluation/SKILL.md

Lines changed: 16 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -256,64 +256,24 @@ After the dry-run, check the output from `nel` for any problems with the config.
256256

257257
**Monitoring Progress**
258258

259-
After job submission, you can monitor progress using:
259+
After job submission, register the job per the **monitor skill** for durable cross-session tracking. For one-off queries (live status, debugging a failed run, analyzing results) use the **launching-evals skill**; for querying past runs in MLflow use **accessing-mlflow**.
260260

261-
1. **Check job status:**
261+
**NEL-specific diagnostics** (for debugging failures):
262262

263-
```bash
264-
nel status <invocation_id>
265-
nel info <invocation_id>
266-
```
267-
268-
2. **Stream logs** (Local execution only):
269-
270-
```bash
271-
nel logs <invocation_id>
272-
```
273-
274-
Note: `nel logs` is not supported for SLURM execution.
275-
276-
3. **Inspect logs via SSH** (SLURM workaround):
277-
278-
When `nel logs` is unavailable (SLURM), use SSH to inspect logs directly:
279-
280-
First, get log locations:
281-
282-
```bash
283-
nel info <invocation_id> --logs
284-
```
285-
286-
Then, use SSH to view logs:
287-
288-
**Check server deployment logs:**
289-
290-
```bash
291-
ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/server-<slurm_job_id>-*.log"
292-
```
293-
294-
Shows vLLM server startup, model loading, and deployment errors (e.g., missing wget/curl).
295-
296-
**Check evaluation client logs:**
297-
298-
```bash
299-
ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/client-<slurm_job_id>.log"
300-
```
301-
302-
Shows evaluation progress, task execution, and results.
303-
304-
**Check SLURM scheduler logs:**
305-
306-
```bash
307-
ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/slurm-<slurm_job_id>.log"
308-
```
309-
310-
Shows job scheduling, health checks, and overall execution flow.
311-
312-
**Search for errors:**
313-
314-
```bash
315-
ssh <username>@<hostname> "grep -i 'error\|warning\|failed' <log path from `nel info <invocation_id> --logs`>/*.log"
316-
```
263+
```bash
264+
# Quick status check
265+
nel status <invocation_id>
266+
nel info <invocation_id>
267+
268+
# Get log paths
269+
nel info <invocation_id> --logs
270+
271+
# Inspect logs via SSH
272+
ssh <user>@<host> "tail -100 <log_path>/server-<slurm_job_id>-*.log" # deployment errors
273+
ssh <user>@<host> "tail -100 <log_path>/client-<slurm_job_id>.log" # evaluation errors
274+
ssh <user>@<host> "tail -100 <log_path>/slurm-<slurm_job_id>.log" # scheduling/walltime
275+
ssh <user>@<host> "grep -i 'error\|failed' <log_path>/*.log" # search all logs
276+
```
317277

318278
---
319279

.claude/skills/monitor/SKILL.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
---
2+
name: monitor
3+
description: Monitor submitted jobs (PTQ, evaluation, deployment) on SLURM clusters. Use when the user asks "check job status", "is my job done", "monitor my evaluation", "what's the status of the PTQ", "check on job <slurm_job_id>", or after any skill submits a long-running job. Also triggers on "nel status", "squeue", or any request to check progress of a previously submitted job.
4+
---
5+
6+
# Job Monitor
7+
8+
Monitor jobs submitted to SLURM clusters — PTQ quantization, NEL evaluation, model deployment, or raw SLURM jobs.
9+
10+
## When to use
11+
12+
1. **Auto-monitor** — another skill (PTQ, evaluation, deployment) just submitted a job. Register the job and set up monitoring immediately.
13+
2. **User-initiated** — user asks about a job status, possibly in a new conversation. Check the registry, identify the job, and report.
14+
15+
---
16+
17+
## Job Registry
18+
19+
All active jobs are tracked in `.claude/active_jobs.json`. This file is the single source of truth for what's being monitored.
20+
21+
```json
22+
[
23+
{
24+
"type": "nel",
25+
"id": "<invocation_id or slurm_job_id>",
26+
"host": "<cluster_hostname>",
27+
"user": "<ssh_user>",
28+
"submitted": "YYYY-MM-DD HH:MM",
29+
"description": "<what this job does>",
30+
"last_status": "<last known status>"
31+
}
32+
]
33+
```
34+
35+
`type` is one of: `nel`, `slurm`, `launcher`.
36+
37+
---
38+
39+
## On Job Submission
40+
41+
Every time a job is submitted (by any skill or manually):
42+
43+
1. **Add an entry** to `.claude/active_jobs.json`. Create the file if it doesn't exist.
44+
2. **Set up a durable recurring cron** (if one isn't already running) that polls all registered jobs every 15 minutes. The cron prompt should: read the registry, check each job, report state changes to the user, remove completed jobs, and delete itself when the registry is empty.
45+
46+
Always do both steps. Don't try to predict job duration.
47+
48+
---
49+
50+
## On Cron Fire / Status Check
51+
52+
Whether triggered by the cron or by the user asking "check status":
53+
54+
1. **Read the registry** from `.claude/active_jobs.json`
55+
2. **Check each job** using the appropriate method (see below)
56+
3. **Report only state changes** — compare against `last_status` in registry
57+
4. **Update `last_status`** in the registry
58+
5. **Remove completed jobs** — any job in a terminal state (COMPLETED, FAILED, CANCELLED, KILLED)
59+
6. **If registry is empty** — delete the recurring cron
60+
61+
---
62+
63+
## How to Check Each Job Type
64+
65+
### NEL jobs (`type: nel`)
66+
67+
- **Check:** `nel status <id>`
68+
- **On completion:** `nel info <id>` to fetch results
69+
- **On failure:** `nel info <id> --logs` then inspect server/client/SLURM logs via SSH
70+
71+
### Launcher jobs (`type: launcher`)
72+
73+
- **Check:** Tail the launcher's background output file for key events
74+
- **Key events:** experiment ID, SLURM job ID, container import, calibration progress, export path, final status
75+
- **On failure:** Look for `Traceback`, `Error`, or `FAILED` in the output
76+
77+
### Raw SLURM jobs (`type: slurm`)
78+
79+
- **Check:** `ssh <host> "squeue -j <id> -h -o '%T %M %R'"` — if empty, job left the queue
80+
- **On completion:** `ssh <host> "sacct -j <id> --format=State,ExitCode,Elapsed -n"`
81+
- **On failure:** Check the job's output log file
82+
83+
---
84+
85+
## Identifying Jobs (user-initiated, no ID given)
86+
87+
When the user asks about a job without specifying an ID, check in order:
88+
89+
1. `.claude/active_jobs.json` — most reliable, has context
90+
2. `nel ls runs --since 1d` — recent NEL runs
91+
3. `ssh <host> "squeue -u <user>"` — active SLURM jobs
92+
4. `ls -lt tools/launcher/experiments/cicd/ | head -10` — recent launcher experiments
93+
94+
---
95+
96+
## Reporting Guidelines
97+
98+
- **Report state changes proactively** — PENDING → RUNNING, or job completes
99+
- **Aggregate multiple jobs** — "2 of 4 completed (MMLU-Pro: 42.3%, GSM8K: 67.1%), 1 running, 1 pending"
100+
- **Summarize, don't echo** — interpret events ("Calibration complete, exporting checkpoint") not raw logs
101+
- **On failure, diagnose immediately** — check logs and report root cause without waiting for user to ask
102+
- **Minimize noise** — don't report "still running" unless the user is actively asking

.claude/skills/ptq/SKILL.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -118,9 +118,7 @@ For SLURM, see `skills/common/slurm-setup.md` and `references/slurm-setup-ptq.md
118118

119119
### Monitoring
120120

121-
- **Launcher**: blocks and tails logs automatically
122-
- **SLURM (manual)**: poll with `squeue -u $USER` + `sleep` (not cron or background tasks)
123-
- **Local**: watch stdout
121+
After job submission, register the job and set up monitoring per the **monitor skill**.
124122

125123
## Step 5 — Verify output
126124

.github/CODEOWNERS

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ LICENSE @NVIDIA/modelopt-setup-codeowners
99
LICENSE_HEADER @NVIDIA/modelopt-setup-codeowners
1010
pyproject.toml @NVIDIA/modelopt-setup-codeowners
1111
SECURITY.md @NVIDIA/modelopt-setup-codeowners
12-
tox.ini @NVIDIA/modelopt-setup-codeowners
12+
noxfile.py @NVIDIA/modelopt-setup-codeowners
1313
uv.lock @NVIDIA/modelopt-setup-codeowners
1414

1515
# Library

.github/codecov.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,15 @@ coverage:
1111
target: auto
1212
threshold: 1% # Allow atmost 1% coverage drop from main branch.
1313
patch: false
14+
15+
# Exclude GPU-only Triton kernel files from ALL codecov calculations (project
16+
# and patch checks, all flags). Rationale: these files are dominated by
17+
# @triton.jit kernel bodies that CPU unit tests cannot exercise. GPU tests
18+
# cover them end-to-end (see tests/gpu/torch/sparsity/attention_sparsity/) but
19+
# the `gpu`-flag upload may race with the PR status check, so relying on flag
20+
# combination alone leaves the project check flaky. Dropping these files here
21+
# makes the check deterministic — local `pytest --cov` and GPU runs still
22+
# measure them; only the codecov PR status ignores them.
23+
ignore:
24+
- "modelopt/torch/kernels/triton_fa.py"
25+
- "modelopt/torch/kernels/hf_triton_attention.py"

.github/workflows/_example_tests_runner.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,6 @@ jobs:
4848
- name: Install dependencies
4949
run: |
5050
# use `python -m pip` instead of `pip` to avoid conflicts with system pip for nemo containers
51-
pip uninstall -y nvidia-modelopt
5251
python -m pip install ".${{ inputs.pip_install_extras }}"
5352
5453
if [[ "${{ inputs.example }}" == *"diffusers"* ]]; then

.github/workflows/_pr_gate.yml

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
name: PR Gate
2+
3+
on:
4+
workflow_call:
5+
inputs:
6+
files:
7+
description: "Newline-separated list of file patterns to watch for changes"
8+
required: true
9+
type: string
10+
outputs:
11+
any_changed:
12+
description: "Whether any relevant files changed"
13+
value: ${{ jobs.check-file-changes.outputs.any_changed }}
14+
15+
jobs:
16+
check-file-changes:
17+
runs-on: ubuntu-latest
18+
outputs:
19+
any_changed: ${{ steps.changed-tests.outputs.any_changed || steps.non-pr.outputs.any_changed }}
20+
steps:
21+
# For non-PR triggers (schedule, workflow_dispatch), always run tests
22+
- id: non-pr
23+
if: "!startsWith(github.ref, 'refs/heads/pull-request/')"
24+
run: echo "any_changed=true" >> $GITHUB_OUTPUT
25+
- if: startsWith(github.ref, 'refs/heads/pull-request/')
26+
uses: actions/checkout@v6
27+
with:
28+
fetch-depth: 0
29+
- if: startsWith(github.ref, 'refs/heads/pull-request/')
30+
id: get-pr-info
31+
uses: nv-gha-runners/get-pr-info@main
32+
# Extract SHAs from pr-info JSON via shell to avoid fromJSON on potentially-empty outputs
33+
- if: startsWith(github.ref, 'refs/heads/pull-request/')
34+
id: pr-shas
35+
env:
36+
PR_INFO: ${{ steps.get-pr-info.outputs.pr-info }}
37+
run: |
38+
echo "head_sha=$(echo "$PR_INFO" | jq -r '.head.sha')" >> $GITHUB_OUTPUT
39+
echo "base_sha=$(echo "$PR_INFO" | jq -r '.base.sha')" >> $GITHUB_OUTPUT
40+
# Get commit from main branch that is present in the PR to use as base for changed files
41+
- if: startsWith(github.ref, 'refs/heads/pull-request/')
42+
id: calculate-merge-base
43+
run: |
44+
(echo -n "merge-base="; git merge-base "${{ steps.pr-shas.outputs.base_sha }}" "${{ steps.pr-shas.outputs.head_sha }}") | tee --append "${GITHUB_OUTPUT}"
45+
- if: startsWith(github.ref, 'refs/heads/pull-request/')
46+
name: Check for changes in test-relevant directories
47+
id: changed-tests
48+
uses: step-security/changed-files@v46.0.5
49+
with:
50+
base_sha: ${{ steps.calculate-merge-base.outputs.merge-base }}
51+
sha: ${{ steps.pr-shas.outputs.head_sha }}
52+
files: ${{ inputs.files }}
53+
fail_on_initial_diff_error: true
54+
wait-checks:
55+
needs: [check-file-changes]
56+
if: >-
57+
startsWith(github.ref, 'refs/heads/pull-request/') &&
58+
needs.check-file-changes.outputs.any_changed == 'true'
59+
uses: ./.github/workflows/_wait_for_checks.yml
60+
permissions:
61+
checks: read
62+
secrets: inherit
63+
with:
64+
match_pattern: "^linux$" # Wait for Unit tests / linux (DCO is a prerequisite of linux)
65+
delay: 300s

.github/workflows/bump_uv_lock.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,8 @@ name: Bump uv.lock
33
on:
44
schedule:
55
- cron: "0 9 * * 1" # Every Monday at 9:00 UTC
6-
workflow_dispatch: # On-demand
6+
workflow_dispatch:
7+
# On-demand
78

89
permissions:
910
contents: write

.github/workflows/code_quality.yml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,12 @@ on:
55
branches: [main, release/*, feature/*]
66
schedule:
77
- cron: "0 0 * * *" # Nightly
8-
workflow_dispatch: # On-demand
8+
workflow_dispatch:
9+
# On-demand
10+
911

10-
# Cancel previous runs if new commit is pushed to the same PR
1112
concurrency:
13+
# Cancel previous runs if new commit is pushed to the same PR
1214
group: ${{ github.workflow }}-${{ github.event.pull_request.number }}
1315
cancel-in-progress: true
1416

@@ -24,4 +26,4 @@ jobs:
2426
with:
2527
extra_args: --results=verified,unknown
2628
- name: Run code quality checks
27-
run: pip install tox && tox -e pre-commit-all
29+
run: pip install nox uv && nox -s pre_commit_all

0 commit comments

Comments
 (0)