Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .claude/skills/deployment/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clust

3. **Deploy based on remote environment:**

- **SLURM** — see `skills/common/slurm-setup.md` for job script templates (container setup, account/partition discovery). The server command inside the container is the same as Step 4 (e.g., `python -m vllm.entrypoints.openai.api_server --model <path> --quantization modelopt`). Use `remote_submit_job` and `remote_poll_job` to manage the job. Get the node hostname from `squeue -j $JOBID -o %N`.
- **SLURM** — see `skills/common/slurm-setup.md` for job script templates (container setup, account/partition discovery). The server command inside the container is the same as Step 4 (e.g., `python -m vllm.entrypoints.openai.api_server --model <path> --quantization modelopt`). After submitting, register the job and set up monitoring per the **monitor skill**. Get the node hostname from `squeue -j $JOBID -o %N`.

- **Bare metal / Docker** — use `remote_run` to start the server directly:

Expand Down
72 changes: 16 additions & 56 deletions .claude/skills/evaluation/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -225,64 +225,24 @@ After the dry-run, check the output from `nel` for any problems with the config.

**Monitoring Progress**

After job submission, you can monitor progress using:
After job submission, register the job and set up monitoring per the **monitor skill**.

1. **Check job status:**
**NEL-specific diagnostics** (for debugging failures):

```bash
nel status <invocation_id>
nel info <invocation_id>
```

2. **Stream logs** (Local execution only):

```bash
nel logs <invocation_id>
```

Note: `nel logs` is not supported for SLURM execution.

3. **Inspect logs via SSH** (SLURM workaround):

When `nel logs` is unavailable (SLURM), use SSH to inspect logs directly:

First, get log locations:

```bash
nel info <invocation_id> --logs
```

Then, use SSH to view logs:

**Check server deployment logs:**

```bash
ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/server-<slurm_job_id>-*.log"
```

Shows vLLM server startup, model loading, and deployment errors (e.g., missing wget/curl).

**Check evaluation client logs:**

```bash
ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/client-<slurm_job_id>.log"
```

Shows evaluation progress, task execution, and results.

**Check SLURM scheduler logs:**

```bash
ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/slurm-<slurm_job_id>.log"
```

Shows job scheduling, health checks, and overall execution flow.

**Search for errors:**

```bash
ssh <username>@<hostname> "grep -i 'error\|warning\|failed' <log path from `nel info <invocation_id> --logs`>/*.log"
```
```bash
# Quick status check
nel status <invocation_id>
nel info <invocation_id>

# Get log paths
nel info <invocation_id> --logs

# Inspect logs via SSH
ssh <user>@<host> "tail -100 <log_path>/server-<slurm_job_id>-*.log" # deployment errors
ssh <user>@<host> "tail -100 <log_path>/client-<slurm_job_id>.log" # evaluation errors
ssh <user>@<host> "tail -100 <log_path>/slurm-<slurm_job_id>.log" # scheduling/walltime
ssh <user>@<host> "grep -i 'error\|failed' <log_path>/*.log" # search all logs
```

---

Expand Down
102 changes: 102 additions & 0 deletions .claude/skills/monitor/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
---
name: monitor
description: Monitor submitted jobs (PTQ, evaluation, deployment) on SLURM clusters. Use when the user asks "check job status", "is my job done", "monitor my evaluation", "what's the status of the PTQ", "check on job 12345", or after any skill submits a long-running job. Also triggers on "nel status", "squeue", or any request to check progress of a previously submitted job.
---

# Job Monitor

Monitor jobs submitted to SLURM clusters — PTQ quantization, NEL evaluation, model deployment, or raw SLURM jobs.

## When to use

1. **Auto-monitor** — another skill (PTQ, evaluation, deployment) just submitted a job. Register the job and set up monitoring immediately.
2. **User-initiated** — user asks about a job status, possibly in a new conversation. Check the registry, identify the job, and report.

---

## Job Registry

All active jobs are tracked in `.claude/active_jobs.json`. This file is the single source of truth for what's being monitored.

```json
[
{
"type": "nel",
"id": "<invocation_id or slurm_job_id>",
"host": "<cluster_hostname>",
"user": "<ssh_user>",
"submitted": "YYYY-MM-DD HH:MM",
"description": "<what this job does>",
"last_status": "<last known status>"
}
]
```

`type` is one of: `nel`, `slurm`, `launcher`.

---

## On Job Submission

Every time a job is submitted (by any skill or manually):

1. **Add an entry** to `.claude/active_jobs.json`. Create the file if it doesn't exist.
2. **Set up a durable recurring cron** (if one isn't already running) that polls all registered jobs every 15 minutes. The cron prompt should: read the registry, check each job, report state changes to the user, remove completed jobs, and delete itself when the registry is empty.

Comment on lines +43 to +45
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Define atomic registry updates and idempotent cron setup.

Line 54-Line 59 implies concurrent read/modify/write from cron + user-triggered checks, but no locking is specified. Combined with Line 44 (“if one isn't already running”), this can cause duplicate pollers and registry clobbering.

Please specify a lock (e.g., flock around all registry mutations) and a unique cron marker to ensure single-instance scheduling.

Also applies to: 54-59

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/monitor/SKILL.md around lines 43 - 45, Add atomic locking and
an idempotent cron marker: wrap every read/modify/write to the registry file
`.claude/active_jobs.json` with an exclusive file lock (e.g., `flock` on a
dedicated lockfile) so cron and user-triggered checks cannot clobber each other,
and ensure the cron setup step uses a unique cron marker (e.g., a comment/token
or a marker file name) so the "if one isn't already running" logic checks for
that marker before installing a new recurring poller; acquire the lock for the
whole mutation sequence (read registry, modify entries, write back, remove
completed jobs) and release after write, and make the cron install routine
idempotent by searching for the marker before adding the cron entry so
duplicates are never created.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code processes turns sequentially and crons only fire when idle, so registry operations are inherently serialized. No locking is needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaix-nv, that's a valid point — Claude Code's turn-based execution guarantees that only one operation runs at a time (user turn or cron turn, never both), so registry mutations are already serialized by the runtime. The concurrency concern doesn't apply in this execution environment. Thanks for the clarification!


✏️ Learnings added
Learnt from: kaix-nv
Repo: NVIDIA/Model-Optimizer PR: 1252
File: .claude/skills/monitor/SKILL.md:43-45
Timestamp: 2026-04-14T00:42:02.848Z
Learning: In the NVIDIA/Model-Optimizer repository, `.claude/skills/monitor/SKILL.md` uses `.claude/active_jobs.json` as a job registry. No file locking (e.g., `flock`) is needed for this registry because Claude Code processes turns sequentially and crons only fire when idle — registry operations are inherently serialized by the Claude Code runtime. Do not flag concurrent access concerns for this registry.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Always do both steps. Don't try to predict job duration.

---

## On Cron Fire / Status Check

Whether triggered by the cron or by the user asking "check status":

1. **Read the registry** from `.claude/active_jobs.json`
2. **Check each job** using the appropriate method (see below)
3. **Report only state changes** — compare against `last_status` in registry
4. **Update `last_status`** in the registry
5. **Remove completed jobs** — any job in a terminal state (COMPLETED, FAILED, CANCELLED, KILLED)
6. **If registry is empty** — delete the recurring cron

---

## How to Check Each Job Type

### NEL jobs (`type: nel`)

- **Check:** `nel status <id>`
- **On completion:** `nel info <id>` to fetch results
- **On failure:** `nel info <id> --logs` then inspect server/client/SLURM logs via SSH

### Launcher jobs (`type: launcher`)

- **Check:** Tail the launcher's background output file for key events
- **Key events:** experiment ID, SLURM job ID, container import, calibration progress, export path, final status
- **On failure:** Look for `Traceback`, `Error`, or `FAILED` in the output

### Raw SLURM jobs (`type: slurm`)

- **Check:** `ssh <host> "squeue -j <id> -h -o '%T %M %R'"` — if empty, job left the queue
- **On completion:** `ssh <host> "sacct -j <id> --format=State,ExitCode,Elapsed -n"`
- **On failure:** Check the job's output log file

---

## Identifying Jobs (user-initiated, no ID given)

When the user asks about a job without specifying an ID, check in order:

1. `.claude/active_jobs.json` — most reliable, has context
2. `nel ls runs --since 1d` — recent NEL runs
3. `ssh <host> "squeue -u <user>"` — active SLURM jobs
4. `ls -lt tools/launcher/experiments/cicd/ | head -10` — recent launcher experiments

---

## Reporting Guidelines

- **Report state changes proactively** — PENDING → RUNNING, or job completes
- **Aggregate multiple jobs** — "2 of 4 completed (MMLU-Pro: 42.3%, GSM8K: 67.1%), 1 running, 1 pending"
- **Summarize, don't echo** — interpret events ("Calibration complete, exporting checkpoint") not raw logs
- **On failure, diagnose immediately** — check logs and report root cause without waiting for user to ask
- **Minimize noise** — don't report "still running" unless the user is actively asking
4 changes: 1 addition & 3 deletions .claude/skills/ptq/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,9 +100,7 @@ For SLURM, see `skills/common/slurm-setup.md` and `references/slurm-setup-ptq.md

### Monitoring

- **Launcher**: blocks and tails logs automatically
- **SLURM (manual)**: poll with `squeue -u $USER` + `sleep` (not cron or background tasks)
- **Local**: watch stdout
After job submission, register the job and set up monitoring per the **monitor skill**.

## Step 5 — Verify output

Expand Down
Loading