-
Notifications
You must be signed in to change notification settings - Fork 354
Add a standalone monitor skill for persistent job tracking #1252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kaix-nv
wants to merge
1
commit into
main
Choose a base branch
from
kaix/monitor-skill
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+120
−60
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,102 @@ | ||
| --- | ||
| name: monitor | ||
| description: Monitor submitted jobs (PTQ, evaluation, deployment) on SLURM clusters. Use when the user asks "check job status", "is my job done", "monitor my evaluation", "what's the status of the PTQ", "check on job 12345", or after any skill submits a long-running job. Also triggers on "nel status", "squeue", or any request to check progress of a previously submitted job. | ||
| --- | ||
|
|
||
| # Job Monitor | ||
|
|
||
| Monitor jobs submitted to SLURM clusters — PTQ quantization, NEL evaluation, model deployment, or raw SLURM jobs. | ||
|
|
||
| ## When to use | ||
|
|
||
| 1. **Auto-monitor** — another skill (PTQ, evaluation, deployment) just submitted a job. Register the job and set up monitoring immediately. | ||
| 2. **User-initiated** — user asks about a job status, possibly in a new conversation. Check the registry, identify the job, and report. | ||
|
|
||
| --- | ||
|
|
||
| ## Job Registry | ||
|
|
||
| All active jobs are tracked in `.claude/active_jobs.json`. This file is the single source of truth for what's being monitored. | ||
|
|
||
| ```json | ||
| [ | ||
| { | ||
| "type": "nel", | ||
| "id": "<invocation_id or slurm_job_id>", | ||
| "host": "<cluster_hostname>", | ||
| "user": "<ssh_user>", | ||
| "submitted": "YYYY-MM-DD HH:MM", | ||
| "description": "<what this job does>", | ||
| "last_status": "<last known status>" | ||
| } | ||
| ] | ||
| ``` | ||
|
|
||
| `type` is one of: `nel`, `slurm`, `launcher`. | ||
|
|
||
| --- | ||
|
|
||
| ## On Job Submission | ||
|
|
||
| Every time a job is submitted (by any skill or manually): | ||
|
|
||
| 1. **Add an entry** to `.claude/active_jobs.json`. Create the file if it doesn't exist. | ||
| 2. **Set up a durable recurring cron** (if one isn't already running) that polls all registered jobs every 15 minutes. The cron prompt should: read the registry, check each job, report state changes to the user, remove completed jobs, and delete itself when the registry is empty. | ||
|
|
||
| Always do both steps. Don't try to predict job duration. | ||
|
|
||
| --- | ||
|
|
||
| ## On Cron Fire / Status Check | ||
|
|
||
| Whether triggered by the cron or by the user asking "check status": | ||
|
|
||
| 1. **Read the registry** from `.claude/active_jobs.json` | ||
| 2. **Check each job** using the appropriate method (see below) | ||
| 3. **Report only state changes** — compare against `last_status` in registry | ||
| 4. **Update `last_status`** in the registry | ||
| 5. **Remove completed jobs** — any job in a terminal state (COMPLETED, FAILED, CANCELLED, KILLED) | ||
| 6. **If registry is empty** — delete the recurring cron | ||
|
|
||
| --- | ||
|
|
||
| ## How to Check Each Job Type | ||
|
|
||
| ### NEL jobs (`type: nel`) | ||
|
|
||
| - **Check:** `nel status <id>` | ||
| - **On completion:** `nel info <id>` to fetch results | ||
| - **On failure:** `nel info <id> --logs` then inspect server/client/SLURM logs via SSH | ||
|
|
||
| ### Launcher jobs (`type: launcher`) | ||
|
|
||
| - **Check:** Tail the launcher's background output file for key events | ||
| - **Key events:** experiment ID, SLURM job ID, container import, calibration progress, export path, final status | ||
| - **On failure:** Look for `Traceback`, `Error`, or `FAILED` in the output | ||
|
|
||
| ### Raw SLURM jobs (`type: slurm`) | ||
|
|
||
| - **Check:** `ssh <host> "squeue -j <id> -h -o '%T %M %R'"` — if empty, job left the queue | ||
| - **On completion:** `ssh <host> "sacct -j <id> --format=State,ExitCode,Elapsed -n"` | ||
| - **On failure:** Check the job's output log file | ||
|
|
||
| --- | ||
|
|
||
| ## Identifying Jobs (user-initiated, no ID given) | ||
|
|
||
| When the user asks about a job without specifying an ID, check in order: | ||
|
|
||
| 1. `.claude/active_jobs.json` — most reliable, has context | ||
| 2. `nel ls runs --since 1d` — recent NEL runs | ||
| 3. `ssh <host> "squeue -u <user>"` — active SLURM jobs | ||
| 4. `ls -lt tools/launcher/experiments/cicd/ | head -10` — recent launcher experiments | ||
|
|
||
| --- | ||
|
|
||
| ## Reporting Guidelines | ||
|
|
||
| - **Report state changes proactively** — PENDING → RUNNING, or job completes | ||
| - **Aggregate multiple jobs** — "2 of 4 completed (MMLU-Pro: 42.3%, GSM8K: 67.1%), 1 running, 1 pending" | ||
| - **Summarize, don't echo** — interpret events ("Calibration complete, exporting checkpoint") not raw logs | ||
| - **On failure, diagnose immediately** — check logs and report root cause without waiting for user to ask | ||
| - **Minimize noise** — don't report "still running" unless the user is actively asking | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define atomic registry updates and idempotent cron setup.
Line 54-Line 59 implies concurrent read/modify/write from cron + user-triggered checks, but no locking is specified. Combined with Line 44 (“if one isn't already running”), this can cause duplicate pollers and registry clobbering.
Please specify a lock (e.g.,
flockaround all registry mutations) and a unique cron marker to ensure single-instance scheduling.Also applies to: 54-59
🤖 Prompt for AI Agents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Claude Code processes turns sequentially and crons only fire when idle, so registry operations are inherently serialized. No locking is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaix-nv, that's a valid point — Claude Code's turn-based execution guarantees that only one operation runs at a time (user turn or cron turn, never both), so registry mutations are already serialized by the runtime. The concurrency concern doesn't apply in this execution environment. Thanks for the clarification!✏️ Learnings added