Skip to content

Add a standalone monitor skill for persistent job tracking#1252

Open
kaix-nv wants to merge 1 commit intomainfrom
kaix/monitor-skill
Open

Add a standalone monitor skill for persistent job tracking#1252
kaix-nv wants to merge 1 commit intomainfrom
kaix/monitor-skill

Conversation

@kaix-nv
Copy link
Copy Markdown
Contributor

@kaix-nv kaix-nv commented Apr 14, 2026

What does this PR do?

Type of change: ?
Add a standalone monitor skill for persistent job tracking across sessions, and integrate it with PTQ, evaluation, and deployment skills.

Problem: Each skill had ad-hoc inline monitoring (squeue polling, nel status checks) that didn't survive session restarts and couldn't track multiple jobs. Users had to manually ask "check status" every time.

Solution: A centralized monitor skill with:

  • Job registry (.claude/active_jobs.json): single source of truth for all active jobs
  • Durable recurring cron: polls every 15 min, survives session restarts, self-cleans when all jobs complete
  • User-initiated mode: works in new conversations by reading the registry
  • Aggregated reporting: "2 of 4 completed" instead of per-job noise

Usage

After any skill submits a job, the monitor skill automatically:

  1. Registers the job in .claude/active_jobs.json
  2. Sets up a durable cron to poll status every 15 minutes

User can also trigger manually:
User: "check my eval status" → reads registry, reports current state
User: "is the PTQ done?" → finds job, checks status
User: "what jobs are running?" → lists all registered jobs

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • New Features

    • Added monitor skill for tracking SLURM jobs, NEL evaluations, and launcher experiments with persistent job registry.
  • Documentation

    • Updated deployment, evaluation, and PTQ documentation to use the new monitor skill.
    • Simplified diagnostic and troubleshooting instructions.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 14, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: de1c5e64-6db6-447f-a578-bdd8cb845b11

📥 Commits

Reviewing files that changed from the base of the PR and between e2f7cd6 and d781534.

📒 Files selected for processing (4)
  • .claude/skills/deployment/SKILL.md
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/monitor/SKILL.md
  • .claude/skills/ptq/SKILL.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • .claude/skills/deployment/SKILL.md
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/monitor/SKILL.md

📝 Walkthrough

Walkthrough

The changes introduce a new monitor skill specification for tracking SLURM and other job types through a centralized registry and recurring polling mechanism. Four skill documentation files are updated: the new monitor skill is added, and three existing skills (deployment, evaluation, PTQ) are refactored to delegate job monitoring to this new monitor skill instead of manual polling approaches.

Changes

Cohort / File(s) Summary
New Monitor Skill
.claude/skills/monitor/SKILL.md
Added comprehensive specification for monitor skill. Defines job registry (.claude/active_jobs.json) as single source of truth, supports three job types (nel, launcher, slurm), implements automatic 15-minute cron polling, and specifies state change detection and cleanup logic.
Monitoring Refactoring
.claude/skills/deployment/SKILL.md, .claude/skills/evaluation/SKILL.md, .claude/skills/ptq/SKILL.md
Updated to reference monitor skill for job monitoring. Replaced manual polling instructions (nel status, squeue with sleep, log tailing) with directive to register jobs and use monitor skill for tracking. Condensed diagnostics in evaluation skill.

Sequence Diagram

sequenceDiagram
    participant User
    participant Deployment as Deployment Skill
    participant Registry as Job Registry<br/>(.claude/active_jobs.json)
    participant Cron as Cron Scheduler
    participant Monitor as Monitor Skill
    participant Job as Job Backend<br/>(NEL/SLURM/Launcher)

    User->>Deployment: Submit job
    Deployment->>Registry: Register job entry
    Deployment->>Cron: Ensure 15-min poll scheduled
    Deployment-->>User: Job submitted

    rect rgba(100, 150, 200, 0.5)
    Note over Cron,Monitor: Automatic polling loop (every 15 min)
    Cron->>Monitor: Trigger poll
    Monitor->>Registry: Read active jobs
    loop For each job
        Monitor->>Job: Query status<br/>(squeue, sacct, nel status, etc.)
        Job-->>Monitor: Current status
        Monitor->>Monitor: Compare to last_status<br/>detect state change
        Monitor->>Registry: Update last_status
        alt Job in terminal state
            Monitor->>Registry: Remove job entry
        end
    end
    Monitor-->>User: Report state changes only
    end

    alt Registry empty
        Monitor->>Cron: Delete recurring cron
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add a standalone monitor skill for persistent job tracking' directly and accurately summarizes the main change: introduction of a new monitor skill for durable job tracking across sessions.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns ✅ Passed PR contains no Python code changes to modelopt package, examples, or dependencies. Only documentation updates to .claude/skills/ markdown files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kaix/monitor-skill

Comment @coderabbitai help to get the list of available commands and usage tips.

@kaix-nv kaix-nv requested review from Edwardf0t1 and mxinO April 14, 2026 00:33
@kaix-nv kaix-nv marked this pull request as ready for review April 14, 2026 00:34
Signed-off-by: Kai Xu <kaix@nvidia.com>
@kaix-nv kaix-nv force-pushed the kaix/monitor-skill branch from e2f7cd6 to d781534 Compare April 14, 2026 00:36
@kaix-nv kaix-nv changed the title Kaix/monitor skill Add a standalone monitor skill for persistent job tracking Apr 14, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 202-214: The current decision flow's submit-then-fallback behavior
must be replaced so the runtime/image is chosen based on auth checks before any
SLURM submission; update the text around "Default behavior: use DockerHub image
first." and the numbered "Decision flow" to: validate DockerHub credentials up
front, if valid use vllm/vllm-openai:latest, else set deployment.image to the
NGC image (nvcr.io/nvidia/vllm:<YY.MM>-py3) and submit once; remove steps that
describe submitting first and retrying on 401 and ensure the doc enforces "Do
not retry more than once" by requiring credential fix before any additional
submission.
- Around line 31-32: The final user-facing checklist is missing the newly added
"Step 7.5: Check container registry auth (SLURM only)"; update the copied
checklist section so it includes that same line immediately before "Step 8: Run
the evaluation" to keep both checklists in sync, ensuring the phrase "Step 7.5:
Check container registry auth (SLURM only)" appears exactly as in the main
workflow checklist.

In @.claude/skills/monitor/SKILL.md:
- Around line 43-45: Add atomic locking and an idempotent cron marker: wrap
every read/modify/write to the registry file `.claude/active_jobs.json` with an
exclusive file lock (e.g., `flock` on a dedicated lockfile) so cron and
user-triggered checks cannot clobber each other, and ensure the cron setup step
uses a unique cron marker (e.g., a comment/token or a marker file name) so the
"if one isn't already running" logic checks for that marker before installing a
new recurring poller; acquire the lock for the whole mutation sequence (read
registry, modify entries, write back, remove completed jobs) and release after
write, and make the cron install routine idempotent by searching for the marker
before adding the cron entry so duplicates are never created.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 26e86888-6c9a-497f-b376-8ab0c3569a1d

📥 Commits

Reviewing files that changed from the base of the PR and between 202c3d3 and e2f7cd6.

📒 Files selected for processing (6)
  • .claude/skills/common/slurm-setup.md
  • .claude/skills/deployment/SKILL.md
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/monitor/SKILL.md
  • .claude/skills/ptq/SKILL.md
  • tools/launcher/common/hf/ptq.sh

Comment on lines 31 to 32
- [ ] Step 7.5: Check container registry auth (SLURM only)
- [ ] Step 8: Run the evaluation
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Checklist drift: new Step 7.5 is missing from the final copied checklist.

You added Step 7.5 in the main workflow checklist, but the final checklist users are told to copy still skips it. Please keep both checklists in sync.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 31 - 32, The final
user-facing checklist is missing the newly added "Step 7.5: Check container
registry auth (SLURM only)"; update the copied checklist section so it includes
that same line immediately before "Step 8: Run the evaluation" to keep both
checklists in sync, ensuring the phrase "Step 7.5: Check container registry auth
(SLURM only)" appears exactly as in the main workflow checklist.

Comment on lines +202 to +214
**Default behavior: use DockerHub image first.** If the job fails with a `401` or image pull error, fall back to the NGC alternative by adding `deployment.image` to the config:

```yaml
deployment:
image: nvcr.io/nvidia/vllm:<YY.MM>-py3 # check https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm for latest tag
```

**Decision flow:**
1. Submit with the default DockerHub image (`vllm/vllm-openai:latest`)
2. If the job fails with image pull auth error (401) → check credentials
3. If DockerHub credentials can be added → add them and resubmit
4. If not → override `deployment.image` to the NGC vLLM image and resubmit
5. **Do not retry more than once** without fixing the auth issue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Step 7.5 currently contradicts its own pre-submit auth gate.

Line 196 says to verify credentials before submit, but Line 202-Line 214 still instructs a submit-first-then-fallback path. This reintroduces avoidable failed SLURM submissions and conflicts with the stricter deployment flow.

Please switch to: choose authenticated image/runtime before first submission, then submit once.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 202 - 214, The current
decision flow's submit-then-fallback behavior must be replaced so the
runtime/image is chosen based on auth checks before any SLURM submission; update
the text around "Default behavior: use DockerHub image first." and the numbered
"Decision flow" to: validate DockerHub credentials up front, if valid use
vllm/vllm-openai:latest, else set deployment.image to the NGC image
(nvcr.io/nvidia/vllm:<YY.MM>-py3) and submit once; remove steps that describe
submitting first and retrying on 401 and ensure the doc enforces "Do not retry
more than once" by requiring credential fix before any additional submission.

Comment on lines +43 to +45
1. **Add an entry** to `.claude/active_jobs.json`. Create the file if it doesn't exist.
2. **Set up a durable recurring cron** (if one isn't already running) that polls all registered jobs every 15 minutes. The cron prompt should: read the registry, check each job, report state changes to the user, remove completed jobs, and delete itself when the registry is empty.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Define atomic registry updates and idempotent cron setup.

Line 54-Line 59 implies concurrent read/modify/write from cron + user-triggered checks, but no locking is specified. Combined with Line 44 (“if one isn't already running”), this can cause duplicate pollers and registry clobbering.

Please specify a lock (e.g., flock around all registry mutations) and a unique cron marker to ensure single-instance scheduling.

Also applies to: 54-59

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/monitor/SKILL.md around lines 43 - 45, Add atomic locking and
an idempotent cron marker: wrap every read/modify/write to the registry file
`.claude/active_jobs.json` with an exclusive file lock (e.g., `flock` on a
dedicated lockfile) so cron and user-triggered checks cannot clobber each other,
and ensure the cron setup step uses a unique cron marker (e.g., a comment/token
or a marker file name) so the "if one isn't already running" logic checks for
that marker before installing a new recurring poller; acquire the lock for the
whole mutation sequence (read registry, modify entries, write back, remove
completed jobs) and release after write, and make the cron install routine
idempotent by searching for the marker before adding the cron entry so
duplicates are never created.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code processes turns sequentially and crons only fire when idle, so registry operations are inherently serialized. No locking is needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaix-nv, that's a valid point — Claude Code's turn-based execution guarantees that only one operation runs at a time (user turn or cron turn, never both), so registry mutations are already serialized by the runtime. The concurrency concern doesn't apply in this execution environment. Thanks for the clarification!


✏️ Learnings added
Learnt from: kaix-nv
Repo: NVIDIA/Model-Optimizer PR: 1252
File: .claude/skills/monitor/SKILL.md:43-45
Timestamp: 2026-04-14T00:42:02.848Z
Learning: In the NVIDIA/Model-Optimizer repository, `.claude/skills/monitor/SKILL.md` uses `.claude/active_jobs.json` as a job registry. No file locking (e.g., `flock`) is needed for this registry because Claude Code processes turns sequentially and crons only fire when idle — registry operations are inherently serialized by the Claude Code runtime. Do not flag concurrent access concerns for this registry.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@github-actions
Copy link
Copy Markdown
Contributor

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1252/

Built to branch gh-pages at 2026-04-14 00:41 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.92%. Comparing base (202c3d3) to head (d781534).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1252   +/-   ##
=======================================
  Coverage   76.92%   76.92%           
=======================================
  Files         350      350           
  Lines       40524    40524           
=======================================
  Hits        31173    31173           
  Misses       9351     9351           
Flag Coverage Δ
unit 55.59% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant