Add a standalone monitor skill for persistent job tracking#1252
Add a standalone monitor skill for persistent job tracking#1252
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughThe changes introduce a new Changes
Sequence DiagramsequenceDiagram
participant User
participant Deployment as Deployment Skill
participant Registry as Job Registry<br/>(.claude/active_jobs.json)
participant Cron as Cron Scheduler
participant Monitor as Monitor Skill
participant Job as Job Backend<br/>(NEL/SLURM/Launcher)
User->>Deployment: Submit job
Deployment->>Registry: Register job entry
Deployment->>Cron: Ensure 15-min poll scheduled
Deployment-->>User: Job submitted
rect rgba(100, 150, 200, 0.5)
Note over Cron,Monitor: Automatic polling loop (every 15 min)
Cron->>Monitor: Trigger poll
Monitor->>Registry: Read active jobs
loop For each job
Monitor->>Job: Query status<br/>(squeue, sacct, nel status, etc.)
Job-->>Monitor: Current status
Monitor->>Monitor: Compare to last_status<br/>detect state change
Monitor->>Registry: Update last_status
alt Job in terminal state
Monitor->>Registry: Remove job entry
end
end
Monitor-->>User: Report state changes only
end
alt Registry empty
Monitor->>Cron: Delete recurring cron
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Signed-off-by: Kai Xu <kaix@nvidia.com>
e2f7cd6 to
d781534
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 202-214: The current decision flow's submit-then-fallback behavior
must be replaced so the runtime/image is chosen based on auth checks before any
SLURM submission; update the text around "Default behavior: use DockerHub image
first." and the numbered "Decision flow" to: validate DockerHub credentials up
front, if valid use vllm/vllm-openai:latest, else set deployment.image to the
NGC image (nvcr.io/nvidia/vllm:<YY.MM>-py3) and submit once; remove steps that
describe submitting first and retrying on 401 and ensure the doc enforces "Do
not retry more than once" by requiring credential fix before any additional
submission.
- Around line 31-32: The final user-facing checklist is missing the newly added
"Step 7.5: Check container registry auth (SLURM only)"; update the copied
checklist section so it includes that same line immediately before "Step 8: Run
the evaluation" to keep both checklists in sync, ensuring the phrase "Step 7.5:
Check container registry auth (SLURM only)" appears exactly as in the main
workflow checklist.
In @.claude/skills/monitor/SKILL.md:
- Around line 43-45: Add atomic locking and an idempotent cron marker: wrap
every read/modify/write to the registry file `.claude/active_jobs.json` with an
exclusive file lock (e.g., `flock` on a dedicated lockfile) so cron and
user-triggered checks cannot clobber each other, and ensure the cron setup step
uses a unique cron marker (e.g., a comment/token or a marker file name) so the
"if one isn't already running" logic checks for that marker before installing a
new recurring poller; acquire the lock for the whole mutation sequence (read
registry, modify entries, write back, remove completed jobs) and release after
write, and make the cron install routine idempotent by searching for the marker
before adding the cron entry so duplicates are never created.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 26e86888-6c9a-497f-b376-8ab0c3569a1d
📒 Files selected for processing (6)
.claude/skills/common/slurm-setup.md.claude/skills/deployment/SKILL.md.claude/skills/evaluation/SKILL.md.claude/skills/monitor/SKILL.md.claude/skills/ptq/SKILL.mdtools/launcher/common/hf/ptq.sh
.claude/skills/evaluation/SKILL.md
Outdated
| - [ ] Step 7.5: Check container registry auth (SLURM only) | ||
| - [ ] Step 8: Run the evaluation |
There was a problem hiding this comment.
Checklist drift: new Step 7.5 is missing from the final copied checklist.
You added Step 7.5 in the main workflow checklist, but the final checklist users are told to copy still skips it. Please keep both checklists in sync.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/evaluation/SKILL.md around lines 31 - 32, The final
user-facing checklist is missing the newly added "Step 7.5: Check container
registry auth (SLURM only)"; update the copied checklist section so it includes
that same line immediately before "Step 8: Run the evaluation" to keep both
checklists in sync, ensuring the phrase "Step 7.5: Check container registry auth
(SLURM only)" appears exactly as in the main workflow checklist.
.claude/skills/evaluation/SKILL.md
Outdated
| **Default behavior: use DockerHub image first.** If the job fails with a `401` or image pull error, fall back to the NGC alternative by adding `deployment.image` to the config: | ||
|
|
||
| ```yaml | ||
| deployment: | ||
| image: nvcr.io/nvidia/vllm:<YY.MM>-py3 # check https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm for latest tag | ||
| ``` | ||
|
|
||
| **Decision flow:** | ||
| 1. Submit with the default DockerHub image (`vllm/vllm-openai:latest`) | ||
| 2. If the job fails with image pull auth error (401) → check credentials | ||
| 3. If DockerHub credentials can be added → add them and resubmit | ||
| 4. If not → override `deployment.image` to the NGC vLLM image and resubmit | ||
| 5. **Do not retry more than once** without fixing the auth issue |
There was a problem hiding this comment.
Step 7.5 currently contradicts its own pre-submit auth gate.
Line 196 says to verify credentials before submit, but Line 202-Line 214 still instructs a submit-first-then-fallback path. This reintroduces avoidable failed SLURM submissions and conflicts with the stricter deployment flow.
Please switch to: choose authenticated image/runtime before first submission, then submit once.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/evaluation/SKILL.md around lines 202 - 214, The current
decision flow's submit-then-fallback behavior must be replaced so the
runtime/image is chosen based on auth checks before any SLURM submission; update
the text around "Default behavior: use DockerHub image first." and the numbered
"Decision flow" to: validate DockerHub credentials up front, if valid use
vllm/vllm-openai:latest, else set deployment.image to the NGC image
(nvcr.io/nvidia/vllm:<YY.MM>-py3) and submit once; remove steps that describe
submitting first and retrying on 401 and ensure the doc enforces "Do not retry
more than once" by requiring credential fix before any additional submission.
| 1. **Add an entry** to `.claude/active_jobs.json`. Create the file if it doesn't exist. | ||
| 2. **Set up a durable recurring cron** (if one isn't already running) that polls all registered jobs every 15 minutes. The cron prompt should: read the registry, check each job, report state changes to the user, remove completed jobs, and delete itself when the registry is empty. | ||
|
|
There was a problem hiding this comment.
Define atomic registry updates and idempotent cron setup.
Line 54-Line 59 implies concurrent read/modify/write from cron + user-triggered checks, but no locking is specified. Combined with Line 44 (“if one isn't already running”), this can cause duplicate pollers and registry clobbering.
Please specify a lock (e.g., flock around all registry mutations) and a unique cron marker to ensure single-instance scheduling.
Also applies to: 54-59
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/monitor/SKILL.md around lines 43 - 45, Add atomic locking and
an idempotent cron marker: wrap every read/modify/write to the registry file
`.claude/active_jobs.json` with an exclusive file lock (e.g., `flock` on a
dedicated lockfile) so cron and user-triggered checks cannot clobber each other,
and ensure the cron setup step uses a unique cron marker (e.g., a comment/token
or a marker file name) so the "if one isn't already running" logic checks for
that marker before installing a new recurring poller; acquire the lock for the
whole mutation sequence (read registry, modify entries, write back, remove
completed jobs) and release after write, and make the cron install routine
idempotent by searching for the marker before adding the cron entry so
duplicates are never created.
There was a problem hiding this comment.
Claude Code processes turns sequentially and crons only fire when idle, so registry operations are inherently serialized. No locking is needed.
There was a problem hiding this comment.
@kaix-nv, that's a valid point — Claude Code's turn-based execution guarantees that only one operation runs at a time (user turn or cron turn, never both), so registry mutations are already serialized by the runtime. The concurrency concern doesn't apply in this execution environment. Thanks for the clarification!
✏️ Learnings added
Learnt from: kaix-nv
Repo: NVIDIA/Model-Optimizer PR: 1252
File: .claude/skills/monitor/SKILL.md:43-45
Timestamp: 2026-04-14T00:42:02.848Z
Learning: In the NVIDIA/Model-Optimizer repository, `.claude/skills/monitor/SKILL.md` uses `.claude/active_jobs.json` as a job registry. No file locking (e.g., `flock`) is needed for this registry because Claude Code processes turns sequentially and crons only fire when idle — registry operations are inherently serialized by the Claude Code runtime. Do not flag concurrent access concerns for this registry.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1252 +/- ##
=======================================
Coverage 76.92% 76.92%
=======================================
Files 350 350
Lines 40524 40524
=======================================
Hits 31173 31173
Misses 9351 9351
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What does this PR do?
Type of change: ?
Add a standalone monitor skill for persistent job tracking across sessions, and integrate it with PTQ, evaluation, and deployment skills.
Problem: Each skill had ad-hoc inline monitoring (squeue polling, nel status checks) that didn't survive session restarts and couldn't track multiple jobs. Users had to manually ask "check status" every time.
Solution: A centralized monitor skill with:
Usage
After any skill submits a job, the monitor skill automatically:
User can also trigger manually:
User: "check my eval status" → reads registry, reports current state
User: "is the PTQ done?" → finds job, checks status
User: "what jobs are running?" → lists all registered jobs
Testing
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Documentation