Skip to content

Commit f2b92ba

Browse files
committed
Add robust monitor status parsing
1 parent 05e931f commit f2b92ba

1 file changed

Lines changed: 42 additions & 20 deletions

File tree

.claude/skills/monitor/SKILL.md

Lines changed: 42 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -93,22 +93,29 @@ vocabulary of the source you're polling.
9393

9494
### NEL jobs (`type: nel`)
9595

96-
- **Check:** `nel status <id>` — second pipe-delimited column carries the state with a Unicode prefix (e.g. `▶ RUNNING`, `✓ SUCCESS`).
97-
- **States** (from `nemo_evaluator_launcher.executors.base.ExecutionState` + the CLI status formatter):
98-
99-
| State (uppercase, as printed) | Terminal? | Indicator |
100-
| --- | --- | --- |
101-
| `PENDING` | no | `` |
102-
| `RUNNING` | no | `` |
103-
| `SUCCESS` | **yes** | `` |
104-
| `FAILED` | **yes** | `` |
105-
| `KILLED` | **yes** | `` |
106-
| `ERROR` | **yes** | `` (synthetic, CLI error path) |
107-
| `NOT FOUND` | **yes** | `?` (synthetic, CLI: invocation id unknown) |
108-
109-
Watcher terminal regex: `^(SUCCESS|FAILED|KILLED|ERROR|NOT FOUND)$`.
110-
Strip the Unicode indicator (`▶✓✗⧗?`) and surrounding whitespace before
111-
matching.
96+
- **Check:** `nel status <id>`.
97+
98+
```bash
99+
extract_nel_state() {
100+
local jid="$1" nel_bin="${NEL:-nel}" output state_col
101+
output=$("$nel_bin" status "$jid" 2>&1)
102+
state_col=$(echo "$output" \
103+
| awk -F'|' -v prefix="$jid." 'index($1, prefix) == 1 { print $2; exit }')
104+
[ -z "$state_col" ] && state_col="$output"
105+
echo "$state_col" \
106+
| LC_ALL=C tr '[:lower:]' '[:upper:]' \
107+
| awk 'match($0, /(PENDING|RUNNING|SUCCESS|FAILED|KILLED|ERROR|NOT[[:space:]]+FOUND)/) { print substr($0, RSTART, RLENGTH); exit }' \
108+
| sed 's/[[:space:]][[:space:]]*/ /g'
109+
}
110+
111+
is_nel_terminal() {
112+
case "$(extract_nel_state "$1")" in
113+
SUCCESS|FAILED|KILLED|ERROR|"NOT FOUND") return 0 ;;
114+
*) return 1 ;;
115+
esac
116+
}
117+
```
118+
112119
- **On completion:** `nel info <id>` to fetch results.
113120
- **On failure:** `nel info <id> --logs` then inspect server/client/SLURM logs via SSH.
114121

@@ -120,10 +127,25 @@ vocabulary of the source you're polling.
120127

121128
### Raw SLURM jobs (`type: slurm`)
122129

123-
- **Check:** `ssh <host> "sacct -j <id> --format=JobID%12,JobName%25,State%12,Elapsed%10 -n"` and filter out `extern`, `batch`, and step rows like `.<step>`. Use `sacct` for the termination check; `squeue` can lag in `COMPLETING` after `sacct` reports a terminal state.
124-
- **States (terminal):** `COMPLETED`, `FAILED`, `CANCELLED` (also appears as `CANCELLED by <uid>`), `TIMEOUT`, `NODE_FAIL`, `OUT_OF_MEMORY`, `PREEMPTED`, `BOOT_FAIL`, `DEADLINE`.
125-
- **States (non-terminal):** `PENDING`, `RUNNING`, `CONFIGURING`, `COMPLETING`, `RESIZING`, `SUSPENDED`, `REQUEUED`.
126-
Watcher terminal regex: `^(COMPLETED|FAILED|CANCELLED( by .*)?|TIMEOUT|NODE_FAIL|OUT_OF_MEMORY|PREEMPTED|BOOT_FAIL|DEADLINE)$`.
130+
- **Check:** `sacct`; use `sacct` for the termination check because `squeue`
131+
can lag in `COMPLETING` after `sacct` reports a terminal state.
132+
133+
```bash
134+
extract_slurm_state() {
135+
local jid="$1" host="$2"
136+
ssh "$host" "sacct -j $jid -X --format=State --noheader -P 2>/dev/null | head -1" \
137+
| sed 's/^[[:space:]]*//;s/[[:space:]]*$//' \
138+
| sed 's/^CANCELLED by .*/CANCELLED/'
139+
}
140+
141+
is_slurm_terminal() {
142+
case "$(extract_slurm_state "$1" "$2")" in
143+
COMPLETED|FAILED|CANCELLED|TIMEOUT|NODE_FAIL|OUT_OF_MEMORY|PREEMPTED|BOOT_FAIL|DEADLINE) return 0 ;;
144+
*) return 1 ;;
145+
esac
146+
}
147+
```
148+
127149
- **On completion:** `ssh <host> "sacct -j <id> --format=State,ExitCode,Elapsed -n"`.
128150
- **On failure:** Check the job's output log file.
129151

0 commit comments

Comments
 (0)