Skip to content

Commit d3eca35

Browse files
committed
add: create SKILL.md for tmux monitoring with training progress
1 parent 4da809a commit d3eca35

File tree

1 file changed

+181
-0
lines changed

1 file changed

+181
-0
lines changed
Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
---
2+
name: monitor-with-tmux
3+
description: Monitor training progress by reading tmux content with exponential backoff intervals (30s, 1min, 2min, 4min, 8min, 16min), analyze logs when anomalies occur, and provide fix suggestions
4+
license: See LICENSE.txt for full terms
5+
---
6+
7+
# Monitor with Tmux
8+
9+
Monitor training progress in tmux, detect anomalies, analyze errors, provide fix suggestions.
10+
11+
## Step Zero
12+
13+
Create a sleep script for tmux monitoring:
14+
15+
1. Create `./tmp/wait_tmux.py`
16+
17+
```python
18+
import argparse
19+
import subprocess
20+
import time
21+
22+
SHELLS = {"bash", "zsh", "sh", "fish", "csh", "tcsh", "ksh", "dash", "ash"}
23+
24+
def smart_sleep(session: str, seconds: float, check_every: float = 2.0) -> bool:
25+
"""
26+
Alternative to time.sleep(), but returns early when commands finish.
27+
28+
Returns:
29+
True - Normal timeout (command still running)
30+
False - Early return (command finished or session gone)
31+
"""
32+
end_time = time.time() + seconds
33+
while time.time() < end_time:
34+
try:
35+
r = subprocess.run(
36+
["tmux", "list-panes", "-F", "#{pane_current_command}", "-t", session],
37+
capture_output=True, text=True, timeout=5
38+
)
39+
if r.returncode != 0:
40+
return False # session gone
41+
cmds = [l.strip().lower() for l in r.stdout.splitlines() if l.strip()]
42+
if not any(c not in SHELLS for c in cmds):
43+
return False # command finished, back to shell
44+
except Exception:
45+
return False
46+
47+
time.sleep(min(check_every, end_time - time.time()))
48+
49+
return True
50+
51+
52+
def main():
53+
parser = argparse.ArgumentParser(description="Wait for a tmux session with smart early-exit.")
54+
parser.add_argument("session", help="tmux session name")
55+
parser.add_argument("seconds", type=float, help="total seconds to wait")
56+
args = parser.parse_args()
57+
58+
timed_out = smart_sleep(args.session, args.seconds, 2)
59+
raise SystemExit(0 if timed_out else 1)
60+
61+
62+
if __name__ == "__main__":
63+
main()
64+
```
65+
66+
## Start Monitoring
67+
68+
When you need to monitor a tmux window, run:
69+
70+
```bash
71+
python ./tmp/tmux_wait.py my_ajet_session_name 30
72+
```
73+
74+
This means:
75+
1. Monitor tmux session named `my_ajet_session_name`
76+
2. Wait 30 seconds
77+
78+
- Exit code 0: Normal timeout (command still running)
79+
- Exit code 1: Command ended early or session disappeared
80+
81+
## Using SSH
82+
83+
When using SSH, always use a local tmux window to establish the SSH connection.
84+
85+
## When You Want to Delay Before Reading tmux Window Again
86+
87+
You must have error immediate return functionality - do not use `sleep xxx`, instead use `python ./tmux_wait.py my_ajet_session_name xxx`
88+
89+
Don't use: `sleep 60 && tmux capture-pane -t my_ajet_session_name -p | tail -80`
90+
91+
You should use: `python ./tmux_wait.py my_ajet_session_name 30 && tmux capture-pane -t my_ajet_session_name -p | tail -80`
92+
93+
- 60 seconds is too long
94+
- Always use `python ./tmux_wait.py` to wait
95+
96+
97+
98+
## Examples
99+
100+
101+
### Examples Without SSH
102+
103+
0 examples available
104+
105+
106+
### Examples With SSH
107+
108+
1 example available
109+
110+
```agent
111+
112+
--- example user instruction begin ---
113+
login
114+
ssh root@localhost -p 8824
115+
goto
116+
/mnt/data_cpfs/qingxu.fu/agentjet/good-luck-agentjet
117+
run with venv and monitor
118+
source .venv/bin/activate && python -m ajet.launcher --conf tests/bench/benchmark_math/benchmark_math.yaml --autokill
119+
--- example user instruction end ---
120+
121+
--- example agent reaction begin ---
122+
The user wants me to monitor a tmux session while running a training task. Let me break down what they're asking:
123+
1. SSH to remote machine (root@localhost -p 8824)
124+
2. Go to specific directory
125+
3. Activate virtual environment and run Python command
126+
4. Monitor tmux session
127+
128+
# Write ./tmp/tmux_wait.py
129+
....
130+
131+
# Create tmux session for training (note: session name must contain keyword `ajet`)
132+
$ tmux new-session -d -s ajet_session -c "/mnt/data_cpfs/qingxu.fu/agentjet/good-luck-agentjet"
133+
...
134+
135+
# Send SSH command to tmux session
136+
$ tmux send-keys -t ajet_session "ssh root@localhost -p 8824" Enter
137+
...
138+
139+
# Wait and check SSH connection
140+
$ python3 /tmp/tmux_wait.py ajet_session 10 && tmux capture-pane -t ajet_session -p | tail -20
141+
SSH connection successful. Now I need to navigate to the directory and run the training command. Let me send commands to the tmux session.
142+
...
143+
144+
# Navigate to working directory
145+
$ tmux send-keys -t ajet_session "cd /mnt/data_cpfs/qingxu.fu/agentjet/good-luck-agentjet" Enter
146+
...
147+
148+
# Start training with venv and launcher
149+
$ tmux send-keys -t ajet_session "source .venv/bin/activate && python -m ajet.launcher --conf tests/bench/benchmark_math/benchmark_math.yaml --autokill" Enter
150+
...
151+
152+
# Capture initial training output
153+
$ tmux capture-pane -t ajet_session -p | tail -40
154+
...
155+
156+
# Wait 30 seconds and check training progress
157+
$ python3 /tmp/tmux_wait.py ajet_session 30 && tmux capture-pane -t ajet_session -p | tail -50
158+
...
159+
Training is running. Commands are executing correctly. Let me continue monitoring at 1-minute intervals.
160+
161+
# Wait 1 minute and check training progress
162+
$ python3 /tmp/tmux_wait.py ajet_session 60 && tmux capture-pane -t ajet_session -p | tail -50
163+
...
164+
165+
# Wait 2 minutes and check training progress
166+
$ python3 /tmp/tmux_wait.py ajet_session 120 && tmux capture-pane -t ajet_session -p | tail -60
167+
...
168+
Training is progressing well. Checkpoints loaded, vLLM server started, swanlab tracking the run. No errors so far, only warnings. Let me continue monitoring.
169+
170+
# Wait 4 minutes and check training progress
171+
$ python3 /tmp/tmux_wait.py ajet_session 240 && tmux capture-pane -t ajet_session -p | tail -60
172+
...
173+
174+
... multiple rounds of monitoring ...
175+
176+
--- example agent reaction end ---
177+
178+
# Destroy tmux session
179+
tmux kill-session -t ajet_session
180+
181+
```

0 commit comments

Comments
 (0)