|
| 1 | +--- |
| 2 | +name: monitor-with-tmux |
| 3 | +description: Monitor training progress by reading tmux content with exponential backoff intervals (30s, 1min, 2min, 4min, 8min, 16min), analyze logs when anomalies occur, and provide fix suggestions |
| 4 | +license: See LICENSE.txt for full terms |
| 5 | +--- |
| 6 | + |
| 7 | +# Monitor with Tmux |
| 8 | + |
| 9 | +Monitor training progress in tmux, detect anomalies, analyze errors, provide fix suggestions. |
| 10 | + |
| 11 | +## Step Zero |
| 12 | + |
| 13 | +Create a sleep script for tmux monitoring: |
| 14 | + |
| 15 | +1. Create `./tmp/wait_tmux.py` |
| 16 | + |
| 17 | +```python |
| 18 | +import argparse |
| 19 | +import subprocess |
| 20 | +import time |
| 21 | + |
| 22 | +SHELLS = {"bash", "zsh", "sh", "fish", "csh", "tcsh", "ksh", "dash", "ash"} |
| 23 | + |
| 24 | +def smart_sleep(session: str, seconds: float, check_every: float = 2.0) -> bool: |
| 25 | + """ |
| 26 | + Alternative to time.sleep(), but returns early when commands finish. |
| 27 | +
|
| 28 | + Returns: |
| 29 | + True - Normal timeout (command still running) |
| 30 | + False - Early return (command finished or session gone) |
| 31 | + """ |
| 32 | + end_time = time.time() + seconds |
| 33 | + while time.time() < end_time: |
| 34 | + try: |
| 35 | + r = subprocess.run( |
| 36 | + ["tmux", "list-panes", "-F", "#{pane_current_command}", "-t", session], |
| 37 | + capture_output=True, text=True, timeout=5 |
| 38 | + ) |
| 39 | + if r.returncode != 0: |
| 40 | + return False # session gone |
| 41 | + cmds = [l.strip().lower() for l in r.stdout.splitlines() if l.strip()] |
| 42 | + if not any(c not in SHELLS for c in cmds): |
| 43 | + return False # command finished, back to shell |
| 44 | + except Exception: |
| 45 | + return False |
| 46 | + |
| 47 | + time.sleep(min(check_every, end_time - time.time())) |
| 48 | + |
| 49 | + return True |
| 50 | + |
| 51 | + |
| 52 | +def main(): |
| 53 | + parser = argparse.ArgumentParser(description="Wait for a tmux session with smart early-exit.") |
| 54 | + parser.add_argument("session", help="tmux session name") |
| 55 | + parser.add_argument("seconds", type=float, help="total seconds to wait") |
| 56 | + args = parser.parse_args() |
| 57 | + |
| 58 | + timed_out = smart_sleep(args.session, args.seconds, 2) |
| 59 | + raise SystemExit(0 if timed_out else 1) |
| 60 | + |
| 61 | + |
| 62 | +if __name__ == "__main__": |
| 63 | + main() |
| 64 | +``` |
| 65 | + |
| 66 | +## Start Monitoring |
| 67 | + |
| 68 | +When you need to monitor a tmux window, run: |
| 69 | + |
| 70 | +```bash |
| 71 | +python ./tmp/tmux_wait.py my_ajet_session_name 30 |
| 72 | +``` |
| 73 | + |
| 74 | +This means: |
| 75 | +1. Monitor tmux session named `my_ajet_session_name` |
| 76 | +2. Wait 30 seconds |
| 77 | + |
| 78 | +- Exit code 0: Normal timeout (command still running) |
| 79 | +- Exit code 1: Command ended early or session disappeared |
| 80 | + |
| 81 | +## Using SSH |
| 82 | + |
| 83 | +When using SSH, always use a local tmux window to establish the SSH connection. |
| 84 | + |
| 85 | +## When You Want to Delay Before Reading tmux Window Again |
| 86 | + |
| 87 | +You must have error immediate return functionality - do not use `sleep xxx`, instead use `python ./tmux_wait.py my_ajet_session_name xxx` |
| 88 | + |
| 89 | +Don't use: `sleep 60 && tmux capture-pane -t my_ajet_session_name -p | tail -80` |
| 90 | + |
| 91 | +You should use: `python ./tmux_wait.py my_ajet_session_name 30 && tmux capture-pane -t my_ajet_session_name -p | tail -80` |
| 92 | + |
| 93 | +- 60 seconds is too long |
| 94 | +- Always use `python ./tmux_wait.py` to wait |
| 95 | + |
| 96 | + |
| 97 | + |
| 98 | +## Examples |
| 99 | + |
| 100 | + |
| 101 | +### Examples Without SSH |
| 102 | + |
| 103 | +0 examples available |
| 104 | + |
| 105 | + |
| 106 | +### Examples With SSH |
| 107 | + |
| 108 | +1 example available |
| 109 | + |
| 110 | +```agent |
| 111 | +
|
| 112 | +--- example user instruction begin --- |
| 113 | +login |
| 114 | +ssh root@localhost -p 8824 |
| 115 | +goto |
| 116 | +/mnt/data_cpfs/qingxu.fu/agentjet/good-luck-agentjet |
| 117 | +run with venv and monitor |
| 118 | +source .venv/bin/activate && python -m ajet.launcher --conf tests/bench/benchmark_math/benchmark_math.yaml --autokill |
| 119 | +--- example user instruction end --- |
| 120 | +
|
| 121 | +--- example agent reaction begin --- |
| 122 | +The user wants me to monitor a tmux session while running a training task. Let me break down what they're asking: |
| 123 | +1. SSH to remote machine (root@localhost -p 8824) |
| 124 | +2. Go to specific directory |
| 125 | +3. Activate virtual environment and run Python command |
| 126 | +4. Monitor tmux session |
| 127 | +
|
| 128 | +# Write ./tmp/tmux_wait.py |
| 129 | +.... |
| 130 | +
|
| 131 | +# Create tmux session for training (note: session name must contain keyword `ajet`) |
| 132 | +$ tmux new-session -d -s ajet_session -c "/mnt/data_cpfs/qingxu.fu/agentjet/good-luck-agentjet" |
| 133 | +... |
| 134 | +
|
| 135 | +# Send SSH command to tmux session |
| 136 | +$ tmux send-keys -t ajet_session "ssh root@localhost -p 8824" Enter |
| 137 | +... |
| 138 | +
|
| 139 | +# Wait and check SSH connection |
| 140 | +$ python3 /tmp/tmux_wait.py ajet_session 10 && tmux capture-pane -t ajet_session -p | tail -20 |
| 141 | +SSH connection successful. Now I need to navigate to the directory and run the training command. Let me send commands to the tmux session. |
| 142 | +... |
| 143 | +
|
| 144 | +# Navigate to working directory |
| 145 | +$ tmux send-keys -t ajet_session "cd /mnt/data_cpfs/qingxu.fu/agentjet/good-luck-agentjet" Enter |
| 146 | +... |
| 147 | +
|
| 148 | +# Start training with venv and launcher |
| 149 | +$ tmux send-keys -t ajet_session "source .venv/bin/activate && python -m ajet.launcher --conf tests/bench/benchmark_math/benchmark_math.yaml --autokill" Enter |
| 150 | +... |
| 151 | +
|
| 152 | +# Capture initial training output |
| 153 | +$ tmux capture-pane -t ajet_session -p | tail -40 |
| 154 | +... |
| 155 | +
|
| 156 | +# Wait 30 seconds and check training progress |
| 157 | +$ python3 /tmp/tmux_wait.py ajet_session 30 && tmux capture-pane -t ajet_session -p | tail -50 |
| 158 | +... |
| 159 | +Training is running. Commands are executing correctly. Let me continue monitoring at 1-minute intervals. |
| 160 | +
|
| 161 | +# Wait 1 minute and check training progress |
| 162 | +$ python3 /tmp/tmux_wait.py ajet_session 60 && tmux capture-pane -t ajet_session -p | tail -50 |
| 163 | +... |
| 164 | +
|
| 165 | +# Wait 2 minutes and check training progress |
| 166 | +$ python3 /tmp/tmux_wait.py ajet_session 120 && tmux capture-pane -t ajet_session -p | tail -60 |
| 167 | +... |
| 168 | +Training is progressing well. Checkpoints loaded, vLLM server started, swanlab tracking the run. No errors so far, only warnings. Let me continue monitoring. |
| 169 | +
|
| 170 | +# Wait 4 minutes and check training progress |
| 171 | +$ python3 /tmp/tmux_wait.py ajet_session 240 && tmux capture-pane -t ajet_session -p | tail -60 |
| 172 | +... |
| 173 | +
|
| 174 | +... multiple rounds of monitoring ... |
| 175 | +
|
| 176 | +--- example agent reaction end --- |
| 177 | +
|
| 178 | +# Destroy tmux session |
| 179 | +tmux kill-session -t ajet_session |
| 180 | +
|
| 181 | +``` |
0 commit comments