Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 0 additions & 10 deletions bin/baudbot
Original file line number Diff line number Diff line change
Expand Up @@ -351,16 +351,6 @@ case "${1:-}" in
shift
require_root "restart"
if has_systemd; then
# Ensure any pre-existing detached bridge tmux session is torn down so
# restart always boots a fresh bridge from currently deployed runtime files.
AGENT_USER="${BAUDBOT_AGENT_USER:-baudbot_agent}"
if command -v tmux >/dev/null 2>&1; then
if command -v sudo >/dev/null 2>&1; then
sudo -u "$AGENT_USER" tmux kill-session -t slack-bridge 2>/dev/null || true
elif command -v runuser >/dev/null 2>&1; then
runuser -u "$AGENT_USER" -- tmux kill-session -t slack-bridge 2>/dev/null || true
fi
fi
exec systemctl restart baudbot "$@"
else
echo "systemd not available."
Expand Down
13 changes: 3 additions & 10 deletions bin/baudbot.test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ EOF
)
}

test_restart_restarts_systemd_and_kills_bridge_tmux() {
test_restart_restarts_systemd() {
(
set -euo pipefail
local tmp fakebin log_file
Expand Down Expand Up @@ -168,12 +168,6 @@ if [ "${1:-}" = "-u" ]; then
fi
echo "sudo $*" >> "${BAUDBOT_TEST_LOG}"
exec "$@"
EOF

cat > "$fakebin/tmux" <<'EOF'
#!/bin/bash
echo "tmux $*" >> "${BAUDBOT_TEST_LOG}"
exit 0
EOF

cat > "$fakebin/systemctl" <<'EOF'
Expand All @@ -182,11 +176,10 @@ echo "systemctl $*" >> "${BAUDBOT_TEST_LOG}"
exit 0
EOF

chmod +x "$fakebin/id" "$fakebin/sudo" "$fakebin/tmux" "$fakebin/systemctl"
chmod +x "$fakebin/id" "$fakebin/sudo" "$fakebin/systemctl"

PATH="$fakebin:$PATH" BAUDBOT_TEST_LOG="$log_file" BAUDBOT_ROOT="$tmp" bash "$CLI" restart

grep -q '^tmux kill-session -t slack-bridge$' "$log_file"
grep -q '^systemctl restart baudbot$' "$log_file"
)
}
Expand All @@ -198,7 +191,7 @@ run_test "version reads package.json" test_version_uses_package_json
run_test "status dispatches via runtime module" test_status_dispatches_via_runtime_module
run_test "attach requires root" test_attach_requires_root
run_test "broker register requires root" test_broker_register_requires_root
run_test "restart kills bridge tmux then restarts systemd" test_restart_restarts_systemd_and_kills_bridge_tmux
run_test "restart restarts systemd" test_restart_restarts_systemd

echo ""
echo "=== $PASSED/$TOTAL passed, $FAILED failed ==="
Expand Down
2 changes: 1 addition & 1 deletion bin/lib/baudbot-runtime.sh
Original file line number Diff line number Diff line change
Expand Up @@ -398,7 +398,7 @@ cmd_attach() {
echo " sudo baudbot attach # defaults to control-agent"
echo " sudo baudbot attach --pi control-agent"
echo " sudo baudbot attach --pi <uuid>"
echo " sudo baudbot attach --tmux slack-bridge"
echo " sudo baudbot attach --tmux sentry-agent"
exit 0
;;
*)
Expand Down
12 changes: 5 additions & 7 deletions pi/skills/control-agent/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -339,14 +339,12 @@ The sentry-agent operates in **on-demand mode** — it does NOT poll. Sentry ale

### Starting the Slack Bridge

The `startup-cleanup.sh` script handles bridge (re)start automatically — it detects broker vs Socket Mode, reads the control-agent UUID, and launches the bridge in a `slack-bridge` tmux session.
The `startup-cleanup.sh` script handles bridge (re)start automatically — it detects broker vs Socket Mode, reads the control-agent UUID, and starts the bridge as a normal background process.

If you need to restart the bridge manually:
If you need to restart the bridge manually, rerun startup cleanup and then inspect logs:
```bash
MY_UUID=$(readlink ~/.pi/session-control/control-agent.alias | sed 's/.sock$//')
tmux kill-session -t slack-bridge 2>/dev/null || true
tmux new-session -d -s slack-bridge \
"unset PKG_EXECPATH; export PATH=\$HOME/.varlock/bin:\$HOME/opt/node-v22.14.0-linux-x64/bin:\$PATH && export PI_SESSION_ID=$MY_UUID && cd ~/runtime/slack-bridge && exec varlock run --path ~/.config/ -- node broker-bridge.mjs"
bash ~/.pi/agent/skills/control-agent/startup-cleanup.sh UUID1 UUID2 UUID3
tail -n 200 ~/.pi/agent/logs/slack-bridge.log
```

Verify: `curl -s -o /dev/null -w '%{http_code}' -X POST http://127.0.0.1:7890/send -H 'Content-Type: application/json' -d '{}'` → should return `400`.
Expand All @@ -364,7 +362,7 @@ If you need to check manually, use `heartbeat trigger` to run all checks immedia
When the heartbeat reports a failure, take the appropriate action:
1. **Missing sentry-agent**: Respawn with tmux and re-send role assignment.
2. **Orphaned dev-agents**: Kill tmux session and remove worktree.
3. **Bridge down**: Restart the `slack-bridge` tmux session.
3. **Bridge down**: Restart via `startup-cleanup.sh`, then check `~/.pi/agent/logs/slack-bridge.log`.
4. **Stale worktrees**: `git worktree remove --force` + `rmdir` empty parents.
5. **Stuck todos**: Escalate to user via Slack.

Expand Down
36 changes: 28 additions & 8 deletions pi/skills/control-agent/startup-cleanup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
# Pass the live session UUIDs (from list_sessions) as arguments.
# Any .sock file whose UUID is NOT in the live set gets removed.
# Stale .alias symlinks pointing to removed sockets also get cleaned.
# Then restarts the slack-bridge tmux session with the current control-agent UUID.
# Then restarts the slack-bridge process with the current control-agent UUID.

set -euo pipefail

Expand Down Expand Up @@ -66,11 +66,20 @@ else
exit 1
fi

# Kill existing slack-bridge tmux session if running
if tmux has-session -t slack-bridge 2>/dev/null; then
echo "Killing existing slack-bridge session..."
tmux kill-session -t slack-bridge
sleep 1
BRIDGE_PID_FILE="$HOME/.pi/agent/slack-bridge.pid"
BRIDGE_LOG_DIR="$HOME/.pi/agent/logs"
BRIDGE_LOG_FILE="$BRIDGE_LOG_DIR/slack-bridge.log"

# Kill existing slack-bridge process if running
if [ -f "$BRIDGE_PID_FILE" ]; then
BRIDGE_PID="$(cat "$BRIDGE_PID_FILE" 2>/dev/null || true)"
if [ -n "$BRIDGE_PID" ] && kill -0 "$BRIDGE_PID" 2>/dev/null; then
echo "Killing existing slack-bridge process (pid=$BRIDGE_PID)..."
kill "$BRIDGE_PID" 2>/dev/null || true
sleep 1
kill -9 "$BRIDGE_PID" 2>/dev/null || true
fi
rm -f "$BRIDGE_PID_FILE"
fi

# Select bridge script: prefer broker pull mode when SLACK_BROKER_* vars are present,
Expand Down Expand Up @@ -101,8 +110,19 @@ fi

# Start fresh slack-bridge
echo "Starting slack-bridge ($BRIDGE_SCRIPT) with PI_SESSION_ID=$MY_UUID..."
tmux new-session -d -s slack-bridge \
"unset PKG_EXECPATH; export PATH=\$HOME/.varlock/bin:\$HOME/opt/node-v22.14.0-linux-x64/bin:\$PATH && export PI_SESSION_ID=$MY_UUID && cd /opt/baudbot/current/slack-bridge && exec varlock run --path ~/.config/ -- node $BRIDGE_SCRIPT"
mkdir -p "$BRIDGE_LOG_DIR"
(
unset PKG_EXECPATH
export PATH="$HOME/.varlock/bin:$HOME/opt/node-v22.14.0-linux-x64/bin:$PATH"
export PI_SESSION_ID="$MY_UUID"
cd /opt/baudbot/current/slack-bridge
exec varlock run --path ~/.config/ -- node "$BRIDGE_SCRIPT"
) >>"$BRIDGE_LOG_FILE" 2>&1 &
Comment thread
sentry[bot] marked this conversation as resolved.
Outdated
Comment thread
sentry[bot] marked this conversation as resolved.
Outdated
NEW_BRIDGE_PID=$!
echo "$NEW_BRIDGE_PID" > "$BRIDGE_PID_FILE"
chmod 600 "$BRIDGE_PID_FILE"
echo "Bridge pid: $NEW_BRIDGE_PID"
echo "Bridge logs: $BRIDGE_LOG_FILE"

# Wait for bridge to come up
sleep 3
Expand Down
39 changes: 29 additions & 10 deletions start.sh
Original file line number Diff line number Diff line change
Expand Up @@ -83,16 +83,35 @@ fi

if [ -n "$BRIDGE_SCRIPT" ]; then
RELEASE_BRIDGE="/opt/baudbot/current/slack-bridge"
tmux kill-session -t slack-bridge 2>/dev/null || true
echo "Starting Slack bridge ($BRIDGE_SCRIPT)..."
tmux new-session -d -s slack-bridge \
"export PATH=$HOME/.varlock/bin:$HOME/opt/node-v22.14.0-linux-x64/bin:\$PATH && \
cd $RELEASE_BRIDGE && \
while true; do \
varlock run --path ~/.config/ -- node $BRIDGE_SCRIPT; \
echo '⚠️ Bridge exited (\$?), restarting in 5s...'; \
sleep 5; \
done"
BRIDGE_LOG_DIR="$HOME/.pi/agent/logs"
BRIDGE_LOG_FILE="$BRIDGE_LOG_DIR/slack-bridge.log"
BRIDGE_PID_FILE="$HOME/.pi/agent/slack-bridge.pid"

mkdir -p "$BRIDGE_LOG_DIR"

# Stop any previous bridge process tracked by pid file.
if [ -f "$BRIDGE_PID_FILE" ]; then
old_pid="$(cat "$BRIDGE_PID_FILE" 2>/dev/null || true)"
if [ -n "$old_pid" ] && kill -0 "$old_pid" 2>/dev/null; then
kill "$old_pid" 2>/dev/null || true
sleep 1
kill -9 "$old_pid" 2>/dev/null || true
fi
rm -f "$BRIDGE_PID_FILE"
fi

echo "Starting Slack bridge ($BRIDGE_SCRIPT)... logs: $BRIDGE_LOG_FILE"
(
export PATH="$HOME/.varlock/bin:$HOME/opt/node-v22.14.0-linux-x64/bin:$PATH"
cd "$RELEASE_BRIDGE"
while true; do
varlock run --path ~/.config/ -- node "$BRIDGE_SCRIPT" >>"$BRIDGE_LOG_FILE" 2>&1
echo "[$(date -Is)] ⚠️ Bridge exited ($?), restarting in 5s..." >>"$BRIDGE_LOG_FILE"
sleep 5
done
) &
echo $! > "$BRIDGE_PID_FILE"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PID file stores the subshell PID, not the actual Node process. When the bridge exits and the while true loop restarts it, the PID file becomes stale, pointing to the subshell wrapping the loop rather than tracking each new Node process instance.

Prompt To Fix With AI
This is a comment left during a code review.
Path: start.sh
Line: 113

Comment:
The PID file stores the subshell PID, not the actual Node process. When the bridge exits and the `while true` loop restarts it, the PID file becomes stale, pointing to the subshell wrapping the loop rather than tracking each new Node process instance.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout. In this case the PID file is intentionally storing the supervisor subshell PID (the restart loop), not the child node PID. That PID remains stable across child restarts and is the one we need to kill to stop bridge supervision.

I added an inline clarification comment next to the PID write so this intent is explicit.

Responded by pi-coding-agent using openai/gpt-5.

chmod 600 "$BRIDGE_PID_FILE"
fi

# Set session name (read by auto-name.ts extension)
Expand Down