Skip to content

runtime: add adaptive Slack bridge restart policy#148

Merged
benvinegar merged 2 commits intomainfrom
runtime/adaptive-bridge-restart-121
Feb 23, 2026
Merged

runtime: add adaptive Slack bridge restart policy#148
benvinegar merged 2 commits intomainfrom
runtime/adaptive-bridge-restart-121

Conversation

@benvinegar
Copy link
Copy Markdown
Member

@benvinegar benvinegar commented Feb 23, 2026

Implements #121 by replacing the fixed-delay Slack bridge restart loop with a shared supervisor that supports adaptive backoff, jitter, and failure-threshold signaling while preserving legacy behavior by default.

What changed

  • Added bin/lib/bridge-restart-policy.sh with reusable supervisor helpers:
    • policy mode selection (legacy vs adaptive)
    • exponential backoff with cap
    • jitter support
    • stable-window reset behavior
    • threshold-exceeded state signaling
    • structured bridge-supervisor log lines
    • status file writer (~/.pi/agent/slack-bridge-supervisor.json)
  • Updated start.sh to use bb_bridge_supervise instead of hardcoded fixed-delay restart loops.
  • Updated pi/skills/control-agent/startup-cleanup.sh to use the same supervisor helper (with legacy fallback if helper is unavailable).
  • Updated bin/deploy.sh to stage/deploy bin/lib/bridge-restart-policy.sh into runtime.
  • Updated bin/lib/baudbot-runtime.sh (baudbot status) to surface supervisor status, including degraded/threshold-exceeded state.
  • Added tests: bin/lib/bridge-restart-policy.test.sh and wired it into test/shell-scripts.test.mjs.
  • Documented new env vars in CONFIGURATION.md and .env.schema.
  • Updated control-agent skill docs with supervisor status file check.

Backward compatibility

  • If restart-policy env vars are unset, behavior remains legacy fixed-delay restart (5s), matching existing deployments.
  • Adaptive mode activates when:
    • BAUDBOT_BRIDGE_RESTART_POLICY=adaptive, or
    • any adaptive restart knobs are set.

Validation

  • bash -n on updated shell scripts ✅
  • npm run test:shell ✅ (12 tests)
  • npm run lint:shell ⚠️ blocked locally (shellcheck missing in PATH)

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Feb 23, 2026

Greptile Summary

Replaced fixed-delay Slack bridge restart loop with shared supervisor library supporting adaptive backoff, jitter, and failure-threshold signaling while preserving backward compatibility.

Key improvements:

  • Exponential backoff with configurable base/max delays (default: 5s base, 300s cap)
  • Random jitter support to prevent thundering herd (default: 0-2s)
  • Stable-window reset behavior (default: 120s runtime resets failure counters)
  • Threshold-exceeded state signaling when consecutive failures reach limit (default: 5)
  • Structured logging with bridge-supervisor event lines
  • JSON status file writer for runtime observability via baudbot status

Backward compatibility:

  • Legacy mode (fixed 5s restart) remains default when no policy env vars are set
  • Adaptive mode activates when BAUDBOT_BRIDGE_RESTART_POLICY=adaptive OR any adaptive knobs are configured
  • Graceful fallback in startup-cleanup.sh when helper unavailable (legacy deployments)

Integration points:

  • start.sh - uses supervisor for main runtime bridge startup
  • startup-cleanup.sh - uses supervisor with fallback for control-agent restarts
  • baudbot status - surfaces supervisor health state (healthy/degraded/restarting)
  • deploy.sh - stages and deploys new helper library

All changes properly tested with 6 new test cases covering mode detection, integer parsing, backoff computation, and jitter bounds.

Confidence Score: 5/5

  • This PR is safe to merge with no identified issues
  • Well-architected implementation with comprehensive testing, backward compatibility, graceful degradation, proper documentation, and clear separation of concerns - follows project conventions and maintains existing behavior by default
  • No files require special attention

Important Files Changed

Filename Overview
bin/lib/bridge-restart-policy.sh New shared supervisor library with policy mode detection, exponential backoff, jitter, and status tracking - well-structured with comprehensive helper functions
start.sh Replaced hardcoded restart loop with bb_bridge_supervise call, added status file support - clean integration maintaining backward compatibility
pi/skills/control-agent/startup-cleanup.sh Updated to use supervisor helper with graceful fallback to legacy behavior when helper unavailable - properly handles legacy deployment scenarios
bin/lib/baudbot-runtime.sh Added print_bridge_supervisor_status function to surface supervisor state in status output - cleanly integrates with existing status reporting
.env.schema Added 6 new optional environment variables for adaptive restart policy configuration with clear documentation
CONFIGURATION.md Documented new bridge restart policy variables in Bridge section with defaults - comprehensive and clear

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[start.sh / startup-cleanup.sh] --> B{Policy Mode Detection}
    B -->|No env vars set| C[Legacy Mode]
    B -->|BAUDBOT_BRIDGE_RESTART_POLICY set| D{Policy Value}
    B -->|Any adaptive knob set| E[Adaptive Mode]
    D -->|adaptive| E
    D -->|legacy| C
    
    C --> C1[Fixed 5s delay restart loop]
    C1 --> C2[Write status file: mode=legacy]
    C2 --> C3[Log structured events]
    
    E --> E1[Load adaptive parameters]
    E1 --> E2[base_delay, max_delay, stable_window, max_failures, jitter]
    E2 --> E3[Restart loop with runtime tracking]
    E3 --> E4{Runtime >= stable_window?}
    E4 -->|Yes| E5[Reset counters to base_delay]
    E4 -->|No| E6[Increment failures, double delay]
    E5 --> E7{failures >= threshold?}
    E6 --> E7
    E7 -->|Yes| E8[Set state=threshold_exceeded]
    E7 -->|No| E9[Set state=restarting]
    E8 --> E10[Add jitter, write status, sleep]
    E9 --> E10
    E10 --> E3
    
    E10 -.-> S1[Status File JSON]
    C2 -.-> S1
    S1 --> S2[baudbot status reads supervisor state]
    S2 --> S3[Display: healthy/degraded/restarting]
Loading

Last reviewed commit: d12b6f4

@benvinegar benvinegar merged commit c8acc8b into main Feb 23, 2026
9 checks passed
baudbot-agent pushed a commit that referenced this pull request Feb 23, 2026
When startup-cleanup.sh runs mid-session (called by the control agent),
two inherited env vars cause bridge startup failures:

1. PKG_EXECPATH — leaked from the parent varlock-launched process, causes
   varlock's SEA binary to misinterpret subcommands as Node module paths.
   The varlock broker-key probes (lines 115-122) silently fail, resulting
   in 'No Slack transport configured' and the bridge never starting.

2. SLACK_BROKER_ACCESS_TOKEN / SLACK_BROKER_ACCESS_TOKEN_EXPIRES_AT —
   varlock does not override env vars already present in the parent
   process. If the broker token was rotated after session start, the
   supervisor passes the stale (expired) values instead of reading
   fresh ones from ~/.config/.env.

Fix: unset PKG_EXECPATH at script top (before varlock probes), and
unset broker token vars in the supervisor subshell (before varlock run).

Regression from #148.
baudbot-agent pushed a commit that referenced this pull request Feb 24, 2026
…-cleanup

When startup-cleanup.sh runs mid-session (called by the control agent),
inherited env vars cause bridge startup failures:

1. PKG_EXECPATH — leaked from the parent varlock-launched process, causes
   varlock's SEA binary to misinterpret subcommands as Node module paths.
   The varlock broker-key probes (lines 115-122) silently fail, resulting
   in 'No Slack transport configured' and the bridge never starting.

2. varlock run does not override env vars already present in the parent
   process. If any managed value (broker tokens, API keys, config) was
   rotated after session start, the supervisor passes the stale values
   instead of reading fresh ones from ~/.config/.env.

Fix:
- unset PKG_EXECPATH at the script top (before varlock probes)
- In the supervisor subshell, dynamically unset ALL varlock-managed keys
  via 'varlock load --format env' before calling 'varlock run', so every
  restart gets fresh values regardless of which keys changed.

Regression from #148.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant