fix(telegram): v0.0.7 reliability rollup — state-dir, PID guard, ppid watchdog, install stdout#1424
fix(telegram): v0.0.7 reliability rollup — state-dir, PID guard, ppid watchdog, install stdout#1424noahzweben wants to merge 3 commits into
Conversation
…nd server The server already reads TELEGRAM_STATE_DIR for multi-bot setups, but the /telegram:access and /telegram:configure skills hardcoded ~/.claude/channels/telegram/ in 11 places. So with a custom state dir the skill writes access.json to the default location while the server reads from the override — pairing and allowlist edits silently don't take effect. Skills now resolve the state dir via shell expansion (TELEGRAM_STATE_DIR → CLAUDE_CONFIG_DIR/channels/telegram → ~/.claude/channels/telegram) before any read/write. Server gets the same CLAUDE_CONFIG_DIR fallback. Also adds Bash(echo)/Bash(chmod) to configure skill's allowed-tools (chmod was already documented but not allowlisted).
PID files race with OS PID recycling. The lockfile from #1349 stored only a PID; after enough churn that PID can be reassigned to anything — including the new launch's own bun-run wrapper. SIGTERMing the wrapper closes our stdin and triggers immediate self-shutdown ('replacing stale poller' then 'shutting down' within seconds — matches #1459 item 3). Now check 'ps -p <pid> -o args=' contains 'server.ts' before killing. execFileSync (no shell); whole block already try/catch so Windows/ps-missing falls through to just overwriting the lockfile.
…to stderr Two v0.0.5/0.0.6 regressions causing the plugin to fail at startup: 1. The orphan watchdog's process.ppid !== bootPpid check false-fires when the bun-run/shell wrapper exits or execs during normal startup and we get reparented to init — plugin self-terminates ~5s after launch. Stdin-close alone is the correct signal: the kernel closes the MCP pipe on any CLI death regardless of intermediate wrappers, so the ppid check was both unnecessary and harmful. (#1467; also the actual cause of #1459 item 3 and likely #1425.) 2. 'bun install --no-summary' in the start script writes to stdout, which is the MCP JSON-RPC transport. The harness sees non-JSON bytes during the handshake and drops the connection ('Failed to connect'). Redirect install output to stderr. (#1470; also explains #1425 on Windows.)
|
+1 — this rollup matches symptoms we've been seeing in a 24/7 production deployment (launchd+tmux wrapper on a Mac Mini, one user, allowlist policy). Evidence for commit 3 (ppid watchdog false-fire)On v0.0.6 the plugin would self-terminate ~5s after startup on a non-trivial fraction of restarts. Evidence for commit 4 (
|
|
This need to be merged quicly... |
|
+1 to shipping this as v0.0.7 soon, @noahzweben. Just spent an evening tracking down 7 orphaned Commit 3 (drop ppid check, rely on stdin-close) directly addresses my case. Given the rollup is:
...would it be acceptable to mark this ready and ship as v0.0.7 standalone, and let #1560's inbound-delivery fix land in v0.0.8? The orphan-process bugs are actively burning CPU on user machines today. Happy to test a prerelease build on macOS Apple Silicon if useful. |
|
For broader context on what users are seeing with v0.0.6 disconnect symptoms: I filed anthropics/claude-code#54544 documenting an apparent CC-side cause in the MCP host's ping/keepalive cycle (5-min interval JSON-RPC This PR's fixes are real and worth shipping — they address bot-side reliability issues. But even with this PR fully merged, the ping-timeout kill cycle described in #54544 would persist whenever the bot is busy enough to miss a ping response. The two are related-but-different layers; flagging here in case the reviewer team wants to coordinate. |
|
Hitting this on v0.0.6, macOS (Darwin 25.4.0), Claude Code 2.1.119. Adding observations in case useful for prioritization: Pattern: Mid-session Telegram MCP disconnects roughly 1 to 3 times per day. Bot still responsive (Telegram API works fine) right up until the disconnect, then /mcp shows it dropped. Manual /mcp Reconnect brings it back instantly. Evidence it's a process exit, not API connectivity: ~/.claude/channels/telegram/bot.pid mtime always matches the moment I run /mcp Reconnect (the new server writes a fresh PID on launch). +1 on getting v0.0.7 out. Both commit 2 (PID-recycle SIGTERM hitting its own wrapper) and commit 3 (ppid watchdog false-fire on reparenting) feel plausibly involved here; the evidence above can't distinguish between them since both produce the same surface symptom: bun server.ts exits, its stdin pipe to Claude Code closes, MCP child marked disconnected. Thanks for the rigorous fix. |
|
+1, independent reproduction this morning on a single-user macOS terminal session — no launchd, no tmux, no fleet. Adds to the existing repros from different deployment shapes. Setup: macOS 25.5.0, Claude Code 2.1.138, plugin v0.0.6, bun 1.x, DM pairing policy, single approved chat. Bot launched via the standard End-user symptom worth naming: "first Telegram reply works, every subsequent message arrives but gets no response." This is the v0.0.6 ppid-watchdog kill, but it doesn't always look like a generic MCP disconnect — the channel-listener side keeps queueing inbound messages in Forensics from one hit:
Matches the commit-3 hypothesis perfectly: bun-run intermediate exits during normal startup, ppid changes, next 5s interval tick fires Ship it — a month of draft is producing more harm than the regressions in commits 1-2 would prevent. |
|
+1, independent reproduction on macOS multi-bot setup — two Telegram bots in separate launcher dirs, each Hit today: Locally patched both SKILL.md files along the same lines as 223c9b2; symptom gone, single-bot sessions unchanged (env var unset → default path). +1 to shipping as v0.0.7. |
Four independent fixes for v0.0.7. Commits 3 and 4 fix regressions introduced by #1349.
1. Skills ignore TELEGRAM_STATE_DIR / CLAUDE_CONFIG_DIR (223c9b2)
server.ts:26already honorsTELEGRAM_STATE_DIR, but the/telegram:accessand/telegram:configureskills hardcode~/.claude/channels/telegram/in 11 places — skill writes and server reads diverge, pairing/allowlist edits silently no-op. Skills now resolve the dir via shell expansion first; server getsCLAUDE_CONFIG_DIRfallback. AddsBash(echo *)/Bash(chmod *)to allowed-tools.Fixes #931, fixes #914, fixes #933, fixes #851; addresses anthropics/claude-code#37173.
2. PID-lockfile SIGTERM can hit a recycled PID (6ceddea)
#1349's lockfile stores only a PID. After enough churn the OS recycles it — potentially to the new launch's own
bun runwrapper or any unrelated process. Now verifyps -p <pid> -o args=containsserver.tsbefore SIGTERM (execFileSync, no shell).Hardens #1349; partial mitigation for #1459 item 3.
3. Orphan-watchdog ppid check false-fires on normal startup (1efdff0) — #1349 regression
The watchdog's
process.ppid !== bootPpidcheck fires when the bun-run/shell wrapper exits or execs during normal startup and we get reparented to init — plugin self-terminates ~5s after launch. Dropped the ppid check; stdin-close is the correct signal (kernel closes the MCP pipe on any CLI death regardless of intermediate wrappers), so ppid was both unnecessary and harmful.Fixes #1467. This is the actual root cause of #1459 item 3 and likely #1425 (not PID-recycling as commit 2 theorized — though that guard remains a valid safety improvement).
4.
bun installstdout corrupts MCP JSON-RPC handshake (1efdff0)bun install --no-summaryin the start script writes to stdout, which is the MCP transport. The harness sees non-JSON during handshake → "Failed to connect". Redirect install output to stderr (1>&2). Verifiedbun run --shell=bunsupports the redirect.Fixes #1470; addresses #1425 on Windows.
Testing
bun build server.ts --target=bun✅ (all commits)bun run --shell=bunredirect smoke test:echo OUT 1>&2→ stderr ✅ps -p $$ -o args=smoke-tested on LinuxNet diff: +57/−28. Plugin v0.0.7.