Skip to content

fix(telegram): v0.0.7 reliability rollup — state-dir, PID guard, ppid watchdog, install stdout#1424

Draft
noahzweben wants to merge 3 commits into
mainfrom
claude/dreamy-einstein-1WBn5
Draft

fix(telegram): v0.0.7 reliability rollup — state-dir, PID guard, ppid watchdog, install stdout#1424
noahzweben wants to merge 3 commits into
mainfrom
claude/dreamy-einstein-1WBn5

Conversation

@noahzweben
Copy link
Copy Markdown
Collaborator

@noahzweben noahzweben commented Apr 15, 2026

Four independent fixes for v0.0.7. Commits 3 and 4 fix regressions introduced by #1349.

1. Skills ignore TELEGRAM_STATE_DIR / CLAUDE_CONFIG_DIR (223c9b2)

server.ts:26 already honors TELEGRAM_STATE_DIR, but the /telegram:access and /telegram:configure skills hardcode ~/.claude/channels/telegram/ in 11 places — skill writes and server reads diverge, pairing/allowlist edits silently no-op. Skills now resolve the dir via shell expansion first; server gets CLAUDE_CONFIG_DIR fallback. Adds Bash(echo *) / Bash(chmod *) to allowed-tools.

Fixes #931, fixes #914, fixes #933, fixes #851; addresses anthropics/claude-code#37173.

2. PID-lockfile SIGTERM can hit a recycled PID (6ceddea)

#1349's lockfile stores only a PID. After enough churn the OS recycles it — potentially to the new launch's own bun run wrapper or any unrelated process. Now verify ps -p <pid> -o args= contains server.ts before SIGTERM (execFileSync, no shell).

Hardens #1349; partial mitigation for #1459 item 3.

3. Orphan-watchdog ppid check false-fires on normal startup (1efdff0) — #1349 regression

The watchdog's process.ppid !== bootPpid check fires when the bun-run/shell wrapper exits or execs during normal startup and we get reparented to init — plugin self-terminates ~5s after launch. Dropped the ppid check; stdin-close is the correct signal (kernel closes the MCP pipe on any CLI death regardless of intermediate wrappers), so ppid was both unnecessary and harmful.

Fixes #1467. This is the actual root cause of #1459 item 3 and likely #1425 (not PID-recycling as commit 2 theorized — though that guard remains a valid safety improvement).

4. bun install stdout corrupts MCP JSON-RPC handshake (1efdff0)

bun install --no-summary in the start script writes to stdout, which is the MCP transport. The harness sees non-JSON during handshake → "Failed to connect". Redirect install output to stderr (1>&2). Verified bun run --shell=bun supports the redirect.

Fixes #1470; addresses #1425 on Windows.

Testing

  • bun build server.ts --target=bun ✅ (all commits)
  • bun run --shell=bun redirect smoke test: echo OUT 1>&2 → stderr ✅
  • ps -p $$ -o args= smoke-tested on Linux

Net diff: +57/−28. Plugin v0.0.7.

claude added 2 commits April 15, 2026 18:40
…nd server

The server already reads TELEGRAM_STATE_DIR for multi-bot setups, but the
/telegram:access and /telegram:configure skills hardcoded
~/.claude/channels/telegram/ in 11 places. So with a custom state dir the
skill writes access.json to the default location while the server reads
from the override — pairing and allowlist edits silently don't take effect.

Skills now resolve the state dir via shell expansion (TELEGRAM_STATE_DIR →
CLAUDE_CONFIG_DIR/channels/telegram → ~/.claude/channels/telegram) before
any read/write. Server gets the same CLAUDE_CONFIG_DIR fallback. Also adds
Bash(echo)/Bash(chmod) to configure skill's allowed-tools (chmod was already
documented but not allowlisted).
PID files race with OS PID recycling. The lockfile from #1349 stored only a
PID; after enough churn that PID can be reassigned to anything — including
the new launch's own bun-run wrapper. SIGTERMing the wrapper closes our
stdin and triggers immediate self-shutdown ('replacing stale poller' then
'shutting down' within seconds — matches #1459 item 3).

Now check 'ps -p <pid> -o args=' contains 'server.ts' before killing.
execFileSync (no shell); whole block already try/catch so Windows/ps-missing
falls through to just overwriting the lockfile.
@noahzweben noahzweben changed the title fix(telegram): honor TELEGRAM_STATE_DIR/CLAUDE_CONFIG_DIR in skills and server fix(telegram): honor STATE_DIR env in skills; guard PID-lockfile SIGTERM against recycling Apr 17, 2026
…to stderr

Two v0.0.5/0.0.6 regressions causing the plugin to fail at startup:

1. The orphan watchdog's process.ppid !== bootPpid check false-fires when the
   bun-run/shell wrapper exits or execs during normal startup and we get
   reparented to init — plugin self-terminates ~5s after launch. Stdin-close
   alone is the correct signal: the kernel closes the MCP pipe on any CLI
   death regardless of intermediate wrappers, so the ppid check was both
   unnecessary and harmful. (#1467; also the actual cause of #1459 item 3
   and likely #1425.)

2. 'bun install --no-summary' in the start script writes to stdout, which is
   the MCP JSON-RPC transport. The harness sees non-JSON bytes during the
   handshake and drops the connection ('Failed to connect'). Redirect install
   output to stderr. (#1470; also explains #1425 on Windows.)
@noahzweben noahzweben changed the title fix(telegram): honor STATE_DIR env in skills; guard PID-lockfile SIGTERM against recycling fix(telegram): v0.0.7 reliability rollup — state-dir, PID guard, ppid watchdog, install stdout Apr 22, 2026
@bradyrobbins
Copy link
Copy Markdown

+1 — this rollup matches symptoms we've been seeing in a 24/7 production deployment (launchd+tmux wrapper on a Mac Mini, one user, allowlist policy).

Evidence for commit 3 (ppid watchdog false-fire)

On v0.0.6 the plugin would self-terminate ~5s after startup on a non-trivial fraction of restarts. ps during the window showed bun server.ts alive then vanished with no stderr other than the generic "shutting down" line. Applying just commit 3 locally (1efdff0 — dropping the ppid check) eliminated the 5s-death class entirely. The stdin-close signal is sufficient in practice.

Evidence for commit 4 (bun install stdout → stderr)

Separately observed MCP · ✗ failed on fresh starts with no other error surfaced. The plugin process was running and polling successfully (visible in ps, tmux pane, and bot.getMe()), so commit 4's diagnosis (install chatter corrupting the JSON-RPC handshake) fits.

One gap the rollup doesn't cover

Inbound-message delivery is still fire-and-forget at handleInbound()'s mcp.notification().catch(...). In the same deployment we observed 12 attachments accumulating in inbox/ between 2026-04-07 and 2026-04-19 with no Claude-side activity, despite the bot process staying alive the whole time. Filed #1560 with the await + on-failure retry queue fix; happy to fold it into this PR if that's preferred.

The four fixes in this PR have been running locally without regression since applying them alongside #1560's changes.

@alexgodlewski
Copy link
Copy Markdown

This need to be merged quicly...

@onlylemi
Copy link
Copy Markdown

+1 to shipping this as v0.0.7 soon, @noahzweben. Just spent an evening tracking down 7 orphaned bun server.ts workers on macOS pinning ~685% CPU combined for up to 8+ days each — full repro data in #1513 (comment).

Commit 3 (drop ppid check, rely on stdin-close) directly addresses my case. Given the rollup is:

  • Approved by @k6l3 since 2026-04-22
  • mergeable: clean, CI green
  • Last code change 6 days ago

...would it be acceptable to mark this ready and ship as v0.0.7 standalone, and let #1560's inbound-delivery fix land in v0.0.8? The orphan-process bugs are actively burning CPU on user machines today.

Happy to test a prerelease build on macOS Apple Silicon if useful.

@cversek
Copy link
Copy Markdown

cversek commented Apr 29, 2026

For broader context on what users are seeing with v0.0.6 disconnect symptoms: I filed anthropics/claude-code#54544 documenting an apparent CC-side cause in the MCP host's ping/keepalive cycle (5-min interval JSON-RPC ping with a timeout that, when missed, SIGTERMs the stdio MCP server).

This PR's fixes are real and worth shipping — they address bot-side reliability issues. But even with this PR fully merged, the ping-timeout kill cycle described in #54544 would persist whenever the bot is busy enough to miss a ping response. The two are related-but-different layers; flagging here in case the reviewer team wants to coordinate.

@Jay-uk
Copy link
Copy Markdown

Jay-uk commented Apr 29, 2026

Hitting this on v0.0.6, macOS (Darwin 25.4.0), Claude Code 2.1.119. Adding observations in case useful for prioritization:

Pattern: Mid-session Telegram MCP disconnects roughly 1 to 3 times per day. Bot still responsive (Telegram API works fine) right up until the disconnect, then /mcp shows it dropped. Manual /mcp Reconnect brings it back instantly.

Evidence it's a process exit, not API connectivity:

~/.claude/channels/telegram/bot.pid mtime always matches the moment I run /mcp Reconnect (the new server writes a fresh PID on launch).
bun server.ts process uptime, measured anytime after a reconnect, equals "time since last /mcp Reconnect". For example, when I checked tonight ~58 minutes post-reconnect, the process had been alive exactly that long.
claude --channels plugin:telegram@claude-plugins-official host process stays alive across these events (current uptime measured in days), so the channels host is fine; only the bun MCP child dies.
No specific repro on demand, but it recurs reliably enough that I notice within an hour or two each time it drops. Happy to capture stderr / dtrace next time if useful.

+1 on getting v0.0.7 out. Both commit 2 (PID-recycle SIGTERM hitting its own wrapper) and commit 3 (ppid watchdog false-fire on reparenting) feel plausibly involved here; the evidence above can't distinguish between them since both produce the same surface symptom: bun server.ts exits, its stdin pipe to Claude Code closes, MCP child marked disconnected.

Thanks for the rigorous fix.

@AdelElo13
Copy link
Copy Markdown

+1, independent reproduction this morning on a single-user macOS terminal session — no launchd, no tmux, no fleet. Adds to the existing repros from different deployment shapes.

Setup: macOS 25.5.0, Claude Code 2.1.138, plugin v0.0.6, bun 1.x, DM pairing policy, single approved chat. Bot launched via the standard claude --channels plugin:telegram@claude-plugins-official flow.

End-user symptom worth naming: "first Telegram reply works, every subsequent message arrives but gets no response." This is the v0.0.6 ppid-watchdog kill, but it doesn't always look like a generic MCP disconnect — the channel-listener side keeps queueing inbound messages in inbox/, so the user sees their messages "land" on the server while Claude stays silent. From the sender's perspective it looks like Claude went unresponsive, not like a plugin died.

Forensics from one hit:

  • ~/.claude/channels/telegram/bot.pid = 51231; ps -p 51231 returns empty
  • No crash dump in ~/Library/Logs/DiagnosticReports → clean exit, not a crash
  • Parent channel-listener (Claude Code session process) still alive with 9 other healthy children, but no bun server.ts among them
  • In-session: an MCP-disconnect notice for plugin:telegram:telegram arrived ~10s after a successful reply tool call returned sent (id: 1242) — the bot serviced exactly one tool call before the watchdog tick caught it
  • Terminal still attached, parent CLI pid unchanged → no external signal source

Matches the commit-3 hypothesis perfectly: bun-run intermediate exits during normal startup, ppid changes, next 5s interval tick fires process.ppid !== bootPpid, plugin shuts itself down via its own clean-exit path.

Ship it — a month of draft is producing more harm than the regressions in commits 1-2 would prevent.

@william-drakemond
Copy link
Copy Markdown

+1, independent reproduction on macOS multi-bot setup — two Telegram bots in separate launcher dirs, each export TELEGRAM_STATE_DIR=$HOME/.claude/channels/telegram-<botN> before exec claude --channels plugin:telegram@claude-plugins-official ....

Hit today: /telegram:access pair <code> in session2 read bot1's access.json and reported bot2's pending entry as missing. Server-side state was correct (bot2's access.json did contain the pending entry — server.ts had written it there), but the skill was looking at the wrong file. Allowlist edits and any subsequent /telegram:configure <token> would have clobbered bot1 too — that's the latent corruption risk for anyone following the multi-bot pattern documented in the README.

Locally patched both SKILL.md files along the same lines as 223c9b2; symptom gone, single-bot sessions unchanged (env var unset → default path). +1 to shipping as v0.0.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment