fix(telegram): v0.0.7 reliability rollup — state-dir, PID guard, ppid watchdog, install stdout by noahzweben · Pull Request #1424 · anthropics/claude-plugins-official

noahzweben · 2026-04-15T18:41:17Z

Four independent fixes for v0.0.7. Commits 3 and 4 fix regressions introduced by #1349.

1. Skills ignore TELEGRAM_STATE_DIR / CLAUDE_CONFIG_DIR (`223c9b2`)

server.ts:26 already honors TELEGRAM_STATE_DIR, but the /telegram:access and /telegram:configure skills hardcode ~/.claude/channels/telegram/ in 11 places — skill writes and server reads diverge, pairing/allowlist edits silently no-op. Skills now resolve the dir via shell expansion first; server gets CLAUDE_CONFIG_DIR fallback. Adds Bash(echo *) / Bash(chmod *) to allowed-tools.

Fixes #931, fixes #914, fixes #933, fixes #851; addresses anthropics/claude-code#37173.

2. PID-lockfile SIGTERM can hit a recycled PID (`6ceddea`)

#1349's lockfile stores only a PID. After enough churn the OS recycles it — potentially to the new launch's own bun run wrapper or any unrelated process. Now verify ps -p <pid> -o args= contains server.ts before SIGTERM (execFileSync, no shell).

Hardens #1349; partial mitigation for #1459 item 3.

3. Orphan-watchdog ppid check false-fires on normal startup (`1efdff0`) — #1349 regression

The watchdog's process.ppid !== bootPpid check fires when the bun-run/shell wrapper exits or execs during normal startup and we get reparented to init — plugin self-terminates ~5s after launch. Dropped the ppid check; stdin-close is the correct signal (kernel closes the MCP pipe on any CLI death regardless of intermediate wrappers), so ppid was both unnecessary and harmful.

Fixes #1467. This is the actual root cause of #1459 item 3 and likely #1425 (not PID-recycling as commit 2 theorized — though that guard remains a valid safety improvement).

4. `bun install` stdout corrupts MCP JSON-RPC handshake (`1efdff0`)

bun install --no-summary in the start script writes to stdout, which is the MCP transport. The harness sees non-JSON during handshake → "Failed to connect". Redirect install output to stderr (1>&2). Verified bun run --shell=bun supports the redirect.

Fixes #1470; addresses #1425 on Windows.

Testing

bun build server.ts --target=bun ✅ (all commits)
bun run --shell=bun redirect smoke test: echo OUT 1>&2 → stderr ✅
ps -p $$ -o args= smoke-tested on Linux

Net diff: +57/−28. Plugin v0.0.7.

…nd server The server already reads TELEGRAM_STATE_DIR for multi-bot setups, but the /telegram:access and /telegram:configure skills hardcoded ~/.claude/channels/telegram/ in 11 places. So with a custom state dir the skill writes access.json to the default location while the server reads from the override — pairing and allowlist edits silently don't take effect. Skills now resolve the state dir via shell expansion (TELEGRAM_STATE_DIR → CLAUDE_CONFIG_DIR/channels/telegram → ~/.claude/channels/telegram) before any read/write. Server gets the same CLAUDE_CONFIG_DIR fallback. Also adds Bash(echo)/Bash(chmod) to configure skill's allowed-tools (chmod was already documented but not allowlisted).

PID files race with OS PID recycling. The lockfile from #1349 stored only a PID; after enough churn that PID can be reassigned to anything — including the new launch's own bun-run wrapper. SIGTERMing the wrapper closes our stdin and triggers immediate self-shutdown ('replacing stale poller' then 'shutting down' within seconds — matches #1459 item 3). Now check 'ps -p <pid> -o args=' contains 'server.ts' before killing. execFileSync (no shell); whole block already try/catch so Windows/ps-missing falls through to just overwriting the lockfile.

…to stderr Two v0.0.5/0.0.6 regressions causing the plugin to fail at startup: 1. The orphan watchdog's process.ppid !== bootPpid check false-fires when the bun-run/shell wrapper exits or execs during normal startup and we get reparented to init — plugin self-terminates ~5s after launch. Stdin-close alone is the correct signal: the kernel closes the MCP pipe on any CLI death regardless of intermediate wrappers, so the ppid check was both unnecessary and harmful. (#1467; also the actual cause of #1459 item 3 and likely #1425.) 2. 'bun install --no-summary' in the start script writes to stdout, which is the MCP JSON-RPC transport. The harness sees non-JSON bytes during the handshake and drops the connection ('Failed to connect'). Redirect install output to stderr. (#1470; also explains #1425 on Windows.)

bradyrobbins · 2026-04-23T19:28:20Z

+1 — this rollup matches symptoms we've been seeing in a 24/7 production deployment (launchd+tmux wrapper on a Mac Mini, one user, allowlist policy).

Evidence for commit 3 (ppid watchdog false-fire)

On v0.0.6 the plugin would self-terminate ~5s after startup on a non-trivial fraction of restarts. ps during the window showed bun server.ts alive then vanished with no stderr other than the generic "shutting down" line. Applying just commit 3 locally (1efdff0 — dropping the ppid check) eliminated the 5s-death class entirely. The stdin-close signal is sufficient in practice.

Evidence for commit 4 (`bun install` stdout → stderr)

Separately observed MCP · ✗ failed on fresh starts with no other error surfaced. The plugin process was running and polling successfully (visible in ps, tmux pane, and bot.getMe()), so commit 4's diagnosis (install chatter corrupting the JSON-RPC handshake) fits.

One gap the rollup doesn't cover

Inbound-message delivery is still fire-and-forget at handleInbound()'s mcp.notification().catch(...). In the same deployment we observed 12 attachments accumulating in inbox/ between 2026-04-07 and 2026-04-19 with no Claude-side activity, despite the bot process staying alive the whole time. Filed #1560 with the await + on-failure retry queue fix; happy to fold it into this PR if that's preferred.

The four fixes in this PR have been running locally without regression since applying them alongside #1560's changes.

alexgodlewski · 2026-04-28T14:14:15Z

This need to be merged quicly...

onlylemi · 2026-04-28T15:50:34Z

+1 to shipping this as v0.0.7 soon, @noahzweben. Just spent an evening tracking down 7 orphaned bun server.ts workers on macOS pinning ~685% CPU combined for up to 8+ days each — full repro data in #1513 (comment).

Commit 3 (drop ppid check, rely on stdin-close) directly addresses my case. Given the rollup is:

Approved by @k6l3 since 2026-04-22
mergeable: clean, CI green
Last code change 6 days ago

...would it be acceptable to mark this ready and ship as v0.0.7 standalone, and let #1560's inbound-delivery fix land in v0.0.8? The orphan-process bugs are actively burning CPU on user machines today.

Happy to test a prerelease build on macOS Apple Silicon if useful.

cversek · 2026-04-29T05:11:44Z

For broader context on what users are seeing with v0.0.6 disconnect symptoms: I filed anthropics/claude-code#54544 documenting an apparent CC-side cause in the MCP host's ping/keepalive cycle (5-min interval JSON-RPC ping with a timeout that, when missed, SIGTERMs the stdio MCP server).

This PR's fixes are real and worth shipping — they address bot-side reliability issues. But even with this PR fully merged, the ping-timeout kill cycle described in #54544 would persist whenever the bot is busy enough to miss a ping response. The two are related-but-different layers; flagging here in case the reviewer team wants to coordinate.

Jay-uk · 2026-04-29T20:56:16Z

Hitting this on v0.0.6, macOS (Darwin 25.4.0), Claude Code 2.1.119. Adding observations in case useful for prioritization:

Pattern: Mid-session Telegram MCP disconnects roughly 1 to 3 times per day. Bot still responsive (Telegram API works fine) right up until the disconnect, then /mcp shows it dropped. Manual /mcp Reconnect brings it back instantly.

Evidence it's a process exit, not API connectivity:

~/.claude/channels/telegram/bot.pid mtime always matches the moment I run /mcp Reconnect (the new server writes a fresh PID on launch).
bun server.ts process uptime, measured anytime after a reconnect, equals "time since last /mcp Reconnect". For example, when I checked tonight ~58 minutes post-reconnect, the process had been alive exactly that long.
claude --channels plugin:telegram@claude-plugins-official host process stays alive across these events (current uptime measured in days), so the channels host is fine; only the bun MCP child dies.
No specific repro on demand, but it recurs reliably enough that I notice within an hour or two each time it drops. Happy to capture stderr / dtrace next time if useful.

+1 on getting v0.0.7 out. Both commit 2 (PID-recycle SIGTERM hitting its own wrapper) and commit 3 (ppid watchdog false-fire on reparenting) feel plausibly involved here; the evidence above can't distinguish between them since both produce the same surface symptom: bun server.ts exits, its stdin pipe to Claude Code closes, MCP child marked disconnected.

Thanks for the rigorous fix.

AdelElo13 · 2026-05-17T13:17:48Z

+1, independent reproduction this morning on a single-user macOS terminal session — no launchd, no tmux, no fleet. Adds to the existing repros from different deployment shapes.

Setup: macOS 25.5.0, Claude Code 2.1.138, plugin v0.0.6, bun 1.x, DM pairing policy, single approved chat. Bot launched via the standard claude --channels plugin:telegram@claude-plugins-official flow.

End-user symptom worth naming: "first Telegram reply works, every subsequent message arrives but gets no response." This is the v0.0.6 ppid-watchdog kill, but it doesn't always look like a generic MCP disconnect — the channel-listener side keeps queueing inbound messages in inbox/, so the user sees their messages "land" on the server while Claude stays silent. From the sender's perspective it looks like Claude went unresponsive, not like a plugin died.

Forensics from one hit:

~/.claude/channels/telegram/bot.pid = 51231; ps -p 51231 returns empty
No crash dump in ~/Library/Logs/DiagnosticReports → clean exit, not a crash
Parent channel-listener (Claude Code session process) still alive with 9 other healthy children, but no bun server.ts among them
In-session: an MCP-disconnect notice for plugin:telegram:telegram arrived ~10s after a successful reply tool call returned sent (id: 1242) — the bot serviced exactly one tool call before the watchdog tick caught it
Terminal still attached, parent CLI pid unchanged → no external signal source

Matches the commit-3 hypothesis perfectly: bun-run intermediate exits during normal startup, ppid changes, next 5s interval tick fires process.ppid !== bootPpid, plugin shuts itself down via its own clean-exit path.

Ship it — a month of draft is producing more harm than the regressions in commits 1-2 would prevent.

william-drakemond · 2026-05-27T01:23:09Z

+1, independent reproduction on macOS multi-bot setup — two Telegram bots in separate launcher dirs, each export TELEGRAM_STATE_DIR=$HOME/.claude/channels/telegram-<botN> before exec claude --channels plugin:telegram@claude-plugins-official ....

Hit today: /telegram:access pair <code> in session2 read bot1's access.json and reported bot2's pending entry as missing. Server-side state was correct (bot2's access.json did contain the pending entry — server.ts had written it there), but the skill was looking at the wrong file. Allowlist edits and any subsequent /telegram:configure <token> would have clobbered bot1 too — that's the latent corruption risk for anyone following the multi-bot pattern documented in the README.

Locally patched both SKILL.md files along the same lines as 223c9b2; symptom gone, single-bot sessions unchanged (env var unset → default path). +1 to shipping as v0.0.7.

claude added 2 commits April 15, 2026 18:40

noahzweben changed the title ~~fix(telegram): honor TELEGRAM_STATE_DIR/CLAUDE_CONFIG_DIR in skills and server~~ fix(telegram): honor STATE_DIR env in skills; guard PID-lockfile SIGTERM against recycling Apr 17, 2026

noahzweben changed the title ~~fix(telegram): honor STATE_DIR env in skills; guard PID-lockfile SIGTERM against recycling~~ fix(telegram): v0.0.7 reliability rollup — state-dir, PID guard, ppid watchdog, install stdout Apr 22, 2026

k6l3 approved these changes Apr 22, 2026

View reviewed changes

bradyrobbins mentioned this pull request Apr 23, 2026

fix(telegram): await inbound MCP notification with on-failure retry queue #1560

Closed

5 tasks

onlylemi mentioned this pull request Apr 28, 2026

telegram plugin: orphaned bun server.ts procs survive SIGTERM and orphan-watchdog on macOS #1513

Open

thomasetienne mentioned this pull request May 2, 2026

fix(telegram): don't kill live siblings during startup #1690

Closed

flenard mentioned this pull request May 29, 2026

fix(telegram): stop EPIPE uncaughtException loop that pegs a CPU core after the owner exits #2081

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(telegram): v0.0.7 reliability rollup — state-dir, PID guard, ppid watchdog, install stdout#1424

fix(telegram): v0.0.7 reliability rollup — state-dir, PID guard, ppid watchdog, install stdout#1424
noahzweben wants to merge 3 commits into
mainfrom
claude/dreamy-einstein-1WBn5

noahzweben commented Apr 15, 2026 •

edited

Loading

Uh oh!

bradyrobbins commented Apr 23, 2026

Uh oh!

alexgodlewski commented Apr 28, 2026

Uh oh!

onlylemi commented Apr 28, 2026

Uh oh!

cversek commented Apr 29, 2026

Uh oh!

Jay-uk commented Apr 29, 2026

Uh oh!

AdelElo13 commented May 17, 2026

Uh oh!

william-drakemond commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

noahzweben commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Skills ignore TELEGRAM_STATE_DIR / CLAUDE_CONFIG_DIR (223c9b2)

2. PID-lockfile SIGTERM can hit a recycled PID (6ceddea)

3. Orphan-watchdog ppid check false-fires on normal startup (1efdff0) — #1349 regression

4. bun install stdout corrupts MCP JSON-RPC handshake (1efdff0)

Testing

Uh oh!

bradyrobbins commented Apr 23, 2026

Evidence for commit 3 (ppid watchdog false-fire)

Evidence for commit 4 (bun install stdout → stderr)

One gap the rollup doesn't cover

Uh oh!

alexgodlewski commented Apr 28, 2026

Uh oh!

onlylemi commented Apr 28, 2026

Uh oh!

cversek commented Apr 29, 2026

Uh oh!

Jay-uk commented Apr 29, 2026

Uh oh!

AdelElo13 commented May 17, 2026

Uh oh!

william-drakemond commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

noahzweben commented Apr 15, 2026 •

edited

Loading

1. Skills ignore TELEGRAM_STATE_DIR / CLAUDE_CONFIG_DIR (`223c9b2`)

2. PID-lockfile SIGTERM can hit a recycled PID (`6ceddea`)

3. Orphan-watchdog ppid check false-fires on normal startup (`1efdff0`) — #1349 regression

4. `bun install` stdout corrupts MCP JSON-RPC handshake (`1efdff0`)

Evidence for commit 4 (`bun install` stdout → stderr)