Skip to content

fix(telegram): stop EPIPE uncaughtException loop that pegs a CPU core after the owner exits#2081

Closed
flenard wants to merge 1 commit into
anthropics:mainfrom
flenard:fix/telegram-epipe-uncaughtexception-loop
Closed

fix(telegram): stop EPIPE uncaughtException loop that pegs a CPU core after the owner exits#2081
flenard wants to merge 1 commit into
anthropics:mainfrom
flenard:fix/telegram-epipe-uncaughtexception-loop

Conversation

@flenard
Copy link
Copy Markdown

@flenard flenard commented May 29, 2026

Summary

Fixes the zombie bun server.ts that pegs a CPU core at ~100% after a Claude Code session exits — the bug reported in #1049, and the root cause behind the "stuck event loop survives SIGTERM" symptom in #1794 (and #1500, #1713).

When the Claude Code process that owns the server's stdio exits, stderr becomes a broken pipe. The crash handlers log through that same broken stream:

process.on('uncaughtException', err => {
  process.stderr.write(`telegram channel: uncaught exception: ${err}\n`)
})

So the sequence is:

  1. Owner exits → stdio pipes break.
  2. Something writes to stderr (a handler, the poll loop, or shutdown()) → EPIPE.
  3. The EPIPE surfaces as another uncaughtException → the handler writes to stderr again → EPIPE again → re-enters → …

This is a tight synchronous loop, so the event loop never idles. Every safeguard in the file — the ppid/stdin orphan watchdog, the SIGTERM/SIGINT handlers, the setTimeout(process.exit, 2000) in shutdown() — depends on the event loop turning, so none of them ever run. The process is unkillable via SIGTERM, keeps holding the bot's getUpdates slot (→ 409 Conflict for new sessions), and pins a core until SIGKILL. On multi-session hosts the orphans accumulate across restarts and can saturate the machine.

Reproduced via strace

write(11, "telegram channel: uncaught exception: Error: EPIPE: broken pipe, write\n", 71) = -1 EPIPE (Broken pipe)
--- SIGPIPE ---
write(11, "telegram channel: uncaught exception: Error: EPIPE: broken pipe, write\n", 71) = -1 EPIPE (Broken pipe)
--- SIGPIPE ---
(repeats ~14,000×/sec)

The handler is literally logging its own EPIPE in a loop. Matches the capture in #1049.

Fix

Route every stderr write through a crash-safe helper that swallows the error and exits when the pipe is dead (a dead stderr means the owning process is gone — there is nothing left to serve and nowhere to log), and short-circuit the two global handlers on EPIPE so they exit instead of re-entering:

function safeStderrWrite(msg: string): void {
  try {
    process.stderr.write(msg)
  } catch (e) {
    const code = (e as { code?: string })?.code
    if (code === 'EPIPE' || code === 'EBADF' || code === 'ERR_STREAM_DESTROYED') process.exit(0)
  }
}

This removes the cause rather than papering over it: the event loop never jams, so the existing SIGTERM handler and orphan watchdog work as designed.

Why this is safe

  • It only acts on an actual broken-pipe error from a write — normal network rejections in the poll loop still log and keep serving.
  • stderr breaking is an unambiguous "owner is gone" signal. Unlike stdin EOF (which Claude Code closes momentarily during context compaction — the reason this file deliberately avoids exiting on stdin close), the stderr pipe only breaks when the parent process actually dies. So exiting on stderr EPIPE won't false-fire during compaction.
  • shutdown()'s log write now goes through the helper too, so a broken pipe there can no longer abort shutdown before the exit path.

Testing

  • bun build server.ts succeeds (243 modules bundled, no type errors).
  • Modeled the cascade deterministically: the original handler re-enters indefinitely (100k+ dispatches); the patched handler exits after one.
  • Deployed to a production 2-core host that had been rebooting from this exact issue (two orphans → 92% user CPU sustained overnight); after the fix, new servers reap cleanly on session end and no orphans accumulate.

Notes

… after the owner exits

When the Claude Code process that owns the server's stdio exits, stderr
becomes a broken pipe. The uncaughtException / unhandledRejection handlers
wrote to stderr, so the write threw EPIPE, which surfaced as another
uncaughtException, whose handler wrote to stderr again — a tight synchronous
loop that pinned a CPU core and was immune to SIGTERM and the orphan watchdog
(the event loop never idled to run them). Orphans accumulated across session
restarts and saturated the host (forcing reboots), and held the bot's
getUpdates slot, causing 409 Conflict for new sessions.

Route every stderr write through a crash-safe helper that swallows the error
and exits when the pipe is dead, and short-circuit the two global handlers on
EPIPE so a dead pipe ends the process instead of re-entering. Confirmed via
strace that the loop emitted ~14k failed writes/sec (matches the capture in
anthropics#1049).

Fixes anthropics#1049. Addresses the root cause behind anthropics#1794, anthropics#1500, and anthropics#1713.
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for your interest! This repo only accepts contributions from Anthropic team members. If you'd like to submit a plugin to the marketplace, please submit your plugin here.

@github-actions github-actions Bot closed this May 29, 2026
@flenard flenard deleted the fix/telegram-epipe-uncaughtexception-loop branch May 29, 2026 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant