Skip to content

Chat & Shell crash with SIGKILL on first user message — "Starting async generator loop" then process dies (Linux, all versions 1.28.0 → 1.30.0) #707

@fg59-flo

Description

@fg59-flo

Summary

On a clean Linux deployment (Ubuntu 24.04 LTS, Node 22.22.2), the CloudCLI Node process is killed with SIGKILL ~1 second after the user sends any Chat message or starts a Shell PTY session. The UI hangs (no response visible) and systemd respawns the service in a loop.

This appears to be the same root issue as #496 ("Claude Code process exited with code 1, but cli ok") and #486 (closed). Reproduced on every released version from 1.28.0 (oldest available on npm) through 1.30.0 (latest).

Environment

Item Value
OS Ubuntu 24.04.3 LTS
Node v22.22.2 (via nvm)
@cloudcli-ai/cloudcli versions tested 1.28.0, 1.29.5, 1.30.0 (all KO)
@anthropic-ai/claude-agent-sdk (embedded) 0.2.119 (same in all 3 versions)
claude binary 2.1.119 (Claude Code, in $PATH)
Auth OAuth Claude Max ×20 plan via ~/.claude/.credentials.json (not API key)
systemd unit cloudcli@flo.service (Type=simple, User=flo, Restart=always)

Reproduction

  1. Start cloudcli (any version 1.28.0+) on Linux with valid Claude OAuth credentials in ~/.claude/.credentials.json.
  2. Open the Web UI, log in, select a project, click New Session in Chat mode.
  3. Send any message (even a single word like ping).
  4. Observe: no response appears, the UI hangs.
  5. Server-side log:
[DEBUG] User message: ping
📁 Project: /home/flo
🔄 Session: New
Starting async generator loop for session: NEW
                                                  ← ~1 second later ↓
systemd[1]: cloudcli@flo.service: Main process exited, code=killed, status=9/KILL
systemd[1]: cloudcli@flo.service: Failed with result 'signal'.

The same SIGKILL happens with Shell mode as soon as the user confirms "Yes, I trust this folder" (after ~µs of valid Claude UI rendering). Service then restarts; UI reconnects to the new instance and shows the same trust-folder prompt → looks like a "loop" to the user.

What is NOT the cause (excluded by bisection)

I went through every plausible angle. None explains the crash:

  • claude binaryclaude -p "ping" in SSH returns pong, exit 0, in ~1.5s.
  • @anthropic-ai/claude-agent-sdk — running it standalone outside cloudcli works perfectly:
    // Standalone test (replaces cloudcli's queryClaudeSDK)
    import { query } from "@anthropic-ai/claude-agent-sdk";
    for await (const m of query({ prompt: "ping", options: { model: "opus" } })) {
      if (m.type === "result") console.log("RESULT:", m.result);
    }
    // → RESULT: pong  (in ~6.9s, exit 0, hooks fire correctly, no crash)
  • claude SessionStart hooks (claude-mem in our case) — disabled them, crash identical.
  • systemd / cgroup / OOM:
    • MemoryPeak = 82 MB
    • MemoryMax = infinity
    • WatchdogUSec = 0
    • kernel dmesg has no OOM event
    • coredumpctl list is empty
    • No entries in journalctl -u systemd-oomd
  • systemd itself — running cloudcli in foreground via nohup env PORT=3001 cloudcli & (no systemd) → process dies silently within 1s of "Starting async generator loop", same way. So the SIGKILL is reported by systemd but originates inside the Node process (or its native deps).
  • React StrictMode + double /shell WS — yes, 2 simultaneous shell WS are observed (StrictMode bit ON), but handleShellConnection in server/index.js shares a single PTY across WS via ptySessionsMap, so this is a non-issue causally.
  • CloudCLI version regression — bug exists in every release tested (1.28.0, 1.29.5, 1.30.0). Embedded SDK is the same 0.2.119 in all of them.

What might be the cause

I can't conclusively isolate it without instrumenting the SDK or running under a debugger, but the strongest remaining hypothesis is:

The way queryClaudeSDK() (in server/claude-sdk.js) invokes the SDK from inside a WebSocket 'message' handler triggers a fatal abort in a native dependency (likely related to stdio piping, signals, or process forking) that does not happen when the SDK is invoked from a plain Node script.

Since the standalone SDK call works and the cloudcli call dies silently with no stack trace, the death is most likely an abort() from a native binding (e.g. node-pty, better-sqlite3, or something the SDK loads transitively). A minimal repro that wraps the same query() call inside a WebSocketServer 'message' handler would help.

Workaround in use

Until this is fixed, we use claude directly via SSH/RustDesk on the VM. The web UI is unusable for both Chat and Shell.

Logs

1. Crash signature — Chat mode (cloudcli 1.30.0, repeats identically on 1.29.5 and 1.28.0)
Apr 26 16:35:33 claude-flo bash[1039599]: [DEBUG] User message: Réponds juste ping
Apr 26 16:35:33 claude-flo bash[1039599]: 📁 Project: /home/flo
Apr 26 16:35:33 claude-flo bash[1039599]: 🔄 Session: New
Apr 26 16:35:33 claude-flo bash[1039599]: Starting async generator loop for session: NEW
Apr 26 16:35:34 claude-flo systemd[1]: cloudcli@flo.service: Main process exited, code=killed, status=9/KILL
Apr 26 16:35:35 claude-flo systemd[1]: cloudcli@flo.service: Failed with result 'signal'.
Apr 26 16:35:35 claude-flo systemd[1]: cloudcli@flo.service: Consumed 3.396s CPU time.
Apr 26 16:35:40 claude-flo systemd[1]: cloudcli@flo.service: Scheduled restart job, restart counter is at 1.
2. Crash signature — Shell mode (after "Yes I trust this folder")
Apr 26 15:53:35 bash[400358]: [INFO] Using Claude Agents SDK for Claude integration
Apr 26 15:53:35 bash[400358]: 📨 Shell message received: init
Apr 26 15:53:35 bash[400358]: [INFO] Starting shell in: /home/flo
Apr 26 15:53:35 bash[400358]: 🔧 Executing shell command: claude
Apr 26 15:53:35 bash[400358]: 📐 Using terminal dimensions: 149 x 43
Apr 26 15:53:35 bash[400358]: 🟢 Shell process started with PTY, PID: …
Apr 26 15:54:00 bash[400358]: 📨 Shell message received: input    ← user confirms Yes
Apr 26 15:54:00 bash[400358]: 📨 Shell message received: input
Apr 26 15:54:01 bash[400358]: 🔚 Shell process exited with code: 1 signal: 0
Apr 26 15:54:15 systemd[1]: cloudcli@flo.service: Main process exited, code=killed, status=9/KILL
3. SDK works perfectly in standalone (same Node, same SDK, same claude binary, same project dir)
# Node script using the SDK directly (skips cloudcli's WebSocket layer)
$ cat > /tmp/test-sdk.mjs << 'EOF'
import { query } from "/path/to/@anthropic-ai/claude-agent-sdk/sdk.mjs";
console.log("Calling query...");
for await (const m of query({ prompt: "ping", options: { model: "opus" } })) {
  if (m.type === "result") console.log("RESULT:", m.result, "duration:", m.duration_ms, "ms");
}
console.log("DONE");
EOF
$ node /tmp/test-sdk.mjs
Calling query...
MSG: {"type":"system","subtype":"hook_started","hook_name":"SessionStart:startup", ...}
MSG: {"type":"system","subtype":"init","cwd":"/home/flo","session_id":"...","tools":[...]}
MSG: {"type":"assistant","message":{"model":"claude-opus-4-7", ...,"content":[{"type":"text","text":"pong"}], ...}}
MSG: {"type":"result","subtype":"success","is_error":false,"duration_ms":6886, ...,"result":"pong","session_id":"..."}
DONE
$ echo $?
0
4. systemd service unit (irrelevant since crash also happens in foreground)
[Service]
Type=simple
User=flo
Environment=NODE_ENV=production
Environment=PORT=3001
Environment=HOST=127.0.0.1
ExecStart=/bin/bash -c 'export NVM_DIR=/home/%i/.nvm && . $NVM_DIR/nvm.sh && exec cloudcli'
Restart=always
RestartSec=5

Confirmed crash is independent of systemd: launching via nohup env PORT=3001 cloudcli > /tmp/cloudcli-fg.log 2>&1 & then sending a Chat message reproduces the same silent process death within ~1s of "Starting async generator loop". The PID disappears, no stack trace, no coredump.

5. Memory / OOM is NOT the cause
$ systemctl show cloudcli@flo | grep -E "Memory|Watchdog"
MemoryCurrent=50593792
MemoryPeak=82640896           ← 82 MB peak, way under any limit
MemoryMax=infinity
MemoryHigh=infinity
WatchdogUSec=0                ← no watchdog
ManagedOOMMemoryPressure=auto

$ sudo dmesg -T | grep -iE "oom|kill"
(empty)

$ sudo journalctl -u systemd-oomd --since "1 hour ago"
-- No entries --

$ sudo coredumpctl list --since "2h ago"
(empty)

Happy to provide more logs, diff the SDK invocation paths, or run additional diagnostics. This blocks the entire web UI for our team — we currently fall back to running claude directly via SSH/RustDesk.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions