Headless server leaks kqueue file descriptors — bash tool stops working after prolonged use

## Summary

The copilot headless server (`copilot --headless`) accumulates kqueue file descriptors over its lifetime. Each session creates `fs.watch()` file watchers that are never released when sessions become idle or are closed. After enough sessions (especially with multi-agent/fleet workflows), PTY allocation for the `bash` tool fails with the generic error `"Failed to start bash process"`.

## Environment

- **CLI version:** 1.0.10
- **OS:** macOS 15.3 (Darwin 25.3.0, arm64)
- **Node:** v24.11.1
- **Client type:** cli-server (PolyPilot app connecting via TCP)

## Reproduction

1. Start `copilot --headless --port 4321`
2. Create many sessions over time (fleet/multi-agent workflows accelerate this — we had 93 unique sessions over the server's lifetime)
3. After enough sessions, all `bash` tool calls fail with `"Failed to start bash process"`
4. Existing PTY handles start throwing `EIO` (I/O error) on write

Restarting the headless server immediately fixes the issue.

## Evidence

### kqueue FD leak comparison

| Metric | Old Server (leaked) | Fresh Server (just restarted) |
|---|---|---|
| **Total FDs** | **4,321** | 69 |
| **KQUEUE FDs** | **9,779** | 27 |
| **Node processes** | 300 | 21 |
| **User processes** | 922 | 362 |

That's ~105 kqueue file descriptors leaked per session (9,779 kqueues / 93 sessions).

A fresh server starts with ~27 KQUEUE FDs. After 93 sessions (including multi-agent sub-agents), this grew to 9,779 — a **362x increase**.

### Timeline from the process log (`process-*.log`)

- **First PTY EIO error:** `2026-03-29T14:38:07.722Z`
  ```
  [ERROR] Unhandled pty write error [Error: EIO: i/o error, write] {
    errno: -5,
    code: 'EIO',
    syscall: 'write'
  }
  ```
- **First bash failure:** `2026-03-29T14:38:07.925Z` (simultaneous with EIO)
  ```json
  {
    "tool_name": "bash",
    "result_type": "FAILURE",
    "error": "<exited with error: Failed to start bash process>"
  }
  ```
- **Last bash failure:** `2026-03-29T14:46:32.763Z` (~8 minutes of total bash unavailability)
- **Total PTY EIO errors:** 149
- **Total bash spawn failures:** 641
- **Multiple sessions affected** — the failure is server-wide, not session-specific

### lsof output on the leaked server (before restart)

```
$ lsof -p 26081 | awk '{print $5}' | sort | uniq -c | sort -rn
9779 KQUEUE
 436 unix
 120 CHR
  27 REG
   8 DIR
   6 PIPE
   5 IPv4
   2 IPv6
   1 systm
```

### lsof output on fresh server (after restart)

```
$ lsof -p 33499 | awk '{print $5}' | sort | uniq -c | sort -rn
  27 KQUEUE
  17 REG
   6 PIPE
   4 unix
   4 IPv4
   4 DIR
   2 IPv6
   2 CHR
   1 systm
```

## Root Cause Analysis

Node.js uses kqueue on macOS for `fs.watch()`. The headless server likely creates file watchers per session (working directory monitoring, `.copilot/` config watches, session state directory watches, etc.). When sessions go idle or are explicitly closed, **these watchers are never cleaned up**.

With multi-agent workflows that spawn many short-lived sub-agent sessions (fleet mode, parallel task execution), the leak accelerates rapidly. In our case, 93 sessions accumulated ~9,779 kqueue FDs.

When the kqueue count gets high enough, `pty.spawn()` (used by the bash tool to create pseudo-terminal sessions) fails — likely due to macOS kernel resource pressure on PTY allocation or the child process inheriting too many FDs.

## Impact

- **All bash tool calls fail server-wide** — not just the session that triggered the limit
- **Existing PTY sessions get EIO errors** — even previously-working bash sessions break
- **Only fix is server restart** — the FDs are never reclaimed without killing the process
- **Multi-agent/fleet workflows hit this faster** due to many short-lived sessions
- **8+ minutes of total bash unavailability** in our observed incident

## Suggested Fix

1. **Close file watchers when sessions are disposed/idle.** Each `fs.watch()` handle should be tracked per-session and `.close()`d when the session ends.
2. **Set `CLOEXEC` on kqueue FDs** if not already — prevents child processes (bash) from inheriting unnecessary FDs.
3. **Add a resource limit guard** — if the server detects its FD count exceeding a threshold, log a warning and/or proactively clean up stale watchers.

## Workaround

Restart the headless server periodically, or when bash failures are detected. PolyPilot users can use Settings → Save & Reconnect to trigger a server restart.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Headless server leaks kqueue file descriptors — bash tool stops working after prolonged use #2389

Summary

Environment

Reproduction

Evidence

kqueue FD leak comparison

Timeline from the process log (`process-*.log`)

lsof output on the leaked server (before restart)

lsof output on fresh server (after restart)

Root Cause Analysis

Impact

Suggested Fix

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Old Server (leaked)	Fresh Server (just restarted)
Total FDs	4,321	69
KQUEUE FDs	9,779	27
Node processes	300	21
User processes	922	362

Headless server leaks kqueue file descriptors — bash tool stops working after prolonged use #2389

Description

Summary

Environment

Reproduction

Evidence

kqueue FD leak comparison

Timeline from the process log (process-*.log)

lsof output on the leaked server (before restart)

lsof output on fresh server (after restart)

Root Cause Analysis

Impact

Suggested Fix

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Timeline from the process log (`process-*.log`)