Skip to content

Commit 99e5040

Browse files
authored
feat(codex-fleet): halve worker pool default + cap Node MCP children + idle worker auto-exit (#195)
Three changes, all in scripts/codex-fleet/, addressing the ~10 GB resident memory floor observed when the fleet is up (16 codex CLIs + 258 node helpers; each codex CLI holds 200-400 MB of native heap that does not shrink while idle). codex-fleet-2.sh - New WORKER_COUNT env (default 4, was hardcoded 8). The RESERVE_ACCOUNTS array is unchanged as the upper bound; spawn loop iterates 0..WORKER_COUNT-1. Bump back via WORKER_COUNT=8 when a heavy plan needs more parallel lanes. - worker_cmd_for() now exports NODE_OPTIONS=--max-old-space-size=400 so any Node MCP-server child codex spawns is capped. codex itself is native; this only affects Node helpers. worker-prompt.md - Added empty_streak counter to the worker loop. After IDLE_EXIT_THRESHOLD (default 5) consecutive empty task_ready_for_agent polls (~5 min idle), the worker posts a Colony note and exits 0. Supervisor respawns it when Colony reports new claimable work. Override per-pane via IDLE_EXIT_THRESHOLD=0. Expected impact - workers at bringup: 8 → 4 - idle floor when plan exhausted: ~2 GB → ~0 - active-work peak: ~10 GB → ~5 GB - node MCP child heap cap: unbounded → 400 MB To activate, tear down + bring up the fleet: bash scripts/codex-fleet/down.sh bash scripts/codex-fleet/full-bringup.sh ...
1 parent 3e1246d commit 99e5040

4 files changed

Lines changed: 86 additions & 4 deletions

File tree

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
schema: spec-driven
2+
created: 2026-05-18
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# agent-claude-halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38 (minimal / T1)
2+
3+
Branch: `agent/claude/halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38`
4+
5+
## Why
6+
7+
The codex-fleet currently holds ~10 GB of resident memory (16 codex CLIs
8+
+ 258 node helper procs, observed via `ps -C codex -o rss`). Each codex
9+
CLI is a native binary with ~200-400 MB of heap that does NOT shrink
10+
while idle. Even when the plan is `plan-exhausted` and every worker is
11+
in `sleep 60`, the native heap stays resident.
12+
13+
## What
14+
15+
Three changes, all in `scripts/codex-fleet/`:
16+
17+
1. `codex-fleet-2.sh` — worker count is now `WORKER_COUNT` env (default
18+
**4**, was **8**). Spawn loop driven by the env. The full
19+
`RESERVE_ACCOUNTS` array stays as the upper bound; bump by setting
20+
`WORKER_COUNT=8` for heavy plans.
21+
2. `codex-fleet-2.sh``worker_cmd_for()` now exports
22+
`NODE_OPTIONS=--max-old-space-size=400` so any Node MCP-server child
23+
codex spawns is capped. (codex itself is native; the flag does not
24+
apply to its own heap.)
25+
3. `worker-prompt.md` — added `empty_streak` counter to the worker
26+
loop. After 5 consecutive `plan-exhausted` polls (~5 min idle), the
27+
worker posts a Colony note and exits with status 0. Supervisor
28+
respawns it when Colony reports new claimable work for the account.
29+
Override per-pane via `IDLE_EXIT_THRESHOLD=0`.
30+
31+
## Expected impact
32+
33+
| Metric | Before | After |
34+
| --- | --- | --- |
35+
| Workers spawned at bringup | 8 | 4 |
36+
| Idle floor when plan exhausted | ~2 GB | near 0 |
37+
| Active-work peak | ~10 GB | ~5 GB |
38+
| Node MCP child heap cap | unbounded | 400 MB |
39+
40+
## Handoff
41+
42+
- Handoff: change=`agent-claude-halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38`; branch=`agent/claude/halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38`; scope=`codex-fleet-2.sh + worker-prompt.md`; action=`finish via PR`.
43+
44+
## Cleanup
45+
46+
- [ ] Run: `gx branch finish --branch agent/claude/halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38 --base main --via-pr --wait-for-merge --cleanup`
47+
- [ ] Tear down + bring up the fleet to pick up the new defaults:
48+
`bash scripts/codex-fleet/down.sh && bash scripts/codex-fleet/full-bringup.sh ...`
49+
- [ ] Record PR URL + `MERGED` state in the completion handoff.
50+
- [ ] Confirm sandbox worktree is gone (`git worktree list`, `git branch -a`).

scripts/codex-fleet/codex-fleet-2.sh

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -143,20 +143,35 @@ RESERVE_ACCOUNTS=(
143143
admin-kollarrobert admin-mite bia-zazrifka fico-magnolia
144144
koncita-pipacs mesi-lebenyse recodee-mite ricsi-zazrifka
145145
)
146+
# Worker count is now parameterizable. Default halved from 8 → 4 to cut
147+
# the codex-fleet-2 RSS floor by ~50% (each codex CLI holds ~200-400 MB
148+
# of native heap that does not shrink while idle). Bump back via
149+
# `WORKER_COUNT=8 bash codex-fleet-2.sh ...` when a heavy plan needs
150+
# more parallel lanes; the array above caps the upper bound.
151+
WORKER_COUNT="${WORKER_COUNT:-4}"
152+
if (( WORKER_COUNT < 1 )); then WORKER_COUNT=1; fi
153+
if (( WORKER_COUNT > ${#RESERVE_ACCOUNTS[@]} )); then
154+
WORKER_COUNT=${#RESERVE_ACCOUNTS[@]}
155+
fi
146156
worker_cmd_for() {
147157
local acct="$1"
148158
# Launch codex directly as the pane command (matches codex-fleet:overview's
149159
# pattern in scripts/codex-fleet/full-bringup.sh). codex inherits the pane's
150160
# TTY cleanly because we skip the bash-lc indirection. The guard wrapper
151161
# (codex-guard.sh) sees CODEX_GUARD_BYPASS=1 and execs the real codex.
152-
printf 'env CODEX_GUARD_BYPASS=1 CODEX_HOME=/tmp/codex-fleet/%s CODEX_FLEET_AGENT_NAME=codex-fleet-2-%s CODEX_FLEET_ACCOUNT=%s CODEX_FLEET_SESSION=%s codex --dangerously-bypass-approvals-and-sandbox --add-dir /home/deadpool/Documents/codex-fleet --add-dir /home/deadpool/Documents/codex-fleetui' \
162+
#
163+
# NODE_OPTIONS=--max-old-space-size=400 caps any Node MCP-server child the
164+
# codex binary spawns. codex itself is a native binary so the V8 flag
165+
# does not apply to its own heap, but it keeps helper Node processes
166+
# from growing unbounded.
167+
printf 'env CODEX_GUARD_BYPASS=1 NODE_OPTIONS=--max-old-space-size=400 CODEX_HOME=/tmp/codex-fleet/%s CODEX_FLEET_AGENT_NAME=codex-fleet-2-%s CODEX_FLEET_ACCOUNT=%s CODEX_FLEET_SESSION=%s codex --dangerously-bypass-approvals-and-sandbox --add-dir /home/deadpool/Documents/codex-fleet --add-dir /home/deadpool/Documents/codex-fleetui' \
153168
"$acct" "$acct" "$acct" "$SESSION"
154169
}
155-
# Force a generous virtual size so 8 worker splits have room before the
170+
# Force a generous virtual size so the worker splits have room before the
156171
# kitty client attaches. tmux resizes to the client on attach anyway.
157172
tmux new-session -d -s "$SESSION" -x 274 -y 78 -n overview \
158173
"$(worker_cmd_for "${RESERVE_ACCOUNTS[0]}")"
159-
for i in 1 2 3 4 5 6 7; do
174+
for (( i = 1; i < WORKER_COUNT; i++ )); do
160175
acct="${RESERVE_ACCOUNTS[$i]}"
161176
tmux split-window -t "$SESSION:overview" "$(worker_cmd_for "$acct")" >/dev/null 2>&1 || true
162177
tmux select-layout -t "$SESSION:overview" tiled >/dev/null

scripts/codex-fleet/worker-prompt.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,18 +89,33 @@ the file-claim, gx, or PR-merge contracts in this prompt.
8989
## Loop
9090

9191
```
92+
1. empty_streak = 0 # tracked per-pane; reset whenever ready.ready is non-empty
9293
2. ready = mcp__colony__task_ready_for_agent({ agent: $CODEX_FLEET_AGENT_NAME, limit: 1 })
9394
3. if ready.ready is empty:
95+
empty_streak += 1
96+
if empty_streak >= IDLE_EXIT_THRESHOLD (default 5, i.e. ~5 minutes idle):
97+
task_post(kind:'note', content:'idle-exit: empty_streak=<n>; supervisor will respawn on demand')
98+
exit 0 # native heap is reclaimed; supervisor watcher respawns the pane
99+
# only when Colony reports new claimable work for this account.
94100
if ready.next_action contains "rescue" or ready.next_tool == "rescue_stranded_scan":
95101
sleep 60 # claim-release-supervisor daemon owns rescue; do not loop on it
96102
else:
97103
sleep 60
98104
goto 2
99-
4. task = ready.ready[0]
105+
4. empty_streak = 0
106+
task = ready.ready[0]
100107
```
101108

102109
Then preflight, claim, work, report. Sequence below.
103110

111+
**Why the idle-exit.** Each codex CLI holds ~200-400 MB of native heap that
112+
does not shrink while idle. Leaving 8 workers spinning at `sleep 60` keeps
113+
~2-3 GB of RSS resident even when no plan is claimable. Exiting after 5
114+
consecutive empty polls (~5 min idle) drops the floor to active workers
115+
only; the supervisor / `claim-release-supervisor.sh` respawns the pane on
116+
the next Colony work signal. Override with `IDLE_EXIT_THRESHOLD=0` in the
117+
pane env to disable the auto-exit for that one pane.
118+
104119
### Tier + specialty gate (REQUIRED before preflight)
105120

106121
Read once at boot: `tier=$CODEX_FLEET_TIER` (default `high`),

0 commit comments

Comments
 (0)