You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(server): add graceful shutdown for zero-downtime rolling deploys (LFXV2-1703) (#738)
* feat(server): add graceful shutdown for zero-downtime rolling deploys (LFXV2-1703)
- Wire SIGTERM/SIGINT handlers in server.ts: capture http.Server,
drain in-flight requests (25s window) before SIGKILL
- Flip /readyz to 503 during shutdown so kube-proxy/Traefik
stops routing new traffic to the terminating pod
- Add shutdown.ts registry to coordinate ordered teardown hooks
- Track active SSE streams on CopilotController; send
event:shutdown frame and res.end() before HTTP drain
- Add NatsService.shutdownAll() static method to drain all
instantiated NATS connections concurrently
- Set PM2 kill_timeout: 30000 (exceeds 25s in-app drain window)
- Add Helm chart values: strategy (maxSurge 100%/maxUnavailable 0),
terminationGracePeriodSeconds: 60, lifecycle.preStop sleep 10s
- Expose all three as value-driven fields for per-env tuning
Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: David Deal <ddeal@linuxfoundation.org>
* fix(server): address review findings on graceful shutdown (LFXV2-1703)
- Export markShuttingDown() from shutdown.ts; call it first in
gracefulShutdown() so /readyz flips synchronously, removing the
dual-flag race between server.ts and shutdown.ts
- Increase PM2 kill_timeout to 45000ms (was 30000) so NATS/Snowflake
drain has 15s instead of 5s after the 25s HTTP drain window
- Add isShuttingDown() guard at copilot chat() entry to reject any
connections that arrive during the LB race window
- Replace res.write()+res.end() with res.end(payload) in closeAllStreams
for atomic write+close; add logger.debug for swallowed stream errors
- Document autorestart/process.exit(0) interaction in ecosystem.config.js
- Update values.yaml comment to reflect corrected timing chain
Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: David Deal <ddeal@linuxfoundation.org>
* fix(server): address review findings on graceful shutdown (LFXV2-1703)
- Wrap signal handlers with .catch() so unhandled rejections no longer
silently crash the process before process.exit(0) is reached (C-001)
- Enforce 15s service drain budget via Promise.race() so a hung NATS or
Snowflake drain cannot outlast PM2's kill_timeout (C-002)
- Bump terminationGracePeriodSeconds 60→75 and correct comment to reflect
that preStop runs inside the grace period, not on top of it (C-003)
- Add SnowflakeService.shutdownIfInitialized() to avoid lazy-creating a
connection pool on pods that never served a Snowflake-backed route (C-004)
- Wrap each shutdown hook in Promise.resolve().then() so synchronous throws
are caught by allSettled; log failed hooks at ERROR level (C-005)
- Make closeAllStreams() async and await write flush so SSE clients receive
the shutdown event before closeAllConnections() destroys the socket (M-003)
- Gate sendEvent() on isShuttingDown() to close the write-after-end window
between res.end() and the async close event (M-005)
- Move NatsService instance deregistration to finally block so a drain
failure no longer leaves the connection untracked (M-004)
- Return deregistration function from addShutdownHook() to prevent hook
accumulation across multiple controller instantiations in tests (M-006)
- Differentiate expected "already closed" stream errors (DEBUG) from
unexpected errors (WARN) in closeAllStreams() (m-001)
Jira: LFXV2-1703
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: David Deal <ddeal@linuxfoundation.org>
* docs(chart): fix terminationGracePeriodSeconds default and timing note
Update README row to match values.yaml: default 60→75 and timing note
(35s)→(55s), reflecting 10s preStop + 45s PM2 kill_timeout = 55s.
LFXV2-1703
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: David Deal <ddeal@linuxfoundation.org>
* fix(review): address PR #738 review feedback
Address review comments from copilot-pull-request-reviewer[bot]:
- copilot.controller.ts: remove unused removeShutdownHook class field;
addShutdownHook() return value discarded since the hook lives for the
full process lifetime and deregistration is never needed in production
- ecosystem.config.js: add stop_exit_codes: [0] so PM2 does not restart
on a clean shutdown regardless of pm2 vs pm2-runtime context; update
autorestart comment to reflect the actual intent
Resolves 2 review threads. Thread on res.end(resolve) left open with
explanation — Node.js end(cb) overload treats a function first-arg as
the callback, not as data (confirmed by @types/node signature and
runtime source).
LFXV2-1703
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: David Deal <ddeal@linuxfoundation.org>
* fix(review): address PR #738 review feedback (round 2)
Address review comments from copilot-pull-request-reviewer[bot]:
- server.ts: use consistent operation name 'graceful_shutdown' for both
startOperation and success calls; the previous names 'shutdown_initiated'
and 'shutdown_complete' split one operation across two names, violating
the repo logging convention of pairing startOperation/success with the
same operation identifier
- server.ts: add rejection handlers to NATS and Snowflake drain .then()
calls so drain failures emit a logger.warning instead of silently
disappearing; previously a rejected drain promise would skip the
success callback with no log entry at all
Resolves 2 review threads.
LFXV2-1703
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: David Deal <ddeal@linuxfoundation.org>
* Fixed formatting
Signed-off-by: David Deal <ddeal@linuxfoundation.org>
* fix(review): address PR #738 review feedback (round 3)
Address review comments from coderabbitai[bot] and copilot-pull-request-reviewer[bot]:
- charts/lfx-self-serve/README.md: fix image pull secrets parameter name from
image.pullSecrets to imagePullSecrets to match the actual top-level value path
used in values.yaml and the deployment template (per copilot-pull-request-reviewer[bot])
- server.ts: add 5s budget to runShutdownHooks() call; without it a blocked SSE
write (backpressure) can stall the entire shutdown sequence past PM2 kill_timeout
(per copilot-pull-request-reviewer[bot])
- server.ts: expand raceDrain to accept a name and log a warning when the 15s budget
fires before the drain completes, making timeout events visible in logs instead of
silently succeeding (per copilot-pull-request-reviewer[bot])
- copilot.controller.ts: add 2s per-stream timeout in closeAllStreams(); if a client's
TCP receive buffer is full (backpressure), res.write()'s flush callback stalls
indefinitely; the timeout ensures the hook completes within budget regardless
(per copilot-pull-request-reviewer[bot])
- Prettier CI failure flagged by coderabbitai[bot] is already resolved (format:check
passes; previous commits reformatted affected files)
Resolves 5 review threads.
LFXV2-1703
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: David Deal <ddeal@linuxfoundation.org>
* fix(review): address PR #738 review feedback (round 4)
Address review comments from copilot-pull-request-reviewer:
- apps/lfx-one/src/server/server.ts: add hooksCompleted flag to hooks budget
race so shutdown_hooks_timeout is not spuriously logged when runShutdownHooks()
completes before the 5s budget fires (mirrors completed guard in raceDrain)
(per copilot-pull-request-reviewer)
- apps/lfx-one/src/server/server.ts: rename shutdown_nats_drained →
shutdown_nats_complete and remove dead failure handler; NatsService.shutdownAll()
uses Promise.allSettled so always resolves — per-connection drain failures are
logged at ERROR inside NatsService.shutdown() (per copilot-pull-request-reviewer)
- apps/lfx-one/src/server/server.ts: rename shutdown_snowflake_drained →
shutdown_snowflake_complete; SnowflakeService.shutdown() has internal try/catch
so pool drain failures are logged inside the service — kept rejection handler
since shutdownIfInitialized() can reject before the internal try/catch
(per copilot-pull-request-reviewer)
Resolves 3 review threads.
Signed-off-by: David Deal <ddeal@linuxfoundation.org>
* fix(server): add LB drain sleep and fix shutdown ordering (LFXV2-1703)
Address review feedback from @emsearcy:
- apps/lfx-one/src/server/server.ts: replace the Promise.race hook-budget
block with Promise.all([lbSleep, hooks]) so the server waits a mandatory
15s (1.5× readyz periodSeconds) after flipping readyz to 503 before closing
the HTTP listener — covers the window where the LB still routes requests
despite the 503 response. SSE shutdown hooks run concurrently within the
15s window so clients are notified and can reconnect before HTTP closes.
Added comment above NATS/Snowflake drain making explicit that they drain
*after* HTTP (not before), preserving in-flight NATS request/reply calls.
- apps/lfx-one/ecosystem.config.js: increase kill_timeout from 45s to 60s
to account for the new 15s LB drain sleep (15s LB + 25s HTTP + 15s service
+ 5s margin = 60s); terminationGracePeriodSeconds (75s) still exceeds
preStop (10s) + kill_timeout (60s) = 70s
- charts/lfx-self-serve/values.yaml: update timing breakdown comment to
include the 15s LB drain sleep line item; adjust active-time summary from
55s to 70s and safety margin from 20s to 5s
- charts/lfx-self-serve/README.md: update terminationGracePeriodSeconds
timing note from (55s) to (70s) to match new kill_timeout
Signed-off-by: David Deal <ddeal@linuxfoundation.org>
* fix(review): address PR #738 review feedback (round 5)
Signed-off-by: David Deal <ddeal@linuxfoundation.org>
* fix(review): address PR #738 review feedback (round 6)
Signed-off-by: David Deal <ddeal@linuxfoundation.org>
* fix(review): address PR #738 review feedback
- server.ts: replace `Promise.all([lbDrain, hooks])` with `await lbDrain`
to hard-cap the hooks wait at LB_DRAIN_MS — a hung hook can no longer
delay the HTTP drain beyond the 15 s LB window (per copilot-pull-request-reviewer[bot])
- server.ts: introduce `hooksStartTime` for `shutdown_hooks_error` log so
`duration_ms` reflects time spent in hooks, not time since graceful
shutdown began (per copilot-pull-request-reviewer[bot])
- server.ts: update comment block to describe best-effort / hard-ceiling
semantics of the revised hooks/LB-drain sequence
Resolves 2 review threads on PR #738.
LFXV2-1703
Signed-off-by: David Deal <ddeal@linuxfoundation.org>
---------
Signed-off-by: David Deal <ddeal@linuxfoundation.org>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
0 commit comments