Fix infinite fork loop when node renderer worker fails to bind port#2881
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughMaster now detects worker startup failures reported over IPC ( Changes
Sequence DiagramsequenceDiagram
participant Worker
participant IPC
participant Master
rect rgba(255, 0, 0, 0.5)
note over Worker,Master: Startup failure reporting & abort
Worker->>Worker: app.listen() error (e.g. EADDRINUSE)
Worker->>IPC: process.send(WorkerStartupFailureMessage)
IPC->>Master: deliver message
Master->>Master: isWorkerStartupFailureMessage() -> true
Master->>Master: record fatalStartupFailure & set isAbortingForStartupFailure
Worker->>Worker: process.exit(1)
Worker->>Master: cluster 'exit' event
Master->>Master: detect recorded failure -> format message (port if EADDRINUSE)
Master->>Master: process.exit(1)
end
rect rgba(0, 255, 0, 0.5)
note over Worker,Master: Scheduled restart / runtime crash preserved
Worker->>Master: cluster 'exit' event (scheduled restart or runtime crash without startup failure)
Master->>Master: isScheduledRestart? -> true (or runtime crash handling)
Master->>Master: cluster.fork() (preserved behavior)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7563af6c62
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Greptile SummaryThis PR fixes an infinite fork/crash loop in the node renderer master process by introducing a Key changes:
Confidence Score: 5/5Safe to merge — the core fix is correct and well-tested; only minor style/defensive-coding suggestions remain. All findings are P2 (style and defensive hardening). The abort-flag + IPC message pattern is sound: Node.js guarantees IPC messages are delivered before the corresponding exit event fires, so the master will always see the failure message before the worker's exit.
Important Files Changed
Sequence DiagramsequenceDiagram
participant M as Master
participant W1 as Worker 1
participant W2 as Worker N
M->>W1: cluster.fork()
M->>W2: cluster.fork()
W1->>W1: app.listen() → EADDRINUSE
W2->>W2: app.listen() → EADDRINUSE
W1-->>M: IPC: WORKER_STARTUP_FAILURE {code, port, host}
Note over M: isAbortingForStartupFailure = true
W2-->>M: IPC: WORKER_STARTUP_FAILURE (ignored — flag already set)
W1->>W1: process.exit(1)
W2->>W2: process.exit(1)
M->>M: cluster 'exit' (worker 1)
Note over M: errorReporter.message(...)<br/>process.exit(1) — master terminates
Reviews (1): Last reviewed commit: "Fix infinite fork loop when node rendere..." | Re-trigger Greptile |
size-limit report 📦
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
packages/react-on-rails-pro-node-renderer/tests/workerStartupFailure.test.ts (1)
51-117: Exercisesrc/worker.ts, not a hand-built copy of the branch.These assertions build the payload and call
process.send/process.exitdirectly, so they can stay green even if the realapp.listenerror path stops sending the message or changes the payload shape. Extract the listen-failure branch into a helper and call that helper from bothsrc/worker.tsand this suite.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/react-on-rails-pro-node-renderer/tests/workerStartupFailure.test.ts` around lines 51 - 117, The tests are exercising a hand-built error path instead of the real code in src/worker.ts; extract the listen-failure logic into a reusable exported helper (e.g., sendWorkerStartupFailure or handleListenError) that builds the WorkerStartupFailureMessage and calls process.send/process.exit, then call that helper from both src/worker.ts where the app.listen error handler currently lives and from tests so the suite verifies the actual production logic; update worker.ts to import/export the helper and replace the inline listen-error branch with a call to that helper (refer to symbols WORKER_STARTUP_FAILURE, WorkerStartupFailureMessage, isWorkerStartupFailureMessage and the app.listen error handler) so tests exercise the real implementation.packages/react-on-rails-pro-node-renderer/src/shared/workerMessages.ts (1)
14-19: Tighten the IPC payload guard.
cluster.on('message')hands youunknown, but this guard only checkstype. A malformed{ type: WORKER_STARTUP_FAILURE }payload would still arm the fatal-abort path and later produce incomplete diagnostics. Validatestage,host,port, andmessagehere.💡 Proposed fix
export function isWorkerStartupFailureMessage(value: unknown): value is WorkerStartupFailureMessage { - return ( - typeof value === 'object' && - value !== null && - (value as { type?: string }).type === WORKER_STARTUP_FAILURE - ); + if (typeof value !== 'object' || value === null) { + return false; + } + + const message = value as Partial<WorkerStartupFailureMessage>; + + return ( + message.type === WORKER_STARTUP_FAILURE && + message.stage === 'listen' && + typeof message.host === 'string' && + typeof message.port === 'number' && + typeof message.message === 'string' + ); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/react-on-rails-pro-node-renderer/src/shared/workerMessages.ts` around lines 14 - 19, The guard in isWorkerStartupFailureMessage currently only checks the top-level type and can be spoofed; update isWorkerStartupFailureMessage to validate the full payload shape by checking that (value as any).stage and .host and .message are strings and .port is a number (and not NaN), in addition to the existing (value as { type?: string }).type === WORKER_STARTUP_FAILURE check; ensure you reference WORKER_STARTUP_FAILURE and the isWorkerStartupFailureMessage function so the cluster.on('message') handler only treats objects with valid stage, host, port, and message fields as WorkerStartupFailureMessage.packages/react-on-rails-pro-node-renderer/tests/masterStartupFailure.test.ts (1)
40-73: Avoid re-implementingsrc/master.tsinside the test.
handleMessageandhandleExitduplicate the production control flow instead of invoking it, so this suite will not catch wiring regressions in the real cluster listeners and it currently mirrors the same worker-id mix-up assrc/master.ts. Extract the decision logic into importable helpers and assert against those.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/react-on-rails-pro-node-renderer/tests/masterStartupFailure.test.ts` around lines 40 - 73, The test duplicates production flow by re-implementing handleMessage/handleExit instead of exercising src/master.ts logic; refactor the decision logic into small, exported helpers (e.g., decideStartupFailureMessage or formatWorkerExitMessage) in the production code, export them from src/master.ts (or a new module), update the test to import and call those helpers with simulated worker and message objects, remove the local handleMessage/handleExit in the test, and assert on the helper return values; also ensure the helper uses the passed worker.id consistently to fix the worker-id mix-up.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/react-on-rails-pro-node-renderer/src/master.ts`:
- Around line 103-108: The abort message uses the currently-iterated worker id
(worker.id) rather than the worker that actually sent the fatal startup failure;
update the message construction in the isAbortingForStartupFailure branch to
reference fatalStartupFailure.workerId (and still use
failure/failure.port/failure.message and fallback to worker.process.exitCode
where appropriate) so the abort log points to the worker that reported the
startup failure (symbols: isAbortingForStartupFailure, fatalStartupFailure,
worker.id, worker.process.exitCode).
---
Nitpick comments:
In `@packages/react-on-rails-pro-node-renderer/src/shared/workerMessages.ts`:
- Around line 14-19: The guard in isWorkerStartupFailureMessage currently only
checks the top-level type and can be spoofed; update
isWorkerStartupFailureMessage to validate the full payload shape by checking
that (value as any).stage and .host and .message are strings and .port is a
number (and not NaN), in addition to the existing (value as { type?: string
}).type === WORKER_STARTUP_FAILURE check; ensure you reference
WORKER_STARTUP_FAILURE and the isWorkerStartupFailureMessage function so the
cluster.on('message') handler only treats objects with valid stage, host, port,
and message fields as WorkerStartupFailureMessage.
In
`@packages/react-on-rails-pro-node-renderer/tests/masterStartupFailure.test.ts`:
- Around line 40-73: The test duplicates production flow by re-implementing
handleMessage/handleExit instead of exercising src/master.ts logic; refactor the
decision logic into small, exported helpers (e.g., decideStartupFailureMessage
or formatWorkerExitMessage) in the production code, export them from
src/master.ts (or a new module), update the test to import and call those
helpers with simulated worker and message objects, remove the local
handleMessage/handleExit in the test, and assert on the helper return values;
also ensure the helper uses the passed worker.id consistently to fix the
worker-id mix-up.
In
`@packages/react-on-rails-pro-node-renderer/tests/workerStartupFailure.test.ts`:
- Around line 51-117: The tests are exercising a hand-built error path instead
of the real code in src/worker.ts; extract the listen-failure logic into a
reusable exported helper (e.g., sendWorkerStartupFailure or handleListenError)
that builds the WorkerStartupFailureMessage and calls process.send/process.exit,
then call that helper from both src/worker.ts where the app.listen error handler
currently lives and from tests so the suite verifies the actual production
logic; update worker.ts to import/export the helper and replace the inline
listen-error branch with a call to that helper (refer to symbols
WORKER_STARTUP_FAILURE, WorkerStartupFailureMessage,
isWorkerStartupFailureMessage and the app.listen error handler) so tests
exercise the real implementation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5f3f10d4-3c39-48dd-b45e-01e7e35d8674
📒 Files selected for processing (6)
.github/workflows/claude-code-review.ymlpackages/react-on-rails-pro-node-renderer/src/master.tspackages/react-on-rails-pro-node-renderer/src/shared/workerMessages.tspackages/react-on-rails-pro-node-renderer/src/worker.tspackages/react-on-rails-pro-node-renderer/tests/masterStartupFailure.test.tspackages/react-on-rails-pro-node-renderer/tests/workerStartupFailure.test.ts
Code ReviewGood fix for a real operational problem. Three issues worth addressing before merge: Bug 1 — Wrong worker ID in error message (master.ts:108) Bug 2 — Loose type guard arms a process-killing path (workerMessages.ts:14-19) Potential hang — process.send callback may never fire (worker.ts:463-466) Minor
|
|
Follow-up on review summaries:
|
|
Thanks for the review.
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Test diverges from production error message logic
- Updated the test handler to use
fatalStartupFailure?.workerId ?? worker.idand added coverage for differing failed/exiting workers to match production logic.
- Updated the test handler to use
Or push these changes by commenting:
@cursor push 47dd98baeb
Preview (47dd98baeb)
diff --git a/packages/react-on-rails-pro-node-renderer/tests/masterStartupFailure.test.ts b/packages/react-on-rails-pro-node-renderer/tests/masterStartupFailure.test.ts
--- a/packages/react-on-rails-pro-node-renderer/tests/masterStartupFailure.test.ts
+++ b/packages/react-on-rails-pro-node-renderer/tests/masterStartupFailure.test.ts
@@ -57,10 +57,11 @@
if (isAbortingForStartupFailure) {
const failure = fatalStartupFailure?.failure;
+ const failedWorkerId = fatalStartupFailure?.workerId ?? worker.id;
const msg =
failure?.code === 'EADDRINUSE'
? `Node renderer startup failed: port ${failure.port} is already in use`
- : `Node renderer startup failed in worker ${worker.id}: ${failure?.message || `exit code ${worker.process.exitCode}`}`;
+ : `Node renderer startup failed in worker ${failedWorkerId}: ${failure?.message || `exit code ${worker.process.exitCode}`}`;
reportedMessages.push(msg);
exitCode = 1;
@@ -113,6 +114,24 @@
]);
});
+ it('uses the originally failed worker id when a different worker exits', () => {
+ const failedWorker = { id: 2, process: { exitCode: 1 } };
+ const exitingWorker = { id: 7, process: { exitCode: 1 } };
+ const failure = buildStartupFailureMessage({
+ code: 'EACCES',
+ message: 'listen EACCES: permission denied 0.0.0.0:80',
+ });
+
+ handleMessage(failedWorker, failure);
+ handleExit(exitingWorker);
+
+ expect(forkCount).toBe(0);
+ expect(exitCode).toBe(1);
+ expect(reportedMessages).toEqual([
+ 'Node renderer startup failed in worker 2: listen EACCES: permission denied 0.0.0.0:80',
+ ]);
+ });
+
it('deduplicates multiple workers failing simultaneously', () => {
const worker1 = { id: 1, process: { exitCode: 1 } };
const worker2 = { id: 2, process: { exitCode: 1 } };You can send follow-ups to this agent here.
|
Update: the
|
Review: Fix infinite fork loop on node renderer port bind failureGood fix overall — the core idea (typed IPC message + abort flag) is sound and solves the real problem. A few issues worth addressing before merge: Reliability concern (medium): IPC message vs. exit event ordering is not guaranteed. The message handler sets isAbortingForStartupFailure, and the exit handler reads it, but Node.js cluster does not guarantee the message event fires before the exit event for the same worker. The 500 ms fallback timer in worker.ts narrows the window but does not close it. See inline comment on master.ts:84-89 for a suggested mitigation. Correctness nit (low): Multiple workers failing simultaneously (all binding the same port) each trigger an exit event. The first calls errorReporter.message and process.exit(1); if a second exit is already queued before the process terminates, the error message fires twice. See inline on master.ts:103-114. The unreachable return after process.exit(1) on line 113 can also be removed (process.exit is typed never). Testing gap (low): masterStartupFailure.test.ts tests a hand-rolled copy of the logic, not the actual masterRun() function. A regression in the real wiring would not be caught. Worth adding even one thin integration-level test that calls masterRun() against a mocked cluster. See inline comment on the test file. Documentation nit (low): The 500 ms fallback timer in worker.ts is unexplained. A one-line comment stating why it exists would help future maintainers. See inline on worker.ts:463-466. |
Review SummaryThe fix is well-structured and correctly solves the infinite refork loop. The IPC message approach is the right pattern - typed message, validator guard, and deduplication via A few items worth addressing before merge: Medium priority
Minor / polish
Overall the approach is sound and the test suite covers the key paths well. The sibling worker cleanup is the only change I would consider before shipping. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/repro-node-renderer-startup-failure.mjs`:
- Around line 38-51: The blocker server created as const blocker =
net.createServer() lacks an 'error' event handler so listen() failures (e.g.,
EADDRINUSE) become unhandled exceptions; add an error listener on blocker
(before calling blocker.listen) that logs the error (using console.error or
processLogger) and exits or otherwise handles the failure deterministically
(mirror the current process.exit(1) behavior used in the address check), and
ensure closeBlocker still works when the server errored.
- Around line 21-24: The current validation for timeoutMs (derived from
process.env.REPRO_TIMEOUT_MS) only rejects NaN or non-positive numbers but
allows non-finite values and values outside the Node setTimeout range; update
the check around timeoutMs to ensure Number.isFinite(timeoutMs) and
Number.isInteger(timeoutMs) and that timeoutMs is within the valid range (1
through 2147483647), and if not log an error including the raw
process.env.REPRO_TIMEOUT_MS and the parsed timeoutMs before process.exit(1) so
setTimeout is never called with Infinity or out-of-range values.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a760fa37-a1cb-4951-b5ba-103e2a54c88a
📒 Files selected for processing (3)
packages/react-on-rails-pro-node-renderer/src/worker.tspackages/react-on-rails-pro-node-renderer/tests/masterStartupFailure.test.tsscripts/repro-node-renderer-startup-failure.mjs
🚧 Files skipped from review as they are similar to previous changes (1)
- packages/react-on-rails-pro-node-renderer/tests/masterStartupFailure.test.ts
ReviewGood fix with thorough test coverage. The IPC-message approach is solid — workers signal their own startup failures, the master sets an abort flag, and a Two minor items (left as inline comments)
Everything else looks good
|
| } | ||
|
|
||
| function validatePort(port: number): string | null { | ||
| if (!Number.isInteger(port) || !Number.isFinite(port) || port < 0 || port > 65535) { |
There was a problem hiding this comment.
!Number.isFinite(port) is redundant here — Number.isInteger already returns false for NaN, Infinity, and -Infinity, so the finite check can never add a failure that isInteger didn't already catch.
| if (!Number.isInteger(port) || !Number.isFinite(port) || port < 0 || port > 65535) { | |
| if (!Number.isInteger(port) || port < 0 || port > 65535) { |
| const doExit = (sendErr?: Error | null) => { | ||
| if (exited) return; | ||
| exited = true; | ||
| if (sendErr) log.error({ err: sendErr }, 'Failed to send startup failure message to master'); | ||
| exitFn(1); | ||
| }; | ||
| sendFn(startupFailure, undefined, undefined, doExit); | ||
| // Safety net: if the IPC channel is half-broken the callback may never | ||
| // fire, leaving this worker alive indefinitely. Force exit after a timeout. | ||
| const IPC_SEND_TIMEOUT_MS = 2000; | ||
| const timer = setTimeout(() => doExit(), IPC_SEND_TIMEOUT_MS); | ||
| if (typeof timer.unref === 'function') timer.unref(); |
There was a problem hiding this comment.
The timeout timer is never cleared when doExit fires via the IPC callback in the normal path. In production this is harmless (the process exits before the timer fires), and in tests the exited guard makes doExit idempotent — but clearing the timer explicitly inside doExit would be cleaner and avoids leaving a dangling handle:
| const doExit = (sendErr?: Error | null) => { | |
| if (exited) return; | |
| exited = true; | |
| if (sendErr) log.error({ err: sendErr }, 'Failed to send startup failure message to master'); | |
| exitFn(1); | |
| }; | |
| sendFn(startupFailure, undefined, undefined, doExit); | |
| // Safety net: if the IPC channel is half-broken the callback may never | |
| // fire, leaving this worker alive indefinitely. Force exit after a timeout. | |
| const IPC_SEND_TIMEOUT_MS = 2000; | |
| const timer = setTimeout(() => doExit(), IPC_SEND_TIMEOUT_MS); | |
| if (typeof timer.unref === 'function') timer.unref(); | |
| let exited = false; | |
| let timer: NodeJS.Timeout | undefined; | |
| const doExit = (sendErr?: Error | null) => { | |
| if (exited) return; | |
| exited = true; | |
| clearTimeout(timer); | |
| if (sendErr) log.error({ err: sendErr }, 'Failed to send startup failure message to master'); | |
| exitFn(1); | |
| }; | |
| sendFn(startupFailure, undefined, undefined, doExit); | |
| // Safety net: if the IPC channel is half-broken the callback may never | |
| // fire, leaving this worker alive indefinitely. Force exit after a timeout. | |
| const IPC_SEND_TIMEOUT_MS = 2000; | |
| timer = setTimeout(() => doExit(), IPC_SEND_TIMEOUT_MS); | |
| if (typeof timer.unref === 'function') timer.unref(); |
| if (worker.isScheduledRestart) { | ||
| log.info('Restarting worker #%d on schedule', worker.id); | ||
| } else { | ||
| cluster.fork(); | ||
| return; | ||
| } |
There was a problem hiding this comment.
The isScheduledRestart branch forks synchronously without the setImmediate deferred check that the runtime-crash branch uses. This creates an asymmetric race window: if a scheduled-restart exit fires while a WORKER_STARTUP_FAILURE IPC message is in-flight (sent but not yet processed), the master will fork a new replacement worker before setting the abort flag.
The new worker will likely fail with the same error and exit, triggering the abort on its exit — so this isn't a correctness failure (the master will still eventually abort), but it does transiently over-fork by one worker during the startup-failure shutdown sequence.
To close the window consistently, apply the same setImmediate deferral here:
| if (worker.isScheduledRestart) { | |
| log.info('Restarting worker #%d on schedule', worker.id); | |
| } else { | |
| cluster.fork(); | |
| return; | |
| } | |
| if (worker.isScheduledRestart) { | |
| setImmediate(() => { | |
| if (abortForStartupFailure()) return; | |
| log.info('Restarting worker #%d on schedule', worker.id); | |
| cluster.fork(); | |
| }); | |
| return; | |
| } |
The existing test 'does not refork a scheduled-restart worker when aborting for startup failure' already asserts the desired outcome for the synchronous case; adding a parallel async test (IPC arrives after the exit event) would cover this path.
Code ReviewOverall this is a clean, well-motivated fix with good test coverage. The IPC-based signalling design is solid — namespaced message type, validated type guard, idempotent shutdown guard, and a fallback hard-deadline timer. A few specific findings below. Issues found1. Scheduled-restart branch has the same IPC race as the runtime-crash branch (medium) 2. Dangling timer reference in 3. Redundant What looks good
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is ON, but it could not run because on-demand usage is turned off. To enable Bugbot Autofix, turn on on-demand usage and set a spend limit in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 97784b9. Configure here.
| } | ||
|
|
||
| function validatePort(port: number): string | null { | ||
| if (!Number.isInteger(port) || !Number.isFinite(port) || port < 0 || port > 65535) { |
There was a problem hiding this comment.
Dead condition in port validation function
Low Severity
The validatePort condition !Number.isInteger(port) || !Number.isFinite(port) contains a dead sub-expression. Number.isInteger already implies Number.isFinite — every integer is finite, and Number.isInteger returns false for NaN, Infinity, and -Infinity. So when !Number.isInteger(port) is false (the only time !Number.isFinite is even evaluated), port is guaranteed to be finite, making !Number.isFinite(port) always false at that point. The !Number.isFinite check never affects the result and misleads readers into thinking there is an input class that Number.isInteger alone would miss.
Reviewed by Cursor Bugbot for commit 97784b9. Configure here.
The validatePort function uses Number.isInteger() which returns false for string values like "3800". When users pass env-derived port values (e.g. `port: env.RENDERER_PORT || 3800`), the env var is a string, causing the validation to reject valid ports and crash the renderer. This was the root cause of all Pro integration test failures in CI — the node renderer exited immediately with "RENDERER_PORT must be an integer between 0 and 65535. Received: 3800". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ca6ac0162f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return ( | ||
| message.type === WORKER_STARTUP_FAILURE && | ||
| message.stage === 'listen' && | ||
| typeof message.host === 'string' && |
There was a problem hiding this comment.
Handle undefined host in startup-failure message guard
This guard drops startup-failure IPC messages unless host is a string, but buildConfig can still propagate host: undefined from JS callers (for example host: process.env.RENDERER_HOST when unset). In that case, workers still send WORKER_STARTUP_FAILURE on app.listen errors, yet the master rejects the payload here and treats exits as ordinary crashes, re-entering the fork/crash loop this change is meant to prevent (notably on EADDRINUSE). Either normalize/validate host before workers start or allow missing host values in this guard.
Useful? React with 👍 / 👎.
| } | ||
|
|
||
| function validatePort(port: number): string | null { | ||
| if (!Number.isInteger(port) || !Number.isFinite(port) || port < 0 || port > 65535) { |
There was a problem hiding this comment.
!Number.isFinite(port) is redundant here. Number.isInteger() already returns false for NaN, Infinity, and -Infinity, so the isFinite guard adds no coverage.
| if (!Number.isInteger(port) || !Number.isFinite(port) || port < 0 || port > 65535) { | |
| if (!Number.isInteger(port) || port < 0 || port > 65535) { |
| if (worker.isScheduledRestart) { | ||
| log.info('Restarting worker #%d on schedule', worker.id); | ||
| } else { | ||
| cluster.fork(); | ||
| return; | ||
| } |
There was a problem hiding this comment.
The setImmediate deferral that gives in-flight IPC messages one tick to arrive only applies to the non-scheduled-restart branch. If a scheduled-restart worker happens to exit in the same very-narrow window that a startup-failure IPC message is in-flight (i.e. isAbortingForStartupFailure is still false at the synchronous check on line 135 but the message arrives before the next tick), this branch will fork a new worker before the abort kicks in.
In practice this race is almost impossible — scheduled restarts fire minutes after startup — but it's worth a comment so future readers don't wonder why the parity is missing:
| if (worker.isScheduledRestart) { | |
| log.info('Restarting worker #%d on schedule', worker.id); | |
| } else { | |
| cluster.fork(); | |
| return; | |
| } | |
| if (worker.isScheduledRestart) { | |
| log.info('Restarting worker #%d on schedule', worker.id); | |
| // Scheduled restarts happen well after startup (order of minutes), so the | |
| // startup-failure window has long closed; no setImmediate deferral needed. | |
| cluster.fork(); | |
| return; | |
| } |
| for (let i = 0; i < workersCount; i += 1) { | ||
| cluster.fork(); | ||
| } | ||
|
|
||
| // Listen for dying workers: | ||
| cluster.on('exit', (worker) => { |
There was a problem hiding this comment.
Minor ordering note: the exit listener is registered after workers are forked. A worker that fails synchronously and exits before the master reaches cluster.on('exit', …) (line 131) would be silently missed — no refork, no abort. In practice this is vanishingly unlikely (the forked process needs its own event loop to start and crash), but registering the listener before cluster.fork() (as is already done for cluster.on('message', …)) would make the intent clearer and eliminate the gap entirely.
ReviewThe fix is well-structured. The core approach — workers send a typed IPC message before exiting, and the master arms an abort flag then defers classification of unexpected exits by one event-loop tick with FindingsOne-line nit (configBuilder.ts) Design gap (master.ts — low risk) Listener ordering (master.ts — cosmetic) SummaryThe logic is correct, the edge cases are handled, and the tests validate the key scenarios. The issues above are all minor. No blockers. |
…e-dedup * origin/main: (51 commits) Fix Content-Length mismatch and null renderingRequest errors in node renderer (#3069) Improve memory debugging docs with simpler heap snapshot approach (#3072) Enforce strict version matching in doctor and recommend doctor on errors (#3070) Remove immediate_hydration feature from codebase (#2834) Fix infinite fork loop when node renderer worker fails to bind port (#2881) Fix TanStack Router SSR hydration mismatch in Pro async path (#2932) Improve node renderer error messages for malformed render requests (#3068) Add interactive mode prompt to create-react-on-rails-app (#3063) docs: replace hardcoded version numbers with unversioned install commands (#2893) Remove Contributing section from docs sidebar (#3064) Consolidate docs comparison pages into single evaluation entry (#3065) Docs route cleanup: canonicalize worst verbose URL slugs (#3067) Add agent summary blocks to high-value docs pages (#3066) Fix brittle positional assertions in create-app tests (#2923) Auto-resolve renderer password from ENV in Rails Pro (#2921) Standardize bundle env vars from =yes to =true (#2925) Fix Pro generator multiline and template-literal rewrites (#2918) Fix spec/dummy Procfile.dev SERVER_BUNDLE_ONLY to match template convention (#2922) Add llms docs entry points (#2916) Bump version to 16.6.0.rc.0 ... # Conflicts: # packages/create-react-on-rails-app/src/create-app.ts
…lacement-2835-main * origin/main: (44 commits) Fix Content-Length mismatch and null renderingRequest errors in node renderer (#3069) Improve memory debugging docs with simpler heap snapshot approach (#3072) Enforce strict version matching in doctor and recommend doctor on errors (#3070) Remove immediate_hydration feature from codebase (#2834) Fix infinite fork loop when node renderer worker fails to bind port (#2881) Fix TanStack Router SSR hydration mismatch in Pro async path (#2932) Improve node renderer error messages for malformed render requests (#3068) Add interactive mode prompt to create-react-on-rails-app (#3063) docs: replace hardcoded version numbers with unversioned install commands (#2893) Remove Contributing section from docs sidebar (#3064) Consolidate docs comparison pages into single evaluation entry (#3065) Docs route cleanup: canonicalize worst verbose URL slugs (#3067) Add agent summary blocks to high-value docs pages (#3066) Fix brittle positional assertions in create-app tests (#2923) Auto-resolve renderer password from ENV in Rails Pro (#2921) Standardize bundle env vars from =yes to =true (#2925) Fix Pro generator multiline and template-literal rewrites (#2918) Fix spec/dummy Procfile.dev SERVER_BUNDLE_ONLY to match template convention (#2922) Add llms docs entry points (#2916) Bump version to 16.6.0.rc.0 ... # Conflicts: # CHANGELOG.md
### Summary Adds missing changelog entries to the `[Unreleased]` section for PRs merged since `v16.6.0.rc.0`. Analyzed 19 commits on `origin/main` since the last tag, identified 1 missing user-visible PR (#3069), and skipped 10 non-user-visible PRs (docs, tests, internal refactors). **New entries added:** - **Added**: `[Pro] Configurable HTTP keep-alive timeout for node renderer connections` (PR #3069) - **Fixed**: `[Pro] Fixed SSR failures from stale persistent HTTP/2 connections` (PR #3069) **Entries already present** (added in prior changelog updates): PRs #2834, #2881, #2918, #2921, #2923, #2932, #3063, #3068, #3070. **Skipped** (not user-visible): #2893 (docs), #2916 (docs), #2922 (test fix), #2923 (test fix), #2925 (internal refactor), #3064 (docs), #3065 (docs), #3066 (docs), #3067 (docs), #3072 (docs). ### Pull Request checklist - [x] ~Add/update test to cover these changes~ - [x] ~Update documentation~ - [x] Update CHANGELOG file ### Other Information No code changes — CHANGELOG.md only. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Low Risk** > Low risk because this PR only changes documentation (`CHANGELOG.md`) and does not modify runtime code. > > **Overview** > Updates `CHANGELOG.md` *[Unreleased]* to include missing Pro release notes for PR `#3069`, documenting the new `renderer_http_keep_alive_timeout` config and the associated fix for SSR failures from stale persistent HTTP/2 connections to the node renderer. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 8d1a480. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Interactive mode selection prompt for CLI tool when no explicit mode is specified * Configurable keep-alive timeout setting for Pro users (default 30 seconds) * **Bug Fixes** * Enhanced validation and error handling for invalid request payloads * Improved template literal handling in code generation * Better HTTP connection stability with enhanced diagnostic messaging <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…2881) ## Summary When a worker fails during app.listen() (e.g. EADDRINUSE), the master previously reforked unconditionally, causing an infinite fork/crash loop. Workers now send a WORKER_STARTUP_FAILURE IPC message to the master before exiting. The master sets an abort flag and exits with a clear error message instead of reforking. Scheduled restarts and runtime crashes continue to refork as before. Based on #2843 by @Yassa-hue — recreated on the main repo (not a fork) so CI runs with full secrets. Closes #2843 fix issue: shakacode/react_on_rails_pro#582 ## Test plan - [ ] Verify new unit tests pass (`masterStartupFailure.test.ts`, `workerStartupFailure.test.ts`) - [ ] Confirm EADDRINUSE scenario aborts master instead of infinite reforking - [ ] Confirm scheduled restarts and runtime crashes still refork as before 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches clustered process lifecycle and shutdown behavior in the Pro Node renderer; a mistake could cause premature master exits or missed reforks in edge cases despite added test coverage. > > **Overview** > Prevents the Pro Node renderer master process from entering an infinite refork loop when a worker fails during `app.listen()` (e.g., `EADDRINUSE`). Workers now report a structured `WORKER_STARTUP_FAILURE` IPC message before exiting, and the master captures the first failure, disconnects workers, and exits with a clear error instead of reforking. > > Also adds `port` coercion + TCP-range validation in `configBuilder`, and introduces focused unit tests covering startup-failure messaging, master abort semantics, and the “wait one tick” behavior to distinguish startup failures from runtime crashes. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit ca6ac01. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Yassa-hue <yassatamer99@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### Summary Adds missing changelog entries to the `[Unreleased]` section for PRs merged since `v16.6.0.rc.0`. Analyzed 19 commits on `origin/main` since the last tag, identified 1 missing user-visible PR (#3069), and skipped 10 non-user-visible PRs (docs, tests, internal refactors). **New entries added:** - **Added**: `[Pro] Configurable HTTP keep-alive timeout for node renderer connections` (PR #3069) - **Fixed**: `[Pro] Fixed SSR failures from stale persistent HTTP/2 connections` (PR #3069) **Entries already present** (added in prior changelog updates): PRs #2834, #2881, #2918, #2921, #2923, #2932, #3063, #3068, #3070. **Skipped** (not user-visible): #2893 (docs), #2916 (docs), #2922 (test fix), #2923 (test fix), #2925 (internal refactor), #3064 (docs), #3065 (docs), #3066 (docs), #3067 (docs), #3072 (docs). ### Pull Request checklist - [x] ~Add/update test to cover these changes~ - [x] ~Update documentation~ - [x] Update CHANGELOG file ### Other Information No code changes — CHANGELOG.md only. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Low Risk** > Low risk because this PR only changes documentation (`CHANGELOG.md`) and does not modify runtime code. > > **Overview** > Updates `CHANGELOG.md` *[Unreleased]* to include missing Pro release notes for PR `#3069`, documenting the new `renderer_http_keep_alive_timeout` config and the associated fix for SSR failures from stale persistent HTTP/2 connections to the node renderer. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 8d1a480. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Interactive mode selection prompt for CLI tool when no explicit mode is specified * Configurable keep-alive timeout setting for Pro users (default 30 seconds) * **Bug Fixes** * Enhanced validation and error handling for invalid request payloads * Improved template literal handling in code generation * Better HTTP connection stability with enhanced diagnostic messaging <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…2881) ## Summary When a worker fails during app.listen() (e.g. EADDRINUSE), the master previously reforked unconditionally, causing an infinite fork/crash loop. Workers now send a WORKER_STARTUP_FAILURE IPC message to the master before exiting. The master sets an abort flag and exits with a clear error message instead of reforking. Scheduled restarts and runtime crashes continue to refork as before. Based on #2843 by @Yassa-hue — recreated on the main repo (not a fork) so CI runs with full secrets. Closes #2843 fix issue: shakacode/react_on_rails_pro#582 ## Test plan - [ ] Verify new unit tests pass (`masterStartupFailure.test.ts`, `workerStartupFailure.test.ts`) - [ ] Confirm EADDRINUSE scenario aborts master instead of infinite reforking - [ ] Confirm scheduled restarts and runtime crashes still refork as before 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches clustered process lifecycle and shutdown behavior in the Pro Node renderer; a mistake could cause premature master exits or missed reforks in edge cases despite added test coverage. > > **Overview** > Prevents the Pro Node renderer master process from entering an infinite refork loop when a worker fails during `app.listen()` (e.g., `EADDRINUSE`). Workers now report a structured `WORKER_STARTUP_FAILURE` IPC message before exiting, and the master captures the first failure, disconnects workers, and exits with a clear error instead of reforking. > > Also adds `port` coercion + TCP-range validation in `configBuilder`, and introduces focused unit tests covering startup-failure messaging, master abort semantics, and the “wait one tick” behavior to distinguish startup failures from runtime crashes. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit ca6ac01. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Yassa-hue <yassatamer99@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### Summary Adds missing changelog entries to the `[Unreleased]` section for PRs merged since `v16.6.0.rc.0`. Analyzed 19 commits on `origin/main` since the last tag, identified 1 missing user-visible PR (#3069), and skipped 10 non-user-visible PRs (docs, tests, internal refactors). **New entries added:** - **Added**: `[Pro] Configurable HTTP keep-alive timeout for node renderer connections` (PR #3069) - **Fixed**: `[Pro] Fixed SSR failures from stale persistent HTTP/2 connections` (PR #3069) **Entries already present** (added in prior changelog updates): PRs #2834, #2881, #2918, #2921, #2923, #2932, #3063, #3068, #3070. **Skipped** (not user-visible): #2893 (docs), #2916 (docs), #2922 (test fix), #2923 (test fix), #2925 (internal refactor), #3064 (docs), #3065 (docs), #3066 (docs), #3067 (docs), #3072 (docs). ### Pull Request checklist - [x] ~Add/update test to cover these changes~ - [x] ~Update documentation~ - [x] Update CHANGELOG file ### Other Information No code changes — CHANGELOG.md only. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Low Risk** > Low risk because this PR only changes documentation (`CHANGELOG.md`) and does not modify runtime code. > > **Overview** > Updates `CHANGELOG.md` *[Unreleased]* to include missing Pro release notes for PR `#3069`, documenting the new `renderer_http_keep_alive_timeout` config and the associated fix for SSR failures from stale persistent HTTP/2 connections to the node renderer. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 8d1a480. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Interactive mode selection prompt for CLI tool when no explicit mode is specified * Configurable keep-alive timeout setting for Pro users (default 30 seconds) * **Bug Fixes** * Enhanced validation and error handling for invalid request payloads * Improved template literal handling in code generation * Better HTTP connection stability with enhanced diagnostic messaging <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>



Summary
When a worker fails during app.listen() (e.g. EADDRINUSE), the master previously reforked unconditionally, causing an infinite fork/crash loop.
Workers now send a WORKER_STARTUP_FAILURE IPC message to the master before exiting. The master sets an abort flag and exits with a clear error message instead of reforking. Scheduled restarts and runtime crashes continue to refork as before.
Based on #2843 by @Yassa-hue — recreated on the main repo (not a fork) so CI runs with full secrets.
Closes #2843
fix issue: shakacode/react_on_rails_pro#582
Test plan
masterStartupFailure.test.ts,workerStartupFailure.test.ts)🤖 Generated with Claude Code
Note
Medium Risk
Touches clustered process lifecycle and shutdown behavior in the Pro Node renderer; a mistake could cause premature master exits or missed reforks in edge cases despite added test coverage.
Overview
Prevents the Pro Node renderer master process from entering an infinite refork loop when a worker fails during
app.listen()(e.g.,EADDRINUSE). Workers now report a structuredWORKER_STARTUP_FAILUREIPC message before exiting, and the master captures the first failure, disconnects workers, and exits with a clear error instead of reforking.Also adds
portcoercion + TCP-range validation inconfigBuilder, and introduces focused unit tests covering startup-failure messaging, master abort semantics, and the “wait one tick” behavior to distinguish startup failures from runtime crashes.Reviewed by Cursor Bugbot for commit ca6ac01. Bugbot is set up for automated code reviews on this repo. Configure here.