Skip to content

Commit bdb839d

Browse files
justin808yassa-tamerclaude
committed
Fix infinite fork loop when node renderer worker fails to bind port (#2881)
## Summary When a worker fails during app.listen() (e.g. EADDRINUSE), the master previously reforked unconditionally, causing an infinite fork/crash loop. Workers now send a WORKER_STARTUP_FAILURE IPC message to the master before exiting. The master sets an abort flag and exits with a clear error message instead of reforking. Scheduled restarts and runtime crashes continue to refork as before. Based on #2843 by @Yassa-hue — recreated on the main repo (not a fork) so CI runs with full secrets. Closes #2843 fix issue: shakacode/react_on_rails_pro#582 ## Test plan - [ ] Verify new unit tests pass (`masterStartupFailure.test.ts`, `workerStartupFailure.test.ts`) - [ ] Confirm EADDRINUSE scenario aborts master instead of infinite reforking - [ ] Confirm scheduled restarts and runtime crashes still refork as before 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches clustered process lifecycle and shutdown behavior in the Pro Node renderer; a mistake could cause premature master exits or missed reforks in edge cases despite added test coverage. > > **Overview** > Prevents the Pro Node renderer master process from entering an infinite refork loop when a worker fails during `app.listen()` (e.g., `EADDRINUSE`). Workers now report a structured `WORKER_STARTUP_FAILURE` IPC message before exiting, and the master captures the first failure, disconnects workers, and exits with a clear error instead of reforking. > > Also adds `port` coercion + TCP-range validation in `configBuilder`, and introduces focused unit tests covering startup-failure messaging, master abort semantics, and the “wait one tick” behavior to distinguish startup failures from runtime crashes. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit ca6ac01. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Yassa-hue <yassatamer99@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 57220dd commit bdb839d

9 files changed

Lines changed: 764 additions & 6 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ After a release, run `/update-changelog` in Claude Code to analyze commits, writ
3131
#### Fixed
3232

3333
- **[Pro]** **Fixed TanStack Router SSR hydration mismatches in the async path**: Client hydration now restores server match data before first render, uses `RouterProvider` directly to match the server-rendered tree, and stops the post-hydration load when a custom `router.options.hydrate` callback fails instead of continuing with partially hydrated client state. [PR 2932](https://github.com/shakacode/react_on_rails/pull/2932) by [justin808](https://github.com/justin808).
34+
- **[Pro] Fixed infinite fork loop when node renderer worker fails to bind port**: When a worker failed during `app.listen()` (e.g., `EADDRINUSE`), the master previously reforked unconditionally, causing an infinite fork/crash loop that consumed CPU and filled logs. Workers now send a `WORKER_STARTUP_FAILURE` IPC message to the master before exiting; the master sets an abort flag and exits with a clear error message instead of reforking. Scheduled restarts and runtime crashes continue to refork as before. [PR 2881](https://github.com/shakacode/react_on_rails/pull/2881) by [justin808](https://github.com/justin808).
3435

3536
### [16.6.0.rc.0] - 2026-04-01
3637

packages/react-on-rails-pro-node-renderer/src/master.ts

Lines changed: 64 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ import { buildConfig, Config, logSanitizedConfig } from './shared/configBuilder.
1010
import restartWorkers from './master/restartWorkers.js';
1111
import * as errorReporter from './shared/errorReporter.js';
1212
import { getLicenseStatus } from './shared/licenseValidator.js';
13+
import { isWorkerStartupFailureMessage, type WorkerStartupFailureMessage } from './shared/workerMessages.js';
1314

1415
const MILLISECONDS_IN_MINUTE = 60000;
1516
// How often to scan for orphaned upload directories.
@@ -77,22 +78,81 @@ export default function masterRun(runningConfig?: Partial<Config>) {
7778
})();
7879
}, ORPHAN_CLEANUP_INTERVAL_MS);
7980

81+
let isAbortingForStartupFailure = false;
82+
let fatalStartupFailure: { workerId: number; failure: WorkerStartupFailureMessage } | null = null;
83+
let hasInitiatedShutdown = false;
84+
85+
const abortForStartupFailure = (): boolean => {
86+
if (!(isAbortingForStartupFailure && fatalStartupFailure)) return false;
87+
88+
if (!hasInitiatedShutdown) {
89+
hasInitiatedShutdown = true;
90+
// Note: the exiting worker may differ from the one that sent the
91+
// failure message if multiple workers exit in rapid succession.
92+
// We always report the first failure received.
93+
const { failure, workerId: failedWorkerId } = fatalStartupFailure;
94+
const msg =
95+
failure.code === 'EADDRINUSE'
96+
? `Node renderer startup failed: ${failure.host}:${failure.port} is already in use`
97+
: `Node renderer startup failed in worker ${failedWorkerId}: ${failure.message}`;
98+
99+
errorReporter.message(msg);
100+
// Disconnect all live workers so they release their ports before the
101+
// master exits. cluster.disconnect() is async — the callback fires
102+
// once every worker has disconnected. A hard-deadline timer guarantees
103+
// the master still exits if a worker is stuck (leaked handle, blocking
104+
// syscall, etc.), following the same pattern as restartWorkers.ts.
105+
const MASTER_SHUTDOWN_TIMEOUT_MS = 5000;
106+
const shutdownTimer = setTimeout(() => process.exit(1), MASTER_SHUTDOWN_TIMEOUT_MS);
107+
if (typeof shutdownTimer.unref === 'function') shutdownTimer.unref();
108+
cluster.disconnect(() => {
109+
clearTimeout(shutdownTimer);
110+
process.exit(1);
111+
});
112+
}
113+
114+
return true;
115+
};
116+
117+
cluster.on('message', (worker, message) => {
118+
// Check the abort flag first to short-circuit the type-guard on every
119+
// ordinary IPC message once we are already aborting.
120+
if (isAbortingForStartupFailure || !isWorkerStartupFailureMessage(message)) return;
121+
122+
isAbortingForStartupFailure = true;
123+
fatalStartupFailure = { workerId: worker.id, failure: message };
124+
});
125+
80126
for (let i = 0; i < workersCount; i += 1) {
81127
cluster.fork();
82128
}
83129

84130
// Listen for dying workers:
85131
cluster.on('exit', (worker) => {
132+
// Once a startup failure has been detected, abort regardless of whether
133+
// this particular exit was from the failing worker, a scheduled restart,
134+
// or an unrelated crash. Don't fork any more workers.
135+
if (abortForStartupFailure()) {
136+
return;
137+
}
138+
86139
if (worker.isScheduledRestart) {
87140
log.info('Restarting worker #%d on schedule', worker.id);
88-
} else {
141+
cluster.fork();
142+
return;
143+
}
144+
145+
// Give in-flight startup-failure IPC messages one event-loop turn to be
146+
// processed before classifying this as an ordinary runtime crash.
147+
setImmediate(() => {
148+
if (abortForStartupFailure()) return;
149+
89150
// TODO: Track last rendering request per worker.id
90151
// TODO: Consider blocking a given rendering request if it kills a worker more than X times
91152
const msg = `Worker ${worker.id} died UNEXPECTEDLY :(, restarting`;
92153
errorReporter.message(msg);
93-
}
94-
// Replace the dead worker:
95-
cluster.fork();
154+
cluster.fork();
155+
});
96156
});
97157

98158
// Schedule regular restarts of workers

packages/react-on-rails-pro-node-renderer/src/shared/configBuilder.ts

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,13 @@ function logLevel(level: string): LevelWithSilent {
142142
}
143143
}
144144

145+
function validatePort(port: number): string | null {
146+
if (!Number.isInteger(port) || !Number.isFinite(port) || port < 0 || port > 65535) {
147+
return `RENDERER_PORT must be an integer between 0 and 65535. Received: ${String(port)}`;
148+
}
149+
return null;
150+
}
151+
145152
function normalizedRuntimeEnvs() {
146153
return [env.RAILS_ENV, env.NODE_ENV]
147154
.filter((value): value is string => Boolean(value))
@@ -380,6 +387,17 @@ export function buildConfig(providedUserConfig?: Partial<Config>): Config {
380387
}
381388
});
382389

390+
// Coerce port to a number — user configs frequently pass env-derived strings
391+
// (e.g. `port: env.RENDERER_PORT || 3800` yields the string "3800").
392+
// eslint-disable-next-line @typescript-eslint/no-unnecessary-type-conversion -- runtime value may be string despite the type
393+
config.port = Number(config.port);
394+
395+
const portValidationError = validatePort(config.port);
396+
if (portValidationError) {
397+
log.error(portValidationError);
398+
process.exit(1);
399+
}
400+
383401
if (
384402
'honeybadgerApiKey' in config ||
385403
'sentryDsn' in config ||
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
export const WORKER_STARTUP_FAILURE = 'NODE_RENDERER_WORKER_STARTUP_FAILURE' as const;
2+
3+
export interface WorkerStartupFailureMessage {
4+
type: typeof WORKER_STARTUP_FAILURE;
5+
stage: 'listen';
6+
code?: string;
7+
errno?: number;
8+
syscall?: string;
9+
host: string;
10+
port: number;
11+
message: string;
12+
}
13+
14+
export function isWorkerStartupFailureMessage(value: unknown): value is WorkerStartupFailureMessage {
15+
if (typeof value !== 'object' || value === null) {
16+
return false;
17+
}
18+
19+
const message = value as Partial<WorkerStartupFailureMessage>;
20+
21+
// stage: 'listen' is the only supported stage today. To handle pre-listen
22+
// failures (e.g. plugin registration), add a new stage value here and
23+
// update the master handler accordingly.
24+
return (
25+
message.type === WORKER_STARTUP_FAILURE &&
26+
message.stage === 'listen' &&
27+
typeof message.host === 'string' &&
28+
typeof message.port === 'number' &&
29+
Number.isInteger(message.port) &&
30+
message.port >= 0 &&
31+
message.port <= 65535 &&
32+
typeof message.message === 'string'
33+
);
34+
}

packages/react-on-rails-pro-node-renderer/src/worker.ts

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ import {
2323
handleNewBundlesProvided,
2424
} from './worker/handleRenderRequest.js';
2525
import handleGracefulShutdown from './worker/handleGracefulShutdown.js';
26+
import { handleStartupListenError } from './worker/startupErrorHandler.js';
2627
import {
2728
handleIncrementalRenderRequest,
2829
type IncrementalRenderInitialRequest,
@@ -622,8 +623,8 @@ export default function run(config: Partial<Config>) {
622623
if (workersCount === 0 || cluster.isWorker) {
623624
app.listen({ port, host }, (err, address) => {
624625
if (err) {
625-
log.error({ err, host, port }, 'Node renderer failed to start');
626-
process.exit(1);
626+
handleStartupListenError({ err, host, port });
627+
return;
627628
}
628629
const workerName = worker ? `worker #${worker.id}` : 'master (single-process)';
629630
log.info({ workerName, address }, 'Node renderer listening');
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
import cluster from 'cluster';
2+
import log from '../shared/log.js';
3+
import { WORKER_STARTUP_FAILURE, type WorkerStartupFailureMessage } from '../shared/workerMessages.js';
4+
5+
export type StartupListenErrorHandlerOptions = {
6+
err: Error;
7+
host: string;
8+
port: number;
9+
isWorker?: boolean;
10+
send?: NodeJS.Process['send'];
11+
exit?: NodeJS.Process['exit'];
12+
};
13+
14+
export function handleStartupListenError({
15+
err,
16+
host,
17+
port,
18+
isWorker = cluster.isWorker,
19+
send,
20+
exit,
21+
}: StartupListenErrorHandlerOptions) {
22+
const sendFn = send ?? process.send?.bind(process);
23+
const exitFn = exit ?? ((code?: number) => process.exit(code));
24+
25+
log.error({ err, host, port }, 'Node renderer failed to start');
26+
27+
if (isWorker) {
28+
if (!sendFn) {
29+
log.error('Cluster worker has no IPC channel; cannot notify master of startup failure');
30+
exitFn(1);
31+
return;
32+
}
33+
34+
const startupFailure: WorkerStartupFailureMessage = {
35+
type: WORKER_STARTUP_FAILURE,
36+
stage: 'listen',
37+
code: (err as NodeJS.ErrnoException).code,
38+
errno: (err as NodeJS.ErrnoException).errno,
39+
syscall: (err as NodeJS.ErrnoException).syscall,
40+
host,
41+
port,
42+
message: err.message,
43+
};
44+
try {
45+
let exited = false;
46+
const doExit = (sendErr?: Error | null) => {
47+
if (exited) return;
48+
exited = true;
49+
if (sendErr) log.error({ err: sendErr }, 'Failed to send startup failure message to master');
50+
exitFn(1);
51+
};
52+
sendFn(startupFailure, undefined, undefined, doExit);
53+
// Safety net: if the IPC channel is half-broken the callback may never
54+
// fire, leaving this worker alive indefinitely. Force exit after a timeout.
55+
const IPC_SEND_TIMEOUT_MS = 2000;
56+
const timer = setTimeout(() => doExit(), IPC_SEND_TIMEOUT_MS);
57+
if (typeof timer.unref === 'function') timer.unref();
58+
} catch (sendErr) {
59+
log.error({ err: sendErr as Error }, 'Failed to send startup failure message to master');
60+
exitFn(1);
61+
}
62+
} else {
63+
exitFn(1);
64+
}
65+
}

packages/react-on-rails-pro-node-renderer/tests/configBuilder.test.ts

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
describe('configBuilder', () => {
22
const envVarsToRestore = [
33
'RENDERER_HOST',
4+
'RENDERER_PORT',
45
'NODE_ENV',
56
'RENDERER_PASSWORD',
67
'RAILS_ENV',
@@ -113,6 +114,39 @@ describe('configBuilder', () => {
113114
expect(finalSettings.password).toBe('<EMPTY STRING>');
114115
});
115116

117+
describe('port validation', () => {
118+
it('throws when configured port is outside the valid TCP range', () => {
119+
process.env.NODE_ENV = 'development';
120+
process.env.RAILS_ENV = 'development';
121+
const processExit = mockProcessExit();
122+
const { buildConfig, error } = loadConfigBuilderWithMockedLogger();
123+
124+
expect(() => buildConfig({ port: 70000 })).toThrow('process.exit: 1');
125+
expect(processExit).toHaveBeenCalledWith(1);
126+
expect(error).toHaveBeenCalledWith(
127+
'RENDERER_PORT must be an integer between 0 and 65535. Received: 70000',
128+
);
129+
});
130+
131+
it('allows port 0 for ephemeral-port test setups', () => {
132+
process.env.NODE_ENV = 'development';
133+
process.env.RAILS_ENV = 'development';
134+
const { buildConfig } = loadConfigBuilderWithMockedLogger();
135+
136+
expect(buildConfig({ port: 0 }).port).toBe(0);
137+
});
138+
139+
it('coerces a string port from env vars to a number', () => {
140+
process.env.NODE_ENV = 'development';
141+
process.env.RAILS_ENV = 'development';
142+
const { buildConfig } = loadConfigBuilderWithMockedLogger();
143+
144+
// Simulates `port: env.RENDERER_PORT || 3800` where env var is the string "3800"
145+
const config = buildConfig({ port: '3800' as unknown as number });
146+
expect(config.port).toBe(3800);
147+
});
148+
});
149+
116150
describe('password validation in production-like environments', () => {
117151
it('throws when no password is set in production', () => {
118152
process.env.NODE_ENV = 'production';

0 commit comments

Comments
 (0)