Skip to content

Commit efc16a9

Browse files
authored
fix(engine): treat ffmpegStreamingTimeout as per-frame inactivity, not total render time (#901)
## Summary - Convert `streamingEncoder.ts`'s safety timer from a total-render hard cap to a per-frame inactivity timeout - Reset the timer only on `accepted === true` writes — buffered writes don't count as consumer progress - Update the `ffmpegStreamingTimeout` config doc to reflect the new semantics ## The bug The timer was set once at spawn and fired SIGTERM unconditionally at `ffmpegStreamingTimeout` ms — turning a "FFmpeg is hung" guard into a hard cap on total render duration. Slow-but-progressing captures (CI runner under load, large compositions, slower compositor paths after [#838](#838 always-clip change) regularly exceeded the 600s default and were killed mid-encode. The symptom surfaced as: ``` Streaming encode failed: FFmpeg exited with code 255 video:NNNkB audio:0kB ... [libx264 @ ...] frame I:3 Avg QP:12.91 size: 73263 [libx264 @ ...] frame P:431 Avg QP:14.72 size: 31633 ... [libx264 @ ...] kb/s:7661.05 Exiting normally, received signal 15. ``` libx264 had encoded most frames cleanly; SIGTERM arrived during the encode, libx264 printed its end-of-encode stats, and Node observed a non-zero exit. The `audio:0kB` in stderr is incidental — `streamingEncoder` is video-only; audio is muxed later in `assembleStage`. Downstream reproduction: `style-13-prod` fails deterministically in `heygen-com/hyperframes-internal` CI after bumping `@hyperframes/producer` from 0.6.7 → 0.6.10. Bisects to #838 widening the SDR capture path at dpr=1 — same composition shape, slower per-frame, total render now crosses 600s. ## The fix Convert the timer to a heartbeat: each `writeFrame` that goes through to the kernel pipe (i.e. `stdin.write` returns `true`) resets it. Only true hangs (no successful frame write for the timeout window) trip SIGTERM now; "slow but progressing" renders are unbounded. Crucially, the heartbeat does **not** reset on `accepted === false`. A `false` return means Node had to buffer the write because FFmpeg hasn't drained the pipe yet — that's not proof of consumer progress, just proof we produced. Without this distinction, a hung FFmpeg with a live Chrome would queue frames into Node's writable buffer indefinitely (no backpressure path back to the capture loop) and grow until OOM. In steady state with a slow-but-alive FFmpeg, writes alternate between `true` and `false` as the buffer drains and refills; the `true`s are enough to keep the heartbeat ticking. Renames are intentionally avoided — `ffmpegStreamingTimeout` keeps its name and `600_000` default; only the semantics changed. The config doc spells out the new behavior so downstream consumers know what 600s now means. ## Test plan - [x] **Slow-but-progressing capture** (`accepted=true`): 9× `writeFrame` at 900ms intervals (under the 1000ms threshold) — encoder stays alive through 8.1s. Stall past the threshold — SIGTERM fires. - [x] **Stalled FFmpeg with live producer** (`accepted=false`): override `stdin.write` to return false; pump 9× `writeFrame` at 900ms intervals. SIGTERM still fires inside the 1000ms window — buffered writes don't keep the heartbeat alive. - [x] Existing 33 tests in `streamingEncoder.test.ts` still pass - [x] Lint (`oxlint`) + format (`oxfmt --check`) clean - [ ] CI regression suite 🤖 Generated with [Claude Code](https://claude.com/claude-code)
1 parent 64b3ae7 commit efc16a9

3 files changed

Lines changed: 109 additions & 12 deletions

File tree

packages/engine/src/config.ts

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,12 @@ export interface EngineConfig {
9494
ffmpegEncodeTimeout: number;
9595
/** Timeout for FFmpeg mux/faststart processes (ms). Default: 300_000 */
9696
ffmpegProcessTimeout: number;
97-
/** Timeout for FFmpeg streaming encode (ms). Default: 600_000 */
97+
/**
98+
* Inactivity timeout for FFmpeg streaming encode (ms). The timer resets on
99+
* every successful `writeFrame` call, so this caps the duration of a
100+
* single "no frame arrived" gap (capture hang, dead Chrome), not the total
101+
* render time. Default: 600_000 (10 minutes without any frame = dead).
102+
*/
98103
ffmpegStreamingTimeout: number;
99104

100105
// ── HDR ──────────────────────────────────────────────────────────────

packages/engine/src/services/streamingEncoder.test.ts

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -572,4 +572,71 @@ describe("spawnStreamingEncoder lifecycle and cleanup", () => {
572572
controller.abort();
573573
expect(proc.kill).not.toHaveBeenCalled();
574574
});
575+
576+
it("inactivity timeout fires only after a no-frame gap exceeds ffmpegStreamingTimeout", async () => {
577+
vi.useFakeTimers();
578+
try {
579+
const { spawn, calls } = createSpawnSpy();
580+
vi.resetModules();
581+
vi.doMock("child_process", () => ({ spawn }));
582+
583+
const { spawnStreamingEncoder } = await import("./streamingEncoder.js");
584+
const dir = mkdtempSync(join(tmpdir(), "se-heartbeat-"));
585+
const encoder = await spawnStreamingEncoder(join(dir, "out.mp4"), baseOptions, undefined, {
586+
ffmpegStreamingTimeout: 1000,
587+
});
588+
589+
const proc = calls[0]!.proc;
590+
591+
// Frames every 900ms — under the 1000ms inactivity threshold — should
592+
// keep resetting the timer. After 9× 900ms = 8.1s of "slow but
593+
// progressing" capture the encoder must still be alive. The old total-
594+
// render timeout would have fired SIGTERM at ~1000ms.
595+
for (let i = 0; i < 9; i++) {
596+
encoder.writeFrame(Buffer.from([i]));
597+
vi.advanceTimersByTime(900);
598+
}
599+
expect(proc.kill).not.toHaveBeenCalled();
600+
601+
// Now stall — no writeFrame for longer than the threshold. SIGTERM fires.
602+
vi.advanceTimersByTime(1100);
603+
expect(proc.kill).toHaveBeenCalledWith("SIGTERM");
604+
} finally {
605+
vi.useRealTimers();
606+
}
607+
});
608+
609+
it("inactivity timeout still fires when stdin is backpressured (stalled ffmpeg, live producer)", async () => {
610+
vi.useFakeTimers();
611+
try {
612+
// Simulate the FFmpeg-hangs-but-Chrome-keeps-producing case: stdin.write
613+
// always returns false (Node has to buffer because ffmpeg isn't draining
614+
// the pipe). The heartbeat must NOT reset on those buffered writes —
615+
// otherwise a hung ffmpeg with a steady frame producer would never
616+
// SIGTERM and we'd grow Node's stdin buffer until OOM.
617+
const { spawn, calls } = createSpawnSpy();
618+
vi.resetModules();
619+
vi.doMock("child_process", () => ({ spawn }));
620+
621+
const { spawnStreamingEncoder } = await import("./streamingEncoder.js");
622+
const dir = mkdtempSync(join(tmpdir(), "se-backpressure-"));
623+
const encoder = await spawnStreamingEncoder(join(dir, "out.mp4"), baseOptions, undefined, {
624+
ffmpegStreamingTimeout: 1000,
625+
});
626+
627+
const proc = calls[0]!.proc;
628+
proc.stdin.write = (_chunk: Buffer) => false;
629+
630+
// Pump 9 frames at 900ms intervals — all returning false. The reset
631+
// should NOT fire (every write was buffered, not accepted), so the
632+
// 1000ms timer (last reset on spawn) elapses near the start.
633+
for (let i = 0; i < 9; i++) {
634+
encoder.writeFrame(Buffer.from([i]));
635+
vi.advanceTimersByTime(900);
636+
}
637+
expect(proc.kill).toHaveBeenCalledWith("SIGTERM");
638+
} finally {
639+
vi.useRealTimers();
640+
}
641+
});
575642
});

packages/engine/src/services/streamingEncoder.ts

Lines changed: 36 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -416,13 +416,26 @@ export async function spawnStreamingEncoder(
416416
}
417417
}
418418

419-
// Timeout safety
419+
// Inactivity timeout: fires only when no frame has been written for
420+
// `ffmpegStreamingTimeout` ms. A slow-but-progressing capture (e.g. a CI
421+
// runner under load) keeps resetting the timer on each writeFrame, so total
422+
// wall-clock render time is unbounded — only a true hang (Chrome dead,
423+
// capture stuck, no frames arriving) trips SIGTERM. The 600s default was
424+
// previously a total-render cap, which intermittently killed legitimate
425+
// slow renders mid-encode (FFmpeg got SIGTERM after most frames were sent;
426+
// libx264 printed its summary and exited 255, observable as
427+
// "Streaming encode failed: FFmpeg exited with code 255" with audio:0kB).
420428
const streamingTimeout = config?.ffmpegStreamingTimeout ?? DEFAULT_CONFIG.ffmpegStreamingTimeout;
421-
const timer = setTimeout(() => {
422-
if (exitStatus === "running") {
423-
ffmpeg.kill("SIGTERM");
424-
}
425-
}, streamingTimeout);
429+
let timer: NodeJS.Timeout | null = null;
430+
const resetTimer = () => {
431+
if (timer) clearTimeout(timer);
432+
timer = setTimeout(() => {
433+
if (exitStatus === "running") {
434+
ffmpeg.kill("SIGTERM");
435+
}
436+
}, streamingTimeout);
437+
};
438+
resetTimer();
426439

427440
const encoder: StreamingEncoder = {
428441
writeFrame: (buffer: Buffer): boolean => {
@@ -433,11 +446,20 @@ export async function spawnStreamingEncoder(
433446
// provided buffer and drain it asynchronously. The HDR path's compositor
434447
// reuses pre-allocated transOutput/normalCanvas buffers across frames,
435448
// so without this copy the pipe would read partially-overwritten data
436-
// and flicker. The SDR path doesn't invoke writeFrame at all (it pipes
437-
// PNG files via encodeFramesFromDir), so the memcpy here is HDR-only
438-
// and justified by correctness.
449+
// and flicker.
439450
const copy = Buffer.from(buffer);
440-
return ffmpeg.stdin.write(copy);
451+
const accepted = ffmpeg.stdin.write(copy);
452+
// Reset inactivity timer ONLY on `accepted === true`. `true` means the
453+
// write went through to the kernel pipe without buffering in Node —
454+
// proof FFmpeg is actually consuming. `false` means Node's writable
455+
// stream had to buffer (FFmpeg hasn't drained the pipe yet); we deliberately
456+
// don't reset on `false` so a hung FFmpeg with a still-producing Chrome
457+
// can't keep us alive forever while Node's stdin buffer grows to OOM. In
458+
// steady state with a slower-but-alive FFmpeg, writes alternate between
459+
// true and false as the buffer drains and refills; the trues are enough
460+
// to keep the heartbeat ticking.
461+
if (accepted) resetTimer();
462+
return accepted;
441463
},
442464

443465
close: async (): Promise<StreamingEncoderResult> => {
@@ -455,7 +477,10 @@ export async function spawnStreamingEncoder(
455477
// repeated calls. If you change this method, preserve idempotency or
456478
// a regression here will silently double-close ffmpeg and produce
457479
// harder-to-trace errors at the orchestrator layer.
458-
clearTimeout(timer);
480+
if (timer) {
481+
clearTimeout(timer);
482+
timer = null;
483+
}
459484
if (signal) signal.removeEventListener("abort", onAbort);
460485

461486
const stdin = ffmpeg.stdin;

0 commit comments

Comments
 (0)