fix(rivetkit-core): orphan engine process and detect crashes by NathanFlurry · Pull Request #4733 · rivet-dev/rivet

NathanFlurry · 2026-04-24T08:13:48Z

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

NathanFlurry · 2026-04-24T08:14:33Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2026-04-24T09:55:51Z

PR Review: fix(rivetkit-core): orphan engine process and detect crashes

Overview

This PR refactors engine_process.rs to implement intentional process orphaning of the rivet-engine subprocess. The goal is for the engine to outlive the rivetkit host so in-flight actor state survives dev-server hot reloads. The design is architecturally sound and the module-level doc comment explains the contract well.

Key changes:

Removes kill_on_drop(true) - engine now survives host death
Routes engine stdio to timestamped log files (file fds survive host pipe closure)
Uses process_group(0) on Unix to isolate engine from terminal signals (Ctrl+C, SIGHUP)
Adds probe_existing_engine() to detect and reuse a running engine
Introduces persistent stable storage at ~/.rivetkit/ (replaces random UUID temp dirs)
Adds crash-detection watcher task (spawn_engine_watcher)
Removes the shutdown() method entirely
Adds PortOccupied error variant

Issues

Bug: terminate_failed_spawn() leaks the process on timeout

When the 5-second timeout fires, the function returns Ok(()) after dropping child - but with kill_on_drop=false (the new default), dropping Child does NOT kill the process. The half-started engine leaks and a subsequent start() call will hit a port conflict. child.start_kill() should be called before returning Ok(()) here.

Bug: probe_existing_engine() can spuriously block start when runtime field is absent

If an older engine version does not include the runtime field in its health response (it is Option), this produces a false PortOccupied error against a healthy engine that should be reused. Consider treating None the same as Some(ENGINE_RUNTIME), or document the minimum engine version that includes this field.

Unintuitive env var behavior in storage_root()

Setting RIVETKIT_STORAGE_PATH=~/data puts storage at ~/data/.rivetkit, not ~/data, because .rivetkit is unconditionally joined onto the env var path. Either treat the env var as the direct storage root or rename it to make the parent-dir semantics explicit.

Surprising current_dir() fallback in storage_root()

If HOME is unset (containers, CI), state silently lands in whatever the cwd is at startup. Failing with an actionable error pointing to RIVETKIT_STORAGE_PATH would be more predictable.

Suggestions

Collapse redundant match arms in probe_existing_engine - Some(RIVETKIT_RUNTIME) and Some(other) do identical things and can be collapsed with a wildcard arm that uses health.runtime.as_deref().unwrap_or("unknown") for the PortOccupied runtime field.

Add pid to terminate_failed_spawn logs - capture child.id() before start_kill and include it in every log line for easier correlation with ps/journalctl. The original code logged pid here.

Use sub-second precision in log_timestamp() - two spawns within the same second produce identically named log files. With append mode they interleave silently. Using milliseconds or including the process pid in the filename would make filenames unique.

Log file permissions - log files are created with default OpenOptions permissions (mode 0666 before umask). Engine stderr may contain credentials or error details. Consider creating the logs directory with mode 0700 on Unix.

Correctness

The core orphaning contract is sound: kill_on_drop=false, process_group(0), and file-fd stdio together correctly implement the intentional-orphan pattern. The zombie-reaping comment in spawn_engine_watcher is accurate - calling child.wait() in the task reaps the process via waitpid.

Windows gap: process_group(0) is #[cfg(unix)] only. On Windows the engine may still be terminated through a job object when the parent exits. Worth documenting.

Test Coverage

No tests are added. Since this is a draft PR that may be intentional, but high-value cases to cover before merge:

probe_existing_engine reattaches to a healthy engine (Some(ENGINE_RUNTIME))
probe_existing_engine errors on unknown runtime (PortOccupied)
storage_root() env var override produces the expected path

Draft PR - the two most important issues to address before marking ready are the timeout-path process leak in terminate_failed_spawn and the None runtime false-positive in probe_existing_engine.

NathanFlurry · 2026-04-24T12:54:29Z

Landed in main via stack-merge fast-forward push. Commits are in main; closing to match.

This was referenced Apr 24, 2026

chore(rivetkit): add child spans for logging #4734

Closed

fix(rivetkit-sqlite): preserve text with embedded nul bytes #4735

Closed

NathanFlurry force-pushed the 04-23-fix_rivetkit-core_orphan_engine_process_and_detect_crashes branch from 2a3027f to fbf0f28 Compare April 24, 2026 08:43

This was referenced Apr 24, 2026

fix(rivetkit-core): gate startup until runtime is ready #4736

Closed

fix(rivetkit): recover stale dynamic actors across sleep and wake #4737

Closed

test(rivetkit): relax sqlite sleep fixture timing #4738

Closed

chore: sql fuzz #4739

Closed

NathanFlurry force-pushed the 04-23-fix_rivetkit-core_return_none_for_empty_inspector_state_to_avoid_frontend_cbor_decode_error branch from 1250cce to b819d4e Compare April 24, 2026 09:52

NathanFlurry force-pushed the 04-23-fix_rivetkit-core_orphan_engine_process_and_detect_crashes branch from fbf0f28 to b73cea0 Compare April 24, 2026 09:52

NathanFlurry mentioned this pull request Apr 24, 2026

chore: fixes loop #4740

Draft

11 tasks

NathanFlurry force-pushed the 04-23-fix_rivetkit-core_return_none_for_empty_inspector_state_to_avoid_frontend_cbor_decode_error branch from b819d4e to c674f60 Compare April 24, 2026 10:19

NathanFlurry force-pushed the 04-23-fix_rivetkit-core_orphan_engine_process_and_detect_crashes branch from b73cea0 to c59f860 Compare April 24, 2026 10:19

NathanFlurry force-pushed the 04-23-fix_rivetkit-core_return_none_for_empty_inspector_state_to_avoid_frontend_cbor_decode_error branch from c674f60 to 60c167f Compare April 24, 2026 10:32

NathanFlurry force-pushed the 04-23-fix_rivetkit-core_orphan_engine_process_and_detect_crashes branch from c59f860 to 315c2ce Compare April 24, 2026 10:32

NathanFlurry force-pushed the 04-23-fix_rivetkit-core_orphan_engine_process_and_detect_crashes branch 2 times, most recently from 5ba2e02 to ae1b9e4 Compare April 24, 2026 12:14

fix(rivetkit-core): orphan engine process and detect crashes

58ce7d0

NathanFlurry force-pushed the 04-23-fix_rivetkit-core_return_none_for_empty_inspector_state_to_avoid_frontend_cbor_decode_error branch from 3133b52 to 894be1e Compare April 24, 2026 12:32

NathanFlurry force-pushed the 04-23-fix_rivetkit-core_orphan_engine_process_and_detect_crashes branch from ae1b9e4 to 58ce7d0 Compare April 24, 2026 12:32

NathanFlurry closed this Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rivetkit-core): orphan engine process and detect crashes#4733

fix(rivetkit-core): orphan engine process and detect crashes#4733
NathanFlurry wants to merge 1 commit into04-23-fix_rivetkit-core_return_none_for_empty_inspector_state_to_avoid_frontend_cbor_decode_errorfrom
04-23-fix_rivetkit-core_orphan_engine_process_and_detect_crashes

NathanFlurry commented Apr 24, 2026

Uh oh!

NathanFlurry commented Apr 24, 2026 •

edited

Loading

Uh oh!

claude Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

NathanFlurry commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NathanFlurry commented Apr 24, 2026

Description

Type of change

How Has This Been Tested?

Checklist:

Uh oh!

NathanFlurry commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(rivetkit-core): orphan engine process and detect crashes

Overview

Issues

Bug: terminate_failed_spawn() leaks the process on timeout

Bug: probe_existing_engine() can spuriously block start when runtime field is absent

Unintuitive env var behavior in storage_root()

Surprising current_dir() fallback in storage_root()

Suggestions

Correctness

Test Coverage

Uh oh!

NathanFlurry commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NathanFlurry commented Apr 24, 2026 •

edited

Loading

claude Bot commented Apr 24, 2026 •

edited

Loading