Skip to content

fix(rivetkit-core): orphan engine process and detect crashes#4733

Closed
NathanFlurry wants to merge 1 commit into04-23-fix_rivetkit-core_return_none_for_empty_inspector_state_to_avoid_frontend_cbor_decode_errorfrom
04-23-fix_rivetkit-core_orphan_engine_process_and_detect_crashes
Closed

fix(rivetkit-core): orphan engine process and detect crashes#4733
NathanFlurry wants to merge 1 commit into04-23-fix_rivetkit-core_return_none_for_empty_inspector_state_to_avoid_frontend_cbor_decode_errorfrom
04-23-fix_rivetkit-core_orphan_engine_process_and_detect_crashes

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 24, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@NathanFlurry NathanFlurry force-pushed the 04-23-fix_rivetkit-core_orphan_engine_process_and_detect_crashes branch from fbf0f28 to b73cea0 Compare April 24, 2026 09:52
@NathanFlurry NathanFlurry mentioned this pull request Apr 24, 2026
11 tasks
@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

PR Review: fix(rivetkit-core): orphan engine process and detect crashes

Overview

This PR refactors engine_process.rs to implement intentional process orphaning of the rivet-engine subprocess. The goal is for the engine to outlive the rivetkit host so in-flight actor state survives dev-server hot reloads. The design is architecturally sound and the module-level doc comment explains the contract well.

Key changes:

  • Removes kill_on_drop(true) - engine now survives host death
  • Routes engine stdio to timestamped log files (file fds survive host pipe closure)
  • Uses process_group(0) on Unix to isolate engine from terminal signals (Ctrl+C, SIGHUP)
  • Adds probe_existing_engine() to detect and reuse a running engine
  • Introduces persistent stable storage at ~/.rivetkit/ (replaces random UUID temp dirs)
  • Adds crash-detection watcher task (spawn_engine_watcher)
  • Removes the shutdown() method entirely
  • Adds PortOccupied error variant

Issues

Bug: terminate_failed_spawn() leaks the process on timeout

When the 5-second timeout fires, the function returns Ok(()) after dropping child - but with kill_on_drop=false (the new default), dropping Child does NOT kill the process. The half-started engine leaks and a subsequent start() call will hit a port conflict. child.start_kill() should be called before returning Ok(()) here.

Bug: probe_existing_engine() can spuriously block start when runtime field is absent

If an older engine version does not include the runtime field in its health response (it is Option), this produces a false PortOccupied error against a healthy engine that should be reused. Consider treating None the same as Some(ENGINE_RUNTIME), or document the minimum engine version that includes this field.

Unintuitive env var behavior in storage_root()

Setting RIVETKIT_STORAGE_PATH=~/data puts storage at ~/data/.rivetkit, not ~/data, because .rivetkit is unconditionally joined onto the env var path. Either treat the env var as the direct storage root or rename it to make the parent-dir semantics explicit.

Surprising current_dir() fallback in storage_root()

If HOME is unset (containers, CI), state silently lands in whatever the cwd is at startup. Failing with an actionable error pointing to RIVETKIT_STORAGE_PATH would be more predictable.


Suggestions

Collapse redundant match arms in probe_existing_engine - Some(RIVETKIT_RUNTIME) and Some(other) do identical things and can be collapsed with a wildcard arm that uses health.runtime.as_deref().unwrap_or("unknown") for the PortOccupied runtime field.

Add pid to terminate_failed_spawn logs - capture child.id() before start_kill and include it in every log line for easier correlation with ps/journalctl. The original code logged pid here.

Use sub-second precision in log_timestamp() - two spawns within the same second produce identically named log files. With append mode they interleave silently. Using milliseconds or including the process pid in the filename would make filenames unique.

Log file permissions - log files are created with default OpenOptions permissions (mode 0666 before umask). Engine stderr may contain credentials or error details. Consider creating the logs directory with mode 0700 on Unix.


Correctness

The core orphaning contract is sound: kill_on_drop=false, process_group(0), and file-fd stdio together correctly implement the intentional-orphan pattern. The zombie-reaping comment in spawn_engine_watcher is accurate - calling child.wait() in the task reaps the process via waitpid.

Windows gap: process_group(0) is #[cfg(unix)] only. On Windows the engine may still be terminated through a job object when the parent exits. Worth documenting.


Test Coverage

No tests are added. Since this is a draft PR that may be intentional, but high-value cases to cover before merge:

  • probe_existing_engine reattaches to a healthy engine (Some(ENGINE_RUNTIME))
  • probe_existing_engine errors on unknown runtime (PortOccupied)
  • storage_root() env var override produces the expected path

Draft PR - the two most important issues to address before marking ready are the timeout-path process leak in terminate_failed_spawn and the None runtime false-positive in probe_existing_engine.

@NathanFlurry NathanFlurry force-pushed the 04-23-fix_rivetkit-core_return_none_for_empty_inspector_state_to_avoid_frontend_cbor_decode_error branch from b819d4e to c674f60 Compare April 24, 2026 10:19
@NathanFlurry NathanFlurry force-pushed the 04-23-fix_rivetkit-core_orphan_engine_process_and_detect_crashes branch from b73cea0 to c59f860 Compare April 24, 2026 10:19
@NathanFlurry NathanFlurry force-pushed the 04-23-fix_rivetkit-core_return_none_for_empty_inspector_state_to_avoid_frontend_cbor_decode_error branch from c674f60 to 60c167f Compare April 24, 2026 10:32
@NathanFlurry NathanFlurry force-pushed the 04-23-fix_rivetkit-core_orphan_engine_process_and_detect_crashes branch from c59f860 to 315c2ce Compare April 24, 2026 10:32
@NathanFlurry NathanFlurry force-pushed the 04-23-fix_rivetkit-core_orphan_engine_process_and_detect_crashes branch 2 times, most recently from 5ba2e02 to ae1b9e4 Compare April 24, 2026 12:14
@NathanFlurry NathanFlurry force-pushed the 04-23-fix_rivetkit-core_return_none_for_empty_inspector_state_to_avoid_frontend_cbor_decode_error branch from 3133b52 to 894be1e Compare April 24, 2026 12:32
@NathanFlurry NathanFlurry force-pushed the 04-23-fix_rivetkit-core_orphan_engine_process_and_detect_crashes branch from ae1b9e4 to 58ce7d0 Compare April 24, 2026 12:32
@NathanFlurry
Copy link
Copy Markdown
Member Author

Landed in main via stack-merge fast-forward push. Commits are in main; closing to match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant