fix(rivetkit-core): orphan engine process and detect crashes#4733
Conversation
2a3027f to
fbf0f28
Compare
1250cce to
b819d4e
Compare
fbf0f28 to
b73cea0
Compare
PR Review: fix(rivetkit-core): orphan engine process and detect crashesOverviewThis PR refactors engine_process.rs to implement intentional process orphaning of the rivet-engine subprocess. The goal is for the engine to outlive the rivetkit host so in-flight actor state survives dev-server hot reloads. The design is architecturally sound and the module-level doc comment explains the contract well. Key changes:
IssuesBug: terminate_failed_spawn() leaks the process on timeoutWhen the 5-second timeout fires, the function returns Ok(()) after dropping child - but with kill_on_drop=false (the new default), dropping Child does NOT kill the process. The half-started engine leaks and a subsequent start() call will hit a port conflict. child.start_kill() should be called before returning Ok(()) here. Bug: probe_existing_engine() can spuriously block start when runtime field is absentIf an older engine version does not include the runtime field in its health response (it is Option), this produces a false PortOccupied error against a healthy engine that should be reused. Consider treating None the same as Some(ENGINE_RUNTIME), or document the minimum engine version that includes this field. Unintuitive env var behavior in storage_root()Setting RIVETKIT_STORAGE_PATH= Surprising current_dir() fallback in storage_root()If HOME is unset (containers, CI), state silently lands in whatever the cwd is at startup. Failing with an actionable error pointing to RIVETKIT_STORAGE_PATH would be more predictable. SuggestionsCollapse redundant match arms in probe_existing_engine - Some(RIVETKIT_RUNTIME) and Some(other) do identical things and can be collapsed with a wildcard arm that uses Add pid to terminate_failed_spawn logs - capture child.id() before start_kill and include it in every log line for easier correlation with ps/journalctl. The original code logged pid here. Use sub-second precision in log_timestamp() - two spawns within the same second produce identically named log files. With append mode they interleave silently. Using milliseconds or including the process pid in the filename would make filenames unique. Log file permissions - log files are created with default OpenOptions permissions (mode 0666 before umask). Engine stderr may contain credentials or error details. Consider creating the logs directory with mode 0700 on Unix. CorrectnessThe core orphaning contract is sound: kill_on_drop=false, process_group(0), and file-fd stdio together correctly implement the intentional-orphan pattern. The zombie-reaping comment in spawn_engine_watcher is accurate - calling child.wait() in the task reaps the process via waitpid. Windows gap: process_group(0) is #[cfg(unix)] only. On Windows the engine may still be terminated through a job object when the parent exits. Worth documenting. Test CoverageNo tests are added. Since this is a draft PR that may be intentional, but high-value cases to cover before merge:
Draft PR - the two most important issues to address before marking ready are the timeout-path process leak in terminate_failed_spawn and the None runtime false-positive in probe_existing_engine. |
b819d4e to
c674f60
Compare
b73cea0 to
c59f860
Compare
c674f60 to
60c167f
Compare
c59f860 to
315c2ce
Compare
5ba2e02 to
ae1b9e4
Compare
3133b52 to
894be1e
Compare
ae1b9e4 to
58ce7d0
Compare
|
Landed in main via stack-merge fast-forward push. Commits are in main; closing to match. |

Description
Please include a summary of the changes and the related issue. Please also include relevant motivation and context.
Type of change
How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
Checklist: