server: fix stopping_thread hang on child process exit#22452
Open
xwinwin wants to merge 1 commit intoggml-org:masterfrom
Open
server: fix stopping_thread hang on child process exit#22452xwinwin wants to merge 1 commit intoggml-org:masterfrom
xwinwin wants to merge 1 commit intoggml-org:masterfrom
Conversation
In router mode, stopping_thread uses subprocess_alive() as the predicate for cv_stop.wait(). subprocess_alive() calls waitpid(WNOHANG) which reaps the child process, setting process->child to 0. This destructive check means the function can only detect the alive→dead transition once — after the first call, subsequent calls return the same value with no state change to trigger a CV wake-up. This commonly manifests when a child process is OOM-killed (e.g., during model loading or inference). The main thread notifies cv_stop once after log_thread.join(). If stopping_thread wakes before waitpid has detected the exit, the predicate evaluates to false and the thread goes back to waiting with no further notifications, causing a permanent hang that blocks model reload. Fix by introducing std::atomic<bool> child_finished to bypass the subprocess_alive() check as a CV predicate. The main thread sets this flag after log_thread.join() (child stdout EOF) and notifies cv_stop. stopping_thread now wakes on either an explicit stop request or this atomic flag, completely decoupling thread synchronization from the destructive subprocess_alive() behavior. Also guard subprocess_terminate() with process->child <= 0 to prevent kill(0, 9) which would send SIGKILL to the entire process group.
|
Hi @xwinwin, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
Author
|
AI usage disclosure is already set to YES with a description of AI's role included. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Summary
stopping_threadhang in router mode when the child process exitsunexpectedly (e.g., OOM during model loading or inference). The thread
could block permanently, preventing model reload.
subprocess_terminate()from callingkill(0, 9)which sendsSIGKILL to the entire process group instead of a single process.
Problem
1. stopping_thread hang
In router mode,
stopping_threadmonitors both explicit stop requests andchild-process crashes. The original implementation used
subprocess_alive()as the predicate for
cv_stop.wait():subprocess_alive()callswaitpid(pid, &status, WNOHANG)which reaps thechild process on the first detection of exit, returning
0or-1thereafter. This destructive behavior means the function can only detect the alive→dead transition once. The fix bypasses this by using an explicit atomic flag instead ofsubprocess_alive()as the CV predicate.The hang scenario (triggered by OOM during model loading, inference, or other crashes):
log_threadexits on stdout EOFcv_stop.notify_all()oncestopping_threadwakes up and evaluatesshould_wake()subprocess_alive()has not yet detected the crash, so it returnstrue(still "alive")
should_wake()→false,stopping_threadgoes back to waitingnotify_all()is ever called → permanent hangThis is a missed notification problem:
stopping_threadhas only onechance to wake up from the CV, but the predicate relies on a racy check that
may not be ready yet. No other code path sends further notifications once the
child has crashed, so the thread can block permanently.
Additionally, concurrent calls to
subprocess_alive()from the main thread andstopping_threadcan race to reap the child. The function returnsint(1= alive,0= exited,-1= error), and the code uses<= 0to detect non-alive states.Platform difference
On Unix,
subprocess_alive()useswaitpid(WNOHANG)which is destructive (reaps the zombie on first detection). On Windows, it usesWaitForSingleObject(hProcess, 0)— a non-destructive poll. The hang is specific to Unix; the fix applies uniformly for consistency.waitpid detection delay
Measured on Intel i9-14900KS, Ubuntu 24.04, Linux 6.17 (x86_64):
The first
waitpid(WNOHANG)call never detects the exit immediately (0% across 200 runs) — there is always a delay before the kernel makes the zombie reaping-able. This delay alone means a single CV notification is unreliable: ifstopping_threadwakes during this window, the check returns "alive" and the thread goes back to waiting with no further notifications.2. kill(0, 9) in subprocess_terminate
After
subprocess_alive()reaps the child,process->childis0. Ifsubprocess_terminate()is then called on the timeout path, it executeskill(0, 9), which sends SIGKILL to all processes in the current processgroup — potentially killing the router itself.
Fix
server-models.cpp
std::atomic<bool> child_finishedto bypass the racysubprocess_alive()check as a CV predicate. Afterlog_thread.join()(child stdout EOF), the main thread setschild_finished = trueand callscv_stop.notify_all().stopping_threadnow wakes on either an explicit stop request (stopping_models) or thechild_finishedflag — completely decoupling thread synchronization from the destructivesubprocess_alive()check.subprocess_alive()checks (returningint, code uses<= 0for non-alive) are kept for early-exit guards (e.g. "is child still alive before sending exit command?").mutexbeforesubprocess_terminate()on the timeout path to avoid holding the lock during a blocking system call.subprocess.h
subprocess_terminate()(Unix path) withprocess->child <= 0checkto prevent
kill(0, 9).Files Changed
tools/server/server-models.cpp— stopping_thread synchronization fixvendor/sheredom/subprocess.h— PID validation in subprocess_terminateTesting
unload()/unload_all()) — no behavior changestopping_threadwakes up viachild_finishedflag, model can be reloadedkill -9 <child_pid>) — same hang is triggeredand verified to be fixed
subprocess_terminate()Additional information
Requirements