Skip to content

server: fix stopping_thread hang on child process exit#22452

Open
xwinwin wants to merge 1 commit intoggml-org:masterfrom
xwinwin:fix/router-stopping-thread-hang
Open

server: fix stopping_thread hang on child process exit#22452
xwinwin wants to merge 1 commit intoggml-org:masterfrom
xwinwin:fix/router-stopping-thread-hang

Conversation

@xwinwin
Copy link
Copy Markdown

@xwinwin xwinwin commented Apr 28, 2026

Overview

Summary

  • Fix stopping_thread hang in router mode when the child process exits
    unexpectedly (e.g., OOM during model loading or inference). The thread
    could block permanently, preventing model reload.
  • Prevent subprocess_terminate() from calling kill(0, 9) which sends
    SIGKILL to the entire process group instead of a single process.

Problem

1. stopping_thread hang

In router mode, stopping_thread monitors both explicit stop requests and
child-process crashes. The original implementation used subprocess_alive()
as the predicate for cv_stop.wait():

auto should_wake = [&]() {
    return is_stopping() || !subprocess_alive(child_proc.get());
};

subprocess_alive() calls waitpid(pid, &status, WNOHANG) which reaps the
child process
on the first detection of exit, returning 0 or -1 thereafter. This destructive behavior means the function can only detect the alive→dead transition once. The fix bypasses this by using an explicit atomic flag instead of subprocess_alive() as the CV predicate.

The hang scenario (triggered by OOM during model loading, inference, or other crashes):

  1. Child crashes (e.g., killed by OOM killer during model loading or inference) →
    log_thread exits on stdout EOF
  2. Main thread calls cv_stop.notify_all() once
  3. stopping_thread wakes up and evaluates should_wake()
  4. subprocess_alive() has not yet detected the crash, so it returns true
    (still "alive")
  5. should_wake()false, stopping_thread goes back to waiting
  6. No further notify_all() is ever called → permanent hang

This is a missed notification problem: stopping_thread has only one
chance to wake up from the CV, but the predicate relies on a racy check that
may not be ready yet. No other code path sends further notifications once the
child has crashed, so the thread can block permanently.

Additionally, concurrent calls to subprocess_alive() from the main thread and stopping_thread can race to reap the child. The function returns int (1 = alive, 0 = exited, -1 = error), and the code uses <= 0 to detect non-alive states.

Platform difference

On Unix, subprocess_alive() uses waitpid(WNOHANG) which is destructive (reaps the zombie on first detection). On Windows, it uses WaitForSingleObject(hProcess, 0) — a non-destructive poll. The hang is specific to Unix; the fix applies uniformly for consistency.

waitpid detection delay

Measured on Intel i9-14900KS, Ubuntu 24.04, Linux 6.17 (x86_64):

$ gcc -O2 -o /tmp/test /tmp/test.c && /tmp/test 200
Running 200 iterations of kill -9 + waitpid(WNOHANG) latency test

Results (200 runs):
  Min:    108 us
  Avg:    289 us
  Max:    502 us
  Immediate (0 retries): 0/200 (0.0%)

The first waitpid(WNOHANG) call never detects the exit immediately (0% across 200 runs) — there is always a delay before the kernel makes the zombie reaping-able. This delay alone means a single CV notification is unreliable: if stopping_thread wakes during this window, the check returns "alive" and the thread goes back to waiting with no further notifications.

2. kill(0, 9) in subprocess_terminate

After subprocess_alive() reaps the child, process->child is 0. If
subprocess_terminate() is then called on the timeout path, it executes
kill(0, 9), which sends SIGKILL to all processes in the current process
group
— potentially killing the router itself.

Fix

server-models.cpp

  • Introduce std::atomic<bool> child_finished to bypass the racy subprocess_alive() check as a CV predicate. After log_thread.join() (child stdout EOF), the main thread sets child_finished = true and calls cv_stop.notify_all().
  • stopping_thread now wakes on either an explicit stop request (stopping_models) or the child_finished flag — completely decoupling thread synchronization from the destructive subprocess_alive() check.
  • One-shot subprocess_alive() checks (returning int, code uses <= 0 for non-alive) are kept for early-exit guards (e.g. "is child still alive before sending exit command?").
  • Unlock mutex before subprocess_terminate() on the timeout path to avoid holding the lock during a blocking system call.

subprocess.h

  • Guard subprocess_terminate() (Unix path) with process->child <= 0 check
    to prevent kill(0, 9).

Files Changed

  • tools/server/server-models.cpp — stopping_thread synchronization fix
  • vendor/sheredom/subprocess.h — PID validation in subprocess_terminate

Testing

  • Normal graceful shutdown (via unload() / unload_all()) — no behavior change
  • OOM during model loading or inference — stopping_thread wakes up via
    child_finished flag, model can be reloaded
  • Child crash reproduction (kill -9 <child_pid>) — same hang is triggered
    and verified to be fixed
  • Timeout path — mutex is unlocked before subprocess_terminate()

Additional information

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES. AI assisted with identifying the root cause and suggesting the fix, generated this description and commit log. All code was authored and reviewed by me before submission.

In router mode, stopping_thread uses subprocess_alive() as the predicate for
cv_stop.wait(). subprocess_alive() calls waitpid(WNOHANG) which reaps the
child process, setting process->child to 0. This destructive check means the
function can only detect the alive→dead transition once — after the first
call, subsequent calls return the same value with no state change to trigger
a CV wake-up.

This commonly manifests when a child process is OOM-killed (e.g., during
model loading or inference). The main thread notifies cv_stop once after
log_thread.join(). If stopping_thread wakes before waitpid has detected the
exit, the predicate evaluates to false and the thread goes back to waiting
with no further notifications, causing a permanent hang that blocks model
reload.

Fix by introducing std::atomic<bool> child_finished to bypass the
subprocess_alive() check as a CV predicate. The main thread sets this flag
after log_thread.join() (child stdout EOF) and notifies cv_stop.
stopping_thread now wakes on either an explicit stop request or this atomic
flag, completely decoupling thread synchronization from the destructive
subprocess_alive() behavior.

Also guard subprocess_terminate() with process->child <= 0 to prevent
kill(0, 9) which would send SIGKILL to the entire process group.
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Apr 28, 2026

Hi @xwinwin, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@xwinwin
Copy link
Copy Markdown
Author

xwinwin commented Apr 28, 2026

AI usage disclosure is already set to YES with a description of AI's role included.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant