Skip to content

Commit 68f5a5b

Browse files
feat: cross-platform force-kill primitive for stuck PHP threads
Introduces a small, self-contained primitive that unblocks a PHP thread stuck in a blocking call (sleep, synchronous I/O, etc.) so the graceful drain used by RestartWorkers and DrainWorkers can make progress instead of waiting for the block to return on its own. The primitive is useful on its own and gives follow-up graceful-shutdown work a reviewed foundation to build on. - frankenphp.c: add frankenphp_init_force_kill / frankenphp_save_php_timer / frankenphp_force_kill_thread / frankenphp_destroy_force_kill. The per-thread PHP timer handle (Linux/FreeBSD ZTS) or OS thread handle (Windows) is captured at thread boot and stored in a pre-sized array so the kill path can fire from any goroutine without touching per-thread PHP state. Linux/FreeBSD arm PHP's max_execution_time timer (delivers SIGALRM -> "Maximum execution time exceeded"); Windows uses CancelSynchronousIo + QueueUserAPC to interrupt I/O and alertable waits; macOS and other platforms are a safe no-op (the thread is abandoned and exits when the blocking call returns naturally). - phpmainthread.go: wire frankenphp_init_force_kill into initPHPThreads (sized to maxThreads, matching the thread_metrics allocation) and frankenphp_destroy_force_kill into drainPHPThreads. - worker.go: add a 5-second graceful-drain grace period to drainWorkerThreads. Once elapsed, arm the force-kill primitive on any thread still outside Yielding and keep waiting on ready.Wait(); the kill lets the thread return from its blocking call so the drain completes in bounded time instead of hanging. - worker_test.go + testdata/worker-sleep.php: TestRestartWorkersForceKillsStuckThread drives the path end-to-end. A worker blocks inside sleep(60) below frankenphp_handle_request (so drainChan close can't reach it); the test asserts RestartWorkers returns within 8s (grace + slack). The test skips on platforms without the underlying primitive.
1 parent a05e6dd commit 68f5a5b

7 files changed

Lines changed: 339 additions & 10 deletions

File tree

frankenphp.c

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,115 @@ static bool is_forked_child = false;
9292
static void frankenphp_fork_child(void) { is_forked_child = true; }
9393
#endif
9494

95+
/* Best-effort force-kill for PHP threads after the graceful-drain grace
96+
* period. Each thread captures pointers to its own executor_globals'
97+
* vm_interrupt and timed_out atomic bools at boot and hands them back to
98+
* Go via go_frankenphp_store_force_kill_slot. From any goroutine, the
99+
* Go side passes that slot back to frankenphp_force_kill_thread, which
100+
* stores true into both bools, waking the VM at the next opcode boundary
101+
* and unwinding the thread through zend_timeout().
102+
*
103+
* On platforms with POSIX realtime signals (Linux, FreeBSD), force-kill
104+
* also delivers SIGRTMIN+3 to the target thread so any in-flight blocking
105+
* syscall (select, sleep, nanosleep, blocking I/O without SA_RESTART)
106+
* returns EINTR and the VM gets a chance to observe the atomic bools on
107+
* the next opcode. On Windows, CancelSynchronousIo + QueueUserAPC does
108+
* the equivalent for alertable I/O and SleepEx. Non-alertable Sleep()
109+
* (including PHP's usleep on Windows) stays uninterruptible - the VM
110+
* must wait for it to return naturally before bailing.
111+
*
112+
* macOS has no realtime signals exposed to user-space, so the atomic
113+
* bool path is the only mechanism there: threads busy-looping in PHP
114+
* are killed promptly, threads stuck in blocking syscalls wait to
115+
* return on their own.
116+
*
117+
* The slot lives in the Go-side phpThread struct - there is no C-side
118+
* array or init/destroy dance. Signal handler installation happens once
119+
* via pthread_once the first time a thread registers. */
120+
#ifdef PHP_WIN32
121+
static void CALLBACK frankenphp_noop_apc(ULONG_PTR param) { (void)param; }
122+
#endif
123+
124+
#ifdef FRANKENPHP_HAS_KILL_SIGNAL
125+
/* No-op handler: signal delivery is sufficient on its own because it
126+
* forces the in-flight syscall to return EINTR. The VM then observes
127+
* vm_interrupt on the next opcode and unwinds via zend_timeout(). */
128+
static void frankenphp_kill_signal_handler(int sig) { (void)sig; }
129+
130+
static pthread_once_t kill_signal_handler_installed = PTHREAD_ONCE_INIT;
131+
static void install_kill_signal_handler(void) {
132+
/* Install the no-op handler process-wide with SA_RESTART cleared so
133+
* blocking syscalls return EINTR when the signal is delivered rather
134+
* than being transparently restarted by libc. */
135+
struct sigaction sa;
136+
memset(&sa, 0, sizeof(sa));
137+
sa.sa_handler = frankenphp_kill_signal_handler;
138+
sigemptyset(&sa.sa_mask);
139+
sa.sa_flags = 0;
140+
sigaction(FRANKENPHP_KILL_SIGNAL, &sa, NULL);
141+
}
142+
#endif
143+
144+
/* Called by each PHP thread at boot, from its own TSRM context, so that
145+
* the EG-backed addresses resolve to the thread's private executor_globals
146+
* and the captured thread identity refers to itself. Hands the slot to
147+
* the Go side via go_frankenphp_store_force_kill_slot; the slot's
148+
* lifetime is the phpThread's. */
149+
void frankenphp_register_thread_for_kill(uintptr_t idx) {
150+
force_kill_slot slot;
151+
memset(&slot, 0, sizeof(slot));
152+
slot.vm_interrupt = &EG(vm_interrupt);
153+
slot.timed_out = &EG(timed_out);
154+
#ifdef FRANKENPHP_HAS_KILL_SIGNAL
155+
slot.tid = pthread_self();
156+
pthread_once(&kill_signal_handler_installed, install_kill_signal_handler);
157+
#elif defined(PHP_WIN32)
158+
if (!DuplicateHandle(GetCurrentProcess(), GetCurrentThread(),
159+
GetCurrentProcess(), &slot.thread_handle, 0, FALSE,
160+
DUPLICATE_SAME_ACCESS)) {
161+
/* DuplicateHandle can fail under resource pressure; leave the handle
162+
* NULL so force_kill_thread falls back to the atomic-bool path only. */
163+
slot.thread_handle = NULL;
164+
}
165+
#endif
166+
go_frankenphp_store_force_kill_slot(idx, slot);
167+
}
168+
169+
void frankenphp_force_kill_thread(force_kill_slot slot) {
170+
if (slot.vm_interrupt == NULL) {
171+
/* Thread never reached register_thread_for_kill (aborted during boot). */
172+
return;
173+
}
174+
/* Set the atomic bools first so that by the time the thread wakes up -
175+
* whether from our signal/APC or naturally - the VM sees them and
176+
* routes through zend_timeout() -> "Maximum execution time exceeded". */
177+
zend_atomic_bool_store(slot.timed_out, true);
178+
zend_atomic_bool_store(slot.vm_interrupt, true);
179+
180+
#ifdef FRANKENPHP_HAS_KILL_SIGNAL
181+
/* Return value intentionally ignored: ESRCH (thread already exited) and
182+
* EINVAL are both benign - there is simply nothing to unblock. */
183+
pthread_kill(slot.tid, FRANKENPHP_KILL_SIGNAL);
184+
#elif defined(PHP_WIN32)
185+
if (slot.thread_handle != NULL) {
186+
CancelSynchronousIo(slot.thread_handle);
187+
QueueUserAPC((PAPCFUNC)frankenphp_noop_apc, slot.thread_handle, 0);
188+
}
189+
#endif
190+
}
191+
192+
/* Releases any OS resource tied to the slot (currently: CloseHandle on
193+
* Windows). Called by the Go side when a phpThread is torn down. */
194+
void frankenphp_release_thread_for_kill(force_kill_slot slot) {
195+
#ifdef PHP_WIN32
196+
if (slot.thread_handle != NULL) {
197+
CloseHandle(slot.thread_handle);
198+
}
199+
#else
200+
(void)slot;
201+
#endif
202+
}
203+
95204
void frankenphp_update_local_thread_context(bool is_worker) {
96205
is_worker_thread = is_worker;
97206

@@ -1073,6 +1182,11 @@ static void *php_thread(void *arg) {
10731182
#endif
10741183
#endif
10751184

1185+
/* Register this thread's vm_interrupt/timed_out addresses so the Go side
1186+
* can force-kill it after the graceful-drain grace period if it gets stuck
1187+
* in a busy PHP loop. */
1188+
frankenphp_register_thread_for_kill(thread_index);
1189+
10761190
bool thread_is_healthy = true;
10771191
bool has_attempted_shutdown = false;
10781192

frankenphp.h

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,28 @@ static inline HRESULT LongLongSub(LONGLONG llMinuend, LONGLONG llSubtrahend,
4646
#include <stdbool.h>
4747
#include <stdint.h>
4848

49+
#ifndef PHP_WIN32
50+
#include <pthread.h>
51+
#include <signal.h>
52+
#endif
53+
54+
/* Platform capabilities for the force-kill primitive; declared in the
55+
* header so Go (via CGo) gets the correct struct layout too. */
56+
#if !defined(PHP_WIN32) && defined(SIGRTMIN)
57+
#define FRANKENPHP_HAS_KILL_SIGNAL 1
58+
#define FRANKENPHP_KILL_SIGNAL (SIGRTMIN + 3)
59+
#endif
60+
61+
typedef struct {
62+
zend_atomic_bool *vm_interrupt;
63+
zend_atomic_bool *timed_out;
64+
#ifdef FRANKENPHP_HAS_KILL_SIGNAL
65+
pthread_t tid;
66+
#elif defined(PHP_WIN32)
67+
HANDLE thread_handle;
68+
#endif
69+
} force_kill_slot;
70+
4971
#ifndef FRANKENPHP_VERSION
5072
#define FRANKENPHP_VERSION dev
5173
#endif
@@ -193,6 +215,18 @@ void frankenphp_init_thread_metrics(int max_threads);
193215
void frankenphp_destroy_thread_metrics(void);
194216
size_t frankenphp_get_thread_memory_usage(uintptr_t thread_index);
195217

218+
/* Best-effort force-kill primitives. The slot is populated by each PHP
219+
* thread at boot (frankenphp_register_thread_for_kill calls back into Go
220+
* via go_frankenphp_store_force_kill_slot) and lives in the Go-side
221+
* phpThread. force_kill_thread interrupts the Zend VM at the next opcode
222+
* boundary; on POSIX it also delivers SIGRTMIN+3 to the target thread,
223+
* on Windows it calls CancelSynchronousIo + QueueUserAPC. release_thread
224+
* drops any OS-owned resource tied to the slot (currently the Windows
225+
* thread handle). */
226+
void frankenphp_register_thread_for_kill(uintptr_t thread_index);
227+
void frankenphp_force_kill_thread(force_kill_slot slot);
228+
void frankenphp_release_thread_for_kill(force_kill_slot slot);
229+
196230
void register_extensions(zend_module_entry **m, int len);
197231

198232
#endif

phpmainthread.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,9 @@ func drainPHPThreads() {
9797
}
9898

9999
doneWG.Wait()
100+
for _, thread := range phpThreads {
101+
C.frankenphp_release_thread_for_kill(thread.forceKill)
102+
}
100103
mainThread.state.Set(state.Done)
101104
mainThread.state.WaitFor(state.Reserved)
102105
C.frankenphp_destroy_thread_metrics()

phpthread.go

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,11 @@ type phpThread struct {
2525
contextMu sync.RWMutex
2626
state *state.ThreadState
2727
requestCount atomic.Int64
28+
// forceKill is populated by go_frankenphp_store_force_kill_slot from
29+
// the PHP thread's own TSRM context at boot. Read by other goroutines
30+
// via RestartWorkers/DrainWorkers; the write-before-Ready state
31+
// transition provides the happens-before edge.
32+
forceKill C.force_kill_slot
2833
}
2934

3035
// threadHandler defines how the callbacks from the C thread should be handled
@@ -203,6 +208,15 @@ func go_frankenphp_after_script_execution(threadIndex C.uintptr_t, exitStatus C.
203208
thread.Unpin()
204209
}
205210

211+
//export go_frankenphp_store_force_kill_slot
212+
func go_frankenphp_store_force_kill_slot(threadIndex C.uintptr_t, slot C.force_kill_slot) {
213+
// Release any resource (Windows thread HANDLE) tied to the previous
214+
// slot: a phpThread can reboot (max_requests, unhealthy restart) and
215+
// register a fresh DuplicateHandle each time.
216+
C.frankenphp_release_thread_for_kill(phpThreads[threadIndex].forceKill)
217+
phpThreads[threadIndex].forceKill = slot
218+
}
219+
206220
//export go_frankenphp_on_thread_shutdown
207221
func go_frankenphp_on_thread_shutdown(threadIndex C.uintptr_t) {
208222
thread := phpThreads[threadIndex]

testdata/worker-sleep.php

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
<?php
2+
3+
// Worker that sleeps inside the handler to simulate a stuck request blocking
4+
// drain. Used to test the force-kill grace period.
5+
//
6+
// Before sleeping we touch a marker file whose path is passed via the
7+
// SLEEP_MARKER header. The Go test polls for the file so it only arms
8+
// RestartWorkers once the worker is proven to be inside sleep(), removing
9+
// the fixed-time race of a bare time.Sleep on the caller side.
10+
$fn = static function () {
11+
$marker = $_SERVER['HTTP_SLEEP_MARKER'] ?? '';
12+
if ($marker !== '') {
13+
@touch($marker);
14+
}
15+
sleep(60);
16+
echo 'should not reach';
17+
};
18+
19+
do {
20+
$ret = \frankenphp_handle_request($fn);
21+
} while ($ret);

worker.go

Lines changed: 88 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ package frankenphp
44
import "C"
55
import (
66
"fmt"
7+
"log/slog"
78
"os"
89
"path/filepath"
910
"runtime"
@@ -165,16 +166,27 @@ func newWorker(o workerOpt) (*worker, error) {
165166
return w, nil
166167
}
167168

169+
// drainGracePeriod is the time a worker thread has to stop gracefully after
170+
// receiving the drain signal before the force-kill primitive is armed on it.
171+
// Well-behaved scripts return promptly on drainChan close; stuck ones (e.g.
172+
// blocking C calls inside the VM) would otherwise hang drainWorkerThreads
173+
// forever.
174+
const drainGracePeriod = 5 * time.Second
175+
176+
// forceKillDeadline is how long drainWorkerThreads waits after arming the
177+
// force-kill primitive before giving up. Force-kill is best-effort: on
178+
// macOS (no realtime signals) and on Windows non-alertable Sleep() it
179+
// cannot interrupt the blocking syscall, so we abandon the thread rather
180+
// than hang the drain forever.
181+
const forceKillDeadline = 5 * time.Second
182+
168183
// EXPERIMENTAL: DrainWorkers finishes all worker scripts before a graceful shutdown
169184
func DrainWorkers() {
170-
_ = drainWorkerThreads()
185+
_, _ = drainWorkerThreads()
171186
}
172187

173-
func drainWorkerThreads() []*phpThread {
174-
var (
175-
ready sync.WaitGroup
176-
drainedThreads []*phpThread
177-
)
188+
func drainWorkerThreads() (drainedThreads []*phpThread, abandoned []*phpThread) {
189+
var ready sync.WaitGroup
178190

179191
for _, worker := range workers {
180192
worker.threadMutex.RLock()
@@ -193,17 +205,71 @@ func drainWorkerThreads() []*phpThread {
193205
drainedThreads = append(drainedThreads, thread)
194206

195207
go func(thread *phpThread) {
196-
thread.state.WaitFor(state.Yielding)
208+
// Accept any state that releases us: Yielding is the
209+
// graceful-drain goal; Ready covers the abandon path
210+
// where RestartWorkers resumes the thread without it
211+
// having yielded; ShuttingDown/Done covers shutdown.
212+
// Without these extras, threads that never yield would
213+
// leak this goroutine permanently across repeated
214+
// RestartWorkers calls.
215+
thread.state.WaitFor(state.Yielding, state.Ready, state.ShuttingDown, state.Done)
197216
ready.Done()
198217
}(thread)
199218
}
200219

201220
worker.threadMutex.RUnlock()
202221
}
203222

204-
ready.Wait()
223+
// Wait for graceful drain, then arm the force-kill primitive on any
224+
// thread still stuck. On Linux/FreeBSD pthread_kill(SIGRTMIN+3) breaks
225+
// out of blocking syscalls; on Windows CancelSynchronousIo covers
226+
// alertable waits and I/O. Platforms/syscalls where force-kill can't
227+
// actually unblock the thread (macOS, Windows non-alertable Sleep,
228+
// PHP's usleep on Windows) get logged and abandoned rather than
229+
// blocking the drain forever.
230+
done := make(chan struct{})
231+
go func() {
232+
ready.Wait()
233+
close(done)
234+
}()
235+
236+
select {
237+
case <-done:
238+
// everyone yielded in time
239+
case <-time.After(drainGracePeriod):
240+
for _, thread := range drainedThreads {
241+
if !thread.state.Is(state.Yielding) {
242+
C.frankenphp_force_kill_thread(thread.forceKill)
243+
}
244+
}
245+
if globalLogger.Enabled(globalCtx, slog.LevelWarn) {
246+
globalLogger.LogAttrs(globalCtx, slog.LevelWarn, "worker threads did not yield within grace period, force-killing stuck threads")
247+
}
248+
// Give the kill signal a bounded window to land. If the thread
249+
// was in a syscall force-kill can't interrupt, we abandon it
250+
// here rather than hanging RestartWorkers/DrainWorkers/Shutdown.
251+
select {
252+
case <-done:
253+
case <-time.After(forceKillDeadline):
254+
if globalLogger.Enabled(globalCtx, slog.LevelWarn) {
255+
globalLogger.LogAttrs(globalCtx, slog.LevelWarn, "worker threads did not yield after force-kill; abandoning to unblock drain")
256+
}
257+
}
258+
}
259+
260+
// Split drained threads into those that actually yielded (will go
261+
// through the proper restart / opcache-reset path on resume) and
262+
// those still stuck in a blocking syscall. Callers like
263+
// RestartWorkers surface the abandoned count so watcher / admin
264+
// restarts don't silently report success for threads that never
265+
// reloaded code.
266+
for _, thread := range drainedThreads {
267+
if !thread.state.Is(state.Yielding) {
268+
abandoned = append(abandoned, thread)
269+
}
270+
}
205271

206-
return drainedThreads
272+
return drainedThreads, abandoned
207273
}
208274

209275
// RestartWorkers attempts to restart all workers gracefully
@@ -213,12 +279,24 @@ func RestartWorkers() {
213279
scalingMu.Lock()
214280
defer scalingMu.Unlock()
215281

216-
threadsToRestart := drainWorkerThreads()
282+
threadsToRestart, abandoned := drainWorkerThreads()
217283

218284
for _, thread := range threadsToRestart {
219285
thread.drainChan = make(chan struct{})
220286
thread.state.Set(state.Ready)
221287
}
288+
289+
// Abandoned threads did not reach Yielding, so they will resume
290+
// without going through the drain branch in threadworker.go that
291+
// clears opcache and re-executes the worker script. That means
292+
// "restart" did not actually reload code for those threads. Log
293+
// loudly so watcher / admin callers do not silently report success.
294+
if len(abandoned) > 0 && globalLogger.Enabled(globalCtx, slog.LevelError) {
295+
globalLogger.LogAttrs(globalCtx, slog.LevelError,
296+
"workers restart incomplete: some threads were stuck in an uninterruptible blocking call and did not reload code",
297+
slog.Int("abandoned", len(abandoned)),
298+
slog.Int("restarted", len(threadsToRestart)-len(abandoned)))
299+
}
222300
}
223301

224302
func (worker *worker) attachThread(thread *phpThread) {

0 commit comments

Comments
 (0)