Skip to content

Commit 41cfd22

Browse files
feat: cross-platform force-kill primitive for stuck PHP threads
Introduces a small, self-contained primitive that unblocks a PHP thread stuck in a blocking call (sleep, synchronous I/O, etc.) so the graceful drain used by RestartWorkers and DrainWorkers can make progress instead of waiting for the block to return on its own. The primitive is useful on its own and gives follow-up graceful-shutdown work a reviewed foundation to build on. - frankenphp.c: add frankenphp_init_force_kill / frankenphp_save_php_timer / frankenphp_force_kill_thread / frankenphp_destroy_force_kill. The per-thread PHP timer handle (Linux/FreeBSD ZTS) or OS thread handle (Windows) is captured at thread boot and stored in a pre-sized array so the kill path can fire from any goroutine without touching per-thread PHP state. Linux/FreeBSD arm PHP's max_execution_time timer (delivers SIGALRM -> "Maximum execution time exceeded"); Windows uses CancelSynchronousIo + QueueUserAPC to interrupt I/O and alertable waits; macOS and other platforms are a safe no-op (the thread is abandoned and exits when the blocking call returns naturally). - phpmainthread.go: wire frankenphp_init_force_kill into initPHPThreads (sized to maxThreads, matching the thread_metrics allocation) and frankenphp_destroy_force_kill into drainPHPThreads. - worker.go: add a 5-second graceful-drain grace period to drainWorkerThreads. Once elapsed, arm the force-kill primitive on any thread still outside Yielding and keep waiting on ready.Wait(); the kill lets the thread return from its blocking call so the drain completes in bounded time instead of hanging. - worker_test.go + testdata/worker-sleep.php: TestRestartWorkersForceKillsStuckThread drives the path end-to-end. A worker blocks inside sleep(60) below frankenphp_handle_request (so drainChan close can't reach it); the test asserts RestartWorkers returns within 8s (grace + slack). The test skips on platforms without the underlying primitive.
1 parent a05e6dd commit 41cfd22

9 files changed

Lines changed: 543 additions & 24 deletions

File tree

caddy/admin.go

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,13 @@ func (admin *FrankenPHPAdmin) restartWorkers(w http.ResponseWriter, r *http.Requ
3939
return admin.error(http.StatusMethodNotAllowed, fmt.Errorf("method not allowed"))
4040
}
4141

42-
frankenphp.RestartWorkers()
42+
if err := frankenphp.RestartWorkers(); err != nil {
43+
// Restart is incomplete: at least one worker thread was stuck in
44+
// an uninterruptible blocking call and did not reload code. Do
45+
// not let the admin endpoint lie to automation with a 200.
46+
caddy.Log().Sugar().Errorf("workers restart incomplete: %v", err)
47+
return admin.error(http.StatusInternalServerError, err)
48+
}
4349
caddy.Log().Info("workers restarted from admin api")
4450
admin.success(w, "workers restarted successfully\n")
4551

frankenphp.c

Lines changed: 167 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,135 @@ static bool is_forked_child = false;
9292
static void frankenphp_fork_child(void) { is_forked_child = true; }
9393
#endif
9494

95+
/* Best-effort force-kill for PHP threads after the graceful-drain grace
96+
* period. Each thread captures pointers to its own executor_globals'
97+
* vm_interrupt and timed_out atomic bools at boot and hands them back to
98+
* Go via go_frankenphp_store_force_kill_slot. From any goroutine, the
99+
* Go side passes that slot back to frankenphp_force_kill_thread, which
100+
* stores true into both bools, waking the VM at the next opcode boundary
101+
* and unwinding the thread through zend_timeout().
102+
*
103+
* On platforms with POSIX realtime signals (Linux, FreeBSD), force-kill
104+
* also delivers SIGRTMIN+3 to the target thread so any in-flight blocking
105+
* syscall (select, sleep, nanosleep, blocking I/O without SA_RESTART)
106+
* returns EINTR and the VM gets a chance to observe the atomic bools on
107+
* the next opcode. On Windows, CancelSynchronousIo + QueueUserAPC does
108+
* the equivalent for alertable I/O and SleepEx. Non-alertable Sleep()
109+
* (including PHP's usleep on Windows) stays uninterruptible - the VM
110+
* must wait for it to return naturally before bailing.
111+
*
112+
* macOS has no realtime signals exposed to user-space, so the atomic
113+
* bool path is the only mechanism there: threads busy-looping in PHP
114+
* are killed promptly, threads stuck in blocking syscalls wait to
115+
* return on their own.
116+
*
117+
* JIT caveat: when the OPcache JIT is enabled, some hot code paths do
118+
* not check vm_interrupt between opcodes. A thread stuck in a
119+
* JIT-compiled busy loop may not observe the atomic-bool store at all
120+
* (see https://github.com/php/php-src/issues/21267). The syscall-
121+
* interruption path (signal -> EINTR) still works since the kernel
122+
* wakes the thread regardless of JIT state, so the regression surface
123+
* is pure-PHP busy loops under JIT. Those fall through to the abandon
124+
* path after forceKillDeadline.
125+
*
126+
* Signal number reservation: SIGRTMIN+3 is reserved by FrankenPHP for
127+
* force-kill. If a PHP user script registers its own handler via
128+
* pcntl_signal(SIGRTMIN+3, ...), it clobbers ours and force-kill stops
129+
* working for threads it runs on. Projects embedding FrankenPHP
130+
* alongside their own Go code that also uses that signal must choose a
131+
* different one here. Keep this in mind if ever changing the constant.
132+
*
133+
* The slot lives in the Go-side phpThread struct - there is no C-side
134+
* array or init/destroy dance. Signal handler installation happens once
135+
* via pthread_once the first time a thread registers. */
136+
#ifdef PHP_WIN32
137+
static void CALLBACK frankenphp_noop_apc(ULONG_PTR param) { (void)param; }
138+
#endif
139+
140+
#ifdef FRANKENPHP_HAS_KILL_SIGNAL
141+
/* No-op handler: signal delivery is sufficient on its own because it
142+
* forces the in-flight syscall to return EINTR. The VM then observes
143+
* vm_interrupt on the next opcode and unwinds via zend_timeout(). */
144+
static void frankenphp_kill_signal_handler(int sig) { (void)sig; }
145+
146+
static pthread_once_t kill_signal_handler_installed = PTHREAD_ONCE_INIT;
147+
static void install_kill_signal_handler(void) {
148+
/* Install the no-op handler process-wide with SA_RESTART cleared so
149+
* blocking syscalls return EINTR when the signal is delivered rather
150+
* than being transparently restarted by libc. SA_ONSTACK is set
151+
* defensively: the signal targets non-Go pthreads via pthread_kill,
152+
* but if it's ever delivered to a Go-managed thread (e.g. through
153+
* accidental process-level raise), Go requires the handler to run on
154+
* the alternate signal stack to avoid corrupting the goroutine's. */
155+
struct sigaction sa;
156+
memset(&sa, 0, sizeof(sa));
157+
sa.sa_handler = frankenphp_kill_signal_handler;
158+
sigemptyset(&sa.sa_mask);
159+
sa.sa_flags = SA_ONSTACK;
160+
sigaction(FRANKENPHP_KILL_SIGNAL, &sa, NULL);
161+
}
162+
#endif
163+
164+
/* Called by each PHP thread at boot, from its own TSRM context, so that
165+
* the EG-backed addresses resolve to the thread's private executor_globals
166+
* and the captured thread identity refers to itself. Hands the slot to
167+
* the Go side via go_frankenphp_store_force_kill_slot; the slot's
168+
* lifetime is the phpThread's. */
169+
void frankenphp_register_thread_for_kill(uintptr_t idx) {
170+
force_kill_slot slot;
171+
memset(&slot, 0, sizeof(slot));
172+
slot.vm_interrupt = &EG(vm_interrupt);
173+
slot.timed_out = &EG(timed_out);
174+
#ifdef FRANKENPHP_HAS_KILL_SIGNAL
175+
slot.tid = pthread_self();
176+
pthread_once(&kill_signal_handler_installed, install_kill_signal_handler);
177+
#elif defined(PHP_WIN32)
178+
if (!DuplicateHandle(GetCurrentProcess(), GetCurrentThread(),
179+
GetCurrentProcess(), &slot.thread_handle, 0, FALSE,
180+
DUPLICATE_SAME_ACCESS)) {
181+
/* DuplicateHandle can fail under resource pressure; leave the handle
182+
* NULL so force_kill_thread falls back to the atomic-bool path only. */
183+
slot.thread_handle = NULL;
184+
}
185+
#endif
186+
go_frankenphp_store_force_kill_slot(idx, slot);
187+
}
188+
189+
void frankenphp_force_kill_thread(force_kill_slot slot) {
190+
if (slot.vm_interrupt == NULL) {
191+
/* Thread never reached register_thread_for_kill (aborted during boot). */
192+
return;
193+
}
194+
/* Set the atomic bools first so that by the time the thread wakes up -
195+
* whether from our signal/APC or naturally - the VM sees them and
196+
* routes through zend_timeout() -> "Maximum execution time exceeded". */
197+
zend_atomic_bool_store(slot.timed_out, true);
198+
zend_atomic_bool_store(slot.vm_interrupt, true);
199+
200+
#ifdef FRANKENPHP_HAS_KILL_SIGNAL
201+
/* Return value intentionally ignored: ESRCH (thread already exited) and
202+
* EINVAL are both benign - there is simply nothing to unblock. */
203+
pthread_kill(slot.tid, FRANKENPHP_KILL_SIGNAL);
204+
#elif defined(PHP_WIN32)
205+
if (slot.thread_handle != NULL) {
206+
CancelSynchronousIo(slot.thread_handle);
207+
QueueUserAPC((PAPCFUNC)frankenphp_noop_apc, slot.thread_handle, 0);
208+
}
209+
#endif
210+
}
211+
212+
/* Releases any OS resource tied to the slot (currently: CloseHandle on
213+
* Windows). Called by the Go side when a phpThread is torn down. */
214+
void frankenphp_release_thread_for_kill(force_kill_slot slot) {
215+
#ifdef PHP_WIN32
216+
if (slot.thread_handle != NULL) {
217+
CloseHandle(slot.thread_handle);
218+
}
219+
#else
220+
(void)slot;
221+
#endif
222+
}
223+
95224
void frankenphp_update_local_thread_context(bool is_worker) {
96225
is_worker_thread = is_worker;
97226

@@ -253,8 +382,14 @@ static frankenphp_thread_metrics *thread_metrics = NULL;
253382

254383
/* Adapted from php_request_shutdown */
255384
static void frankenphp_worker_request_shutdown() {
256-
__atomic_store_n(&thread_metrics[thread_index].last_memory_usage,
257-
zend_memory_usage(0), __ATOMIC_RELAXED);
385+
/* thread_metrics can be NULL if the Go side already ran
386+
* frankenphp_destroy_thread_metrics because Shutdown timed out waiting
387+
* for this thread: tolerate the race rather than dereferencing freed
388+
* memory when the blocked call finally unwinds. */
389+
if (thread_metrics != NULL) {
390+
__atomic_store_n(&thread_metrics[thread_index].last_memory_usage,
391+
zend_memory_usage(0), __ATOMIC_RELAXED);
392+
}
258393

259394
/* Flush all output buffers */
260395
zend_try { php_output_end_all(); }
@@ -1073,6 +1208,11 @@ static void *php_thread(void *arg) {
10731208
#endif
10741209
#endif
10751210

1211+
/* Register this thread's vm_interrupt/timed_out addresses so the Go side
1212+
* can force-kill it after the graceful-drain grace period if it gets stuck
1213+
* in a busy PHP loop. */
1214+
frankenphp_register_thread_for_kill(thread_index);
1215+
10761216
bool thread_is_healthy = true;
10771217
bool has_attempted_shutdown = false;
10781218

@@ -1108,9 +1248,12 @@ static void *php_thread(void *arg) {
11081248
zend_destroy_file_handle(&file_handle);
11091249
reset_sandboxed_environment();
11101250

1111-
/* Update the last memory usage for metrics */
1112-
__atomic_store_n(&thread_metrics[thread_index].last_memory_usage,
1113-
zend_memory_usage(0), __ATOMIC_RELAXED);
1251+
/* Update the last memory usage for metrics (see
1252+
* frankenphp_worker_request_shutdown for the NULL-check rationale). */
1253+
if (thread_metrics != NULL) {
1254+
__atomic_store_n(&thread_metrics[thread_index].last_memory_usage,
1255+
zend_memory_usage(0), __ATOMIC_RELAXED);
1256+
}
11141257

11151258
has_attempted_shutdown = true;
11161259

@@ -1150,6 +1293,15 @@ static void *php_thread(void *arg) {
11501293
}
11511294
zend_end_try();
11521295

1296+
/* Clear the force-kill slot BEFORE ts_free_thread: that call frees
1297+
* the TSRM storage that &EG(vm_interrupt) / &EG(timed_out) point at.
1298+
* Clearing afterwards (even under a write lock) would leave a window
1299+
* where a concurrent delivery reads the still-populated slot and
1300+
* writes into freed memory. Applies to both the healthy exit and the
1301+
* unhealthy-restart path below so every call to force_kill_thread
1302+
* sees either a valid or a zero-valued slot. */
1303+
go_frankenphp_clear_force_kill_slot(thread_index);
1304+
11531305
/* free all global PHP memory reserved for this thread */
11541306
#ifdef ZTS
11551307
ts_free_thread();
@@ -1163,7 +1315,16 @@ static void *php_thread(void *arg) {
11631315
}
11641316

11651317
/* Thread is unhealthy, PHP globals might be in a bad state after a bailout,
1166-
* restart the entire thread */
1318+
* restart the entire thread - unless we're already past Shutdown (detected
1319+
* via thread_metrics having been freed). Respawning after Shutdown would
1320+
* hand a fresh pthread a nil phpThreads slice on the Go side and a freed
1321+
* thread_metrics array on the C side, so we simply drop the restart. */
1322+
if (thread_metrics == NULL) {
1323+
frankenphp_log_message(
1324+
"Unhealthy thread unwinding after Shutdown; not restarting",
1325+
LOG_WARNING);
1326+
return NULL;
1327+
}
11671328
frankenphp_log_message("Restarting unhealthy thread", LOG_WARNING);
11681329

11691330
if (!frankenphp_new_php_thread(thread_index)) {

frankenphp.h

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,28 @@ static inline HRESULT LongLongSub(LONGLONG llMinuend, LONGLONG llSubtrahend,
4646
#include <stdbool.h>
4747
#include <stdint.h>
4848

49+
#ifndef PHP_WIN32
50+
#include <pthread.h>
51+
#include <signal.h>
52+
#endif
53+
54+
/* Platform capabilities for the force-kill primitive; declared in the
55+
* header so Go (via CGo) gets the correct struct layout too. */
56+
#if !defined(PHP_WIN32) && defined(SIGRTMIN)
57+
#define FRANKENPHP_HAS_KILL_SIGNAL 1
58+
#define FRANKENPHP_KILL_SIGNAL (SIGRTMIN + 3)
59+
#endif
60+
61+
typedef struct {
62+
zend_atomic_bool *vm_interrupt;
63+
zend_atomic_bool *timed_out;
64+
#ifdef FRANKENPHP_HAS_KILL_SIGNAL
65+
pthread_t tid;
66+
#elif defined(PHP_WIN32)
67+
HANDLE thread_handle;
68+
#endif
69+
} force_kill_slot;
70+
4971
#ifndef FRANKENPHP_VERSION
5072
#define FRANKENPHP_VERSION dev
5173
#endif
@@ -193,6 +215,18 @@ void frankenphp_init_thread_metrics(int max_threads);
193215
void frankenphp_destroy_thread_metrics(void);
194216
size_t frankenphp_get_thread_memory_usage(uintptr_t thread_index);
195217

218+
/* Best-effort force-kill primitives. The slot is populated by each PHP
219+
* thread at boot (frankenphp_register_thread_for_kill calls back into Go
220+
* via go_frankenphp_store_force_kill_slot) and lives in the Go-side
221+
* phpThread. force_kill_thread interrupts the Zend VM at the next opcode
222+
* boundary; on POSIX it also delivers SIGRTMIN+3 to the target thread,
223+
* on Windows it calls CancelSynchronousIo + QueueUserAPC. release_thread
224+
* drops any OS-owned resource tied to the slot (currently the Windows
225+
* thread handle). */
226+
void frankenphp_register_thread_for_kill(uintptr_t thread_index);
227+
void frankenphp_force_kill_thread(force_kill_slot slot);
228+
void frankenphp_release_thread_for_kill(force_kill_slot slot);
229+
196230
void register_extensions(zend_module_entry **m, int len);
197231

198232
#endif

phpmainthread.go

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,14 @@ func drainPHPThreads() {
9797
}
9898

9999
doneWG.Wait()
100+
// Slots are released by the PHP threads themselves, under the
101+
// per-thread write lock, right before ts_free_thread() runs (see
102+
// go_frankenphp_clear_force_kill_slot). A second release here would
103+
// be a double-CloseHandle on Windows (potentially on a reused handle)
104+
// and bypass the lock discipline on every platform, so we rely on
105+
// the thread-exit path instead. Threads that were abandoned by
106+
// phpThread.shutdown() still hold their slot; the OS reclaims the
107+
// handle when the process exits.
100108
mainThread.state.Set(state.Done)
101109
mainThread.state.WaitFor(state.Reserved)
102110
C.frankenphp_destroy_thread_metrics()

0 commit comments

Comments
 (0)