Skip to content

Commit d36b9eb

Browse files
feat: cross-platform force-kill primitive for stuck PHP threads
Introduces a self-contained primitive that unblocks a PHP thread stuck in a blocking call (sleep, synchronous I/O, etc.) so the graceful drain used by RestartWorkers, DrainWorkers, and Shutdown makes progress instead of hanging for the duration of the block. The primitive is useful on its own and gives follow-up graceful-shutdown work a reviewed foundation to build on. Design: each PHP thread, at boot from its own TSRM context, hands a force_kill_slot (pointers to its own EG(vm_interrupt) and EG(timed_out) atomic bools, plus pthread_t / Windows HANDLE for the wake-up) back to Go via go_frankenphp_store_force_kill_slot. The slot lives on phpThread and is protected by a per-thread RWMutex so the zero-and-release path at thread exit cannot race an in-flight kill. From any goroutine, Go passes the slot back to frankenphp_force_kill_thread, which stores true into both bools (waking the VM at the next opcode boundary and routing through zend_timeout -> "Maximum execution time exceeded") and then delivers a platform-specific wake-up: - Linux/FreeBSD: pthread_kill(SIGRTMIN+3) with a no-op handler installed via pthread_once, SA_ONSTACK, no SA_RESTART. Signal delivery causes any in-flight blocking syscall to return EINTR. - Windows: CancelSynchronousIo + QueueUserAPC covers alertable I/O and SleepEx. Non-alertable Sleep (including PHP's usleep) stays uninterruptible. - macOS: atomic-bool-only path. Threads stuck in blocking syscalls wait to return on their own. JIT caveat: under the OPcache JIT some hot code paths skip vm_interrupt checks (see php-src#21267), so a pure-PHP busy loop under JIT may not observe the store and will fall through to the abandon path below. Drain flow: - worker.go: drainWorkerThreads waits drainGracePeriod (5s) for each drained thread to reach Yielding; then arms force-kill on stragglers and waits forceKillDeadline (5s) more. Threads still stuck past that are abandoned rather than hanging the drain forever. - drainWorkerThreads returns (drained, abandoned). RestartWorkers puts drained threads back to Ready and abandoned ones into the new state.Abandoned (handlers treat it like ShuttingDown on next callback) so an abandoned thread that finally unwinds exits instead of re-entering the serve loop under stale request state. If any were abandoned, RestartWorkers returns errIncompleteRestart wrapped with abandoned/restarted counts - admin endpoint and watcher surface it. - phpthread.go: phpThread.shutdown mirrors the same grace + force-kill + abandon pattern so Shutdown cannot hang on an uninterruptible blocking call either. Lifecycle hardening: Shutdown intentionally leaves phpThreads allocated and thread_metrics alive - an abandoned thread that eventually unwinds still calls through the SAPI and lifecycle callbacks which index those structures. initPHPThreads blocks on a package-level sync.WaitGroup (Add on every php_thread entry, Done on every exit path) so the next Init cycle cannot reassign them out from under a lingering abandoned thread, then frees the previous allocation inside frankenphp_init_thread_metrics before allocating fresh. A dedicated C-side atomic (shutdown_in_progress, toggled by frankenphp_set_shutdown_in_progress) is the signal the unhealthy-thread restart path uses to refuse respawning past Shutdown. - go_frankenphp_store_force_kill_slot / clear_force_kill_slot / on_thread_shutdown: take the per-thread write lock; clear runs before ts_free_thread on both healthy and unhealthy exit paths so the captured &EG() pointers are zeroed before their backing storage is freed. A threadForLateCallback helper guards callbacks against phpThreads races anyway, belt-and-suspenders. - php_thread unblocks FRANKENPHP_KILL_SIGNAL with pthread_sigmask at startup so Go's runtime signal mask cannot silently drop deliveries. - worker_test.go + testdata/worker-sleep.php: the regression test drives the full path via a request marker file so it only arms RestartWorkers once the worker is proven parked in sleep(), then asserts both the bounded elapsed time and that the "should not reach" line after sleep never runs (which would indicate the VM interrupt was never observed). RestartWorkers now returns an error - a source-compatible Go API change (callers that ignored it still compile) but worth noting for embedders.
1 parent a05e6dd commit d36b9eb

15 files changed

Lines changed: 758 additions & 43 deletions

caddy/admin.go

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,13 @@ func (admin *FrankenPHPAdmin) restartWorkers(w http.ResponseWriter, r *http.Requ
3939
return admin.error(http.StatusMethodNotAllowed, fmt.Errorf("method not allowed"))
4040
}
4141

42-
frankenphp.RestartWorkers()
42+
if err := frankenphp.RestartWorkers(); err != nil {
43+
// Restart is incomplete: at least one worker thread was stuck in
44+
// an uninterruptible blocking call and did not reload code. Do
45+
// not let the admin endpoint lie to automation with a 200.
46+
caddy.Log().Sugar().Errorf("workers restart incomplete: %v", err)
47+
return admin.error(http.StatusInternalServerError, err)
48+
}
4349
caddy.Log().Info("workers restarted from admin api")
4450
admin.success(w, "workers restarted successfully\n")
4551

frankenphp.c

Lines changed: 212 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,148 @@ static bool is_forked_child = false;
9292
static void frankenphp_fork_child(void) { is_forked_child = true; }
9393
#endif
9494

95+
/* Best-effort force-kill for PHP threads after the graceful-drain grace
96+
* period. Each thread captures pointers to its own executor_globals'
97+
* vm_interrupt and timed_out atomic bools at boot and hands them back to
98+
* Go via go_frankenphp_store_force_kill_slot. From any goroutine, the
99+
* Go side passes that slot back to frankenphp_force_kill_thread, which
100+
* stores true into both bools, waking the VM at the next opcode boundary
101+
* and unwinding the thread through zend_timeout().
102+
*
103+
* On platforms with POSIX realtime signals (Linux, FreeBSD), force-kill
104+
* also delivers SIGRTMIN+3 to the target thread so any in-flight blocking
105+
* syscall (select, sleep, nanosleep, blocking I/O without SA_RESTART)
106+
* returns EINTR and the VM gets a chance to observe the atomic bools on
107+
* the next opcode. On Windows, CancelSynchronousIo + QueueUserAPC does
108+
* the equivalent for alertable I/O and SleepEx. Non-alertable Sleep()
109+
* (including PHP's usleep on Windows) stays uninterruptible - the VM
110+
* must wait for it to return naturally before bailing.
111+
*
112+
* macOS has no realtime signals exposed to user-space, so the atomic
113+
* bool path is the only mechanism there: threads busy-looping in PHP
114+
* are killed promptly, threads stuck in blocking syscalls wait to
115+
* return on their own.
116+
*
117+
* JIT caveat: when the OPcache JIT is enabled, some hot code paths do
118+
* not check vm_interrupt between opcodes. A thread stuck in a
119+
* JIT-compiled busy loop may not observe the atomic-bool store at all
120+
* (see https://github.com/php/php-src/issues/21267). The syscall-
121+
* interruption path (signal -> EINTR) still works since the kernel
122+
* wakes the thread regardless of JIT state, so the regression surface
123+
* is pure-PHP busy loops under JIT. Those fall through to the abandon
124+
* path after forceKillDeadline.
125+
*
126+
* Signal number reservation: SIGRTMIN+3 is reserved by FrankenPHP for
127+
* force-kill. If a PHP user script registers its own handler via
128+
* pcntl_signal(SIGRTMIN+3, ...), it clobbers ours and force-kill stops
129+
* working for threads it runs on. Projects embedding FrankenPHP
130+
* alongside their own Go code that also uses that signal must choose a
131+
* different one here. glibc's NPTL reserves SIGRTMIN..SIGRTMIN+2 for
132+
* its own use, so do not move this offset downward.
133+
*
134+
* The slot lives in the Go-side phpThread struct - there is no C-side
135+
* array or init/destroy dance. Signal handler installation happens once
136+
* via pthread_once the first time a thread registers. */
137+
#ifdef PHP_WIN32
138+
static void CALLBACK frankenphp_noop_apc(ULONG_PTR param) { (void)param; }
139+
#endif
140+
141+
#ifdef FRANKENPHP_HAS_KILL_SIGNAL
142+
/* No-op handler: signal delivery is sufficient on its own because it
143+
* forces the in-flight syscall to return EINTR. The VM then observes
144+
* vm_interrupt on the next opcode and unwinds via zend_timeout(). */
145+
static void frankenphp_kill_signal_handler(int sig) { (void)sig; }
146+
147+
static pthread_once_t kill_signal_handler_installed = PTHREAD_ONCE_INIT;
148+
static void install_kill_signal_handler(void) {
149+
/* Install the no-op handler process-wide without SA_RESTART so blocking
150+
* syscalls return EINTR when the signal is delivered rather than being
151+
* transparently restarted by libc. SA_ONSTACK is set defensively: the
152+
* signal targets non-Go pthreads via pthread_kill, but if it's ever
153+
* delivered to a Go-managed thread (e.g. through accidental process-
154+
* level raise), Go requires the handler to run on the alternate signal
155+
* stack to avoid corrupting the goroutine's. */
156+
struct sigaction sa;
157+
memset(&sa, 0, sizeof(sa));
158+
sa.sa_handler = frankenphp_kill_signal_handler;
159+
sigemptyset(&sa.sa_mask);
160+
sa.sa_flags = SA_ONSTACK;
161+
sigaction(FRANKENPHP_KILL_SIGNAL, &sa, NULL);
162+
}
163+
#endif
164+
165+
/* shutdown_in_progress is toggled by the Go side through
166+
* frankenphp_set_shutdown_in_progress(). It is the only honest signal the
167+
* unhealthy-thread restart path has to tell "we are tearing the runtime
168+
* down, do not respawn" apart from normal operation - thread_metrics is
169+
* never NULL anymore because Shutdown intentionally leaves it allocated
170+
* for abandoned threads still writing into it. */
171+
static zend_atomic_bool shutdown_in_progress;
172+
173+
void frankenphp_set_shutdown_in_progress(bool v) {
174+
zend_atomic_bool_store(&shutdown_in_progress, v);
175+
}
176+
177+
/* Called by each PHP thread at boot, from its own TSRM context, so that
178+
* the EG-backed addresses resolve to the thread's private executor_globals
179+
* and the captured thread identity refers to itself. Hands the slot to
180+
* the Go side via go_frankenphp_store_force_kill_slot; the slot's
181+
* lifetime is the phpThread's. */
182+
void frankenphp_register_thread_for_kill(uintptr_t idx) {
183+
force_kill_slot slot;
184+
memset(&slot, 0, sizeof(slot));
185+
slot.vm_interrupt = &EG(vm_interrupt);
186+
slot.timed_out = &EG(timed_out);
187+
#ifdef FRANKENPHP_HAS_KILL_SIGNAL
188+
slot.tid = pthread_self();
189+
pthread_once(&kill_signal_handler_installed, install_kill_signal_handler);
190+
#elif defined(PHP_WIN32)
191+
if (!DuplicateHandle(GetCurrentProcess(), GetCurrentThread(),
192+
GetCurrentProcess(), &slot.thread_handle, 0, FALSE,
193+
DUPLICATE_SAME_ACCESS)) {
194+
/* DuplicateHandle can fail under resource pressure; leave the handle
195+
* NULL so force_kill_thread falls back to the atomic-bool path only. */
196+
slot.thread_handle = NULL;
197+
}
198+
#endif
199+
go_frankenphp_store_force_kill_slot(idx, slot);
200+
}
201+
202+
void frankenphp_force_kill_thread(force_kill_slot slot) {
203+
if (slot.vm_interrupt == NULL) {
204+
/* Thread never reached register_thread_for_kill (aborted during boot). */
205+
return;
206+
}
207+
/* Set the atomic bools first so that by the time the thread wakes up -
208+
* whether from our signal/APC or naturally - the VM sees them and
209+
* routes through zend_timeout() -> "Maximum execution time exceeded". */
210+
zend_atomic_bool_store(slot.timed_out, true);
211+
zend_atomic_bool_store(slot.vm_interrupt, true);
212+
213+
#ifdef FRANKENPHP_HAS_KILL_SIGNAL
214+
/* Return value intentionally ignored: ESRCH (thread already exited) and
215+
* EINVAL are both benign - there is simply nothing to unblock. */
216+
pthread_kill(slot.tid, FRANKENPHP_KILL_SIGNAL);
217+
#elif defined(PHP_WIN32)
218+
if (slot.thread_handle != NULL) {
219+
CancelSynchronousIo(slot.thread_handle);
220+
QueueUserAPC((PAPCFUNC)frankenphp_noop_apc, slot.thread_handle, 0);
221+
}
222+
#endif
223+
}
224+
225+
/* Releases any OS resource tied to the slot (currently: CloseHandle on
226+
* Windows). Called by the Go side when a phpThread is torn down. */
227+
void frankenphp_release_thread_for_kill(force_kill_slot slot) {
228+
#ifdef PHP_WIN32
229+
if (slot.thread_handle != NULL) {
230+
CloseHandle(slot.thread_handle);
231+
}
232+
#else
233+
(void)slot;
234+
#endif
235+
}
236+
95237
void frankenphp_update_local_thread_context(bool is_worker) {
96238
is_worker_thread = is_worker;
97239

@@ -1065,6 +1207,23 @@ static void *php_thread(void *arg) {
10651207
snprintf(thread_name, 16, "php-%" PRIxPTR, thread_index);
10661208
set_thread_name(thread_name);
10671209

1210+
/* Tell the Go side a new native thread is entering the main loop so
1211+
* initPHPThreads can Wait() for abandoned threads from a previous
1212+
* Init cycle to fully unwind before reassigning phpThreads. Paired
1213+
* with go_frankenphp_thread_exited() at the single exit: label below. */
1214+
go_frankenphp_thread_spawned();
1215+
1216+
#ifdef FRANKENPHP_HAS_KILL_SIGNAL
1217+
/* pthread_create inherits the caller's signal mask. frankenphp_new_php_thread
1218+
* is typically called from a goroutine pinned to a Go-managed M whose mask
1219+
* may block realtime signals. Explicitly unblock FRANKENPHP_KILL_SIGNAL so
1220+
* force-kill delivery is not silently discarded on this thread. */
1221+
sigset_t unblock;
1222+
sigemptyset(&unblock);
1223+
sigaddset(&unblock, FRANKENPHP_KILL_SIGNAL);
1224+
pthread_sigmask(SIG_UNBLOCK, &unblock, NULL);
1225+
#endif
1226+
10681227
/* Initial allocation of all global PHP memory for this thread */
10691228
#ifdef ZTS
10701229
(void)ts_resource(0);
@@ -1073,6 +1232,11 @@ static void *php_thread(void *arg) {
10731232
#endif
10741233
#endif
10751234

1235+
/* Register this thread's vm_interrupt/timed_out addresses so the Go side
1236+
* can force-kill it after the graceful-drain grace period if it gets stuck
1237+
* in a busy PHP loop. */
1238+
frankenphp_register_thread_for_kill(thread_index);
1239+
10761240
bool thread_is_healthy = true;
10771241
bool has_attempted_shutdown = false;
10781242

@@ -1150,6 +1314,15 @@ static void *php_thread(void *arg) {
11501314
}
11511315
zend_end_try();
11521316

1317+
/* Clear the force-kill slot BEFORE ts_free_thread: that call frees
1318+
* the TSRM storage that &EG(vm_interrupt) / &EG(timed_out) point at.
1319+
* Clearing afterwards (even under a write lock) would leave a window
1320+
* where a concurrent delivery reads the still-populated slot and
1321+
* writes into freed memory. Applies to both the healthy exit and the
1322+
* unhealthy-restart path below so every call to force_kill_thread
1323+
* sees either a valid or a zero-valued slot. */
1324+
go_frankenphp_clear_force_kill_slot(thread_index);
1325+
11531326
/* free all global PHP memory reserved for this thread */
11541327
#ifdef ZTS
11551328
ts_free_thread();
@@ -1158,19 +1331,33 @@ static void *php_thread(void *arg) {
11581331
/* Thread is healthy, signal to Go that the thread has shut down */
11591332
if (thread_is_healthy) {
11601333
go_frankenphp_on_thread_shutdown(thread_index);
1161-
1162-
return NULL;
1334+
goto exit;
11631335
}
11641336

11651337
/* Thread is unhealthy, PHP globals might be in a bad state after a bailout,
1166-
* restart the entire thread */
1338+
* restart the entire thread - unless the Go side has already declared the
1339+
* runtime to be shutting down via frankenphp_set_shutdown_in_progress().
1340+
* Respawning past that point would hand a fresh pthread a phpThreads
1341+
* slice that drainPHPThreads has already stopped tracking. */
1342+
if (zend_atomic_bool_load(&shutdown_in_progress)) {
1343+
frankenphp_log_message(
1344+
"Unhealthy thread unwinding after Shutdown; not restarting",
1345+
LOG_WARNING);
1346+
goto exit;
1347+
}
11671348
frankenphp_log_message("Restarting unhealthy thread", LOG_WARNING);
11681349

11691350
if (!frankenphp_new_php_thread(thread_index)) {
11701351
/* probably unreachable */
11711352
frankenphp_log_message("Failed to restart an unhealthy thread", LOG_ERR);
11721353
}
11731354

1355+
exit:
1356+
/* Single exit point: every path above that took the spawned() Add must
1357+
* route through here so lingeringThreads.Wait() in initPHPThreads can
1358+
* observe termination. Adding a new return above without going through
1359+
* exit would leak one Add across Init cycles. */
1360+
go_frankenphp_thread_exited();
11741361
return NULL;
11751362
}
11761363

@@ -1265,17 +1452,25 @@ static void *php_main(void *arg) {
12651452

12661453
go_frankenphp_main_thread_is_ready();
12671454

1268-
/* channel closed, shutdown gracefully */
1269-
frankenphp_sapi_module.shutdown(&frankenphp_sapi_module);
1270-
1271-
sapi_shutdown();
1455+
/* channel closed, shutdown gracefully. If an abandoned PHP thread is
1456+
* still alive in a blocked syscall (RestartWorkers/Shutdown gave up
1457+
* after the force-kill deadline), wait a bounded window for it to
1458+
* unwind before running SAPI/TSRM teardown. On timeout, skip teardown
1459+
* entirely so the late-unwinding thread cannot touch freed state via
1460+
* ts_free_thread / php_request_shutdown (zend_catch) / SAPI callbacks.
1461+
* Process exit will reclaim the leaked state. */
1462+
if (go_frankenphp_wait_for_threads_exited()) {
1463+
frankenphp_sapi_module.shutdown(&frankenphp_sapi_module);
1464+
1465+
sapi_shutdown();
12721466
#ifdef ZTS
1273-
tsrm_shutdown();
1467+
tsrm_shutdown();
12741468
#endif
12751469

1276-
if (frankenphp_sapi_module.ini_entries) {
1277-
free((char *)frankenphp_sapi_module.ini_entries);
1278-
frankenphp_sapi_module.ini_entries = NULL;
1470+
if (frankenphp_sapi_module.ini_entries) {
1471+
free((char *)frankenphp_sapi_module.ini_entries);
1472+
frankenphp_sapi_module.ini_entries = NULL;
1473+
}
12791474
}
12801475

12811476
go_frankenphp_shutdown_main_thread();
@@ -1470,6 +1665,12 @@ int frankenphp_reset_opcache(void) {
14701665
int frankenphp_get_current_memory_limit() { return PG(memory_limit); }
14711666

14721667
void frankenphp_init_thread_metrics(int max_threads) {
1668+
/* Free any allocation left over from a prior Init: Shutdown no longer
1669+
* calls frankenphp_destroy_thread_metrics (abandoned threads may still
1670+
* be writing into the array when the blocked syscall unwinds), but
1671+
* initPHPThreads waits on lingeringThreads before reaching us so any
1672+
* such abandoned thread has already exited by the time we reallocate. */
1673+
free(thread_metrics);
14731674
thread_metrics = calloc(max_threads, sizeof(frankenphp_thread_metrics));
14741675
}
14751676

frankenphp.h

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,28 @@ static inline HRESULT LongLongSub(LONGLONG llMinuend, LONGLONG llSubtrahend,
4646
#include <stdbool.h>
4747
#include <stdint.h>
4848

49+
#ifndef PHP_WIN32
50+
#include <pthread.h>
51+
#include <signal.h>
52+
#endif
53+
54+
/* Platform capabilities for the force-kill primitive; declared in the
55+
* header so Go (via CGo) gets the correct struct layout too. */
56+
#if !defined(PHP_WIN32) && defined(SIGRTMIN)
57+
#define FRANKENPHP_HAS_KILL_SIGNAL 1
58+
#define FRANKENPHP_KILL_SIGNAL (SIGRTMIN + 3)
59+
#endif
60+
61+
typedef struct {
62+
zend_atomic_bool *vm_interrupt;
63+
zend_atomic_bool *timed_out;
64+
#ifdef FRANKENPHP_HAS_KILL_SIGNAL
65+
pthread_t tid;
66+
#elif defined(PHP_WIN32)
67+
HANDLE thread_handle;
68+
#endif
69+
} force_kill_slot;
70+
4971
#ifndef FRANKENPHP_VERSION
5072
#define FRANKENPHP_VERSION dev
5173
#endif
@@ -193,6 +215,19 @@ void frankenphp_init_thread_metrics(int max_threads);
193215
void frankenphp_destroy_thread_metrics(void);
194216
size_t frankenphp_get_thread_memory_usage(uintptr_t thread_index);
195217

218+
/* Best-effort force-kill primitives. The slot is populated by each PHP
219+
* thread at boot (frankenphp_register_thread_for_kill calls back into Go
220+
* via go_frankenphp_store_force_kill_slot) and lives in the Go-side
221+
* phpThread. force_kill_thread interrupts the Zend VM at the next opcode
222+
* boundary; on POSIX it also delivers SIGRTMIN+3 to the target thread,
223+
* on Windows it calls CancelSynchronousIo + QueueUserAPC. release_thread
224+
* drops any OS-owned resource tied to the slot (currently the Windows
225+
* thread handle). */
226+
void frankenphp_set_shutdown_in_progress(bool v);
227+
void frankenphp_register_thread_for_kill(uintptr_t thread_index);
228+
void frankenphp_force_kill_thread(force_kill_slot slot);
229+
void frankenphp_release_thread_for_kill(force_kill_slot slot);
230+
196231
void register_extensions(zend_module_entry **m, int len);
197232

198233
#endif

0 commit comments

Comments
 (0)