Skip to content

Commit bb8461b

Browse files
feat: cross-platform force-kill primitive for stuck PHP threads
Introduces a self-contained primitive that unblocks a PHP thread stuck in a blocking call (sleep, synchronous I/O, etc.) so the graceful drain used by RestartWorkers, DrainWorkers, and Shutdown makes progress instead of hanging for the duration of the block. The primitive is useful on its own and gives follow-up graceful-shutdown work a reviewed foundation to build on. Design: each PHP thread, at boot from its own TSRM context, hands a force_kill_slot (pointers to its own EG(vm_interrupt) and EG(timed_out) atomic bools, plus pthread_t / Windows HANDLE for the wake-up) back to Go via go_frankenphp_store_force_kill_slot. The slot lives on phpThread and is protected by a per-thread RWMutex so the zero-and-release path at thread exit cannot race an in-flight kill. From any goroutine, Go passes the slot back to frankenphp_force_kill_thread, which stores true into both bools (waking the VM at the next opcode boundary and routing through zend_timeout -> "Maximum execution time exceeded") and then delivers a platform-specific wake-up: - Linux/FreeBSD: pthread_kill(SIGRTMIN+3) with a no-op handler installed via pthread_once, SA_ONSTACK, no SA_RESTART. Signal delivery causes any in-flight blocking syscall to return EINTR. - Windows: CancelSynchronousIo + QueueUserAPC covers alertable I/O and SleepEx. Non-alertable Sleep (including PHP's usleep) stays uninterruptible. - macOS: atomic-bool-only path. Threads stuck in blocking syscalls wait to return on their own. JIT caveat: under the OPcache JIT some hot code paths skip vm_interrupt checks (see php-src#21267), so a pure-PHP busy loop under JIT may not observe the store and will fall through to the abandon path below. Drain flow: - worker.go: drainWorkerThreads waits drainGracePeriod (5s) for each drained thread to reach Yielding; then arms force-kill on stragglers and waits forceKillDeadline (5s) more. Threads still stuck past that are abandoned rather than hanging the drain forever. - drainWorkerThreads returns (drained, abandoned). RestartWorkers puts drained threads back to Ready and abandoned ones into the new state.Abandoned (handlers treat it like ShuttingDown on next callback) so an abandoned thread that finally unwinds exits instead of re-entering the serve loop under stale request state. Abandoned threads are also filtered out of worker.threads immediately so isAtThreadLimit and the scaler see accurate capacity; the matching deactivateThreads cleanup drops Abandoned/Done entries from the auto-scaling slice so they do not permanently consume a global scaling slot. If any were abandoned, RestartWorkers returns errIncompleteRestart wrapped with abandoned/restarted counts - admin endpoint and watcher surface it. - phpthread.go: phpThread.shutdown mirrors the same grace + force-kill + abandon pattern so Shutdown cannot hang on an uninterruptible blocking call either. RequestSafeStateChange and shutdown's fast-fail WaitFor both accept Abandoned so Shutdown racing a RestartWorkers that marks a thread Abandoned does not park on a transition that will never come. Lifecycle hardening: Shutdown intentionally leaves phpThreads allocated and thread_metrics alive - an abandoned thread that eventually unwinds still calls through the SAPI and lifecycle callbacks which index those structures. initPHPThreads blocks on a package-level sync.WaitGroup (Add on every php_thread entry, Done on every exit path, routed through a single goto exit: label in php_thread so future return paths cannot silently leak an Add) so the next Init cycle cannot reassign them out from under a lingering abandoned thread, then frees the previous allocation inside frankenphp_init_thread_metrics before allocating fresh. A dedicated C-side atomic (shutdown_in_progress, toggled by frankenphp_set_shutdown_in_progress) is the signal the unhealthy-thread restart path uses to refuse respawning past Shutdown. - go_frankenphp_store_force_kill_slot / clear_force_kill_slot / on_thread_shutdown: take the per-thread write lock; clear runs before ts_free_thread on both healthy and unhealthy exit paths so the captured &EG() pointers are zeroed before their backing storage is freed. - php_thread unblocks FRANKENPHP_KILL_SIGNAL with pthread_sigmask at startup so Go's runtime signal mask cannot silently drop deliveries. Teardown after abandonment: php_main runs sapi_shutdown / tsrm_shutdown unconditionally once mainThread.state.Done is observed. By definition we already gave up on abandoned threads in drainPHPThreads, so we tear down rather than try to outlive them. If an abandoned thread ever does unwind after teardown it would touch torn-down state; embedders that observe errIncompleteRestart and care about cleanliness should terminate rather than re-Init. - worker_test.go + testdata/worker-sleep.php: the regression test drives the full path via a request marker file so it only arms RestartWorkers once the worker is proven parked in sleep(), then asserts both the bounded elapsed time and that the "should not reach" line after sleep never runs (which would indicate the VM interrupt was never observed). RestartWorkers now returns an error - a source-compatible Go API change (callers that ignored it still compile) but worth noting for embedders.
1 parent a05e6dd commit bb8461b

16 files changed

Lines changed: 546 additions & 46 deletions

caddy/admin.go

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,13 @@ func (admin *FrankenPHPAdmin) restartWorkers(w http.ResponseWriter, r *http.Requ
3939
return admin.error(http.StatusMethodNotAllowed, fmt.Errorf("method not allowed"))
4040
}
4141

42-
frankenphp.RestartWorkers()
42+
if err := frankenphp.RestartWorkers(); err != nil {
43+
// Restart is incomplete: at least one worker thread was stuck in
44+
// an uninterruptible blocking call and did not reload code. Do
45+
// not let the admin endpoint lie to automation with a 200.
46+
caddy.Log().Sugar().Errorf("workers restart incomplete: %v", err)
47+
return admin.error(http.StatusInternalServerError, err)
48+
}
4349
caddy.Log().Info("workers restarted from admin api")
4450
admin.success(w, "workers restarted successfully\n")
4551

frankenphp.c

Lines changed: 150 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,107 @@ static bool is_forked_child = false;
9292
static void frankenphp_fork_child(void) { is_forked_child = true; }
9393
#endif
9494

95+
/* Best-effort force-kill for stuck PHP threads.
96+
*
97+
* Each thread captures &EG(vm_interrupt) / &EG(timed_out) at boot and
98+
* hands them to Go via go_frankenphp_store_force_kill_slot. To kill,
99+
* Go passes the slot back to frankenphp_force_kill_thread, which stores
100+
* true into both bools (the VM bails through zend_timeout() at the next
101+
* opcode boundary) and then wakes any in-flight syscall:
102+
* - Linux/FreeBSD: pthread_kill(SIGRTMIN+3) -> EINTR.
103+
* - Windows: CancelSynchronousIo + QueueUserAPC for alertable I/O +
104+
* SleepEx. Non-alertable Sleep (including PHP's usleep) stays stuck.
105+
* - macOS: atomic-bool only; busy loops bail, blocking syscalls don't.
106+
*
107+
* Reserved signal: SIGRTMIN+3. PHP's pcntl_signal(SIGRTMIN+3, ...)
108+
* clobbers it. glibc NPTL reserves SIGRTMIN..SIGRTMIN+2; embedders with
109+
* their own Go signal usage may need to patch this constant.
110+
*
111+
* The slot lives Go-side on phpThread; the C side has no global table.
112+
* The signal handler is installed once via pthread_once. */
113+
#ifdef PHP_WIN32
114+
static void CALLBACK frankenphp_noop_apc(ULONG_PTR param) { (void)param; }
115+
#endif
116+
117+
#ifdef FRANKENPHP_HAS_KILL_SIGNAL
118+
/* No-op: delivery itself is what unblocks the syscall via EINTR. */
119+
static void frankenphp_kill_signal_handler(int sig) { (void)sig; }
120+
121+
static pthread_once_t kill_signal_handler_installed = PTHREAD_ONCE_INIT;
122+
static void install_kill_signal_handler(void) {
123+
/* No SA_RESTART so syscalls return EINTR rather than being restarted.
124+
* SA_ONSTACK guards against an accidental process-level delivery to a
125+
* Go-managed thread, where Go requires the alternate signal stack. */
126+
struct sigaction sa;
127+
memset(&sa, 0, sizeof(sa));
128+
sa.sa_handler = frankenphp_kill_signal_handler;
129+
sigemptyset(&sa.sa_mask);
130+
sa.sa_flags = SA_ONSTACK;
131+
sigaction(FRANKENPHP_KILL_SIGNAL, &sa, NULL);
132+
}
133+
#endif
134+
135+
/* Set by frankenphp_set_shutdown_in_progress to gate the unhealthy-thread
136+
* respawn loop off once Shutdown begins. */
137+
static zend_atomic_bool shutdown_in_progress;
138+
139+
void frankenphp_set_shutdown_in_progress(bool v) {
140+
zend_atomic_bool_store(&shutdown_in_progress, v);
141+
}
142+
143+
/* Must run on the PHP thread itself: EG() resolves to its own TSRM
144+
* context and pthread_self() captures the right tid. */
145+
void frankenphp_register_thread_for_kill(uintptr_t idx) {
146+
force_kill_slot slot;
147+
memset(&slot, 0, sizeof(slot));
148+
slot.vm_interrupt = &EG(vm_interrupt);
149+
slot.timed_out = &EG(timed_out);
150+
#ifdef FRANKENPHP_HAS_KILL_SIGNAL
151+
slot.tid = pthread_self();
152+
pthread_once(&kill_signal_handler_installed, install_kill_signal_handler);
153+
#elif defined(PHP_WIN32)
154+
if (!DuplicateHandle(GetCurrentProcess(), GetCurrentThread(),
155+
GetCurrentProcess(), &slot.thread_handle, 0, FALSE,
156+
DUPLICATE_SAME_ACCESS)) {
157+
/* On failure, force_kill falls back to atomic-bool only. */
158+
slot.thread_handle = NULL;
159+
}
160+
#endif
161+
go_frankenphp_store_force_kill_slot(idx, slot);
162+
}
163+
164+
void frankenphp_force_kill_thread(force_kill_slot slot) {
165+
if (slot.vm_interrupt == NULL) {
166+
/* Boot aborted before register_thread_for_kill. */
167+
return;
168+
}
169+
/* Atomic stores first: by the time the thread wakes (signal-driven or
170+
* natural) the VM sees them and bails through zend_timeout(). */
171+
zend_atomic_bool_store(slot.timed_out, true);
172+
zend_atomic_bool_store(slot.vm_interrupt, true);
173+
174+
#ifdef FRANKENPHP_HAS_KILL_SIGNAL
175+
/* ESRCH (thread already exited) / EINVAL are both benign here. */
176+
pthread_kill(slot.tid, FRANKENPHP_KILL_SIGNAL);
177+
#elif defined(PHP_WIN32)
178+
if (slot.thread_handle != NULL) {
179+
CancelSynchronousIo(slot.thread_handle);
180+
QueueUserAPC((PAPCFUNC)frankenphp_noop_apc, slot.thread_handle, 0);
181+
}
182+
#endif
183+
}
184+
185+
/* CloseHandle on Windows; no-op on POSIX. */
186+
void frankenphp_release_thread_for_kill(force_kill_slot slot) {
187+
#ifdef PHP_WIN32
188+
if (slot.thread_handle != NULL) {
189+
CloseHandle(slot.thread_handle);
190+
}
191+
#else
192+
(void)slot;
193+
#endif
194+
}
195+
95196
void frankenphp_update_local_thread_context(bool is_worker) {
96197
is_worker_thread = is_worker;
97198

@@ -1065,6 +1166,20 @@ static void *php_thread(void *arg) {
10651166
snprintf(thread_name, 16, "php-%" PRIxPTR, thread_index);
10661167
set_thread_name(thread_name);
10671168

1169+
/* Paired with go_frankenphp_thread_exited at the exit: label below;
1170+
* lets initPHPThreads wait for prior-generation threads. */
1171+
go_frankenphp_thread_spawned();
1172+
1173+
#ifdef FRANKENPHP_HAS_KILL_SIGNAL
1174+
/* The spawning Go-managed M may block realtime signals, which the
1175+
* new pthread inherits. Unblock FRANKENPHP_KILL_SIGNAL here so
1176+
* force-kill deliveries are not silently dropped. */
1177+
sigset_t unblock;
1178+
sigemptyset(&unblock);
1179+
sigaddset(&unblock, FRANKENPHP_KILL_SIGNAL);
1180+
pthread_sigmask(SIG_UNBLOCK, &unblock, NULL);
1181+
#endif
1182+
10681183
/* Initial allocation of all global PHP memory for this thread */
10691184
#ifdef ZTS
10701185
(void)ts_resource(0);
@@ -1073,6 +1188,11 @@ static void *php_thread(void *arg) {
10731188
#endif
10741189
#endif
10751190

1191+
/* Register this thread's vm_interrupt/timed_out addresses so the Go side
1192+
* can force-kill it after the graceful-drain grace period if it gets stuck
1193+
* in a busy PHP loop. */
1194+
frankenphp_register_thread_for_kill(thread_index);
1195+
10761196
bool thread_is_healthy = true;
10771197
bool has_attempted_shutdown = false;
10781198

@@ -1150,6 +1270,11 @@ static void *php_thread(void *arg) {
11501270
}
11511271
zend_end_try();
11521272

1273+
/* Must precede ts_free_thread: that frees the TSRM storage backing
1274+
* the slot's &EG() pointers. Clearing first means any concurrent
1275+
* force-kill either ran before us or sees a zero slot. */
1276+
go_frankenphp_clear_force_kill_slot(thread_index);
1277+
11531278
/* free all global PHP memory reserved for this thread */
11541279
#ifdef ZTS
11551280
ts_free_thread();
@@ -1158,19 +1283,28 @@ static void *php_thread(void *arg) {
11581283
/* Thread is healthy, signal to Go that the thread has shut down */
11591284
if (thread_is_healthy) {
11601285
go_frankenphp_on_thread_shutdown(thread_index);
1161-
1162-
return NULL;
1286+
goto exit;
11631287
}
11641288

1165-
/* Thread is unhealthy, PHP globals might be in a bad state after a bailout,
1166-
* restart the entire thread */
1289+
/* Unhealthy: respawn unless Shutdown is in progress; respawning then
1290+
* would hand a fresh pthread a phpThreads slice already untracked. */
1291+
if (zend_atomic_bool_load(&shutdown_in_progress)) {
1292+
frankenphp_log_message(
1293+
"Unhealthy thread unwinding after Shutdown; not restarting",
1294+
LOG_WARNING);
1295+
goto exit;
1296+
}
11671297
frankenphp_log_message("Restarting unhealthy thread", LOG_WARNING);
11681298

11691299
if (!frankenphp_new_php_thread(thread_index)) {
11701300
/* probably unreachable */
11711301
frankenphp_log_message("Failed to restart an unhealthy thread", LOG_ERR);
11721302
}
11731303

1304+
exit:
1305+
/* Single exit so spawned()/exited() pairing can never drift if a new
1306+
* return is added above. */
1307+
go_frankenphp_thread_exited();
11741308
return NULL;
11751309
}
11761310

@@ -1265,7 +1399,14 @@ static void *php_main(void *arg) {
12651399

12661400
go_frankenphp_main_thread_is_ready();
12671401

1268-
/* channel closed, shutdown gracefully */
1402+
/* channel closed, shutdown gracefully. Abandoned threads (force-kill
1403+
* could not interrupt them within forceKillDeadline) may still unwind
1404+
* later when the syscall returns naturally - sleep finishes, select
1405+
* times out, etc. Waiting for them here would mean blocking Shutdown
1406+
* for an unbounded duration, so we run SAPI/TSRM teardown anyway and
1407+
* accept the use-after-free risk if one of them resumes afterwards.
1408+
* Embedders that observe errIncompleteRestart should terminate the
1409+
* process rather than re-Init - see RestartWorkers' doc. */
12691410
frankenphp_sapi_module.shutdown(&frankenphp_sapi_module);
12701411

12711412
sapi_shutdown();
@@ -1470,6 +1611,10 @@ int frankenphp_reset_opcache(void) {
14701611
int frankenphp_get_current_memory_limit() { return PG(memory_limit); }
14711612

14721613
void frankenphp_init_thread_metrics(int max_threads) {
1614+
/* Frees any prior generation's allocation; Shutdown leaves it alive
1615+
* for abandoned threads. lingeringThreads.Wait() upstream guarantees
1616+
* those have all exited before we get here. free(NULL) is a no-op. */
1617+
free(thread_metrics);
14731618
thread_metrics = calloc(max_threads, sizeof(frankenphp_thread_metrics));
14741619
}
14751620

frankenphp.h

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,28 @@ static inline HRESULT LongLongSub(LONGLONG llMinuend, LONGLONG llSubtrahend,
4646
#include <stdbool.h>
4747
#include <stdint.h>
4848

49+
#ifndef PHP_WIN32
50+
#include <pthread.h>
51+
#include <signal.h>
52+
#endif
53+
54+
/* Platform capabilities for the force-kill primitive; declared in the
55+
* header so Go (via CGo) gets the correct struct layout too. */
56+
#if !defined(PHP_WIN32) && defined(SIGRTMIN)
57+
#define FRANKENPHP_HAS_KILL_SIGNAL 1
58+
#define FRANKENPHP_KILL_SIGNAL (SIGRTMIN + 3)
59+
#endif
60+
61+
typedef struct {
62+
zend_atomic_bool *vm_interrupt;
63+
zend_atomic_bool *timed_out;
64+
#ifdef FRANKENPHP_HAS_KILL_SIGNAL
65+
pthread_t tid;
66+
#elif defined(PHP_WIN32)
67+
HANDLE thread_handle;
68+
#endif
69+
} force_kill_slot;
70+
4971
#ifndef FRANKENPHP_VERSION
5072
#define FRANKENPHP_VERSION dev
5173
#endif
@@ -193,6 +215,19 @@ void frankenphp_init_thread_metrics(int max_threads);
193215
void frankenphp_destroy_thread_metrics(void);
194216
size_t frankenphp_get_thread_memory_usage(uintptr_t thread_index);
195217

218+
/* Best-effort force-kill primitives. The slot is populated by each PHP
219+
* thread at boot (frankenphp_register_thread_for_kill calls back into Go
220+
* via go_frankenphp_store_force_kill_slot) and lives in the Go-side
221+
* phpThread. force_kill_thread interrupts the Zend VM at the next opcode
222+
* boundary; on POSIX it also delivers SIGRTMIN+3 to the target thread,
223+
* on Windows it calls CancelSynchronousIo + QueueUserAPC. release_thread
224+
* drops any OS-owned resource tied to the slot (currently the Windows
225+
* thread handle). */
226+
void frankenphp_set_shutdown_in_progress(bool v);
227+
void frankenphp_register_thread_for_kill(uintptr_t thread_index);
228+
void frankenphp_force_kill_thread(force_kill_slot slot);
229+
void frankenphp_release_thread_for_kill(force_kill_slot slot);
230+
196231
void register_extensions(zend_module_entry **m, int len);
197232

198233
#endif

internal/state/state.go

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,13 @@ const (
3535
Rebooting
3636
// C thread has exited and ZTS state is cleaned up, ready to spawn a new C thread
3737
RebootReady
38+
39+
// Abandoned is set by RestartWorkers on threads that did not yield
40+
// within the force-kill deadline. Handlers treat it like ShuttingDown
41+
// on the next callback, so an abandoned thread that eventually
42+
// unwinds exits cleanly instead of re-entering the serve loop with
43+
// stale request state.
44+
Abandoned
3845
)
3946

4047
func (s State) String() string {
@@ -67,6 +74,8 @@ func (s State) String() string {
6774
return "rebooting"
6875
case RebootReady:
6976
return "reboot ready"
77+
case Abandoned:
78+
return "abandoned"
7079
default:
7180
return "unknown"
7281
}
@@ -172,12 +181,12 @@ func (ts *ThreadState) WaitFor(states ...State) {
172181
func (ts *ThreadState) RequestSafeStateChange(nextState State) bool {
173182
ts.mu.Lock()
174183
switch ts.currentState {
175-
// disallow state changes if shutting down or done
176-
case ShuttingDown, Done, Reserved:
184+
// Terminal: Abandoned never transitions to Ready/Inactive/ShuttingDown,
185+
// so waiting would park forever.
186+
case ShuttingDown, Done, Reserved, Abandoned:
177187
ts.mu.Unlock()
178188

179189
return false
180-
// ready and inactive are safe states to transition from
181190
case Ready, Inactive:
182191
ts.currentState = nextState
183192
ts.notifySubscribers(nextState)
@@ -187,8 +196,9 @@ func (ts *ThreadState) RequestSafeStateChange(nextState State) bool {
187196
}
188197
ts.mu.Unlock()
189198

190-
// wait for the state to change to a safe state
191-
ts.WaitFor(Ready, Inactive, ShuttingDown)
199+
// Done and Abandoned in the set so a concurrent terminal transition
200+
// wakes us; the recursive call below then hits the reject branch.
201+
ts.WaitFor(Ready, Inactive, ShuttingDown, Done, Abandoned)
192202

193203
return ts.RequestSafeStateChange(nextState)
194204
}

0 commit comments

Comments
 (0)