Skip to content

Commit 521e032

Browse files
committed
Fix epoll_ctl dropping registrations in multi-threaded guests
In a multi-threaded guest host_fd_ref_open() hands back a dup of the target fd that host_fd_ref_close() closes when the syscall returns. sys_epoll_ctl() used that transient dup as the kqueue knote ident, and the kernel drops a knote the moment its fd is closed -- so every epoll registration made while multi-threaded was torn down the instant epoll_ctl() returned, and epoll_pwait() never reported readiness again. Single-threaded guests borrow the raw fd (no dup, no close) and never hit it. Node's libuv DelayedTaskScheduler (eventfd + epoll backing uv_async_send) relied on this path and hung forever at process exit: the main thread blocked in WorkerThreadsTaskRunner::Shutdown -> uv_thread_join on a scheduler thread that could no longer be woken. Key the knote on the persistent host fd from the fd table. Take it from the same atomic fd_snapshot() that validates the fd, so the ident comes from the entry that was validated rather than a second fd_to_host() lookup that could race a concurrent close/reopen. Result mapping already uses udata (the guest fd), so the ident only needs to stay open and refer to the same open file description. Guard the close+reopen ABA with a per-slot generation counter. fd_table entries now carry a monotonic generation bumped on every allocation; epoll registrations stamp it at ADD/MOD. If the guest closes a watched fd and reopens it (reusing the guest fd number), the kernel has already dropped the original knote, yet reg->active still looks live -- a later DEL/MOD would EV_DELETE the wrong knote on the reused host fd. A mismatched generation now marks the registration gone, so DEL/MOD report ENOENT (matching Linux's auto-removal on close) and ADD starts fresh. Also implement the FIONBIO / FIOCLEX / FIONCLEX ioctls, which were falling through to ENOTTY. libuv's uv_pipe_open() sets non-blocking via FIONBIO, so Node's console.log() to a pipe threw "open ENOTTY". FIONBIO maps to F_SETFL O_NONBLOCK (status flag, shared across the dup). FIOCLEX/FIONCLEX mirror F_SETFD by toggling the fd_table cloexec bit rather than the host fd's FD_CLOEXEC, which is per-descriptor and lost on the dup. They need no host fd, so they dispatch before host_fd_ref_open_regular_io() -- which rejects O_PATH (FD_PATH) with EBADF, while Linux allows these ioctls (like F_SETFD) on O_PATH fds -- and validate the slot and flip the flag in a single fd_lock section, so there is no validate-then-mutate window for a concurrent close/reuse to flip cloexec on a different file. Add tests/test-epoll-mt.c: a CLONE_THREAD sibling keeps the guest multi-threaded across epoll_ctl, then asserts a registered eventfd and pipe still deliver an EPOLLIN edge. It fails without the poll.c fix. Add tests/test-ioctl-cloexec.c covering FIOCLEX/FIONCLEX round-trip on both a regular and an O_PATH fd. Both are listed in tests/manifest.txt so the driver runs them under make check. With these, node:alpine (node v26.3.0) runs JavaScript, timers, the libuv threadpool, and promises, and exits cleanly.
1 parent 23b8300 commit 521e032

7 files changed

Lines changed: 539 additions & 17 deletions

File tree

src/syscall/abi.h

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -352,6 +352,9 @@ typedef struct {
352352
#define LINUX_TIOCSCTTY 0x540E /* -> macOS TIOCSCTTY (same semantics) */
353353
#define LINUX_TIOCGWINSZ 0x5413 /* -> macOS TIOCGWINSZ (same struct) */
354354
#define LINUX_FIONREAD 0x541B /* -> macOS FIONREAD (same semantics) */
355+
#define LINUX_FIONBIO 0x5421 /* set/clear O_NONBLOCK (arg: int *) */
356+
#define LINUX_FIONCLEX 0x5450 /* clear close-on-exec on fd */
357+
#define LINUX_FIOCLEX 0x5451 /* set close-on-exec on fd */
355358
#define LINUX_TIOCNOTTY 0x5422 /* -> macOS TIOCNOTTY (same semantics) */
356359
#define LINUX_TIOCGSID 0x5429 /* -> macOS TIOCGSID (same semantics) */
357360
/* termios2 variant (adds c_ispeed/c_ospeed) */
@@ -705,7 +708,10 @@ typedef struct {
705708
typedef struct {
706709
int type; /* FD_CLOSED, FD_STDIO, FD_REGULAR, FD_DIR */
707710
int host_fd; /* Underlying macOS file descriptor */
708-
uint64_t generation; /* Bumped each time this guest fd slot is reused. */
711+
uint64_t generation; /* Bumped each time this guest fd slot is reused. Lets
712+
* long-lived references (e.g. epoll registrations)
713+
* detect a close+reopen ABA where the slot now holds a
714+
* different open file. */
709715
int linux_flags; /* Linux open flags (for CLOEXEC tracking) */
710716
void *dir; /* DIR* for FD_DIR entries (NULL otherwise) */
711717
char proc_path[FD_VIRTUAL_PATH_MAX]; /* Virtual /proc dir root for *at */

src/syscall/io.c

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1476,6 +1476,33 @@ int64_t sys_pwritev2(guest_t *g,
14761476

14771477
int64_t sys_ioctl(guest_t *g, int fd, uint64_t request, uint64_t arg)
14781478
{
1479+
/* FIOCLEX/FIONCLEX are the ioctl form of fcntl(F_SETFD): they set/clear the
1480+
* guest close-on-exec flag, which lives in fd_table linux_flags (not the
1481+
* host fd's FD_CLOEXEC, which is per-descriptor and would be lost on the dup
1482+
* that host_fd_ref hands multi-threaded callers, so mirror the F_SETFD path
1483+
* in sys_fcntl). They need no host fd, so dispatch them before
1484+
* host_fd_ref_open_regular_io(): that helper rejects O_PATH (FD_PATH) fds
1485+
* with EBADF, but Linux allows these ioctls -- like fcntl(F_SETFD) -- on
1486+
* O_PATH descriptors. Validate the slot and mutate the flag in a single
1487+
* fd_lock section so there is no validate-then-mutate window in which a
1488+
* concurrent close/reuse could flip CLOEXEC on a different file that took
1489+
* the slot. The arg is ignored. */
1490+
if (request == LINUX_FIOCLEX || request == LINUX_FIONCLEX) {
1491+
if (!RANGE_CHECK(fd, 0, FD_TABLE_SIZE))
1492+
return -LINUX_EBADF;
1493+
pthread_mutex_lock(&fd_lock);
1494+
if (fd_table[fd].type == FD_CLOSED) {
1495+
pthread_mutex_unlock(&fd_lock);
1496+
return -LINUX_EBADF;
1497+
}
1498+
if (request == LINUX_FIOCLEX)
1499+
fd_table[fd].linux_flags |= LINUX_O_CLOEXEC;
1500+
else
1501+
fd_table[fd].linux_flags &= ~LINUX_O_CLOEXEC;
1502+
pthread_mutex_unlock(&fd_lock);
1503+
return 0;
1504+
}
1505+
14791506
host_fd_ref_t host_ref;
14801507
int64_t err = host_fd_ref_open_regular_io(fd, &host_ref);
14811508
if (err < 0)
@@ -1688,6 +1715,29 @@ int64_t sys_ioctl(guest_t *g, int fd, uint64_t request, uint64_t arg)
16881715
return 0;
16891716
}
16901717

1718+
case LINUX_FIONBIO: {
1719+
/* Set/clear O_NONBLOCK on the fd. Linux FIONBIO takes an int* arg:
1720+
* nonzero enables non-blocking, zero disables it. libuv's
1721+
* uv__nonblock_ioctl() (its default on Linux) issues this on pipe and
1722+
* socket fds at setup; without it the guest's uv_pipe_open() fails with
1723+
* ENOTTY and Node's stdio stream construction throws.
1724+
*/
1725+
int32_t on = 0;
1726+
if (guest_read_small(g, arg, &on, sizeof(on)) < 0) {
1727+
host_fd_ref_close(&host_ref);
1728+
return -LINUX_EFAULT;
1729+
}
1730+
int flags = fcntl(host_fd, F_GETFL);
1731+
if (flags < 0) {
1732+
host_fd_ref_close(&host_ref);
1733+
return linux_errno();
1734+
}
1735+
flags = on ? (flags | O_NONBLOCK) : (flags & ~O_NONBLOCK);
1736+
int r = fcntl(host_fd, F_SETFL, flags);
1737+
host_fd_ref_close(&host_ref);
1738+
return r < 0 ? linux_errno() : 0;
1739+
}
1740+
16911741
default:
16921742
host_fd_ref_close(&host_ref);
16931743
return -LINUX_ENOTTY;

src/syscall/poll.c

Lines changed: 43 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -694,6 +694,11 @@ typedef struct {
694694
typedef struct {
695695
uint32_t events; /* Registered EPOLL* events mask */
696696
uint64_t data; /* User data to return in epoll_wait */
697+
uint64_t generation; /* fd_entry_t.generation captured at ADD/MOD. Detects a
698+
* close+reopen ABA: if the guest fd's current
699+
* generation no longer matches, the registered open
700+
* file is gone and this stale entry must not drive
701+
* kevent against the reused host fd. */
697702
bool active; /* Registered in this instance */
698703
bool oneshot_armed; /* EPOLLONESHOT and event already fired,
699704
* waiting for EPOLL_CTL_MOD re-arm.
@@ -781,18 +786,43 @@ int64_t sys_epoll_ctl(guest_t *g, int epfd, int op, int fd, uint64_t event_gva)
781786
return -LINUX_EINVAL;
782787
}
783788

784-
host_fd_ref_t target_ref;
785-
if (host_fd_ref_open(fd, &target_ref) < 0) {
789+
/* Validate the target fd and read its persistent host fd in a single
790+
* fd_lock snapshot, so the kqueue knote ident is taken from the same entry
791+
* that was validated. A kqueue knote is keyed by the fd number and the
792+
* kernel drops it the moment that fd is closed, so the ident must be the
793+
* persistent host fd from the fd table -- not the dup that
794+
* host_fd_ref_open() hands multi-threaded callers, which
795+
* host_fd_ref_close() closes when the syscall returns (silently tearing the
796+
* registration down). Snapshotting (rather than host_fd_ref_open() + a
797+
* separate fd_to_host()) keeps the validate and the ident read atomic under
798+
* one fd_lock. The snapshot's generation then guards the cross-call ABA
799+
* below. Result mapping uses udata (the guest fd), so the ident only needs
800+
* to stay open and refer to the same open file description. */
801+
fd_entry_t target_snap;
802+
if (!fd_snapshot(fd, &target_snap)) {
786803
host_fd_ref_close(&epoll_ref);
787804
return -LINUX_EBADF;
788805
}
806+
int target_host_fd = target_snap.host_fd;
789807

790808
epoll_reg_t *reg = &inst->regs[fd];
791809

810+
/* Cross-call ABA guard. If the guest closed this fd and reopened it (or the
811+
* slot was reused) since the registration was stamped, the kernel already
812+
* dropped the original knote when the old host fd closed, yet the guest fd
813+
* number -- and thus reg->active -- still looks live. Acting on it would
814+
* EV_DELETE/EV_MOD the wrong knote on the reused host fd. A mismatched
815+
* generation means the registration is gone: drop it so DEL/MOD report
816+
* ENOENT (matching Linux's auto-removal on close) and ADD starts fresh. */
817+
if ((reg->active || reg->oneshot_armed) &&
818+
reg->generation != target_snap.generation) {
819+
reg->active = false;
820+
reg->oneshot_armed = false;
821+
}
822+
792823
if (op == LINUX_EPOLL_CTL_DEL) {
793824
/* Linux returns ENOENT when removing an unregistered fd */
794825
if (!reg->active) {
795-
host_fd_ref_close(&target_ref);
796826
host_fd_ref_close(&epoll_ref);
797827
return -LINUX_ENOENT;
798828
}
@@ -804,12 +834,12 @@ int64_t sys_epoll_ctl(guest_t *g, int epfd, int op, int fd, uint64_t event_gva)
804834
int nchanges = 0;
805835
{
806836
if (reg->events & (LINUX_EPOLLIN | LINUX_EPOLLRDHUP)) {
807-
EV_SET(&changes[nchanges], target_ref.fd, EVFILT_READ,
837+
EV_SET(&changes[nchanges], target_host_fd, EVFILT_READ,
808838
EV_DELETE, 0, 0, NULL);
809839
nchanges++;
810840
}
811841
if (reg->events & LINUX_EPOLLOUT) {
812-
EV_SET(&changes[nchanges], target_ref.fd, EVFILT_WRITE,
842+
EV_SET(&changes[nchanges], target_host_fd, EVFILT_WRITE,
813843
EV_DELETE, 0, 0, NULL);
814844
nchanges++;
815845
}
@@ -819,7 +849,6 @@ int64_t sys_epoll_ctl(guest_t *g, int epfd, int op, int fd, uint64_t event_gva)
819849
/* Clear stale state for potential re-add */
820850
reg->oneshot_armed = false;
821851
}
822-
host_fd_ref_close(&target_ref);
823852
host_fd_ref_close(&epoll_ref);
824853
return 0;
825854
}
@@ -829,20 +858,17 @@ int64_t sys_epoll_ctl(guest_t *g, int epfd, int op, int fd, uint64_t event_gva)
829858
* (EPOLLONESHOT fired, waiting for re-arm) are still valid for MOD.
830859
*/
831860
if (op == LINUX_EPOLL_CTL_ADD && reg->active) {
832-
host_fd_ref_close(&target_ref);
833861
host_fd_ref_close(&epoll_ref);
834862
return -LINUX_EEXIST;
835863
}
836864
if (op == LINUX_EPOLL_CTL_MOD && !reg->active && !reg->oneshot_armed) {
837-
host_fd_ref_close(&target_ref);
838865
host_fd_ref_close(&epoll_ref);
839866
return -LINUX_ENOENT;
840867
}
841868

842869
/* ADD or MOD: read the epoll_event from guest */
843870
linux_epoll_event_t ev;
844871
if (guest_read_small(g, event_gva, &ev, sizeof(ev)) < 0) {
845-
host_fd_ref_close(&target_ref);
846872
host_fd_ref_close(&epoll_ref);
847873
return -LINUX_EFAULT;
848874
}
@@ -860,11 +886,11 @@ int64_t sys_epoll_ctl(guest_t *g, int epfd, int op, int fd, uint64_t event_gva)
860886
if (op == LINUX_EPOLL_CTL_MOD && reg->active) {
861887
struct kevent del;
862888
if (reg->events & (LINUX_EPOLLIN | LINUX_EPOLLRDHUP)) {
863-
EV_SET(&del, target_ref.fd, EVFILT_READ, EV_DELETE, 0, 0, NULL);
889+
EV_SET(&del, target_host_fd, EVFILT_READ, EV_DELETE, 0, 0, NULL);
864890
kevent(epoll_ref.fd, &del, 1, NULL, 0, NULL);
865891
}
866892
if (reg->events & LINUX_EPOLLOUT) {
867-
EV_SET(&del, target_ref.fd, EVFILT_WRITE, EV_DELETE, 0, 0, NULL);
893+
EV_SET(&del, target_host_fd, EVFILT_WRITE, EV_DELETE, 0, 0, NULL);
868894
kevent(epoll_ref.fd, &del, 1, NULL, 0, NULL);
869895
}
870896
}
@@ -894,33 +920,34 @@ int64_t sys_epoll_ctl(guest_t *g, int epfd, int op, int fd, uint64_t event_gva)
894920
void *udata = (void *) (uintptr_t) fd;
895921

896922
if (ev.events & (LINUX_EPOLLIN | LINUX_EPOLLRDHUP)) {
897-
EV_SET(&changes[nchanges], target_ref.fd, EVFILT_READ, kflags, 0, 0,
923+
EV_SET(&changes[nchanges], target_host_fd, EVFILT_READ, kflags, 0, 0,
898924
udata);
899925
nchanges++;
900926
}
901927
if (ev.events & LINUX_EPOLLOUT) {
902-
EV_SET(&changes[nchanges], target_ref.fd, EVFILT_WRITE, kflags, 0, 0,
928+
EV_SET(&changes[nchanges], target_host_fd, EVFILT_WRITE, kflags, 0, 0,
903929
udata);
904930
nchanges++;
905931
}
906932

907933
if (nchanges > 0) {
908934
if (kevent(epoll_ref.fd, changes, nchanges, NULL, 0, NULL) < 0) {
909-
host_fd_ref_close(&target_ref);
910935
host_fd_ref_close(&epoll_ref);
911936
return linux_errno();
912937
}
913938
}
914939

915940
/* Store registration data in per-instance table.
916-
* Clear oneshot_armed when MOD successfully re-arms.
941+
* Clear oneshot_armed when MOD successfully re-arms. Stamp the snapshot's
942+
* generation so a later close+reopen of this guest fd is detected as a
943+
* stale registration by the ABA guard above.
917944
*/
918945
reg->events = ev.events;
919946
reg->data = ev.data;
947+
reg->generation = target_snap.generation;
920948
reg->active = true;
921949
reg->oneshot_armed = false;
922950

923-
host_fd_ref_close(&target_ref);
924951
host_fd_ref_close(&epoll_ref);
925952
return 0;
926953
}

tests/manifest.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,8 +62,11 @@ test-signalfd
6262
test-signalfd-hardening
6363
test-epoll
6464
test-epoll-edge
65+
test-epoll-mt
66+
test-epoll-aba
6567
test-timerfd
6668
test-large-io-boundary
69+
test-ioctl-cloexec
6770

6871
[section] /proc and /dev emulation tests
6972
test-proc

0 commit comments

Comments
 (0)