Skip to content

[Bug]: Frequent crashes due to vsock #678

@DePasqualeOrg

Description

@DePasqualeOrg

I have done the following

  • I have searched the existing issues
  • If possible, I've reproduced the issue using the 'main' branch of this project

Steps to reproduce

Not well understood

Current behavior

I've been seeing frequent crashes with an app that uses Containerization. The following is Claude Code's analysis of three crash reports from today:

Related: #403, #503, #552, #572

Environment

  • macOS 26.4 (25E246), Apple Silicon (Mac15,3)
  • Xcode 26.4 (17E192), Swift 6.3
  • Containerization from commit 250546f
  • grpc-swift 1.x (the version currently pinned in Containerization's Package.resolved)
  • Debug build run from Xcode

Three crashes, two patterns

All three crashes are EXC_BREAKPOINT (SIGTRAP) caused by errno 9 (EBADF) on a vsock file descriptor used by the gRPC connection to the VM agent. The gRPC/NIO stack hits a precondition failure (preconditionIsNotUnacceptableErrno) when the syscall returns EBADF.

Crash 1: 07:36 (overnight, sleep possibly involved)

  • Runtime: ~14 hours (launched previous day at 17:25)
  • Crash site: fcntl(F_SETNOSIGPIPE) on the vsock fd — the gRPC client was idle, and when a new exec() triggered an RPC, the ConnectionManager tried to reconnect using the original .connectedSocket(fd) target. The fd was already closed.
  • Likely cause: The grpc-swift default idle timeout (30 minutes) closed the connection's socket, and with connectionBackoff = nil, the reconnection attempt reused the closed fd. Sleep/wake may also be a factor given the 14-hour runtime.

Relevant stack:

0: _assertionFailure
1: preconditionIsNotUnacceptableErrno(err:where:)
2: syscall(blocking:where:_:)
3: Posix.fcntl(descriptor:command:value:)
4: BaseSocketProtocol.ignoreSIGPIPE(descriptor:)
...
12: ClientBootstrap.withConnectedSocket(_:)
...
17: ConnectionManager.startConnecting(...)

Crashes 2 and 3: 11:46 and 13:42 (no sleep)

  • Runtime: ~1h 49m and ~1h 47m respectively
  • Machine did not sleep during these runs
  • Crash site: writev on the vsock fd — the gRPC client believed the connection was still active and attempted a new RPC. The writev on the vsock socket got EBADF.

Unlike crash 1, this is not a reconnection attempt. The gRPC client considers the connection live and tries to use it normally, but the underlying vsock fd has been invalidated by something external.

Both crashes have identical stacks. Key frames:

0: _assertionFailure
1: preconditionIsNotUnacceptableErrno(err:where:)
2: syscall(blocking:where:_:)
3: Posix.writev(descriptor:iovecs:)          <-- EBADF on the vsock fd
...
57: GRPCClientChannelHandler.flush(context:)  <-- gRPC initiating a new RPC

Timeline from system logs

Crash 2 (11:46), launched at 09:57

09:57:44  VM 1 created
09:57:45  VM 2 created
          (no container-related activity in between)
11:46:04  Precondition failed: unacceptable errno 9 Bad file descriptor in writev

Crash 3 (13:42), launched at 11:55

11:55:44  VM 1 created
11:55:45  VM 2 created
          (no container-related activity until a third VM is created ~97 minutes later)
13:32:59  VM 3 created
13:42:09  Precondition failed: unacceptable errno 9 Bad file descriptor in writev

No TimeSyncer errors ("failed to sync time") were logged during either run, which means the TimeSyncer's agent connection (also created via dupHandle()) remained healthy throughout.

Observations

  1. The ~1h 47m timing is consistent. Both non-sleep crashes happened almost exactly 107-109 minutes after launch. This suggests a timer or deferred operation rather than a random race.

  2. The fd was valid initially. The crashes happen on a vsock connection that was already established and working. Something invalidated the fd after it was in use.

  3. The fd was not closed by the gRPC/NIO layer. The gRPC channel is still in an active state at the time of the crash. The fd was invalidated by something outside the gRPC stack.

  4. connectionBackoff = nil and default idle timeout. Vminitd.Client.init sets connectionBackoff = nil but does not override connectionIdleTimeout, which defaults to 30 minutes in grpc-swift. The init process agent (created during LinuxContainer.start()) goes idle after the createProcess/startProcess RPCs complete. The 30-minute idle timeout would close that connection's socket. This explains crash 1 (the reconnection path), but 30 minutes doesn't match the ~107-minute timing of crashes 2 and 3.

  5. Two VMs, different crashes. The app creates two VMs at launch. Crash 1 was on one VM's gRPC client, crashes 2 and 3 were on the other's.

Questions

  • Could the Virtualization framework be doing deferred cleanup of vsock endpoints after VZVirtioSocketConnection.close() is called in dupHandle()? The ~107-minute delay suggests something on a schedule rather than immediate invalidation.
  • Is there any internal timeout or resource reclamation in the vsock/XPC layer that could invalidate file descriptors for established connections?
  • Would disabling the grpc-swift idle timeout (setting connectionIdleTimeout to a very large value) and/or adding keepalive be worth trying to narrow down whether it's involved?

Expected behavior

Shouldn't crash

Environment

See above

Relevant log output

See above

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions