Skip to content

DNS TLS transport + naive detour causes goroutine leak → iOS OOM #4105

@laurdawn

Description

@laurdawn

Description

On iOS (NetworkExtension, ~50MB memory limit), using a naive proxy as the detour for DNS-over-TLS causes goroutines to leak indefinitely, eventually triggering OOM.

Environment

  • sing-box version: 1.14.0-alpha.20
  • Go version: go1.25.9, ios/arm64
  • iOS NetworkExtension

Configuration (relevant parts)

{
  "dns": {
    "servers": [
      {
        "type": "tls",
        "tag": "google",
        "detour": "auto-out",       // ← DNS queries go through auto-out → naive
        "server": "8.8.8.8"
      }
    ]
  },
  "outbounds": [
    {
      "type": "naive",
      "tag": "...",
      "server": "...",
      "server_port": 443,
      "username": "...",
      "password": "...",
      "tls": { "enabled": true }
    }
  ]
}

Symptoms

  • Memory usage grows to ~45MB (iOS limit ~50MB)
  • 364 goroutines, of which 262 are stuck in cronet/naive/DNS TLS path
  • Process eventually OOM-killed by iOS jetsam

Goroutine profile breakdown

Count Stuck at Stage
120 crypto/tls.(*Conn).writeRecordLocked TLS handshake write blocked
48 crypto/tls.(*Conn).handshakeContext.func2 Timed out, goroutine leaked
11 cronet.(*BidirectionalConn).Write Post-handshake data write
11 cronet.(*BidirectionalConn).Read Post-handshake data read
2 crypto/tls.(*Conn).readRecordOrCCS TLS handshake read blocked

The distribution across 5 different stages proves accumulation over time (not a one-time event). 48 goroutines already exceeded the 15s timeout; 72 are still within the timeout window.

Full call chain for the 120 stuck goroutines:

dns.Router.Exchange
  → dns.Client.exchangeToTransport
    → TLSTransport.Exchange
      → ConnPool.acquireOrdered
        → ConnPool.dial
          → tls.DialTLSContext → clientHandshake
            → writeRecordLocked → BidirectionalConn.Write
              → select { case <-c.write: ... }  ← blocked forever

Root cause

Two bugs combine to cause this:

1. BidirectionalConn does not support deadlines

https://github.com/SagerNet/cronet-go/blob/main/bidirectional_conn.go#L291-L301

func (c *BidirectionalConn) SetDeadline(t time.Time) error      { return os.ErrInvalid }
func (c *BidirectionalConn) SetReadDeadline(t time.Time) error  { return os.ErrInvalid }
func (c *BidirectionalConn) SetWriteDeadline(t time.Time) error { return os.ErrInvalid }

When crypto/tls.HandshakeContext(ctx) tries to set the deadline on the underlying connection, it gets os.ErrInvalid. The TLS handshake proceeds without any write timeout.

BidirectionalConn.Write() calls c.stream.Write() (async, non-blocking CGO) then blocks on a Go channel waiting for the OnWriteCompleted callback from the Cronet engine. If the callback never arrives (server unresponsive, HTTP/2 flow control exhausted, etc.), the goroutine is permanently stuck.

2. ConnPoolOrdered has no concurrent dial limiting

https://github.com/SagerNet/sing-box/blob/main/dns/transport/conn_pool.go#L247-L301

func (p *ConnPool[T]) acquireOrdered(...) {
    for {
        // check idle connections → none found
        conn, err := p.dial(ctx, currentState, dial)  // ← every caller starts a new dial
        // dial blocks forever (bug #1) → never returns → goroutine leaked
    }
}

Unlike ConnPoolSingle mode (which has current.connecting to prevent concurrent dials), ConnPoolOrdered allows every caller to start a new dial concurrently. When each dial blocks forever, every new DNS query leaks a goroutine.

Combined effect

  1. DNS query arrives → tries to dial DNS-over-TLS through naive proxy
  2. TLS handshake starts → writes to BidirectionalConn → blocks waiting for Cronet callback
  3. Callback never arrives (server unresponsive / flow control stalled)
  4. 15s timeout → HandshakeContext returns error, but the spawned clientHandshake goroutine is still stuck in Write → goroutine leaked
  5. Next DNS query → same thing → more goroutines leaked
  6. Repeat 120+ times → OOM

Suggested fixes

  1. ConnPoolOrdered should limit concurrent dials, similar to ConnPoolSingle's connecting mechanism
  2. BidirectionalConn should support deadlines by using context.AfterFunc to close the connection (send to c.close / c.done) when the deadline expires, since the underlying Cronet stream API doesn't support native deadlines
  3. DNS transport dial should have a hard timeout that forcibly closes the underlying connection, not just cancels the context

Profile data

Full memory profile dump available if needed (heap, goroutine, allocs profiles from iOS memory pressure callback).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions