Description
On iOS (NetworkExtension, ~50MB memory limit), using a naive proxy as the detour for DNS-over-TLS causes goroutines to leak indefinitely, eventually triggering OOM.
Environment
- sing-box version:
1.14.0-alpha.20
- Go version:
go1.25.9, ios/arm64
- iOS NetworkExtension
Configuration (relevant parts)
{
"dns": {
"servers": [
{
"type": "tls",
"tag": "google",
"detour": "auto-out", // ← DNS queries go through auto-out → naive
"server": "8.8.8.8"
}
]
},
"outbounds": [
{
"type": "naive",
"tag": "...",
"server": "...",
"server_port": 443,
"username": "...",
"password": "...",
"tls": { "enabled": true }
}
]
}
Symptoms
- Memory usage grows to ~45MB (iOS limit ~50MB)
- 364 goroutines, of which 262 are stuck in cronet/naive/DNS TLS path
- Process eventually OOM-killed by iOS jetsam
Goroutine profile breakdown
| Count |
Stuck at |
Stage |
| 120 |
crypto/tls.(*Conn).writeRecordLocked |
TLS handshake write blocked |
| 48 |
crypto/tls.(*Conn).handshakeContext.func2 |
Timed out, goroutine leaked |
| 11 |
cronet.(*BidirectionalConn).Write |
Post-handshake data write |
| 11 |
cronet.(*BidirectionalConn).Read |
Post-handshake data read |
| 2 |
crypto/tls.(*Conn).readRecordOrCCS |
TLS handshake read blocked |
The distribution across 5 different stages proves accumulation over time (not a one-time event). 48 goroutines already exceeded the 15s timeout; 72 are still within the timeout window.
Full call chain for the 120 stuck goroutines:
dns.Router.Exchange
→ dns.Client.exchangeToTransport
→ TLSTransport.Exchange
→ ConnPool.acquireOrdered
→ ConnPool.dial
→ tls.DialTLSContext → clientHandshake
→ writeRecordLocked → BidirectionalConn.Write
→ select { case <-c.write: ... } ← blocked forever
Root cause
Two bugs combine to cause this:
1. BidirectionalConn does not support deadlines
https://github.com/SagerNet/cronet-go/blob/main/bidirectional_conn.go#L291-L301
func (c *BidirectionalConn) SetDeadline(t time.Time) error { return os.ErrInvalid }
func (c *BidirectionalConn) SetReadDeadline(t time.Time) error { return os.ErrInvalid }
func (c *BidirectionalConn) SetWriteDeadline(t time.Time) error { return os.ErrInvalid }
When crypto/tls.HandshakeContext(ctx) tries to set the deadline on the underlying connection, it gets os.ErrInvalid. The TLS handshake proceeds without any write timeout.
BidirectionalConn.Write() calls c.stream.Write() (async, non-blocking CGO) then blocks on a Go channel waiting for the OnWriteCompleted callback from the Cronet engine. If the callback never arrives (server unresponsive, HTTP/2 flow control exhausted, etc.), the goroutine is permanently stuck.
2. ConnPoolOrdered has no concurrent dial limiting
https://github.com/SagerNet/sing-box/blob/main/dns/transport/conn_pool.go#L247-L301
func (p *ConnPool[T]) acquireOrdered(...) {
for {
// check idle connections → none found
conn, err := p.dial(ctx, currentState, dial) // ← every caller starts a new dial
// dial blocks forever (bug #1) → never returns → goroutine leaked
}
}
Unlike ConnPoolSingle mode (which has current.connecting to prevent concurrent dials), ConnPoolOrdered allows every caller to start a new dial concurrently. When each dial blocks forever, every new DNS query leaks a goroutine.
Combined effect
- DNS query arrives → tries to dial DNS-over-TLS through naive proxy
- TLS handshake starts → writes to BidirectionalConn → blocks waiting for Cronet callback
- Callback never arrives (server unresponsive / flow control stalled)
- 15s timeout → HandshakeContext returns error, but the spawned
clientHandshake goroutine is still stuck in Write → goroutine leaked
- Next DNS query → same thing → more goroutines leaked
- Repeat 120+ times → OOM
Suggested fixes
- ConnPoolOrdered should limit concurrent dials, similar to ConnPoolSingle's
connecting mechanism
- BidirectionalConn should support deadlines by using
context.AfterFunc to close the connection (send to c.close / c.done) when the deadline expires, since the underlying Cronet stream API doesn't support native deadlines
- DNS transport dial should have a hard timeout that forcibly closes the underlying connection, not just cancels the context
Profile data
Full memory profile dump available if needed (heap, goroutine, allocs profiles from iOS memory pressure callback).
Description
On iOS (NetworkExtension, ~50MB memory limit), using a naive proxy as the detour for DNS-over-TLS causes goroutines to leak indefinitely, eventually triggering OOM.
Environment
1.14.0-alpha.20go1.25.9, ios/arm64Configuration (relevant parts)
{ "dns": { "servers": [ { "type": "tls", "tag": "google", "detour": "auto-out", // ← DNS queries go through auto-out → naive "server": "8.8.8.8" } ] }, "outbounds": [ { "type": "naive", "tag": "...", "server": "...", "server_port": 443, "username": "...", "password": "...", "tls": { "enabled": true } } ] }Symptoms
Goroutine profile breakdown
crypto/tls.(*Conn).writeRecordLockedcrypto/tls.(*Conn).handshakeContext.func2cronet.(*BidirectionalConn).Writecronet.(*BidirectionalConn).Readcrypto/tls.(*Conn).readRecordOrCCSThe distribution across 5 different stages proves accumulation over time (not a one-time event). 48 goroutines already exceeded the 15s timeout; 72 are still within the timeout window.
Full call chain for the 120 stuck goroutines:
Root cause
Two bugs combine to cause this:
1.
BidirectionalConndoes not support deadlineshttps://github.com/SagerNet/cronet-go/blob/main/bidirectional_conn.go#L291-L301
When
crypto/tls.HandshakeContext(ctx)tries to set the deadline on the underlying connection, it getsos.ErrInvalid. The TLS handshake proceeds without any write timeout.BidirectionalConn.Write()callsc.stream.Write()(async, non-blocking CGO) then blocks on a Go channel waiting for theOnWriteCompletedcallback from the Cronet engine. If the callback never arrives (server unresponsive, HTTP/2 flow control exhausted, etc.), the goroutine is permanently stuck.2.
ConnPoolOrderedhas no concurrent dial limitinghttps://github.com/SagerNet/sing-box/blob/main/dns/transport/conn_pool.go#L247-L301
Unlike
ConnPoolSinglemode (which hascurrent.connectingto prevent concurrent dials),ConnPoolOrderedallows every caller to start a new dial concurrently. When each dial blocks forever, every new DNS query leaks a goroutine.Combined effect
clientHandshakegoroutine is still stuck in Write → goroutine leakedSuggested fixes
connectingmechanismcontext.AfterFuncto close the connection (send toc.close/c.done) when the deadline expires, since the underlying Cronet stream API doesn't support native deadlinesProfile data
Full memory profile dump available if needed (heap, goroutine, allocs profiles from iOS memory pressure callback).