Fix DNS loop from seeding defaultNetwork with own VPN#61
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b6ff502a12
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // back through the tun (auto_route) and loop. The NetworkRequest carries | ||
| // NET_CAPABILITY_NOT_VPN, but that does not apply to this direct getter, | ||
| // so filter the VPN transport out explicitly. | ||
| Application.connectivity.activeNetwork?.takeUnless(::isVpn) |
There was a problem hiding this comment.
Wait for a non-VPN network instead of publishing null
When activeNetwork is the VPN and the NOT_VPN listener has not delivered the underlying network yet (the startup race this change is targeting), this expression stores null and start() returns. LocalResolver.exchange()/lookup() read DefaultNetworkMonitor.defaultNetwork directly and throw "missing default interface" rather than calling require(), so the first DNS requests during this window fail instead of waiting for the listener-provided physical network.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Thanks — this is a real edge, but it's a pre-existing one that this PR deliberately doesn't widen, and I'd argue it's out of scope here. I traced it end-to-end in the downstream client as a separate task — LxBox §122 — missing default interface: LocalResolver on null defaultNetwork (won't-fix, with the full reasoning). Summary:
1. This is not a regression introduced by the PR. defaultNetwork already goes null on every onLost (DefaultNetworkListener sends Msg.Lost → network = null), and that path predates this change. takeUnless(::isVpn) adds exactly one new way to reach null: a VPN seed at startup, in the narrow window before the NOT_VPN callback lands. The net effect of the PR is guaranteed DNS loop → rare fast fail in a startup race — strictly better. The recurring null source is onLost, which the seed filter neither caused nor touches.
2. The window is genuinely narrow. Between the seed and the first DNS query there's setMemoryLimit + startOrReloadService (config parse + tun bring-up, tens-to-hundreds of ms), while the OS typically delivers onAvailable in single-to-low-double-digit ms — so the callback usually wins and null never reaches the resolver. No guarantee, but it's not the common path.
3. error("missing default interface") is not a crash. LocalDNSTransport.exchange/lookup are declared throws Exception in the .aar; the Go wrapper (experimental/libbox/dns.go) propagates it as a failed exchange to the DNS engine, which then fails over / SERVFAILs per the config's DNS rules. It's a handled failure path, not a hang and not a panic. (No silent retry in the wrapper either — so I'm not claiming "it just retries and it's fine"; I'm claiming the engine sees a normal failed exchange.)
4. Waiting (require()) costs more than it buys here. Upstream sing-box-for-android has require(), but doing the wait properly means moving the network read inside runBlocking in both exchange()/lookup() (suspend reshuffle), plus a mandatory timeout with an onCancel — otherwise a never-arriving network hangs the Go DNS thread indefinitely. That's real surface for a window nobody has observed firing in the field. For a fix whose whole point is to be minimal and root-cause-scoped, I'd rather not bundle that in.
So I'd keep this PR limited to the seed filter and leave the null→require() question as its own change. Happy to send the require() follow-up separately if you'd prefer it covered — the §122 doc has a ready concept (with the timeout fork already decided) for exactly that. Same applies to the API < 23 DefaultNetworkListener.get() fallback I mentioned in the description.
§120 — затрекали баг §119 (DefaultNetworkMonitor сидит defaultNetwork из getActiveNetwork() без VPN-фильтра → DNS-loop) публично: - upstream PR SagerNet/sing-box-for-android#61 (base dev, +14/−1, takeUnless(::isVpn)) - локальный issue #10 (closed/resolved §119) Issue в upstream невозможен (issues/discussions у SFA отключены; форма sing-box требует TUN-free CLI-репро). Полный текст репорта — в таске. §119 — поправлена атрибуция цитаты: «…VPN that applies to the application» дословно из javadoc registerDefaultNetworkCallback(), не из гайда Read network state.
start() seeds defaultNetwork from getActiveNetwork(), which returns the per-app default and may be the app's own VPN when the tun is already up. LocalResolver then queries DNS through the tun (auto_route) and loops, so apps under the VPN fail to resolve names. The async DefaultNetworkListener callback overwrites the seed with the NOT_VPN-filtered network shortly after, but the synchronous seed is used for the first resolutions; on ROMs where the tun is up at start() it is the VPN until the callback corrects it. The request's NOT_VPN capability does not apply to the direct getActiveNetwork() getter, so filter the VPN transport from the seed explicitly.
b6ff502 to
53eca0c
Compare
DefaultNetworkMonitor.start()seedsdefaultNetworkfromgetActiveNetwork()without filtering the VPN transport (L19-L20).The seed can be the app's own VPN
getActiveNetwork()returns the per-app default network. Per theregisterDefaultNetworkCallback()docs it "may be a physical network or a virtual network, such as a VPN that applies to the application", andgetActiveNetwork()returns the same value. So if the tun is already up whenstart()runs, the seed is sing-box's own VPNNetwork.How that loops DNS
LocalResolverresolves viaDnsResolver.rawQuery(network = defaultNetwork, ...). WhendefaultNetworkis the VPN:auto_routehijacks all traffic through the tun, DNS included, and hands the query to sing-box.LocalResolver.LocalResolverqueries onnetwork = VPNagain → back to step 1.The query never reaches a real upstream resolver — apps under the VPN fail to resolve names.
Why the seed, and not the request
defaultNetworkhas two writers:DefaultNetworkListener. ItsNetworkRequestcarriesNET_CAPABILITY_NOT_VPN(part ofNetworkCapabilities.DEFAULT_CAPABILITIES), so this value is already VPN-free.start()— a directgetActiveNetwork()getter. The request'sNOT_VPNcapability does not apply to it.The seed is read synchronously and is what the first resolutions use. The callback overwrites it with the NOT_VPN-filtered network, but only after the listener fires — so it's a race. On ROMs where the tun is already up at
start(), the seed is the VPN for the window before the callback corrects it. Filtering the request capability therefore can't fix this; the seed itself has to be filtered.ROM dependence
Whether the seed catches the VPN depends on the ordering of "tun came up" vs "seed was read" at start, which varies by device timing:
The fix
takeUnless(::isVpn)on the seed, so it is never the app's own tun. The callback path is unchanged; on devices where the seed was already an underlying network, behavior is unchanged.The same unfiltered read exists on the API < 23
DefaultNetworkListener.get()fallback (only reachable in theotherLegacyflavor). Left untouched to keep the diff minimal; can extend if preferred.Found in LxBox, a downstream client built on SFA.
Related: sing-box#3637, sing-box#2643.