Summary
The Coder Desktop tray indicator currently reflects "service running / signed in"
rather than "tunnel + DNS are actually serving traffic." When the dataplane goes
unhealthy (network flap, sleep/wake, control-plane DNS hiccup, etc.) but the
service is still running, the tray stays green while:
- The
.coder NRPT rule and Wintun adapter remain in place.
- The embedded DNS server either keeps answering with
fd60:627a:a42b::/48
addresses that no longer route, or silently returns empty answers.
This misleads both users and downstream tooling that relies on *.coder
resolving correctly, producing two distinct user-visible failure modes.
Failure mode 1 — stale DNS still answering
Embedded DNS continues to return fd60::/48 addresses for <workspace>.coder,
but packets to those addresses go nowhere. Downstream effects:
ssh coder.<workspace> (via a coder config-ssh-style ProxyCommand) hits
coder ssh --stdio, which calls workspacesdk.ExistsViaCoderConnect,
sees a fd60::/48 answer, commits to the direct-Connect path, and times out
with dial tcp [fd60:...]:22: connect: operation timed out.
coder ssh <workspace> (no --stdio) still works because it bypasses the
Connect probe and builds its own tailnet client.
- Reported recovery in the field has been "reboot the laptop," which tears down
the stale state.
(Filed separately against coder/coder as coder/coder#26669 for the CLI-side
defensive fix.)
Failure mode 2 — embedded DNS returns empty answers
Conversely, the embedded DNS server stops returning anything for <workspace>.coder
while the tray still shows connected:
> Resolve-DnsName -Name myworkspace.coder -Server fd60:627a:a42b::53
(empty answer — no IPAddress, no NXDOMAIN)
Downstream effects:
- File sync's remote directory picker (
POST /api/v0/list-directory against
http://<workspace>.coder:4/) fails with a raw
System.Net.Sockets.SocketException (11001): No such host is known.
surfaced to the user as a wall of stack trace in the directory picker dialog.
- Mutagen sync daemon repeatedly fails to resolve
<workspace>.coder hostnames.
- Tailnet logs show
failed to dial tailnet v2+ API, no matching peer, etc.
This is the same underlying state-management gap as #170, and tends to follow
the "service restarted without full off/on toggle" path from #149.
What I'd like to see
- Tray indicator should reflect dataplane health, not just service state.
When the tunnel cannot carry traffic (no peers, can't dial coordinator, embedded
DNS not serving), the tray should turn yellow/red and surface a one-click
recovery action ("Restart Coder Connect").
- State should be coherent on tunnel-unhealthy. If the tunnel is not
actually serving traffic, the NRPT rule and embedded DNS should either be
torn down (so callers get a clean NXDOMAIN) or kept in lockstep with the
tunnel's real status. The current "DNS lingers, tunnel is gone" intermediate
state is what trips both failure modes above.
- Friendlier error in the directory picker. Even after the underlying
issue is fixed, list-directory should not surface a raw AggregateException
stack to the user. Catch the socket / DNS failure class and render
"Workspace unreachable — check Coder Connect status," ideally with a link to
the tray.
- Recovery shouldn't require a reboot. Today users in the field are
resolving this by rebooting, or by toggling Coder Connect off/on (#149 notes
that a service restart alone leaves the system in a broken state — the
off/on toggle from the tray is the documented recovery path). Either the
service-restart path should fully re-establish state, or the off/on toggle
should be discoverable from inside the broken state (e.g. surfaced by item 1).
Reproduction
Hard to deterministically reproduce, but reported triggers include:
In all cases the tray UI continues to display the connected state.
Related
Created on behalf of @mdanter
Summary
The Coder Desktop tray indicator currently reflects "service running / signed in"
rather than "tunnel + DNS are actually serving traffic." When the dataplane goes
unhealthy (network flap, sleep/wake, control-plane DNS hiccup, etc.) but the
service is still running, the tray stays green while:
.coderNRPT rule and Wintun adapter remain in place.fd60:627a:a42b::/48addresses that no longer route, or silently returns empty answers.
This misleads both users and downstream tooling that relies on
*.coderresolving correctly, producing two distinct user-visible failure modes.
Failure mode 1 — stale DNS still answering
Embedded DNS continues to return
fd60::/48addresses for<workspace>.coder,but packets to those addresses go nowhere. Downstream effects:
ssh coder.<workspace>(via acoder config-ssh-style ProxyCommand) hitscoder ssh --stdio, which callsworkspacesdk.ExistsViaCoderConnect,sees a
fd60::/48answer, commits to the direct-Connect path, and times outwith
dial tcp [fd60:...]:22: connect: operation timed out.coder ssh <workspace>(no--stdio) still works because it bypasses theConnect probe and builds its own tailnet client.
the stale state.
(Filed separately against
coder/coderas coder/coder#26669 for the CLI-sidedefensive fix.)
Failure mode 2 — embedded DNS returns empty answers
Conversely, the embedded DNS server stops returning anything for
<workspace>.coderwhile the tray still shows connected:
Downstream effects:
POST /api/v0/list-directoryagainsthttp://<workspace>.coder:4/) fails with a rawSystem.Net.Sockets.SocketException (11001): No such host is known.surfaced to the user as a wall of stack trace in the directory picker dialog.
<workspace>.coderhostnames.failed to dial tailnet v2+ API,no matching peer, etc.This is the same underlying state-management gap as #170, and tends to follow
the "service restarted without full off/on toggle" path from #149.
What I'd like to see
When the tunnel cannot carry traffic (no peers, can't dial coordinator, embedded
DNS not serving), the tray should turn yellow/red and surface a one-click
recovery action ("Restart Coder Connect").
actually serving traffic, the NRPT rule and embedded DNS should either be
torn down (so callers get a clean NXDOMAIN) or kept in lockstep with the
tunnel's real status. The current "DNS lingers, tunnel is gone" intermediate
state is what trips both failure modes above.
issue is fixed,
list-directoryshould not surface a rawAggregateExceptionstack to the user. Catch the socket / DNS failure class and render
"Workspace unreachable — check Coder Connect status," ideally with a link to
the tray.
resolving this by rebooting, or by toggling Coder Connect off/on (#149 notes
that a service restart alone leaves the system in a broken state — the
off/on toggle from the tray is the documented recovery path). Either the
service-restart path should fully re-establish state, or the off/on toggle
should be discoverable from inside the broken state (e.g. surfaced by item 1).
Reproduction
Hard to deterministically reproduce, but reported triggers include:
mix, per #149).
deletes the Wintun adapter, NRPT rule, and
CoderVpnServiceentry withoutrecreating them).
In all cases the tray UI continues to display the connected state.
Related
ExistsViaCoderConnectfalse positives when Coder Desktop has stale DNS but no working tunnel, causingcoder ssh --stdioto hang coder#26669 — companion CLI-side issue forExistsViaCoderConnecttoadd a liveness probe so the CLI degrades gracefully even if Desktop is in
this state.
despite Connect being on.
Created on behalf of @mdanter