Skip to content

Tray reports Coder Connect as healthy while tunnel/DNS is broken; stale NRPT rule causes downstream tooling to hang or fail #171

Description

@blinkagent

Summary

The Coder Desktop tray indicator currently reflects "service running / signed in"
rather than "tunnel + DNS are actually serving traffic." When the dataplane goes
unhealthy (network flap, sleep/wake, control-plane DNS hiccup, etc.) but the
service is still running, the tray stays green while:

  • The .coder NRPT rule and Wintun adapter remain in place.
  • The embedded DNS server either keeps answering with fd60:627a:a42b::/48
    addresses that no longer route, or silently returns empty answers.

This misleads both users and downstream tooling that relies on *.coder
resolving correctly, producing two distinct user-visible failure modes.

Failure mode 1 — stale DNS still answering

Embedded DNS continues to return fd60::/48 addresses for <workspace>.coder,
but packets to those addresses go nowhere. Downstream effects:

  • ssh coder.<workspace> (via a coder config-ssh-style ProxyCommand) hits
    coder ssh --stdio, which calls workspacesdk.ExistsViaCoderConnect,
    sees a fd60::/48 answer, commits to the direct-Connect path, and times out
    with dial tcp [fd60:...]:22: connect: operation timed out.
  • coder ssh <workspace> (no --stdio) still works because it bypasses the
    Connect probe and builds its own tailnet client.
  • Reported recovery in the field has been "reboot the laptop," which tears down
    the stale state.

(Filed separately against coder/coder as coder/coder#26669 for the CLI-side
defensive fix.)

Failure mode 2 — embedded DNS returns empty answers

Conversely, the embedded DNS server stops returning anything for <workspace>.coder
while the tray still shows connected:

> Resolve-DnsName -Name myworkspace.coder -Server fd60:627a:a42b::53
(empty answer — no IPAddress, no NXDOMAIN)

Downstream effects:

  • File sync's remote directory picker (POST /api/v0/list-directory against
    http://<workspace>.coder:4/) fails with a raw
    System.Net.Sockets.SocketException (11001): No such host is known.
    surfaced to the user as a wall of stack trace in the directory picker dialog.
  • Mutagen sync daemon repeatedly fails to resolve <workspace>.coder hostnames.
  • Tailnet logs show failed to dial tailnet v2+ API, no matching peer, etc.

This is the same underlying state-management gap as #170, and tends to follow
the "service restarted without full off/on toggle" path from #149.

What I'd like to see

  1. Tray indicator should reflect dataplane health, not just service state.
    When the tunnel cannot carry traffic (no peers, can't dial coordinator, embedded
    DNS not serving), the tray should turn yellow/red and surface a one-click
    recovery action ("Restart Coder Connect").
  2. State should be coherent on tunnel-unhealthy. If the tunnel is not
    actually serving traffic, the NRPT rule and embedded DNS should either be
    torn down (so callers get a clean NXDOMAIN) or kept in lockstep with the
    tunnel's real status. The current "DNS lingers, tunnel is gone" intermediate
    state is what trips both failure modes above.
  3. Friendlier error in the directory picker. Even after the underlying
    issue is fixed, list-directory should not surface a raw AggregateException
    stack to the user. Catch the socket / DNS failure class and render
    "Workspace unreachable — check Coder Connect status," ideally with a link to
    the tray.
  4. Recovery shouldn't require a reboot. Today users in the field are
    resolving this by rebooting, or by toggling Coder Connect off/on (#149 notes
    that a service restart alone leaves the system in a broken state — the
    off/on toggle from the tray is the documented recovery path). Either the
    service-restart path should fully re-establish state, or the off/on toggle
    should be discoverable from inside the broken state (e.g. surfaced by item 1).

Reproduction

Hard to deterministically reproduce, but reported triggers include:

In all cases the tray UI continues to display the connected state.

Related

Created on behalf of @mdanter

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions