|
| 1 | +# Proposal: Make TLS + key-pinning the default daemon→registry transport (H1) |
| 2 | + |
| 3 | +Status: DRAFT — staged rollout design, not yet approved. No default flip is included. |
| 4 | +Audit finding: H1 — the `node_id → public_key` mapping that all peer-handshake |
| 5 | +authentication (and badge/recovery trust) rests on is fetched over plaintext TCP |
| 6 | +by default. TLS + pinning exists but is opt-in. |
| 7 | + |
| 8 | +This document is a design only. It changes no current default behaviour. Any |
| 9 | +default flip is gated on the maintainer decisions in §7. |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## 1. Current state (verified in code + against prod) |
| 14 | + |
| 15 | +### 1.1 The trust that rests on the mapping |
| 16 | + |
| 17 | +A receiving daemon authenticates an inbound key-exchange by fetching the peer's |
| 18 | +expected Ed25519 public key from the registry and comparing it to the key carried |
| 19 | +in the handshake packet: |
| 20 | + |
| 21 | +- `pkg/daemon/keyexchange/handle.go:46-75` — `HandleAuthFrame` calls |
| 22 | + `GetPeerPubKey(peerNodeID)` *first*, rejects if unavailable, and verifies the |
| 23 | + packet-carried Ed25519 key + signature against the registry-returned key. This |
| 24 | + is the actual MITM surface: whoever controls the `node_id → pubkey` answer |
| 25 | + controls which key is "trusted". |
| 26 | +- `pkg/daemon/keyexchange/keyexchange.go:382-407` — `GetPeerPubKey` resolves a |
| 27 | + cache miss through `verifyFunc`. |
| 28 | +- `pkg/daemon/daemon.go:994` — `verifyFunc` is wired to `d.lookupPeerPubKey`. |
| 29 | +- `pkg/daemon/daemon.go:5674-5686` — `lookupPeerPubKey` calls |
| 30 | + `d.regConn.Lookup(nodeID)` and reads `resp["public_key"]`. |
| 31 | + |
| 32 | +So the entire peer-auth chain terminates at one registry lookup over `d.regConn`. |
| 33 | + |
| 34 | +### 1.2 The lookup is application-layer-unauthenticated |
| 35 | + |
| 36 | +The registry's lookup response is **not signed**. `EncodeLookupResp` |
| 37 | +(`common/registry/wire/wire.go:183`) emits |
| 38 | +`[node_id][flags][reserved][nets][pubkey_len][pubkey]...` with no registry |
| 39 | +signature over the pubkey. The lookup *request* (`EncodeLookupReq`, |
| 40 | +`wire.go:168`) is `[4B node_id]` — also unauthenticated. Integrity of the pubkey |
| 41 | +therefore depends **entirely on the transport**. There is no second line of |
| 42 | +defence at the application layer. |
| 43 | + |
| 44 | +### 1.3 The transport is plaintext by default |
| 45 | + |
| 46 | +Daemon (`cmd/daemon/main.go`): |
| 47 | + |
| 48 | +- `-registry` default `34.71.57.205:9000` (`registryDefault`, line ~46). |
| 49 | +- `-registry-tls` default **`false`** (line 65). |
| 50 | +- `-registry-trust` default `"pinned"` (line 67) — but only consulted when |
| 51 | + `-registry-tls` is set. |
| 52 | +- `-registry-fingerprint` default `""` (line 66). |
| 53 | + |
| 54 | +Dial selection (`pkg/daemon/daemon.go:888-912`): |
| 55 | + |
| 56 | +``` |
| 57 | +if RegistryTLS: |
| 58 | + trust=="pinned" -> registry.DialTLSPinned(addr, fingerprint) // needs fingerprint |
| 59 | + trust=="system" -> registry.DialTLSPool(addr, {MinVersion:TLS12}, 4) |
| 60 | +else: |
| 61 | + registry.DialPool(addr, 4) // PLAINTEXT — default |
| 62 | +``` |
| 63 | + |
| 64 | +Because `-registry-tls` defaults to false, **the default path is |
| 65 | +`registry.DialPool` = plaintext TCP** (`common/registry/client/client.go:150`). |
| 66 | + |
| 67 | +`pilotctl` is worse: it has **no TLS option at all**. `connectRegistry()` |
| 68 | +(`cmd/pilotctl/main.go:487`) and the badge/recovery path |
| 69 | +(`cmd/pilotctl/verify.go:301`, `RecoverIdentity`) both call `registry.Dial(addr)` |
| 70 | +(plaintext) against `34.71.57.205:9000`. Recovery — re-binding a node_id to a new |
| 71 | +key — runs over plaintext. |
| 72 | + |
| 73 | +The compat tunnel mode (`-transport=compat`) does auto-route the registry to |
| 74 | +`registry.pilotprotocol.network:443` with `system` trust |
| 75 | +(`cmd/daemon/main.go:158-173`), but that is opt-in and only for UDP-blocked |
| 76 | +environments — the 222K-node default fleet is on `:9000` plaintext. |
| 77 | + |
| 78 | +### 1.4 What "pinning" means here |
| 79 | + |
| 80 | +`DialTLSPinned` (`common/registry/client/client.go:239-258`): |
| 81 | +`InsecureSkipVerify: true` + `VerifyPeerCertificate` comparing the **SHA-256 of |
| 82 | +the server's DER-encoded leaf certificate** to a hex fingerprint. This is a |
| 83 | +**leaf-cert pin**, not an SPKI pin and not a registry-pubkey pin. Consequence: |
| 84 | +it breaks the moment the registry's leaf cert rotates (including Let's Encrypt's |
| 85 | +~60-day renewals). Pinning the SPKI (public-key info) instead would survive |
| 86 | +same-key renewals. |
| 87 | + |
| 88 | +### 1.5 What the registry actually serves (verified against prod 2026-06) |
| 89 | + |
| 90 | +- `34.71.57.205:9000` → **plaintext** (raw TCP; TLS handshake fails). Matches the |
| 91 | + prod log line `registry listening transport=plaintext`. Server default |
| 92 | + `-tls=false` (`rendezvous/cmd/rendezvous/main.go:68`, |
| 93 | + `accept/accept.go:662-669`). |
| 94 | +- `registry.pilotprotocol.network:443` → **valid Let's Encrypt cert** |
| 95 | + (CN=registry.pilotprotocol.network, ECDSA via E7 intermediate, ~90-day cert, |
| 96 | + renewed ~every 60 days). Fronted by Caddy/nginx terminating TLS; the Go binary |
| 97 | + speaks plain WS/TCP upstream (`-wss-addr`, `main.go:80`). |
| 98 | + |
| 99 | +So a publicly-trusted TLS endpoint **already exists** and works today — it is |
| 100 | +just not the default the fleet dials, and `system` trust (CA chain) is weaker |
| 101 | +than a pin against a network-controlled identity. |
| 102 | + |
| 103 | +### 1.6 Self-signed instability (why naïve pinning is fragile) |
| 104 | + |
| 105 | +`accept.SetTLS("","")` → `GenerateSelfSignedCert()` produces an **in-memory** |
| 106 | +cert regenerated on every server restart (`accept/accept.go:419,632`). Pinning a |
| 107 | +leaf fingerprint against an auto-self-signed registry would break every pinned |
| 108 | +client on each registry restart. A pinned deployment therefore **requires a |
| 109 | +persistent cert/key on disk** (`-tls-cert`/`-tls-key`) with a stable identity. |
| 110 | + |
| 111 | +--- |
| 112 | + |
| 113 | +## 2. Target state |
| 114 | + |
| 115 | +The `node_id → pubkey` answer must be authenticated against a **network-controlled |
| 116 | +identity that clients verify without trusting a public CA** and without a |
| 117 | +first-contact TOFU window. |
| 118 | + |
| 119 | +Two complementary mechanisms; we want **both**, but only (A) is required to close |
| 120 | +H1 at the transport layer: |
| 121 | + |
| 122 | +**(A) Pinned TLS to the registry, by default.** Daemon (and eventually pilotctl) |
| 123 | +dial the registry over TLS and verify the server identity against a pin that |
| 124 | +ships in the binary, not the OS trust store. Pin against the **SPKI** of a stable |
| 125 | +registry key (Pilot-controlled), surviving cert renewals. Maintain a small set of |
| 126 | +valid pins (current + next) to allow key rotation. |
| 127 | + |
| 128 | +**(B) Application-layer signing of the lookup response (defence in depth).** The |
| 129 | +registry signs `(node_id, pubkey, ...)` with a Pilot registry signing key whose |
| 130 | +public half ships in the binary; clients verify the signature on every lookup. |
| 131 | +This makes the `node_id → pubkey` mapping authenticated **independent of |
| 132 | +transport**, so a future TLS-terminating proxy, a mis-pinned client, or a |
| 133 | +compromised CA cannot forge the mapping. (B) is a wire-format change in |
| 134 | +`common/registry` and is the durable fix; (A) is the faster one. Recommend |
| 135 | +shipping (A) first, (B) as a fast-follow. |
| 136 | + |
| 137 | +Plaintext `:9000` is ultimately retired (or kept only as an explicitly-flagged, |
| 138 | +loudly-warned escape hatch). |
| 139 | + |
| 140 | +--- |
| 141 | + |
| 142 | +## 3. Staged migration (never breaks live nodes) |
| 143 | + |
| 144 | +Guiding rule: **clients learn to prefer the secure path before the server removes |
| 145 | +the insecure one, and the server offers the secure path before clients require |
| 146 | +it.** No stage makes a previously-working daemon fail. |
| 147 | + |
| 148 | +### Stage 0 — Pin distribution & stable server identity (server + build, no client behaviour change) |
| 149 | + |
| 150 | +This solves the pin-distribution / TOFU problem *before* any client prefers TLS. |
| 151 | + |
| 152 | +1. Generate a long-lived registry TLS identity (dedicated keypair, not the |
| 153 | + Let's Encrypt leaf) and persist it. Serve it on a TLS listener on `:9000`'s |
| 154 | + sibling port (e.g. `:9443`) via `-tls -tls-cert -tls-key`, **in addition to** |
| 155 | + plaintext `:9000`. Keep `:443` (Let's Encrypt) as-is for compat/system trust. |
| 156 | +2. **Bake the pin into the release.** The SPKI pin (and the registry signing |
| 157 | + pubkey for Stage B) are embedded as compiled constants in the daemon/pilotctl |
| 158 | + binaries — same model already used for `internal/transport/compat/roots.go` |
| 159 | + (the embedded Pilot CA root) and `internal/trustedagents`. Because the pin |
| 160 | + ships *in the signed binary the auto-updater already distributes*, there is |
| 161 | + **no TOFU first-contact window**: the client knows the expected identity |
| 162 | + before it ever connects. |
| 163 | +3. Also publish the pin out-of-band for verification: in the pinned catalogue, |
| 164 | + on the website Configuration page, and in release notes — so operators can |
| 165 | + audit that the embedded pin matches. |
| 166 | +4. Switch SPKI-pin support into `DialTLSPinned` (today it is a leaf-cert pin): |
| 167 | + add an SPKI-fingerprint comparison path so renewals don't break pins. |
| 168 | + |
| 169 | +Compatibility: zero. No client dials the new listener yet. Rollback: stop serving |
| 170 | +the extra listener. |
| 171 | + |
| 172 | +### Stage 1 — Ship clients that CAN pin, opt-in (client release N) |
| 173 | + |
| 174 | +1. Add embedded-pin support to the daemon: a `-registry-trust=pinned-builtin` |
| 175 | + value (or `RegistryFingerprint`/SPKI auto-populated from the embedded |
| 176 | + constant when empty) so an operator can select pinned TLS **without manually |
| 177 | + copying a fingerprint**. |
| 178 | +2. Add the same TLS+pin capability to `pilotctl` (currently has none): a |
| 179 | + `--registry-tls`/built-in-pin path through `connectRegistry()` and the |
| 180 | + recovery/verify dial in `verify.go`. |
| 181 | +3. Default remains plaintext. Operators opt in. Document it. |
| 182 | + |
| 183 | +Compatibility: zero default change. Older daemons unaffected. Rollback: operators |
| 184 | +drop the flag. |
| 185 | + |
| 186 | +### Stage 2 — Prefer-TLS-with-fallback + adoption telemetry (client release N+1) |
| 187 | + |
| 188 | +1. Default behaviour becomes: **try pinned TLS first; on dial/handshake failure, |
| 189 | + fall back to plaintext with a loud WARN** (`slog.Warn("registry: falling back |
| 190 | + to PLAINTEXT — MITM-exposed; …")`). This makes the secure path the *preferred* |
| 191 | + path while guaranteeing no daemon loses connectivity if the TLS listener is |
| 192 | + unreachable for it. |
| 193 | +2. Emit consent-gated telemetry recording which transport each daemon actually |
| 194 | + used (pinned-tls / system-tls / plaintext-fallback). Wire through the existing |
| 195 | + `pkg/telemetry` client (the same consent-gated path used elsewhere) so we can |
| 196 | + measure TLS adoption across the fleet. |
| 197 | +3. The fallback is the safety net that lets us change the default without a flag |
| 198 | + day; the telemetry is what tells us when fallback has stopped firing. |
| 199 | + |
| 200 | +Compatibility: a daemon whose network blocks the TLS port keeps working via |
| 201 | +fallback (with warnings). Rollback: revert the default to plaintext-first; the |
| 202 | +code paths already exist. |
| 203 | + |
| 204 | +### Stage 3 — Flip the default to pinned-TLS-required-with-warned-fallback (client release N+2) |
| 205 | + |
| 206 | +Precondition: Stage-2 telemetry shows TLS success ≥ an agreed threshold |
| 207 | +(e.g. ≥99% of active daemons reaching the TLS listener) **and** the registry has |
| 208 | +served the pinned listener stably for ≥1 renewal cycle. |
| 209 | + |
| 210 | +1. Default `-registry-tls=true`, `-registry-trust=pinned-builtin`. |
| 211 | +2. Fallback to plaintext **still exists** but now requires an explicit |
| 212 | + `-registry-allow-plaintext` opt-out flag (loud warning), so a default flip |
| 213 | + can never strand a daemon — the operator can always re-enable plaintext. |
| 214 | +3. The auto-updater rolls release N+2 across the fleet gradually (it already |
| 215 | + checks GitHub releases hourly and is opt-in per host — see §4). |
| 216 | + |
| 217 | +Compatibility: daemons that genuinely can't reach the TLS listener fail closed by |
| 218 | +default but have a documented escape hatch. Rollback: ship N+2.1 reverting the |
| 219 | +default; or operators set `-registry-allow-plaintext`. |
| 220 | + |
| 221 | +### Stage 4 — Deprecate, then disable plaintext (server, release N+3+) |
| 222 | + |
| 223 | +1. Server logs every plaintext accept with the connecting node_id (deprecation |
| 224 | + signal); dashboards surface remaining plaintext talkers. |
| 225 | +2. Once the plaintext-talker count is ~0 for a sustained window, drop the |
| 226 | + plaintext `:9000` listener (or gate it behind a server `-allow-plaintext` |
| 227 | + flag, default off). |
| 228 | + |
| 229 | +Compatibility: only daemons still on plaintext break — by then a measured ~0. |
| 230 | +Rollback: re-enable the plaintext listener (one flag). |
| 231 | + |
| 232 | +--- |
| 233 | + |
| 234 | +## 4. Compatibility analysis — the 222K fleet + auto-updater |
| 235 | + |
| 236 | +- **Auto-updater is opt-in and per-host** (writes `~/.pilot/auto-update.json`, |
| 237 | + re-read each tick; controlled via `pilotctl update enable|disable`). It checks |
| 238 | + GitHub releases hourly. This means **the fleet does not update atomically** — |
| 239 | + at any moment a mix of release N, N-1, N-2 is live. Every stage must tolerate |
| 240 | + version skew, which the prefer-with-fallback design (Stage 2) and the |
| 241 | + server-offers-before-client-requires ordering (Stage 0 before Stage 3) |
| 242 | + guarantee. |
| 243 | +- Hosts with auto-update **off** stay on old releases indefinitely. They keep |
| 244 | + working as long as plaintext `:9000` is served — which is why Stage 4 |
| 245 | + (disable plaintext) is gated on observed ~0 plaintext talkers, not on a date. |
| 246 | +- **Per-stage break matrix:** |
| 247 | + - Stage 0: nothing breaks. |
| 248 | + - Stage 1: nothing breaks (opt-in). |
| 249 | + - Stage 2: nothing breaks; daemons on TLS-blocked networks fall back + warn. |
| 250 | + - Stage 3: daemons that can't reach TLS **and** can't get the new release |
| 251 | + would fail closed — mitigated by the `-registry-allow-plaintext` escape hatch |
| 252 | + and by not flipping until telemetry shows ~full reachability. |
| 253 | + - Stage 4: only still-plaintext daemons break (by then ~0). |
| 254 | +- **pilotctl** is a separate binary (Homebrew tap + install.sh). Its TLS path |
| 255 | + (Stage 1) and default flip should track the daemon's but can lag; until then |
| 256 | + recovery/verify stay plaintext, so **prioritise pilotctl pinning** given |
| 257 | + recovery is the highest-value target. |
| 258 | + |
| 259 | +--- |
| 260 | + |
| 261 | +## 5. Concrete code touch-points |
| 262 | + |
| 263 | +Stage 0 (server + build): |
| 264 | +- `rendezvous/cmd/rendezvous/main.go` — run a second TLS listener with a |
| 265 | + persistent cert (`-tls -tls-cert -tls-key`) alongside plaintext; deploy a |
| 266 | + stable registry keypair. |
| 267 | +- `rendezvous/accept/accept.go:604-654` — ensure a stable on-disk cert path |
| 268 | + (avoid the in-memory self-signed regen for the pinned listener). |
| 269 | +- `common/registry/client/client.go:239-258` — add an **SPKI** pin comparison |
| 270 | + alongside the existing leaf-cert fingerprint pin. |
| 271 | +- New embedded constant (mirror `internal/transport/compat/roots.go`): registry |
| 272 | + SPKI pin(s) + (Stage B) registry signing pubkey. |
| 273 | + |
| 274 | +Stage 1 (clients can pin): |
| 275 | +- `cmd/daemon/main.go:65-67,226-228` — add `pinned-builtin` trust mode; populate |
| 276 | + `RegistryFingerprint`/SPKI from the embedded constant when empty. |
| 277 | +- `pkg/daemon/daemon.go:888-912` — dial selection: add the builtin-pin branch. |
| 278 | +- `cmd/pilotctl/main.go:264-272,487-495` — add TLS+pin to `getRegistry`/ |
| 279 | + `connectRegistry`. |
| 280 | +- `cmd/pilotctl/verify.go:301` (and `RecoverIdentity` path) — TLS+pin the |
| 281 | + recovery dial. |
| 282 | + |
| 283 | +Stage 2 (prefer + telemetry): |
| 284 | +- `pkg/daemon/daemon.go:888-926` — wrap the dial loop: try pinned TLS, fall back |
| 285 | + to plaintext with WARN; record the chosen transport. |
| 286 | +- `pkg/telemetry/client.go` — add a transport-adoption metric (consent-gated). |
| 287 | + |
| 288 | +Stage 3 (flip): |
| 289 | +- `cmd/daemon/main.go` — defaults `-registry-tls=true`, |
| 290 | + `-registry-trust=pinned-builtin`; add `-registry-allow-plaintext`. |
| 291 | + |
| 292 | +Stage 4 (server retire): |
| 293 | +- `rendezvous/cmd/rendezvous/main.go` / `accept/accept.go` — log/gate plaintext |
| 294 | + accepts; later drop the plaintext listener. |
| 295 | + |
| 296 | +Stage B (defence in depth, any time after Stage 0): |
| 297 | +- `common/registry/wire/wire.go:183` (`EncodeLookupResp`) + client decode — add a |
| 298 | + registry signature over `(node_id, pubkey, …)`; verify in |
| 299 | + `pkg/daemon/daemon.go:5674-5686` (`lookupPeerPubKey`). |
| 300 | + |
| 301 | +--- |
| 302 | + |
| 303 | +## 6. Pin-distribution problem (explicit) |
| 304 | + |
| 305 | +The classic objection to pinning is the bootstrap: how does a client learn the |
| 306 | +pin without a TOFU window an attacker can occupy? Resolution here: |
| 307 | + |
| 308 | +- The pin is **compiled into the binary** the auto-updater already ships and that |
| 309 | + install.sh/Homebrew already distribute — the same trust path already used for |
| 310 | + the embedded compat CA root and trusted-agents list. The client knows the |
| 311 | + expected registry identity *before first connect*. **No TOFU.** |
| 312 | +- The binary's own integrity is the root of trust (release signing + the |
| 313 | + updater's existing verification). We are not adding a new trust anchor, we are |
| 314 | + reusing the one that already gates code execution. |
| 315 | +- We pin **SPKI** (stable across renewals) and carry **multiple valid pins** |
| 316 | + (current + next) so the registry key can be rotated by shipping a release that |
| 317 | + adds the next pin before the server switches to it. |
| 318 | +- The pin is also published out-of-band (catalogue, website, release notes) for |
| 319 | + auditability. |
| 320 | + |
| 321 | +--- |
| 322 | + |
| 323 | +## 7. Risks & open questions (maintainer decisions needed before any flip) |
| 324 | + |
| 325 | +1. **SPKI vs leaf-cert pin.** Confirm we move `DialTLSPinned` to SPKI pinning |
| 326 | + (recommended) so Let's Encrypt / cert renewals don't break pins. The current |
| 327 | + leaf-cert pin would require re-pinning every ~60 days. |
| 328 | +2. **Registry TLS identity.** Do we mint a dedicated long-lived Pilot registry |
| 329 | + keypair for pinning (recommended), or pin the SPKI of the existing Let's |
| 330 | + Encrypt-fronted endpoint? The latter couples the pin to ACME key rotation. |
| 331 | +3. **Port topology.** Serve pinned TLS on `:9000` itself (protocol-sniff or flag |
| 332 | + day on the port) or on a new sibling port (e.g. `:9443`)? Sibling port is |
| 333 | + non-breaking; reusing `:9000` is cleaner long-term but needs care. |
| 334 | +4. **Stage B scope.** Do we also ship application-layer signed lookups (the |
| 335 | + transport-independent fix)? Recommended, but it is a `common/registry` |
| 336 | + wire-format change with its own compat story. |
| 337 | +5. **Adoption threshold for the Stage-3 flip.** What TLS-reachability % and what |
| 338 | + soak time gate the default flip? |
| 339 | +6. **pilotctl priority.** Recovery/verify over plaintext is arguably the |
| 340 | + single highest-value target. Confirm we fast-track pilotctl pinning ahead of |
| 341 | + the daemon default flip. |
| 342 | +7. **Escape hatch policy.** Keep `-registry-allow-plaintext` indefinitely |
| 343 | + (operability) or sunset it on a date (security)? |
| 344 | + |
| 345 | +No code in this proposal changes a default. Approval of the above unblocks |
| 346 | +Stage 0 (server-side, fully backward-compatible) as the first concrete step. |
0 commit comments