Skip to content

Commit bd13ec8

Browse files
TeoSlayerteovl
andauthored
docs: staged TLS+pinning rollout for registry transport (H1) (#310)
docs: propose staged TLS+pinning rollout for registry transport (H1) Co-authored-by: Teodor <teodor@vulturelabs.io>
1 parent f43fe95 commit bd13ec8

1 file changed

Lines changed: 346 additions & 0 deletions

File tree

Lines changed: 346 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,346 @@
1+
# Proposal: Make TLS + key-pinning the default daemon→registry transport (H1)
2+
3+
Status: DRAFT — staged rollout design, not yet approved. No default flip is included.
4+
Audit finding: H1 — the `node_id → public_key` mapping that all peer-handshake
5+
authentication (and badge/recovery trust) rests on is fetched over plaintext TCP
6+
by default. TLS + pinning exists but is opt-in.
7+
8+
This document is a design only. It changes no current default behaviour. Any
9+
default flip is gated on the maintainer decisions in §7.
10+
11+
---
12+
13+
## 1. Current state (verified in code + against prod)
14+
15+
### 1.1 The trust that rests on the mapping
16+
17+
A receiving daemon authenticates an inbound key-exchange by fetching the peer's
18+
expected Ed25519 public key from the registry and comparing it to the key carried
19+
in the handshake packet:
20+
21+
- `pkg/daemon/keyexchange/handle.go:46-75``HandleAuthFrame` calls
22+
`GetPeerPubKey(peerNodeID)` *first*, rejects if unavailable, and verifies the
23+
packet-carried Ed25519 key + signature against the registry-returned key. This
24+
is the actual MITM surface: whoever controls the `node_id → pubkey` answer
25+
controls which key is "trusted".
26+
- `pkg/daemon/keyexchange/keyexchange.go:382-407``GetPeerPubKey` resolves a
27+
cache miss through `verifyFunc`.
28+
- `pkg/daemon/daemon.go:994``verifyFunc` is wired to `d.lookupPeerPubKey`.
29+
- `pkg/daemon/daemon.go:5674-5686``lookupPeerPubKey` calls
30+
`d.regConn.Lookup(nodeID)` and reads `resp["public_key"]`.
31+
32+
So the entire peer-auth chain terminates at one registry lookup over `d.regConn`.
33+
34+
### 1.2 The lookup is application-layer-unauthenticated
35+
36+
The registry's lookup response is **not signed**. `EncodeLookupResp`
37+
(`common/registry/wire/wire.go:183`) emits
38+
`[node_id][flags][reserved][nets][pubkey_len][pubkey]...` with no registry
39+
signature over the pubkey. The lookup *request* (`EncodeLookupReq`,
40+
`wire.go:168`) is `[4B node_id]` — also unauthenticated. Integrity of the pubkey
41+
therefore depends **entirely on the transport**. There is no second line of
42+
defence at the application layer.
43+
44+
### 1.3 The transport is plaintext by default
45+
46+
Daemon (`cmd/daemon/main.go`):
47+
48+
- `-registry` default `34.71.57.205:9000` (`registryDefault`, line ~46).
49+
- `-registry-tls` default **`false`** (line 65).
50+
- `-registry-trust` default `"pinned"` (line 67) — but only consulted when
51+
`-registry-tls` is set.
52+
- `-registry-fingerprint` default `""` (line 66).
53+
54+
Dial selection (`pkg/daemon/daemon.go:888-912`):
55+
56+
```
57+
if RegistryTLS:
58+
trust=="pinned" -> registry.DialTLSPinned(addr, fingerprint) // needs fingerprint
59+
trust=="system" -> registry.DialTLSPool(addr, {MinVersion:TLS12}, 4)
60+
else:
61+
registry.DialPool(addr, 4) // PLAINTEXT — default
62+
```
63+
64+
Because `-registry-tls` defaults to false, **the default path is
65+
`registry.DialPool` = plaintext TCP** (`common/registry/client/client.go:150`).
66+
67+
`pilotctl` is worse: it has **no TLS option at all**. `connectRegistry()`
68+
(`cmd/pilotctl/main.go:487`) and the badge/recovery path
69+
(`cmd/pilotctl/verify.go:301`, `RecoverIdentity`) both call `registry.Dial(addr)`
70+
(plaintext) against `34.71.57.205:9000`. Recovery — re-binding a node_id to a new
71+
key — runs over plaintext.
72+
73+
The compat tunnel mode (`-transport=compat`) does auto-route the registry to
74+
`registry.pilotprotocol.network:443` with `system` trust
75+
(`cmd/daemon/main.go:158-173`), but that is opt-in and only for UDP-blocked
76+
environments — the 222K-node default fleet is on `:9000` plaintext.
77+
78+
### 1.4 What "pinning" means here
79+
80+
`DialTLSPinned` (`common/registry/client/client.go:239-258`):
81+
`InsecureSkipVerify: true` + `VerifyPeerCertificate` comparing the **SHA-256 of
82+
the server's DER-encoded leaf certificate** to a hex fingerprint. This is a
83+
**leaf-cert pin**, not an SPKI pin and not a registry-pubkey pin. Consequence:
84+
it breaks the moment the registry's leaf cert rotates (including Let's Encrypt's
85+
~60-day renewals). Pinning the SPKI (public-key info) instead would survive
86+
same-key renewals.
87+
88+
### 1.5 What the registry actually serves (verified against prod 2026-06)
89+
90+
- `34.71.57.205:9000`**plaintext** (raw TCP; TLS handshake fails). Matches the
91+
prod log line `registry listening transport=plaintext`. Server default
92+
`-tls=false` (`rendezvous/cmd/rendezvous/main.go:68`,
93+
`accept/accept.go:662-669`).
94+
- `registry.pilotprotocol.network:443`**valid Let's Encrypt cert**
95+
(CN=registry.pilotprotocol.network, ECDSA via E7 intermediate, ~90-day cert,
96+
renewed ~every 60 days). Fronted by Caddy/nginx terminating TLS; the Go binary
97+
speaks plain WS/TCP upstream (`-wss-addr`, `main.go:80`).
98+
99+
So a publicly-trusted TLS endpoint **already exists** and works today — it is
100+
just not the default the fleet dials, and `system` trust (CA chain) is weaker
101+
than a pin against a network-controlled identity.
102+
103+
### 1.6 Self-signed instability (why naïve pinning is fragile)
104+
105+
`accept.SetTLS("","")``GenerateSelfSignedCert()` produces an **in-memory**
106+
cert regenerated on every server restart (`accept/accept.go:419,632`). Pinning a
107+
leaf fingerprint against an auto-self-signed registry would break every pinned
108+
client on each registry restart. A pinned deployment therefore **requires a
109+
persistent cert/key on disk** (`-tls-cert`/`-tls-key`) with a stable identity.
110+
111+
---
112+
113+
## 2. Target state
114+
115+
The `node_id → pubkey` answer must be authenticated against a **network-controlled
116+
identity that clients verify without trusting a public CA** and without a
117+
first-contact TOFU window.
118+
119+
Two complementary mechanisms; we want **both**, but only (A) is required to close
120+
H1 at the transport layer:
121+
122+
**(A) Pinned TLS to the registry, by default.** Daemon (and eventually pilotctl)
123+
dial the registry over TLS and verify the server identity against a pin that
124+
ships in the binary, not the OS trust store. Pin against the **SPKI** of a stable
125+
registry key (Pilot-controlled), surviving cert renewals. Maintain a small set of
126+
valid pins (current + next) to allow key rotation.
127+
128+
**(B) Application-layer signing of the lookup response (defence in depth).** The
129+
registry signs `(node_id, pubkey, ...)` with a Pilot registry signing key whose
130+
public half ships in the binary; clients verify the signature on every lookup.
131+
This makes the `node_id → pubkey` mapping authenticated **independent of
132+
transport**, so a future TLS-terminating proxy, a mis-pinned client, or a
133+
compromised CA cannot forge the mapping. (B) is a wire-format change in
134+
`common/registry` and is the durable fix; (A) is the faster one. Recommend
135+
shipping (A) first, (B) as a fast-follow.
136+
137+
Plaintext `:9000` is ultimately retired (or kept only as an explicitly-flagged,
138+
loudly-warned escape hatch).
139+
140+
---
141+
142+
## 3. Staged migration (never breaks live nodes)
143+
144+
Guiding rule: **clients learn to prefer the secure path before the server removes
145+
the insecure one, and the server offers the secure path before clients require
146+
it.** No stage makes a previously-working daemon fail.
147+
148+
### Stage 0 — Pin distribution & stable server identity (server + build, no client behaviour change)
149+
150+
This solves the pin-distribution / TOFU problem *before* any client prefers TLS.
151+
152+
1. Generate a long-lived registry TLS identity (dedicated keypair, not the
153+
Let's Encrypt leaf) and persist it. Serve it on a TLS listener on `:9000`'s
154+
sibling port (e.g. `:9443`) via `-tls -tls-cert -tls-key`, **in addition to**
155+
plaintext `:9000`. Keep `:443` (Let's Encrypt) as-is for compat/system trust.
156+
2. **Bake the pin into the release.** The SPKI pin (and the registry signing
157+
pubkey for Stage B) are embedded as compiled constants in the daemon/pilotctl
158+
binaries — same model already used for `internal/transport/compat/roots.go`
159+
(the embedded Pilot CA root) and `internal/trustedagents`. Because the pin
160+
ships *in the signed binary the auto-updater already distributes*, there is
161+
**no TOFU first-contact window**: the client knows the expected identity
162+
before it ever connects.
163+
3. Also publish the pin out-of-band for verification: in the pinned catalogue,
164+
on the website Configuration page, and in release notes — so operators can
165+
audit that the embedded pin matches.
166+
4. Switch SPKI-pin support into `DialTLSPinned` (today it is a leaf-cert pin):
167+
add an SPKI-fingerprint comparison path so renewals don't break pins.
168+
169+
Compatibility: zero. No client dials the new listener yet. Rollback: stop serving
170+
the extra listener.
171+
172+
### Stage 1 — Ship clients that CAN pin, opt-in (client release N)
173+
174+
1. Add embedded-pin support to the daemon: a `-registry-trust=pinned-builtin`
175+
value (or `RegistryFingerprint`/SPKI auto-populated from the embedded
176+
constant when empty) so an operator can select pinned TLS **without manually
177+
copying a fingerprint**.
178+
2. Add the same TLS+pin capability to `pilotctl` (currently has none): a
179+
`--registry-tls`/built-in-pin path through `connectRegistry()` and the
180+
recovery/verify dial in `verify.go`.
181+
3. Default remains plaintext. Operators opt in. Document it.
182+
183+
Compatibility: zero default change. Older daemons unaffected. Rollback: operators
184+
drop the flag.
185+
186+
### Stage 2 — Prefer-TLS-with-fallback + adoption telemetry (client release N+1)
187+
188+
1. Default behaviour becomes: **try pinned TLS first; on dial/handshake failure,
189+
fall back to plaintext with a loud WARN** (`slog.Warn("registry: falling back
190+
to PLAINTEXT — MITM-exposed; …")`). This makes the secure path the *preferred*
191+
path while guaranteeing no daemon loses connectivity if the TLS listener is
192+
unreachable for it.
193+
2. Emit consent-gated telemetry recording which transport each daemon actually
194+
used (pinned-tls / system-tls / plaintext-fallback). Wire through the existing
195+
`pkg/telemetry` client (the same consent-gated path used elsewhere) so we can
196+
measure TLS adoption across the fleet.
197+
3. The fallback is the safety net that lets us change the default without a flag
198+
day; the telemetry is what tells us when fallback has stopped firing.
199+
200+
Compatibility: a daemon whose network blocks the TLS port keeps working via
201+
fallback (with warnings). Rollback: revert the default to plaintext-first; the
202+
code paths already exist.
203+
204+
### Stage 3 — Flip the default to pinned-TLS-required-with-warned-fallback (client release N+2)
205+
206+
Precondition: Stage-2 telemetry shows TLS success ≥ an agreed threshold
207+
(e.g. ≥99% of active daemons reaching the TLS listener) **and** the registry has
208+
served the pinned listener stably for ≥1 renewal cycle.
209+
210+
1. Default `-registry-tls=true`, `-registry-trust=pinned-builtin`.
211+
2. Fallback to plaintext **still exists** but now requires an explicit
212+
`-registry-allow-plaintext` opt-out flag (loud warning), so a default flip
213+
can never strand a daemon — the operator can always re-enable plaintext.
214+
3. The auto-updater rolls release N+2 across the fleet gradually (it already
215+
checks GitHub releases hourly and is opt-in per host — see §4).
216+
217+
Compatibility: daemons that genuinely can't reach the TLS listener fail closed by
218+
default but have a documented escape hatch. Rollback: ship N+2.1 reverting the
219+
default; or operators set `-registry-allow-plaintext`.
220+
221+
### Stage 4 — Deprecate, then disable plaintext (server, release N+3+)
222+
223+
1. Server logs every plaintext accept with the connecting node_id (deprecation
224+
signal); dashboards surface remaining plaintext talkers.
225+
2. Once the plaintext-talker count is ~0 for a sustained window, drop the
226+
plaintext `:9000` listener (or gate it behind a server `-allow-plaintext`
227+
flag, default off).
228+
229+
Compatibility: only daemons still on plaintext break — by then a measured ~0.
230+
Rollback: re-enable the plaintext listener (one flag).
231+
232+
---
233+
234+
## 4. Compatibility analysis — the 222K fleet + auto-updater
235+
236+
- **Auto-updater is opt-in and per-host** (writes `~/.pilot/auto-update.json`,
237+
re-read each tick; controlled via `pilotctl update enable|disable`). It checks
238+
GitHub releases hourly. This means **the fleet does not update atomically**
239+
at any moment a mix of release N, N-1, N-2 is live. Every stage must tolerate
240+
version skew, which the prefer-with-fallback design (Stage 2) and the
241+
server-offers-before-client-requires ordering (Stage 0 before Stage 3)
242+
guarantee.
243+
- Hosts with auto-update **off** stay on old releases indefinitely. They keep
244+
working as long as plaintext `:9000` is served — which is why Stage 4
245+
(disable plaintext) is gated on observed ~0 plaintext talkers, not on a date.
246+
- **Per-stage break matrix:**
247+
- Stage 0: nothing breaks.
248+
- Stage 1: nothing breaks (opt-in).
249+
- Stage 2: nothing breaks; daemons on TLS-blocked networks fall back + warn.
250+
- Stage 3: daemons that can't reach TLS **and** can't get the new release
251+
would fail closed — mitigated by the `-registry-allow-plaintext` escape hatch
252+
and by not flipping until telemetry shows ~full reachability.
253+
- Stage 4: only still-plaintext daemons break (by then ~0).
254+
- **pilotctl** is a separate binary (Homebrew tap + install.sh). Its TLS path
255+
(Stage 1) and default flip should track the daemon's but can lag; until then
256+
recovery/verify stay plaintext, so **prioritise pilotctl pinning** given
257+
recovery is the highest-value target.
258+
259+
---
260+
261+
## 5. Concrete code touch-points
262+
263+
Stage 0 (server + build):
264+
- `rendezvous/cmd/rendezvous/main.go` — run a second TLS listener with a
265+
persistent cert (`-tls -tls-cert -tls-key`) alongside plaintext; deploy a
266+
stable registry keypair.
267+
- `rendezvous/accept/accept.go:604-654` — ensure a stable on-disk cert path
268+
(avoid the in-memory self-signed regen for the pinned listener).
269+
- `common/registry/client/client.go:239-258` — add an **SPKI** pin comparison
270+
alongside the existing leaf-cert fingerprint pin.
271+
- New embedded constant (mirror `internal/transport/compat/roots.go`): registry
272+
SPKI pin(s) + (Stage B) registry signing pubkey.
273+
274+
Stage 1 (clients can pin):
275+
- `cmd/daemon/main.go:65-67,226-228` — add `pinned-builtin` trust mode; populate
276+
`RegistryFingerprint`/SPKI from the embedded constant when empty.
277+
- `pkg/daemon/daemon.go:888-912` — dial selection: add the builtin-pin branch.
278+
- `cmd/pilotctl/main.go:264-272,487-495` — add TLS+pin to `getRegistry`/
279+
`connectRegistry`.
280+
- `cmd/pilotctl/verify.go:301` (and `RecoverIdentity` path) — TLS+pin the
281+
recovery dial.
282+
283+
Stage 2 (prefer + telemetry):
284+
- `pkg/daemon/daemon.go:888-926` — wrap the dial loop: try pinned TLS, fall back
285+
to plaintext with WARN; record the chosen transport.
286+
- `pkg/telemetry/client.go` — add a transport-adoption metric (consent-gated).
287+
288+
Stage 3 (flip):
289+
- `cmd/daemon/main.go` — defaults `-registry-tls=true`,
290+
`-registry-trust=pinned-builtin`; add `-registry-allow-plaintext`.
291+
292+
Stage 4 (server retire):
293+
- `rendezvous/cmd/rendezvous/main.go` / `accept/accept.go` — log/gate plaintext
294+
accepts; later drop the plaintext listener.
295+
296+
Stage B (defence in depth, any time after Stage 0):
297+
- `common/registry/wire/wire.go:183` (`EncodeLookupResp`) + client decode — add a
298+
registry signature over `(node_id, pubkey, …)`; verify in
299+
`pkg/daemon/daemon.go:5674-5686` (`lookupPeerPubKey`).
300+
301+
---
302+
303+
## 6. Pin-distribution problem (explicit)
304+
305+
The classic objection to pinning is the bootstrap: how does a client learn the
306+
pin without a TOFU window an attacker can occupy? Resolution here:
307+
308+
- The pin is **compiled into the binary** the auto-updater already ships and that
309+
install.sh/Homebrew already distribute — the same trust path already used for
310+
the embedded compat CA root and trusted-agents list. The client knows the
311+
expected registry identity *before first connect*. **No TOFU.**
312+
- The binary's own integrity is the root of trust (release signing + the
313+
updater's existing verification). We are not adding a new trust anchor, we are
314+
reusing the one that already gates code execution.
315+
- We pin **SPKI** (stable across renewals) and carry **multiple valid pins**
316+
(current + next) so the registry key can be rotated by shipping a release that
317+
adds the next pin before the server switches to it.
318+
- The pin is also published out-of-band (catalogue, website, release notes) for
319+
auditability.
320+
321+
---
322+
323+
## 7. Risks & open questions (maintainer decisions needed before any flip)
324+
325+
1. **SPKI vs leaf-cert pin.** Confirm we move `DialTLSPinned` to SPKI pinning
326+
(recommended) so Let's Encrypt / cert renewals don't break pins. The current
327+
leaf-cert pin would require re-pinning every ~60 days.
328+
2. **Registry TLS identity.** Do we mint a dedicated long-lived Pilot registry
329+
keypair for pinning (recommended), or pin the SPKI of the existing Let's
330+
Encrypt-fronted endpoint? The latter couples the pin to ACME key rotation.
331+
3. **Port topology.** Serve pinned TLS on `:9000` itself (protocol-sniff or flag
332+
day on the port) or on a new sibling port (e.g. `:9443`)? Sibling port is
333+
non-breaking; reusing `:9000` is cleaner long-term but needs care.
334+
4. **Stage B scope.** Do we also ship application-layer signed lookups (the
335+
transport-independent fix)? Recommended, but it is a `common/registry`
336+
wire-format change with its own compat story.
337+
5. **Adoption threshold for the Stage-3 flip.** What TLS-reachability % and what
338+
soak time gate the default flip?
339+
6. **pilotctl priority.** Recovery/verify over plaintext is arguably the
340+
single highest-value target. Confirm we fast-track pilotctl pinning ahead of
341+
the daemon default flip.
342+
7. **Escape hatch policy.** Keep `-registry-allow-plaintext` indefinitely
343+
(operability) or sunset it on a date (security)?
344+
345+
No code in this proposal changes a default. Approval of the above unblocks
346+
Stage 0 (server-side, fully backward-compatible) as the first concrete step.

0 commit comments

Comments
 (0)