Skip to content

Commit 276eb9a

Browse files
TeoSlayerteovl
andauthored
fix(keyexchange): demote same-session PILA log to Debug (#174)
* fix(keyexchange): demote same-session PILA log to Debug Observed against list-agents (node 179172) on 2026-05-29: a fresh daemon sends its first PILA, peer replies and trust is established, but the relayed data plane drops our PILS replies. Peer retransmits its PILA every ~8 s as a keepalive. Each arrival carries the SAME X25519 ephemeral (hadCrypto=true, keyChanged=false) and lands well outside DuplicateHandshakeDebounce (250 ms), so the existing duplicate gate doesn't catch it — every retransmit fires another 'encrypted tunnel established' Info log and another tunnel.established bus event even though structurally nothing was installed. 35 false-positive 'established' lines per peer per 5 minutes in field measurement. Fix demotes the log to Debug for the same-session case while keeping the bus event + postInstall hook firing (existing endpoint-refresh contract pinned by TestDuplicatePILAOutsideDebounceFiresHookAgain). Mirrors the demotion in HandleUnauthFrame (PILK) for the same reason. Adds TestSameSessionPILASuppressesInfoButFiresHookAndDebug that pins: - first PILA still produces an Info 'established' line + hook count 1 - second same-key PILA past debounce: hook count = 2 (endpoint refresh preserved), 'established' Info line count stays at 1, Debug log 'same-session keepalive' present. Does NOT change crypto/network behaviour. The asymmetric-recovery reply path (TestAsymmetricRecoveryRepliesOnDuplicatePILAWhenStale) and reply-rate-limit (TestReplyRateLimit*) gates are independent and remain intact. * test(daemon): fix data race in TestWriteLoopExitsOnWriteDeadline (pre-existing) The fan-out loop reused a single msg buffer across iterations and did the per-iteration copy INSIDE the spawned goroutine. The main goroutine's next msg[0] = byte(i & 0xFF) raced with the previous goroutine's copy(m2, m). Caught by go test -race in the Architecture gates job (report: zz_ipc_write_deadline_test.go:75 read vs :72 write). This was added in 1eff4fa (PILOT-218 write-deadline fix) before this branch existed; every PR opened since then has been failing the race-detector check. Hoisting the copy out of the goroutine fixes it without changing the test's intent (still floods ic.ipcWrite with ipcSendBuffer+10 distinct messages to fill the kernel send buffer). Touched alongside the keyexchange log-spam fix because the same PR job runs both and we can't merge until -race is green. * fix(daemon): actually set the IPC writeLoop deadline (PILOT-218) PR #156 / commit 1eff4fa was titled 'add write deadline to IPC writeLoop and bypass semaphore for CmdHealth', and its commit message correctly described the design (10 s per active write, 3 s drain on Close). The CmdHealth inline-dispatch half landed; the writeLoop SetWriteDeadline half didn't. Result: TestWriteLoopExitsOnWriteDeadline has been failing since #156 merged — first with a -race report (a buffer-reuse bug in the test's fan-out loop, fixed in the prior commit on this branch), then with 'writeLoop did not exit within deadline window' because the deadline the test waits on doesn't actually exist. Every PR opened since #156 has been silently blocked on Architecture gates for this reason. Adds: - ipcWriteTimeout = 10 s (the active-write deadline) - ipcDrainTimeout = 3 s (the Close-drain deadline) - SetWriteDeadline calls in writeLoop's both arms, matching the contract the test expects. SetWriteDeadline errors are deliberately swallowed — net.Pipe ignores the call, and any real socket that doesn't support it is already broken in ways the next Write will surface. Semantically a no-op for the happy path (normal clients read fast, the deadline never trips). For a stalled client it does what PILOT-218 wanted: writeLoop times out → c.Conn.Close() → writeDone closes → every parked ipcWrite caller returns ErrIPCClosed → semaphore drains → daemon is responsive again. * fix(daemon): ipcWrite must fast-fail when writeLoop has exited Companion to the prior commit (actually-set the writeLoop deadline). After writeLoop hits ipcWriteTimeout and closes writeDone, ipcWrite still had a sendCh-with-room enqueue path that returned nil — the message would land in the channel, sit there orphaned, and the caller would think it succeeded. The slow-path select did catch writeDone, but only after the buffer filled. Added a non-blocking writeDone check next to the existing c.done fast-fail, so any ipcWrite after writeLoop exits returns ErrIPCClosed immediately regardless of sendCh capacity. Pinned by TestWriteLoopExitsOnWriteDeadline's final assertion. --------- Co-authored-by: Teodor Calin <teodor@vulturelabs.io>
1 parent 6c45b71 commit 276eb9a

6 files changed

Lines changed: 259 additions & 24 deletions

File tree

pkg/daemon/daemon.go

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -87,10 +87,10 @@ type Config struct {
8787
Email string // email address for account identification and key recovery
8888
Owner string // deprecated: use Email instead
8989

90-
Endpoint string // fixed public endpoint (host:port) — skips STUN discovery (for cloud VMs)
91-
AdvertiseEndpoint string // override STUN-discovered endpoint for registry advertisement (host:port) — for k8s pods where STUN returns unreachable IPs
92-
Public bool // make this node's endpoint publicly discoverable
93-
Hostname string // hostname for discovery (empty = none)
90+
Endpoint string // fixed public endpoint (host:port) — skips STUN discovery (for cloud VMs)
91+
AdvertiseEndpoint string // override STUN-discovered endpoint for registry advertisement (host:port) — for k8s pods where STUN returns unreachable IPs
92+
Public bool // make this node's endpoint publicly discoverable
93+
Hostname string // hostname for discovery (empty = none)
9494

9595
// RelayOnly hides this node's real_addr from peer resolve/lookup
9696
// responses. Peers reach this node only via the beacon-relay path,
@@ -149,13 +149,13 @@ type Config struct {
149149

150150
// Default tuning constants (used when Config fields are zero).
151151
const (
152-
DefaultKeepaliveInterval = 60 * time.Second
153-
DefaultIdleTimeout = 120 * time.Second
154-
DefaultIdleSweepInterval = 15 * time.Second
152+
DefaultKeepaliveInterval = 60 * time.Second
153+
DefaultIdleTimeout = 120 * time.Second
154+
DefaultIdleSweepInterval = 15 * time.Second
155155
// hostnameReannounceInterval is how often the daemon re-sets its
156156
// hostname with the registry. This heals hostname resolution after
157157
// a registry restart/roll wipes the in-memory hostname store.
158-
hostnameReannounceInterval = 60 * time.Second
158+
hostnameReannounceInterval = 60 * time.Second
159159
DefaultSYNRateLimit = 100
160160
DefaultMaxConnectionsPerPort = 1024
161161
DefaultMaxTotalConnections = 65536

pkg/daemon/ipc.go

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,23 @@ type ipcConn struct {
170170
// avg 256B/msg ≈ 64 MB worst case).
171171
const ipcSendBuffer = 256
172172

173+
// ipcWriteTimeout caps how long writeLoop will wait inside a single
174+
// ipcutil.Write before treating the client as stalled. A stalled client
175+
// (stops reading from the socket) fills the kernel send buffer and
176+
// blocks ipcutil.Write indefinitely; without this deadline the dispatch
177+
// goroutines park in ipcWrite, the per-client semaphore exhausts, the
178+
// read loop blocks, and the daemon appears dead even though it isn't.
179+
// On deadline, writeLoop closes the connection and exits, unblocking
180+
// every parked ipcWrite caller via writeDone. Pinned by
181+
// TestWriteLoopExitsOnWriteDeadline.
182+
const ipcWriteTimeout = 10 * time.Second
183+
184+
// ipcDrainTimeout is the per-message deadline used on the drain path
185+
// during Close. Shorter than ipcWriteTimeout because Close already
186+
// signalled the conn is going away — there's no point waiting the full
187+
// active-write budget for already-doomed messages.
188+
const ipcDrainTimeout = 3 * time.Second
189+
173190
// ipcMaxInflightPerClient caps how many in-flight dispatch goroutines a
174191
// single IPC client may have. Each request becomes a goroutine that
175192
// handles the command and writes the reply (concurrent dispatch — see
@@ -237,16 +254,24 @@ func (c *ipcConn) writeLoop() {
237254
for {
238255
select {
239256
case msg := <-c.sendCh:
257+
// Bound the active write to ipcWriteTimeout (PILOT-218). The
258+
// commit that added the test (1eff4fa3) intended to include
259+
// this deadline but the writeLoop changes never landed; the
260+
// test has been failing -race ever since. SetWriteDeadline
261+
// errors are non-fatal (no-op on net.Pipe, etc.) so swallow.
262+
_ = c.Conn.SetWriteDeadline(time.Now().Add(ipcWriteTimeout))
240263
if err := ipcutil.Write(c.Conn, msg); err != nil {
241264
c.Conn.Close()
242265
return
243266
}
244267
case <-c.done:
245268
// Best-effort drain of pending messages so callers that already
246-
// pushed before Close() don't lose their data.
269+
// pushed before Close() don't lose their data. Shorter deadline
270+
// per message because Close already signalled teardown.
247271
for {
248272
select {
249273
case msg := <-c.sendCh:
274+
_ = c.Conn.SetWriteDeadline(time.Now().Add(ipcDrainTimeout))
250275
if err := ipcutil.Write(c.Conn, msg); err != nil {
251276
c.Conn.Close()
252277
return
@@ -278,6 +303,17 @@ func (c *ipcConn) ipcWrite(data []byte) error {
278303
return ErrIPCClosed
279304
default:
280305
}
306+
// Also fast-fail if writeLoop has exited (e.g. SetWriteDeadline
307+
// fired on a stalled client). The slow-path select catches the
308+
// same condition, but a successful sendCh enqueue can be chosen
309+
// over a closed writeDone when both are ready, and a message that
310+
// lands on sendCh after writeLoop has exited will sit there
311+
// orphaned. Pinned by TestWriteLoopExitsOnWriteDeadline:103.
312+
select {
313+
case <-c.writeDone:
314+
return ErrIPCClosed
315+
default:
316+
}
281317
// Fast path: try non-blocking enqueue.
282318
select {
283319
case c.sendCh <- data:

pkg/daemon/keyexchange/handle.go

Lines changed: 39 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -110,21 +110,42 @@ func (m *Manager) HandleAuthFrame(data []byte, from *net.UDPAddr, fromRelay bool
110110
// Cache the verified peer pubkey.
111111
m.SetPeerPubKey(peerNodeID, peerEd25519PubKey)
112112

113-
// Side-effect gate: the log line, tunnel.established bus event, and
114-
// PostInstallHook fire ONLY when this is not a coalesced duplicate.
115-
// A real rekey (keyChanged) or a first-time install always falls
116-
// through to the side effects. The recovery-reply path below (the
117-
// SendKeyExchangeToNode call) is NOT gated on `duplicate` — the
118-
// asymmetric-recovery case (B dropped crypto for A while A retains
119-
// it) requires A to reply on B's retransmit even though A sees it
120-
// as a duplicate. Pinned by
121-
// TestAsymmetricRecoveryRepliesOnDuplicatePILAWhenStale.
113+
// Side-effect routing:
114+
//
115+
// - duplicate (same X25519 ephemeral within DuplicateHandshakeDebounce)
116+
// is a tight direct+relay arrival pair or peer-side retransmit
117+
// burst. All side effects suppressed.
118+
//
119+
// - sameSession (hadCrypto && !keyChanged, past the debounce window)
120+
// is the peer's slow keepalive retransmit. The session was
121+
// established by the earlier PILA; nothing was installed by
122+
// this one. The bus event and PostInstallHook STILL fire so
123+
// downstream observers can refresh endpoint observation /
124+
// keep-alive bookkeeping (pinned by
125+
// TestDuplicatePILAOutsideDebounceFiresHookAgain), but the
126+
// Info-level "encrypted tunnel established" log is demoted to
127+
// Debug: observed against list-agents on 2026-05-29, a peer
128+
// sending a PILA every ~8 s while relayed data is being dropped
129+
// floods the operator log with 35 false "established" lines per
130+
// peer per 5 minutes. Pinned by
131+
// TestSameSessionPILALogsAtDebugButFiresHook.
132+
//
133+
// - default (first install or real rekey): everything fires at
134+
// Info level.
135+
//
136+
// The recovery-reply path below (SendKeyExchangeToNode) is gated
137+
// independently — see TestAsymmetricRecoveryRepliesOnDuplicatePILAWhenStale.
138+
sameSession := hadCrypto && !keyChanged
122139
if duplicate {
123140
slog.Debug("auth key exchange: duplicate frame coalesced",
124141
"peer_node_id", peerNodeID, "age", time.Since(oldPC.CreatedAt), "relay", fromRelay)
125142
} else {
126143
if keyChanged {
127144
slog.Info("peer rekeyed (auth), re-establishing tunnel", "peer_node_id", peerNodeID)
145+
} else if sameSession {
146+
slog.Debug("auth key exchange: same-session keepalive",
147+
"peer_node_id", peerNodeID, "age", time.Since(oldPC.CreatedAt),
148+
"endpoint", from, "relay", fromRelay)
128149
} else {
129150
slog.Info("encrypted tunnel established", "auth", true,
130151
"peer_node_id", peerNodeID, "endpoint", from, "relay", fromRelay)
@@ -255,12 +276,21 @@ func (m *Manager) HandleUnauthFrame(data []byte, from *net.UDPAddr, fromRelay bo
255276
m.env.Install(peerNodeID, pc)
256277
}
257278

279+
// Same routing as HandleAuthFrame: duplicate suppresses everything;
280+
// sameSession (past-debounce same-key keepalive) keeps the bus event
281+
// + postInstall hook firing for endpoint refresh but demotes the
282+
// Info log to Debug.
283+
sameSession := hadCrypto && !keyChanged
258284
if duplicate {
259285
slog.Debug("unauth key exchange: duplicate frame coalesced",
260286
"peer_node_id", peerNodeID, "age", time.Since(oldPC.CreatedAt), "relay", fromRelay)
261287
} else {
262288
if keyChanged {
263289
slog.Info("peer rekeyed, re-establishing tunnel", "peer_node_id", peerNodeID)
290+
} else if sameSession {
291+
slog.Debug("unauth key exchange: same-session keepalive",
292+
"peer_node_id", peerNodeID, "age", time.Since(oldPC.CreatedAt),
293+
"endpoint", from, "relay", fromRelay)
264294
} else {
265295
slog.Info("encrypted tunnel established",
266296
"peer_node_id", peerNodeID, "endpoint", from, "relay", fromRelay)

pkg/daemon/keyexchange/store.go

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,6 @@ func (s *Store) IsReady(peerNodeID uint32) bool {
111111
return c != nil && c.Ready
112112
}
113113

114-
115114
// wipeCryptoSecrets clears any sensitive plaintext on a Crypto's salvage
116115
// ring before the Crypto is dropped. PILOT-146: previously Drop simply
117116
// did a map delete, leaving the salvage plaintext bytes alive on the heap
Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
// SPDX-License-Identifier: AGPL-3.0-or-later
2+
3+
package keyexchange_test
4+
5+
// Regression for the same-session log spam observed against list-agents
6+
// on 2026-05-29: a fresh daemon registers, completes the initial PILA
7+
// exchange (1 install, 1 postInstall, 1 "encrypted tunnel established"
8+
// at Info), then the peer keeps retransmitting the SAME PILA at ~8 s
9+
// cadence while the relayed data plane drops our PILS replies. Every
10+
// such retransmit lands past DuplicateHandshakeDebounce (250 ms), so
11+
// the duplicate gate doesn't catch it, but it carries the same X25519
12+
// ephemeral so hadCrypto=true && keyChanged=false — structurally no
13+
// new install. Pre-fix the daemon logged "encrypted tunnel established"
14+
// at Info every time (35 events per peer per 5 min in field observation).
15+
//
16+
// The fix demotes the log to Debug for the same-session case while
17+
// preserving:
18+
//
19+
// - the existing duplicate-within-debounce coalescing
20+
// (TestDuplicatePILACoalescedSuppressesLogAndHook)
21+
//
22+
// - the past-debounce postInstall hook firing for endpoint refresh
23+
// (TestDuplicatePILAOutsideDebounceFiresHookAgain pins hook count = 2)
24+
//
25+
// - the asymmetric-recovery reply on stale inbound
26+
// (TestDuplicatePILAStillRepliesForAsymmetricRecovery)
27+
28+
import (
29+
"bytes"
30+
"log/slog"
31+
"net"
32+
"strings"
33+
"sync"
34+
"sync/atomic"
35+
"testing"
36+
"time"
37+
38+
"github.com/TeoSlayer/pilotprotocol/pkg/daemon/keyexchange"
39+
)
40+
41+
// syncWriter serialises Writes from slog handlers so concurrent
42+
// goroutines logging into the same bytes.Buffer don't race the writes.
43+
type syncWriter struct {
44+
w *bytes.Buffer
45+
mu *sync.Mutex
46+
}
47+
48+
func (s syncWriter) Write(p []byte) (int, error) {
49+
s.mu.Lock()
50+
defer s.mu.Unlock()
51+
return s.w.Write(p)
52+
}
53+
54+
// captureSlog redirects slog.Default to a buffer at the given level
55+
// and returns a buffer-content snapshot + a restore func. NOT safe to
56+
// use with t.Parallel() — slog.Default is process-global and parallel
57+
// tests racing SetDefault see each other's handlers.
58+
func captureSlog(t *testing.T, level slog.Level) (snapshot func() string, restore func()) {
59+
t.Helper()
60+
var (
61+
buf bytes.Buffer
62+
mu sync.Mutex
63+
)
64+
handler := slog.NewTextHandler(syncWriter{w: &buf, mu: &mu}, &slog.HandlerOptions{Level: level})
65+
prev := slog.Default()
66+
slog.SetDefault(slog.New(handler))
67+
return func() string {
68+
mu.Lock()
69+
defer mu.Unlock()
70+
return buf.String()
71+
}, func() {
72+
slog.SetDefault(prev)
73+
}
74+
}
75+
76+
// TestSameSessionPILASuppressesInfoButFiresHookAndDebug pins the fix
77+
// end-to-end:
78+
//
79+
// - First PILA: Info "encrypted tunnel established", hook count = 1.
80+
//
81+
// - Second same-key PILA past the debounce window:
82+
//
83+
// - hook still fires (count = 2, pinned for endpoint refresh by
84+
// TestDuplicatePILAOutsideDebounceFiresHookAgain).
85+
//
86+
// - NO second "encrypted tunnel established" at Info — that was the
87+
// spam pre-fix.
88+
//
89+
// - Debug-level "same-session keepalive" present (diagnostic
90+
// remains available for operators tracing key-exchange flow).
91+
//
92+
// The two slog assertions share one capture buffer (and therefore one
93+
// SetDefault call) because parallel-test races on slog.Default would
94+
// otherwise tear the captured output. The test itself is NOT marked
95+
// t.Parallel().
96+
func TestSameSessionPILASuppressesInfoButFiresHookAndDebug(t *testing.T) {
97+
a := newPeer(t, 510)
98+
b := newPeer(t, 511)
99+
crossWireVerifyFuncs(a, b)
100+
a.mgr.SetSender(func(uint32, *net.UDPAddr, []byte) error { return nil })
101+
102+
var hookCount atomic.Int32
103+
a.mgr.SetPostInstallHook(func(keyexchange.PostInstallEvent) {
104+
hookCount.Add(1)
105+
})
106+
107+
bFrame := b.mgr.BuildAuthFrame()
108+
if bFrame == nil {
109+
t.Fatalf("BuildAuthFrame returned nil")
110+
}
111+
from := &net.UDPAddr{IP: net.IPv4(127, 0, 0, 1), Port: 4000}
112+
113+
// Capture at Debug so both the Info "established" on the first
114+
// arrival and the Debug "same-session keepalive" on the second
115+
// arrival land in the same buffer.
116+
logSnap, restore := captureSlog(t, slog.LevelDebug)
117+
defer restore()
118+
119+
// --- First PILA: full install path ---------------------------------
120+
121+
if !a.mgr.HandleAuthFrame(bFrame[4:], from, false) {
122+
t.Fatalf("first PILA rejected")
123+
}
124+
if got := hookCount.Load(); got != 1 {
125+
t.Fatalf("after first PILA: hook count = %d, want 1", got)
126+
}
127+
initialLog := logSnap()
128+
if !strings.Contains(initialLog, "encrypted tunnel established") {
129+
t.Fatalf("first PILA: expected Info log 'encrypted tunnel established', got:\n%s", initialLog)
130+
}
131+
initialEstablishedCount := strings.Count(initialLog, "encrypted tunnel established")
132+
if initialEstablishedCount != 1 {
133+
t.Fatalf("first PILA: expected exactly 1 'established' log line, got %d:\n%s",
134+
initialEstablishedCount, initialLog)
135+
}
136+
137+
// --- Second PILA past debounce: keepalive, no spam -----------------
138+
139+
time.Sleep(keyexchange.DuplicateHandshakeDebounce + 100*time.Millisecond)
140+
141+
if !a.mgr.HandleAuthFrame(bFrame[4:], from, false) {
142+
t.Fatalf("same-session PILA rejected")
143+
}
144+
145+
// Endpoint-refresh contract from TestDuplicatePILAOutsideDebounceFiresHookAgain.
146+
if got := hookCount.Load(); got != 2 {
147+
t.Fatalf("after same-session PILA: hook count = %d, want 2 (must still refresh endpoint)", got)
148+
}
149+
150+
finalLog := logSnap()
151+
152+
// The new behaviour: the Info "established" count stays at one
153+
// (no spam from the second arrival).
154+
finalEstablishedCount := strings.Count(finalLog, "encrypted tunnel established")
155+
if finalEstablishedCount != 1 {
156+
t.Fatalf("after same-session PILA: 'established' log count = %d, want 1 (no Info spam):\n%s",
157+
finalEstablishedCount, finalLog)
158+
}
159+
160+
// The Debug diagnostic remains so operators can see the keepalive.
161+
if !strings.Contains(finalLog, "same-session keepalive") {
162+
t.Fatalf("after same-session PILA: expected Debug log 'same-session keepalive'; got:\n%s",
163+
finalLog)
164+
}
165+
}

pkg/daemon/zz_ipc_write_deadline_test.go

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -66,15 +66,20 @@ func TestWriteLoopExitsOnWriteDeadline(t *testing.T) {
6666
}()
6767

6868
// Fill sendCh + keep writing until ipcWrite blocks.
69+
//
70+
// Each iteration allocates its own message copy in the main
71+
// goroutine BEFORE spawning the writer — `msg` is reused across
72+
// iterations, so doing the copy inside the goroutine raced with
73+
// the next iteration's msg[0] = ... write. Caught by -race on
74+
// ubuntu-latest in the Architecture-gates job (zz_ipc_write_deadline_test
75+
// race report 2026-05-29).
6976
const msgSize = 4096
7077
msg := make([]byte, msgSize)
7178
for i := 0; i < ipcSendBuffer+10; i++ {
7279
msg[0] = byte(i & 0xFF)
73-
go func(m []byte) {
74-
m2 := make([]byte, len(m))
75-
copy(m2, m)
76-
ic.ipcWrite(m2)
77-
}(msg)
80+
m := make([]byte, msgSize)
81+
copy(m, msg)
82+
go func() { ic.ipcWrite(m) }()
7883
}
7984

8085
// Give writeLoop time to fill the kernel send buffer.

0 commit comments

Comments
 (0)