Skip to content

Commit f3b3023

Browse files
committed
feat(dist): add structured logging to dist backend (Phase A.1)
Introduce `WithDistLogger(*slog.Logger)` option for `DistMemory`. When supplied, the dist backend emits structured `slog` records for all operational events across its background loops and error surfaces: - HTTP listener bind success and failure - Serve-goroutine exits (previously silent in `serveErr` only) - Peer state transitions: suspect-on-timeout, suspect-on-probe-failure, and dead-peer pruning - Merkle sync fetch failures - Rebalance migration forward failures - Hint dropped after replay error All records are pre-bound with `component=dist_memory` and `node_id=<id>` at construction so operators can grep/filter without each call site re-attaching the node identity. Default behaviour is unchanged: the zero-value path installs a `slog.DiscardHandler` so library code never writes to stderr unless the caller opts in. Add `tests/dist_logging_test.go` with two contract tests: - `TestDistMemory_LoggerEmitsListenerStart` — asserts the listener-start record is emitted with the expected level and attributes. - `TestDistMemory_LoggerDefaultIsSilent` — asserts no-option construction still succeeds without any side effects. Update `.mdl_style.rb` to allow `MD024` (`allow_different_nesting`) so the Keep-a-Changelog heading structure passes linting. Update `CHANGELOG.md`: add `[Unreleased]` entry for this work; correct version numbers from `2.0.x` to `0.5.0`/`0.4.3`; fix comparison links to match actual tags.
1 parent fd2db48 commit f3b3023

6 files changed

Lines changed: 331 additions & 9 deletions

File tree

.mdl_style.rb

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,11 @@
55
rule 'MD007', :indent => 3
66

77
rule "MD029", style => "one"
8+
9+
# Keep-a-Changelog (https://keepachangelog.com) uses repeated `### Added`,
10+
# `### Fixed`, `### Security` headings under each `## [version]` heading by
11+
# design. MD024 with the default config flags those as duplicates.
12+
# allow_different_nesting permits same-text headings as long as they sit
13+
# under distinct parent headings — which is exactly the Keep-a-Changelog
14+
# shape, and still catches genuine duplicates within the same section.
15+
rule "MD024", :allow_different_nesting => true

CHANGELOG.md

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,20 @@ adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

77
## [Unreleased]
88

9-
## [2.0.1] — 2026-05-05
9+
### Added
10+
11+
- **Structured logging on the dist backend.** New `WithDistLogger(*slog.Logger)`
12+
option wires a structured logger into the dist backend's background
13+
loops (heartbeat, hint replay, rebalance, merkle sync) and operational
14+
error surfaces (HTTP listener bind failures, serve-goroutine exits,
15+
failed migrations during rebalance, dropped hints, peer state
16+
transitions). Library default is silent — `WithDistLogger` not called
17+
installs a `slog.DiscardHandler` so the dist backend never writes to
18+
stderr unless the caller opts in. Every record is pre-bound with
19+
`component=dist_memory` and `node_id=<id>` attributes for grep/filter.
20+
Phase A.1 of the production-readiness work.
21+
22+
## [0.5.0] — 2026-05-05
1023

1124
### Security
1225

@@ -25,7 +38,7 @@ adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
2538
in via the new `DistHTTPAuth.AllowAnonymousInbound` field. All other
2639
configurations (`Token`-only, `Token+ServerVerify`, `Token+ClientSign`,
2740
`ServerVerify`-only) are unaffected. Reported by the post-tag
28-
security review; addressed before any v2.0.0 public announcement.
41+
security review; addressed before any v0.5.0 public announcement.
2942

3043
### Added
3144

@@ -34,7 +47,7 @@ adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
3447
- `sentinel.ErrInsecureAuthConfig` — surfaced from `NewDistMemory` when
3548
the auth policy would silently disable inbound enforcement.
3649

37-
## [2.0.0] — 2026-05-04
50+
## [0.4.3] — 2026-05-04
3851

3952
A modernization release. The headline themes:
4053

@@ -86,7 +99,6 @@ RFCs that informed the design decisions live under [docs/rfcs/](docs/rfcs/).
8699
### Performance
87100

88101
Measurements on Apple M4 Pro, `go test -bench`, `count=5`, benchstat.
89-
Full release snapshot captured in [bench-v2.0.0.txt](bench-v2.0.0.txt).
90102

91103
- **Per-shard atomic `Count`.** `BenchmarkConcurrentMap_Count`:
92104
53 → ~10 ns/op. `_CountParallel`: 1181 → ~13 ns/op. Eliminates the
@@ -186,5 +198,5 @@ Worth surfacing for contributors:
186198
[RFC document](docs/rfcs/0001-backend-owned-eviction.md) preserves
187199
the measurement and the lessons.
188200

189-
[Unreleased]: https://github.com/hyp3rd/hypercache/compare/v2.0.0...HEAD
190-
[2.0.0]: https://github.com/hyp3rd/hypercache/releases/tag/v2.0.0
201+
Unreleased: <https://github.com/hyp3rd/hypercache/compare/v0.5.0...HEAD>
202+
Released: [0.5.0](https://github.com/hyp3rd/hypercache/releases/tag/v0.5.0)

pkg/backend/dist_http_server.go

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ import (
44
"context"
55
"crypto/subtle"
66
"crypto/tls"
7+
"log/slog"
78
"net"
89
"net/http"
910
"strconv"
@@ -47,6 +48,12 @@ type distHTTPServer struct {
4748
// use, TLS handshake failure on accept) instead of having them
4849
// silently swallowed.
4950
serveErr atomic.Pointer[error]
51+
// logger is the structured logger inherited from the parent
52+
// DistMemory. Used to surface serve-goroutine errors that previously
53+
// only landed in serveErr (LastServeError accessor) — operators
54+
// running with a configured logger now see them in their log stream
55+
// at the moment of failure, not just on demand.
56+
logger *slog.Logger
5057
}
5158

5259
// DistHTTPAuth configures authentication for the dist HTTP server
@@ -482,8 +489,17 @@ func (s *distHTTPServer) listen(ctx context.Context) error {
482489
if serveErr != nil {
483490
// Stash so operators can read it via LastServeError(); a
484491
// listener that crashed silently is the worst kind of
485-
// production bug.
492+
// production bug. Also surface to the structured logger when
493+
// configured so the failure shows up in the operator's log
494+
// stream at the moment it happens, not just on demand.
486495
s.serveErr.Store(&serveErr)
496+
497+
if s.logger != nil {
498+
s.logger.Error("dist HTTP serve goroutine exited",
499+
slog.String("addr", s.addr),
500+
slog.Any("err", serveErr),
501+
)
502+
}
487503
}
488504
}()
489505

pkg/backend/dist_memory.go

Lines changed: 100 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ import (
88
"errors"
99
"hash"
1010
"hash/fnv"
11+
"log/slog"
1112
"math/big"
1213
"slices"
1314
"strings"
@@ -157,6 +158,16 @@ type DistMemory struct {
157158

158159
// stopped guards Stop() against double-invocation (idempotent shutdown).
159160
stopped atomic.Bool
161+
162+
// logger is the structured logger used by background loops
163+
// (heartbeat, hint replay, rebalance, gossip, merkle sync) and
164+
// error surfaces (transport bind failures, sync errors, dropped
165+
// hints). Defaults to a no-op handler writing to io.Discard so
166+
// library code does not write to stderr unless the caller opts
167+
// in via WithDistLogger. All log lines are pre-bound with
168+
// `node_id` so operators can grep/filter without the call sites
169+
// having to weave the ID through every record.
170+
logger *slog.Logger
160171
}
161172

162173
const (
@@ -433,6 +444,11 @@ func (dm *DistMemory) SyncWith(ctx context.Context, nodeID string) error {
433444

434445
remoteTree, err := transport.FetchMerkle(ctx, nodeID)
435446
if err != nil {
447+
dm.logger.Warn("merkle sync fetch failed",
448+
slog.String("peer_id", nodeID),
449+
slog.Any("err", err),
450+
)
451+
436452
return err
437453
}
438454

@@ -625,6 +641,29 @@ func WithDistHTTPAuth(auth DistHTTPAuth) DistMemoryOption {
625641
return func(dm *DistMemory) { dm.httpAuth = auth }
626642
}
627643

644+
// WithDistLogger supplies a structured logger for the dist backend's
645+
// background loops (heartbeat, hint replay, rebalance, gossip, merkle
646+
// auto-sync) and operational error surfaces (HTTP listener failures,
647+
// transport errors, dropped hints). The supplied logger is wrapped with
648+
// `node_id` and `component=dist_memory` attributes before use, so call
649+
// sites do not need to weave the node ID through every record.
650+
//
651+
// Pass slog.Default() to inherit the application's logger, or supply a
652+
// custom *slog.Logger with the desired level / handler. Zero-value (no
653+
// option call) keeps the dist backend silent — the default uses an
654+
// io.Discard handler, which means library code never writes to stderr
655+
// unless the caller opts in.
656+
//
657+
// nil is treated as "no change" — useful when callers conditionally
658+
// build options.
659+
func WithDistLogger(logger *slog.Logger) DistMemoryOption {
660+
return func(dm *DistMemory) {
661+
if logger != nil {
662+
dm.logger = logger
663+
}
664+
}
665+
}
666+
628667
// NewDistMemory creates a new DistMemory backend.
629668
func NewDistMemory(ctx context.Context, opts ...DistMemoryOption) (IBackend[DistMemory], error) {
630669
// Derive a server-lifetime context from the caller's ctx so that:
@@ -659,6 +698,20 @@ func NewDistMemory(ctx context.Context, opts ...DistMemoryOption) (IBackend[Dist
659698
return nil, authErr
660699
}
661700

701+
// Default the logger to a no-op handler so library code never writes
702+
// to stderr unless the caller opts in via WithDistLogger. Bind the
703+
// node identity once here so call sites can log without re-attaching
704+
// it on every record. Done before subsystem startup so background
705+
// loops capture the bound logger when they spawn.
706+
if dm.logger == nil {
707+
dm.logger = slog.New(slog.DiscardHandler)
708+
}
709+
710+
dm.logger = dm.logger.With(
711+
slog.String("component", "dist_memory"),
712+
slog.String("node_id", dm.nodeID),
713+
)
714+
662715
dm.ensureShardConfig()
663716
dm.initMembershipIfNeeded()
664717
// Pass the lifecycle ctx to subsystems that capture it (HTTP handlers,
@@ -1917,7 +1970,18 @@ func (dm *DistMemory) migrateIfNeeded(ctx context.Context, item *cache.Item) {
19171970
dm.metrics.rebalancedKeys.Add(1)
19181971
dm.metrics.rebalancedPrimary.Add(1)
19191972

1920-
_ = transport.ForwardSet(ctx, string(owners[0]), item, true)
1973+
// Fire-and-forget forwarding: failures are dropped silently today
1974+
// (Phase B will introduce a retry queue). Logging is the minimum
1975+
// surface so operators can correlate vanished keys with transport
1976+
// failures during rolling deploys.
1977+
migrationErr := transport.ForwardSet(ctx, string(owners[0]), item, true)
1978+
if migrationErr != nil {
1979+
dm.logger.Warn("rebalance migration forward failed",
1980+
slog.String("key", item.Key),
1981+
slog.String("new_primary", string(owners[0])),
1982+
slog.Any("err", migrationErr),
1983+
)
1984+
}
19211985

19221986
// Update originalPrimary so we don't recount repeatedly.
19231987
sh := dm.shardFor(item.Key)
@@ -2077,14 +2141,26 @@ func (dm *DistMemory) tryStartHTTP(ctx context.Context) {
20772141
server := newDistHTTPServer(dm.nodeAddr, limits, dm.httpAuth)
20782142

20792143
server.ctx = dm.lifeCtx // handler-side cancellation tied to Stop
2144+
server.logger = dm.logger
20802145

20812146
err := server.start(ctx, dm)
2082-
if err != nil { // best-effort
2147+
if err != nil { // best-effort, but the operator must see this
2148+
dm.logger.Error("dist HTTP listener bind failed",
2149+
slog.String("addr", dm.nodeAddr),
2150+
slog.Any("err", err),
2151+
)
2152+
20832153
return
20842154
}
20852155

20862156
dm.httpServer = server
20872157

2158+
dm.logger.Info("dist HTTP listener started",
2159+
slog.String("addr", dm.nodeAddr),
2160+
slog.Bool("tls", limits.TLSConfig != nil),
2161+
slog.Bool("auth", dm.httpAuth.inboundConfigured()),
2162+
)
2163+
20882164
resolver := dm.makePeerURLResolver(limits)
20892165

20902166
dm.storeTransport(NewDistHTTPTransportWithAuth(limits, dm.httpAuth, resolver))
@@ -2609,6 +2685,12 @@ func (dm *DistMemory) processHint(ctx context.Context, nodeID string, entry hint
26092685

26102686
dm.metrics.hintedDropped.Add(1)
26112687

2688+
dm.logger.Warn("hint dropped after replay error",
2689+
slog.String("peer_id", nodeID),
2690+
slog.String("key", entry.item.Key),
2691+
slog.Any("err", err),
2692+
)
2693+
26122694
return 1
26132695
}
26142696

@@ -3153,6 +3235,12 @@ func (dm *DistMemory) evaluateLiveness(ctx context.Context, now time.Time, node
31533235
if dm.membership.Remove(node.ID) {
31543236
dm.metrics.nodesRemoved.Add(1)
31553237
dm.metrics.nodesDead.Add(1)
3238+
3239+
dm.logger.Warn("peer pruned (dead)",
3240+
slog.String("peer_id", string(node.ID)),
3241+
slog.String("peer_addr", node.Address),
3242+
slog.Duration("elapsed_since_seen", elapsed),
3243+
)
31563244
}
31573245

31583246
return
@@ -3161,6 +3249,11 @@ func (dm *DistMemory) evaluateLiveness(ctx context.Context, now time.Time, node
31613249
if dm.hbSuspectAfter > 0 && elapsed > dm.hbSuspectAfter && node.State == cluster.NodeAlive { // suspect
31623250
dm.membership.Mark(node.ID, cluster.NodeSuspect)
31633251
dm.metrics.nodesSuspect.Add(1)
3252+
3253+
dm.logger.Info("peer marked suspect (timeout)",
3254+
slog.String("peer_id", string(node.ID)),
3255+
slog.Duration("elapsed_since_seen", elapsed),
3256+
)
31643257
}
31653258

31663259
transport := dm.loadTransport()
@@ -3179,6 +3272,11 @@ func (dm *DistMemory) evaluateLiveness(ctx context.Context, now time.Time, node
31793272
if node.State == cluster.NodeAlive { // escalate
31803273
dm.membership.Mark(node.ID, cluster.NodeSuspect)
31813274
dm.metrics.nodesSuspect.Add(1)
3275+
3276+
dm.logger.Info("peer marked suspect (probe failed)",
3277+
slog.String("peer_id", string(node.ID)),
3278+
slog.Any("err", err),
3279+
)
31823280
}
31833281

31843282
return

pkg/backend/dist_memory_test_helpers.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ func (dm *DistMemory) EnableHTTPForTest(ctx context.Context) {
3131
server := newDistHTTPServer(dm.nodeAddr, limits, dm.httpAuth)
3232

3333
server.ctx = dm.lifeCtx // handler-side cancellation tied to Stop
34+
server.logger = dm.logger
3435

3536
err := server.start(ctx, dm)
3637
if err != nil {

0 commit comments

Comments
 (0)