Skip to content

Commit 74e65c4

Browse files
committed
feat(dist): graceful drain, cursor-based key enumeration, and ops runbook (Phase C.1–C.3)
Phase C.1 — Drain endpoint: - Add DistMemory.Drain(ctx) and POST /dist/drain HTTP endpoint; marks the node for graceful shutdown in a one-way, idempotent transition - /health returns 503 while draining so load balancers stop routing - Set/Remove reject with sentinel.ErrDraining; Get continues to serve - Add IsDraining() accessor and dist.drains metric (CAS ensures it fires exactly once per transition) Phase C.2 — Cursor-based key enumeration: - Replace the naive full-set /internal/keys response with shard-level cursor pagination (next_cursor token per page) - Add optional ?limit=<n> param; truncated=true in the response flags a partially-read shard and returns the same cursor for re-request - DistHTTPTransport.ListKeys now walks pages internally with a 1024-page safety cap; all existing callers (anti-entropy fallback, tests) are unchanged - Extract listKeysPage helper and keysPageResp wire type Phase C.3 — Operations runbook: - Add docs/operations.md covering split-brain, hint-queue overflow, rebalance under load, and replica-loss failure modes; each mode maps to the metrics that surface it - Document observability wiring (logger/tracer/meter), drain procedure, and capacity-planning notes Tests: dist_drain_test.go (3 cases) and dist_keys_cursor_test.go (2 cases)
1 parent f81592f commit 74e65c4

8 files changed

Lines changed: 885 additions & 26 deletions

File tree

CHANGELOG.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,30 @@ adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
9797
raising it on any peer — a server with compression disabled will
9898
reject a gzip body with HTTP 400. Phase B.3 of the
9999
production-readiness work.
100+
- **Drain endpoint for graceful shutdown.** New
101+
`DistMemory.Drain(ctx)` method and `POST /dist/drain` HTTP
102+
endpoint mark the node for shutdown: `/health` returns 503 so
103+
load balancers stop routing, `Set`/`Remove` return
104+
`sentinel.ErrDraining`, `Get` continues to serve so in-flight
105+
reads complete. New `IsDraining()` accessor for dashboards. New
106+
metric `dist.drains` records transitions. Drain is one-way and
107+
idempotent. Phase C.1 of the production-readiness work.
108+
- **Cursor-based key enumeration** replaces the pre-Phase-C
109+
testing-only `/internal/keys` endpoint. The endpoint now returns
110+
shard-level pages with a `next_cursor` token; clients walk the
111+
cursor chain to enumerate the full key set. New `?limit=<n>` query
112+
parameter truncates within a shard for clusters with very large
113+
shards (response then carries `truncated=true` and the same
114+
`next_cursor`). The `DistHTTPTransport.ListKeys` helper now walks
115+
pages internally so existing callers (anti-entropy fallback, tests)
116+
keep their full-set semantics unchanged. Phase C.2 of the
117+
production-readiness work.
118+
- **Operations runbook** at [docs/operations.md](docs/operations.md)
119+
covering split-brain, hint-queue overflow, rebalance under load,
120+
replica loss, observability wiring (logger/tracer/meter), drain
121+
procedure, and capacity-planning notes. Cross-links each failure
122+
mode to the metrics that surface it. Phase C.3 of the
123+
production-readiness work.
100124

101125
## [0.5.0] — 2026-05-05
102126

docs/operations.md

Lines changed: 214 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,214 @@
1+
# Operations runbook — DistMemory
2+
3+
This document is for operators running the `pkg/backend.DistMemory`
4+
distributed backend in production. It assumes the design background in
5+
[distributed.md](distributed.md). Sections are deliberately short — each
6+
one stands on its own and links to code.
7+
8+
## At a glance
9+
10+
| Concern | First place to look |
11+
|---|---|
12+
| Node not receiving traffic | `dist.members.alive`, `/health` |
13+
| Writes failing | `dist.write.quorum_failures`, `sentinel.ErrDraining`, `sentinel.ErrQuorumFailed` |
14+
| Replicas falling behind | `dist.hinted.queued`, `dist.hinted.replayed`, `dist.hinted.dropped` |
15+
| Bandwidth pressure | `DistHTTPLimits.CompressionThreshold` |
16+
| Spurious peer flapping | `dist.heartbeat.indirect_probe.refuted`, `WithDistIndirectProbes` |
17+
| Slow rebalance | `dist.rebalance.throttle`, `dist.rebalance.last_ns` |
18+
| Anti-entropy backlog | `dist.merkle.last_diff_ns`, `dist.auto_sync.last_ns` |
19+
20+
Live metric values come from `DistMemory.Metrics()` (Go struct),
21+
`/dist/metrics` (JSON, when wrapped in `hypercache.HyperCache`), or
22+
the OpenTelemetry pipeline you wired via `WithDistMeterProvider`.
23+
The OTel names use the `dist.` prefix.
24+
25+
## Wiring observability
26+
27+
Three opt-in entry points, all defaulting to no-op:
28+
29+
- **Logging**`backend.WithDistLogger(*slog.Logger)` routes background
30+
loops (heartbeat, hint replay, rebalance, merkle sync) and operational
31+
errors into your logger. Records are pre-bound with
32+
`component=dist_memory` and `node_id=<id>`.
33+
- **Tracing**`backend.WithDistTracerProvider(trace.TracerProvider)`
34+
opens spans on `Get`/`Set`/`Remove` plus per-peer
35+
`dist.replicate.*` child spans. Cache key *values* are never put on
36+
spans (they can be PII); only `cache.key.length`.
37+
- **Metrics**`backend.WithDistMeterProvider(metric.MeterProvider)`
38+
exposes every field on `DistMetrics` as an observable instrument.
39+
40+
Wire all three to the same `otel.SetTracerProvider` /
41+
`otel.SetMeterProvider` your application uses; the logger inherits via
42+
`slog.Default()` if you want a one-liner.
43+
44+
## Failure mode — split-brain
45+
46+
**Symptom.** Two subsets of the cluster lose connectivity to each
47+
other. Each subset elects local primaries for the keys it owns.
48+
Writes from clients on subset A land on A-side primaries; writes from
49+
B-side clients land on B-side primaries. When the partition heals, the
50+
versions diverge.
51+
52+
**Detection.** `dist.heartbeat.failure` rises on both sides during the
53+
partition. After healing, `dist.version.conflicts` increments as
54+
anti-entropy reconciles.
55+
56+
**Resolution.** DistMemory uses last-write-wins by `(version, origin)`
57+
ordering — the higher version wins, ties broken by origin string. This
58+
is automatic. Anti-entropy via `SyncWith` (manual) or
59+
`WithDistMerkleAutoSync` (background) closes the gap. There is no
60+
manual reconciliation step today.
61+
62+
**Mitigation.** Run an odd number of nodes with quorum writes
63+
(`WithDistWriteConsistency(ConsistencyQuorum)`); a partition that
64+
isolates a minority leaves only the majority side accepting writes
65+
because the minority cannot reach quorum. The minority returns
66+
`ErrQuorumFailed` (`sentinel.ErrQuorumFailed`) on Set.
67+
68+
## Failure mode — hint queue overflow
69+
70+
**Symptom.** A peer is unreachable for a long time. Every replicated
71+
write to that peer turns into a queued hint. Eventually the queue
72+
hits `WithDistHintMaxPerNode` or `WithDistHintMaxBytes` and new hints
73+
get dropped.
74+
75+
**Detection.** `dist.hinted.bytes` (gauge) climbs steadily.
76+
`dist.hinted.global_dropped` increments when caps are exceeded.
77+
`dist.hinted.dropped` (a different metric — replay errors) also rises
78+
if the peer is reachable but rejecting writes (auth, schema mismatch).
79+
80+
**Resolution.**
81+
82+
1. Restore the unreachable peer; the replay loop drains automatically
83+
(`dist.hinted.replayed` rises).
84+
1. If the peer is permanently gone, remove it from membership
85+
(`DistMemory.RemovePeer(addr)`); queued hints expire on the
86+
`WithDistHintTTL` timer.
87+
1. If hints are dropping faster than they replay, raise
88+
`WithDistHintMaxPerNode` / `WithDistHintMaxBytes` — but understand
89+
that the cap exists to bound process memory under sustained
90+
failure. Raising it without fixing the underlying peer just delays
91+
the bound.
92+
93+
**Phase B note.** Migration failures during rebalance now also funnel
94+
through the hint queue (Phase B.2). A surge in `dist.hinted.queued`
95+
during a rolling deploy is expected; it should drain as the new node
96+
becomes reachable.
97+
98+
## Failure mode — rebalance under load
99+
100+
**Symptom.** Adding a node triggers a rebalance scan that migrates
101+
keys to their new primary. Under sustained write load the migration
102+
saturates and `dist.rebalance.throttle` increments — batches queue
103+
behind the configured concurrency cap.
104+
105+
**Detection.** `dist.rebalance.last_ns` (gauge — last full scan
106+
duration) climbs. `dist.rebalance.throttle` (counter) increments when
107+
the concurrency limit blocks a batch dispatch. `dist.rebalance.batches`
108+
should still climb steadily.
109+
110+
**Resolution.**
111+
112+
1. Raise `WithDistRebalanceMaxConcurrent` (default 1) if CPU and
113+
network headroom allow.
114+
1. Lower `WithDistRebalanceBatchSize` (default 64) so individual
115+
batches finish faster and concurrency slots cycle more often —
116+
counter-intuitively, smaller batches sometimes throughput-win.
117+
1. Pause writes (drain a subset of clients via your LB) until the
118+
scan finishes. The dist backend has no built-in
119+
write-throttling — that's the application's job.
120+
121+
**Phase C note.** Drain (`POST /dist/drain`) does *not* trigger an
122+
expedited rebalance today; the next scheduled
123+
`WithDistRebalanceInterval` tick does the work. If you need to force
124+
a faster ownership transfer, call `Stop` after Drain to cancel
125+
in-flight work and let restart-time rebalance handle migration.
126+
127+
## Failure mode — replica loss
128+
129+
**Symptom.** A replica node dies hard (kernel panic, hardware
130+
failure). Its keys still have other replicas (when `replication >= 2`),
131+
but until membership notices, writes try to fan out to it and
132+
silently retry via the hint queue.
133+
134+
**Detection.** `dist.heartbeat.failure` increments steadily for the
135+
lost peer. After `WithDistHeartbeat`'s `deadAfter` window, the peer
136+
is pruned (`dist.nodes.removed` increments) and ring lookups stop
137+
including it.
138+
139+
**Resolution.**
140+
141+
1. Wait for the heartbeat to detect the dead peer. With default
142+
timing, this is on the order of seconds.
143+
1. Spin up a replacement node with the same membership (or let
144+
gossip discover it).
145+
1. The new node's rebalance scan pulls its assigned keys from
146+
surviving replicas via Merkle anti-entropy.
147+
148+
**Indirect probes.** `WithDistIndirectProbes(k, timeout)` filters
149+
caller-side network blips that would otherwise mark a healthy peer
150+
suspect. `dist.heartbeat.indirect_probe.refuted` rising indicates
151+
indirect probes are saving you from spurious flapping; rising
152+
`dist.heartbeat.indirect_probe.failure` indicates the peer is
153+
genuinely unreachable from multiple vantage points.
154+
155+
## Operational tasks
156+
157+
### Drain a node
158+
159+
```sh
160+
curl -X POST http://node-A:8080/dist/drain
161+
```
162+
163+
After drain:
164+
165+
- `/health` returns 503; load balancers should stop routing.
166+
- New `Set`/`Remove` calls return `sentinel.ErrDraining`.
167+
- `Get` continues to serve until the process exits.
168+
169+
Drain is one-way. Restart the process to clear it.
170+
171+
### Inspect cluster state
172+
173+
```sh
174+
# Membership snapshot.
175+
curl http://node-A:8080/cluster/members
176+
177+
# Key enumeration (paginated, shard-by-shard since Phase C.2).
178+
curl 'http://node-A:8080/internal/keys'
179+
curl 'http://node-A:8080/internal/keys?cursor=1'
180+
# ... follow next_cursor until empty.
181+
```
182+
183+
### Force anti-entropy sync
184+
185+
```go
186+
// Pull missing keys from peer "node-B" onto this node.
187+
err := dm.SyncWith(ctx, "node-B")
188+
```
189+
190+
`WithDistMerkleAutoSync(interval)` runs this on a timer; manual calls
191+
are useful for debugging.
192+
193+
## Capacity planning notes
194+
195+
- Each shard mutex is independent — write throughput scales with
196+
shard count up to CPU saturation.
197+
- Hint queue memory is approximately `HintedBytes` + 64 bytes of
198+
bookkeeping per queued hint. Cap via `WithDistHintMaxBytes` to
199+
bound total process memory under partition.
200+
- Merkle tree storage scales O(N/chunk) for N keys at
201+
`WithDistMerkleChunkSize` (default 128). For a million keys, the
202+
default chunk gives ~8K leaf hashes per node — negligible.
203+
- Replication factor 3 with quorum reads/writes tolerates 1 failure;
204+
raise to 5 for tolerating 2 failures, at 5× the storage cost.
205+
206+
## Where things are
207+
208+
| Concern | File |
209+
|---|---|
210+
| Public surface | [pkg/backend/dist_memory.go](../pkg/backend/dist_memory.go) |
211+
| Transport interface | [pkg/backend/dist_transport.go](../pkg/backend/dist_transport.go) |
212+
| HTTP transport | [pkg/backend/dist_http_transport.go](../pkg/backend/dist_http_transport.go) |
213+
| HTTP server | [pkg/backend/dist_http_server.go](../pkg/backend/dist_http_server.go) |
214+
| Membership / ring | [internal/cluster/](../internal/cluster) |

internal/sentinel/sentinel.go

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,11 @@ var (
7777
// Operators who genuinely want asymmetric auth must set DistHTTPAuth.AllowAnonymousInbound explicitly.
7878
ErrInsecureAuthConfig = ewrap.New("dist HTTP auth: ClientSign without inbound verifier (set Token, ServerVerify, or AllowAnonymousInbound)")
7979

80+
// ErrDraining is returned by Set/Remove when the dist backend has been Drained — the node is preparing to
81+
// shut down, /health is reporting 503, and new writes must be redirected by the caller. Reads still succeed
82+
// because the node continues to hold data while in-flight ownership transfers complete.
83+
ErrDraining = ewrap.New("dist node is draining")
84+
8085
// ErrTypeMismatch is returned by the typed cache wrapper when a stored value is not assertable to the wrapper's V parameter.
8186
ErrTypeMismatch = ewrap.New("cached value type mismatch")
8287

0 commit comments

Comments
 (0)