|
| 1 | +# Operations runbook — DistMemory |
| 2 | + |
| 3 | +This document is for operators running the `pkg/backend.DistMemory` |
| 4 | +distributed backend in production. It assumes the design background in |
| 5 | +[distributed.md](distributed.md). Sections are deliberately short — each |
| 6 | +one stands on its own and links to code. |
| 7 | + |
| 8 | +## At a glance |
| 9 | + |
| 10 | +| Concern | First place to look | |
| 11 | +|---|---| |
| 12 | +| Node not receiving traffic | `dist.members.alive`, `/health` | |
| 13 | +| Writes failing | `dist.write.quorum_failures`, `sentinel.ErrDraining`, `sentinel.ErrQuorumFailed` | |
| 14 | +| Replicas falling behind | `dist.hinted.queued`, `dist.hinted.replayed`, `dist.hinted.dropped` | |
| 15 | +| Bandwidth pressure | `DistHTTPLimits.CompressionThreshold` | |
| 16 | +| Spurious peer flapping | `dist.heartbeat.indirect_probe.refuted`, `WithDistIndirectProbes` | |
| 17 | +| Slow rebalance | `dist.rebalance.throttle`, `dist.rebalance.last_ns` | |
| 18 | +| Anti-entropy backlog | `dist.merkle.last_diff_ns`, `dist.auto_sync.last_ns` | |
| 19 | + |
| 20 | +Live metric values come from `DistMemory.Metrics()` (Go struct), |
| 21 | +`/dist/metrics` (JSON, when wrapped in `hypercache.HyperCache`), or |
| 22 | +the OpenTelemetry pipeline you wired via `WithDistMeterProvider`. |
| 23 | +The OTel names use the `dist.` prefix. |
| 24 | + |
| 25 | +## Wiring observability |
| 26 | + |
| 27 | +Three opt-in entry points, all defaulting to no-op: |
| 28 | + |
| 29 | +- **Logging** — `backend.WithDistLogger(*slog.Logger)` routes background |
| 30 | + loops (heartbeat, hint replay, rebalance, merkle sync) and operational |
| 31 | + errors into your logger. Records are pre-bound with |
| 32 | + `component=dist_memory` and `node_id=<id>`. |
| 33 | +- **Tracing** — `backend.WithDistTracerProvider(trace.TracerProvider)` |
| 34 | + opens spans on `Get`/`Set`/`Remove` plus per-peer |
| 35 | + `dist.replicate.*` child spans. Cache key *values* are never put on |
| 36 | + spans (they can be PII); only `cache.key.length`. |
| 37 | +- **Metrics** — `backend.WithDistMeterProvider(metric.MeterProvider)` |
| 38 | + exposes every field on `DistMetrics` as an observable instrument. |
| 39 | + |
| 40 | +Wire all three to the same `otel.SetTracerProvider` / |
| 41 | +`otel.SetMeterProvider` your application uses; the logger inherits via |
| 42 | +`slog.Default()` if you want a one-liner. |
| 43 | + |
| 44 | +## Failure mode — split-brain |
| 45 | + |
| 46 | +**Symptom.** Two subsets of the cluster lose connectivity to each |
| 47 | +other. Each subset elects local primaries for the keys it owns. |
| 48 | +Writes from clients on subset A land on A-side primaries; writes from |
| 49 | +B-side clients land on B-side primaries. When the partition heals, the |
| 50 | +versions diverge. |
| 51 | + |
| 52 | +**Detection.** `dist.heartbeat.failure` rises on both sides during the |
| 53 | +partition. After healing, `dist.version.conflicts` increments as |
| 54 | +anti-entropy reconciles. |
| 55 | + |
| 56 | +**Resolution.** DistMemory uses last-write-wins by `(version, origin)` |
| 57 | +ordering — the higher version wins, ties broken by origin string. This |
| 58 | +is automatic. Anti-entropy via `SyncWith` (manual) or |
| 59 | +`WithDistMerkleAutoSync` (background) closes the gap. There is no |
| 60 | +manual reconciliation step today. |
| 61 | + |
| 62 | +**Mitigation.** Run an odd number of nodes with quorum writes |
| 63 | +(`WithDistWriteConsistency(ConsistencyQuorum)`); a partition that |
| 64 | +isolates a minority leaves only the majority side accepting writes |
| 65 | +because the minority cannot reach quorum. The minority returns |
| 66 | +`ErrQuorumFailed` (`sentinel.ErrQuorumFailed`) on Set. |
| 67 | + |
| 68 | +## Failure mode — hint queue overflow |
| 69 | + |
| 70 | +**Symptom.** A peer is unreachable for a long time. Every replicated |
| 71 | +write to that peer turns into a queued hint. Eventually the queue |
| 72 | +hits `WithDistHintMaxPerNode` or `WithDistHintMaxBytes` and new hints |
| 73 | +get dropped. |
| 74 | + |
| 75 | +**Detection.** `dist.hinted.bytes` (gauge) climbs steadily. |
| 76 | +`dist.hinted.global_dropped` increments when caps are exceeded. |
| 77 | +`dist.hinted.dropped` (a different metric — replay errors) also rises |
| 78 | +if the peer is reachable but rejecting writes (auth, schema mismatch). |
| 79 | + |
| 80 | +**Resolution.** |
| 81 | + |
| 82 | +1. Restore the unreachable peer; the replay loop drains automatically |
| 83 | + (`dist.hinted.replayed` rises). |
| 84 | +1. If the peer is permanently gone, remove it from membership |
| 85 | + (`DistMemory.RemovePeer(addr)`); queued hints expire on the |
| 86 | + `WithDistHintTTL` timer. |
| 87 | +1. If hints are dropping faster than they replay, raise |
| 88 | + `WithDistHintMaxPerNode` / `WithDistHintMaxBytes` — but understand |
| 89 | + that the cap exists to bound process memory under sustained |
| 90 | + failure. Raising it without fixing the underlying peer just delays |
| 91 | + the bound. |
| 92 | + |
| 93 | +**Phase B note.** Migration failures during rebalance now also funnel |
| 94 | +through the hint queue (Phase B.2). A surge in `dist.hinted.queued` |
| 95 | +during a rolling deploy is expected; it should drain as the new node |
| 96 | +becomes reachable. |
| 97 | + |
| 98 | +## Failure mode — rebalance under load |
| 99 | + |
| 100 | +**Symptom.** Adding a node triggers a rebalance scan that migrates |
| 101 | +keys to their new primary. Under sustained write load the migration |
| 102 | +saturates and `dist.rebalance.throttle` increments — batches queue |
| 103 | +behind the configured concurrency cap. |
| 104 | + |
| 105 | +**Detection.** `dist.rebalance.last_ns` (gauge — last full scan |
| 106 | +duration) climbs. `dist.rebalance.throttle` (counter) increments when |
| 107 | +the concurrency limit blocks a batch dispatch. `dist.rebalance.batches` |
| 108 | +should still climb steadily. |
| 109 | + |
| 110 | +**Resolution.** |
| 111 | + |
| 112 | +1. Raise `WithDistRebalanceMaxConcurrent` (default 1) if CPU and |
| 113 | + network headroom allow. |
| 114 | +1. Lower `WithDistRebalanceBatchSize` (default 64) so individual |
| 115 | + batches finish faster and concurrency slots cycle more often — |
| 116 | + counter-intuitively, smaller batches sometimes throughput-win. |
| 117 | +1. Pause writes (drain a subset of clients via your LB) until the |
| 118 | + scan finishes. The dist backend has no built-in |
| 119 | + write-throttling — that's the application's job. |
| 120 | + |
| 121 | +**Phase C note.** Drain (`POST /dist/drain`) does *not* trigger an |
| 122 | +expedited rebalance today; the next scheduled |
| 123 | +`WithDistRebalanceInterval` tick does the work. If you need to force |
| 124 | +a faster ownership transfer, call `Stop` after Drain to cancel |
| 125 | +in-flight work and let restart-time rebalance handle migration. |
| 126 | + |
| 127 | +## Failure mode — replica loss |
| 128 | + |
| 129 | +**Symptom.** A replica node dies hard (kernel panic, hardware |
| 130 | +failure). Its keys still have other replicas (when `replication >= 2`), |
| 131 | +but until membership notices, writes try to fan out to it and |
| 132 | +silently retry via the hint queue. |
| 133 | + |
| 134 | +**Detection.** `dist.heartbeat.failure` increments steadily for the |
| 135 | +lost peer. After `WithDistHeartbeat`'s `deadAfter` window, the peer |
| 136 | +is pruned (`dist.nodes.removed` increments) and ring lookups stop |
| 137 | +including it. |
| 138 | + |
| 139 | +**Resolution.** |
| 140 | + |
| 141 | +1. Wait for the heartbeat to detect the dead peer. With default |
| 142 | + timing, this is on the order of seconds. |
| 143 | +1. Spin up a replacement node with the same membership (or let |
| 144 | + gossip discover it). |
| 145 | +1. The new node's rebalance scan pulls its assigned keys from |
| 146 | + surviving replicas via Merkle anti-entropy. |
| 147 | + |
| 148 | +**Indirect probes.** `WithDistIndirectProbes(k, timeout)` filters |
| 149 | +caller-side network blips that would otherwise mark a healthy peer |
| 150 | +suspect. `dist.heartbeat.indirect_probe.refuted` rising indicates |
| 151 | +indirect probes are saving you from spurious flapping; rising |
| 152 | +`dist.heartbeat.indirect_probe.failure` indicates the peer is |
| 153 | +genuinely unreachable from multiple vantage points. |
| 154 | + |
| 155 | +## Operational tasks |
| 156 | + |
| 157 | +### Drain a node |
| 158 | + |
| 159 | +```sh |
| 160 | +curl -X POST http://node-A:8080/dist/drain |
| 161 | +``` |
| 162 | + |
| 163 | +After drain: |
| 164 | + |
| 165 | +- `/health` returns 503; load balancers should stop routing. |
| 166 | +- New `Set`/`Remove` calls return `sentinel.ErrDraining`. |
| 167 | +- `Get` continues to serve until the process exits. |
| 168 | + |
| 169 | +Drain is one-way. Restart the process to clear it. |
| 170 | + |
| 171 | +### Inspect cluster state |
| 172 | + |
| 173 | +```sh |
| 174 | +# Membership snapshot. |
| 175 | +curl http://node-A:8080/cluster/members |
| 176 | + |
| 177 | +# Key enumeration (paginated, shard-by-shard since Phase C.2). |
| 178 | +curl 'http://node-A:8080/internal/keys' |
| 179 | +curl 'http://node-A:8080/internal/keys?cursor=1' |
| 180 | +# ... follow next_cursor until empty. |
| 181 | +``` |
| 182 | + |
| 183 | +### Force anti-entropy sync |
| 184 | + |
| 185 | +```go |
| 186 | +// Pull missing keys from peer "node-B" onto this node. |
| 187 | +err := dm.SyncWith(ctx, "node-B") |
| 188 | +``` |
| 189 | + |
| 190 | +`WithDistMerkleAutoSync(interval)` runs this on a timer; manual calls |
| 191 | +are useful for debugging. |
| 192 | + |
| 193 | +## Capacity planning notes |
| 194 | + |
| 195 | +- Each shard mutex is independent — write throughput scales with |
| 196 | + shard count up to CPU saturation. |
| 197 | +- Hint queue memory is approximately `HintedBytes` + 64 bytes of |
| 198 | + bookkeeping per queued hint. Cap via `WithDistHintMaxBytes` to |
| 199 | + bound total process memory under partition. |
| 200 | +- Merkle tree storage scales O(N/chunk) for N keys at |
| 201 | + `WithDistMerkleChunkSize` (default 128). For a million keys, the |
| 202 | + default chunk gives ~8K leaf hashes per node — negligible. |
| 203 | +- Replication factor 3 with quorum reads/writes tolerates 1 failure; |
| 204 | + raise to 5 for tolerating 2 failures, at 5× the storage cost. |
| 205 | + |
| 206 | +## Where things are |
| 207 | + |
| 208 | +| Concern | File | |
| 209 | +|---|---| |
| 210 | +| Public surface | [pkg/backend/dist_memory.go](../pkg/backend/dist_memory.go) | |
| 211 | +| Transport interface | [pkg/backend/dist_transport.go](../pkg/backend/dist_transport.go) | |
| 212 | +| HTTP transport | [pkg/backend/dist_http_transport.go](../pkg/backend/dist_http_transport.go) | |
| 213 | +| HTTP server | [pkg/backend/dist_http_server.go](../pkg/backend/dist_http_server.go) | |
| 214 | +| Membership / ring | [internal/cluster/](../internal/cluster) | |
0 commit comments