hyp3rd
diff --git a/‎CHANGELOG.md‎
Lines changed: 45 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 45 additions & 0 deletions
diff --git a/‎Makefile‎
Lines changed: 22 additions & 17 deletions b/‎Makefile‎
Lines changed: 22 additions & 17 deletions
diff --git a/‎cspell.config.yaml‎
Lines changed: 6 additions & 0 deletions b/‎cspell.config.yaml‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎docs/operations.md‎
Lines changed: 38 additions & 0 deletions b/‎docs/operations.md‎
Lines changed: 38 additions & 0 deletions
diff --git a/‎go.mod‎
Lines changed: 2 additions & 0 deletions b/‎go.mod‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎go.sum‎
Lines changed: 4 additions & 2 deletions b/‎go.sum‎
Lines changed: 4 additions & 2 deletions
@@ -8,6 +8,34 @@ All notable changes to HyperCache are recorded here. The format follows
 
 ### Added
 
+- **Async read-repair batching (Phase 4) + unconditional `ForwardSet`-only repair.** Two composing changes
+  in the same PR that together cut the wire-call cost of read-repair under quorum reads. (1) The defensive
+  `ForwardGet` probe in `repairRemoteReplica` is gone — every repair is now exactly one `ForwardSet`,
+  because the receiver's `applySet` already version-compares and noops downgrades, so the probe was pure
+  duplication. ~50% wire-call reduction per repair regardless of batching. (2) New opt-in
+  [`backend.WithDistReadRepairBatch(interval, maxBatchSize)`](pkg/backend/dist_memory.go) option queues
+  repairs by destination peer + key (last-write-wins by `(version, origin)`) and dispatches per-peer batches
+  on the interval or when a peer's pending count hits `maxBatchSize`. Concurrent reads of the same hot key
+  produce ONE repair through the queue, not N — the coalescer collapses duplicate `(peer, key)` entries
+  and bumps the new `dist.read_repair.coalesced` counter per collapsed enqueue. Disabled by default
+  (`interval == 0` = current synchronous behavior preserved, so `TestDistMemoryReadRepair` and
+  `TestDistMemoryRemoveReplication` pass byte-identical). Clean shutdown drains the queue inside `Stop()`;
+  crash exit loses queued repairs by design, with merkle anti-entropy as the convergence safety net.
+  New [`pkg/backend/dist_read_repair.go`](pkg/backend/dist_read_repair.go) hosts the `repairQueue` type
+  with errgroup-driven per-peer parallel `ForwardSet` dispatch. Eight unit tests in
+  [`pkg/backend/dist_read_repair_test.go`](pkg/backend/dist_read_repair_test.go) cover the coalesce rule
+  (same `(peer, key)` keeps the higher version, distinct peers stay independent), the size-threshold
+  inline flush, the nil-transport noop path, the `Stop()` drain semantics, the `(version, origin)`
+  tie-break rule, and concurrent-enqueue race-safety. Three integration tests in
+  [`tests/hypercache_distmemory_readrepair_batch_test.go`](tests/hypercache_distmemory_readrepair_batch_test.go)
+  drive the end-to-end shape — a 3-node RF=3 ConsistencyQuorum cluster, one node's local copy dropped,
+  N concurrent Gets from a third node — and assert the batched flush heals the dropped node, parallel
+  reads coalesce to ≤2 dispatches (one per remote owner) regardless of N, and `Stop()` drains queued
+  repairs before returning. Two new OTel metrics:
+  `dist.read_repair.batched` (per actual `ForwardSet` dispatched by the queue's flusher) and
+  `dist.read_repair.coalesced` (per duplicate-enqueue collapsed). New "Tuning — read-repair batching"
+  section in [`docs/operations.md`](docs/operations.md) covers the option shape, the divergence-window
+  trade-off, the two metrics, and when to enable it (high read-amplification with stable hot keys).
 - **Token-refresh visibility for the OIDC source.** Closes RFC 0003 open question 6: the
   `WithOIDCClientCredentials` source now wraps its `oauth2.TokenSource` with a logger that emits one
   `"oidc token rotated"` Info line per real rotation (expiry change), staying silent on cached returns.
@@ -258,6 +286,23 @@ All notable changes to HyperCache are recorded here. The format follows
 
 ### Fixed
 
+- **Set-forward promotion no longer requires the in-process `ErrBackendNotFound` sentinel.**
+  [`handleForwardPrimary`](pkg/backend/dist_memory.go) used to gate "primary unreachable → promote to
+  replica" on `errors.Is(errFwd, sentinel.ErrBackendNotFound)`, the error the in-process transport returns
+  for an unregistered peer. HTTP/gRPC transports against a stopped container surface
+  `net.OpError` / `io.EOF` / `context.DeadlineExceeded` instead — none of which matched the condition.
+  Result: when a cluster node was killed (e.g. `docker stop` in
+  [`scripts/tests/20-test-cluster-resilience.sh`](scripts/tests/20-test-cluster-resilience.sh)), writes for
+  keys whose primary was the dead node failed immediately at the forwarding hop, no hint was queued, and
+  the data never landed anywhere — the same 7 of 50 "during-*" writes failed reproducibly in CI's cluster
+  workflow. Promotion now triggers on **any** non-nil forward error when the local node is in `owners[1:]`,
+  matching the in-process and production transport behavior under the same resilience contract. Spurious
+  promotion on a transient blip is benign — `applySet` version-compares on the receiver, and merkle
+  anti-entropy / `chooseNewer` reconcile any divergent `(version, origin)` pair via the existing
+  last-write-wins rule. New test [`TestDistSet_PromotesOnGenericForwardError`](tests/hypercache_distmemory_forward_primary_promotion_test.go)
+  uses the chaos hooks at `DropRate=1.0` to deterministically force a generic forward error and asserts the
+  Set succeeds via promotion; the existing `TestDistFailureRecovery` continues to pass byte-identical (the
+  change widens the promotion gate, doesn't narrow it).
 - **`TestDistRebalanceReplicaDiffThrottle` no longer flakes under `make test-race`.** The test's 900ms hard
   sleep wasn't enough wall-clock budget for the rebalancer's 80ms-tick loop to actually fire 11 ticks under
   `-race` + `-shuffle=on`'s scheduler pressure. Replaced the sleep with a 5-second polling loop that exits as
 
@@ -99,6 +99,16 @@ run-example:
 update-deps:
 	go get -u -t ./... && go mod tidy -v && go mod verify
 
+# check_command_exists expands to a recipe line that succeeds if the
+# given command resolves on PATH, otherwise prints "<cmd> command not
+# found" and exits non-zero. Used by prepare-base-tools' chain of
+# `$(call check_command_exists,X) || install-fallback` lines: the
+# first failure message above is the developer-facing hint; the
+# install fallback fires when the tool itself is genuinely missing.
+define check_command_exists
+@which $(1) > /dev/null 2>&1 || (echo "$(1) command not found" && exit 1)
+endef
+
 prepare-toolchain: prepare-base-tools
 
 prepare-base-tools:
@@ -216,23 +226,18 @@ docs-serve: docs-build
 	PYENV_VERSION=mkdocs mkdocs serve
 
 pre-commit:
-	@eval "$$(pyenv init -)" && \
-	pyenv activate pre-commit && \
-	pre-commit run -a trailing-whitespace && \
-	pre-commit run -a end-of-file-fixer && \
-	pre-commit run -a markdownlint && \
-	pre-commit run -a yamllint && \
-	pre-commit run -a cspell && \
-	pre-commit run -a cspell
-
-# check_command_exists is a helper function that checks if a command exists.
-define check_command_exists
-@which $(1) > /dev/null 2>&1 || (echo "$(1) command not found" && exit 1)
-endef
-
-ifeq ($(call check_command_exists,$(1)),false)
-  $(error "$(1) command not found")
-endif
+	@if command -v pyenv >/dev/null 2>&1; then \
+		eval "$$(pyenv init -)" && \
+		pyenv activate pre-commit && \
+		pre-commit run -a trailing-whitespace && \
+		pre-commit run -a end-of-file-fixer && \
+		pre-commit run -a markdownlint && \
+		pre-commit run -a yamllint && \
+		pre-commit run -a cspell && \
+		pre-commit run -a cspell; \
+	else \
+		echo "pyenv command not found"; \
+	fi
 
 # help prints a list of available targets and their descriptions.
 help:
 
@@ -39,6 +39,7 @@ words:
   - Akudx
   - aliceonly
   - ALPN
+  - amortisation
   - APITLS
   - APITLSCA
   - assertable
@@ -68,6 +69,7 @@ words:
   - clientcredentials
   - cmap
   - Cmder
+  - coalescer
   - codacy
   - codebook
   - codegen
@@ -86,12 +88,14 @@ words:
   - derr
   - disambiguator
   - distconfig
+  - distmemory
   - distroless
   - EDITMSG
   - elif
   - Equalf
   - errcheck
   - errchkjson
+  - errgroup
   - Errorf
   - errp
   - eventbus
@@ -164,6 +168,7 @@ words:
   - keepalive
   - keepalives
   - keyf
+  - keyfmt
   - keypair
   - lamport
   - lblll
@@ -216,6 +221,7 @@ words:
   - pygments
   - pymdownx
   - reaad
+  - readrepair
   - recvcheck
   - rediscluster
   - Redocly
 
@@ -123,6 +123,44 @@ otherwise mark a healthy peer suspect. `dist.heartbeat.indirect_probe.refuted` r
 probes are saving you from spurious flapping; rising `dist.heartbeat.indirect_probe.failure` indicates the
 peer is genuinely unreachable from multiple vantage points.
 
+## Tuning — read-repair batching
+
+Quorum reads (`ConsistencyQuorum` / `ConsistencyAll`) issue best-effort read-repair to every owner of the
+chosen item — one `ForwardSet` per owner per read. Under read-heavy workloads on hot keys with stale replicas,
+this fan-out becomes the dominant network cost on the dist transport.
+
+`backend.WithDistReadRepairBatch(interval, maxBatchSize)` opts into async coalescing: repairs are queued by
+destination peer + key (last-write-wins by `(version, origin)`) and dispatched by a background flusher on the
+configured interval or when a per-peer batch hits `maxBatchSize`. Concurrent reads of the same hot key
+produce one repair through the queue, not N.
+
+```go
+dm, _ := backend.NewDistMemory(ctx,
+    backend.WithDistReadConsistency(backend.ConsistencyQuorum),
+    backend.WithDistReadRepairBatch(100*time.Millisecond, 64),
+)
+```
+
+**Trade-off.** Batched mode introduces a divergence window of up to `interval` where a stale replica stays
+stale. Merkle anti-entropy (`WithDistMerkleAutoSync`) is the safety net for any repair the queue drops on
+crash exit; clean shutdown drains the queue inside `Stop()`. Disabled by default — the existing synchronous
+read-repair path is preserved when the option is not set.
+
+**Detection.** Two new metrics quantify the amortisation:
+
+- `dist.read_repair.batched` — repairs dispatched via the queue's flusher (one bump per actual `ForwardSet`).
+- `dist.read_repair.coalesced` — repairs short-circuited because a same-version-or-higher entry was already
+  queued for the same `(peer, key)`. Every concurrent same-key read past the first bumps this.
+
+The aggregate `dist.read_repair` counter still bumps per repair attempt (sync dispatch or enqueue). Operators
+read `dist.read_repair.batched / dist.read_repair` for the fraction routed through the queue, and
+`dist.read_repair.coalesced` as the saved-wire-call counter.
+
+**When to enable.** High read-amplification with stable hot keys; the typical pattern is a partitioned
+workload where a small set of keys takes the majority of reads and one of the owners drops out briefly. Not
+useful for write-heavy paths (which don't go through read-repair) or for workloads with a flat key
+distribution (the coalescer has little to collapse).
+
 ## Operational tasks
 
 ### Drain a node
 
@@ -20,6 +20,7 @@ require (
 	go.opentelemetry.io/otel/trace v1.43.0
 	golang.org/x/crypto v0.51.0
 	golang.org/x/oauth2 v0.36.0
+	golang.org/x/sync v0.20.0
 	gopkg.in/yaml.v3 v3.0.1
 )
 
@@ -36,6 +37,7 @@ require (
 	github.com/mattn/go-isatty v0.0.22 // indirect
 	github.com/philhofer/fwd v1.2.0 // indirect
 	github.com/pmezard/go-difflib v1.0.0 // indirect
+	github.com/shamaton/msgpack/v3 v3.1.1 // indirect
 	github.com/tinylib/msgp v1.6.4 // indirect
 	github.com/valyala/bytebufferpool v1.0.0 // indirect
 	github.com/valyala/fasthttp v1.71.0 // indirect
 
@@ -55,8 +55,8 @@ github.com/redis/go-redis/v9 v9.19.0 h1:XPVaaPSnG6RhYf7p+rmSa9zZfeVAnWsH5h3lxthO
 github.com/redis/go-redis/v9 v9.19.0/go.mod h1:v/M13XI1PVCDcm01VtPFOADfZtHf8YW3baQf57KlIkA=
 github.com/rogpeppe/go-internal v1.14.1 h1:UQB4HGPB6osV0SQTLymcB4TgvyWu6ZyliaW0tI/otEQ=
 github.com/rogpeppe/go-internal v1.14.1/go.mod h1:MaRKkUm5W0goXpeCfT7UZI6fk/L7L7so1lCWt35ZSgc=
-github.com/shamaton/msgpack/v3 v3.1.0 h1:jsk0vEAqVvvS9+fTZ5/EcQ9tz860c9pWxJ4Iwecz8gU=
-github.com/shamaton/msgpack/v3 v3.1.0/go.mod h1:DcQG8jrdrQCIxr3HlMYkiXdMhK+KfN2CitkyzsQV4uc=
+github.com/shamaton/msgpack/v3 v3.1.1 h1:1EkrTpc68/H9bziTVw9eDLHLeK2v8aAyyv60quMqIY4=
+github.com/shamaton/msgpack/v3 v3.1.1/go.mod h1:DcQG8jrdrQCIxr3HlMYkiXdMhK+KfN2CitkyzsQV4uc=
 github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U=
 github.com/stretchr/testify v1.11.1/go.mod h1:wZwfW3scLgRK+23gO65QZefKpKQRnfz6sD981Nm4B6U=
 github.com/tinylib/msgp v1.6.4 h1:mOwYbyYDLPj35mkA2BjjYejgJk9BuHxDdvRnb6v2ZcQ=
@@ -95,6 +95,8 @@ golang.org/x/net v0.54.0 h1:2zJIZAxAHV/OHCDTCOHAYehQzLfSXuf/5SoL/Dv6w/w=
 golang.org/x/net v0.54.0/go.mod h1:Sj4oj8jK6XmHpBZU/zWHw3BV3abl4Kvi+Ut7cQcY+cQ=
 golang.org/x/oauth2 v0.36.0 h1:peZ/1z27fi9hUOFCAZaHyrpWG5lwe0RJEEEeH0ThlIs=
 golang.org/x/oauth2 v0.36.0/go.mod h1:YDBUJMTkDnJS+A4BP4eZBjCqtokkg1hODuPjwiGPO7Q=
+golang.org/x/sync v0.20.0 h1:e0PTpb7pjO8GAtTs2dQ6jYa5BWYlMuX047Dco/pItO4=
+golang.org/x/sync v0.20.0/go.mod h1:9xrNwdLfx4jkKbNva9FpL6vEN7evnE43NNNJQ2LF3+0=
 golang.org/x/sys v0.44.0 h1:ildZl3J4uzeKP07r2F++Op7E9B29JRUy+a27EibtBTQ=
 golang.org/x/sys v0.44.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw=
 golang.org/x/text v0.37.0 h1:Cqjiwd9eSg8e0QAkyCaQTNHFIIzWtidPahFWR83rTrc=