Skip to content

Commit 3bb32ec

Browse files
committed
fix(dist): widen replica fan-out on promotion and add forward_promotion metric
When `setImpl` promotes a local replica due to an unreachable primary, pass the full `owners` list to `replicateTo` (instead of `owners[1:]`). This ensures the dead primary's slot is included in best-effort replication, so `replicateTo`'s existing failure-path queues a hinted handoff for it. Post-restart convergence is now bounded by `WithDistHintReplayInterval` (~200ms default) rather than the next merkle tick. Add a new OTel counter `dist.write.forward_promotion` (internal atomic `writeForwardPromotion`) that increments each time promotion fires. A steadily rising counter surfaces a flapping primary well before any read/write error spikes. Expand `TestDistSet_PromotesOnGenericForwardError` to: - assert `WriteForwardPromotion` increments on every promotion - assert `HintedQueued` increments (proving the hint was enqueued) - heal chaos and confirm the original primary receives the write via natural hint-replay, using a `waitForLocalContains` polling helper to absorb scheduling jitter - configure a 20ms hint-replay interval for fast, deterministic recovery assertions Also updates CHANGELOG.md with a detailed description of the defense-in- depth approach and the extended test coverage.
1 parent e19ab38 commit 3bb32ec

3 files changed

Lines changed: 85 additions & 7 deletions

File tree

CHANGELOG.md

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -286,7 +286,8 @@ All notable changes to HyperCache are recorded here. The format follows
286286

287287
### Fixed
288288

289-
- **Set-forward promotion no longer requires the in-process `ErrBackendNotFound` sentinel.**
289+
- **Set-forward promotion no longer requires the in-process `ErrBackendNotFound` sentinel, and the dead
290+
primary now converges via the hint queue (not just the next merkle tick).**
290291
[`handleForwardPrimary`](pkg/backend/dist_memory.go) used to gate "primary unreachable → promote to
291292
replica" on `errors.Is(errFwd, sentinel.ErrBackendNotFound)`, the error the in-process transport returns
292293
for an unregistered peer. HTTP/gRPC transports against a stopped container surface
@@ -299,10 +300,17 @@ All notable changes to HyperCache are recorded here. The format follows
299300
matching the in-process and production transport behavior under the same resilience contract. Spurious
300301
promotion on a transient blip is benign — `applySet` version-compares on the receiver, and merkle
301302
anti-entropy / `chooseNewer` reconcile any divergent `(version, origin)` pair via the existing
302-
last-write-wins rule. New test [`TestDistSet_PromotesOnGenericForwardError`](tests/hypercache_distmemory_forward_primary_promotion_test.go)
303-
uses the chaos hooks at `DropRate=1.0` to deterministically force a generic forward error and asserts the
304-
Set succeeds via promotion; the existing `TestDistFailureRecovery` continues to pass byte-identical (the
305-
change widens the promotion gate, doesn't narrow it).
303+
last-write-wins rule. Defense-in-depth follow-up: when promotion fires, `setImpl` now widens the replica
304+
fan-out from `owners[1:]` to the full `owners` list, so `replicateTo`'s existing best-effort hint queueing
305+
catches the failed forward to the dead primary. Its post-restart convergence window is bounded by
306+
hint-replay (`WithDistHintReplayInterval`, ~200ms in the default cluster config) rather than waiting for
307+
the next merkle tick. New OTel counter `dist.write.forward_promotion` exposes how often promotion fired —
308+
a flapping primary surfaces as a steady rise here, well before any read- or write-side error spikes.
309+
Test [`TestDistSet_PromotesOnGenericForwardError`](tests/hypercache_distmemory_forward_primary_promotion_test.go)
310+
uses the chaos hooks at `DropRate=1.0` to deterministically force a generic forward error, asserts the
311+
Set succeeds via promotion, that `HintedQueued` bumps, and — after chaos clears — that the original
312+
primary receives the write through the natural hint-replay loop. The existing `TestDistFailureRecovery`
313+
continues to pass byte-identical (the change widens the promotion gate, doesn't narrow it).
306314
- **`TestDistRebalanceReplicaDiffThrottle` no longer flakes under `make test-race`.** The test's 900ms hard
307315
sleep wasn't enough wall-clock budget for the rebalancer's 80ms-tick loop to actually fire 11 ticks under
308316
`-race` + `-shuffle=on`'s scheduler pressure. Replaced the sleep with a 5-second polling loop that exits as

pkg/backend/dist_memory.go

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1274,6 +1274,7 @@ type distMetrics struct {
12741274
writeQuorumFailures atomic.Int64 // number of write operations that failed quorum
12751275
writeAcks atomic.Int64 // cumulative replica write acks (includes primary)
12761276
writeAttempts atomic.Int64 // total write operations attempted (Set)
1277+
writeForwardPromotion atomic.Int64 // times a Set forward to primary failed and the local replica self-promoted
12771278
rebalancedKeys atomic.Int64 // keys migrated during rebalancing
12781279
rebalanceBatches atomic.Int64 // number of batches processed
12791280
rebalanceThrottle atomic.Int64 // times rebalance was throttled due to concurrency limits
@@ -1334,6 +1335,7 @@ type DistMetrics struct {
13341335
WriteQuorumFailures int64
13351336
WriteAcks int64
13361337
WriteAttempts int64
1338+
WriteForwardPromotion int64
13371339
RebalancedKeys int64
13381340
RebalanceBatches int64
13391341
RebalanceThrottle int64
@@ -1409,6 +1411,7 @@ func (dm *DistMemory) Metrics() DistMetrics {
14091411
WriteQuorumFailures: dm.metrics.writeQuorumFailures.Load(),
14101412
WriteAcks: dm.metrics.writeAcks.Load(),
14111413
WriteAttempts: dm.metrics.writeAttempts.Load(),
1414+
WriteForwardPromotion: dm.metrics.writeForwardPromotion.Load(),
14121415
RebalancedKeys: dm.metrics.rebalancedKeys.Load(),
14131416
RebalanceBatches: dm.metrics.rebalanceBatches.Load(),
14141417
RebalanceThrottle: dm.metrics.rebalanceThrottle.Load(),
@@ -3657,6 +3660,8 @@ func (dm *DistMemory) handleForwardPrimary(ctx context.Context, owners []cluster
36573660
if len(owners) > 1 {
36583661
for _, oid := range owners[1:] {
36593662
if oid == dm.localNode.ID && dm.ownsKeyInternal(item.Key) {
3663+
dm.metrics.writeForwardPromotion.Add(1)
3664+
36603665
return true, nil
36613666
}
36623667
}
@@ -3823,6 +3828,8 @@ func (dm *DistMemory) setImpl(ctx context.Context, item *cache.Item, span trace.
38233828

38243829
span.SetAttributes(attribute.Int("dist.owners.count", len(owners)))
38253830

3831+
promoted := false
3832+
38263833
if owners[0] != dm.localNode.ID { // attempt forward; may promote
38273834
proceedAsPrimary, ferr := dm.handleForwardPrimary(ctx, owners, item)
38283835
if ferr != nil {
@@ -3832,6 +3839,8 @@ func (dm *DistMemory) setImpl(ctx context.Context, item *cache.Item, span trace.
38323839
if !proceedAsPrimary { // forwarded successfully; nothing else to do
38333840
return nil
38343841
}
3842+
3843+
promoted = true
38353844
}
38363845

38373846
// primary path: assign version & timestamp
@@ -3840,7 +3849,19 @@ func (dm *DistMemory) setImpl(ctx context.Context, item *cache.Item, span trace.
38403849
item.LastUpdated = time.Now()
38413850
dm.applySet(ctx, item, false)
38423851

3843-
acks := 1 + dm.replicateTo(ctx, item, owners[1:])
3852+
// When we promoted, the original primary is unreachable but still
3853+
// a listed owner. Pass the full owners list (not owners[1:]) so
3854+
// replicateTo's existing failure-path queues a hint for the dead
3855+
// primary — its post-restart convergence is bounded by hint-replay
3856+
// (~200ms by default) rather than waiting for the next merkle tick.
3857+
// replicateTo already skips the local node, so adding owners[0]
3858+
// back in doesn't cause a self-forward.
3859+
replicas := owners[1:]
3860+
if promoted {
3861+
replicas = owners
3862+
}
3863+
3864+
acks := 1 + dm.replicateTo(ctx, item, replicas)
38443865
dm.metrics.writeAcks.Add(int64(acks))
38453866

38463867
span.SetAttributes(attribute.Int("dist.acks", acks))
@@ -4192,6 +4213,11 @@ var distMetricSpecs = []distMetricSpec{
41924213
desc: "Set operations that failed quorum",
41934214
get: func(m DistMetrics) int64 { return m.WriteQuorumFailures },
41944215
},
4216+
{
4217+
name: "dist.write.forward_promotion", unit: unitOp, counter: true,
4218+
desc: "Set forwards that promoted the local replica because the primary was unreachable",
4219+
get: func(m DistMetrics) int64 { return m.WriteForwardPromotion },
4220+
},
41954221

41964222
// --- Rebalance ---
41974223
{

tests/hypercache_distmemory_forward_primary_promotion_test.go

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ package tests
33
import (
44
"context"
55
"testing"
6+
"time"
67

78
"github.com/hyp3rd/hypercache/internal/cluster"
89
"github.com/hyp3rd/hypercache/pkg/backend"
@@ -35,11 +36,15 @@ func TestDistSet_PromotesOnGenericForwardError(t *testing.T) {
3536

3637
// 3 nodes, RF=3, ConsistencyOne. Chaos is wired onto every
3738
// node's transport; we only ever Set from one node, so only
38-
// that node's forwards exercise the promotion path.
39+
// that node's forwards exercise the promotion path. Short
40+
// hint-replay interval so the recovery assertion runs in
41+
// well under a second.
3942
dc := SetupInProcessClusterRF(
4043
t, 3, 3,
4144
backend.WithDistChaos(chaos),
4245
backend.WithDistWriteConsistency(backend.ConsistencyOne),
46+
backend.WithDistHintTTL(time.Minute),
47+
backend.WithDistHintReplayInterval(20*time.Millisecond),
4348
)
4449

4550
a := dc.Nodes[0]
@@ -71,4 +76,43 @@ func TestDistSet_PromotesOnGenericForwardError(t *testing.T) {
7176
if chaos.Drops() == 0 {
7277
t.Errorf("chaos.Drops: got 0, want > 0 (chaos didn't see the forward attempt)")
7378
}
79+
80+
if got := a.Metrics().WriteForwardPromotion; got == 0 {
81+
t.Errorf("WriteForwardPromotion: got 0, want > 0 (counter should bump on every promotion)")
82+
}
83+
84+
// The defense-in-depth contract: when we promote, the failed
85+
// forward to the dead primary queues a hint via replicateTo's
86+
// existing best-effort logic. Without this, the original
87+
// primary would only see the write at the next merkle tick.
88+
if got := a.Metrics().HintedQueued; got == 0 {
89+
t.Errorf("HintedQueued: got 0, want > 0 (replicateTo should have queued a hint for the dead primary)")
90+
}
91+
92+
// End-to-end recovery: heal chaos and wait for the natural
93+
// hint-replay tick (20ms interval, configured above). This
94+
// proves the hint queued during promotion actually carries
95+
// the write back to the primary once it's reachable again.
96+
chaos.SetDropRate(0)
97+
98+
if !waitForLocalContains(b, key, 2*time.Second) {
99+
t.Errorf("primary did not receive the write via hint replay after chaos cleared")
100+
}
101+
}
102+
103+
// waitForLocalContains polls node.LocalContains(key) until it returns
104+
// true or the timeout elapses. Returns the final state. Used by the
105+
// promotion test to absorb the hint-replay tick's scheduling jitter
106+
// without busy-waiting the whole 2s deadline on the happy path.
107+
func waitForLocalContains(node *backend.DistMemory, key string, timeout time.Duration) bool {
108+
deadline := time.Now().Add(timeout)
109+
for time.Now().Before(deadline) {
110+
if node.LocalContains(key) {
111+
return true
112+
}
113+
114+
time.Sleep(10 * time.Millisecond)
115+
}
116+
117+
return node.LocalContains(key)
74118
}

0 commit comments

Comments
 (0)