Skip to content

Commit f598868

Browse files
authored
perf(lease-read): store lease expiry as atomic.Int64 nanoseconds (zero-alloc extend) (#948)
Closes #554. ## Behavior change `leaseState` no longer heap-allocates on the lease write path. The expiry is now stored as a single `atomic.Int64` of monotonic-raw nanoseconds and the invalidation generation as a single `atomic.Uint64`, replacing the `atomic.Pointer[leaseSlot]` whose every successful `extend` allocated a fresh `*leaseSlot`. That allocation was 1:1 with Dispatch throughput (issue #554: ~50k allocs/s at 50k Dispatch/s) and a matching GC source. All externally observable lease semantics are preserved: - `valid(now)` boundary unchanged: `now.Before(expiry)` (strict / exclusive at the expiry instant), zero-expiry treated as "no lease", zero `now` fails closed. - `extend` monotonicity unchanged: a shorter target never regresses a longer live lease. - Generation guard unchanged: an `extend` carrying a generation captured before a racing `invalidate` is dropped, so a leader-loss callback cannot be resurrected. - `invalidate` still clears unconditionally and bumps the generation. - Nil-receiver and zero-value `leaseState` behavior unchanged. The method signatures (`valid`, `generation`, `extend`, `invalidate`) are byte-for-byte identical, so the callers in `kv/coordinator.go` and `kv/sharded_coordinator.go` are untouched. ## Clock-source decision (called out explicitly) The issue offered two directions for the expiry value: 1. `time.Now().UnixNano()` (wall clock) — zero alloc but **loses Go's monotonic-clock comparison**: a backward NTP step prematurely expires the lease (safe), a forward step extends it past its true safety window (**unsafe**). 2. A monotonic source (`runtime.nanotime` / `CLOCK_MONOTONIC_RAW`) — zero alloc **and** step-immune, avoiding the caveat entirely. This PR takes **option 2**, and in fact the repo already adopted `CLOCK_MONOTONIC_RAW` via `internal/monoclock` (PR #551, which landed after #554 was filed). `expiryNanos` stores `monoclock.Instant.Nanos()` and `valid()` compares it against a caller-supplied `monoclock.Now()`. Both endpoints share the same arbitrary monotonic-raw zero point, so the lease-vs-safety-window comparison is immune to NTP rate adjustment **and** wall-clock step events — strictly stronger than `time.Now().UnixNano()` and with no allocation. The wall-clock dependency the issue asked to document is captured in a code comment on `leaseState` (constraint stated as an invariant, no PR/issue reference): both the stored expiry and the `now` passed to `valid()` must originate from monoclock; mixing in `time.Now()`-derived nanoseconds would reintroduce the NTP-step hazard. So the only outstanding part of #554 was the per-`extend` `*leaseSlot` allocation, which this PR removes. ## Concurrency design `valid()` (the hot path, one per read) stays lock-free: a single atomic load of `expiryNanos` and a comparison. `extend` and `invalidate` (one per Dispatch / leadership change, not per read) serialize on a writer-only `sync.Mutex`, so their two-field `(gen, expiry)` updates appear atomic to each other without needing a 128-bit CAS and without the post-write rollback dance a lock-free two-atomic scheme would require. Because writers are mutually exclusive, an `extend` and an `invalidate` can never interleave: the `extend` either runs fully before the `invalidate` (and is then cleared) or fully after (and is dropped by the generation guard). Readers never take the mutex. This replaces the previous pointer-identity rollback machinery (the clock-granularity-tie disambiguation): with serialized writers there is no rollback, so the value-clobber hazard that machinery guarded cannot occur. The test that pinned the pointer-identity invariant is replaced by one pinning the same observable safety property (`TestLeaseState_StaleExtendCannotClobberFreshLeaseSameExpiry`). ## Risk - Adds a `sync.Mutex` on the lease write paths. These run once per Dispatch (already gated by Raft consensus) / once per leadership change, not on the per-read hot path. The critical section is two atomic stores. Lock contention is negligible relative to the consensus round-trip; the read hot path is unaffected. - The clock-source choice is the load-bearing safety decision. monoclock (CLOCK_MONOTONIC_RAW) preserves the existing step-immune behavior; no wall-clock regression is introduced. ## Test evidence `go test -race -run 'TestLeaseState|TestIsLeadershipLossError' ./kv/`: ``` ok github.com/bootjp/elastickv/kv 1.094s ``` Full package, `go test -race ./kv/...`: ``` ok github.com/bootjp/elastickv/kv 10.721s ``` `golangci-lint --config=.golangci.yaml run ./kv/...`: ``` 0 issues. ``` Benchmark, `go test -run='^$' -bench='BenchmarkLeaseState' -benchmem ./kv/`: ``` goos: darwin goarch: arm64 pkg: github.com/bootjp/elastickv/kv cpu: Apple M1 Max BenchmarkLeaseStateExtend-10 79673116 14.65 ns/op 0 B/op 0 allocs/op BenchmarkLeaseStateValid-10 1000000000 0.5511 ns/op 0 B/op 0 allocs/op ``` Before this change the same `extend` benchmark measured `23.08 ns/op 16 B/op 1 allocs/op`. Acceptance criterion (0 allocs/op) met; `extend` is also ~37% faster. Per-PR Jepsen CI covers the lease-read window (DynamoDB / Redis workloads), which is the integration-level safety net for this path. ## Self-review 1. **Data loss** — No persistence path touched. The lease is leader-local, in-memory, advisory fast-path state; it gates whether a read takes the fast or slow (LinearizableRead) path but never affects what is written or committed. A lost/cleared lease only downgrades to the slow path (safe). No FSM, Raft, Pebble, or snapshot semantics change. 2. **Concurrency / distributed failures** — Writers serialize on `writeMu`; readers are lock-free single-atomic. Eliminates the extend/invalidate interleave the old design handled via pointer-identity rollback (the interleave can no longer occur). Generation guard preserved, so leader-loss invalidation still beats a racing stale extend. Verified with `go test -race` on the full `kv` package, including the concurrent extend/read test. 3. **Performance** — Removes the per-`extend` 16 B heap allocation (0 allocs/op, confirmed by benchmark); read path stays lock-free at 0.55 ns/op. New writer mutex critical section is two atomic stores, off the read hot path and dwarfed by the consensus round-trip that precedes every extend. 4. **Data consistency** — Lease validity boundary, monotonic-extend rule, zero/sentinel handling, and the monotonic-raw clock source are all preserved, so the lease-read freshness bound is unchanged. Reads still go through the leader-issued read pipeline; nothing here bypasses `HLC.Next()` or the read-timestamp path. 5. **Test coverage** — All existing lease tests retained and passing. Replaced the implementation-internal pointer-identity test with `TestLeaseState_StaleExtendCannotClobberFreshLeaseSameExpiry` (same observable invariant against the new internals) and `TestLeaseState_StaleExtendAfterInvalidateIsNoop`. Added `BenchmarkLeaseStateExtend` (the acceptance criterion, 0 allocs/op) and `BenchmarkLeaseStateValid`.
2 parents 128e16c + ac9b7f8 commit f598868

2 files changed

Lines changed: 241 additions & 163 deletions

File tree

kv/lease_state.go

Lines changed: 115 additions & 100 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
package kv
22

33
import (
4+
"sync"
45
"sync/atomic"
56

67
"github.com/bootjp/elastickv/internal/monoclock"
@@ -30,55 +31,56 @@ func isLeadershipLossError(err error) bool {
3031
errors.Is(err, raftengine.ErrLeadershipTransferInProgress)
3132
}
3233

33-
// leaseSlot is the immutable payload stored behind leaseState.current.
34-
// Each successful extend installs a freshly-allocated *leaseSlot, so
35-
// pointer identity alone disambiguates two extenders that happened to
36-
// compute the same expiry value (clock-granularity tie). A rollback
37-
// CAS that compares against the extender's own *leaseSlot therefore
38-
// cannot clobber a newer lease installed by a concurrent winner, even
39-
// if the newer lease carries an equal expiryNanos.
40-
//
41-
// expiryNanos == 0 is the sentinel for "no lease"; the zero-valued
42-
// *leaseSlot (returned on startup before any extend) carries gen 0.
43-
type leaseSlot struct {
44-
// expiryNanos is monoclock.Instant.Nanos(); 0 means "no lease".
45-
expiryNanos int64
46-
// gen is the monotonic invalidation counter. Each invalidate
47-
// installs a new slot with gen+1; each extend observes it via
48-
// generation() BEFORE the quorum operation and refuses to install
49-
// a slot whose gen no longer matches. This preserves the
50-
// leader-loss guard: a Dispatch that returned just before an
51-
// invalidate fires must not resurrect the lease.
52-
gen uint64
53-
}
54-
5534
// leaseState tracks the monotonic-raw expiry of a leader-local read
56-
// lease. The hot-path read (valid) remains lock-free via a single
57-
// atomic pointer load; writes go through CAS on the pointer so
58-
// pointer identity gates the extender-vs-rollback race.
35+
// lease.
5936
//
60-
// expiryNanos == 0 in the current slot means the lease has never been
61-
// issued or has been invalidated. A non-zero value is the
62-
// monotonic-raw instant after which the lease is considered expired;
63-
// a caller comparing monoclock.Now() against the loaded value can
64-
// decide whether to skip a quorum confirmation. The monotonic-raw
65-
// clock is immune to NTP rate adjustment and wall-clock steps (see
66-
// internal/monoclock).
37+
// Storage layout. The expiry is held as a single atomic int64 of
38+
// monotonic-raw nanoseconds and the invalidation generation as a single
39+
// atomic uint64. Neither write path allocates: the previous
40+
// pointer-swap design heap-allocated a slot on every successful extend,
41+
// which is 1:1 with Dispatch throughput and a measurable GC source at
42+
// tens of thousands of writes per second. Two plain atomics remove that
43+
// allocation entirely.
44+
//
45+
// Clock source. expiryNanos stores monoclock.Instant.Nanos(), i.e. a
46+
// reading of CLOCK_MONOTONIC_RAW (see internal/monoclock). It is NOT
47+
// wall-clock nanoseconds. valid() compares it against a caller-supplied
48+
// monoclock.Now(), so the lease-vs-safety-window comparison is immune to
49+
// NTP rate adjustment and wall-clock step events: a backward or forward
50+
// wall-clock step cannot prematurely expire the lease or, worse, extend
51+
// it past its true safety window. Both the stored expiry and the now
52+
// passed to valid() MUST originate from monoclock so they share that
53+
// arbitrary zero point; mixing in time.Now()-derived nanoseconds here
54+
// would reintroduce the NTP-step hazard this clock source exists to
55+
// avoid.
56+
//
57+
// expiryNanos == 0 means the lease has never been issued or has been
58+
// invalidated; valid() returns false for it unconditionally. A non-zero
59+
// value is the monotonic-raw instant after which the lease is expired.
60+
//
61+
// Concurrency. valid() is the hot path (one per read) and stays
62+
// lock-free: a single atomic load of expiryNanos. The write paths
63+
// (extend / invalidate) run once per Dispatch / leadership change, not
64+
// per read, and serialize on writeMu so an extend and an invalidate can
65+
// never interleave their (gen, expiry) updates. Serializing only the
66+
// writers keeps the pair update atomic without a 128-bit CAS and without
67+
// the post-write rollback dance a lock-free two-atomic scheme would
68+
// otherwise need, while leaving the read path uncontended.
6769
type leaseState struct {
68-
// current is the live *leaseSlot. A nil pointer means "never
69-
// extended"; valid() treats it identically to an explicit
70-
// zero-expiry slot.
71-
current atomic.Pointer[leaseSlot]
72-
}
73-
74-
// slotOrZero returns the current slot, substituting a synthetic zero
75-
// slot when the pointer has never been stored. Keeps callers free of
76-
// nil checks.
77-
func slotOrZero(p *leaseSlot) *leaseSlot {
78-
if p == nil {
79-
return &leaseSlot{}
80-
}
81-
return p
70+
// writeMu serializes extend and invalidate so their two-field
71+
// updates appear atomic to each other. Readers never take it.
72+
writeMu sync.Mutex
73+
// expiryNanos is monoclock.Instant.Nanos(); 0 means "no lease".
74+
// Read lock-free by valid(); written only under writeMu.
75+
expiryNanos atomic.Int64
76+
// gen is the monotonic invalidation counter. Each invalidate
77+
// increments it; each extend observes it via generation() BEFORE
78+
// the quorum operation and refuses to install an expiry whose
79+
// generation no longer matches. This preserves the leader-loss
80+
// guard: a Dispatch that returned just before an invalidate fires
81+
// must not resurrect the lease. Read lock-free by generation();
82+
// written only under writeMu.
83+
gen atomic.Uint64
8284
}
8385

8486
// valid reports whether the lease is unexpired at now.
@@ -93,11 +95,11 @@ func (s *leaseState) valid(now monoclock.Instant) bool {
9395
if s == nil || now.IsZero() {
9496
return false
9597
}
96-
slot := s.current.Load()
97-
if slot == nil || slot.expiryNanos == 0 {
98+
expiry := s.expiryNanos.Load()
99+
if expiry == 0 {
98100
return false
99101
}
100-
return now.Before(monoclock.FromNanos(slot.expiryNanos))
102+
return now.Before(monoclock.FromNanos(expiry))
101103
}
102104

103105
// generation returns the current invalidation counter. Callers MUST
@@ -110,80 +112,93 @@ func (s *leaseState) generation() uint64 {
110112
if s == nil {
111113
return 0
112114
}
113-
return slotOrZero(s.current.Load()).gen
115+
return s.gen.Load()
114116
}
115117

116118
// extend sets the lease expiry to until iff (a) until is strictly
117119
// after the currently stored expiry (or no expiry is stored) and
118120
// (b) no invalidate has happened since the caller captured
119121
// expectedGen via generation() BEFORE the quorum operation.
120122
//
121-
// Pointer-identity CAS guards the rollback: after a racing
122-
// invalidate fires, the rollback clears ONLY the exact *leaseSlot
123-
// this extender installed, so a concurrent extender that captured
124-
// the post-invalidate generation and computed the same expiry value
125-
// (clock-granularity tie) cannot be clobbered -- its slot is a
126-
// distinct allocation with a distinct pointer.
123+
// The generation recheck under writeMu is the leader-loss guard: an
124+
// invalidate that fired DURING the quorum operation has bumped gen, so
125+
// the captured expectedGen no longer matches and the extend is dropped
126+
// before it can resurrect a lease the leadership-loss callback just
127+
// cleared. Because extend and invalidate are mutually exclusive on
128+
// writeMu, no post-write rollback is needed: an invalidate cannot land
129+
// between the generation recheck and the expiry store.
127130
func (s *leaseState) extend(until monoclock.Instant, expectedGen uint64) {
128131
if s == nil {
129132
return
130133
}
131134
target := until.Nanos()
132135
if target == 0 {
133136
// Refuse to store 0 -- that value is the sentinel for "no
134-
// lease" and would race with invalidate's zero-store.
137+
// lease" and would be indistinguishable from invalidate's
138+
// zero-store.
135139
return
136140
}
137-
for {
138-
// Load the live pointer once per iteration; all decisions
139-
// below are based on THIS observation. The CAS old-value
140-
// must match this exact pointer.
141-
stored := s.current.Load()
142-
old := slotOrZero(stored)
143-
// Pre-CAS gate: if invalidate already advanced the generation
144-
// past expectedGen, skip the CAS entirely.
145-
if old.gen != expectedGen {
146-
return
147-
}
148-
if old.expiryNanos != 0 && target <= old.expiryNanos {
149-
return
150-
}
151-
next := &leaseSlot{expiryNanos: target, gen: expectedGen}
152-
if !s.current.CompareAndSwap(stored, next) {
153-
continue
154-
}
155-
// CAS landed. If invalidate raced in between the pre-CAS
156-
// gate and the CAS itself, the current gen now exceeds
157-
// expectedGen; undo our write via a pointer-identity CAS on
158-
// `next`. A later writer (concurrent extender, concurrent
159-
// invalidate) that already replaced `next` with its own
160-
// allocation owns the state; its pointer differs from
161-
// `next`, our CAS fails, and we leave its value intact --
162-
// EVEN IF its expiryNanos equals our target.
163-
if observed := slotOrZero(s.current.Load()); observed.gen != expectedGen {
164-
rb := &leaseSlot{expiryNanos: 0, gen: observed.gen}
165-
s.current.CompareAndSwap(next, rb)
166-
}
141+
s.writeMu.Lock()
142+
defer s.writeMu.Unlock()
143+
// Generation guard: an invalidate since expectedGen was captured
144+
// means a leader-loss callback fired during the quorum op; refuse
145+
// to resurrect the lease.
146+
if s.gen.Load() != expectedGen {
167147
return
168148
}
149+
// Monotonicity: never regress a lease. An out-of-order writer that
150+
// sampled monoclock.Now() earlier must not shorten a freshly
151+
// extended lease and force callers onto the slow path while the
152+
// leader is still confirmed.
153+
cur := s.expiryNanos.Load()
154+
if cur != 0 && target <= cur {
155+
return
156+
}
157+
s.expiryNanos.Store(target)
169158
}
170159

171-
// invalidate clears the lease so the next read takes the slow path.
172-
// Each call installs a fresh zero-expiry slot whose gen is one more
173-
// than the current slot's, via CAS retry. A concurrent extender that
174-
// captured the pre-invalidate generation will see the advanced gen
175-
// on its post-CAS recheck and roll back its own slot via pointer-
176-
// identity CAS.
160+
// invalidate clears the lease so the next read takes the slow path and
161+
// bumps the generation so a concurrent extender that captured the
162+
// pre-invalidate generation cannot resurrect the lease. The store of
163+
// the zero sentinel is unconditional, even when the current expiry is
164+
// in the future: otherwise leadership-loss callbacks would be powerless
165+
// once a lease is in place.
166+
//
167+
// Store order matters for the lock-free reader. valid() reads only
168+
// expiryNanos; generation() reads only gen. Clearing the expiry BEFORE
169+
// bumping the generation guarantees that the fast-path guard (valid())
170+
// fails closed at least as early as the generation bump becomes visible.
171+
// If the generation were bumped first, a reader that loaded gen and then
172+
// loaded expiryNanos in the window before the zero-store landed could
173+
// observe "new generation, old (still valid) expiry" and serve a
174+
// fast-path read after leadership loss was already detected. With the
175+
// expiry cleared first, once any reader can observe the post-invalidate
176+
// generation it can also observe the cleared expiry, so the fast path is
177+
// never served past the point invalidation began.
178+
//
179+
// extend serializes on writeMu so the two never interleave: an extend
180+
// either runs fully before this invalidate (and is then cleared here) or
181+
// fully after (and observes the bumped generation via its guard and is
182+
// dropped). The intra-invalidate store order is invisible to a
183+
// mutex-serialized extend, so reordering does not weaken the extend
184+
// generation guard.
177185
func (s *leaseState) invalidate() {
178186
if s == nil {
179187
return
180188
}
181-
for {
182-
stored := s.current.Load()
183-
old := slotOrZero(stored)
184-
next := &leaseSlot{expiryNanos: 0, gen: old.gen + 1}
185-
if s.current.CompareAndSwap(stored, next) {
186-
return
187-
}
189+
s.writeMu.Lock()
190+
defer s.writeMu.Unlock()
191+
s.expiryNanos.Store(0)
192+
if onInvalidateBetweenStores != nil {
193+
onInvalidateBetweenStores()
188194
}
195+
s.gen.Add(1)
189196
}
197+
198+
// onInvalidateBetweenStores is a test-only hook invoked inside
199+
// invalidate() after the expiry has been cleared but before the
200+
// generation is bumped. It exists to make the intra-invalidate store
201+
// ordering deterministically observable so the fail-closed invariant can
202+
// be regression-tested; it is nil in production and adds a single
203+
// predictable-branch nil check to the once-per-leadership-change path.
204+
var onInvalidateBetweenStores func()

0 commit comments

Comments
 (0)