fix(discov): prevent etcd key leak when Watch DELETE races with lease expiry#5545
fix(discov): prevent etcd key leak when Watch DELETE races with lease expiry#5545AnTengye wants to merge 2 commits intozeromicro:masterfrom
Conversation
… expiry The publisher's Watch-based self-healing (added after v1.8) re-puts the key with p.lease whenever a DELETE event is observed. When the DELETE is caused by lease expiry and races with the KeepAlive channel close, the publisher may Put with an expired or zero LeaseID, which makes the key permanent in etcd (no TTL) and leaks forever. This fix adds a TimeToLive check before re-putting: if the lease is already expired or gone, the publisher skips the Put and immediately restarts the keepalive flow instead, avoiding orphaned keys.
a50c8cc to
b14d290
Compare
|
TimeToLive() and Put(...WithLease(p.lease)) are still not atomic, so there is still a small window where issues can happen:
This means there is still a possibility of stale lease re-put. Also, it would be better to add unit tests for at least these cases:
|
7d6bf28 to
4dba2e7
Compare
It's fixed, please check. |
ReviewGood fix for a subtle but critical etcd service discovery bug. Problem confirmed: The race between a Watch DELETE event and lease expiry can cause What the fix does:
Concerns / suggestions:
Overall: The fix is correct and addresses a real production issue (orphaned etcd keys after pod restarts). The two-phase validation approach (pre-check TTL, post-check lease binding) provides defense in depth. The approach is sound. Consider addressing the comment about |
Problem
The publisher's Watch-based self-healing mechanism (introduced after v1.8.x) re-Puts the key with
p.leasewhenever a DELETE event is observed. However, when the DELETE is caused by the lease itself expiring which happens concurrently with the KeepAlive channel being closed the publisher may race toPutwith an expired or zeroLeaseID.In etcd,
Put(key, value, WithLease(expired_or_zero_LeaseID))succeeds but effectively detaches the key from its lease, turning it into a permanent key (no TTL). This key will never be auto-deleted by etcd and will leak forever, causing the symptom:2 live pods 4+ keys in etcdThe orphaned keys are always the ones without a lease (verifiable by
etcdctl get --prefix ... -w json).Root Cause
When the lease expires, etcd atomically deletes both the lease and all attached keys. The publisher's Watch channel receives a DELETE event for its key at almost the same time the KeepAlive channel is closed. Go
selectrandomly picks one branch:cli.Put(fullKey, value, WithLease(p.lease))using a lease that is already expired or being concurrently revoked key becomes permanent.doRegistercycle, compounding the race.Fix
Add a
TimeToLivecheck before the Watch-based re-Put:TTL <= 0or the call errors, the lease is gone skip thePut, callrevoke()anddoKeepAlive()to restart the full registration flow (which grants a fresh lease).Putas before.This preserves the self-healing intent (recovering keys deleted by external/manual operations) while eliminating the etcd key-leak scenario.
Changes
core/discov/internal/etcdclient.goTimeToLivemethod toEtcdClientinterfacecore/discov/internal/etcdclient_mock.gocore/discov/publisher.goTesting
go build ./core/discov/...compiles successfully.