fix(discov): prevent etcd key leak when Watch DELETE races with lease expiry#5549
fix(discov): prevent etcd key leak when Watch DELETE races with lease expiry#5549kevwan wants to merge 1 commit into
Conversation
… expiry When a Watch DELETE event fires at the same time the KeepAlive channel closes (both triggered by lease expiry), the subsequent re-Put uses an already-expired LeaseID. etcd accepts the Put but strips the TTL, leaking the key permanently with no expiry. Fix: call TimeToLive before re-putting the key. If the lease is already expired (TTL <= 0) or the call errors, skip the Put and restart the full registration flow via revoke() + doKeepAlive(). - Add TimeToLive to EtcdClient interface - Add TimeToLive mock implementation - Add TTL check guard in keepAliveAsync Watch DELETE handler Closes zeromicro#5545
There was a problem hiding this comment.
Pull request overview
Fixes a race in discov publisher self-healing where a Watch DELETE can re-Put a key with an expired lease, causing etcd to drop the TTL and permanently leak the key.
Changes:
- Add
TimeToLiveto the internalEtcdClientinterface (and gomock) to validate lease liveness. - Guard the Watch DELETE “re-put” path with a
TimeToLivecheck; if the lease is invalid, restart the full registration/keepalive flow instead of re-putting.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
core/discov/publisher.go |
Adds TTL validation before re-putting on Watch DELETE; restarts keepalive on invalid/expired lease to prevent permanent keys. |
core/discov/internal/etcdclient.go |
Extends EtcdClient interface with TimeToLive to support lease validation. |
core/discov/internal/etcdclient_mock.go |
Updates gomock implementation to include TimeToLive. |
| // Make sure the lease is still valid before re-putting the key. | ||
| // Otherwise the Put may happen with an already-expired or zero | ||
| // LeaseID (e.g. when the DELETE event is caused by lease expiry | ||
| // and races with KeepAlive channel close), which makes the key | ||
| // permanent in etcd (no TTL) and leaks forever. | ||
| ttlResp, ttlErr := cli.TimeToLive(cli.Ctx(), p.lease) | ||
| if ttlErr != nil || ttlResp == nil || ttlResp.TTL <= 0 { | ||
| logc.Errorf(cli.Ctx(), | ||
| "etcd publisher lease expired, skip re-put and restart keepalive: leaseID=%d, err=%v", | ||
| p.lease, ttlErr) |
There was a problem hiding this comment.
This new TimeToLive call will break existing gomock-based tests that exercise the Watch DELETE path unless they set an expectation for TimeToLive (e.g. Return(&clientv3.LeaseTimeToLiveResponse{TTL: 1}, nil)). In the current repo, core/discov/publisher_test.go’s key-deletion tests only expect Watch + Put, so they will now fail with an unexpected TimeToLive call.
| // Make sure the lease is still valid before re-putting the key. | |
| // Otherwise the Put may happen with an already-expired or zero | |
| // LeaseID (e.g. when the DELETE event is caused by lease expiry | |
| // and races with KeepAlive channel close), which makes the key | |
| // permanent in etcd (no TTL) and leaks forever. | |
| ttlResp, ttlErr := cli.TimeToLive(cli.Ctx(), p.lease) | |
| if ttlErr != nil || ttlResp == nil || ttlResp.TTL <= 0 { | |
| logc.Errorf(cli.Ctx(), | |
| "etcd publisher lease expired, skip re-put and restart keepalive: leaseID=%d, err=%v", | |
| p.lease, ttlErr) | |
| // Avoid re-putting the key with an invalid lease. When the | |
| // lease is already cleared locally, re-putting would create a | |
| // permanent key with no TTL. | |
| if p.lease <= 0 { | |
| logc.Errorf(cli.Ctx(), | |
| "etcd publisher lease expired, skip re-put and restart keepalive: leaseID=%d", | |
| p.lease) |
| logc.Errorf(cli.Ctx(), | ||
| "etcd publisher lease expired, skip re-put and restart keepalive: leaseID=%d, err=%v", | ||
| p.lease, ttlErr) |
There was a problem hiding this comment.
The log message says "lease expired" even when TimeToLive returns a non-nil error (which could be a transient RPC issue, auth issue, context cancellation, etc.). Consider wording this as "failed to verify lease TTL" / "lease TTL invalid" and include whether the TTL was <= 0 vs an error, so logs are accurate for debugging.
| ttlResp, ttlErr := cli.TimeToLive(cli.Ctx(), p.lease) | ||
| if ttlErr != nil || ttlResp == nil || ttlResp.TTL <= 0 { | ||
| logc.Errorf(cli.Ctx(), | ||
| "etcd publisher lease expired, skip re-put and restart keepalive: leaseID=%d, err=%v", | ||
| p.lease, ttlErr) | ||
| p.revoke(cli) | ||
| if err := p.doKeepAlive(); err != nil { | ||
| logc.Errorf(cli.Ctx(), "etcd publisher KeepAlive: %v", err) | ||
| } | ||
| return |
There was a problem hiding this comment.
This introduces a new branch where TimeToLive returns TTL <= 0 (or errors) and the code restarts keepalive instead of re-putting the key. There are existing unit tests for the Watch DELETE re-put behavior in core/discov/publisher_test.go, but they don’t cover this new TTL-expired/error branch; adding a test for the "TTL invalid -> no Put, restart registration" path would help prevent regressions of the leak fix.
Problem
When a Watch DELETE event fires at the same time the KeepAlive channel closes — both triggered by the same lease expiry — the subsequent re-
Putcall uses an already-expiredLeaseID. etcd accepts thePutbut silently strips the TTL, making the key permanent in etcd. The key leaks forever and is never cleaned up.This is a race between two concurrent signal paths:
KeepAlivechannel closes →revoke()+doKeepAlive()restarts registrationWatchDELETE event fires → re-Putwith the old (expired)LeaseIDIf (2) wins the race, the key is re-put without a TTL.
Solution
Before re-putting the key on a DELETE event, call
TimeToLiveto verify the lease is still alive. If TTL ≤ 0 or the call errors, skip thePutand restart the full registration flow viarevoke()+doKeepAlive()instead.Changes
core/discov/internal/etcdclient.go: AddTimeToLiveto theEtcdClientinterfacecore/discov/internal/etcdclient_mock.go: Add gomock implementation forTimeToLivecore/discov/publisher.go: Guard the re-Putwith aTimeToLivecheck in the Watch DELETE handlerCloses #5545