Skip to content

fix(discov): prevent etcd key leak when Watch DELETE races with lease expiry#5549

Closed
kevwan wants to merge 1 commit into
zeromicro:masterfrom
kevwan:fix/etcd-publisher-lease-expiry-race
Closed

fix(discov): prevent etcd key leak when Watch DELETE races with lease expiry#5549
kevwan wants to merge 1 commit into
zeromicro:masterfrom
kevwan:fix/etcd-publisher-lease-expiry-race

Conversation

@kevwan
Copy link
Copy Markdown
Contributor

@kevwan kevwan commented Apr 25, 2026

Problem

When a Watch DELETE event fires at the same time the KeepAlive channel closes — both triggered by the same lease expiry — the subsequent re-Put call uses an already-expired LeaseID. etcd accepts the Put but silently strips the TTL, making the key permanent in etcd. The key leaks forever and is never cleaned up.

This is a race between two concurrent signal paths:

  1. KeepAlive channel closes → revoke() + doKeepAlive() restarts registration
  2. Watch DELETE event fires → re-Put with the old (expired) LeaseID

If (2) wins the race, the key is re-put without a TTL.

Solution

Before re-putting the key on a DELETE event, call TimeToLive to verify the lease is still alive. If TTL ≤ 0 or the call errors, skip the Put and restart the full registration flow via revoke() + doKeepAlive() instead.

Changes

  • core/discov/internal/etcdclient.go: Add TimeToLive to the EtcdClient interface
  • core/discov/internal/etcdclient_mock.go: Add gomock implementation for TimeToLive
  • core/discov/publisher.go: Guard the re-Put with a TimeToLive check in the Watch DELETE handler

Closes #5545

… expiry

When a Watch DELETE event fires at the same time the KeepAlive channel
closes (both triggered by lease expiry), the subsequent re-Put uses an
already-expired LeaseID. etcd accepts the Put but strips the TTL,
leaking the key permanently with no expiry.

Fix: call TimeToLive before re-putting the key. If the lease is already
expired (TTL <= 0) or the call errors, skip the Put and restart the
full registration flow via revoke() + doKeepAlive().

- Add TimeToLive to EtcdClient interface
- Add TimeToLive mock implementation
- Add TTL check guard in keepAliveAsync Watch DELETE handler

Closes zeromicro#5545
Copilot AI review requested due to automatic review settings April 25, 2026 03:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race in discov publisher self-healing where a Watch DELETE can re-Put a key with an expired lease, causing etcd to drop the TTL and permanently leak the key.

Changes:

  • Add TimeToLive to the internal EtcdClient interface (and gomock) to validate lease liveness.
  • Guard the Watch DELETE “re-put” path with a TimeToLive check; if the lease is invalid, restart the full registration/keepalive flow instead of re-putting.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
core/discov/publisher.go Adds TTL validation before re-putting on Watch DELETE; restarts keepalive on invalid/expired lease to prevent permanent keys.
core/discov/internal/etcdclient.go Extends EtcdClient interface with TimeToLive to support lease validation.
core/discov/internal/etcdclient_mock.go Updates gomock implementation to include TimeToLive.

Comment thread core/discov/publisher.go
Comment on lines +155 to +164
// Make sure the lease is still valid before re-putting the key.
// Otherwise the Put may happen with an already-expired or zero
// LeaseID (e.g. when the DELETE event is caused by lease expiry
// and races with KeepAlive channel close), which makes the key
// permanent in etcd (no TTL) and leaks forever.
ttlResp, ttlErr := cli.TimeToLive(cli.Ctx(), p.lease)
if ttlErr != nil || ttlResp == nil || ttlResp.TTL <= 0 {
logc.Errorf(cli.Ctx(),
"etcd publisher lease expired, skip re-put and restart keepalive: leaseID=%d, err=%v",
p.lease, ttlErr)
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new TimeToLive call will break existing gomock-based tests that exercise the Watch DELETE path unless they set an expectation for TimeToLive (e.g. Return(&clientv3.LeaseTimeToLiveResponse{TTL: 1}, nil)). In the current repo, core/discov/publisher_test.go’s key-deletion tests only expect Watch + Put, so they will now fail with an unexpected TimeToLive call.

Suggested change
// Make sure the lease is still valid before re-putting the key.
// Otherwise the Put may happen with an already-expired or zero
// LeaseID (e.g. when the DELETE event is caused by lease expiry
// and races with KeepAlive channel close), which makes the key
// permanent in etcd (no TTL) and leaks forever.
ttlResp, ttlErr := cli.TimeToLive(cli.Ctx(), p.lease)
if ttlErr != nil || ttlResp == nil || ttlResp.TTL <= 0 {
logc.Errorf(cli.Ctx(),
"etcd publisher lease expired, skip re-put and restart keepalive: leaseID=%d, err=%v",
p.lease, ttlErr)
// Avoid re-putting the key with an invalid lease. When the
// lease is already cleared locally, re-putting would create a
// permanent key with no TTL.
if p.lease <= 0 {
logc.Errorf(cli.Ctx(),
"etcd publisher lease expired, skip re-put and restart keepalive: leaseID=%d",
p.lease)

Copilot uses AI. Check for mistakes.
Comment thread core/discov/publisher.go
Comment on lines +162 to +164
logc.Errorf(cli.Ctx(),
"etcd publisher lease expired, skip re-put and restart keepalive: leaseID=%d, err=%v",
p.lease, ttlErr)
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message says "lease expired" even when TimeToLive returns a non-nil error (which could be a transient RPC issue, auth issue, context cancellation, etc.). Consider wording this as "failed to verify lease TTL" / "lease TTL invalid" and include whether the TTL was <= 0 vs an error, so logs are accurate for debugging.

Copilot uses AI. Check for mistakes.
Comment thread core/discov/publisher.go
Comment on lines +160 to +169
ttlResp, ttlErr := cli.TimeToLive(cli.Ctx(), p.lease)
if ttlErr != nil || ttlResp == nil || ttlResp.TTL <= 0 {
logc.Errorf(cli.Ctx(),
"etcd publisher lease expired, skip re-put and restart keepalive: leaseID=%d, err=%v",
p.lease, ttlErr)
p.revoke(cli)
if err := p.doKeepAlive(); err != nil {
logc.Errorf(cli.Ctx(), "etcd publisher KeepAlive: %v", err)
}
return
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces a new branch where TimeToLive returns TTL <= 0 (or errors) and the code restarts keepalive instead of re-putting the key. There are existing unit tests for the Watch DELETE re-put behavior in core/discov/publisher_test.go, but they don’t cover this new TTL-expired/error branch; adding a test for the "TTL invalid -> no Put, restart registration" path would help prevent regressions of the leak fix.

Copilot generated this review using guidance from repository custom instructions.
@kevwan kevwan closed this Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants