Skip to content

fix(discov): prevent etcd key leak when Watch DELETE races with lease expiry#5545

Open
AnTengye wants to merge 2 commits intozeromicro:masterfrom
AnTengye:fix/etcd-publisher-revalidate-lease-on-watch
Open

fix(discov): prevent etcd key leak when Watch DELETE races with lease expiry#5545
AnTengye wants to merge 2 commits intozeromicro:masterfrom
AnTengye:fix/etcd-publisher-revalidate-lease-on-watch

Conversation

@AnTengye
Copy link
Copy Markdown

Problem

The publisher's Watch-based self-healing mechanism (introduced after v1.8.x) re-Puts the key with p.lease whenever a DELETE event is observed. However, when the DELETE is caused by the lease itself expiring which happens concurrently with the KeepAlive channel being closed the publisher may race to Put with an expired or zero LeaseID.

In etcd, Put(key, value, WithLease(expired_or_zero_LeaseID)) succeeds but effectively detaches the key from its lease, turning it into a permanent key (no TTL). This key will never be auto-deleted by etcd and will leak forever, causing the symptom:

2 live pods 4+ keys in etcd

The orphaned keys are always the ones without a lease (verifiable by etcdctl get --prefix ... -w json).

Root Cause

When the lease expires, etcd atomically deletes both the lease and all attached keys. The publisher's Watch channel receives a DELETE event for its key at almost the same time the KeepAlive channel is closed. Go select randomly picks one branch:

  • If the DELETE branch wins, the publisher executes cli.Put(fullKey, value, WithLease(p.lease)) using a lease that is already expired or being concurrently revoked key becomes permanent.
  • The old goroutine may still be alive from a prior doRegister cycle, compounding the race.

Fix

Add a TimeToLive check before the Watch-based re-Put:

  • If TTL <= 0 or the call errors, the lease is gone skip the Put, call revoke() and doKeepAlive() to restart the full registration flow (which grants a fresh lease).
  • If the lease is still alive, proceed with the original Put as before.

This preserves the self-healing intent (recovering keys deleted by external/manual operations) while eliminating the etcd key-leak scenario.

Changes

File Change
core/discov/internal/etcdclient.go Added TimeToLive method to EtcdClient interface
core/discov/internal/etcdclient_mock.go Added corresponding mock
core/discov/publisher.go Added TTL validation before Watch-based re-Put

Testing

  • go build ./core/discov/... compiles successfully.

… expiry

The publisher's Watch-based self-healing (added after v1.8) re-puts
the key with p.lease whenever a DELETE event is observed. When the
DELETE is caused by lease expiry and races with the KeepAlive channel
close, the publisher may Put with an expired or zero LeaseID, which
makes the key permanent in etcd (no TTL) and leaks forever.

This fix adds a TimeToLive check before re-putting: if the lease is
already expired or gone, the publisher skips the Put and immediately
restarts the keepalive flow instead, avoiding orphaned keys.
@kevwan kevwan force-pushed the fix/etcd-publisher-revalidate-lease-on-watch branch from a50c8cc to b14d290 Compare April 25, 2026 03:30
@zhoushuguang
Copy link
Copy Markdown
Collaborator

TimeToLive() and Put(...WithLease(p.lease)) are still not atomic, so there is still a small window where issues can happen:

  1. TTL > 0
  2. the lease expires immediately after that
  3. Put() still executes with the old lease

This means there is still a possibility of stale lease re-put.

Also, it would be better to add unit tests for at least these cases:

  • TTL > 0: should continue to call Put
  • TTL <= 0: should skip Put and trigger re-registration
  • TimeToLive() returns error / nil: should not execute Put

@AnTengye AnTengye force-pushed the fix/etcd-publisher-revalidate-lease-on-watch branch from 7d6bf28 to 4dba2e7 Compare April 26, 2026 08:47
@AnTengye
Copy link
Copy Markdown
Author

TimeToLive() and Put(...WithLease(p.lease)) are still not atomic, so there is still a small window where issues can happen:

  1. TTL > 0
  2. the lease expires immediately after that
  3. Put() still executes with the old lease

This means there is still a possibility of stale lease re-put.

Also, it would be better to add unit tests for at least these cases:

  • TTL > 0: should continue to call Put
  • TTL <= 0: should skip Put and trigger re-registration
  • TimeToLive() returns error / nil: should not execute Put

It's fixed, please check.

@kevwan kevwan added kind/bug Something isn't working area/grpc Categorizes issue or PR as related to gRPC. labels Apr 30, 2026
@kevwan
Copy link
Copy Markdown
Contributor

kevwan commented Apr 30, 2026

Review

Good fix for a subtle but critical etcd service discovery bug.

Problem confirmed: The race between a Watch DELETE event and lease expiry can cause Put(key, WithLease(expired_lease)) to succeed — but etcd creates a permanent (no TTL) key, which leaks forever.

What the fix does:

  1. Before the re-put: calls TimeToLive to verify the lease is still alive (TTL > 0)
  2. After the re-put: calls Get to verify the key actually got attached to the lease
  3. If either check fails, calls restartKeepAlive() to revoke + re-register with a fresh lease

Concerns / suggestions:

  1. Interface breaking change — Adding TimeToLive to EtcdClient is a breaking change for anyone with a custom implementation. Consider whether this warrants a deprecation period or interface embedding. (In practice, most users rely on the concrete clientv3.Client so this is minor.)

  2. Extra round-trips — Every time a key is manually deleted, the fix now adds 2 extra etcd round-trips (TTL check + Get verify). This is acceptable given the low frequency of such events, but worth documenting.

  3. restartKeepAlivereturn — When restartKeepAlive is called, the current goroutine returns (exits). doKeepAlive() will start a new goroutine with a fresh lease. This is correct but the interaction is subtle. A comment explaining this flow would help future readers.

  4. Test coverage — Good addition of comprehensive table-driven tests (TestPublisher_keepAliveAsyncDeleteSkipsPutWhenTTLUnavailable). The waitForSignal helper is clean.

  5. Missing test for keyBoundToLease error branchkeyBoundToLease logs an error when the Get itself fails, but this branch isn't tested. Non-blocking.

Overall: The fix is correct and addresses a real production issue (orphaned etcd keys after pod restarts). The two-phase validation approach (pre-check TTL, post-check lease binding) provides defense in depth.

The approach is sound. Consider addressing the comment about restartKeepAlive flow documentation before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/grpc Categorizes issue or PR as related to gRPC. kind/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants