Skip to content

[Bug]: lockservice can hang on stale remote lock binds #24205

@LeftHandCold

Description

@LeftHandCold

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Branch Name

3.0-dev

Commit ID

b79773d

Other Environment Information

  • Hardware parameters:
  • OS type:
  • Others: multi-CN freetier environment; requester CN observed bind closed, cannot find lockservice address, and remote lock/unlock failures while the owner CN still held the row lock.

Actual Behavior

A remote lock can hang indefinitely after the requester loses a specific remote bind. The owner CN still treats the remote txn as a valid lock holder, so waiters remain blocked even though the requester side has already lost the bind and cannot complete the remote lock/unlock path.

Expected Behavior

Remote lock requests should not hang indefinitely when the response path breaks, and the owner CN should release stale remote holders once the corresponding bind heartbeat is lost.

Steps to Reproduce

  1. Run a multi-CN cluster with remote row locking enabled.
  2. Let one CN acquire a remote row lock on a table owned by another CN.
  3. Break the requester side bind / routing state so the requester logs bind closed and later cannot find lockservice address for that remote lock.
  4. Observe that the owner CN still reports the original remote txn as the lock holder and waiters remain blocked for a long time.

Additional information

Root cause analysis shows two protocol gaps:

  1. handleRemoteLock / handleForwardLock wrote the response through an async one-way path after the owner had already taken the lock, so the requester could stay stuck in morpc.Future.Get().
  2. Remote lock keepalive was effectively tracked at service granularity, so a lost bind could leave a stale remote holder on the owner CN.

The fix uses bounded synchronous response writes for remote lock results, tracks bind-level remote heartbeats in orphan detection, and sends KeepRemoteLock heartbeats per bind to avoid multi-table overwrite on the same peer.

Metadata

Metadata

Assignees

Labels

kind/bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions