Skip to content

Fix/issue 4954 etcd pause resume#5530

Draft
ljluestc wants to merge 2 commits into
zeromicro:masterfrom
ljluestc:fix/issue-4954-etcd-pause-resume
Draft

Fix/issue 4954 etcd pause resume#5530
ljluestc wants to merge 2 commits into
zeromicro:masterfrom
ljluestc:fix/issue-4954-etcd-pause-resume

Conversation

@ljluestc
Copy link
Copy Markdown

@ljluestc ljluestc commented Apr 5, 2026

背景

当前服务启动后会持续向 etcd 续租注册信息。
当某些节点 API 临时不可用时,如果希望让它们从服务发现中下线,通常只能停进程/停容器,运维成本较高,也不够灵活。

主要改动

1) core/discov:Publisher 支持安全暂停/恢复续租

  • 在 Publisher 中将 pauseChan / resumeChan 调整为缓冲通道(容量 1)
  • 新增内部 notify 方法,Pause() / Resume() 采用非阻塞通知,避免重复调用造成阻塞
  • 在 keepalive 循环中:
    • 收到 pause 信号后主动 revoke 当前 lease,停止续租
    • 收到 resume 信号后重新注册并恢复 keepalive

2) zrpc/internal:对外暴露 etcd 注册控制入口

  • 在 keepAliveServer 增加对 publisher 的抽象接口
  • 新增:
    • PauseEtcdRegister()
    • ResumeEtcdRegister()
  • Start() 逻辑保持原有语义:先 KeepAlive(),再启动 gRPC server

3) zrpc:RpcServer 提供上层调用 API

  • 在 RpcServer 新增:
    • PauseEtcdRegister()
    • ResumeEtcdRegister()
  • 当底层 server 不支持 etcd 控制能力时,调用为 no-op,保证兼容非 etcd 场景

4) 测试补充

  • core/discov/publisher_test.go
    • 新增/增强 Pause、Resume 非阻塞与并发场景测试
  • zrpc/internal/rpcpubserver_test.go
    • 新增 keepalive 启动、错误分支、Pause/Resume 控制链路测试
  • zrpc/server_test.go
    • 新增 RpcServer 的 Pause/Resume 转发与 no-op 行为测试

兼容性与影响

  • 对现有默认行为无破坏:不调用新接口时,续租流程保持不变
  • 新能力为可选控制能力,适用于需要“临时摘除/恢复节点”的运行场景
  • 不改变服务协议与注册数据结构

使用方式(示例)

服务运行中可按需调用:

  • rpcServer.PauseEtcdRegister():暂停 etcd 注册续租(节点会从发现中下线)
  • rpcServer.ResumeEtcdRegister():恢复 etcd 注册续租(节点重新上线)

测试结果

  • go test ./core/discov ./zrpc/internal ./zrpc
  • go test ./...

private-bot added 2 commits April 4, 2026 21:44
Expose RpcServer-level control to pause/resume etcd lease renewal for gRPC services, and make publisher pause/resume signaling non-blocking to avoid call-site blocking.

Also add unit tests for keepalive server controls and RpcServer controller forwarding.
@kevwan kevwan added area/grpc Categorizes issue or PR as related to gRPC. kind/in-progress Issues and PRs that in progress labels Apr 30, 2026
@kevwan
Copy link
Copy Markdown
Contributor

kevwan commented Apr 30, 2026

Translation / English summary:

This draft PR adds a PauseEtcdRegister() / ResumeEtcdRegister() API to RpcServer, allowing a service to temporarily deregister from etcd service discovery without stopping the process (addressing issue #4954).

Original (原文): 当某些节点 API 临时不可用时,希望让它们从服务发现中下线,通常只能停进程/停容器,运维成本较高,也不够灵活。本 PR 在 Publisher 中增加安全的暂停/恢复续租能力,并在 zrpc.RpcServer 层暴露 PauseEtcdRegister / ResumeEtcdRegister API。


Review

Concept: Solid and useful — the ability to temporarily remove a node from service discovery without killing the process is a common operational need (e.g., graceful drain before maintenance, canary rollout control).

Key changes:

  1. core/discov: pauseChan / resumeChan changed to buffered channels (capacity 1) + non-blocking notify() to prevent duplicate-call deadlock
  2. core/discov: On pause, revokes current lease; on resume, re-registers and restarts keepalive
  3. zrpc/internal: keepAliveServer gets a Publisher interface abstraction with Pause()/Resume()
  4. zrpc: RpcServer.PauseEtcdRegister() / ResumeEtcdRegister() — no-op when not using etcd

Review points:

  1. Interface change in core/discov — If EtcdClient interface (or similar) is changed, this is a breaking change for users who implement those interfaces. Please confirm which interfaces are changed and whether it's backward-compatible.

  2. Lease revocation timing — After calling PauseEtcdRegister(), there's a race between the keepalive loop processing the pause signal and existing clients receiving the stale registration from etcd before TTL expiration. The TTL on the lease controls how quickly clients learn the node is gone — please document this delay.

  3. Resume re-registration — When resuming, the node creates a new lease and re-registers. If this fails (etcd unavailable during resume), does the error propagate to the caller or is it silently retried? Users need visibility into failed resumes.

  4. Tests — The test additions are comprehensive. Please also add an integration test (or at minimum a unit test with a mock) for the resume-after-failed-lease scenario.

This is a draft — please mark ready for review once the above concerns are addressed. Also, please consider writing the PR description in English or including an English translation to help all contributors review the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/grpc Categorizes issue or PR as related to gRPC. kind/in-progress Issues and PRs that in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants