Skip to content

maintainer remove handoff can still recreate dispatchers after shutdown starts #4827

@wlwilliamx

Description

@wlwilliamx

What did you do?

Investigated a real maintainer failover/remove report against pingcap/ticdc master, then traced and validated the current control-plane code paths with unit coverage.

The problematic sequence is:

  1. the old maintainer receives RemoveMaintainer and enters removing state;
  2. before it is fully stopped, legacy control-plane paths on the old maintainer are still active;
  3. late heartbeat / block-status / node-change / queued operator activity can still mutate scheduling state;
  4. the old maintainer can recreate or reschedule dispatchers after shutdown has already started.

The affected paths on current master include:

  • heartbeat self-healing on late Stopped / Removed statuses without an operator;
  • node-change handling that can still call OnNodeRemoved and mark spans absent;
  • barrier block-status / resend paths that can still drive DDL-triggered scheduling;
  • queued operator and scheduler work that can still send ordinary scheduling requests.

What did you expect to see?

Once RemoveMaintainer starts, the old maintainer should stop ordinary control-plane scheduling immediately and only finish the minimal close path needed for the DDL trigger dispatcher.

The old maintainer should not recreate, reschedule, or otherwise mutate normal table dispatcher state after shutdown handoff starts.

What did you see instead?

Current master still allows the old maintainer to drive ordinary scheduling during the remove handoff window.

That can:

  • re-mark spans absent and trigger rescheduling after shutdown starts;
  • race with the new maintainer during handoff;
  • leave orphan dispatchers or continued event-service traffic after the maintainer is already gone.

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

Not required for this reproducer. The issue is in TiCDC maintainer control-plane
logic and was confirmed by code-path validation against current upstream master.

Upstream TiKV version (execute tikv-server --version):

Not required for this reproducer. The issue is in TiCDC maintainer control-plane
logic and was confirmed by code-path validation against current upstream master.

TiCDC version (execute cdc version):

Confirmed on `pingcap/ticdc` `master` at
`0a418b4132466aa084517ec7137b3d5f24013dcc`.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions