What did you do?
Investigated a real maintainer failover/remove report against pingcap/ticdc master, then traced and validated the current control-plane code paths with unit coverage.
The problematic sequence is:
- the old maintainer receives
RemoveMaintainer and enters removing state;
- before it is fully stopped, legacy control-plane paths on the old maintainer are still active;
- late heartbeat / block-status / node-change / queued operator activity can still mutate scheduling state;
- the old maintainer can recreate or reschedule dispatchers after shutdown has already started.
The affected paths on current master include:
- heartbeat self-healing on late
Stopped / Removed statuses without an operator;
- node-change handling that can still call
OnNodeRemoved and mark spans absent;
- barrier block-status / resend paths that can still drive DDL-triggered scheduling;
- queued operator and scheduler work that can still send ordinary scheduling requests.
What did you expect to see?
Once RemoveMaintainer starts, the old maintainer should stop ordinary control-plane scheduling immediately and only finish the minimal close path needed for the DDL trigger dispatcher.
The old maintainer should not recreate, reschedule, or otherwise mutate normal table dispatcher state after shutdown handoff starts.
What did you see instead?
Current master still allows the old maintainer to drive ordinary scheduling during the remove handoff window.
That can:
- re-mark spans absent and trigger rescheduling after shutdown starts;
- race with the new maintainer during handoff;
- leave orphan dispatchers or continued event-service traffic after the maintainer is already gone.
Versions of the cluster
Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):
Not required for this reproducer. The issue is in TiCDC maintainer control-plane
logic and was confirmed by code-path validation against current upstream master.
Upstream TiKV version (execute tikv-server --version):
Not required for this reproducer. The issue is in TiCDC maintainer control-plane
logic and was confirmed by code-path validation against current upstream master.
TiCDC version (execute cdc version):
Confirmed on `pingcap/ticdc` `master` at
`0a418b4132466aa084517ec7137b3d5f24013dcc`.
What did you do?
Investigated a real maintainer failover/remove report against
pingcap/ticdcmaster, then traced and validated the current control-plane code paths with unit coverage.The problematic sequence is:
RemoveMaintainerand enters removing state;The affected paths on current
masterinclude:Stopped/Removedstatuses without an operator;OnNodeRemovedand mark spans absent;What did you expect to see?
Once
RemoveMaintainerstarts, the old maintainer should stop ordinary control-plane scheduling immediately and only finish the minimal close path needed for the DDL trigger dispatcher.The old maintainer should not recreate, reschedule, or otherwise mutate normal table dispatcher state after shutdown handoff starts.
What did you see instead?
Current
masterstill allows the old maintainer to drive ordinary scheduling during the remove handoff window.That can:
Versions of the cluster
Upstream TiDB cluster version (execute
SELECT tidb_version();in a MySQL client):Upstream TiKV version (execute
tikv-server --version):TiCDC version (execute
cdc version):