Skip to content

feat: support prechecking down peers before restarting tikv pod#6877

Merged
ti-chi-bot[bot] merged 4 commits intopingcap:mainfrom
liubog2008:support-detect-peer-down
May 9, 2026
Merged

feat: support prechecking down peers before restarting tikv pod#6877
ti-chi-bot[bot] merged 4 commits intopingcap:mainfrom
liubog2008:support-detect-peer-down

Conversation

@liubog2008
Copy link
Copy Markdown
Member

@liubog2008 liubog2008 commented May 5, 2026

  • support prechecking down peers before restarting tikv pod
  • support waiting until leaders are evicted before restarting tikv pod

liubog2008 added 2 commits May 5, 2026 21:31
Signed-off-by: liubo02 <liubo02@pingcap.com>
Signed-off-by: liubo02 <liubo02@pingcap.com>
@ti-chi-bot ti-chi-bot Bot requested a review from howardlau1999 May 5, 2026 13:42
@github-actions github-actions Bot added the v2 for operator v2 label May 5, 2026
@ti-chi-bot ti-chi-bot Bot added the size/XXL label May 5, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 5, 2026

Codecov Report

❌ Patch coverage is 70.75472% with 31 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.61%. Comparing base (2b81667) to head (c7b20b0).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6877      +/-   ##
==========================================
+ Coverage   37.44%   37.61%   +0.17%     
==========================================
  Files         392      392              
  Lines       22432    22483      +51     
==========================================
+ Hits         8399     8458      +59     
+ Misses      14033    14025       -8     
Flag Coverage Δ
unittest 37.61% <70.75%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: liubo02 <liubo02@pingcap.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances TiKV pod restart safety by introducing PD-based prechecks (down-peer regions and leader eviction) before allowing TiKV pod recreation, and refactors leader-eviction condition syncing into the eviction task flow.

Changes:

  • Add a PD API client method and types for querying regions with down peers (/pd/api/v1/regions/check/down-peer).
  • Gate TiKV pod recreation on (a) zero non-self down peers and (b) leaders being evicted, and trigger leader-eviction scheduling when needed.
  • Refactor syncing of TiKVCondLeadersEvicted from the status task into the leader-eviction task, with updated/added unit tests.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/pdapi/v1/types.go Adds PD response types for down-peer region checks.
pkg/pdapi/v1/client.go Adds GetDownPeerRegions PD client call and endpoint constant.
pkg/pdapi/v1/mock_generated.go Updates PD client mock to include GetDownPeerRegions.
pkg/pdapi/v1/client_test.go Adds unit test coverage for GetDownPeerRegions.
pkg/controllers/tikv/tasks/util.go Adds helper checks for leader-eviction status/timeout; fixes VolumeName import aliasing.
pkg/controllers/tikv/tasks/pod.go Adds restart prechecks (down peers + leaders evicted) and wires PD client usage into restart flow.
pkg/controllers/tikv/tasks/pod_test.go Extends pod task tests to cover down-peer filtering and leader-eviction gating behavior.
pkg/controllers/tikv/tasks/evict_leader.go Changes eviction scheduler management based on ShouldEvictLeader and syncs LeadersEvicted condition here.
pkg/controllers/tikv/tasks/evict_leader_test.go Adds tests for starting/stopping leader eviction scheduler behavior.
pkg/controllers/tikv/tasks/offline.go Switches offline flow to use the new leader-eviction check helper and ShouldEvictLeader.
pkg/controllers/tikv/tasks/status.go Removes leader-eviction condition syncing and related wait behavior from status task.
pkg/controllers/tikv/tasks/status_test.go Updates expectations after removing leader-eviction condition management from status task.
pkg/controllers/tikv/tasks/ctx.go Minor formatting/structure adjustments; no functional change observed.
pkg/controllers/tikv/builder.go Updates runner wiring to pass PD client manager into TaskPod.
api/core/v1alpha1/tikv_types.go Adds ReasonStoreNotExist and deprecates ReasonStoreIsRemoved.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/controllers/tikv/tasks/pod.go Outdated
Comment on lines +105 to +109
return task.Wait().With("cannot recreate pod, check down peer: %v", err)
}

if err := CheckTiKVLeadersEvicted(state.TiKV()); err != nil {
return task.Wait().With("cannot recreate pod, check leader count: %v", err)
Comment on lines +150 to +159
func countNonSelfDownPeers(downPeerInfo *pdapi.RegionsCheckInfo, store *pdv1.Store) int {
if store == nil || store.ID == "" {
return downPeerInfo.Count
}
if downPeerInfo.Count == 0 {
return 0
}

nonSelfDownPeerCount := 0
for _, region := range downPeerInfo.Regions {
case !state.PDSynced:
return task.Wait().With("pd is unsynced")
case state.Store == nil:
if state.Store == nil {
Comment thread pkg/controllers/tikv/tasks/pod.go Outdated
Comment on lines +78 to +82
pc, ok := state.GetPDClient(cm)
if !ok {
return task.Wait().With("wait if pd client is not registered")
}

Signed-off-by: liubo02 <liubo02@pingcap.com>
@liubog2008
Copy link
Copy Markdown
Member Author

/cherry-pick release-2.1

@ti-chi-bot
Copy link
Copy Markdown
Member

@liubog2008: once the present PR merges, I will cherry-pick it on top of release-2.1 in the new PR and assign it to you.

Details

In response to this:

/cherry-pick release-2.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 9, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fgksgf

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the lgtm label May 9, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 9, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-05-09 01:33:32.250407344 +0000 UTC m=+490685.123757316: ☑️ agreed by fgksgf.

@ti-chi-bot ti-chi-bot Bot added the approved label May 9, 2026
@ti-chi-bot ti-chi-bot Bot merged commit c3de909 into pingcap:main May 9, 2026
10 checks passed
@ti-chi-bot
Copy link
Copy Markdown
Member

@liubog2008: new pull request created to branch release-2.1: #6882.

Details

In response to this:

/cherry-pick release-2.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants