test(e2e): add kind-based stress/scale harness (1/3/7 members, churn, quorum watcher)#369
test(e2e): add kind-based stress/scale harness (1/3/7 members, churn, quorum watcher)#369xrl wants to merge 2 commits into
Conversation
Adds a build-tagged stress target (`//go:build stress` + `make test-stress`) that exercises the operator at 1/3/7-member scale and under scale-churn, crash-during-scale, and pod-recovery, reusing the existing kind bootstrap and e2e primitives. The fast `make test-e2e` suite is unaffected (stress tag excluded). Green tests (pass on current main): TestStressBringUp, TestStressScaleChurn, TestStressSingleEditJump, TestStressCrashDuringScale, TestStressPodRecoveryAtScale. Skip-gated bug-proofs (flip to passing alongside their fix PR): TestStressVersionUpgrade, TestStressEvenSizeRejected, TestStressLeaderlessScaleIn. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
quorumWatcher polled `etcdctl endpoint health --cluster`, which reports health for every member in pod-0's member list. During scale churn a member that is mid-join (a not-yet-serving learner) or mid-removal transiently reports unhealthy under --cluster even though quorum is fully intact. That tripped the watcher's 3-consecutive-bad-poll threshold and failed TestStressScaleChurn with a phantom "1 sustained quorum-loss window" (13 unhealthy polls), despite every scale step converging and the keyset staying intact -- i.e. quorum was never actually lost. Switch the watcher to a dedicated endpointHealthQuorum check that runs `etcdctl endpoint health` against pod-0's *local* endpoint only. A healthy result there means etcd committed a proposal through Raft, which requires quorum -- the true write-stall / quorum-loss signal -- and is immune to transient joining/leaving members. endpointHealthAllHealthy is left unchanged for waitForClusterHealthy, where all-endpoints-healthy is the correct convergence gate. With the fix TestStressScaleChurn passes with 0 unhealthy polls. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: xrl The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @xrl. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What this adds
A build-tagged,
kind-based stress/integration harness that exercises the operator at the cluster sizes and under the churn where its known hazards live. Today the largest cluster any e2e test creates is 3 members, and there is no version-upgrade test — so several real failure modes are simply unreachable by the current suite. This closes that gap without slowing the fast path.Everything is gated behind
//go:build stress+ a newmake test-stresstarget, somake test-e2eis completely unaffected (the stress tag is excluded). It reuses the existing kind bootstrap, deployed operator, gofail wiring, andtest/e2e/helpers_test.goprimitives — no new infrastructure.Green tests (passing on current
main)TestStressBringUp— bootstrap atsize ∈ {1,3,7}, logs time-to-healthy per size, asserts hashKV consistency + data round-trip.TestStressScaleChurn—1→3→7→3→1with a background quorum-invariant watcher running throughout; asserts one-member-at-a-time progression, no stuck learners, hashKV consistency, and keyset integrity after every step.TestStressSingleEditJump— a single1→7edit; asserts the operator never adds more than one learner at a time and converges.TestStressCrashDuringScale— arms the existingexceptionAfterMemberAdd/exceptionAfterMemberDeletefailpoints during3→7and7→3, asserts the operator recovers and converges.TestStressPodRecoveryAtScale— deletes a member pod at size 7, asserts member ID stability + data replication (extends the current 3-member recovery test).The quorum-invariant watcher (
quorumWatcher) directly answers the request from the contribution discussion for "an upgrade e2e with a continuous quorum-invariant watcher" — it polls a member-local endpoint (a healthy commit there requires a Raft quorum), so it flags genuine write-stalls without false positives from members that are mid-join/mid-removal.Skip-gated regression guards (paired with future fix PRs)
Three tests are committed
t.Skip-gated as executable proof + regression guard for known issues; each flips to passing in the same PR as its fix:TestStressVersionUpgrade— proves the silent no-op.spec.versionupgrade.TestStressEvenSizeRejected— even sizes (2/4) are currently admitted.TestStressLeaderlessScaleIn— scale-in that removes the current leader.Test evidence
Full
make test-stressrun on a single-node kind cluster (kindest/node:v1.32.0), all green tests pass, all three guards correctly skipped:make test-e2eis unchanged and does not pick up any of these tests.Notes for reviewers
size=7does not trigger the lexical-sort member-ordering bug (that needs ≥11, whereetcd-9sorts afteretcd-10); a dedicated ordinal case can be added when that fix is in flight. 1/3/7 is the target here.Related work — etcd TLS & operability
Independent peer/client TLS reshape and surrounding operability work, in dependency / stacking order (
→marks this PR):spec.tls.{peer,client}surfaces; breaking alpha API change (no conversion webhook, by design)TLSReadycondition + TLS lifecycle EventsPeerCANotSharedThe TLS reshape (#376) supersedes the earlier conflated T2/T3/T4 plan (per-surface mounts, flags+scheme, and client
*tls.Confignow all live in #376). T5←#376 and T6←#377 are stacked: review/merge in order. T0, the reconcile/QoS knobs, and the stress harness are independent.