roachtest: replace kv/splits with splits perturbation test#170090
Conversation
|
😎 Merged successfully - details. |
7d94271 to
ab0285a
Compare
|
Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link) |
|
Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link) |
686c2d2 to
19d2e8e
Compare
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
9a94da4 to
5275a36
Compare
95f0931 to
927a718
Compare
miraradeva
left a comment
There was a problem hiding this comment.
Looks really nice! It's a bit sad how much code it takes to do waitForVoterPlacement and rebalanceLeases, both things we do in many tests.
A later commit in this PR reframes "how many splits can the cluster sustain" as a perturbation test that runs concurrently with a foreground KV workload and reports impact in roachperf-comparable terms (latency ratio, throughput ratio, replicas/vcpu/node density). The legacy kv/splits/nodes=3/quiesce=*/lease=* tests answered a binary "did splits finish in 2h", repeatedly tripped the same known cascade (#170041, #166202, #169319, #168374), and produced no trendable signal when they passed. Drop them in favor of perturbation/full/splits. Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Some perturbations need a wall-clock budget beyond the registry default (e.g. tests that pile up cluster-wide work that has to drain afterwards). Add an optional v.timeout field and a v.applyTimeout helper that addFull, addDev, and addMetamorphic all use to copy it into their TestSpec. No behavior change for existing tests — they leave v.timeout at zero and continue to get the registry default. Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
…tests The kv workload's `init` subcommand supports a --scatter flag that scatters the table immediately after splitting; this gives the allocator a chance to spread leaseholders across the cluster while queues are idle, instead of leaving every split's children with the parent's leaseholder. Add an opt-in v.scatter field that initWorkload passes through. Off by default so existing tests are unaffected. Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
…ntervals Some perturbations are expected to crush foreground throughput while they run, with the meaningful pass/fail signal being whether the cluster returns to baseline afterwards. Letting one threshold gate both intervals forces a choice between flaky-during-perturbation and permissive-during-recovery. Add recoveryImpact alongside impact. Zero value falls back to impact at the use site (no eager copy), so tests that don't set it independently continue to use a single threshold for both intervals. Perturbations that need a permissive in-flight threshold can set the two independently. Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
…ests Some perturbations leave non-trivial kvadmission flow-control churn at the end of the workload phase that takes longer than the 10-minute default to fully drain to zero in the post-test ValidateTokensReturned check. A previous workaround (raise to 1h when v.diskBandwidthLimit is set) covers one such case. Generalize that to a per-test v.tokenReturnTime field that any perturbation can set; the existing defaults remain in place when the field is zero. No behavior change for existing tests. Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
…work Add an optional perturbationNamer interface a perturbation may implement to append a variation suffix to its test name. With this a single struct type can register multiple test entries that differ in configuration; a later commit uses it to register the splits test in two flavors (followers asleep vs. forced awake) under distinct names like perturbation/full/splits/asleep=true perturbation/full/splits/asleep=false without resorting to one struct per variant. No behavior change for existing perturbations — none of them implement the interface, so perturbationName() falls through to the reflection-based type name as before. Epic: none Release note: None
Reframe "how many splits can the cluster sustain" as a perturbation test that runs concurrently with a foreground KV workload and reports impact in roachperf-comparable terms (throughput ratio, density in replicas/vcpu/node). Supersedes the legacy kv/splits/*/quiesce=*/lease=* binary tests dropped earlier in this PR. Two flavors are registered together: asleep=true measures total replica capacity (followers park via store liveness quiescence), asleep=false measures active-replica capacity (every replica costs full per-tick CPU, the relevant ceiling during outages where followers can't sleep). asleep=true targets ~9,400 replicas/vcpu (40% over the production telemetry cluster's ~6,700 replicas/vcpu non-quiescent density) to stress-test the primary quiescence path beyond current production density; asleep=false targets ~6,250 replicas/vcpu to match the telemetry cluster's observed non-quiescent density. asleep=false also disables store-liveness follower sleep via the kv.raft.store_liveness.quiescence.enabled cluster setting; leader leases keep leaseholders awake unconditionally so no additional knob is needed. Design (see the struct-level comment in splits.go for the long form): - Partition the cluster's stores into disjoint groups of three on distinct nodes; pin one database per group with voter_constraints that one-to-one map each voter to a store. With this every split's children inherit their parent's correct placement, so no SCATTER and no rebalance churn during the split storm. - Issue splits per group in parallel batches; descending keys per batch to keep each split landing on the leftmost (stable, single-leader) range rather than chaining new raft groups together. - After batch 0 (and again at end-of-perturbation), do an explicit deterministic round-robin lease rebalance via ALTER RANGE RELOCATE LEASE. Plain SCATTER stalls at 1.5-3x spread under per-store voter_constraints because the allocator's lease scorer picks the same biased target on every re-scatter. Round-robin converges in one pass; dispatch is parallelized within a group (concurrency 16) and across groups (one task per group). - Per-call ALTER RANGE failures are tolerated up to 10% (catches systemic problems like a node down while shrugging off transient range-merge races); a 100%-failure floor catches small-batch pathologies that the percentage gate would skip. First error is surfaced in the failure message and in a warn log when any failure happens below the threshold. Knobs in splits.setup(): - asleep=true: targetSplits=600k, batchSize=10k (full); the optional devSizer / metamorphicSizer interfaces (added in this commit) let the dev and metamorphic variants shrink to 4k and 60k respectively. - asleep=false: targetSplits=400k, batchSize=5k (full); 40k metamorphic; shares the 4k dev with asleep=true. - v.splits=500 + v.scatter=true: enough kv ranges to avoid a single-range hotspot for the foreground workload, scattered so leases distribute evenly. - v.timeout=5h: full 600k storm + end-of-perturbation rebalance runs for over an hour; the default 3h budget is too tight. asleep=false shares the budget since the rebalance phase dominates either way. - v.tokenReturnTime=1h: real outstanding flow-control tokens drain within seconds of workload cancel, but ValidateTokensReturned's SQL query races against background flow-control activity (replicate queue, lease queue, jobs) at 1-2M replicas and observed transient non-zero diffs at the 10m default. A longer window gives the retry loop a better chance to land on a clean snapshot. - impact=10x, recoveryImpact=3x: during the split storm the cluster is expected to absorb significant throughput loss; the meaningful pass/fail signal is whether it returns to baseline afterwards. Both flavors share these thresholds; the asleep=false gates may be tightened in a follow-up now that the flavor lands at production density. - SkipPostValidations=PostValidationReplicaDivergence: at 1.2M-1.8M ranges the post-test consistency check can't scan a meaningful fraction within the framework's 20m budget (observed: 0 ranges in 20m). The framework tolerates the timeout but logs it as a post-assertion failure; skipping explicitly keeps the artifact clean. addDev now also zeros v.recoveryImpact, v.timeout, and v.tokenReturnTime, all of which splits.setup() sets to non-zero values. Without these clears the dev variant would inherit a 3x recovery throughput gate (too tight for noisy local processes), a 5h test timeout (would hide hangs behind a long wait), and a 1h token-return wait (wildly overshoots what's meaningful on a 5-node local cluster). Test fixtures: - waitForVoterPlacement polls per-range placement against the group's expected store set and nudges non-conforming ranges into the replicate queue via crdb_internal.kv_enqueue_replica (must run on the leaseholder). - waitForVoterPlacement budget defaults to 10 minutes and scales up to v.timeout/3 (capped at 90 minutes) when the test sets a per-test timeout. The 10-minute default suits the dev variant; the lift makes the nightly variant tolerant of slow up-replication. - reportDensity / reportBalance / summarize give per-node and per-store range and lease distribution at every batch boundary, with deterministic min/max tiebreaks so log diffs are stable across runs. Epic: none Release note: None
|
Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link) |
|
Final pre-merge roachtest run completed — both flavors PASSED ✅
Worst-case ratios per phase (all within
p99 is unbounded by design (the split storm temporarily crushes tail latency); the gates are on throughput, which both flavors cleared comfortably. Most striking: |
Replace the legacy
kv/splits/nodes=3/quiesce=*/lease=*tests with anew perturbation test that measures foreground KV workload impact
during sustained range-split activity. The test registers in two
flavors that exercise complementary capacity questions.
Why
The legacy tests answered a binary "did ~300k splits finish on a
3n × 4vcpu cluster within 2h" and repeatedly tripped the known
slow-proposals → circuit-breaker → poisoned-latch cascade at the edge
of the supported replicas-per-node ceiling (#170041, #166202, #169319,
#168374). When they passed they produced no trendable signal; when
they failed it was almost always the same known cascade rather than a
regression. They also ran without a foreground workload, so the
cascade only manifested at extreme density rather than at densities
real workloads would actually encounter.
What
The new
perturbation/{dev,full,metamorphic}/splits/asleep={true,false}test runs the standard perturbation framework: a fixed-rate KV workload
at 50% of max throughput, with the perturbation phase issuing splits
off to the side on a separate set of pinned tables. The output is a
roachperf signal that trends across releases.
Two flavors are registered together:
asleep=truemeasures total replica capacity. Followers fallasleep via store-liveness quiescence, so most of the replica
population costs no per-tick CPU. The default 600k splits on the
standard perturbation cluster (12 nodes × 16 vcpu, RF=3) land at
~9,400 replicas/vcpu — 40% higher than the production telemetry
cluster (crl-prod-38z, ~6,700 replicas/vcpu), so this flavor
stress-tests the primary quiescence path beyond current production
density.
asleep=falsemeasures active-replica capacity, the relevantceiling during outages where store liveness can't withdraw and
followers can't fall asleep. The cluster setting
kv.raft.store_liveness.quiescence.enabled=falsekeeps followersawake; leader leases keep leaseholders awake unconditionally so no
additional knob is needed. The 400k split target lands at ~6,250
replicas/vcpu, matching the telemetry cluster's observed non-quiescent
density. Because awake replicas are far more expensive than quiescent
ones, this is 33% lower than asleep=true's target despite being close
to production density.
Design highlights (long form in the struct-level comment in
splits.go):disjoint 3-store groups on distinct nodes; pin one database per
group via
voter_constraints. With this every split's childreninherit their parent's correct placement, so no SCATTER and no
rebalance churn during the split storm. An earlier node-pinning
version of this test ran ~1 TB of background rebalance traffic
during the perturbation; per-store pinning eliminates that.
group issues splits in parallel against its own pinned table;
descending keys keep every split landing on the leftmost
(single-leader) range rather than chaining new raft groups.
again at end-of-perturbation), explicit
ALTER RANGE … RELOCATE LEASEbrings the lease distribution to perfect balance in onepass per group. Plain
SCATTERstalls at 1.5-3x spread underper-store
voter_constraintsbecause the allocator's lease scorerpicks the same biased target on every re-scatter. Dispatch is
parallelized within a group (concurrency 16) and across groups.
Thresholds
The split storm itself is expected to crush foreground throughput;
the meaningful pass/fail signal is whether the cluster returns to
baseline afterwards. This PR adds a separate
recoveryImpactfieldon
variations(defaults toimpactvia use-site fallback soexisting tests are unchanged) and the splits test sets:
impact(perturbation interval): 10× throughput, p50/p99 disabled.recoveryImpact(recovery interval): 3× throughput, p50/p99 disabled.Both flavors share these thresholds for the initial landing; the
asleep=falsegates may be tightened after the first nightly-equivalentcalibration run if the smaller population produces materially less
impact.
Results from nightly-equivalent runs (
asleep=true)perturbation/full/splits/asleep=trueon 12n × 16vcpu, 600k splits,~1.8M replicas at peak (≈9.4k replicas/vcpu/node):
All gates pass. Splitperf user table holds exactly 25 000 leases on
each of its 3 voter stores per group after the end-of-perturbation
rebalance.
asleep=falsenumbers will be filled in after acalibration run.
Commits
This PR is a small framework cleanup followed by a single test
introduction with both flavors:
roachtest: remove kv/splits testsroachtest: add per-test timeout override for perturbation testsroachtest: add --scatter option to kv workload init for perturbation testsroachtest: split impact threshold between perturbation and recovery intervalsroachtest: add per-test token-return wait override for perturbation testsroachtest: support test name variation suffixes in perturbation frameworkroachtest: add splits perturbation testCommits 2–6 are framework groundwork; commit 7 introduces the splits
test and registers both flavors via the
perturbationNamerinterfaceadded in commit 6.
Depends on the allocator fix in #170377 (merged), which is a hard
prerequisite for the test to converge.
See also: #170041
Epic: none