roachtest: replace kv/splits with splits perturbation test by tbg · Pull Request #170090 · cockroachdb/cockroach

tbg · 2026-05-11T07:37:23Z

Replace the legacy kv/splits/nodes=3/quiesce=*/lease=* tests with a
new perturbation test that measures foreground KV workload impact
during sustained range-split activity. The test registers in two
flavors that exercise complementary capacity questions.

Why

The legacy tests answered a binary "did ~300k splits finish on a
3n × 4vcpu cluster within 2h" and repeatedly tripped the known
slow-proposals → circuit-breaker → poisoned-latch cascade at the edge
of the supported replicas-per-node ceiling (#170041, #166202, #169319,
#168374). When they passed they produced no trendable signal; when
they failed it was almost always the same known cascade rather than a
regression. They also ran without a foreground workload, so the
cascade only manifested at extreme density rather than at densities
real workloads would actually encounter.

What

The new perturbation/{dev,full,metamorphic}/splits/asleep={true,false}
test runs the standard perturbation framework: a fixed-rate KV workload
at 50% of max throughput, with the perturbation phase issuing splits
off to the side on a separate set of pinned tables. The output is a
roachperf signal that trends across releases.

Two flavors are registered together:

asleep=true measures total replica capacity. Followers fall
asleep via store-liveness quiescence, so most of the replica
population costs no per-tick CPU. The default 600k splits on the
standard perturbation cluster (12 nodes × 16 vcpu, RF=3) land at
~9,400 replicas/vcpu — 40% higher than the production telemetry
cluster (crl-prod-38z, ~6,700 replicas/vcpu), so this flavor
stress-tests the primary quiescence path beyond current production
density.
asleep=false measures active-replica capacity, the relevant
ceiling during outages where store liveness can't withdraw and
followers can't fall asleep. The cluster setting
kv.raft.store_liveness.quiescence.enabled=false keeps followers
awake; leader leases keep leaseholders awake unconditionally so no
additional knob is needed. The 400k split target lands at ~6,250
replicas/vcpu, matching the telemetry cluster's observed non-quiescent
density. Because awake replicas are far more expensive than quiescent
ones, this is 33% lower than asleep=true's target despite being close
to production density.

Design highlights (long form in the struct-level comment in
splits.go):

Per-store voter pinning. Partition the cluster's stores into
disjoint 3-store groups on distinct nodes; pin one database per
group via voter_constraints. With this every split's children
inherit their parent's correct placement, so no SCATTER and no
rebalance churn during the split storm. An earlier node-pinning
version of this test ran ~1 TB of background rebalance traffic
during the perturbation; per-store pinning eliminates that.
Parallel splits per group, descending keys per batch. Each
group issues splits in parallel against its own pinned table;
descending keys keep every split landing on the leftmost
(single-leader) range rather than chaining new raft groups.
Deterministic round-robin lease rebalance. After batch 0 (and
again at end-of-perturbation), explicit ALTER RANGE … RELOCATE LEASE brings the lease distribution to perfect balance in one
pass per group. Plain SCATTER stalls at 1.5-3x spread under
per-store voter_constraints because the allocator's lease scorer
picks the same biased target on every re-scatter. Dispatch is
parallelized within a group (concurrency 16) and across groups.

Thresholds

The split storm itself is expected to crush foreground throughput;
the meaningful pass/fail signal is whether the cluster returns to
baseline afterwards. This PR adds a separate recoveryImpact field
on variations (defaults to impact via use-site fallback so
existing tests are unchanged) and the splits test sets:

impact (perturbation interval): 10× throughput, p50/p99 disabled.
recoveryImpact (recovery interval): 3× throughput, p50/p99 disabled.

Both flavors share these thresholds for the initial landing; the
asleep=false gates may be tightened after the first nightly-equivalent
calibration run if the smaller population produces materially less
impact.

Results from nightly-equivalent runs (`asleep=true`)

perturbation/full/splits/asleep=true on 12n × 16vcpu, 600k splits,
~1.8M replicas at peak (≈9.4k replicas/vcpu/node):

Op	Perturbation (limit 10×)	Recovery (limit 3×)
write throughput	2.35–3.43× ✅	1.77–1.98× ✅
read throughput	2.35–3.38× ✅	1.78–1.97× ✅
follower-read throughput	2.35–3.44× ✅	1.76–1.97× ✅

All gates pass. Splitperf user table holds exactly 25 000 leases on
each of its 3 voter stores per group after the end-of-perturbation
rebalance. asleep=false numbers will be filled in after a
calibration run.

Commits

This PR is a small framework cleanup followed by a single test
introduction with both flavors:

roachtest: remove kv/splits tests
roachtest: add per-test timeout override for perturbation tests
roachtest: add --scatter option to kv workload init for perturbation tests
roachtest: split impact threshold between perturbation and recovery intervals
roachtest: add per-test token-return wait override for perturbation tests
roachtest: support test name variation suffixes in perturbation framework
roachtest: add splits perturbation test

Commits 2–6 are framework groundwork; commit 7 introduces the splits
test and registers both flavors via the perturbationNamer interface
added in commit 6.

Depends on the allocator fix in #170377 (merged), which is a hard
prerequisite for the test to converge.

See also: #170041
Epic: none

trunk-io · 2026-05-11T07:37:28Z

😎 Merged successfully - details.

cockroach-teamcity · 2026-05-11T07:37:37Z

This change is

blathers-crl · 2026-05-11T15:22:32Z

Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link)

blathers-crl · 2026-05-11T16:16:46Z

Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link)

blathers-crl · 2026-05-18T12:12:43Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

miraradeva

Looks really nice! It's a bit sad how much code it takes to do waitForVoterPlacement and rebalanceLeases, both things we do in many tests.

A later commit in this PR reframes "how many splits can the cluster sustain" as a perturbation test that runs concurrently with a foreground KV workload and reports impact in roachperf-comparable terms (latency ratio, throughput ratio, replicas/vcpu/node density). The legacy kv/splits/nodes=3/quiesce=*/lease=* tests answered a binary "did splits finish in 2h", repeatedly tripped the same known cascade (#170041, #166202, #169319, #168374), and produced no trendable signal when they passed. Drop them in favor of perturbation/full/splits. Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>

Some perturbations need a wall-clock budget beyond the registry default (e.g. tests that pile up cluster-wide work that has to drain afterwards). Add an optional v.timeout field and a v.applyTimeout helper that addFull, addDev, and addMetamorphic all use to copy it into their TestSpec. No behavior change for existing tests — they leave v.timeout at zero and continue to get the registry default. Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>

…tests The kv workload's `init` subcommand supports a --scatter flag that scatters the table immediately after splitting; this gives the allocator a chance to spread leaseholders across the cluster while queues are idle, instead of leaving every split's children with the parent's leaseholder. Add an opt-in v.scatter field that initWorkload passes through. Off by default so existing tests are unaffected. Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>

…ntervals Some perturbations are expected to crush foreground throughput while they run, with the meaningful pass/fail signal being whether the cluster returns to baseline afterwards. Letting one threshold gate both intervals forces a choice between flaky-during-perturbation and permissive-during-recovery. Add recoveryImpact alongside impact. Zero value falls back to impact at the use site (no eager copy), so tests that don't set it independently continue to use a single threshold for both intervals. Perturbations that need a permissive in-flight threshold can set the two independently. Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>

…ests Some perturbations leave non-trivial kvadmission flow-control churn at the end of the workload phase that takes longer than the 10-minute default to fully drain to zero in the post-test ValidateTokensReturned check. A previous workaround (raise to 1h when v.diskBandwidthLimit is set) covers one such case. Generalize that to a per-test v.tokenReturnTime field that any perturbation can set; the existing defaults remain in place when the field is zero. No behavior change for existing tests. Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>

…work Add an optional perturbationNamer interface a perturbation may implement to append a variation suffix to its test name. With this a single struct type can register multiple test entries that differ in configuration; a later commit uses it to register the splits test in two flavors (followers asleep vs. forced awake) under distinct names like perturbation/full/splits/asleep=true perturbation/full/splits/asleep=false without resorting to one struct per variant. No behavior change for existing perturbations — none of them implement the interface, so perturbationName() falls through to the reflection-based type name as before. Epic: none Release note: None

Reframe "how many splits can the cluster sustain" as a perturbation test that runs concurrently with a foreground KV workload and reports impact in roachperf-comparable terms (throughput ratio, density in replicas/vcpu/node). Supersedes the legacy kv/splits/*/quiesce=*/lease=* binary tests dropped earlier in this PR. Two flavors are registered together: asleep=true measures total replica capacity (followers park via store liveness quiescence), asleep=false measures active-replica capacity (every replica costs full per-tick CPU, the relevant ceiling during outages where followers can't sleep). asleep=true targets ~9,400 replicas/vcpu (40% over the production telemetry cluster's ~6,700 replicas/vcpu non-quiescent density) to stress-test the primary quiescence path beyond current production density; asleep=false targets ~6,250 replicas/vcpu to match the telemetry cluster's observed non-quiescent density. asleep=false also disables store-liveness follower sleep via the kv.raft.store_liveness.quiescence.enabled cluster setting; leader leases keep leaseholders awake unconditionally so no additional knob is needed. Design (see the struct-level comment in splits.go for the long form): - Partition the cluster's stores into disjoint groups of three on distinct nodes; pin one database per group with voter_constraints that one-to-one map each voter to a store. With this every split's children inherit their parent's correct placement, so no SCATTER and no rebalance churn during the split storm. - Issue splits per group in parallel batches; descending keys per batch to keep each split landing on the leftmost (stable, single-leader) range rather than chaining new raft groups together. - After batch 0 (and again at end-of-perturbation), do an explicit deterministic round-robin lease rebalance via ALTER RANGE RELOCATE LEASE. Plain SCATTER stalls at 1.5-3x spread under per-store voter_constraints because the allocator's lease scorer picks the same biased target on every re-scatter. Round-robin converges in one pass; dispatch is parallelized within a group (concurrency 16) and across groups (one task per group). - Per-call ALTER RANGE failures are tolerated up to 10% (catches systemic problems like a node down while shrugging off transient range-merge races); a 100%-failure floor catches small-batch pathologies that the percentage gate would skip. First error is surfaced in the failure message and in a warn log when any failure happens below the threshold. Knobs in splits.setup(): - asleep=true: targetSplits=600k, batchSize=10k (full); the optional devSizer / metamorphicSizer interfaces (added in this commit) let the dev and metamorphic variants shrink to 4k and 60k respectively. - asleep=false: targetSplits=400k, batchSize=5k (full); 40k metamorphic; shares the 4k dev with asleep=true. - v.splits=500 + v.scatter=true: enough kv ranges to avoid a single-range hotspot for the foreground workload, scattered so leases distribute evenly. - v.timeout=5h: full 600k storm + end-of-perturbation rebalance runs for over an hour; the default 3h budget is too tight. asleep=false shares the budget since the rebalance phase dominates either way. - v.tokenReturnTime=1h: real outstanding flow-control tokens drain within seconds of workload cancel, but ValidateTokensReturned's SQL query races against background flow-control activity (replicate queue, lease queue, jobs) at 1-2M replicas and observed transient non-zero diffs at the 10m default. A longer window gives the retry loop a better chance to land on a clean snapshot. - impact=10x, recoveryImpact=3x: during the split storm the cluster is expected to absorb significant throughput loss; the meaningful pass/fail signal is whether it returns to baseline afterwards. Both flavors share these thresholds; the asleep=false gates may be tightened in a follow-up now that the flavor lands at production density. - SkipPostValidations=PostValidationReplicaDivergence: at 1.2M-1.8M ranges the post-test consistency check can't scan a meaningful fraction within the framework's 20m budget (observed: 0 ranges in 20m). The framework tolerates the timeout but logs it as a post-assertion failure; skipping explicitly keeps the artifact clean. addDev now also zeros v.recoveryImpact, v.timeout, and v.tokenReturnTime, all of which splits.setup() sets to non-zero values. Without these clears the dev variant would inherit a 3x recovery throughput gate (too tight for noisy local processes), a 5h test timeout (would hide hangs behind a long wait), and a 1h token-return wait (wildly overshoots what's meaningful on a 5-node local cluster). Test fixtures: - waitForVoterPlacement polls per-range placement against the group's expected store set and nudges non-conforming ranges into the replicate queue via crdb_internal.kv_enqueue_replica (must run on the leaseholder). - waitForVoterPlacement budget defaults to 10 minutes and scales up to v.timeout/3 (capped at 90 minutes) when the test sets a per-test timeout. The 10-minute default suits the dev variant; the lift makes the nightly variant tolerant of slow up-replication. - reportDensity / reportBalance / summarize give per-node and per-store range and lease distribution at every batch boundary, with deterministic min/max tiebreaks so log diffs are stable across runs. Epic: none Release note: None

blathers-crl · 2026-05-27T13:56:26Z

Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link)

tbg · 2026-05-27T14:38:41Z

Final pre-merge roachtest run completed — both flavors PASSED ✅

Build: https://teamcity.cockroachdb.com/viewLog.html?buildId=21361997&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel (3h 17min total)
perturbation/full/splits/asleep=true ✅ (2h 14min)
perturbation/full/splits/asleep=false ✅ (1h 3min)

Worst-case ratios per phase (all within 10× perturbation / 3× recovery throughput limits):

flavor	phase	follower-read p99 / tput	read p99 / tput	write p99 / tput
`asleep=true`	perturbation	242× / 3.27×	273× / 3.30×	33× / 3.28×
`asleep=true`	recovery	103× / 1.84×	121× / 1.86×	15× / 1.85×
`asleep=false`	perturbation	130× / 1.82×	122× / 1.83×	21× / 1.83×
`asleep=false`	recovery	69× / 1.37×	69× / 1.35×	12× / 1.37×

p99 is unbounded by design (the split storm temporarily crushes tail latency); the gates are on throughput, which both flavors cleared comfortably. Most striking: asleep=true carries higher throughput impact (3.3× vs 1.8×) because of the much larger split count (600k vs 400k), but recovers to the same ~1.8× recovery throughput as asleep=false's perturbation impact.

tbg changed the title ~~roachtest: add splits perturbation test~~ roachtest: replace kv/splits with splits perturbation test May 11, 2026

tbg force-pushed the splits-perturbation branch 3 times, most recently from 7d94271 to ab0285a Compare May 11, 2026 14:47

tbg force-pushed the splits-perturbation branch from ab0285a to f7661e2 Compare May 11, 2026 15:51

tbg force-pushed the splits-perturbation branch 12 times, most recently from 686c2d2 to 19d2e8e Compare May 18, 2026 12:12

tbg force-pushed the splits-perturbation branch from 7f1d822 to 7134c35 Compare May 19, 2026 05:25

tbg mentioned this pull request May 19, 2026

.claude: add run-roachtest project skill #170188

Merged

tbg force-pushed the splits-perturbation branch 3 times, most recently from 9a94da4 to 5275a36 Compare May 20, 2026 13:52

tbg marked this pull request as ready for review May 20, 2026 16:03

tbg requested review from a team as code owners May 20, 2026 16:03

tbg requested a review from miraradeva May 20, 2026 16:03

tbg force-pushed the splits-perturbation branch 3 times, most recently from 95f0931 to 927a718 Compare May 22, 2026 08:22

miraradeva approved these changes May 22, 2026

View reviewed changes

tbg force-pushed the splits-perturbation branch from 927a718 to 0d6f6d8 Compare May 27, 2026 05:19

tbg and others added 7 commits May 27, 2026 15:03

tbg force-pushed the splits-perturbation branch from 835ed25 to 9314462 Compare May 27, 2026 13:25

trunk-io Bot merged commit 2cd752f into cockroachdb:master May 27, 2026
36 of 37 checks passed

celeste-cockroachdb Bot added the target-release-26.3.0 label May 27, 2026

tbg mentioned this pull request May 27, 2026

roachtest: re-enable perturbation/full tests #170724

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

roachtest: replace kv/splits with splits perturbation test#170090

roachtest: replace kv/splits with splits perturbation test#170090
trunk-io[bot] merged 7 commits into
cockroachdb:masterfrom
tbg:splits-perturbation

tbg commented May 11, 2026 •

edited

Loading

Uh oh!

trunk-io Bot commented May 11, 2026 •

edited

Loading

Uh oh!

cockroach-teamcity commented May 11, 2026

Uh oh!

blathers-crl Bot commented May 11, 2026

Uh oh!

blathers-crl Bot commented May 11, 2026

Uh oh!

blathers-crl Bot commented May 18, 2026

Uh oh!

miraradeva left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blathers-crl Bot commented May 27, 2026

Uh oh!

tbg commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

tbg commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Thresholds

Results from nightly-equivalent runs (asleep=true)

Commits

Uh oh!

trunk-io Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented May 11, 2026

Uh oh!

blathers-crl Bot commented May 11, 2026

Uh oh!

blathers-crl Bot commented May 11, 2026

Uh oh!

blathers-crl Bot commented May 18, 2026

Uh oh!

miraradeva left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blathers-crl Bot commented May 27, 2026

Uh oh!

tbg commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tbg commented May 11, 2026 •

edited

Loading

Results from nightly-equivalent runs (`asleep=true`)

trunk-io Bot commented May 11, 2026 •

edited

Loading

tbg commented May 27, 2026 •

edited

Loading