Skip to content

roachtest: replace kv/splits with splits perturbation test#170090

Merged
trunk-io[bot] merged 7 commits into
cockroachdb:masterfrom
tbg:splits-perturbation
May 27, 2026
Merged

roachtest: replace kv/splits with splits perturbation test#170090
trunk-io[bot] merged 7 commits into
cockroachdb:masterfrom
tbg:splits-perturbation

Conversation

@tbg

@tbg tbg commented May 11, 2026

Copy link
Copy Markdown
Member

Replace the legacy kv/splits/nodes=3/quiesce=*/lease=* tests with a
new perturbation test that measures foreground KV workload impact
during sustained range-split activity. The test registers in two
flavors that exercise complementary capacity questions.

Why

The legacy tests answered a binary "did ~300k splits finish on a
3n × 4vcpu cluster within 2h" and repeatedly tripped the known
slow-proposals → circuit-breaker → poisoned-latch cascade at the edge
of the supported replicas-per-node ceiling (#170041, #166202, #169319,
#168374). When they passed they produced no trendable signal; when
they failed it was almost always the same known cascade rather than a
regression. They also ran without a foreground workload, so the
cascade only manifested at extreme density rather than at densities
real workloads would actually encounter.

What

The new perturbation/{dev,full,metamorphic}/splits/asleep={true,false}
test runs the standard perturbation framework: a fixed-rate KV workload
at 50% of max throughput, with the perturbation phase issuing splits
off to the side on a separate set of pinned tables. The output is a
roachperf signal that trends across releases.

Two flavors are registered together:

  • asleep=true measures total replica capacity. Followers fall
    asleep via store-liveness quiescence, so most of the replica
    population costs no per-tick CPU. The default 600k splits on the
    standard perturbation cluster (12 nodes × 16 vcpu, RF=3) land at
    ~9,400 replicas/vcpu — 40% higher than the production telemetry
    cluster (crl-prod-38z, ~6,700 replicas/vcpu), so this flavor
    stress-tests the primary quiescence path beyond current production
    density.
  • asleep=false measures active-replica capacity, the relevant
    ceiling during outages where store liveness can't withdraw and
    followers can't fall asleep. The cluster setting
    kv.raft.store_liveness.quiescence.enabled=false keeps followers
    awake; leader leases keep leaseholders awake unconditionally so no
    additional knob is needed. The 400k split target lands at ~6,250
    replicas/vcpu, matching the telemetry cluster's observed non-quiescent
    density. Because awake replicas are far more expensive than quiescent
    ones, this is 33% lower than asleep=true's target despite being close
    to production density.

Design highlights (long form in the struct-level comment in
splits.go):

  • Per-store voter pinning. Partition the cluster's stores into
    disjoint 3-store groups on distinct nodes; pin one database per
    group via voter_constraints. With this every split's children
    inherit their parent's correct placement, so no SCATTER and no
    rebalance churn during the split storm. An earlier node-pinning
    version of this test ran ~1 TB of background rebalance traffic
    during the perturbation; per-store pinning eliminates that.
  • Parallel splits per group, descending keys per batch. Each
    group issues splits in parallel against its own pinned table;
    descending keys keep every split landing on the leftmost
    (single-leader) range rather than chaining new raft groups.
  • Deterministic round-robin lease rebalance. After batch 0 (and
    again at end-of-perturbation), explicit ALTER RANGE … RELOCATE LEASE brings the lease distribution to perfect balance in one
    pass per group. Plain SCATTER stalls at 1.5-3x spread under
    per-store voter_constraints because the allocator's lease scorer
    picks the same biased target on every re-scatter. Dispatch is
    parallelized within a group (concurrency 16) and across groups.

Thresholds

The split storm itself is expected to crush foreground throughput;
the meaningful pass/fail signal is whether the cluster returns to
baseline afterwards. This PR adds a separate recoveryImpact field
on variations (defaults to impact via use-site fallback so
existing tests are unchanged) and the splits test sets:

  • impact (perturbation interval): 10× throughput, p50/p99 disabled.
  • recoveryImpact (recovery interval): 3× throughput, p50/p99 disabled.

Both flavors share these thresholds for the initial landing; the
asleep=false gates may be tightened after the first nightly-equivalent
calibration run if the smaller population produces materially less
impact.

Results from nightly-equivalent runs (asleep=true)

perturbation/full/splits/asleep=true on 12n × 16vcpu, 600k splits,
~1.8M replicas at peak (≈9.4k replicas/vcpu/node):

Op Perturbation (limit 10×) Recovery (limit 3×)
write throughput 2.35–3.43× ✅ 1.77–1.98× ✅
read throughput 2.35–3.38× ✅ 1.78–1.97× ✅
follower-read throughput 2.35–3.44× ✅ 1.76–1.97× ✅

All gates pass. Splitperf user table holds exactly 25 000 leases on
each of its 3 voter stores per group after the end-of-perturbation
rebalance. asleep=false numbers will be filled in after a
calibration run.

Commits

This PR is a small framework cleanup followed by a single test
introduction with both flavors:

  1. roachtest: remove kv/splits tests
  2. roachtest: add per-test timeout override for perturbation tests
  3. roachtest: add --scatter option to kv workload init for perturbation tests
  4. roachtest: split impact threshold between perturbation and recovery intervals
  5. roachtest: add per-test token-return wait override for perturbation tests
  6. roachtest: support test name variation suffixes in perturbation framework
  7. roachtest: add splits perturbation test

Commits 2–6 are framework groundwork; commit 7 introduces the splits
test and registers both flavors via the perturbationNamer interface
added in commit 6.

Depends on the allocator fix in #170377 (merged), which is a hard
prerequisite for the test to converge.

See also: #170041
Epic: none

@trunk-io

trunk-io Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

😎 Merged successfully - details.

@cockroach-teamcity

Copy link
Copy Markdown
Member

This change is Reviewable

@tbg tbg changed the title roachtest: add splits perturbation test roachtest: replace kv/splits with splits perturbation test May 11, 2026
@tbg tbg force-pushed the splits-perturbation branch 3 times, most recently from 7d94271 to ab0285a Compare May 11, 2026 14:47
@blathers-crl

blathers-crl Bot commented May 11, 2026

Copy link
Copy Markdown

Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link)

@tbg tbg force-pushed the splits-perturbation branch from ab0285a to f7661e2 Compare May 11, 2026 15:51
@blathers-crl

blathers-crl Bot commented May 11, 2026

Copy link
Copy Markdown

Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link)

@tbg tbg force-pushed the splits-perturbation branch 12 times, most recently from 686c2d2 to 19d2e8e Compare May 18, 2026 12:12
@blathers-crl

blathers-crl Bot commented May 18, 2026

Copy link
Copy Markdown

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@tbg tbg force-pushed the splits-perturbation branch from 7f1d822 to 7134c35 Compare May 19, 2026 05:25
@tbg tbg force-pushed the splits-perturbation branch 3 times, most recently from 9a94da4 to 5275a36 Compare May 20, 2026 13:52
@tbg tbg marked this pull request as ready for review May 20, 2026 16:03
@tbg tbg requested review from a team as code owners May 20, 2026 16:03
@tbg tbg requested a review from miraradeva May 20, 2026 16:03
@tbg tbg force-pushed the splits-perturbation branch 3 times, most recently from 95f0931 to 927a718 Compare May 22, 2026 08:22

@miraradeva miraradeva left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really nice! It's a bit sad how much code it takes to do waitForVoterPlacement and rebalanceLeases, both things we do in many tests.

Comment thread pkg/cmd/roachtest/tests/perturbation/splits.go
Comment thread pkg/cmd/roachtest/tests/perturbation/splits.go
Comment thread pkg/cmd/roachtest/tests/perturbation/splits.go
Comment thread pkg/cmd/roachtest/tests/perturbation/framework.go
Comment thread pkg/cmd/roachtest/tests/perturbation/framework.go
Comment thread pkg/cmd/roachtest/tests/perturbation/splits.go
Comment thread pkg/cmd/roachtest/tests/perturbation/splits.go Outdated
@tbg tbg force-pushed the splits-perturbation branch from 927a718 to 0d6f6d8 Compare May 27, 2026 05:19
tbg and others added 7 commits May 27, 2026 15:03
A later commit in this PR reframes "how many splits can the cluster
sustain" as a perturbation test that runs concurrently with a
foreground KV workload and reports impact in roachperf-comparable
terms (latency ratio, throughput ratio, replicas/vcpu/node density).
The legacy kv/splits/nodes=3/quiesce=*/lease=* tests answered a
binary "did splits finish in 2h", repeatedly tripped the same known
cascade (#170041, #166202, #169319, #168374), and produced no
trendable signal when they passed. Drop them in favor of
perturbation/full/splits.

Epic: none
Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Some perturbations need a wall-clock budget beyond the registry default
(e.g. tests that pile up cluster-wide work that has to drain
afterwards). Add an optional v.timeout field and a v.applyTimeout
helper that addFull, addDev, and addMetamorphic all use to copy it
into their TestSpec.

No behavior change for existing tests — they leave v.timeout at zero
and continue to get the registry default.

Epic: none
Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
…tests

The kv workload's `init` subcommand supports a --scatter flag that
scatters the table immediately after splitting; this gives the
allocator a chance to spread leaseholders across the cluster while
queues are idle, instead of leaving every split's children with the
parent's leaseholder. Add an opt-in v.scatter field that initWorkload
passes through. Off by default so existing tests are unaffected.

Epic: none
Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
…ntervals

Some perturbations are expected to crush foreground throughput while
they run, with the meaningful pass/fail signal being whether the
cluster returns to baseline afterwards. Letting one threshold gate
both intervals forces a choice between flaky-during-perturbation and
permissive-during-recovery.

Add recoveryImpact alongside impact. Zero value falls back to impact
at the use site (no eager copy), so tests that don't set it
independently continue to use a single threshold for both intervals.
Perturbations that need a permissive in-flight threshold can set the
two independently.

Epic: none
Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
…ests

Some perturbations leave non-trivial kvadmission flow-control churn at
the end of the workload phase that takes longer than the 10-minute
default to fully drain to zero in the post-test ValidateTokensReturned
check. A previous workaround (raise to 1h when v.diskBandwidthLimit is
set) covers one such case. Generalize that to a per-test
v.tokenReturnTime field that any perturbation can set; the existing
defaults remain in place when the field is zero.

No behavior change for existing tests.

Epic: none
Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
…work

Add an optional perturbationNamer interface a perturbation may implement
to append a variation suffix to its test name. With this a single struct
type can register multiple test entries that differ in configuration; a
later commit uses it to register the splits test in two flavors (followers
asleep vs. forced awake) under distinct names like

  perturbation/full/splits/asleep=true
  perturbation/full/splits/asleep=false

without resorting to one struct per variant.

No behavior change for existing perturbations — none of them implement
the interface, so perturbationName() falls through to the reflection-based
type name as before.

Epic: none
Release note: None
Reframe "how many splits can the cluster sustain" as a perturbation
test that runs concurrently with a foreground KV workload and reports
impact in roachperf-comparable terms (throughput ratio, density in
replicas/vcpu/node). Supersedes the legacy kv/splits/*/quiesce=*/lease=*
binary tests dropped earlier in this PR.

Two flavors are registered together: asleep=true measures total replica
capacity (followers park via store liveness quiescence), asleep=false
measures active-replica capacity (every replica costs full per-tick
CPU, the relevant ceiling during outages where followers can't sleep).
asleep=true targets ~9,400 replicas/vcpu (40% over the production
telemetry cluster's ~6,700 replicas/vcpu non-quiescent density) to
stress-test the primary quiescence path beyond current production
density; asleep=false targets ~6,250 replicas/vcpu to match the
telemetry cluster's observed non-quiescent density. asleep=false also
disables store-liveness follower sleep via the
kv.raft.store_liveness.quiescence.enabled cluster setting; leader
leases keep leaseholders awake unconditionally so no additional knob
is needed.

Design (see the struct-level comment in splits.go for the long form):

- Partition the cluster's stores into disjoint groups of three on
  distinct nodes; pin one database per group with voter_constraints
  that one-to-one map each voter to a store. With this every split's
  children inherit their parent's correct placement, so no SCATTER and
  no rebalance churn during the split storm.
- Issue splits per group in parallel batches; descending keys per batch
  to keep each split landing on the leftmost (stable, single-leader)
  range rather than chaining new raft groups together.
- After batch 0 (and again at end-of-perturbation), do an explicit
  deterministic round-robin lease rebalance via ALTER RANGE RELOCATE
  LEASE. Plain SCATTER stalls at 1.5-3x spread under per-store
  voter_constraints because the allocator's lease scorer picks the
  same biased target on every re-scatter. Round-robin converges in one
  pass; dispatch is parallelized within a group (concurrency 16) and
  across groups (one task per group).
- Per-call ALTER RANGE failures are tolerated up to 10% (catches
  systemic problems like a node down while shrugging off transient
  range-merge races); a 100%-failure floor catches small-batch
  pathologies that the percentage gate would skip. First error is
  surfaced in the failure message and in a warn log when any failure
  happens below the threshold.

Knobs in splits.setup():
- asleep=true: targetSplits=600k, batchSize=10k (full); the optional
  devSizer / metamorphicSizer interfaces (added in this commit) let
  the dev and metamorphic variants shrink to 4k and 60k respectively.
- asleep=false: targetSplits=400k, batchSize=5k (full); 40k metamorphic;
  shares the 4k dev with asleep=true.
- v.splits=500 + v.scatter=true: enough kv ranges to avoid a
  single-range hotspot for the foreground workload, scattered so
  leases distribute evenly.
- v.timeout=5h: full 600k storm + end-of-perturbation rebalance runs
  for over an hour; the default 3h budget is too tight. asleep=false
  shares the budget since the rebalance phase dominates either way.
- v.tokenReturnTime=1h: real outstanding flow-control tokens drain
  within seconds of workload cancel, but ValidateTokensReturned's SQL
  query races against background flow-control activity (replicate
  queue, lease queue, jobs) at 1-2M replicas and observed transient
  non-zero diffs at the 10m default. A longer window gives the retry
  loop a better chance to land on a clean snapshot.
- impact=10x, recoveryImpact=3x: during the split storm the cluster is
  expected to absorb significant throughput loss; the meaningful
  pass/fail signal is whether it returns to baseline afterwards. Both
  flavors share these thresholds; the asleep=false gates may be
  tightened in a follow-up now that the flavor lands at production
  density.
- SkipPostValidations=PostValidationReplicaDivergence: at 1.2M-1.8M
  ranges the post-test consistency check can't scan a meaningful
  fraction within the framework's 20m budget (observed: 0 ranges in
  20m). The framework tolerates the timeout but logs it as a
  post-assertion failure; skipping explicitly keeps the artifact clean.

addDev now also zeros v.recoveryImpact, v.timeout, and v.tokenReturnTime,
all of which splits.setup() sets to non-zero values. Without these
clears the dev variant would inherit a 3x recovery throughput gate
(too tight for noisy local processes), a 5h test timeout (would hide
hangs behind a long wait), and a 1h token-return wait (wildly
overshoots what's meaningful on a 5-node local cluster).

Test fixtures:
- waitForVoterPlacement polls per-range placement against the group's
  expected store set and nudges non-conforming ranges into the
  replicate queue via crdb_internal.kv_enqueue_replica (must run on
  the leaseholder).
- waitForVoterPlacement budget defaults to 10 minutes and scales up to
  v.timeout/3 (capped at 90 minutes) when the test sets a per-test
  timeout. The 10-minute default suits the dev variant; the lift makes
  the nightly variant tolerant of slow up-replication.
- reportDensity / reportBalance / summarize give per-node and per-store
  range and lease distribution at every batch boundary, with
  deterministic min/max tiebreaks so log diffs are stable across runs.

Epic: none
Release note: None
@tbg tbg force-pushed the splits-perturbation branch from 835ed25 to 9314462 Compare May 27, 2026 13:25
@blathers-crl

blathers-crl Bot commented May 27, 2026

Copy link
Copy Markdown

Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link)

@tbg

tbg commented May 27, 2026

Copy link
Copy Markdown
Member Author

Final pre-merge roachtest run completed — both flavors PASSED

Worst-case ratios per phase (all within 10× perturbation / recovery throughput limits):

flavor phase follower-read p99 / tput read p99 / tput write p99 / tput
asleep=true perturbation 242× / 3.27× 273× / 3.30× 33× / 3.28×
asleep=true recovery 103× / 1.84× 121× / 1.86× 15× / 1.85×
asleep=false perturbation 130× / 1.82× 122× / 1.83× 21× / 1.83×
asleep=false recovery 69× / 1.37× 69× / 1.35× 12× / 1.37×

p99 is unbounded by design (the split storm temporarily crushes tail latency); the gates are on throughput, which both flavors cleared comfortably. Most striking: asleep=true carries higher throughput impact (3.3× vs 1.8×) because of the much larger split count (600k vs 400k), but recovers to the same ~1.8× recovery throughput as asleep=false's perturbation impact.

@trunk-io trunk-io Bot merged commit 2cd752f into cockroachdb:master May 27, 2026
36 of 37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants