roachtest: re-enable perturbation/full tests by tbg · Pull Request #170724 · cockroachdb/cockroach

tbg · 2026-05-21T12:28:47Z

Context

The perturbation/full/* tests measure the throughput/latency impact of various perturbations (network partition, node restart, decommission, slow disk, etc.) on a CockroachDB cluster. They were valuable for catching latency regressions, but were also persistently flaky.

In July 2025 (#149662), most of them were skipped on a "stabilize one at a time" basis. As of master, only restart, backup, and backfill are running; the rest (intents, decommission, elasticWorkload, partition, slowDisk, addNode) are skipped.

What this PR does

This PR re-enables all six skipped tests and tunes their pass/fail thresholds.

Threshold history

At the time of the skip, each test had its own latency-impact threshold (single float, ratio of post-perturbation latency to baseline):

Test	Pre-skip threshold
partition	`Inf` (no gate)
slowDisk	`Inf` (no gate)
restart	`Inf` (no gate)
backfill	`Inf` (no gate)
intents	`40x` latency
decommission	`5x` latency
addNode	`5x` latency
backup	`5x` latency

The pattern: perturbations that intentionally break the cluster (partition, slow disk, restart, backfill) weren't gated at all. Perturbations with expected mild impact had 5x-40x latency thresholds.

What changed in the harness

Since the skip, the perturbation framework has been substantially rewritten:

Throughput-based scoring (roachtest/perturbation: simplify scoring to impact ratios #167696): impact is now measured as a throughput ratio (p5 of per-second throughput vs baseline) rather than blended latency. p99/p50 latency gates exist but are off by default — too noisy at current sample sizes.
Default lenient threshold (roachtest/perturbation: enforce 0.8x throughput floor #169659): all tests now share defaultThresholds() with maxThroughputImpact: 1.25 (fail if throughput drops below 80% of baseline).
Separate perturbation vs recovery thresholds (roachtest: replace kv/splits with splits perturbation test #170090): tests can now gate the perturbation interval and the recovery interval independently via recoveryImpact. Some perturbations are expected to crush foreground throughput while they run; the meaningful pass/fail signal is whether the cluster returns to baseline afterwards.

Per-test thresholds in this PR

For each test, why we picked what we picked:

Test	Perturbation	Recovery	Rationale
partition	`Inf`	`1.25x`	Isolating an entire region removes 1/3 of leaseholders. Foreground throughput drops sharply during the partition (observed ~50% drop, ratio ~2x). The meaningful signal is whether the cluster recovers once the partition heals. Matches historical "never gated during perturbation".
slowDisk	`1.25x`	`1.25x`	With `walFailover=true` and 2 disks per node (the full variant's default), raft log writes fail over to the non-throttled store, so foreground throughput stays close to baseline. The 1.25x floor mainly absorbs noise from the `slowLiveness` leg, which routes liveness heartbeats through the slow disk.
restart	`1.25x`	`1.25x`	Clean restart (`cleanRestart=true`, SIGTERM) gracefully drains leases off the target node before stopping. Cluster spends the 10-minute downtime serving traffic from the other nodes with no missing leases. Throughput should stay near baseline.
backfill	`1.25x`	`1.25x`	Index backfill writes a lot but uses elastic admission control; foreground throughput shouldn't drop meaningfully.
intents	`1.25x`	`1.25x`	Builds up a large intent backlog then rolls back. Some cluster pressure during cleanup, but not enough to crush throughput.
decommission	`1.25x`	`1.25x`	Graceful, AC-aware data movement. Doesn't crush throughput.
addNode	`1.25x`	`1.25x`	Triggers rebalancing; doesn't crush throughput.
backup	`1.25x`	`1.25x`	Uses elastic priority via AC.
elasticWorkload	`1.25x`	`1.25x`	The perturbation IS an elastic workload — AC keeps foreground unaffected.

Only partition needs the asymmetric thresholds; the rest use defaultThresholds() for both intervals.

Validation

See the run table in the comment thread for results from running each test on TeamCity. With USE_SPOT=never (now the trigger script default per #171008), all six previously-skipped tests pass under the chosen thresholds.

Resolves: #149662
Epic: none

Release note: None

trunk-io · 2026-05-21T12:28:52Z

Merging to master in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

cockroach-teamcity · 2026-05-21T12:29:01Z

This change is

tbg · 2026-05-22T08:09:46Z

TeamCity Roachtest Runs

Test	Attempt 1	Attempt 2	Attempt 3 (USE_SPOT=never)	Worst tput impact
perturbation/full/intents	21357048 ✅ Passed	—	—	1.00x
perturbation/full/decommission	21357049 ⚠️ VM preempted	21360396 ⚠️ VM preempted (4 VMs in us-east1-c)	21361625 ✅ Passed	1.00x
perturbation/full/elasticWorkload	21357050 ⚠️ VM preempted	21360398 ⚠️ VM preempted (throughput→0 from preempted node)	21361632 ✅ Passed	1.05x
perturbation/full/partition	21357051 ❌ Throughput drop (1.99x > 1.25x)	21360399 ⚠️ VM preempted (failed to get max rate, context canceled)	21361633 ✅ Passed	1.15x ⚠️ (recovery interval, near 1.2x limit)
perturbation/full/slowDisk	21357052 ❌ GCE stockout (us-east1-c)	21360400 ❌ GCE stockout (us-east1-c, even with suggested zones available)	21361805 ✅ Passed	1.00x
perturbation/full/addNode	21357053 ⚠️ VM preempted	21360401 ❓ No test results	21361634 ✅ Passed	1.00x

All tests use the default lenient thresholds (maxThroughputImpact: 1.25, p99/p50 disabled). Worst tput impact is the max ratio across all operations (follower-read / read / write) and both intervals (perturbation / recovery) — lower = less impact, 1.00x = no impact.

Summary

✅ 6/6 tests passed on attempt 3 (with USE_SPOT=never)

⚠️ Notable: partition test passed but with a 1.15x throughput impact in the recovery interval, leaving very little headroom under the 1.2x display threshold (actual code limit is 1.25x). This test also failed attempt 1 with a 1.99x ratio. The default lenient threshold may need revisiting for partition specifically — it's borderline.

Attempt 3 changes

Attempts 1 and 2 used the nightly default --use-spot=auto, which runs the first iteration on spot VMs (preemptible). With COUNT=1, every preemption surfaced as a failure. Attempt 3 explicitly sets USE_SPOT=never to use regular (non-preemptible) VMs. See #171008 for the trigger script change that makes this the default going forward.

Classification basis

Preemption labels come from the test framework's explicit signals (not inference):

Header line: VMs preempted during the test run: teamcity-...-NNNN (us-east1-X)
Runner message: (test_runner.go:2732).func1: monitorForPreemptedVMs detected VM Preemptions: ...

When a preemption signal is present, downstream errors like COMMAND_PROBLEM, throughput→0, failed to get cluster max rate, or context canceled are caused by the preemption, not independent failures.

"GCE stockout" labels (slowDisk attempts 1-2): the cluster creation itself failed with ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS — no preemption, no test, just no VMs available in us-east1-c.

blathers-crl · 2026-05-27T16:22:58Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

The following tests were previously skipped via cockroachdb#149662 to focus on stabilizing one test at a time: - perturbation/full/intents - perturbation/full/decommission - perturbation/full/elasticWorkload - perturbation/full/partition - perturbation/full/slowDisk - perturbation/full/addNode Re-enable all of them. The metamorphic and dev variants remain unchanged. All re-enabled tests use the lenient defaultThresholds() (1.25x throughput floor, p99/p50 disabled) for both the perturbation and recovery intervals, with one exception: the partition test isolates an entire region (4 of 12 nodes) and removes 1/3 of leaseholders, which causes foreground throughput to drop sharply (~2x) while the partition is in effect. The meaningful pass/fail signal for partition is whether the cluster returns to baseline once the partition heals, so the perturbation interval is left ungated (noImpactThresholds()) and the 1.25x floor is enforced only on the recovery interval, via the recoveryImpact field. While here, add a comment to slowDisk explaining why the default threshold is appropriate for the full variant: with walFailover=true and 2 disks per node, raft log writes fail over to the non-throttled store and foreground throughput stays close to baseline. The lenient 1.25x floor is mainly to absorb noise from the slowLiveness leg. Resolves: cockroachdb#149662 Epic: none Release note: None

This comment was marked as resolved.

Sign in to view

tbg force-pushed the tbg/unskip-perturbation-full-tests branch from a49d119 to 3150e77 Compare May 21, 2026 13:48

This comment was marked as outdated.

Sign in to view

This was referenced May 22, 2026

build/github: quiet bazel output in unit_tests CI job #170521

Merged

scripts: improve trigger-pr-roachtest.sh #171008

Merged

tbg force-pushed the tbg/unskip-perturbation-full-tests branch from 4314b7a to c3bc9cd Compare May 27, 2026 16:22

tbg force-pushed the tbg/unskip-perturbation-full-tests branch from c3bc9cd to d43059f Compare May 27, 2026 16:25

tbg requested a review from stevendanna May 27, 2026 16:32

tbg marked this pull request as ready for review May 27, 2026 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: re-enable perturbation/full tests#170724

roachtest: re-enable perturbation/full tests#170724
tbg wants to merge 1 commit into
cockroachdb:masterfrom
tbg:tbg/unskip-perturbation-full-tests

tbg commented May 21, 2026 •

edited

Loading

Uh oh!

trunk-io Bot commented May 21, 2026

Uh oh!

cockroach-teamcity commented May 21, 2026

Uh oh!

This comment was marked as resolved.

This comment was marked as outdated.

tbg commented May 22, 2026 •

edited

Loading

Uh oh!

blathers-crl Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tbg commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

What this PR does

Threshold history

What changed in the harness

Per-test thresholds in this PR

Validation

Uh oh!

trunk-io Bot commented May 21, 2026

Uh oh!

cockroach-teamcity commented May 21, 2026

Uh oh!

This comment was marked as resolved.

This comment was marked as outdated.

tbg commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TeamCity Roachtest Runs

Summary

Attempt 3 changes

Classification basis

Uh oh!

blathers-crl Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tbg commented May 21, 2026 •

edited

Loading

tbg commented May 22, 2026 •

edited

Loading