Skip to content

roachtest: re-enable perturbation/full tests#170724

Open
tbg wants to merge 1 commit into
cockroachdb:masterfrom
tbg:tbg/unskip-perturbation-full-tests
Open

roachtest: re-enable perturbation/full tests#170724
tbg wants to merge 1 commit into
cockroachdb:masterfrom
tbg:tbg/unskip-perturbation-full-tests

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented May 21, 2026

Context

The perturbation/full/* tests measure the throughput/latency impact of various perturbations (network partition, node restart, decommission, slow disk, etc.) on a CockroachDB cluster. They were valuable for catching latency regressions, but were also persistently flaky.

In July 2025 (#149662), most of them were skipped on a "stabilize one at a time" basis. As of master, only restart, backup, and backfill are running; the rest (intents, decommission, elasticWorkload, partition, slowDisk, addNode) are skipped.

What this PR does

This PR re-enables all six skipped tests and tunes their pass/fail thresholds.

Threshold history

At the time of the skip, each test had its own latency-impact threshold (single float, ratio of post-perturbation latency to baseline):

Test Pre-skip threshold
partition Inf (no gate)
slowDisk Inf (no gate)
restart Inf (no gate)
backfill Inf (no gate)
intents 40x latency
decommission 5x latency
addNode 5x latency
backup 5x latency

The pattern: perturbations that intentionally break the cluster (partition, slow disk, restart, backfill) weren't gated at all. Perturbations with expected mild impact had 5x-40x latency thresholds.

What changed in the harness

Since the skip, the perturbation framework has been substantially rewritten:

Per-test thresholds in this PR

For each test, why we picked what we picked:

Test Perturbation Recovery Rationale
partition Inf 1.25x Isolating an entire region removes 1/3 of leaseholders. Foreground throughput drops sharply during the partition (observed ~50% drop, ratio ~2x). The meaningful signal is whether the cluster recovers once the partition heals. Matches historical "never gated during perturbation".
slowDisk 1.25x 1.25x With walFailover=true and 2 disks per node (the full variant's default), raft log writes fail over to the non-throttled store, so foreground throughput stays close to baseline. The 1.25x floor mainly absorbs noise from the slowLiveness leg, which routes liveness heartbeats through the slow disk.
restart 1.25x 1.25x Clean restart (cleanRestart=true, SIGTERM) gracefully drains leases off the target node before stopping. Cluster spends the 10-minute downtime serving traffic from the other nodes with no missing leases. Throughput should stay near baseline.
backfill 1.25x 1.25x Index backfill writes a lot but uses elastic admission control; foreground throughput shouldn't drop meaningfully.
intents 1.25x 1.25x Builds up a large intent backlog then rolls back. Some cluster pressure during cleanup, but not enough to crush throughput.
decommission 1.25x 1.25x Graceful, AC-aware data movement. Doesn't crush throughput.
addNode 1.25x 1.25x Triggers rebalancing; doesn't crush throughput.
backup 1.25x 1.25x Uses elastic priority via AC.
elasticWorkload 1.25x 1.25x The perturbation IS an elastic workload — AC keeps foreground unaffected.

Only partition needs the asymmetric thresholds; the rest use defaultThresholds() for both intervals.

Validation

See the run table in the comment thread for results from running each test on TeamCity. With USE_SPOT=never (now the trigger script default per #171008), all six previously-skipped tests pass under the chosen thresholds.

Resolves: #149662
Epic: none

Release note: None

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 21, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@cockroachlabs-cla-agent

This comment was marked as resolved.

@tbg tbg force-pushed the tbg/unskip-perturbation-full-tests branch from a49d119 to 3150e77 Compare May 21, 2026 13:48
@tbg

This comment was marked as outdated.

@tbg
Copy link
Copy Markdown
Member Author

tbg commented May 22, 2026

TeamCity Roachtest Runs

Test Attempt 1 Attempt 2 Attempt 3 (USE_SPOT=never) Worst tput impact
perturbation/full/intents 21357048 ✅ Passed 1.00x
perturbation/full/decommission 21357049 ⚠️ VM preempted 21360396 ⚠️ VM preempted (4 VMs in us-east1-c) 21361625Passed 1.00x
perturbation/full/elasticWorkload 21357050 ⚠️ VM preempted 21360398 ⚠️ VM preempted (throughput→0 from preempted node) 21361632Passed 1.05x
perturbation/full/partition 21357051 ❌ Throughput drop (1.99x > 1.25x) 21360399 ⚠️ VM preempted (failed to get max rate, context canceled) 21361633Passed 1.15x ⚠️ (recovery interval, near 1.2x limit)
perturbation/full/slowDisk 21357052 ❌ GCE stockout (us-east1-c) 21360400 ❌ GCE stockout (us-east1-c, even with suggested zones available) 21361805Passed 1.00x
perturbation/full/addNode 21357053 ⚠️ VM preempted 21360401 ❓ No test results 21361634Passed 1.00x

All tests use the default lenient thresholds (maxThroughputImpact: 1.25, p99/p50 disabled). Worst tput impact is the max ratio across all operations (follower-read / read / write) and both intervals (perturbation / recovery) — lower = less impact, 1.00x = no impact.

Summary

6/6 tests passed on attempt 3 (with USE_SPOT=never)

⚠️ Notable: partition test passed but with a 1.15x throughput impact in the recovery interval, leaving very little headroom under the 1.2x display threshold (actual code limit is 1.25x). This test also failed attempt 1 with a 1.99x ratio. The default lenient threshold may need revisiting for partition specifically — it's borderline.

Attempt 3 changes

Attempts 1 and 2 used the nightly default --use-spot=auto, which runs the first iteration on spot VMs (preemptible). With COUNT=1, every preemption surfaced as a failure. Attempt 3 explicitly sets USE_SPOT=never to use regular (non-preemptible) VMs. See #171008 for the trigger script change that makes this the default going forward.

Classification basis

Preemption labels come from the test framework's explicit signals (not inference):

  • Header line: VMs preempted during the test run: teamcity-...-NNNN (us-east1-X)
  • Runner message: (test_runner.go:2732).func1: monitorForPreemptedVMs detected VM Preemptions: ...

When a preemption signal is present, downstream errors like COMMAND_PROBLEM, throughput→0, failed to get cluster max rate, or context canceled are caused by the preemption, not independent failures.

"GCE stockout" labels (slowDisk attempts 1-2): the cluster creation itself failed with ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS — no preemption, no test, just no VMs available in us-east1-c.

@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented May 27, 2026

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

The following tests were previously skipped via cockroachdb#149662 to focus on
stabilizing one test at a time:

- perturbation/full/intents
- perturbation/full/decommission
- perturbation/full/elasticWorkload
- perturbation/full/partition
- perturbation/full/slowDisk
- perturbation/full/addNode

Re-enable all of them. The metamorphic and dev variants remain
unchanged.

All re-enabled tests use the lenient defaultThresholds() (1.25x
throughput floor, p99/p50 disabled) for both the perturbation and
recovery intervals, with one exception: the partition test isolates an
entire region (4 of 12 nodes) and removes 1/3 of leaseholders, which
causes foreground throughput to drop sharply (~2x) while the partition
is in effect. The meaningful pass/fail signal for partition is whether
the cluster returns to baseline once the partition heals, so the
perturbation interval is left ungated (noImpactThresholds()) and the
1.25x floor is enforced only on the recovery interval, via the
recoveryImpact field.

While here, add a comment to slowDisk explaining why the default
threshold is appropriate for the full variant: with walFailover=true
and 2 disks per node, raft log writes fail over to the non-throttled
store and foreground throughput stays close to baseline. The lenient
1.25x floor is mainly to absorb noise from the slowLiveness leg.

Resolves: cockroachdb#149662
Epic: none

Release note: None
@tbg tbg force-pushed the tbg/unskip-perturbation-full-tests branch from c3bc9cd to d43059f Compare May 27, 2026 16:25
@tbg tbg requested a review from stevendanna May 27, 2026 16:32
@tbg tbg marked this pull request as ready for review May 27, 2026 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kv,roachtest: investigate and recalibrate perturbation tests

2 participants