roachtest: re-enable perturbation/full tests#170724
Conversation
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
This comment was marked as resolved.
This comment was marked as resolved.
a49d119 to
3150e77
Compare
This comment was marked as outdated.
This comment was marked as outdated.
TeamCity Roachtest Runs
All tests use the default lenient thresholds ( Summary✅ 6/6 tests passed on attempt 3 (with Attempt 3 changesAttempts 1 and 2 used the nightly default Classification basisPreemption labels come from the test framework's explicit signals (not inference):
When a preemption signal is present, downstream errors like "GCE stockout" labels (slowDisk attempts 1-2): the cluster creation itself failed with |
4314b7a to
c3bc9cd
Compare
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
The following tests were previously skipped via cockroachdb#149662 to focus on stabilizing one test at a time: - perturbation/full/intents - perturbation/full/decommission - perturbation/full/elasticWorkload - perturbation/full/partition - perturbation/full/slowDisk - perturbation/full/addNode Re-enable all of them. The metamorphic and dev variants remain unchanged. All re-enabled tests use the lenient defaultThresholds() (1.25x throughput floor, p99/p50 disabled) for both the perturbation and recovery intervals, with one exception: the partition test isolates an entire region (4 of 12 nodes) and removes 1/3 of leaseholders, which causes foreground throughput to drop sharply (~2x) while the partition is in effect. The meaningful pass/fail signal for partition is whether the cluster returns to baseline once the partition heals, so the perturbation interval is left ungated (noImpactThresholds()) and the 1.25x floor is enforced only on the recovery interval, via the recoveryImpact field. While here, add a comment to slowDisk explaining why the default threshold is appropriate for the full variant: with walFailover=true and 2 disks per node, raft log writes fail over to the non-throttled store and foreground throughput stays close to baseline. The lenient 1.25x floor is mainly to absorb noise from the slowLiveness leg. Resolves: cockroachdb#149662 Epic: none Release note: None
c3bc9cd to
d43059f
Compare
Context
The
perturbation/full/*tests measure the throughput/latency impact of various perturbations (network partition, node restart, decommission, slow disk, etc.) on a CockroachDB cluster. They were valuable for catching latency regressions, but were also persistently flaky.In July 2025 (#149662), most of them were skipped on a "stabilize one at a time" basis. As of master, only
restart,backup, andbackfillare running; the rest (intents,decommission,elasticWorkload,partition,slowDisk,addNode) are skipped.What this PR does
This PR re-enables all six skipped tests and tunes their pass/fail thresholds.
Threshold history
At the time of the skip, each test had its own latency-impact threshold (single float, ratio of post-perturbation latency to baseline):
Inf(no gate)Inf(no gate)Inf(no gate)Inf(no gate)40xlatency5xlatency5xlatency5xlatencyThe pattern: perturbations that intentionally break the cluster (partition, slow disk, restart, backfill) weren't gated at all. Perturbations with expected mild impact had 5x-40x latency thresholds.
What changed in the harness
Since the skip, the perturbation framework has been substantially rewritten:
defaultThresholds()withmaxThroughputImpact: 1.25(fail if throughput drops below 80% of baseline).recoveryImpact. Some perturbations are expected to crush foreground throughput while they run; the meaningful pass/fail signal is whether the cluster returns to baseline afterwards.Per-test thresholds in this PR
For each test, why we picked what we picked:
Inf1.25x1.25x1.25xwalFailover=trueand 2 disks per node (the full variant's default), raft log writes fail over to the non-throttled store, so foreground throughput stays close to baseline. The 1.25x floor mainly absorbs noise from theslowLivenessleg, which routes liveness heartbeats through the slow disk.1.25x1.25xcleanRestart=true, SIGTERM) gracefully drains leases off the target node before stopping. Cluster spends the 10-minute downtime serving traffic from the other nodes with no missing leases. Throughput should stay near baseline.1.25x1.25x1.25x1.25x1.25x1.25x1.25x1.25x1.25x1.25x1.25x1.25xOnly
partitionneeds the asymmetric thresholds; the rest usedefaultThresholds()for both intervals.Validation
See the run table in the comment thread for results from running each test on TeamCity. With
USE_SPOT=never(now the trigger script default per #171008), all six previously-skipped tests pass under the chosen thresholds.Resolves: #149662
Epic: none
Release note: None