Skip to content

Commit c3bc9cd

Browse files
committed
roachtest: skip perturbation-interval gate for partition
The partition test isolates an entire region (4 of 12 nodes), removing 1/3 of leaseholders. Foreground throughput drops sharply while the partition is in effect, and the meaningful pass/fail signal is whether the cluster returns to baseline once the partition heals. Gating the perturbation interval at the default 1.25x throughput floor mostly produces flakes. Use noImpactThresholds() for the perturbation interval and keep the default 1.25x throughput floor for the recovery interval, via the recoveryImpact field added in the parent commit. While here, add a comment to slowDisk explaining why the default threshold is appropriate for the full variant: with walFailover=true and 2 disks per node, raft log writes fail over to the non-throttled store and foreground throughput stays close to baseline. The lenient 1.25x floor is mainly to absorb noise from the slowLiveness leg. Release note: None
1 parent 8fff0e1 commit c3bc9cd

2 files changed

Lines changed: 16 additions & 1 deletion

File tree

pkg/cmd/roachtest/tests/perturbation/network_partition.go

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,13 @@ var _ perturbation = partition{}
3131

3232
func (p partition) setup() variations {
3333
p.partitionSite = true
34-
v := setup(p, defaultThresholds())
34+
// The partition test isolates an entire region (4 of 12 nodes), removing
35+
// 1/3 of leaseholders. Foreground throughput naturally drops sharply
36+
// while the partition is in effect, and the meaningful pass/fail signal
37+
// is whether the cluster returns to baseline once the partition heals.
38+
// Skip the perturbation-interval gate; keep the default recovery gate.
39+
v := setup(p, noImpactThresholds())
40+
v.recoveryImpact = defaultThresholds()
3541
v.leaseType = registry.ExpirationLeases
3642
// TODO(baptist): Remove this setting once #120073 is fixed.
3743
v.clusterSettings["kv.lease.reject_on_leader_unknown.enabled"] = "true"

pkg/cmd/roachtest/tests/perturbation/slow_disk.go

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,15 @@ var _ perturbation = &slowDisk{}
3232
func (s *slowDisk) setup() variations {
3333
s.slowLiveness = true
3434
s.walFailover = true
35+
// With walFailover=true and 2 disks per node (the default for the full
36+
// variant), raft log writes fail over to the non-throttled store, so
37+
// foreground throughput is expected to stay close to baseline even
38+
// while the staller is active. Default thresholds apply to both
39+
// intervals; we keep the 1.25x floor (rather than tightening) only to
40+
// avoid flakes from the slowLiveness leg, which routes liveness
41+
// heartbeats through the slow disk. The metamorphic variant exercises
42+
// configurations where walFailover is off and throughput can drop
43+
// substantially -- those should override impact independently.
3544
return setup(s, defaultThresholds())
3645
}
3746

0 commit comments

Comments
 (0)