Skip to content

Commit 7e08f8c

Browse files
roachtest/perturbation: wait for rebalance after baseline; scatter restart init
Two changes motivated by the recovery-phase write stall analyzed in #170849. - In the shared framework runTest, wait for in-flight rebalancing to settle after the baseline interval and before T3. This applies to all perturbation tests; the wait is a no-op when nothing is rebalancing, so non-restart tests only pay the polling cost. On restart it closes the per-store range-count skew left by the fill phase, which otherwise concentrates the recovery snapshot storm on whichever of the target node's stores looks most underfull. - Pass --scatter to `kv workload init` for restart so the workload table's initial splits are distributed across stores rather than landing wherever the allocator happens to pick at split time. Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
1 parent e99f26b commit 7e08f8c

2 files changed

Lines changed: 14 additions & 0 deletions

File tree

pkg/cmd/roachtest/tests/perturbation/framework.go

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -992,6 +992,14 @@ func (v variations) runTest(ctx context.Context, t test.Test, c cluster.Cluster)
992992

993993
// Collect the baseline after the workload has stabilized.
994994
baselineInterval := intervalSince(v.validationDuration / 2)
995+
996+
// Let any in-flight rebalancing settle before perturbing. The fill phase
997+
// can leave per-store range counts noticeably skewed: when the perturbed
998+
// node returns and the cluster tries to re-balance to its underfull
999+
// stores, the snapshot storm concentrates on one store and can push it
1000+
// into a Pebble write stall.
1001+
v.waitForRebalanceToStop(ctx, t)
1002+
9951003
// Now start the perturbation.
9961004
t.Status("T3: inducing perturbation")
9971005
perturbationDuration := v.perturbation.startPerturbation(ctx, t, v)

pkg/cmd/roachtest/tests/perturbation/restart_node.go

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,12 @@ func (r restart) setup() variations {
3434
// flush/compaction pipeline can fall behind into a Pebble write stall.
3535
v.ratioOfMax = 0.3
3636

37+
// Scatter the workload table at init so initial replica placement isn't
38+
// skewed across the per-node stores. A pre-existing per-store imbalance
39+
// concentrates the recovery snapshot storm on whichever store appears
40+
// underfull, which is what causes the stall described above.
41+
v.scatter = true
42+
3743
// TODO(baptist): Remove this setting once #120073 is fixed.
3844
v.clusterSettings["kv.lease.reject_on_leader_unknown.enabled"] = "true"
3945

0 commit comments

Comments
 (0)