Skip to content

Commit e31fe74

Browse files
sandy2008claude
andcommitted
fix(compactor): de-flake sharded-ownership compactor tests
TestCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning and its partition twin seed one shared bucket.ClientMock with 100 tenants (~3,208/~3,707 registered testify expectations). Every bucket operation from both compactors and both blocks cleaners holds the mock's single mutex through an O(expectations) reflective scan (~38% of CPU in two independent profiles), so the first compaction cycle alone takes ~20s under -race on an idle machine and 6-7x longer on starved CI runners. Each per-tenant meta sync additionally runs under a deadline equal to -compactor.compaction-interval (5s in these tests, pkg/compactor/compactor.go:1085), so starved syncs turn whole runs into failures and CompactionRunsCompleted can stay pinned at 0 longer than ANY poll budget: in CI run 26632776611 the partition test failed its 60s poll AND the non-partition sibling burned its full 120s poll, in both attempts. The ring itself was ACTIVE and stable before the poll started (starting() waits for both), correcting the issue's original ring-convergence framing. Fix (test-only): - Cut numUsers from 100 to 20 in both tests (~25x less quadratic mock-matching work; the per-user ownership assertions are tenant-count independent). - Align the partition test's run-completion poll budget with its non-partition sibling (60s -> 120s) as a seatbelt. - Set WaitActiveInstanceTimeout=30s in both tests (same hardening as #7503): prepareConfig's 5s ACTIVE ceiling is within reach of the observed CI starvation envelope. The shuffle-sharding twins are deliberately untouched: their live arm64 failure in the same run is an ownership-exclusivity violation (compactor_test.go:1318), a different failure class that needs its own root-causing. Before/after on Apple Silicon, -race: 21.00s/24.01s -> 3.57s/3.55s. Fixes #7607 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
1 parent 74185ef commit e31fe74

3 files changed

Lines changed: 6 additions & 3 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@
5757
* [BUGFIX] Query Frontend: Fix native histogram responses not being handled correctly in `minTime()` sort ordering for split_by_interval merge. #7555
5858
* [BUGFIX] Distributor: Release the push worker pool goroutines on shutdown by stopping the async executor during the stopping phase when `-distributor.num-push-workers` is set. #7602
5959
* [BUGFIX] Querier: Fix unbounded resource leak in the bucket-scan blocks finder (used when the bucket index is disabled). Per-tenant metadata fetchers, their Prometheus registries, and on-disk meta caches are now evicted once a tenant is no longer active, instead of being retained for the lifetime of the process. #7573
60+
* [BUGFIX] Compactor: Fix flake in `TestCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning` and its partition twin: shrink the 100-tenant mock-bucket fixture, whose quadratic testify-mock matching could starve the first compaction run past the poll budget on loaded CI runners, and align the partition test's run-completion poll timeout with its non-partition sibling. #7607
6061

6162
## 1.21.0 2026-04-24
6263

pkg/compactor/compactor_paritioning_test.go

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1142,7 +1142,7 @@ func TestPartitionCompactor_ShouldCompactAllUsersOnShardingEnabledButOnlyOneInst
11421142

11431143
func TestPartitionCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning(t *testing.T) {
11441144

1145-
numUsers := 100
1145+
numUsers := 20
11461146

11471147
// Setup user IDs
11481148
userIDs := make([]string, 0, numUsers)
@@ -1192,6 +1192,7 @@ func TestPartitionCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEn
11921192
cfg.ShardingRing.InstanceAddr = fmt.Sprintf("127.0.0.%d", i)
11931193
cfg.ShardingRing.WaitStabilityMinDuration = time.Second
11941194
cfg.ShardingRing.WaitStabilityMaxDuration = 5 * time.Second
1195+
cfg.ShardingRing.WaitActiveInstanceTimeout = 30 * time.Second // Give the ring ACTIVE wait headroom on slow CI (#7503).
11951196
cfg.ShardingRing.KVStore.Mock = kvstore
11961197

11971198
c, _, tsdbPlanner, l, _ := prepareForPartitioning(t, cfg, bucketClient, nil, nil)
@@ -1214,7 +1215,7 @@ func TestPartitionCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEn
12141215

12151216
// Wait until a run has been completed on each compactor
12161217
for _, c := range compactors {
1217-
cortex_testutil.Poll(t, 60*time.Second, true, func() any {
1218+
cortex_testutil.Poll(t, 120*time.Second, true, func() any {
12181219
return prom_testutil.ToFloat64(c.CompactionRunsCompleted) >= 1
12191220
})
12201221
}

pkg/compactor/compactor_test.go

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1091,7 +1091,7 @@ func TestCompactor_ShouldCompactAllUsersOnShardingEnabledButOnlyOneInstanceRunni
10911091
func TestCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning(t *testing.T) {
10921092
t.Parallel()
10931093

1094-
numUsers := 100
1094+
numUsers := 20
10951095

10961096
// Setup user IDs
10971097
userIDs := make([]string, 0, numUsers)
@@ -1139,6 +1139,7 @@ func TestCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndM
11391139
cfg.ShardingRing.InstanceAddr = fmt.Sprintf("127.0.0.%d", i)
11401140
cfg.ShardingRing.WaitStabilityMinDuration = time.Second
11411141
cfg.ShardingRing.WaitStabilityMaxDuration = 5 * time.Second
1142+
cfg.ShardingRing.WaitActiveInstanceTimeout = 30 * time.Second // Give the ring ACTIVE wait headroom on slow CI (#7503).
11421143
cfg.ShardingRing.KVStore.Mock = kvstore
11431144

11441145
c, _, tsdbPlanner, l, _ := prepare(t, cfg, bucketClient, nil)

0 commit comments

Comments
 (0)