Skip to content

Commit 40a27ad

Browse files
sandy2008claude
andauthored
fix(compactor): fix flaky TestCompactor_DeleteLocalSyncFiles on arm64 (#7567)
* fix(compactor): poll on user ownership in TestCompactor_DeleteLocalSyncFiles TestCompactor_DeleteLocalSyncFiles and TestPartitionCompactor_DeleteLocalSyncFiles were flaking on arm64 with "Should not be zero, but was 0" at the require.NotZero(t, c2Users) assertion. The polling condition only checked CompactionRunsCompleted >= 1, but that counter increments even when the compactor's first cycle ran with zero owned users due to a transient ring-view skew at startup. The subsequent listTenantsWithMetaSyncDirectories() call could therefore return an empty slice. Poll on the actual condition under test (both the cycle has completed AND the compactor sees at least one owned user) so the test waits until the ring has stabilized. Fixes #7565 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com> * fix(compactor): require two completed cycles before sampling local sync dirs Round 2 review flagged a remaining mid-cycle sampling race: with the previous predicate CompactionRunsCompleted >= 1, Poll could return after the first cycle (counter incremented despite c2 owning zero users due to ring-view skew) while the second cycle's fetcher.NewBaseFetcher was still creating meta-sync directories. c2Users would then be partial, breaking require.Equal(t, numUsers, c1Users+c2Users) downstream. Bumping to >= 2 ensures we sample only after at least one steady-state cycle has completed end-to-end with the dirs in place. Refs #7565 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com> * fix(compactor): remove redundant NotZero assertion after Poll predicate The Poll predicate added in the prior commits already requires len(c2.listTenantsWithMetaSyncDirectories()) > 0, so the subsequent require.NotZero(t, c2Users) is dead — it can only fire if Poll already timed out and failed the test with a clearer message. Addresses @SungJin1212 review feedback on PR #7567. Refs #7565 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com> * ci: retrigger to re-run flaky arm64 integration_querier The previous CI run had an unrelated arm64 integration_querier flake (TestQuerierWithBlocksStorageLimits, in pkg integration, not pkg/compactor). This empty commit re-triggers CI to confirm the flake. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com> --------- Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com> Signed-off-by: Sandy <Yuxuan.Chen@morganstanley.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 4aed3a0 commit 40a27ad

3 files changed

Lines changed: 19 additions & 6 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@
4444
* [BUGFIX] Compactor: Fix stale `cortex_bucket_index_last_successful_update_timestamp_seconds` metric not being cleaned up when tenant ownership changes due to ring rebalancing. This caused false alarms on bucket index update rate when a tenant moved between compactors. #7485
4545
* [BUGFIX] Security: Fix stored XSS vulnerability in Alertmanager and Store Gateway status pages by replacing `text/template` with `html/template`. #7512
4646
* [BUGFIX] Security: Limit decompressed gzip output in `ParseProtoReader` and OTLP ingestion path. The decompressed body is now capped by `-distributor.otlp-max-recv-msg-size`. #7515
47+
* [BUGFIX] Compactor: Fix flake in `TestCompactor_DeleteLocalSyncFiles` and `TestPartitionCompactor_DeleteLocalSyncFiles` by polling on user ownership rather than just the `CompactionRunsCompleted` counter, which increments even when the second compactor sees zero owned users due to a transient ring-view skew at startup. #7565
4748
* [BUGFIX] Ingester: Close TSDB when compaction fails during `createTSDB`, preventing resource leaks (file descriptors, mmap handles) that could lead to ingester instability. #7560
4849
* [BUGFIX] Tenant Federation: Fix result cache returning stale data after a new tenant is added when `-tenant-federation.regex-matcher-enabled=true`. The resolved tenant set is now hashed and included in the cache key so that any change to the matched tenant list automatically invalidates cached entries. Non-regex users are unaffected. #7562
4950
* [BUGFIX] Tenant Federation: Fix regex resolver clearing known users list when user scan fails. #7534

pkg/compactor/compactor_paritioning_test.go

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1622,13 +1622,19 @@ func TestPartitionCompactor_DeleteLocalSyncFiles(t *testing.T) {
16221622

16231623
// Now start second compactor, and wait until it runs compaction.
16241624
require.NoError(t, services.StartAndAwaitRunning(context.Background(), c2))
1625-
cortex_testutil.Poll(t, 20*time.Second, 1.0, func() any {
1626-
return prom_testutil.ToFloat64(c2.CompactionRunsCompleted)
1625+
// Wait for at least two completed cycles so we sample after a steady-state
1626+
// ownership cycle, not mid-cycle following a zero-owned first cycle. The
1627+
// first cycle's CompactionRunsCompleted can increment with zero owned users
1628+
// due to transient ring-view skew at startup; sampling then would race with
1629+
// the second cycle's fetcher.NewBaseFetcher creating meta-sync directories
1630+
// and return a partial count.
1631+
cortex_testutil.Poll(t, 30*time.Second, true, func() any {
1632+
return prom_testutil.ToFloat64(c2.CompactionRunsCompleted) >= 2 &&
1633+
len(c2.listTenantsWithMetaSyncDirectories()) > 0
16271634
})
16281635

16291636
// Let's check how many users second compactor has.
16301637
c2Users := len(c2.listTenantsWithMetaSyncDirectories())
1631-
require.NotZero(t, c2Users)
16321638

16331639
// Force new compaction cycle on first compactor. It will run the cleanup of un-owned users at the end of compaction cycle.
16341640
c1.compactUsers(context.Background())

pkg/compactor/compactor_test.go

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1846,13 +1846,19 @@ func TestCompactor_DeleteLocalSyncFiles(t *testing.T) {
18461846

18471847
// Now start second compactor, and wait until it runs compaction.
18481848
require.NoError(t, services.StartAndAwaitRunning(context.Background(), c2))
1849-
cortex_testutil.Poll(t, 10*time.Second, 1.0, func() any {
1850-
return prom_testutil.ToFloat64(c2.CompactionRunsCompleted)
1849+
// Wait for at least two completed cycles so we sample after a steady-state
1850+
// ownership cycle, not mid-cycle following a zero-owned first cycle. The
1851+
// first cycle's CompactionRunsCompleted can increment with zero owned users
1852+
// due to transient ring-view skew at startup; sampling then would race with
1853+
// the second cycle's fetcher.NewBaseFetcher creating meta-sync directories
1854+
// and return a partial count.
1855+
cortex_testutil.Poll(t, 30*time.Second, true, func() any {
1856+
return prom_testutil.ToFloat64(c2.CompactionRunsCompleted) >= 2 &&
1857+
len(c2.listTenantsWithMetaSyncDirectories()) > 0
18511858
})
18521859

18531860
// Let's check how many users second compactor has.
18541861
c2Users := len(c2.listTenantsWithMetaSyncDirectories())
1855-
require.NotZero(t, c2Users)
18561862

18571863
// Force new compaction cycle on first compactor. It will run the cleanup of un-owned users at the end of compaction cycle.
18581864
c1.compactUsers(context.Background())

0 commit comments

Comments
 (0)