Skip to content

Commit 9101b6f

Browse files
sandy2008claude
andcommitted
fix(querier): wait for store-gateway ACTIVE in querier ring view in TestQuerierWithBlocksStorageOnMissingBlocksFromStorage
The first happy-path query in this integration test intermittently failed with a 500 on arm64 CI (issue #7605). Decoding the gzipped 500 response body logged by the querier in both failing runs shows the failure is querier-local and the query never reached the store-gateway: expanding series: failed to get store-gateway replication set owning the block <ULID>: at least 1 healthy replica required, could only find 0 - unhealthy instances: 172.18.0.8:9095 The store-gateway registers in the ring with all its tokens in JOINING state (pkg/storegateway/gateway.go:461), runs the initial blocks sync (which is what drives cortex_bucket_store_blocks_loaded to 1), and only then switches to ACTIVE (gateway.go:333). The test's existing waits (querier ring tokens 512*2, store-gateway ring tokens 512, blocks_loaded 1) are therefore all satisfiable while the querier's view of the store-gateway ring still holds the instance in JOINING, and the BlocksRead ring operation admits ACTIVE instances only (pkg/storegateway/gateway_ring.go:49-55), so the first query fails with the error above (pkg/ring/replication_strategy.go:93, wrapped at pkg/querier/blocks_store_replicated_set.go:127). The window is structural: the querier's consul watch is rate-limited to 1 req/s (pkg/ring/kv/consul/client.go:79), so the JOINING->ACTIVE flip can reach the querier's ring view over a second after the store-gateway CASed it, while all three metric waits can pass on their first poll. Close the race by additionally waiting until the querier exposes cortex_ring_members{name="store-gateway-client",state="ACTIVE"} == 1 before the first query - the same inline readiness wait used to fix the identical race in backward_compatibility_test.go (#5975, 32bd46f). The waited gauge is computed from the same mutex-guarded ring descriptor that the querier's GetClientsFor consults, so the wait condition is the exact negation of the failure condition. The deliberate post-deletion 500 assertions are unchanged. Fixes #7605 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
1 parent 74185ef commit 9101b6f

2 files changed

Lines changed: 10 additions & 0 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@
5757
* [BUGFIX] Query Frontend: Fix native histogram responses not being handled correctly in `minTime()` sort ordering for split_by_interval merge. #7555
5858
* [BUGFIX] Distributor: Release the push worker pool goroutines on shutdown by stopping the async executor during the stopping phase when `-distributor.num-push-workers` is set. #7602
5959
* [BUGFIX] Querier: Fix unbounded resource leak in the bucket-scan blocks finder (used when the bucket index is disabled). Per-tenant metadata fetchers, their Prometheus registries, and on-disk meta caches are now evicted once a tenant is no longer active, instead of being retained for the lifetime of the process. #7573
60+
* [BUGFIX] Querier: Fix flake in integration test TestQuerierWithBlocksStorageOnMissingBlocksFromStorage by waiting for the querier to see the store-gateway ACTIVE in the ring before the first query. #7615
6061

6162
## 1.21.0 2026-04-24
6263

integration/querier_test.go

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -376,6 +376,15 @@ func TestQuerierWithBlocksStorageOnMissingBlocksFromStorage(t *testing.T) {
376376
require.NoError(t, storeGateway.WaitSumMetrics(e2e.Equals(512), "cortex_ring_tokens_total"))
377377
require.NoError(t, storeGateway.WaitSumMetrics(e2e.Equals(1), "cortex_bucket_store_blocks_loaded"))
378378

379+
// Wait until the querier observes the store-gateway as ACTIVE in its view of the store-gateway
380+
// ring: the store-gateway registers as JOINING and switches to ACTIVE only after the initial
381+
// blocks sync, so the waits above can all pass while queries would still fail with
382+
// "at least 1 healthy replica required, could only find 0" (500). Keep after the tokens wait.
383+
require.NoError(t, querier.WaitSumMetricsWithOptions(e2e.Equals(1), []string{"cortex_ring_members"}, e2e.WithLabelMatchers(
384+
labels.MustNewMatcher(labels.MatchEqual, "name", "store-gateway-client"),
385+
labels.MustNewMatcher(labels.MatchEqual, "state", "ACTIVE"),
386+
)))
387+
379388
// Query back the series.
380389
c, err = e2ecortex.NewClient("", querier.HTTPEndpoint(), "", "", "user-1")
381390
require.NoError(t, err)

0 commit comments

Comments
 (0)