Skip to content

Commit 7d732db

Browse files
sandy2008claude
andcommitted
fix(querier): wait for store-gateway ACTIVE in querier ring view in store-gateway limits integration tests
TestQuerierWithStoreGatewayDataBytesLimits intermittently fails with HTTP 500 instead of the expected 422 (#7606, arm64 CI). The decoded (gzipped) 500 response body from the failing run is the querier-local ring error: expanding series: failed to get store-gateway replication set owning the block <ULID>: at least 1 healthy replica required, could only find 0 - unhealthy instances: 172.18.0.8:9095 i.e. the ring lookup failed before any store-gateway RPC was made. The store-gateway registers in the ring as JOINING (already owning tokens) and switches to ACTIVE only after its initial blocks sync, while the querier's BlocksRead ring operation only selects ACTIVE instances and its consul watch is rate-limited (1 rps by default). So the existing waits (ring tokens registered, blocks loaded on the store-gateway) can all pass while the querier's view of the store-gateway ring still says JOINING, and the first query 500s. The hypothesis originally filed on the issue - that the bytes-limit error loses its 422/ResourceExhausted coding in the vendored Thanos refetch ("series size exceeded expected size; refetching") path - was falsified during investigation: those log lines belong to an earlier, passing test in the same CI job; the failing query never reached store-gateway limiter code at all; and all 10 vendored limiter consumption sites (including the refetch recursion) re-code the error as ResourceExhausted, which the querier maps to a 422 LimitError (#5286). Fix the race in the tests by waiting until the querier sees the store-gateway ACTIVE in its store-gateway ring view before querying (same idiom as backward_compatibility_test.go, #5975). Apply the same wait to the sibling TestQuerierWithBlocksStorageLimits, which has the identical vulnerable shape (every query expected to hit a 422 limit against a freshly started store-gateway). Same root cause as #7605, which is fixed separately for TestQuerierWithBlocksStorageOnMissingBlocksFromStorage in a non-overlapping PR. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
1 parent 74185ef commit 7d732db

2 files changed

Lines changed: 17 additions & 0 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@
5757
* [BUGFIX] Query Frontend: Fix native histogram responses not being handled correctly in `minTime()` sort ordering for split_by_interval merge. #7555
5858
* [BUGFIX] Distributor: Release the push worker pool goroutines on shutdown by stopping the async executor during the stopping phase when `-distributor.num-push-workers` is set. #7602
5959
* [BUGFIX] Querier: Fix unbounded resource leak in the bucket-scan blocks finder (used when the bucket index is disabled). Per-tenant metadata fetchers, their Prometheus registries, and on-disk meta caches are now evicted once a tenant is no longer active, instead of being retained for the lifetime of the process. #7573
60+
* [BUGFIX] Querier: Fix flake in integration tests TestQuerierWithStoreGatewayDataBytesLimits and TestQuerierWithBlocksStorageLimits by waiting for the querier to see the store-gateway ACTIVE in the ring before querying. #7606
6061

6162
## 1.21.0 2026-04-24
6263

integration/querier_test.go

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -474,6 +474,14 @@ func TestQuerierWithBlocksStorageLimits(t *testing.T) {
474474
require.NoError(t, storeGateway.WaitSumMetrics(e2e.Equals(512), "cortex_ring_tokens_total"))
475475
require.NoError(t, storeGateway.WaitSumMetrics(e2e.Equals(1), "cortex_bucket_store_blocks_loaded"))
476476

477+
// Wait until the store-gateway is ACTIVE in the querier's view of the store-gateway ring. The
478+
// store-gateway registers JOINING (with tokens) and switches to ACTIVE only after the initial
479+
// blocks sync, so the waits above can pass while the querier would still fail queries with
480+
// "at least 1 healthy replica required, could only find 0" (HTTP 500) instead of the expected 422.
481+
require.NoError(t, querier.WaitSumMetricsWithOptions(e2e.Equals(1), []string{"cortex_ring_members"}, e2e.WithLabelMatchers(
482+
labels.MustNewMatcher(labels.MatchEqual, "name", "store-gateway-client"),
483+
labels.MustNewMatcher(labels.MatchEqual, "state", "ACTIVE"))))
484+
477485
// Query back the series.
478486
c, err = e2ecortex.NewClient("", querier.HTTPEndpoint(), "", "", "user-1")
479487
require.NoError(t, err)
@@ -571,6 +579,14 @@ func TestQuerierWithStoreGatewayDataBytesLimits(t *testing.T) {
571579
require.NoError(t, storeGateway.WaitSumMetrics(e2e.Equals(512), "cortex_ring_tokens_total"))
572580
require.NoError(t, storeGateway.WaitSumMetrics(e2e.Equals(1), "cortex_bucket_store_blocks_loaded"))
573581

582+
// Wait until the store-gateway is ACTIVE in the querier's view of the store-gateway ring. The
583+
// store-gateway registers JOINING (with tokens) and switches to ACTIVE only after the initial
584+
// blocks sync, so the waits above can pass while the querier would still fail queries with
585+
// "at least 1 healthy replica required, could only find 0" (HTTP 500) instead of the expected 422.
586+
require.NoError(t, querier.WaitSumMetricsWithOptions(e2e.Equals(1), []string{"cortex_ring_members"}, e2e.WithLabelMatchers(
587+
labels.MustNewMatcher(labels.MatchEqual, "name", "store-gateway-client"),
588+
labels.MustNewMatcher(labels.MatchEqual, "state", "ACTIVE"))))
589+
574590
// Query back the series.
575591
c, err = e2ecortex.NewClient("", querier.HTTPEndpoint(), "", "", "user-1")
576592
require.NoError(t, err)

0 commit comments

Comments
 (0)