Commit 9101b6f
fix(querier): wait for store-gateway ACTIVE in querier ring view in TestQuerierWithBlocksStorageOnMissingBlocksFromStorage
The first happy-path query in this integration test intermittently
failed with a 500 on arm64 CI (issue #7605). Decoding the gzipped 500
response body logged by the querier in both failing runs shows the
failure is querier-local and the query never reached the store-gateway:
expanding series: failed to get store-gateway replication set owning
the block <ULID>: at least 1 healthy replica required, could only
find 0 - unhealthy instances: 172.18.0.8:9095
The store-gateway registers in the ring with all its tokens in JOINING
state (pkg/storegateway/gateway.go:461), runs the initial blocks sync
(which is what drives cortex_bucket_store_blocks_loaded to 1), and only
then switches to ACTIVE (gateway.go:333). The test's existing waits
(querier ring tokens 512*2, store-gateway ring tokens 512,
blocks_loaded 1) are therefore all satisfiable while the querier's view
of the store-gateway ring still holds the instance in JOINING, and the
BlocksRead ring operation admits ACTIVE instances only
(pkg/storegateway/gateway_ring.go:49-55), so the first query fails with
the error above (pkg/ring/replication_strategy.go:93, wrapped at
pkg/querier/blocks_store_replicated_set.go:127). The window is
structural: the querier's consul watch is rate-limited to 1 req/s
(pkg/ring/kv/consul/client.go:79), so the JOINING->ACTIVE flip can
reach the querier's ring view over a second after the store-gateway
CASed it, while all three metric waits can pass on their first poll.
Close the race by additionally waiting until the querier exposes
cortex_ring_members{name="store-gateway-client",state="ACTIVE"} == 1
before the first query - the same inline readiness wait used to fix the
identical race in backward_compatibility_test.go (#5975, 32bd46f).
The waited gauge is computed from the same mutex-guarded ring
descriptor that the querier's GetClientsFor consults, so the wait
condition is the exact negation of the failure condition. The
deliberate post-deletion 500 assertions are unchanged.
Fixes #7605
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>1 parent 74185ef commit 9101b6f
2 files changed
Lines changed: 10 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
| 60 | + | |
60 | 61 | | |
61 | 62 | | |
62 | 63 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
376 | 376 | | |
377 | 377 | | |
378 | 378 | | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
379 | 388 | | |
380 | 389 | | |
381 | 390 | | |
| |||
0 commit comments