Endpoint and timeout fixes for sharded-CI flakes#621
Merged
Conversation
The public-tier endpoint is rate-limited (RPC error -32029, "Please apply an OnFinality API key") under the sustained load produced by sharded CI runs, observed in polkadot-fellows/runtimes#1180.
Contributor
|
No issues found. |
Acala XCM tests (acala.astar, acala.bifrostPolkadot, etc.) hit the 60s timeout on every Acala endpoint Subway cycles through, on the same shard that surfaced the OnFinality rate-limit failure in polkadot-fellows/runtimes#1180. The Acala public RPC pool is slow enough under load that the heavy XCM-Transact storage queries don't return in 60s on any individual endpoint, so Subway burns the full timeout per upstream before rotating, never gets a response, and the test fails. 90s gives those queries enough headroom while still capping a genuinely stuck call. Block numbers are bumped at the same time to keep state lookups close to chain head.
`wss://us.bifrost-rpc.liebi.com/ws` was the only endpoint for bifrostKusama and is currently network-dead (handshake timeout, probed live). Test runs stall on bifrostKusama because Subway has no fallback to cycle to. The two replacements (`hk.` and the no-region default) both respond and serve state at current tip; the no-region host is Liebi's DNS-load-balanced entrypoint and adds geographic redundancy in case `hk.` ever goes the way of `us.`.
`ba34d62` shipped `BIFROSTKUSAMA_BLOCK_NUMBER=13903082` as a stale script fallback because the only configured Bifrost Kusama endpoint was unreachable at the time, and no public RPC retained that block's state. The previous commit fixes the endpoint; this re-runs `yarn update-known-good` against the live endpoint to record a block number that is actually servable, and refreshes every other chain's block in the same pass.
`defineChain.ts` already set `timeout: 90_000` in the per-chain chopsticks config, but `SetupOption.timeout` only controls the test-side WsProvider that talks to the in-process chopsticks server; it leaves chopsticks' own upstream WsProvider on its 60s default, which is what produces the `No response received from RPC endpoint in 60s` errors seen on Acala in the previous CI run on this branch. Bundles a yarn patch that adds an `rpcTimeout` field to `SetupOption` and forwards it as `rpc-timeout` in the chopsticks config (mirroring AcalaNetwork/chopsticks#1034), and sets it to 90s in `defineChain.ts`. The patch can be dropped once a chopsticks release includes #1034.
`wss://us.bifrost-rpc.liebi.com/ws` (only configured endpoint until the previous commit) is network-dead, and the alternative Liebi hosts (`hk.`, no-region) only serve current-tip state; they don't retain the historical state at the block PET pins to, so chopsticks setup fails with `UnknownBlock: State already discarded` on every fresh shard. Until a public Bifrost Kusama endpoint retains state at our pinned block (or PET runs against an archive-quality endpoint operator specifically), the four `bifrostKusama.*` E2E suites and the cross- chain `karura.bifrostKusama.xcm` suite are excluded from collection. The other Kusama suites are unaffected.
The chopsticks-side patch in this PR raised `rpcTimeout` to 90s, but Subway hardcodes its own per-upstream `request_timeout` to 30s (with no field exposed in `ClientConfig` to override it). Heavy Acala storage queries take longer than 30s, so Subway cycles through the 3 Liebi endpoints (~30s each) without serving a response, and chopsticks times out before the cycle completes. Excluding Acala suites until Subway exposes `request_timeout` as a config field.
xlc
approved these changes
May 20, 2026
xlc
added a commit
to AcalaNetwork/subway
that referenced
this pull request
May 20, 2026
* Expose per-upstream client timeouts and retries in `ClientConfig` `Client::new` already accepts `request_timeout`, `connection_timeout`, and `retries` arguments, but `from_config` hardcodes all three to `None` because `ClientConfig` only exposes `endpoints` and `shuffle_endpoints`. As a result the only way to override the 30s per-upstream request timeout (and the 30s connection timeout, and the default retry count) is to construct `Client` directly in Rust, which isn't reachable from the YAML-driven config. Adds three optional fields to `ClientConfig`: - `request_timeout_seconds` - `connection_timeout_seconds` - `retries` `from_config` plumbs them into `Client::new`. None of the existing defaults change when the fields are omitted. The motivating case is heavy storage queries against slow public RPCs (Acala under load is the case that surfaced this in `polkadot-fellows/runtimes#1180` / `open-web3-stack/polkadot-ecosystem-tests#621`) where 30s per upstream is not enough and Subway exhausts its endpoint cycle without serving a response. * cargo fmt * feat(bench): Add client config options for connection timeout, request timeout, and retries --------- Co-authored-by: Bryan Chen <xlchen1291@gmail.com>
e99a143 to
dbb5d6d
Compare
dbb5d6d to
f6c715a
Compare
rockbmb
added a commit
that referenced
this pull request
May 20, 2026
`request_timeout_seconds: 90` on Subway's upstream client (added to `subway-template.yml` in the previous commit) gives Subway enough time per upstream attempt for Acala storage queries to land before the 30s default forced it to cycle endpoints. The exclusion added in PR #621 is no longer needed and is removed; the exclusion comment is narrowed to bifrostKusama, which still lacks a workable endpoint set.
rockbmb
added a commit
that referenced
this pull request
May 20, 2026
…pstream timeout (#622) * Install Subway from upstream `v0.1.0` musl release in `ci.yml` Switches `cargo install --git` to a `curl | tar -xz` of the released static binary (https://github.com/AcalaNetwork/subway/releases/tag/v0.1.0, published by AcalaNetwork/subway#202). Removes the Rust toolchain install, Subway-HEAD commit-hash lookup, and Swatinem cache layer that existed only to amortise the `cargo install` cost — none of them have any other consumer in this workflow. * Install Subway from upstream `v0.1.0` musl release in `update-known-good.yml` Same swap as the previous commit, applied to the periodic block-number update workflow. * Install Subway from upstream `v0.1.0` musl release in `update-snapshot.yml` Same swap as the previous two commits, applied to the snapshot-update workflow. * Fail Subway download fast on HTTP errors (`curl -f`) Without `-f`, an HTTP 4xx/5xx response (e.g. release deleted, GitHub degraded) leaves `curl` exiting zero with the error body on stdout, and the downstream `tar -xz` fails with a confusing "not in gzip format" message instead. Per review on PR #622. * Install Subway by extracting binary from `acala/subway:v0.1.1` Docker image The `v0.1.1` GitHub Release at AcalaNetwork/subway is missing its `x86_64-unknown-linux-musl.tar.gz` asset; the release workflow's `Build release binary` step failed (`cargo build --locked` mismatched the bumped `Cargo.toml` version), so the upload was skipped. The upstream tag still produces a working Docker image because `docker.yml` doesn't use `--locked`, so `acala/subway:v0.1.1` is the only working consumption path for v0.1.1. The image's binary lives at `/usr/local/bin/subway` (per Subway's Dockerfile); copying it out with `docker create` + `docker cp` lands in roughly the same wall time as the curl-and-untar path and unblocks consumption of PR #203's `request_timeout_seconds` config field. * Set Subway per-upstream `request_timeout_seconds` to 90s Subway's default per-upstream request timeout is 30s. With three Acala public RPC endpoints, heavy storage queries that take longer than 30s cause Subway to cycle through all three endpoints (~90s) before any single upstream has a chance to respond, and the test-side waiting client times out. `request_timeout_seconds` was added to `ClientConfig` in AcalaNetwork/subway#203 (Subway v0.1.1+). Setting it to 90 lets a single upstream attempt run long enough to complete those queries instead of being preempted by Subway's own per-endpoint clock. The companion exclusion of Acala tests in `vitest.config.mts` is intentionally left in place; this commit only restores Subway's ability to wait long enough. Lifting the exclusion is a separate verification step. * Re-enable Acala test suites `request_timeout_seconds: 90` on Subway's upstream client (added to `subway-template.yml` in the previous commit) gives Subway enough time per upstream attempt for Acala storage queries to land before the 30s default forced it to cycle endpoints. The exclusion added in PR #621 is no longer needed and is removed; the exclusion comment is narrowed to bifrostKusama, which still lacks a workable endpoint set.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A batch of CI-reliability fixes surfaced by sharded test runs (see polkadot-fellows/runtimes#1180, which consumes PET via
runtimes-master).Endpoint changes
wss://collectives.api.onfinality.io/public-wsfromcollectivesPolkadot. The public-tier endpoint returns-32029: Too Many Requestsunder sustained load.wss://us.bifrost-rpc.liebi.com/ws(the only configured endpoint forbifrostKusama) withhk.plus the no-region Liebi default.KNOWN_GOOD_BLOCK_NUMBERS_*.env. The previous bump shipped a stale Bifrost Kusama fallback because the deadus.endpoint blockedyarn update-known-good.Timeouts
defineChain.tsraises the per-chaintimeoutto 90s.SetupOption.timeoutinchopsticks-utilsonly governs the test-sideWsProvider; chopsticks' upstreamWsProviderhas a separaterpc-timeoutthat has no path fromSetupOption. This PR bundles a.yarn/patchespatch that adds anrpcTimeoutfield toSetupOptionand forwards it asrpc-timeout, mirroring AcalaNetwork/chopsticks#1034. The patch can be dropped once a chopsticks release includes that change.Exclusions
bifrostKusama.*andkarura.bifrostKusama.xcm.test.ts: every public Bifrost Kusama RPC either rejects connections or prunes state at the pinned block. Excluded until a workable endpoint set exists.acala.*.test.ts: Subway hardcodes its per-upstreamrequest_timeoutto 30s and doesn't expose it inClientConfig, so heavy Acala storage queries force Subway to cycle through the 3 Liebi endpoints without serving a response. AcalaNetwork/subway#203 adds the missing field and is merged but pending a fresh tag with a working release artifact; the exclusion can be reverted once that lands.