Use sharding in CI workflows#619
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
The introduction of sharding in the CI workflow is a great step for performance. However, reducing the test retry count could impact the stability of the CI pipeline if tests are flaky. The changes to the .env files are just block number updates and have no issues.
- ci.yml: add --reporter=blob per shard, upload as artifact, and a merge-reports job that combines them via vitest --merge-reports so the run summary shows one unified set of test totals instead of one section per shard. - update-known-good.yml: shard the tests matrix 3 ways to match ci.yml, and disambiguate the failed-chains artifact name by shard so the notify job downloads all per-(network, shard) reports without collisions.
The previous amperfix endpoint has pruned state at PET's pinned block (probed live: UnknownBlock error in 136ms). bulletin-rpc.polkadot.io serves the same block in 110ms. simplystaking's spectrum endpoint is also pruned and not used.
Vitest's default blob filename is blob-${shard}-${total}.json, which
collides across networks when 'actions/download-artifact' merges all
artifacts into one directory. Each shard now writes to a uniquely
named file, so the merge-reports job can parse all six blobs without
the JSON-after-JSON SyntaxError seen in run 26061812546.
xlc
approved these changes
May 19, 2026
Collaborator
Author
|
I am also testing this in the runtimes repo: polkadot-fellows/runtimes#1180 If all looks good over there as well (that work is based on this), I will merge this one. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #610.
Context
#610 listed three CI perf items:
*a shared SQLite DB cache,
--pool=forks.The latter two already landed in
master; the DB cache was not worthwhile - a full 14-run benchmark (sqlite3 ± DB, better-sqlite3 ± DB) showed ~0-5% deltas at PET scale, well within run-to-run noise. Test time is WASM-bound, not storage-bound, so adding a shared SQLite layer did not improve it.What does produce significant improvement is splitting the work across more runners.
Changes
Shard the test matrix 3 ways per network via Vitest's
--shardflag. The two ecosystem-test jobs (polkadot, kusama) become six (polkadot×{1,2,3}, kusama×{1,2,3}), each running ~1/3 of the test files in parallel on separate runners. Applied to bothci.ymlandupdate-known-good.yml.Per-shard
--reporter=bloboutputs are uploaded as artifacts and combined by a follow-onmerge-reportsjob (vitest --merge-reports), so the GitHub run summary still shows a single unified set of test totals instead of one section per shard. Each shard writes toblob-${network}-${shard}.jsonto avoid filename collisions when artifacts are merged.update-known-good.yml'sfailed-chains-${network}artifact is renamed tofailed-chains-${network}-${shard}to avoid collisions; the existing notify job already unions chain names across all downloaded artifacts, so per-chain GitHub-issue notifications continue to work.Also drops two pruned RPC endpoints (
wss://bridgehub-kusama.public.curie.radiumblock.co/ws,wss://bulletin.amperfix.de) and addswss://bulletin-rpc.polkadot.iofor Polkadot Bulletin. Both removed endpoints accept connections and serve block headers, but fail onstate_getStorageat PET's pinned blocks withUnknownBlock: State already discarded.Impact
CI wall time on a clean run drops from ~35min to ~15min (measured on this branch). Each shard still gets the full 9-chain Subway pool (no infra changes), so per-shard RPC cache duplication across runners is the tradeoff for the parallelism. A future move to self-hosted runners with a long-lived Subway pool would close that gap further.
--retry=3reduced to--retry=2since #616 cut endpoint flakiness;timeout-minutesreduced from 150 to 60 to match the new envelope.