Skip to content

Use sharding in CI workflows#619

Merged
rockbmb merged 6 commits into
masterfrom
ci-sharding
May 19, 2026
Merged

Use sharding in CI workflows#619
rockbmb merged 6 commits into
masterfrom
ci-sharding

Conversation

@rockbmb
Copy link
Copy Markdown
Collaborator

@rockbmb rockbmb commented May 18, 2026

Closes #610.

Context

#610 listed three CI perf items:
*a shared SQLite DB cache,

  • Subway binary caching, and
  • a switch to --pool=forks.

The latter two already landed in master; the DB cache was not worthwhile - a full 14-run benchmark (sqlite3 ± DB, better-sqlite3 ± DB) showed ~0-5% deltas at PET scale, well within run-to-run noise. Test time is WASM-bound, not storage-bound, so adding a shared SQLite layer did not improve it.

What does produce significant improvement is splitting the work across more runners.

Changes

Shard the test matrix 3 ways per network via Vitest's --shard flag. The two ecosystem-test jobs (polkadot, kusama) become six (polkadot×{1,2,3}, kusama×{1,2,3}), each running ~1/3 of the test files in parallel on separate runners. Applied to both ci.yml and update-known-good.yml.

Per-shard --reporter=blob outputs are uploaded as artifacts and combined by a follow-on merge-reports job (vitest --merge-reports), so the GitHub run summary still shows a single unified set of test totals instead of one section per shard. Each shard writes to blob-${network}-${shard}.json to avoid filename collisions when artifacts are merged.

update-known-good.yml's failed-chains-${network} artifact is renamed to failed-chains-${network}-${shard} to avoid collisions; the existing notify job already unions chain names across all downloaded artifacts, so per-chain GitHub-issue notifications continue to work.

Also drops two pruned RPC endpoints (wss://bridgehub-kusama.public.curie.radiumblock.co/ws, wss://bulletin.amperfix.de) and adds wss://bulletin-rpc.polkadot.io for Polkadot Bulletin. Both removed endpoints accept connections and serve block headers, but fail on state_getStorage at PET's pinned blocks with UnknownBlock: State already discarded.

Impact

CI wall time on a clean run drops from ~35min to ~15min (measured on this branch). Each shard still gets the full 9-chain Subway pool (no infra changes), so per-shard RPC cache duplication across runners is the tradeoff for the parallelism. A future move to self-hosted runners with a long-lived Subway pool would close that gap further.

--retry=3 reduced to --retry=2 since #616 cut endpoint flakiness; timeout-minutes reduced from 150 to 60 to match the new envelope.

@rockbmb rockbmb added this to the Refactors & redesigns milestone May 18, 2026
@rockbmb rockbmb self-assigned this May 18, 2026
@rockbmb rockbmb added enhancement New feature or request ci labels May 18, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The introduction of sharding in the CI workflow is a great step for performance. However, reducing the test retry count could impact the stability of the CI pipeline if tests are flaky. The changes to the .env files are just block number updates and have no issues.

Comment thread .github/workflows/ci.yml Outdated
rockbmb added 4 commits May 18, 2026 18:35
- ci.yml: add --reporter=blob per shard, upload as artifact, and a
  merge-reports job that combines them via vitest --merge-reports so
  the run summary shows one unified set of test totals instead of one
  section per shard.
- update-known-good.yml: shard the tests matrix 3 ways to match ci.yml,
  and disambiguate the failed-chains artifact name by shard so the
  notify job downloads all per-(network, shard) reports without
  collisions.
The previous amperfix endpoint has pruned state at PET's pinned block
(probed live: UnknownBlock error in 136ms). bulletin-rpc.polkadot.io
serves the same block in 110ms. simplystaking's spectrum endpoint is
also pruned and not used.
Vitest's default blob filename is blob-${shard}-${total}.json, which
collides across networks when 'actions/download-artifact' merges all
artifacts into one directory. Each shard now writes to a uniquely
named file, so the merge-reports job can parse all six blobs without
the JSON-after-JSON SyntaxError seen in run 26061812546.
@rockbmb rockbmb changed the title Use sharding in CI workflow Use sharding in CI workflows May 18, 2026
@rockbmb
Copy link
Copy Markdown
Collaborator Author

rockbmb commented May 19, 2026

I am also testing this in the runtimes repo: polkadot-fellows/runtimes#1180

If all looks good over there as well (that work is based on this), I will merge this one.

@rockbmb rockbmb merged commit 5fd1659 into master May 19, 2026
13 checks passed
@rockbmb rockbmb deleted the ci-sharding branch May 19, 2026 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI workflow performance improvements

2 participants