Skip to content

ci: shard alpine-test into parallel jobs to reduce CI time#2966

Merged
rst0git merged 1 commit into
checkpoint-restore:criu-devfrom
adrianreber:2026-03-15-shard
May 24, 2026
Merged

ci: shard alpine-test into parallel jobs to reduce CI time#2966
rst0git merged 1 commit into
checkpoint-restore:criu-devfrom
adrianreber:2026-03-15-shard

Conversation

@adrianreber
Copy link
Copy Markdown
Member

@adrianreber adrianreber commented Mar 15, 2026

The alpine-test CI job runs all ~483 zdtm tests sequentially three times (normal, mntns-compat-mode, criu-config), followed by many non-shardable tests. This dominates overall CI wait time. With only 2 jobs running in parallel (GCC and CLANG) the alpine tests take around 30 minutes.

Use the existing --test-shard-index and --test-shard-count flags already built into test/zdtm.py to split the zdtm test suite across four parallel runners (shards 0-3). A fifth shard runs all non-shardable tests (lazy pages, fault injection, test/others/*, rootless, compel, plugins, etc.) independently and in parallel with the zdtm shards. This increases parallelism from 2 to 10 jobs and reduces the alpine test wall-clock time from ~30 to ~10 minutes.

Changes:

  • run-ci-tests.sh: Build SHARD_OPTS from ZDTM_SHARD_INDEX/COUNT env vars and pass them to zdtm.py. Extract all non-shardable tests into a run_non_shardable_tests() function. Dispatch based on shard index: 0-3 run zdtm slices, 4 runs non-shardable tests, unset runs everything sequentially (preserving existing behavior).
  • Makefile: Pass ZDTM_SHARD_INDEX and ZDTM_SHARD_COUNT into the container when set.
  • ci.yml: Add shard: [0, 1, 2, 3, 4] to the alpine-test matrix, producing 10 jobs (2 compilers x 5 shards). Job labels now show descriptive shard names (e.g. "zdtm 1/4", "non-zdtm") instead of raw indices.

When sharding is not configured the script behaves identically to before, so other CI jobs (aarch64, compat, gcov, etc.) are unaffected.

Comment thread scripts/ci/run-ci-tests.sh
Comment thread scripts/ci/Makefile Outdated
@rst0git rst0git requested a review from avagin March 20, 2026 08:50
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.52%. Comparing base (5546c06) to head (e1ddb0e).
⚠️ Report is 4 commits behind head on criu-dev.

Additional details and impacted files
@@             Coverage Diff              @@
##           criu-dev    #2966      +/-   ##
============================================
+ Coverage     57.21%   57.52%   +0.31%     
============================================
  Files           154      158       +4     
  Lines         40400    40448      +48     
  Branches       8856     8864       +8     
============================================
+ Hits          23113    23268     +155     
- Misses        17023    17036      +13     
+ Partials        264      144     -120     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@avagin
Copy link
Copy Markdown
Member

avagin commented Mar 24, 2026

LGTM, thanks.

An alternative solution would be to splice run-ci-tests.sh. We don't need to run all these tests in one job. It can be separate jobs to run zdtm tests, pre-dump tests, fault tests, etc.

The alpine-test CI job runs all ~483 zdtm tests sequentially three
times (normal, mntns-compat-mode, criu-config), followed by many
non-shardable tests. This dominates overall CI wait time. With only
2 jobs running in parallel (GCC and CLANG) the alpine tests take
around 30 minutes.

Use the existing --test-shard-index and --test-shard-count flags
already built into test/zdtm.py to split the zdtm test suite across
four parallel runners (shards 0-3). A fifth shard runs all
non-shardable tests (lazy pages, fault injection, test/others/*,
rootless, compel, plugins, etc.) independently and in parallel with
the zdtm shards. This increases parallelism from 2 to 10 jobs and
reduces the alpine test wall-clock time from ~30 to ~10 minutes.

Changes:
- run-ci-tests.sh: Build SHARD_OPTS from ZDTM_SHARD_INDEX/COUNT
  env vars and pass them to zdtm.py. Extract all non-shardable
  tests into a run_non_shardable_tests() function. Dispatch based
  on shard index: 0-3 run zdtm slices, 4 runs non-shardable
  tests, unset runs everything sequentially (preserving existing
  behavior). Validate that ZDTM_SHARD_INDEX is set when
  ZDTM_SHARD_COUNT is set.
- Makefile: Pass ZDTM_SHARD_INDEX and ZDTM_SHARD_COUNT into the
  container when set. Split long container run command across
  multiple lines for readability.
- ci.yml: Add shard: [0, 1, 2, 3, 4] to the alpine-test matrix,
  producing 10 jobs (2 compilers x 5 shards). Job labels now show
  descriptive shard names (e.g. "zdtm 1/4", "non-zdtm") instead
  of raw indices.

When sharding is not configured the script behaves identically to
before, so other CI jobs (aarch64, compat, gcov, etc.) are
unaffected.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Adrian Reber <areber@redhat.com>
@github-actions
Copy link
Copy Markdown

A friendly reminder that this PR had no activity for 30 days.

@rst0git rst0git merged commit 4d76d1a into checkpoint-restore:criu-dev May 24, 2026
46 of 51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants