feat(test-split): add grouped-split pytest plugin by danceratopz · Pull Request #2721 · ethereum/execution-specs

danceratopz · 2026-04-19T23:51:59Z

Important

Requires #2720 (refactor(test-fill): allow phase-1-only pre-alloc generation) to land first. Do not merge until that PR is in.

🗒️ Description

Add pytest-split integration to fill so a CI release build can fan fixture generation out across multiple runners while keeping each per-test-function fixture file owned by a single runner.

The new --grouped-split flag replaces pytest-split's default splitter with a (test_function_path, fork)-aware algorithm. With this, every parametrization of one test function under one fork is guarenteed to land on the same runner.

Because fill's native output layout already groups per-function fixtures into one file per (fork, function), this invariant makes plain cp -r fan-in safe: There's no need to perform a JSON-level merge to generate fixture artifacts.

What lands:

packages/testing/src/execution_testing/cli/pytest_commands/plugins/split/: the new plugin, split into four single-concern modules.
- durations.py: single source of truth for .test_durations handling. strip_xdist_suffix preserves non-t8n-cache xdist_group markers (e.g. @bigmem) and @ inside parametrize values (email@example.com), matching filler.py:_strip_xdist_group_suffix. Also exports normalize_durations, merge_durations, load_durations, write_durations.
- grouping.py: correctness-only module. group_key(item) returns the (function_path, fork) key. Reads the fork from item.callspec.params["parametrized_fork"] (the forks plugin's canonical param) when available, so a parametrize value that happens to start with fork_ cannot hijack the key. Falls back to a nodeid fork_* token scan only for items without a callspec.
- scheduling.py: performance-only module, three public functions composed by assign_runners.
  - build_group_durations groups keyed items and totals per-group duration (unknown items inherit the known-items mean).
  - lpt_schedule runs Longest-Processing-Time-first bin-packing across runners: sort groups heaviest-first, assign each to the least-loaded runner via a min-heap. Standard 4/3-approximation for makespan minimization.
  - sort_items_within_groups orders each group's items by individual duration DESC so that xdist workers start on the slow tests first; reduces end-of-run stragglers within one runner's -n N worker pool. Default-on, disabled via assign_runners(..., sort_intra_group=False).
- plugin.py: pytest hooks + summary plumbing only. Adds --grouped-split; when combined with --splits and --group, unregisters upstream pytestsplitplugin and partitions collected items by group_key. Emits a compact summary whose first line classifies the run as mode: duration-aware (...), mode: average-only (no durations ...), or mode: average-only (... KEY MISMATCH), plus a GitHub Actions ::warning:: annotation when durations load but zero items match.
.github/scripts/{normalize_durations,merge_durations_files,diagnose_durations}.py: thin CI shims over durations.py so workflows can call the helpers without duplicating JSON-handling logic.
packages/testing/pyproject.toml: pin pytest-split==0.11.0 exactly. The plugin unregisters by upstream's internal plugin name pytestsplitplugin, an implementation detail that could shift across minor releases.

Design notes:

Two orthogonal concerns inside the same plugin. grouping.py enforces the correctness invariant (items with the same key share a runner, which is what makes plain-copy fan-in safe). scheduling.py handles the performance concerns: LPT across CI runners, and slowest-first within each group for intra-runner xdist workers. Swapping in a different per-group distribution rule would keep fan-in safe and only affect wallclock.
Authoritative fork lookup. group_key reads the fork from item.callspec.params["parametrized_fork"], the forks plugin's canonical param name (not a nodeid-token heuristic). A parametrize value like fork_candidate is no longer ambiguous with a real fork param.
Intra-group ordering respects xdist. Under xdist's default scheduler settings (--dist=load or --dist=loadgroup without --loadscopereorder) the order set via items[:] is preserved to worker dispatch, verified against xdist 3.8's LoadScopeScheduling.schedule() source. --loadscopereorder is the one setting that would re-sort scopes by item count and override our order; not used in this repo's CI.
Plugin summary travels from xdist workers to the controller via workeroutput (pytest_sessionfinish on workers, pytest_testnodedown on the controller) so the terminal summary emits correctly under --dist loadgroup.
End-to-end verified against a non-split baseline. fill --generate-all-formats (no plugin) vs. fill --grouped-split (N groups merged by plain file copy) produces byte-equivalent output per uv run hasher compare across every fixture format: blockchain_tests, blockchain_tests_engine, state_tests, and blockchain_tests_engine_x per fork. The harness lives at tmp/verify_split_matches_baseline.py (gitignored). It also asserts the plugin's duration coverage N/M is non-zero, so a silent fallback to average-based splitting fails the check.

🔗 Related Issues or PRs

Requires #2720.

✅ Checklist

All: Ran fast static checks to avoid unnecessary CI fails, see also Code Standards and Enabling Pre-commit Checks:
```
just static
```
All: PR title adheres to the repo standard - it will be used as the squash commit message and should start type(scope):.
All: Considered updating the online docs in the ./docs/ directory.
All: Set appropriate labels for the changes (only maintainers can apply labels).

Cute Animal Picture

Pin `pytest-split==0.11.0` exactly: the `--grouped-split` plugin relies on the internal `pytestsplitplugin` name to unregister the upstream splitter when grouped splitting is active, which is an implementation detail that could shift across minor releases.

Add `scheduling.py` exposing three functions for grouped test splits: - `build_group_durations` groups keyed items and totals per-group duration (unknown items fall back to the known-items mean). - `lpt_schedule` assigns groups to runners heaviest-first via a min-heap (Longest-Processing-Time-first, 4/3-approximation of optimal makespan). - `assign_runners` composes the two and returns one `SplitGroup` per runner, preserving intra-group collection order. The module separates the two concerns that together implement `--grouped-split`: the grouping invariant (correctness) and LPT scheduling (performance). Swapping in a different per-group scheduling rule would keep fan-in safe and only affect wallclock.

Provide a single source of truth for `.test_durations` handling used by both the split plugin and the CI helper scripts: strip the `@xdist_group` suffix, normalize a raw durations dict, merge per-runner files, and load/write the JSON format. Fold `strip_xdist_suffix` out of `grouped_least_duration.py` so the algorithm module depends on this utility rather than duplicating it.

Register via `pytest-fill.ini` under the conventional plugin path `execution_testing.cli.pytest_commands.plugins.split.plugin`. When `--grouped-split` is combined with pytest-split's `--splits` and `--group`, unregister the upstream `pytestsplitplugin` and delegate collection partitioning to `grouped_least_duration`. The plugin emits a compact summary showing the selected runner's load and all runners' load distribution for independent CI log inspection. The plugin's `(test_case, fork)` grouping is what makes the default multi-fixture-per-file output layout safe under split: every fixture format of a test case lands on one runner, so no two runners ever write the same output file. Pairing `--grouped-split` with `--single-fixture-per-file` is therefore unnecessary and should be avoided. Pytester-based integration tests cover partition coverage, format- variant co-location, upstream plugin unregistration, summary emission, no-op behavior without the flag, and durations matching.

Re-order `FillCommand.create_executions` so that the execution plan reflects the user's intent: - `--use-pre-alloc-groups` now takes priority. The flag means pre-alloc groups already exist on disk from a previous run, so even alongside `--generate-all-formats` the run is single-phase. - `--generate-pre-alloc-groups` without `--generate-all-formats` now runs phase 1 only. CI can populate pre-alloc groups on a dedicated runner without wasting time on phase 2. - `--generate-all-formats` continues to trigger the full two-phase run. Update `test_legacy_generate_pre_alloc_groups_still_works` to reflect the new phase-1-only behaviour and add a test covering the `--use-pre-alloc-groups` priority.

Thin shims over the split-plugin `durations` module so the CI workflow can call `normalize`, `merge`, and `diagnose` without duplicating the JSON-handling logic. The scripts import directly from `execution_testing.cli.pytest_commands.plugins.split.durations`, so any future change to the format or suffix convention stays in one place.

After LPT assigns groups to CI runners, sort each group's items by individual duration DESC so xdist workers receive slow tests first. This reduces trailing stragglers on a runner's `-n N` worker pool. The new order flows through `items[:]` in the existing `pytest_collection_modifyitems` hook (`trylast=True`), which xdist respects under its default scheduler settings. The one override is `--loadscopereorder`: when enabled, xdist re-sorts scopes by item count and our order is lost. Not used in CI today. Disable via `assign_runners(..., sort_intra_group=False)` for bisecting or A/B comparison; no CLI flag for now.

codecov · 2026-04-20T08:15:00Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.26%. Comparing base (e8f01a6) to head (7683e28).
⚠️ Report is 1 commits behind head on forks/amsterdam.

Additional details and impacted files

@@                 Coverage Diff                 @@
##           forks/amsterdam    #2721      +/-   ##
===================================================
+ Coverage            84.72%   86.26%   +1.53%     
===================================================
  Files                  524      599      +75     
  Lines                31117    37038    +5921     
  Branches              3036     3795     +759     
===================================================
+ Hits                 26365    31949    +5584     
- Misses                4181     4525     +344     
+ Partials               571      564       -7

Flag	Coverage Δ
unittests	`86.26% <ø> (+1.53%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

danceratopz added 7 commits April 19, 2026 00:58

danceratopz added C-feat Category: an improvement or new feature A-test-fill Area: execution_testing.cli.pytest_commands.plugins.filler labels Apr 19, 2026

danceratopz mentioned this pull request Apr 20, 2026

Tracker: Faster, More Targeted Test Fixture Releases #2736

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(test-split): add grouped-split pytest plugin#2721

feat(test-split): add grouped-split pytest plugin#2721
danceratopz wants to merge 7 commits intoethereum:forks/amsterdamfrom
danceratopz:add-pytest-split

danceratopz commented Apr 19, 2026

Uh oh!

codecov Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danceratopz commented Apr 19, 2026

🗒️ Description

🔗 Related Issues or PRs

✅ Checklist

Cute Animal Picture

Uh oh!

codecov Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Apr 20, 2026 •

edited

Loading