Skip to content

feat(test-split): add grouped-split pytest plugin#2721

Draft
danceratopz wants to merge 7 commits intoethereum:forks/amsterdamfrom
danceratopz:add-pytest-split
Draft

feat(test-split): add grouped-split pytest plugin#2721
danceratopz wants to merge 7 commits intoethereum:forks/amsterdamfrom
danceratopz:add-pytest-split

Conversation

@danceratopz
Copy link
Copy Markdown
Member

Important

Requires #2720 (refactor(test-fill): allow phase-1-only pre-alloc generation) to land first. Do not merge until that PR is in.

🗒️ Description

Add pytest-split integration to fill so a CI release build can fan fixture generation out across multiple runners while keeping each per-test-function fixture file owned by a single runner.

The new --grouped-split flag replaces pytest-split's default splitter with a (test_function_path, fork)-aware algorithm. With this, every parametrization of one test function under one fork is guarenteed to land on the same runner.

Because fill's native output layout already groups per-function fixtures into one file per (fork, function), this invariant makes plain cp -r fan-in safe: There's no need to perform a JSON-level merge to generate fixture artifacts.

What lands:

  • packages/testing/src/execution_testing/cli/pytest_commands/plugins/split/: the new plugin, split into four single-concern modules.
    • durations.py: single source of truth for .test_durations handling. strip_xdist_suffix preserves non-t8n-cache xdist_group markers (e.g. @bigmem) and @ inside parametrize values (email@example.com), matching filler.py:_strip_xdist_group_suffix. Also exports normalize_durations, merge_durations, load_durations, write_durations.
    • grouping.py: correctness-only module. group_key(item) returns the (function_path, fork) key. Reads the fork from item.callspec.params["parametrized_fork"] (the forks plugin's canonical param) when available, so a parametrize value that happens to start with fork_ cannot hijack the key. Falls back to a nodeid fork_* token scan only for items without a callspec.
    • scheduling.py: performance-only module, three public functions composed by assign_runners.
      • build_group_durations groups keyed items and totals per-group duration (unknown items inherit the known-items mean).
      • lpt_schedule runs Longest-Processing-Time-first bin-packing across runners: sort groups heaviest-first, assign each to the least-loaded runner via a min-heap. Standard 4/3-approximation for makespan minimization.
      • sort_items_within_groups orders each group's items by individual duration DESC so that xdist workers start on the slow tests first; reduces end-of-run stragglers within one runner's -n N worker pool. Default-on, disabled via assign_runners(..., sort_intra_group=False).
    • plugin.py: pytest hooks + summary plumbing only. Adds --grouped-split; when combined with --splits and --group, unregisters upstream pytestsplitplugin and partitions collected items by group_key. Emits a compact summary whose first line classifies the run as mode: duration-aware (...), mode: average-only (no durations ...), or mode: average-only (... KEY MISMATCH), plus a GitHub Actions ::warning:: annotation when durations load but zero items match.
  • .github/scripts/{normalize_durations,merge_durations_files,diagnose_durations}.py: thin CI shims over durations.py so workflows can call the helpers without duplicating JSON-handling logic.
  • packages/testing/pyproject.toml: pin pytest-split==0.11.0 exactly. The plugin unregisters by upstream's internal plugin name pytestsplitplugin, an implementation detail that could shift across minor releases.

Design notes:

  • Two orthogonal concerns inside the same plugin. grouping.py enforces the correctness invariant (items with the same key share a runner, which is what makes plain-copy fan-in safe). scheduling.py handles the performance concerns: LPT across CI runners, and slowest-first within each group for intra-runner xdist workers. Swapping in a different per-group distribution rule would keep fan-in safe and only affect wallclock.
  • Authoritative fork lookup. group_key reads the fork from item.callspec.params["parametrized_fork"], the forks plugin's canonical param name (not a nodeid-token heuristic). A parametrize value like fork_candidate is no longer ambiguous with a real fork param.
  • Intra-group ordering respects xdist. Under xdist's default scheduler settings (--dist=load or --dist=loadgroup without --loadscopereorder) the order set via items[:] is preserved to worker dispatch, verified against xdist 3.8's LoadScopeScheduling.schedule() source. --loadscopereorder is the one setting that would re-sort scopes by item count and override our order; not used in this repo's CI.
  • Plugin summary travels from xdist workers to the controller via workeroutput (pytest_sessionfinish on workers, pytest_testnodedown on the controller) so the terminal summary emits correctly under --dist loadgroup.
  • End-to-end verified against a non-split baseline. fill --generate-all-formats (no plugin) vs. fill --grouped-split (N groups merged by plain file copy) produces byte-equivalent output per uv run hasher compare across every fixture format: blockchain_tests, blockchain_tests_engine, state_tests, and blockchain_tests_engine_x per fork. The harness lives at tmp/verify_split_matches_baseline.py (gitignored). It also asserts the plugin's duration coverage N/M is non-zero, so a silent fallback to average-based splitting fails the check.

🔗 Related Issues or PRs

Requires #2720.

✅ Checklist

  • All: Ran fast static checks to avoid unnecessary CI fails, see also Code Standards and Enabling Pre-commit Checks:

    just static
  • All: PR title adheres to the repo standard - it will be used as the squash commit message and should start type(scope):.

  • All: Considered updating the online docs in the ./docs/ directory.

  • All: Set appropriate labels for the changes (only maintainers can apply labels).

Cute Animal Picture

Put a link to a cute animal picture inside the parenthesis-->

Pin `pytest-split==0.11.0` exactly: the `--grouped-split` plugin relies
on the internal `pytestsplitplugin` name to unregister the upstream
splitter when grouped splitting is active, which is an implementation
detail that could shift across minor releases.
Add `scheduling.py` exposing three functions for grouped test splits:

- `build_group_durations` groups keyed items and totals per-group
  duration (unknown items fall back to the known-items mean).
- `lpt_schedule` assigns groups to runners heaviest-first via a
  min-heap (Longest-Processing-Time-first, 4/3-approximation of
  optimal makespan).
- `assign_runners` composes the two and returns one `SplitGroup`
  per runner, preserving intra-group collection order.

The module separates the two concerns that together implement
`--grouped-split`: the grouping invariant (correctness) and LPT
scheduling (performance). Swapping in a different per-group
scheduling rule would keep fan-in safe and only affect wallclock.
Provide a single source of truth for `.test_durations` handling used
by both the split plugin and the CI helper scripts: strip the
`@xdist_group` suffix, normalize a raw durations dict, merge per-runner
files, and load/write the JSON format.

Fold `strip_xdist_suffix` out of `grouped_least_duration.py` so the
algorithm module depends on this utility rather than duplicating it.
Register via `pytest-fill.ini` under the conventional plugin path
`execution_testing.cli.pytest_commands.plugins.split.plugin`.

When `--grouped-split` is combined with pytest-split's `--splits` and
`--group`, unregister the upstream `pytestsplitplugin` and delegate
collection partitioning to `grouped_least_duration`. The plugin emits
a compact summary showing the selected runner's load and all
runners' load distribution for independent CI log inspection.

The plugin's `(test_case, fork)` grouping is what makes the default
multi-fixture-per-file output layout safe under split: every fixture
format of a test case lands on one runner, so no two runners ever
write the same output file. Pairing `--grouped-split` with
`--single-fixture-per-file` is therefore unnecessary and should be
avoided.

Pytester-based integration tests cover partition coverage, format-
variant co-location, upstream plugin unregistration, summary
emission, no-op behavior without the flag, and durations matching.
Re-order `FillCommand.create_executions` so that the execution plan
reflects the user's intent:

- `--use-pre-alloc-groups` now takes priority. The flag means pre-alloc
  groups already exist on disk from a previous run, so even alongside
  `--generate-all-formats` the run is single-phase.
- `--generate-pre-alloc-groups` without `--generate-all-formats` now
  runs phase 1 only. CI can populate pre-alloc groups on a dedicated
  runner without wasting time on phase 2.
- `--generate-all-formats` continues to trigger the full two-phase
  run.

Update `test_legacy_generate_pre_alloc_groups_still_works` to reflect
the new phase-1-only behaviour and add a test covering the
`--use-pre-alloc-groups` priority.
Thin shims over the split-plugin `durations` module so the CI workflow
can call `normalize`, `merge`, and `diagnose` without duplicating the
JSON-handling logic. The scripts import directly from
`execution_testing.cli.pytest_commands.plugins.split.durations`, so any
future change to the format or suffix convention stays in one place.
After LPT assigns groups to CI runners, sort each group's items by
individual duration DESC so xdist workers receive slow tests first.
This reduces trailing stragglers on a runner's `-n N` worker pool.

The new order flows through `items[:]` in the existing
`pytest_collection_modifyitems` hook (`trylast=True`), which xdist
respects under its default scheduler settings. The one override is
`--loadscopereorder`: when enabled, xdist re-sorts scopes by item
count and our order is lost. Not used in CI today.

Disable via `assign_runners(..., sort_intra_group=False)` for
bisecting or A/B comparison; no CLI flag for now.
@danceratopz danceratopz added C-feat Category: an improvement or new feature A-test-fill Area: execution_testing.cli.pytest_commands.plugins.filler labels Apr 19, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.26%. Comparing base (e8f01a6) to head (7683e28).
⚠️ Report is 1 commits behind head on forks/amsterdam.

Additional details and impacted files
@@                 Coverage Diff                 @@
##           forks/amsterdam    #2721      +/-   ##
===================================================
+ Coverage            84.72%   86.26%   +1.53%     
===================================================
  Files                  524      599      +75     
  Lines                31117    37038    +5921     
  Branches              3036     3795     +759     
===================================================
+ Hits                 26365    31949    +5584     
- Misses                4181     4525     +344     
+ Partials               571      564       -7     
Flag Coverage Δ
unittests 86.26% <ø> (+1.53%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-test-fill Area: execution_testing.cli.pytest_commands.plugins.filler C-feat Category: an improvement or new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant