feat(test-split): add grouped-split pytest plugin#2721
Draft
danceratopz wants to merge 7 commits intoethereum:forks/amsterdamfrom
Draft
feat(test-split): add grouped-split pytest plugin#2721danceratopz wants to merge 7 commits intoethereum:forks/amsterdamfrom
danceratopz wants to merge 7 commits intoethereum:forks/amsterdamfrom
Conversation
Pin `pytest-split==0.11.0` exactly: the `--grouped-split` plugin relies on the internal `pytestsplitplugin` name to unregister the upstream splitter when grouped splitting is active, which is an implementation detail that could shift across minor releases.
Add `scheduling.py` exposing three functions for grouped test splits: - `build_group_durations` groups keyed items and totals per-group duration (unknown items fall back to the known-items mean). - `lpt_schedule` assigns groups to runners heaviest-first via a min-heap (Longest-Processing-Time-first, 4/3-approximation of optimal makespan). - `assign_runners` composes the two and returns one `SplitGroup` per runner, preserving intra-group collection order. The module separates the two concerns that together implement `--grouped-split`: the grouping invariant (correctness) and LPT scheduling (performance). Swapping in a different per-group scheduling rule would keep fan-in safe and only affect wallclock.
Provide a single source of truth for `.test_durations` handling used by both the split plugin and the CI helper scripts: strip the `@xdist_group` suffix, normalize a raw durations dict, merge per-runner files, and load/write the JSON format. Fold `strip_xdist_suffix` out of `grouped_least_duration.py` so the algorithm module depends on this utility rather than duplicating it.
Register via `pytest-fill.ini` under the conventional plugin path `execution_testing.cli.pytest_commands.plugins.split.plugin`. When `--grouped-split` is combined with pytest-split's `--splits` and `--group`, unregister the upstream `pytestsplitplugin` and delegate collection partitioning to `grouped_least_duration`. The plugin emits a compact summary showing the selected runner's load and all runners' load distribution for independent CI log inspection. The plugin's `(test_case, fork)` grouping is what makes the default multi-fixture-per-file output layout safe under split: every fixture format of a test case lands on one runner, so no two runners ever write the same output file. Pairing `--grouped-split` with `--single-fixture-per-file` is therefore unnecessary and should be avoided. Pytester-based integration tests cover partition coverage, format- variant co-location, upstream plugin unregistration, summary emission, no-op behavior without the flag, and durations matching.
Re-order `FillCommand.create_executions` so that the execution plan reflects the user's intent: - `--use-pre-alloc-groups` now takes priority. The flag means pre-alloc groups already exist on disk from a previous run, so even alongside `--generate-all-formats` the run is single-phase. - `--generate-pre-alloc-groups` without `--generate-all-formats` now runs phase 1 only. CI can populate pre-alloc groups on a dedicated runner without wasting time on phase 2. - `--generate-all-formats` continues to trigger the full two-phase run. Update `test_legacy_generate_pre_alloc_groups_still_works` to reflect the new phase-1-only behaviour and add a test covering the `--use-pre-alloc-groups` priority.
Thin shims over the split-plugin `durations` module so the CI workflow can call `normalize`, `merge`, and `diagnose` without duplicating the JSON-handling logic. The scripts import directly from `execution_testing.cli.pytest_commands.plugins.split.durations`, so any future change to the format or suffix convention stays in one place.
After LPT assigns groups to CI runners, sort each group's items by individual duration DESC so xdist workers receive slow tests first. This reduces trailing stragglers on a runner's `-n N` worker pool. The new order flows through `items[:]` in the existing `pytest_collection_modifyitems` hook (`trylast=True`), which xdist respects under its default scheduler settings. The one override is `--loadscopereorder`: when enabled, xdist re-sorts scopes by item count and our order is lost. Not used in CI today. Disable via `assign_runners(..., sort_intra_group=False)` for bisecting or A/B comparison; no CLI flag for now.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## forks/amsterdam #2721 +/- ##
===================================================
+ Coverage 84.72% 86.26% +1.53%
===================================================
Files 524 599 +75
Lines 31117 37038 +5921
Branches 3036 3795 +759
===================================================
+ Hits 26365 31949 +5584
- Misses 4181 4525 +344
+ Partials 571 564 -7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Important
Requires #2720 (
refactor(test-fill): allow phase-1-only pre-alloc generation) to land first. Do not merge until that PR is in.🗒️ Description
Add
pytest-splitintegration tofillso a CI release build can fan fixture generation out across multiple runners while keeping each per-test-function fixture file owned by a single runner.The new
--grouped-splitflag replacespytest-split's default splitter with a(test_function_path, fork)-aware algorithm. With this, every parametrization of one test function under one fork is guarenteed to land on the same runner.Because fill's native output layout already groups per-function fixtures into one file per
(fork, function), this invariant makes plaincp -rfan-in safe: There's no need to perform a JSON-level merge to generate fixture artifacts.What lands:
packages/testing/src/execution_testing/cli/pytest_commands/plugins/split/: the new plugin, split into four single-concern modules.durations.py: single source of truth for.test_durationshandling.strip_xdist_suffixpreserves non-t8n-cachexdist_groupmarkers (e.g.@bigmem) and@inside parametrize values (email@example.com), matchingfiller.py:_strip_xdist_group_suffix. Also exportsnormalize_durations,merge_durations,load_durations,write_durations.grouping.py: correctness-only module.group_key(item)returns the(function_path, fork)key. Reads the fork fromitem.callspec.params["parametrized_fork"](the forks plugin's canonical param) when available, so a parametrize value that happens to start withfork_cannot hijack the key. Falls back to a nodeidfork_*token scan only for items without a callspec.scheduling.py: performance-only module, three public functions composed byassign_runners.build_group_durationsgroups keyed items and totals per-group duration (unknown items inherit the known-items mean).lpt_scheduleruns Longest-Processing-Time-first bin-packing across runners: sort groups heaviest-first, assign each to the least-loaded runner via a min-heap. Standard 4/3-approximation for makespan minimization.sort_items_within_groupsorders each group's items by individual duration DESC so that xdist workers start on the slow tests first; reduces end-of-run stragglers within one runner's-n Nworker pool. Default-on, disabled viaassign_runners(..., sort_intra_group=False).plugin.py: pytest hooks + summary plumbing only. Adds--grouped-split; when combined with--splitsand--group, unregisters upstreampytestsplitpluginand partitions collected items bygroup_key. Emits a compact summary whose first line classifies the run asmode: duration-aware (...),mode: average-only (no durations ...), ormode: average-only (... KEY MISMATCH), plus a GitHub Actions::warning::annotation when durations load but zero items match..github/scripts/{normalize_durations,merge_durations_files,diagnose_durations}.py: thin CI shims overdurations.pyso workflows can call the helpers without duplicating JSON-handling logic.packages/testing/pyproject.toml: pinpytest-split==0.11.0exactly. The plugin unregisters by upstream's internal plugin namepytestsplitplugin, an implementation detail that could shift across minor releases.Design notes:
grouping.pyenforces the correctness invariant (items with the same key share a runner, which is what makes plain-copy fan-in safe).scheduling.pyhandles the performance concerns: LPT across CI runners, and slowest-first within each group for intra-runner xdist workers. Swapping in a different per-group distribution rule would keep fan-in safe and only affect wallclock.group_keyreads the fork fromitem.callspec.params["parametrized_fork"], the forks plugin's canonical param name (not a nodeid-token heuristic). A parametrize value likefork_candidateis no longer ambiguous with a real fork param.--dist=loador--dist=loadgroupwithout--loadscopereorder) the order set viaitems[:]is preserved to worker dispatch, verified against xdist 3.8'sLoadScopeScheduling.schedule()source.--loadscopereorderis the one setting that would re-sort scopes by item count and override our order; not used in this repo's CI.workeroutput(pytest_sessionfinishon workers,pytest_testnodedownon the controller) so the terminal summary emits correctly under--dist loadgroup.fill --generate-all-formats(no plugin) vs.fill --grouped-split(N groups merged by plain file copy) produces byte-equivalent output peruv run hasher compareacross every fixture format:blockchain_tests,blockchain_tests_engine,state_tests, andblockchain_tests_engine_xper fork. The harness lives attmp/verify_split_matches_baseline.py(gitignored). It also asserts the plugin'sduration coverage N/Mis non-zero, so a silent fallback to average-based splitting fails the check.🔗 Related Issues or PRs
Requires #2720.
✅ Checklist
All: Ran fast static checks to avoid unnecessary CI fails, see also Code Standards and Enabling Pre-commit Checks:
just staticAll: PR title adheres to the repo standard - it will be used as the squash commit message and should start
type(scope):.All: Considered updating the online docs in the ./docs/ directory.
All: Set appropriate labels for the changes (only maintainers can apply labels).
Cute Animal Picture