Skip to content

Performance optimizations#105

Open
m9o8 wants to merge 4 commits into
4Freye:mainfrom
m9o8:perf/optimizations
Open

Performance optimizations#105
m9o8 wants to merge 4 commits into
4Freye:mainfrom
m9o8:perf/optimizations

Conversation

@m9o8
Copy link
Copy Markdown
Contributor

@m9o8 m9o8 commented May 21, 2026

Speeding up PanelSplit (Caching, Vectorization, & Parallel Execution)

Hey @4Freye, I did some profiling and performance optimization on panelsplit to make spatio-temporal cross-validation and pipeline execution run a bit faster. I am using it for a large dataset, and I guess we can make some improvements here and there, e.g., adding support for parallel fold fitting and fixing the test suite to run headlessly. Let me know what you think. Summary below:


Changes

1. Spatial Splitter Caching

  • What: Previously, spatial splits were recalculated inside the spatio-temporal loop. Even with some local caching, repeating split() calls (which happen constantly in pipeline steps, plotting, etc.) ended up running the spatial splitter and performing index intersections over and over.
  • Fix:
    • Added a caching system.
    • For splitters independent of X and y (like GroupKFold, LeaveOneGroupOut, LeavePGroupsOut, GroupShuffleSplit), the splits are pre-generated on init and cached. I upgraded this check to inspect the class MRO (cls.name) so it automatically catches custom user subclasses of these splitters.
    • For dependent splitters (like StratifiedGroupKFold), the splits are cached based on the identity (is check) of the X and y inputs.

2. Vectorized Split Reconstruction

  • What: Reconstructing the out-of-fold predictions was using slow Python element-by-element loops.
  • Fix:
    • Swapped the loops for a fast vectorized NumPy implementation using np.argsort(..., kind="stable") (stable sort is used to ensure correctness when training indices overlap).

3. Parallel Fold Fitting

  • What: Fitting estimators across cross-validation folds was strictly sequential.
  • Fix:
    • Added an optional n_jobs parameter (defaults to 1) to SequentialCVPipeline.
    • Parallelized fold fitting using joblib.Parallel.
    • Refactored pipeline steps to use module-level helpers to prevent process-boundary serialization/pickling issues.

4. Headless Test Execution

  • What: Running tests on environments crashed for me on Windows because Matplotlib seemingly tried to spin up a Tkinter GUI backend.
  • Fix:
    • Added a pytest config setting the Matplotlib backend to Agg for headless execution.

Verification

All 134 unit tests passing.

  • Added caching tests to verify both the base caching behavior and custom subclass/GroupShuffleSplit caching.
  • Added parallel CV tests to verify that n_jobs=2 outputs identical results to n_jobs=1.

m9o8 added 2 commits May 21, 2026 21:58
…fold fitting

- Implement spatial splitter evaluation caching and two-level caching in PanelSplit (by identity of X/y, and initialization pre-generation).
- Generalize independent group splitter detection using MRO class checks and support GroupShuffleSplit.
- Vectorize out-of-fold prediction reconstruction in SequentialCVPipeline using np.argsort.
- Implement parallel fold fitting and prediction using joblib.Parallel in SequentialCVPipeline (n_jobs).
- Configure matplotlib Agg backend in tests/conftest.py for headless test execution.
- Add unit tests for parallel pipelines, spatial caching, and custom subclass independent splitters.
Copilot AI review requested due to automatic review settings May 21, 2026 20:26
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds parallel execution support to SequentialCVPipeline using joblib and introduces a caching mechanism for spatial splits in PanelSplit to enhance performance. It also refactors the _sort_and_combine function to support vectorized operations across various data formats. Feedback highlights the need for more robust type checking using isinstance and identifies potential edge-case bugs in the combination logic related to empty folds and object length mismatches.

Comment thread panelsplit/cross_validation.py Outdated
Comment thread panelsplit/pipeline.py
Comment thread panelsplit/pipeline.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds caching for spatio-temporal splits in PanelSplit and introduces optional fold-level parallelism (n_jobs) in SequentialCVPipeline, along with tests verifying caching behavior and deterministic parallel vs sequential pipeline results.

Changes:

  • Add cached spatial/spatio-temporal split computation in PanelSplit.split() / _compute_spatio_temporal_splits().
  • Add n_jobs to SequentialCVPipeline and parallelize per-fold fit/predict via joblib.Parallel.
  • Add tests for caching behavior, parallel pipeline parity, and configure matplotlib to use a headless backend in tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_spatial_cv.py Adds tests asserting caching semantics for independent vs dependent group splitters.
tests/test_pipeline.py Adds a test comparing sequential vs parallel pipeline execution and validates n_jobs params handling.
tests/conftest.py Forces matplotlib to use Agg backend for headless test runs.
panelsplit/pipeline.py Adds joblib-based parallel fold execution and refactors fold fit/predict utilities and output recombination.
panelsplit/cross_validation.py Adds caching for spatial and spatio-temporal splits, including optimized reuse across calls.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread panelsplit/cross_validation.py
Comment thread panelsplit/pipeline.py
Comment thread tests/test_pipeline.py
Comment thread panelsplit/pipeline.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants