Performance optimizations#105
Conversation
…fold fitting - Implement spatial splitter evaluation caching and two-level caching in PanelSplit (by identity of X/y, and initialization pre-generation). - Generalize independent group splitter detection using MRO class checks and support GroupShuffleSplit. - Vectorize out-of-fold prediction reconstruction in SequentialCVPipeline using np.argsort. - Implement parallel fold fitting and prediction using joblib.Parallel in SequentialCVPipeline (n_jobs). - Configure matplotlib Agg backend in tests/conftest.py for headless test execution. - Add unit tests for parallel pipelines, spatial caching, and custom subclass independent splitters.
There was a problem hiding this comment.
Code Review
This pull request adds parallel execution support to SequentialCVPipeline using joblib and introduces a caching mechanism for spatial splits in PanelSplit to enhance performance. It also refactors the _sort_and_combine function to support vectorized operations across various data formats. Feedback highlights the need for more robust type checking using isinstance and identifies potential edge-case bugs in the combination logic related to empty folds and object length mismatches.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds caching for spatio-temporal splits in PanelSplit and introduces optional fold-level parallelism (n_jobs) in SequentialCVPipeline, along with tests verifying caching behavior and deterministic parallel vs sequential pipeline results.
Changes:
- Add cached spatial/spatio-temporal split computation in
PanelSplit.split()/_compute_spatio_temporal_splits(). - Add
n_jobstoSequentialCVPipelineand parallelize per-fold fit/predict viajoblib.Parallel. - Add tests for caching behavior, parallel pipeline parity, and configure matplotlib to use a headless backend in tests.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_spatial_cv.py | Adds tests asserting caching semantics for independent vs dependent group splitters. |
| tests/test_pipeline.py | Adds a test comparing sequential vs parallel pipeline execution and validates n_jobs params handling. |
| tests/conftest.py | Forces matplotlib to use Agg backend for headless test runs. |
| panelsplit/pipeline.py | Adds joblib-based parallel fold execution and refactors fold fit/predict utilities and output recombination. |
| panelsplit/cross_validation.py | Adds caching for spatial and spatio-temporal splits, including optimized reuse across calls. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Speeding up PanelSplit (Caching, Vectorization, & Parallel Execution)
Hey @4Freye, I did some profiling and performance optimization on panelsplit to make spatio-temporal cross-validation and pipeline execution run a bit faster. I am using it for a large dataset, and I guess we can make some improvements here and there, e.g., adding support for parallel fold fitting and fixing the test suite to run headlessly. Let me know what you think. Summary below:
Changes
1. Spatial Splitter Caching
2. Vectorized Split Reconstruction
3. Parallel Fold Fitting
4. Headless Test Execution
Verification
All 134 unit tests passing.