Skip to content

Add PreprocessingPipeline#3438

Merged
alejoe91 merged 53 commits into
SpikeInterface:mainfrom
chrishalcrow:preprocessing-pipeline
Jun 19, 2025
Merged

Add PreprocessingPipeline#3438
alejoe91 merged 53 commits into
SpikeInterface:mainfrom
chrishalcrow:preprocessing-pipeline

Conversation

@chrishalcrow
Copy link
Copy Markdown
Member

@chrishalcrow chrishalcrow commented Sep 25, 2024

Add a PreprocessingPipeline class, which contains ordered preprocessing steps and their kwargs in a dictionary.

Docs:
https://spikeinterface--3438.org.readthedocs.build/en/3438/modules/preprocessing.html#the-preprocessing-pipeline
https://spikeinterface--3438.org.readthedocs.build/en/3438/how_to/build_pipeline_with_dicts.html

You can apply_preprocessing_pipeline to a recording to make a preprocessed recording:

preprocessor_dict = {'bandpass_filter': {'freq_max': 3000}, 'common_reference': {}}

from spikeinterface.preprocessing import apply_preprocessing_pipeline
preprocessed_recording = apply_preprocessing_pipeline(recording, preprocessor_dict)

Under the hood, this uses the _apply method of the new PreprocessingPipeline.

Also adds a function which takes in a provenance.json, provenance.pkl, 'recording.jsonorrecording.pklprovenance file and makes apreprocessor_dict`. So it's easy to extract preprocessing steps from a saved recording.

from spikeinterface.preprocessing import get_preprocessing_dict_from_json
my_dict = get_preprocessing_dict_from_provenance('/path/to/provenance.json')

Can also get it from an analyzer, using get_preprocessing_dict_from_analyzer function.

After you load this, you can either apply the precomputable_kwargs or ignore them and compute on application:

# this will apply the precomputed stuff, like the `M` and `W` matrices from whitening:
pp_rec = si.apply_preprocessing_pipeline(rec, my_dict, apply_precomputed_kwargs=True)
# this will ignore this stuff, and recompute the kwargs on application:
pp_rec = si.apply_preprocessing_pipeline(rec, my_dict, apply_precomputed_kwargs=False)

PR allow for some cool things:

  1. Users can pass a single dictionary to construct a preprocessed recording (as above). Hence it completes the “dictionary workflow”; since you can use dicts in sorting, run_sorter, and postprocessing in compute.
  2. Users can easily visualise their preprocessing pipeline using the repr, including an HTML repr in Jupyter notebook
  3. Increases portability between labs, and should make giving advice to users easier (from us, and from spike sorting developers), since we can just say "Oh, for KS4 NP2.0 we use this dict for preprocessing".
  4. Increases the usefulness of our provenance system, since you can reconstruct human-readable preprocessing steps from the provenance.json file without the original recording (and worrying about paths).

The repr currently looks like this:
Screenshot 2025-05-20 at 15 33 45

@chrishalcrow chrishalcrow added enhancement New feature or request preprocessing Related to preprocessing module labels Sep 25, 2024
@alejoe91 alejoe91 modified the milestone: 0.101.2 Oct 1, 2024
Copy link
Copy Markdown
Member

@zm711 zm711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm out of the loop on this. So I'm not sure how helpful my review of code would be. I would prefer to hear what Alessio and Sam think to feel safe. I think my only concern is that dictionaries since 3.7 maintain order of insertion, but since preprocessing steps are ordered (doing filtering then whitening is not the same as whitening then filtering) I would feel more comfortable if we had some way to guarantee the order rather than relying on dictionary implementation details since they've changed once before. Right?

Comment thread src/spikeinterface/preprocessing/pipeline.py Outdated
Comment thread src/spikeinterface/preprocessing/pipeline.py Outdated
provenance_dict = pickle.load(f)

pipeline_dict_from_provenance = {}
_load_pp_from_dict(provenance_dict, pipeline_dict_from_provenance)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this return something?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This _load_pp_from_dict is a gross recursive function since our provenance json/pkl are gross recursive dicts. Due to this structure, it returns the kwargs of the recursive step it's at. So it does return something, but the important bit of the function is that it modifies pipeline_dict_from_provenance. It does return something, so maybe it's best practice to put that returned value somewhere (which I've done). I've also significantly updated the _load_pp_from_dict docstring and variable names, which should hopefully make the purpose of the function clearer.

@chrishalcrow
Copy link
Copy Markdown
Member Author

I guess I'm out of the loop on this. So I'm not sure how helpful my review of code would be. I would prefer to hear what Alessio and Sam think to feel safe. I think my only concern is that dictionaries since 3.7 maintain order of insertion, but since preprocessing steps are ordered (doing filtering then whitening is not the same as whitening then filtering) I would feel more comfortable if we had some way to guarantee the order rather than relying on dictionary implementation details since they've changed once before. Right?

It was incredibly helpful - thanks!!

They made the change in 3.6, and from 3.7 it was added to the language spec ("Changed in version 3.7: Dictionary order is guaranteed to be insertion order." from the bottom of https://docs.python.org/3/library/stdtypes.html#dict), as ruled by Guido (https://mail.python.org/pipermail/python-dev/2017-December/151283.html). So I think it's pretty safe.

Not against implementing OrderedDicts under the hood, but I'd prefer to allow the user to pass a plain dict, to avoid them having to from collections import OrderedDict. If we used OrderdDicts in the codebase, it would be pretty easy to update if python broke dictionary ordering. But I'm personally in favour of keeping it the way it is. The tests will (sometimes) fail if ordered is removed.

Copy link
Copy Markdown
Member

@alejoe91 alejoe91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this amazing work Chris! My main point is to make the functions more generally available to recordings, not just provenance files.

Comment thread doc/modules/preprocessing.rst Outdated
preprocessing_pipeline = PreprocessingPipeline(recording, preprocessing_dict)
# to view the pipeline:
preprocessing_pipeline

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could paste html here, right?

return preprocessed_recording


def get_preprocessing_dict_from_provenance(recording_provenance_path):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to be provenance, since you could pass it any json/pkl recording file. What about get_preprocessing_pipeline_from_file?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, sounds good.

Note: I prefer to return the dict than a PreprocessingPipeline itself since this allows the user to then use apply_pipeline(recording, pipeline_dict) and to share it easily. But I could be convinced otherwise!

Comment thread src/spikeinterface/preprocessing/pipeline.py Outdated
return pipeline_dict


def _load_pp_from_dict(prov_dict, kwargs_dict):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this public. It could be helpful

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm against this. Mostly because this funcitons main purpose is to modify kwargs_dict. So it doesn't return what you would naively expect. It's like this to make the recursive loading work. In this last round of refactoring, I've added a _make_pipeline_dict_from_recording_dict - maybe that could be made public?? Do you really think it's helpful?

Comment thread src/spikeinterface/preprocessing/pipeline.py Outdated
Comment thread src/spikeinterface/preprocessing/pipeline.py Outdated
Comment thread src/spikeinterface/preprocessing/pipeline.py
@chrishalcrow
Copy link
Copy Markdown
Member Author

Tests failing due to #3990

@alejoe91
Copy link
Copy Markdown
Member

LGTM! Thanks Chris this is going to be supeeeer helpful!

Comment thread examples/how_to/build_pipeline_with_dicts.py Outdated
Comment thread doc/how_to/build_pipeline_with_dicts.rst Outdated
@zm711
Copy link
Copy Markdown
Member

zm711 commented Jun 17, 2025

Not required I guess, but we could fix both of them once tests pass so that we don't forget to fix in the future.

@chrishalcrow
Copy link
Copy Markdown
Member Author

Just fixed a couple of lil doc things - happy once tests pass :)

Copy link
Copy Markdown
Member

@zm711 zm711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't reread the code, but at least from the documentation side this looks good to me too



pp_names_to_functions = {preprocessor.__name__: preprocessor for preprocessor in preprocessor_dict.values()}
pp_names_to_classes = {pp_function.__name__: pp_class for pp_class, pp_function in _all_preprocesser_dict.items()}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside if we are actually going to use this private dict then I probably shouldn't have left the typo as an homage to history :). We should consider fixing that in another PR....

@alejoe91
Copy link
Copy Markdown
Member

Ok to merge for me. @samuelgarcia ok?

@alejoe91 alejoe91 merged commit 172ba18 into SpikeInterface:main Jun 19, 2025
15 checks passed
@chrishalcrow chrishalcrow deleted the preprocessing-pipeline branch July 30, 2025 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request preprocessing Related to preprocessing module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants