Skip to content

Remove the use of pickle throughout codebase#15629

Merged
blisc merged 16 commits intoNVIDIA-NeMo:mainfrom
blisc:2604_rm_pickle
Apr 29, 2026
Merged

Remove the use of pickle throughout codebase#15629
blisc merged 16 commits intoNVIDIA-NeMo:mainfrom
blisc:2604_rm_pickle

Conversation

@blisc
Copy link
Copy Markdown
Collaborator

@blisc blisc commented Apr 21, 2026

What does this PR do ?

  • Updates the use of pickle to msgpack inside speaker diarization setups that save embeddings to a binary file
  • Removes Tabular Tokenizers from the codebase
  • Updates TTS' files_to_ignore to use json list instead of pickled list
  • Switches ngram_lm binary files from pickle to msgpack
  • Removes freesound scripts
  • Updates the use of pickle to msgpack inside speech data explorer cache

Collection: core

Changelog

  • Removes use of pickle throughout codebase

PR Type:

  • New Feature
  • Bugfix
  • Documentation

…iles elsewhere

Signed-off-by: Jason <jasoli@nvidia.com>
blisc and others added 2 commits April 21, 2026 15:43
Signed-off-by: blisc <blisc@users.noreply.github.com>
Signed-off-by: Jason <jasoli@nvidia.com>
@blisc blisc added the Run CICD label Apr 21, 2026
Signed-off-by: blisc <blisc@users.noreply.github.com>
Comment thread examples/speaker_tasks/recognition/extract_speaker_embeddings.py Fixed
Comment thread nemo/collections/asr/models/clustering_diarizer.py Fixed
Signed-off-by: Jason <jasoli@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: blisc <blisc@users.noreply.github.com>
Signed-off-by: blisc <blisc@users.noreply.github.com>
logging.info(f"Loading the cached pickle file of hypotheses from '{cfg.hyps_cache_file}' ...")
with open(cfg.hyps_cache_file, 'rb') as probs_file:
all_hyps = pickle.load(probs_file)
all_hyps = msgpack.load(probs_file)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure - did you check if cfg.hyps_cache_file default extension is .pkl? if it is we should update, similar with other changes

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I don't know where this config file is supposed to exist, so leaving it as is.

timestamp = json_mtime.strftime('%Y%m%d_%H%M')
pickle_filename += '_' + timestamp + '.pkl'
if os.path.exists(pickle_filename):
with open(pickle_filename, 'rb') as f:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update pickle_filename and the extension

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been changed throughout the codebase as best as I can find

Copy link
Copy Markdown
Collaborator Author

@blisc blisc Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there is an update to SDE in https://github.com/NVIDIA-NeMo/NeMo/pull/15500/changes. I will unstage my changes from this PR

Signed-off-by: Jason <jasoli@nvidia.com>
Comment thread tools/speech_data_explorer/data_explorer.py Fixed
Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: Jason <jasoli@nvidia.com>
@blisc
Copy link
Copy Markdown
Collaborator Author

blisc commented Apr 28, 2026

/ok to test 878336f

@github-actions
Copy link
Copy Markdown
Contributor

[🤖]: Hi @blisc 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

@blisc blisc merged commit 0558934 into NVIDIA-NeMo:main Apr 29, 2026
136 checks passed
@blisc blisc deleted the 2604_rm_pickle branch April 29, 2026 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants