This major release represents a fundamental architecture shift from Dask to Ray, expanding NeMo Curator to support multimodal data curation with new video and audio capabilities. This refactor enables unified backend processing, better heterogeneous computing support, and enhanced autoscaling for dynamic workloads.
-
New Docker container: Updated Docker infrastructure with CUDA 12.8.1 and Ubuntu 24.04 base; obtainable through the NGC Catalog (
nvcr.io/nvidia/nemo-curator:25.09) -
Docker file to build own image: Simplified Dockerfile structure for custom container builds with FFmpeg support
-
UV source installations: Integrated UV package manager (v0.8.22) for faster dependency management
-
PyPI improvements: Enhanced PyPI installation with modular extras for targeted functionality:
Extra Installation Command Description All Modalities nemo-curator[all]Complete installation with all modalities and GPU support Text Curation nemo-curator[text_cuda12]GPU-accelerated text processing with RAPIDS Image Curation nemo-curator[image_cuda12]Image processing with NVIDIA DALI Audio Curation nemo-curator[audio_cuda12]Speech recognition with NeMo ASR models Video Curation nemo-curator[video_cuda12]Video processing with GPU acceleration Basic GPU nemo-curator[cuda12]CUDA utilities without modality-specific dependencies All GPU installations require the NVIDIA PyPI index:
uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[EXTRA]
NeMo Curator now supports comprehensive video data curation with distributed processing capabilities:
- Video splitting: Fixed-stride and scene-change detection (TransNetV2) for clip extraction
- Semantic deduplication: K-means clustering and pairwise similarity for near-duplicate clip removal
- Content filtering: Motion-based filtering and aesthetic filtering for quality improvement
- Embedding generation: Cosmos-Embed1 models for clip-level embeddings
- Ray-based distributed architecture: Scalable video processing with autoscaling support
New audio curation capabilities for speech data processing:
- ASR inference: Automatic speech recognition using NeMo Framework pretrained models
- Quality assessment: Word Error Rate (WER) and Character Error Rate (CER) calculation
- Speech metrics: Duration analysis and speech rate metrics (words/characters per second)
- Text integration: Seamless integration with text curation workflows via
AudioToDocumentStage - Manifest support: JSONL manifest format for audio file management
- Ray backend migration: Complete transition from Dask to Ray for distributed text processing
- Improved model-based classifier throughput: Better overlapping of compute between tokenization and inference through length-based sequence sorting for optimal GPU memory utilization
- Task-centric architecture: New
Task-based processing model for finer-grained control - Pipeline redesign: Updated
ProcessingStageandPipelinearchitecture with resource specification
- Pipeline-based architecture: Transitioned from legacy
ImageTextPairDatasetto modern stage-based processing withImageReaderStage,ImageEmbeddingStage, and filter stages - DALI-based image loading: New
ImageReaderStageuses NVIDIA DALI for high-performance WebDataset tar shard processing with GPU/CPU fallback - Modular processing stages: Separate stages for embedding generation, aesthetic filtering, and NSFW filtering
- Task-based data flow: Images processed as
ImageBatchtasks containingImageObjectinstances with metadata, embeddings, and classification scores
Learn more about image curation.
Enhanced deduplication capabilities across all modalities with improved performance and flexibility:
- Exact and Fuzzy deduplication: Updated rapidsmpf-based shuffle backend for more efficient GPU-to-GPU data transfer and better spilling capabilities
- Semantic deduplication: Support for deduplicating text, image, and video datasets using unified embedding-based workflows
- New ranking strategies: Added
RankingStrategywhich allows you to rank elements within cluster centers to decide which point to prioritize during duplicate removal, supporting metadata-based ranking to prioritize specific datasets or inputs
The architecture refactor introduces a layered system with unified interfaces and multiple execution backends:
User Layer: Pipeline → ProcessingStage X→Y → ProcessingStage Y→Z → ProcessingStage Z→W
↓
Orchestration Layer: BaseExecutor Interface
↓
Backend Layer: XennaExecutor (Production Ready) | RayActorPoolExecutor (Experimental) | RayDataExecutor (Experimental)
↓
Adaptation Layer: Xenna Adapter | Ray Actor Adapter | Ray Data Adapter
↓
Execution Layer: Cosmos-Xenna (Streaming/Batch) | Ray Actor Pool (Load Balancing) | Ray Data API (Dataset Processing)
- New Pipeline API: Ray-based pipeline execution with
BaseExecutorinterface - Multiple backends: Support for Xenna, Ray Actor Pool, and Ray Data execution backends
- Resource specification: Configurable CPU and GPU memory requirements per stage
- Stage composition: Improved stage validation and execution orchestration
- ProcessingStage redesign: Generic
ProcessingStage[X, Y]base class with type safety - Resource requirements: Built-in resource specification for CPU and GPU memory
- Backend adapters: Stage adaptation layer for different Ray orchestration systems
- Input/output validation: Enhanced type checking and data validation
- Text tutorials: Updated all text curation tutorials to use new Ray-based API
- Image tutorials: Migrated image processing tutorials to unified backend
- Audio tutorials: New audio curation tutorials
- Video tutorials: New video processing tutorials
For all tutorial content, refer to the tutorials directory in the NeMo Curator GitHub repository.
(Pending Refactor in Future Release)
- Synthetic data generation: Synthetic text generation features are being refactored for Ray compatibility
- Hard negative mining: Retrieval-based data generation workflows under development
- PII processing: Personal Identifiable Information removal tools are being updated for Ray backend
- Privacy workflows: Enhanced privacy-preserving data curation capabilities in development
- Data blending: Multi-source dataset blending functionality being refactored
- Dataset shuffling: Large-scale data shuffling operations under development
- Local preview capability: Improved documentation build system with local preview support
- Modality-specific guides: Comprehensive documentation for each supported modality (text, image, audio, video)
- API reference: Complete API documentation with type annotations and examples
The next release will focus on completing the refactor of Generation, PII, and Blending & Shuffling features, along with additional performance optimizations and new modality support.
- New How-to Data Recipes (Tutorials)
- Multimodal DAPT Curation w/ PDF Extraction
- Llama Nemotron Data Curation
- LLM NIM - PII Redaction
- Performance and Code Optimizations
- Simplified Clustering Logic: Significantly improved semantic deduplication clustering performance
- Removed convoluted backend switching logic that caused performance issues
- Eliminated expensive length assertions that could cause timeouts on large datasets
- Improved GPU utilization during KMeans clustering operations
- Tested on 37M embedding dataset (80GB) across 7 GPUs with substantial performance gains
- FastText Download URL Fix
- Corrected the
fasttextmodel download URL in nemotron-cc tutorial - Changed from
dl.fbaipublicfiles.com/fastText/todl.fbaipublicfiles.com/fasttext/ - Ensures reliable model downloads for language identification
- Corrected the
- NeMo Retriever Tutorial Bug Fix
- Fixed lambda function bug in
RetrieverEvalSetGenerator - Corrected score assignment from
df["question"].apply(lambda: 1)todf["score"] = 1
- Fixed lambda function bug in
- API Usage Updates
- Updated examples and tutorials to use correct
DocumentDatasetAPI - Replaced deprecated
write_to_disk(result, output_dir, output_type="parquet")withresult.to_parquet(output_dir) - Updated exact deduplication workflows:
deduplicator.remove()now returnsDocumentDatasetdirectly
- Updated examples and tutorials to use correct
- Llama Based PII Redaction
- Trafilatura Text Extractor
- Chinese & Japanese Stopwords for Text Extractors
- Writing gzip compressed jsonl datasets
- Training dataset curation for retriever customization using hard-negative mining
- Implemented a memory efficient pairwise similarity in Semantic Deduplication
- Fix Transformers + Cuda Context bug
- Fix rate limit in SDG Retriever Eval Tutorial
- Python 3.12 Support
- Curator on Blackwell
- Nemotron-CC Dataset Recipe
- Performant S3 for Fuzzy Deduplication
- Synthetic Data Generation for Text Retrieval
- LLM-based Filters
- Easiness
- Answerability
- Q&A Retrieval Generation Pipeline
- LLM-based Filters
- Image Curation
- Image Embedding Creation
- Aesthetic Classifier
- NSFW Classifier
- Semantic Deduplication
- Text Curation
- Quality Classifier
- Aegis Classifier
- FineWeb-Edu Classifier
Full Changelog: https://github.com/NVIDIA/NeMo-Curator/commits/v0.5.0
- Add spacy<3.8 pin to r0.4.1 by @ayushdg in NVIDIA-NeMo#279
Full Changelog: https://github.com/NVIDIA/NeMo-Curator/compare/v0.4.0...v0.4.1
- Semantic Deduplication
- Resiliparse for Text Extraction
- Improve Distributed Data Classification - Domain classifier is 1.55x faster through intelligent batching
- Synthetic data generation for fine-tuning
- Update README by @ryantwolf in NVIDIA-NeMo#6
- [Tutorials] Add a readme file for the TinyStories tutorial by @Maghoumi in NVIDIA-NeMo#5
- Add workflow for running cpu pytests by @ayushdg in NVIDIA-NeMo#13
- Add pre-commit style checks by @ayushdg in NVIDIA-NeMo#14
- Add citation by @ryantwolf in NVIDIA-NeMo#15
- Fix Noisy CUDA Shutdown by @ryantwolf in NVIDIA-NeMo#20
- Bump Python and RAPIDS versions by @ryantwolf in NVIDIA-NeMo#16
- Add batched decorator by @ryantwolf in NVIDIA-NeMo#18
- Add issue templates by @ayushdg in NVIDIA-NeMo#22
- Add dependency to fix justext by @ryantwolf in NVIDIA-NeMo#24
- Fix metadata inference with pandas and dask by @ryantwolf in NVIDIA-NeMo#35
- Disable PyTorch Compile Multiprocessing by @ryantwolf in NVIDIA-NeMo#34
- Improve speed of AddId module by @ryantwolf in NVIDIA-NeMo#36
- Make GPU dependencies optional by @ayushdg in NVIDIA-NeMo#27
- Fix failing GPU tests with latest pandas bump by @ayushdg in NVIDIA-NeMo#41
- Adds Nemo Curator K8s example by @terrykong in NVIDIA-NeMo#40
- Move common dedup utils and remove unused code by @ayushdg in NVIDIA-NeMo#42
- Fix lang id example by @ryantwolf in NVIDIA-NeMo#37
- Add dataset blending tool by @ryantwolf in NVIDIA-NeMo#32
- High level fuzzy duplicates module by @ayushdg in NVIDIA-NeMo#46
- Fix indexing in PII Modifier by @ryantwolf in NVIDIA-NeMo#55
- Disable string conversion globally by @ryantwolf in NVIDIA-NeMo#56
- Fix issue #43 (empty files creation) and improve reading/writing speed by @miguelusque in NVIDIA-NeMo#57
- [Tutorials] Add a tutorial for PEFT data curation by @Maghoumi in NVIDIA-NeMo#45
- Only import PII constants during Curator import by @ayushdg in NVIDIA-NeMo#61
- Align
extract_partitioning_indexlogic with upstream shuffling by @rjzamora in NVIDIA-NeMo#60 - [REVIEW] Switch Models to use Crossfit by @VibhuJawa in NVIDIA-NeMo#58
- Remove argparse from get_client function signature by @ryantwolf in NVIDIA-NeMo#12
- Fuzzy Dedup: Use text_field instead of hardcoded text column by @ayushdg in NVIDIA-NeMo#74
- Add pull request template by @ayushdg in NVIDIA-NeMo#78
- Add jupyter notebook tutorial for single node mulilingual dataset by @nicoleeeluo in NVIDIA-NeMo#30
- Update issue templates by @ryantwolf in NVIDIA-NeMo#81
- Fix #91 - Incorrect reference to domain_classifier_example.py by @miguelusque in NVIDIA-NeMo#92
- Fix #63. Add --input-meta parameter to explicitly specify the jsonl field dtypes by @miguelusque in NVIDIA-NeMo#75
- Update readme by @ayushdg in NVIDIA-NeMo#93
- Update documentation for new version by @ryantwolf in NVIDIA-NeMo#83
- Update requirements documentation. by @ayushdg in NVIDIA-NeMo#98
- Make sure query-planning is disabled for now by @rjzamora in NVIDIA-NeMo#97
- Applying SEO Best Pratices by @aschilling-nv in NVIDIA-NeMo#104
- Shuffle CC result on group before writing out by @ayushdg in NVIDIA-NeMo#110
- Added tutorials to index.rst by @jgerh in NVIDIA-NeMo#113
- Pin to numpy<2 to avoid spacy compat issues by @ayushdg in NVIDIA-NeMo#119
- Fix #116. Fix broken links by @miguelusque in NVIDIA-NeMo#117
- Update index.rst by @aschilling-nv in NVIDIA-NeMo#129
- Fix nemo_curator import in CPU only environment when GPU packages are installed. by @ayushdg in NVIDIA-NeMo#123
- Improve Common Crawl download by @ryantwolf in NVIDIA-NeMo#82
- Update README.md by @Maghoumi in NVIDIA-NeMo#126
- Allow multiple filenames per partition when separating by metadata by @ayushdg in NVIDIA-NeMo#99
- [REVIEW] Add Resiliparse option for text extraction by @sarahyurick in NVIDIA-NeMo#128
- Fix 69 - Refactor how arguments are added to scripts by @miguelusque in NVIDIA-NeMo#102
- Stricter check for query planning. by @ayushdg in NVIDIA-NeMo#107
- Add DataFrame example to Distributed Data Classification tutorial by @sarahyurick in NVIDIA-NeMo#137
- Enable Sem-dedup by @VibhuJawa in NVIDIA-NeMo#130
- Remove lxml installation by @ryantwolf in NVIDIA-NeMo#140
- Nemotron 340 SDG Pipeline Tutorial by @chrisalexiuk-nvidia in NVIDIA-NeMo#144
- Add Synthetic Data Generation Module by @ryantwolf in NVIDIA-NeMo#136
- Skip explicit comms shuffle for dask-cuda 24.06 by @ayushdg in NVIDIA-NeMo#147
- Add support for NeMo SDK by @ryantwolf in NVIDIA-NeMo#131
- [REVIEW] Fix SemDedup bugs by @VibhuJawa in NVIDIA-NeMo#151
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in NVIDIA-NeMo#135
- Fix bug with torch rmm and nemo by @ryantwolf in NVIDIA-NeMo#155
- @ayushdg made their first contribution in NVIDIA-NeMo#13
- @terrykong made their first contribution in NVIDIA-NeMo#40
- @rjzamora made their first contribution in NVIDIA-NeMo#60
- @nicoleeeluo made their first contribution in NVIDIA-NeMo#30
- @aschilling-nv made their first contribution in NVIDIA-NeMo#104
- @pre-commit-ci made their first contribution in NVIDIA-NeMo#135
Full Changelog: https://github.com/NVIDIA/NeMo-Curator/commits/v0.4.0s
- Update README by @ryantwolf in NVIDIA-NeMo#6
- [Tutorials] Add a readme file for the TinyStories tutorial by @Maghoumi in NVIDIA-NeMo#5
- Add workflow for running cpu pytests by @ayushdg in NVIDIA-NeMo#13
- Add pre-commit style checks by @ayushdg in NVIDIA-NeMo#14
- Add citation by @ryantwolf in NVIDIA-NeMo#15
- Fix Noisy CUDA Shutdown by @ryantwolf in NVIDIA-NeMo#20
- Bump Python and RAPIDS versions by @ryantwolf in NVIDIA-NeMo#16
- Add batched decorator by @ryantwolf in NVIDIA-NeMo#18
- Add issue templates by @ayushdg in NVIDIA-NeMo#22
- Add dependency to fix justext by @ryantwolf in NVIDIA-NeMo#24
- Fix metadata inference with pandas and dask by @ryantwolf in NVIDIA-NeMo#35
- Disable PyTorch Compile Multiprocessing by @ryantwolf in NVIDIA-NeMo#34
- Improve speed of AddId module by @ryantwolf in NVIDIA-NeMo#36
- Make GPU dependencies optional by @ayushdg in NVIDIA-NeMo#27
- Fix failing GPU tests with latest pandas bump by @ayushdg in NVIDIA-NeMo#41
- Adds Nemo Curator K8s example by @terrykong in NVIDIA-NeMo#40
- Move common dedup utils and remove unused code by @ayushdg in NVIDIA-NeMo#42
- Fix lang id example by @ryantwolf in NVIDIA-NeMo#37
- Add dataset blending tool by @ryantwolf in NVIDIA-NeMo#32
- High level fuzzy duplicates module by @ayushdg in NVIDIA-NeMo#46
- Fix indexing in PII Modifier by @ryantwolf in NVIDIA-NeMo#55
- Disable string conversion globally by @ryantwolf in NVIDIA-NeMo#56
- Fix issue #43 (empty files creation) and improve reading/writing speed by @miguelusque in NVIDIA-NeMo#57
- [Tutorials] Add a tutorial for PEFT data curation by @Maghoumi in NVIDIA-NeMo#45
- Only import PII constants during Curator import by @ayushdg in NVIDIA-NeMo#61
- Align
extract_partitioning_indexlogic with upstream shuffling by @rjzamora in NVIDIA-NeMo#60
- @Maghoumi made their first contribution in NVIDIA-NeMo#5
- @terrykong made their first contribution in NVIDIA-NeMo#40
- @miguelusque made their first contribution in NVIDIA-NeMo#57
- @rjzamora made their first contribution in NVIDIA-NeMo#60
Full Changelog: https://github.com/NVIDIA/NeMo-Curator/commits/v0.3.0