Hands-on tutorials for curating data across all modalities with NeMo Curator. Complete working examples with detailed explanations.
New to NeMo Curator? Start with the Getting Started Guide or try the quickstart.py example to understand core concepts.
| Modality | Description | Key Tutorials |
|---|---|---|
| Text | Natural language processing and curation | Deduplication, Classification, Quality Assessment, Tokenization |
| Image | Computer vision and image processing | Aesthetic Classification, NSFW Detection, Deduplication |
| Video | Video processing and analysis | Clipping, Frame Extraction, Filtering |
| Audio | Speech and audio data curation | FLEURS Dataset Processing |
| Interleaved | Multimodal (text + image) data curation | Getting Started, PDF Extraction Pipeline (Nemotron-Parse) |
Complete, production-grade pipelines built on NeMo Curator:
| Recipe | Description | Key Components |
|---|---|---|
| Nemotron-CC • SDG tutorial (in-repo) | Curate Common Crawl snapshots into an LLM-ready dataset, reproducing the Nemotron-CC datasets | CommonCrawlDownloadExtractStage • Language ID & Filtering • Exact/Fuzzy/Substring Dedup • Ensemble Quality Classification (1 fasttext + 2 FineWeb classifiers) • Synthetic Data Generation (4 tasks) |
The quickstart.py demonstrates NeMo Curator's foundational architecture:
- Task: Define data processing objectives
- ProcessingStage: Individual processing steps
- Pipeline: Orchestrate multiple stages
| Category | Links |
|---|---|
| Getting Started | Installation • Core Concepts |
| Modality Guides | Text Curation • Image Curation • Video Curation |
| Advanced | Custom Pipelines • Execution Backends • API Reference |
Documentation: Main Docs • GitHub Discussions