Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,8 @@ NeMo Curator leverages NVIDIA RAPIDS™ libraries such as cuDF, cuML, and cuGrap
- **40% lower** total cost of ownership (TCO) compared to CPU-based alternatives
- **Near-linear scaling** from one to four H100 80 GB nodes (2.05 hrs → 0.50 hrs)

**Real-World Recipe:** The [Nemotron-CC curation pipeline](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) uses NeMo Curator end-to-end — from Common Crawl extraction through language identification, exact/fuzzy/substring deduplication, ensemble quality classification, and LLM-based synthetic data generation — to reproduce the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2). The SDG stage is also available as an [in-repo tutorial](tutorials/synthetic/nemotron_cc/).

<p align="center">
<img src="./docs/_images/text-benchmarks.png" alt="Performance benchmarks showing 16x speed improvement, 40% cost savings, and near-linear scaling" width="700"/>
</p>
Expand All @@ -125,6 +127,7 @@ Data curation modules measurably improve model performance. In ablation studies
|----------|-------|
| **Documentation** | [Main Docs](https://docs.nvidia.com/nemo/curator/latest/) • [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/index.html) • [Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/index.html) |
| **Tutorials** | [Text](tutorials/text/) • [Image](tutorials/image/) • [Video](tutorials/video/) • [Audio](tutorials/audio/) |
| **Recipes** | [Nemotron-CC: end-to-end web data curation](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) • [SDG tutorial (in-repo)](tutorials/synthetic/nemotron_cc/) |
| **Deployment** | [Installation](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html) • [Infrastructure](https://docs.nvidia.com/nemo/curator/latest/reference/infrastructure/index.html) |
| **Community** | [GitHub Discussions](https://github.com/NVIDIA-NeMo/Curator/discussions) • [Issues](https://github.com/NVIDIA-NeMo/Curator/issues) |

Expand Down
8 changes: 8 additions & 0 deletions tutorials/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,14 @@ Hands-on tutorials for curating data across all modalities with NeMo Curator. Co
| **[Video](video/)** | Video processing and analysis | Clipping, Frame Extraction, Filtering |
| **[Audio](audio/)** | Speech and audio data curation | FLEURS Dataset Processing |

## Production Recipes

Complete, production-grade pipelines built on NeMo Curator:

| Recipe | Description | Key Components |
|--------|-------------|----------------|
| [Nemotron-CC](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) • [SDG tutorial (in-repo)](synthetic/nemotron_cc/) | Curate Common Crawl snapshots into an LLM-ready dataset, reproducing the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) | `CommonCrawlDownloadExtractStage` • Language ID & Filtering • Exact/Fuzzy/Substring Dedup • Ensemble Quality Classification (1 fasttext + 2 FineWeb classifiers) • Synthetic Data Generation (4 tasks) |

## Core Concepts Example

The [`quickstart.py`](quickstart.py) demonstrates NeMo Curator's foundational architecture:
Expand Down
Loading