diff --git a/README.md b/README.md index ab1f6ba7e..6f756a44c 100644 --- a/README.md +++ b/README.md @@ -103,6 +103,8 @@ NeMo Curator leverages NVIDIA RAPIDS™ libraries such as cuDF, cuML, and cuGrap - **40% lower** total cost of ownership (TCO) compared to CPU-based alternatives - **Near-linear scaling** from one to four H100 80 GB nodes (2.05 hrs → 0.50 hrs) +**Real-World Recipe:** The [Nemotron-CC curation pipeline](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) uses NeMo Curator end-to-end — from Common Crawl extraction through language identification, exact/fuzzy/substring deduplication, ensemble quality classification, and LLM-based synthetic data generation — to reproduce the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2). The SDG stage is also available as an [in-repo tutorial](tutorials/synthetic/nemotron_cc/). +

Performance benchmarks showing 16x speed improvement, 40% cost savings, and near-linear scaling

@@ -125,6 +127,7 @@ Data curation modules measurably improve model performance. In ablation studies |----------|-------| | **Documentation** | [Main Docs](https://docs.nvidia.com/nemo/curator/latest/) • [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/index.html) • [Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/index.html) | | **Tutorials** | [Text](tutorials/text/) • [Image](tutorials/image/) • [Video](tutorials/video/) • [Audio](tutorials/audio/) | +| **Recipes** | [Nemotron-CC: end-to-end web data curation](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) • [SDG tutorial (in-repo)](tutorials/synthetic/nemotron_cc/) | | **Deployment** | [Installation](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html) • [Infrastructure](https://docs.nvidia.com/nemo/curator/latest/reference/infrastructure/index.html) | | **Community** | [GitHub Discussions](https://github.com/NVIDIA-NeMo/Curator/discussions) • [Issues](https://github.com/NVIDIA-NeMo/Curator/issues) | diff --git a/tutorials/README.md b/tutorials/README.md index 6954c5ba3..8cb753617 100644 --- a/tutorials/README.md +++ b/tutorials/README.md @@ -15,6 +15,14 @@ Hands-on tutorials for curating data across all modalities with NeMo Curator. Co | **[Video](video/)** | Video processing and analysis | Clipping, Frame Extraction, Filtering | | **[Audio](audio/)** | Speech and audio data curation | FLEURS Dataset Processing | +## Production Recipes + +Complete, production-grade pipelines built on NeMo Curator: + +| Recipe | Description | Key Components | +|--------|-------------|----------------| +| [Nemotron-CC](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) • [SDG tutorial (in-repo)](synthetic/nemotron_cc/) | Curate Common Crawl snapshots into an LLM-ready dataset, reproducing the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) | `CommonCrawlDownloadExtractStage` • Language ID & Filtering • Exact/Fuzzy/Substring Dedup • Ensemble Quality Classification (1 fasttext + 2 FineWeb classifiers) • Synthetic Data Generation (4 tasks) | + ## Core Concepts Example The [`quickstart.py`](quickstart.py) demonstrates NeMo Curator's foundational architecture: