Skip to content

Commit 06b8388

Browse files
arhamm1claudesarahyurick
authored
docs: cross-link Nemotron-CC recipe as a production NeMo Curator example (#1767)
* docs: cross-link Nemotron-CC recipe as a production NeMo Curator example Adds references to the Nemotron-CC data curation pipeline in README.md and tutorials/README.md so users can discover a production-scale, end-to-end example of NeMo Curator in use. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: address review comments — add in-repo SDG link and complete pipeline stages - Link to tutorials/synthetic/nemotron_cc/ alongside external recipe in both README.md and tutorials/README.md - Expand Key Components to include language ID & filtering and all 7 pipeline stages per sarahyurick's feedback - Add language identification step to the Real-World Recipe callout Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
1 parent 4c3af97 commit 06b8388

2 files changed

Lines changed: 11 additions & 0 deletions

File tree

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,8 @@ NeMo Curator leverages NVIDIA RAPIDS™ libraries such as cuDF, cuML, and cuGrap
103103
- **40% lower** total cost of ownership (TCO) compared to CPU-based alternatives
104104
- **Near-linear scaling** from one to four H100 80 GB nodes (2.05 hrs → 0.50 hrs)
105105

106+
**Real-World Recipe:** The [Nemotron-CC curation pipeline](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) uses NeMo Curator end-to-end — from Common Crawl extraction through language identification, exact/fuzzy/substring deduplication, ensemble quality classification, and LLM-based synthetic data generation — to reproduce the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2). The SDG stage is also available as an [in-repo tutorial](tutorials/synthetic/nemotron_cc/).
107+
106108
<p align="center">
107109
<img src="./docs/_images/text-benchmarks.png" alt="Performance benchmarks showing 16x speed improvement, 40% cost savings, and near-linear scaling" width="700"/>
108110
</p>
@@ -125,6 +127,7 @@ Data curation modules measurably improve model performance. In ablation studies
125127
|----------|-------|
126128
| **Documentation** | [Main Docs](https://docs.nvidia.com/nemo/curator/latest/)[API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/index.html)[Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/index.html) |
127129
| **Tutorials** | [Text](tutorials/text/)[Image](tutorials/image/)[Video](tutorials/video/)[Audio](tutorials/audio/) |
130+
| **Recipes** | [Nemotron-CC: end-to-end web data curation](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc)[SDG tutorial (in-repo)](tutorials/synthetic/nemotron_cc/) |
128131
| **Deployment** | [Installation](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html)[Infrastructure](https://docs.nvidia.com/nemo/curator/latest/reference/infrastructure/index.html) |
129132
| **Community** | [GitHub Discussions](https://github.com/NVIDIA-NeMo/Curator/discussions)[Issues](https://github.com/NVIDIA-NeMo/Curator/issues) |
130133

tutorials/README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,14 @@ Hands-on tutorials for curating data across all modalities with NeMo Curator. Co
1515
| **[Video](video/)** | Video processing and analysis | Clipping, Frame Extraction, Filtering |
1616
| **[Audio](audio/)** | Speech and audio data curation | FLEURS Dataset Processing |
1717

18+
## Production Recipes
19+
20+
Complete, production-grade pipelines built on NeMo Curator:
21+
22+
| Recipe | Description | Key Components |
23+
|--------|-------------|----------------|
24+
| [Nemotron-CC](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc)[SDG tutorial (in-repo)](synthetic/nemotron_cc/) | Curate Common Crawl snapshots into an LLM-ready dataset, reproducing the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) | `CommonCrawlDownloadExtractStage` • Language ID & Filtering • Exact/Fuzzy/Substring Dedup • Ensemble Quality Classification (1 fasttext + 2 FineWeb classifiers) • Synthetic Data Generation (4 tasks) |
25+
1826
## Core Concepts Example
1927

2028
The [`quickstart.py`](quickstart.py) demonstrates NeMo Curator's foundational architecture:

0 commit comments

Comments
 (0)