docs: cross-link Nemotron-CC recipe as a production NeMo Curator example (#1767)

arhamm1 · claude · sarahyurick · web-flow · commit 06b8388a21bf · 2026-04-22T19:37:08.000Z
* docs: cross-link Nemotron-CC recipe as a production NeMo Curator example

Adds references to the Nemotron-CC data curation pipeline in README.md
and tutorials/README.md so users can discover a production-scale,
end-to-end example of NeMo Curator in use.

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

* docs: address review comments — add in-repo SDG link and complete pipeline stages

- Link to tutorials/synthetic/nemotron_cc/ alongside external recipe in both README.md and tutorials/README.md
- Expand Key Components to include language ID &amp; filtering and all 7 pipeline stages per sarahyurick's feedback
- Add language identification step to the Real-World Recipe callout

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

---------

Co-authored-by: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;
Co-authored-by: Sarah Yurick &lt;53962159+sarahyurick@users.noreply.github.com&gt;
diff --git a/README.md b/README.md
@@ -103,6 +103,8 @@ NeMo Curator leverages NVIDIA RAPIDS™ libraries such as cuDF, cuML, and cuGrap
 - **40% lower** total cost of ownership (TCO) compared to CPU-based alternatives
 - **Near-linear scaling** from one to four H100 80 GB nodes (2.05 hrs → 0.50 hrs)
 
+**Real-World Recipe:** The [Nemotron-CC curation pipeline](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) uses NeMo Curator end-to-end — from Common Crawl extraction through language identification, exact/fuzzy/substring deduplication, ensemble quality classification, and LLM-based synthetic data generation — to reproduce the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2). The SDG stage is also available as an [in-repo tutorial](tutorials/synthetic/nemotron_cc/).
+
 <p align="center">
   <img src="./docs/_images/text-benchmarks.png" alt="Performance benchmarks showing 16x speed improvement, 40% cost savings, and near-linear scaling" width="700"/>
 </p>
@@ -125,6 +127,7 @@ Data curation modules measurably improve model performance. In ablation studies
 |----------|-------|
 | **Documentation** | [Main Docs](https://docs.nvidia.com/nemo/curator/latest/) • [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/index.html) • [Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/index.html) |
 | **Tutorials** | [Text](tutorials/text/) • [Image](tutorials/image/) • [Video](tutorials/video/) • [Audio](tutorials/audio/) |
+| **Recipes** | [Nemotron-CC: end-to-end web data curation](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) • [SDG tutorial (in-repo)](tutorials/synthetic/nemotron_cc/) |
 | **Deployment** | [Installation](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html) • [Infrastructure](https://docs.nvidia.com/nemo/curator/latest/reference/infrastructure/index.html) |
 | **Community** | [GitHub Discussions](https://github.com/NVIDIA-NeMo/Curator/discussions) • [Issues](https://github.com/NVIDIA-NeMo/Curator/issues) |
 
diff --git a/tutorials/README.md b/tutorials/README.md
@@ -15,6 +15,14 @@ Hands-on tutorials for curating data across all modalities with NeMo Curator. Co
 | **[Video](video/)** | Video processing and analysis | Clipping, Frame Extraction, Filtering |
 | **[Audio](audio/)** | Speech and audio data curation | FLEURS Dataset Processing |
 
+## Production Recipes
+
+Complete, production-grade pipelines built on NeMo Curator:
+
+| Recipe | Description | Key Components |
+|--------|-------------|----------------|
+| [Nemotron-CC](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) • [SDG tutorial (in-repo)](synthetic/nemotron_cc/) | Curate Common Crawl snapshots into an LLM-ready dataset, reproducing the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) | `CommonCrawlDownloadExtractStage` • Language ID & Filtering • Exact/Fuzzy/Substring Dedup • Ensemble Quality Classification (1 fasttext + 2 FineWeb classifiers) • Synthetic Data Generation (4 tasks) |
+
 ## Core Concepts Example
 
 The [`quickstart.py`](quickstart.py) demonstrates NeMo Curator's foundational architecture: