docs: cross-link Nemotron-CC recipe as a production NeMo Curator example#1767
docs: cross-link Nemotron-CC recipe as a production NeMo Curator example#1767sarahyurick merged 4 commits intomainfrom
Conversation
Adds references to the Nemotron-CC data curation pipeline in README.md and tutorials/README.md so users can discover a production-scale, end-to-end example of NeMo Curator in use. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Greptile SummaryThis documentation-only PR cross-links the Nemotron-CC production curation pipeline from both Confidence Score: 5/5Safe to merge — documentation-only changes with correct relative paths and no code impact. All changes are documentation (markdown only). Relative links to No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[README.md<br/>Why NeMo Curator?] -->|Real-World Recipe callout| B[External: NVIDIA-NeMo/Nemotron<br/>nemotron-cc pipeline]
A -->|Real-World Recipe callout| C[In-repo: tutorials/synthetic/nemotron_cc/]
D[README.md<br/>Learn More table — Recipes row] -->|link| B
D -->|link| C
E[tutorials/README.md<br/>Production Recipes section] -->|Nemotron-CC link| B
E -->|SDG tutorial link| C
C --> F[nemotron_cc_pipelines.py<br/>nemotron_cc_sdg_*.py]
Reviews (6): Last reviewed commit: "Merge branch 'main' into docs/crosslink-..." | Re-trigger Greptile |
|
|
||
| | Recipe | Description | Key Components | | ||
| |--------|-------------|----------------| | ||
| | [Nemotron-CC](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) | Curate Common Crawl snapshots into an LLM-ready dataset, reproducing the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) | `CommonCrawlDownloadExtractStage` • Exact/Fuzzy/Substring Dedup • `FineWebNemotronEduClassifier` • Synthetic Data Generation | |
There was a problem hiding this comment.
New section misses in-repo Nemotron-CC content
The new "Production Recipes" row links exclusively to NVIDIA-NeMo/Nemotron (an external repo), but this repository already contains the Nemotron-CC synthetic data generation pipeline at tutorials/synthetic/nemotron_cc/. The NVIDIA technical blog states the pipeline was merged into NeMo Curator, and tutorials/synthetic/nemotron_cc/nemotron_cc_pipelines.py exists in this repo.
Pointing only at the external recipe means users browsing tutorials/README.md will miss the SDG portion that lives here. Consider adding a link to tutorials/synthetic/nemotron_cc/ alongside (or instead of) the external recipe link, and clarifying which parts of the end-to-end pipeline live where:
| | [Nemotron-CC](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) | Curate Common Crawl snapshots into an LLM-ready dataset, reproducing the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) | `CommonCrawlDownloadExtractStage` • Exact/Fuzzy/Substring Dedup • `FineWebNemotronEduClassifier` • Synthetic Data Generation | | |
| | [Nemotron-CC (SDG)](tutorials/synthetic/nemotron_cc/) | Synthetic data generation stage of the Nemotron-CC pipeline | `nemotron_cc_pipelines.py` • Diverse QA • Knowledge List | | |
| | [Nemotron-CC (full curation)](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) | Curate Common Crawl snapshots into an LLM-ready dataset, reproducing the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) | `CommonCrawlDownloadExtractStage` • Exact/Fuzzy/Substring Dedup • `FineWebNemotronEduClassifier` • Synthetic Data Generation | |
| |----------|-------| | ||
| | **Documentation** | [Main Docs](https://docs.nvidia.com/nemo/curator/latest/) • [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/index.html) • [Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/index.html) | | ||
| | **Tutorials** | [Text](tutorials/text/) • [Image](tutorials/image/) • [Video](tutorials/video/) • [Audio](tutorials/audio/) | | ||
| | **Recipes** | [Nemotron-CC: end-to-end web data curation](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) | |
There was a problem hiding this comment.
Learn More table link points to external repo, not canonical in-repo location
The Recipes row links to NVIDIA-NeMo/Nemotron (external), but this repo already hosts the Nemotron-CC SDG pipeline at tutorials/synthetic/nemotron_cc/. Users clicking from the main README.md learn-more table would be sent off-site when a directly relevant tutorial lives in the same repo. Consider adding a link to the internal tutorial alongside the external recipe:
| | **Recipes** | [Nemotron-CC: end-to-end web data curation](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) | | |
| | **Recipes** | [Nemotron-CC SDG (in-repo)](tutorials/synthetic/nemotron_cc/) • [Nemotron-CC full pipeline](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) | |
|
Tip: Greploops — Automatically fix all review issues by running Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal. |
|
|
||
| | Recipe | Description | Key Components | | ||
| |--------|-------------|----------------| | ||
| | [Nemotron-CC](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) | Curate Common Crawl snapshots into an LLM-ready dataset, reproducing the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) | `CommonCrawlDownloadExtractStage` • Exact/Fuzzy/Substring Dedup • `FineWebNemotronEduClassifier` • Synthetic Data Generation | |
There was a problem hiding this comment.
Not sure how detailed you want this description to be, but technically all of the stages are:
- Download and extract Common Crawl
- Language identification and filtering
- Exact deduplication
- Fuzzy deduplication
- Substring deduplication
- Ensemble quality classification using 3 classifiers (1 fasttext and 2 FineWeb-based)
- SDG (4 generation tasks)
8b5702d to
f0ae237
Compare
…eline stages - Link to tutorials/synthetic/nemotron_cc/ alongside external recipe in both README.md and tutorials/README.md - Expand Key Components to include language ID & filtering and all 7 pipeline stages per sarahyurick's feedback - Add language identification step to the Real-World Recipe callout Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
/ok to test d504b64 |
|
/ok to test 14b7b7e |
…ple (#1767) (#1860) * docs: cross-link Nemotron-CC recipe as a production NeMo Curator example Adds references to the Nemotron-CC data curation pipeline in README.md and tutorials/README.md so users can discover a production-scale, end-to-end example of NeMo Curator in use. * docs: address review comments — add in-repo SDG link and complete pipeline stages - Link to tutorials/synthetic/nemotron_cc/ alongside external recipe in both README.md and tutorials/README.md - Expand Key Components to include language ID & filtering and all 7 pipeline stages per sarahyurick's feedback - Add language identification step to the Real-World Recipe callout --------- Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Arham Mehta <141266146+arhamm1@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Summary
README.mdunder the "Why NeMo Curator?" section, linking to the Nemotron-CC pipelineREADME.mdtutorials/README.mdContext
The Nemotron-CC curation pipeline is a complete end-to-end example of NeMo Curator used at production scale — covering Common Crawl extraction, exact/fuzzy/substring deduplication, ensemble quality classification, and LLM-based synthetic data generation. Cross-linking it here helps users discover a real-world reference implementation of the library.
🤖 Generated with Claude Code