Skip to content

docs: cross-link Nemotron-CC recipe as a production NeMo Curator example#1767

Merged
sarahyurick merged 4 commits intomainfrom
docs/crosslink-nemotron-cc-recipe
Apr 22, 2026
Merged

docs: cross-link Nemotron-CC recipe as a production NeMo Curator example#1767
sarahyurick merged 4 commits intomainfrom
docs/crosslink-nemotron-cc-recipe

Conversation

@arhamm1
Copy link
Copy Markdown
Contributor

@arhamm1 arhamm1 commented Apr 8, 2026

Summary

Context

The Nemotron-CC curation pipeline is a complete end-to-end example of NeMo Curator used at production scale — covering Common Crawl extraction, exact/fuzzy/substring deduplication, ensemble quality classification, and LLM-based synthetic data generation. Cross-linking it here helps users discover a real-world reference implementation of the library.

🤖 Generated with Claude Code

Adds references to the Nemotron-CC data curation pipeline in README.md
and tutorials/README.md so users can discover a production-scale,
end-to-end example of NeMo Curator in use.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@arhamm1 arhamm1 requested a review from a team as a code owner April 8, 2026 16:12
@arhamm1 arhamm1 requested review from oyilmaz-nvidia and removed request for a team April 8, 2026 16:12
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@arhamm1 arhamm1 requested review from ayushdg and lbliii April 8, 2026 16:12
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 8, 2026

Greptile Summary

This documentation-only PR cross-links the Nemotron-CC production curation pipeline from both README.md and tutorials/README.md, adding a "Real-World Recipe" callout, a Recipes row in the Learn More table, and a new "Production Recipes" section. The previously raised concern about missing in-repo tutorial links has been addressed — all new entries now point to both the external NVIDIA-NeMo/Nemotron pipeline and the in-repo tutorials/synthetic/nemotron_cc/ tutorial.

Confidence Score: 5/5

Safe to merge — documentation-only changes with correct relative paths and no code impact.

All changes are documentation (markdown only). Relative links to tutorials/synthetic/nemotron_cc/ resolve correctly from both files. The previously flagged concern about missing in-repo links is fully addressed. No logic, security, or runtime issues are possible in this change.

No files require special attention.

Important Files Changed

Filename Overview
README.md Adds a "Real-World Recipe" callout and a Recipes row to the Learn More table, both linking to the external Nemotron pipeline and the in-repo SDG tutorial; relative paths are correct.
tutorials/README.md Adds a "Production Recipes" section with a table row linking to the external Nemotron repo and the in-repo synthetic/nemotron_cc/ tutorial; relative path resolves correctly from the tutorials/ directory.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[README.md<br/>Why NeMo Curator?] -->|Real-World Recipe callout| B[External: NVIDIA-NeMo/Nemotron<br/>nemotron-cc pipeline]
    A -->|Real-World Recipe callout| C[In-repo: tutorials/synthetic/nemotron_cc/]

    D[README.md<br/>Learn More table — Recipes row] -->|link| B
    D -->|link| C

    E[tutorials/README.md<br/>Production Recipes section] -->|Nemotron-CC link| B
    E -->|SDG tutorial link| C

    C --> F[nemotron_cc_pipelines.py<br/>nemotron_cc_sdg_*.py]
Loading

Reviews (6): Last reviewed commit: "Merge branch 'main' into docs/crosslink-..." | Re-trigger Greptile

Comment thread tutorials/README.md Outdated

| Recipe | Description | Key Components |
|--------|-------------|----------------|
| [Nemotron-CC](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) | Curate Common Crawl snapshots into an LLM-ready dataset, reproducing the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) | `CommonCrawlDownloadExtractStage` • Exact/Fuzzy/Substring Dedup • `FineWebNemotronEduClassifier` • Synthetic Data Generation |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 New section misses in-repo Nemotron-CC content

The new "Production Recipes" row links exclusively to NVIDIA-NeMo/Nemotron (an external repo), but this repository already contains the Nemotron-CC synthetic data generation pipeline at tutorials/synthetic/nemotron_cc/. The NVIDIA technical blog states the pipeline was merged into NeMo Curator, and tutorials/synthetic/nemotron_cc/nemotron_cc_pipelines.py exists in this repo.

Pointing only at the external recipe means users browsing tutorials/README.md will miss the SDG portion that lives here. Consider adding a link to tutorials/synthetic/nemotron_cc/ alongside (or instead of) the external recipe link, and clarifying which parts of the end-to-end pipeline live where:

Suggested change
| [Nemotron-CC](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) | Curate Common Crawl snapshots into an LLM-ready dataset, reproducing the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) | `CommonCrawlDownloadExtractStage`Exact/Fuzzy/Substring Dedup`FineWebNemotronEduClassifier` • Synthetic Data Generation |
| [Nemotron-CC (SDG)](tutorials/synthetic/nemotron_cc/) | Synthetic data generation stage of the Nemotron-CC pipeline | `nemotron_cc_pipelines.py`Diverse QAKnowledge List |
| [Nemotron-CC (full curation)](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) | Curate Common Crawl snapshots into an LLM-ready dataset, reproducing the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) | `CommonCrawlDownloadExtractStage` • Exact/Fuzzy/Substring Dedup • `FineWebNemotronEduClassifier` • Synthetic Data Generation |

Comment thread README.md Outdated
|----------|-------|
| **Documentation** | [Main Docs](https://docs.nvidia.com/nemo/curator/latest/) • [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/index.html) • [Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/index.html) |
| **Tutorials** | [Text](tutorials/text/) • [Image](tutorials/image/) • [Video](tutorials/video/) • [Audio](tutorials/audio/) |
| **Recipes** | [Nemotron-CC: end-to-end web data curation](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Learn More table link points to external repo, not canonical in-repo location

The Recipes row links to NVIDIA-NeMo/Nemotron (external), but this repo already hosts the Nemotron-CC SDG pipeline at tutorials/synthetic/nemotron_cc/. Users clicking from the main README.md learn-more table would be sent off-site when a directly relevant tutorial lives in the same repo. Consider adding a link to the internal tutorial alongside the external recipe:

Suggested change
| **Recipes** | [Nemotron-CC: end-to-end web data curation](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) |
| **Recipes** | [Nemotron-CC SDG (in-repo)](tutorials/synthetic/nemotron_cc/)[Nemotron-CC full pipeline](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) |

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 8, 2026

Tip:

Greploops — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

Comment thread tutorials/README.md Outdated

| Recipe | Description | Key Components |
|--------|-------------|----------------|
| [Nemotron-CC](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) | Curate Common Crawl snapshots into an LLM-ready dataset, reproducing the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) | `CommonCrawlDownloadExtractStage` • Exact/Fuzzy/Substring Dedup • `FineWebNemotronEduClassifier` • Synthetic Data Generation |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how detailed you want this description to be, but technically all of the stages are:

  1. Download and extract Common Crawl
  2. Language identification and filtering
  3. Exact deduplication
  4. Fuzzy deduplication
  5. Substring deduplication
  6. Ensemble quality classification using 3 classifiers (1 fasttext and 2 FineWeb-based)
  7. SDG (4 generation tasks)

@arhamm1 arhamm1 force-pushed the docs/crosslink-nemotron-cc-recipe branch from 8b5702d to f0ae237 Compare April 8, 2026 18:36
…eline stages

- Link to tutorials/synthetic/nemotron_cc/ alongside external recipe in both README.md and tutorials/README.md
- Expand Key Components to include language ID & filtering and all 7 pipeline stages per sarahyurick's feedback
- Add language identification step to the Real-World Recipe callout

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sarahyurick sarahyurick added the r1.2.0 Pick this label for auto cherry-picking into r1.2.0 label Apr 21, 2026
@sarahyurick
Copy link
Copy Markdown
Contributor

/ok to test d504b64

@sarahyurick
Copy link
Copy Markdown
Contributor

/ok to test 14b7b7e

@sarahyurick sarahyurick enabled auto-merge (squash) April 22, 2026 19:03
@sarahyurick sarahyurick merged commit 06b8388 into main Apr 22, 2026
45 checks passed
sarahyurick added a commit that referenced this pull request Apr 22, 2026
…ple (#1767) (#1860)

* docs: cross-link Nemotron-CC recipe as a production NeMo Curator example

Adds references to the Nemotron-CC data curation pipeline in README.md
and tutorials/README.md so users can discover a production-scale,
end-to-end example of NeMo Curator in use.



* docs: address review comments — add in-repo SDG link and complete pipeline stages

- Link to tutorials/synthetic/nemotron_cc/ alongside external recipe in both README.md and tutorials/README.md
- Expand Key Components to include language ID & filtering and all 7 pipeline stages per sarahyurick's feedback
- Add language identification step to the Real-World Recipe callout



---------

Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Arham Mehta <141266146+arhamm1@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r1.2.0 Pick this label for auto cherry-picking into r1.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docs - Add how Curator is used to build Nemotron datasets

2 participants