Skip to content

Commit 27f063d

Browse files
lbliiiayushdgclaudegreptile-apps[bot]jgerh
authored
26.04 staging (#1655)
* Add Fern documentation site (synced from llane/fern-docs-migration) Introduce the full fern/ tree on top of origin/main for a clean PR base. Made-with: Cursor Signed-off-by: Lawrence Lane <llane@nvidia.com> * regen fern effort on fresh branch Signed-off-by: Lawrence Lane <llane@nvidia.com> * fern ci Signed-off-by: Lawrence Lane <llane@nvidia.com> * fix: satisfy Ruff EXE001/INP001 for Fern helper scripts Signed-off-by: Lawrence Lane <llane@nvidia.com> Made-with: Cursor * remove Signed-off-by: Lawrence Lane <llane@nvidia.com> * chore: fix detect-secrets for Fern docs - Ignore and stop scanning generated fern/product-docs (Fern autodoc output) - Extend baseline allowlist for v26.02 synthetic MDX (doc examples with api_key) Signed-off-by: Lawrence Lane <llane@nvidia.com> Made-with: Cursor * secrets baseline Signed-off-by: Lawrence Lane <llane@nvidia.com> * index page redirects Signed-off-by: Lawrence Lane <llane@nvidia.com> * simplify nested for articles Signed-off-by: Lawrence Lane <llane@nvidia.com> * simplify sections Signed-off-by: Lawrence Lane <llane@nvidia.com> * more flattening Signed-off-by: Lawrence Lane <llane@nvidia.com> * prep 26.04 Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: update docs and release notes for Cosmos-Xenna 0.2.0 (PR #1571) (#1683) * docs: update docs for Cosmos-Xenna 0.2.0 bump (PR #1571) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * ci: add FERN_TOKEN to fern docs CI workflow The generate-library-reference job was failing because fern docs md generate requires authentication via the FERN_TOKEN secret. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * ci: use fern check instead of fern docs md generate fern docs md generate requires cloud authentication via FERN_TOKEN, which is not configured as a repo secret. fern check validates the configuration locally without authentication. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * ci: use DOCS_FERN_TOKEN org secret for fern docs generation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: address PR feedback on release notes and GPU wording - Add version sub-headers (26.04/26.02) to shared Dependency Updates and Breaking Changes sections in cumulative release notes - Fix misleading "multi-GPU" wording for gpus field — it supports 1 or more full GPUs, not just multi-GPU Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add release notes for audio stage name fix (PR #1470) (#1691) * docs: add release notes for audio stage name fix (PR #1470) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: apply style guide fixes to release notes entries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: revert Sphinx release notes changes, keep Fern only Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add release notes for vLLM race condition fix (PR #1590) (#1696) Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: update dedup docs for WorkflowRunResult (PR #1275) (#1654) * docs: update dedup docs for WorkflowRunResult (PR #1275) Update v26.04 fern docs to reflect the new WorkflowRunResult return type from all deduplication workflow run() methods. Add API reference docs for WorkflowRunResult and WorkflowBase, update code examples across dedup pages, and replace 26.02 release notes with 26.04 skeleton. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * style: fix latinism and split long breaking changes sentence Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: add missing metadata keys from audit findings Add input_filegroups_time, connected_components_pipeline_time, and complete TextSemanticDedup keys to API reference table and inline comments. Note PR #1275 provenance on workflow.py source link. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: address Greptile review comments - Qualify WorkflowBase claim: "most" workflows inherit, not "all" (TextSemanticDeduplicationWorkflow duck-types the interface) - Add missing id_generator_path to exact/fuzzy comments in index.mdx - Complete semdedup.mdx metadata comments with identification_time, removal_time, and final_output_path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: remove internal WorkflowRunResult/WorkflowBase API reference Per praateekmahajan review feedback — these sections are too internal for public docs. Also removes dead link in dedup index page. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: document multi-user metrics isolation (PR #1523) (#1656) * docs: add 26.04 release notes and monitoring setup for PR #1523 Document the multi-user metrics isolation feature (per-user metrics directories, metrics_dir parameter, PID-file tracking, auto-generated Ray dashboards, graceful cleanup). Expand the monitoring setup section in memory-management.mdx with step-by-step instructions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: apply style guide fixes to PR #1523 docs Add periods to complete-sentence list items in release notes. Fix passive voice ("are tracked" → active, "is stored" → active). Adjust phrasing for PACE voice consistency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: fix inaccurate dashboard name (core → default) The Ray dashboard generator uses "default" not "core" as the name (generate_default_grafana_dashboard → ray_default_dashboard.json). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: extract monitoring setup into dedicated page Move Prometheus/Grafana content from memory-management.mdx into a new monitoring.mdx page under reference/infrastructure. Update nav, release notes link, and best practices cross-references. Rename step headers from "Step N:" to "N." for consistency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Signed-off-by: Logan Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: update semdedup docs for vLLM default backend (PR #1606) (#1659) * docs: update semdedup docs for vLLM default backend (PR #1606) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: fix remaining merge conflict markers in semdedup.mdx Resolve leftover conflict markers that were missed in the previous merge resolution commit. Signed-off-by: Logan Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Signed-off-by: Logan Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add InferenceServer guide and update SDG docs for PR #1541 (#1665) Documents the new InferenceServer and InferenceModelConfig APIs (Ray Serve + vLLM) in v26.04 fern docs. Adds new how-to page, updates LLM client with extra_kwargs and local inference example, adds install extras to installation guide, updates SDG overview with InferenceServer references, and replaces 26.02 release notes with 26.04 skeleton containing the Inference Server entry. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add release notes for cryptography bump and uv min version (PR #1682) (#1705) Signed-off-by: Logan Lane <llane@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add actor pool progress bar documentation (PR #1457) (#1706) * docs: add actor pool progress bar documentation for PR #1457 Document the new show_progress and progress_interval parameters added to RayActorPoolExecutor in the execution backends reference, experimental executors API reference, and 26.04 release notes. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * Update fern/versions/v26.04/pages/reference/infrastructure/execution-backends.mdx Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * docs: update fern docs for vLLM default in semantic dedup (PR #1606) (#1704) Update semantic deduplication docs and release notes for the switch from SentenceTransformers to vLLM as the default embedding backend in TextSemanticDeduplicationWorkflow. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add release note for Pygments bump (PR #1681) (#1729) * docs: add release note for Pygments bump (PR #1681) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: remove transitive dependency entries from release notes Remove Pygments and cryptography from Dependency Updates since they are transitive dependencies, not core project dependencies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add release notes and docs for batched shuffle insertion (PR #1369) (#1698) Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add release notes and container docs for CVE fixes (PR #1612) (#1733) * docs: add release notes and container docs for CVE fixes (PR #1612) Document four HIGH-severity CVE fixes (nemo-toolkit RCE, xgrammar DoS, jackson-core DoS) and dependency updates (pynvml removal) in the 26.04 release notes. Add security hardening section to container environments page for the ray_dist.jar removal. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: fix CVE vs GHSA terminology and clarify nemo-toolkit version range Address review feedback: - Change "four HIGH-severity CVEs" to "four HIGH-severity vulnerabilities" since GHSA-72hv-8253-57qq is a GitHub Security Advisory, not a CVE. - Clarify that the nemo-toolkit CVE was fixed in 2.6.1 but bumped to >=2.7.2 for additional fixes and dependency compatibility. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: remove pynvml dependency note per review feedback Remove pynvml entry from Dependency Updates — it was accidentally added this release and was not present in the previous release. Signed-off-by: Logan Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Signed-off-by: Logan Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add exact dedup tuning params to v26.04 docs and release notes (#1699) * docs: add exact dedup tuning params to v26.04 docs and release notes Documents three new ExactDeduplicationWorkflow parameters exposed in PR #1561: total_nparts, rmm_pool_size, and spill_memory_limit. Updates the configuration table and performance best practices in exact.mdx, and adds an Enhancements entry to the 26.04 release notes. Signed-off-by: Lawrence Lane <llane@nvidia.com> * Update fern/versions/v26.04/pages/curate-text/process-data/deduplication/exact.mdx Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: address PR feedback on exact dedup tuning guidance Update total_nparts guidance per ayushdg's feedback: explicitly recommend smaller values (256/512) for better shuffle performance on large runs. Signed-off-by: Logan Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Signed-off-by: Logan Lane <llane@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: document ScoreFilter benchmark metrics for 26.04 (#1686) * docs: add ScoreFilter benchmark metrics to 26.04 release notes and heuristic filtering guide Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: fix undefined executor and division-by-zero in pipeline metrics snippet Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: apply NVIDIA style guide fixes to filter metrics docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: generalize metrics section per reviewer feedback Remove ScoreFilter-specific framing from release notes and heuristic guide since pipeline stage metrics are not unique to ScoreFilter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add shared tokenizer docs and release notes (PR #1528) (#1700) * docs: add shared tokenizer docs and release notes for PR #1528 Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: replace latinism in shared tokenizer docs Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: address Greptile review concerns for shared tokenizer PR Add AegisClassifier aegis_prompt_field warning when using use_existing_tokens, and include MultilingualDomainClassifier in the release notes DeBERTa group list. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: update docs and release notes for fused iterate-extract (PR #1458) (#1684) * docs: update docs and release notes for fused iterate-extract stages (PR #1458) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: address PR feedback from Greptile review Revert 26.02 release notes line to avoid referencing 26.04 class name in wrong version section. Update custom.mdx wording to reflect that the fused step maps to multiple abstract base classes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: address PR review feedback from jgerh and greptile Apply tech pubs copyedits: add periods to list items, fix link spacing, backtick code references, correct product names (FastText, GLiNER), replace OOM'd with "ran out of memory", rewrite component descriptions for clarity, convert <Note> to :::{note} directive, remove horizontal dividers, and align fern/non-fern breaking changes (add Three-Stage Pipeline entry to non-fern). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: revert non-fern release notes to staging state Release notes live in fern only; remove docs/ changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * Update docs/about/release-notes/index.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * Update docs/about/release-notes/index.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * Update docs/about/release-notes/index.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * Update docs/about/release-notes/index.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * Update docs/about/release-notes/index.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * Apply suggestion from @jgerh Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: address sarahyurick feedback and remove last divider Remove ID Field Standardization entries from non-fern 26.02 release notes since PR #1390 is a 26.04 feature. Remove fused iterate-extract entry from same section (26.04 via PR #1458). Remove remaining horizontal divider in custom.mdx. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: revert all changes to non-fern release notes Restore docs/about/release-notes/index.md to staging state; this PR should not modify the non-fern release notes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * docs: update Fern docs for filter/modifier directory reorganization (PR #1472) (#1685) Update import paths and release notes in Fern v26.04 docs to reflect the DocumentFilter/DocumentModifier directory restructuring that avoids eagerly importing heavy dependencies. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add AEGIS classifier GPU utilization note and release notes (Issue #878) (#1702) * docs: add AEGIS classifier GPU utilization note and release notes (Issue #878) Document confirmed full GPU utilization for the AEGIS safety classifier on multi-GPU setups and add performance expectations note about the LlamaGuard-7b generative model being slower than encoder-based classifiers. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: address Greptile review feedback on AEGIS docs Remove "now" from release notes to clarify this confirms existing behavior rather than implying a bug fix. Link to the NVIDIA AEGIS model page instead of Meta's base LlamaGuard-7b model. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: address PR feedback for FastText filter benchmarking release notes (#1690) Add --fasttext-quality-model-path mention and model reference links per reviewer suggestions from sarahyurick and greptile bot. Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: add per-stage runtime environment documentation (PR #1623) (#1753) * docs: add per-stage runtime environment documentation (PR #1623) Add reference page, release note, and API reference updates for the per-stage runtime_env feature that enables isolated Python dependencies per pipeline stage using Ray's native runtime_env support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: address PR feedback for per-stage runtime docs - Clarify that pip/uv keys control the worker virtualenv installer, not the local package manager - Replace undefined stage references in example with full RecordPackagingVersionStage definition from PR #1623 tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add release notes and docs for jusText OOM fix (PR #1534) (#1697) Addresses PR #1697 review feedback: - Condensed "What's New" entry to 1-2 lines (ayushdg verbosity feedback) - Consolidated all bug fixes under single "## Bug Fixes" section - Fixed "Audio Stage Name Propagation" being orphaned under "Dependency Updates" - Kept feature in "What's New" and bug fix in "Bug Fixes" as separate concise entries Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add text embeddings guide and release notes for PR #1346 (#1687) * docs: add text embeddings guide and release notes for PR #1346 Add Fern documentation for vLLM and Sentence Transformers embedding support. Creates new Text Embeddings section with overview and vLLM Embedder pages. Updates 26.04 release notes and expands semdedup page with vLLM embedding example. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: deduplicate code snippets and fix placeholder model names Replace duplicated vLLM Quick Start in embeddings overview and semdedup page with cross-references to the canonical vllm-embedder page. Replace placeholder "large-embedding-model" with consistent model identifiers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: fix internal stage reference for TextSemanticDeduplicationWorkflow The workflow now uses VLLMEmbeddingModelStage internally, not EmbeddingCreatorStage. Signed-off-by: Logan Lane <lbliii@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: address sarahyurick review feedback on embedding docs - Fix "HuggingFace" to "Hugging Face" everywhere - Remove vllm install instructions (included in text_cuda12) - Fix "classe" typo to "classes" in release notes - Update Setup column to recommend text_cuda12 - Position vLLM as recommended for semantic dedup - Fix pretokenize recommendation (model-dependent, not universal) - Remove vLLM vs ST comparison table per reviewer request - Use correct model identifier google/embeddinggemma-300m Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Signed-off-by: Logan Lane <lbliii@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add release notes for vLLM setup race condition fix (PR #1590) (#1708) Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add image reader Ray Data support docs (PR #1610) (#1731) Document ImageReaderStage RayDataExecutor compatibility and fanout behavior in tar archives loading guide, and add release note entry. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add release notes and docs for LSH memory config (PR #1603) (#1732) Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add release notes and docs for math benchmarking support (PR #1604) (#1737) Document S3 transport for CommonCrawlWARCReader, serialization fixes for MathContentExtractor and CommonCrawlWARCReader, and boto3 dependency. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add release notes and docs for Ray 2.54 update (PR #1557) (#1734) Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * ci: remove combined fern-docs-preview workflow (#1771) Replace with the more secure two-part workflow split into fern-docs-preview-build.yml and fern-docs-preview-comment.yml. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: update fern docs for RayDataExecutor promotion from experimental (PR #1619) (#1703) * docs: update fern docs for RayDataExecutor promotion from experimental (PR #1619) Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: add RayDataExecutor config table to execution-backends page Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: rename AudioBatch to AudioTask in audio curation docs (#1694) * docs: rename AudioBatch to AudioTask across audio curation docs Rename AudioBatch class/concept to AudioTask throughout the 26.04 documentation to reflect the upstream API rename. Updates navigation, redirects, concepts, API reference, tutorials, and release notes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: fix style guide issues in audio curation docs Replace latinisms (via, etc.) with plain English equivalents and fix code formatting spacing in text-integration how-to. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: fix AudioTask API mismatches flagged in PR review Address review comments from #1694: - Fix process() return type in duration-calculation.mdx (returns single AudioTask, not a list) - Fix AudioTask construction in manifests-ingest.mdx (single dict per task, not a list) - Update "batch-level validation" wording to match single-entry model - Correct asr-inference/index.mdx to state ASR stage defines process_batch() as its canonical method - Fix "file paths" plural to singular for single-entry AudioTask - Replace invalid list comprehension over AudioTask.data dict keys in asr-pipeline.mdx with validate() call - Replace direct process() call on ASR stage with pipeline-driven pattern Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: address 9 open documentation issues (#1722) * docs: address open documentation issues (#1264, #1326, #1546, #1547, #1548, #1549, #1550, #1551, #1552) - Fix broken Installation and Configuration links in tutorials/README.md (#1264) - Document pip install dependency conflict and recommend uv (#1326) - Clarify that Curator uses Ray (not Dask) in migration guide and about page (#1546) - Add architecture diagram section to README (#1547) - Add Nemotron dataset usage section to README (#1548) - Add data curation importance section to README (#1549) - Add deep-dives section to fern docs (resource allocation, streaming, auto-balancing, throughput) (#1550) - Add citation section to README (#1551) - Add Updates/News section to README (#1552) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: move assets to fern/, revert docs/ changes Images now live in fern/assets/images/ and README references updated. Reverted docs/about/release-notes/migration-guide.md and docs/admin/installation.md since docs/ is deprecated in favor of fern/. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: enrich deep-dive pages with slide content Update resource allocation, streaming, and auto-balancing deep-dives with concrete details from internal slides: ~5x CPU stage speedup, 20% streaming overlap improvement, before/after auto-balancing example (1 vs 4 videos/s), and code examples matching actual API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: add Berkeley lecture throughput example to README and deep-dives Add illustrative pipeline example from UCB lecture (lang ID → tokenization → 5B model, 13,000s naive → ~1,000s with Curator) to the "Why Data Curation?" section and throughput deep-dive. Update auto-balancing with accurate tasks/s numbers and streaming with 99% GPU utilization stat. Framed as illustrative, not a benchmark claim. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: move deep-dives into concepts as Scaling & Performance section These are concept articles, not a new content type. Moved resource allocation, streaming, auto-balancing, and throughput pages under About > Concepts > Scaling & Performance. Removed the separate Deep Dives nav section and index page. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * Address PR review comments on concept docs - resource-allocation: Use class-level resources attr, fix .with_() API, replace fractional GPU example with Resources(gpus=0.25), remove unsourced 5x claim, remove deduplication from CPU-bound example - streaming: Remove unsourced 20% claim, use quickstart-style example, soften batch size trade-off language - throughput: Clarify GPU memory math in example, emphasize streaming, use RayClient/RaySlurmClient, add ScoreFilter default note, note auto-balancing behavior - auto-balancing: Remove untested stage_stats snippet, reference Ray Dashboard instead - tutorials/README: Add .html extensions to doc URLs for consistency Signed-off-by: Lawrence Lane <llane@nvidia.com> * Replace architecture diagram with version without red underline Signed-off-by: Lawrence Lane <llane@nvidia.com> * Replace architecture diagram with clean version Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add MegatronTokenizerWriter documentation for 26.04 (#1742) * docs: add MegatronTokenizerWriter documentation for 26.04 Add Save and Export page for text curation documenting MegatronTokenizerWriter (PR #1259), including configuration reference, output format details, and pipeline examples. Update release notes with feature entry, add page to navigation, and update related-tools to reflect Curator's tokenization capability. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: fix output tree placeholders and .idx format description Use distinct hash placeholders to clarify per-partition output files. Correct .idx layout to show three separate contiguous arrays instead of implying an interleaved per-record format. Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: add NeMo Data Designer (NDD) integration docs for 26.04 (#1743) * docs: add NeMo Data Designer integration documentation for 26.04 Add fern/ docs for the NDD + Curator integration, covering DataDesignerStage, NDD-backed Nemotron-CC stages, configuration builder patterns, and local/remote inference setup. Includes release note line items and nav config updates. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add missing import os in NDD code examples Fixes Greptile review comments — two code snippets using os.environ["NVIDIA_API_KEY"] were missing the os import. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs(ndd): address review feedback on NDD docs - Link to NDD docs in the intro for reader context - Add callout linking to NDD config builder reference - Include missing DataDesignerStage import in remote provider snippet Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs(ndd): define pipeline in NDD-backed stages example The NDD-Backed Stages snippet in the Nemotron-CC page called pipeline.add_stage(...) without constructing the Pipeline, so copying it verbatim raised NameError. Add the Pipeline import and instantiation to make the example self-contained. Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: fix ImageReaderStage batch_size parameter name in v26.04 docs (#1837) QA found that v26.04 image curation docs use batch_size when the actual ImageReaderStage parameter is dali_batch_size, causing all documented examples to fail with TypeError at runtime. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * docs: add ALM pipeline fern docs and release notes (PRs #1419, #1608, #1676) (#1738) * docs: add ALM pipeline docs and release notes for PRs #1419, #1608, #1676 New fern pages for ALM data curation: - Concept page: about/concepts/audio/alm-pipeline - Tutorial: curate-audio/tutorials/alm - Processing pages: curate-audio/process-data/alm/ (index, data-builder, overlap-filtering) Updated existing pages: - release-notes/index: added entries for ALM pipeline, AudioTask redesign, audio profiling - version.yml: added ALM nav entries under concepts, tutorials, and process-data - Audio concepts index: added ALM Pipeline card - Audio tutorials index: added ALM Tutorial card - Audio process-data index: added ALM Data Curation section with cards Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: fix incomplete output JSON examples and deduplicate release notes - Add missing fields to output JSON examples in alm-pipeline.mdx and alm.mdx: filtered_dur_list, total_dur_window, truncation_events, and lost_no_spkr in stats - Clarify PR #1608 release note to reference PR #1419 instead of repeating Hydra configuration details Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs(alm): address reviewer feedback on ALM pipeline docs - Loss Tracking prose: document lost_no_spkr and lost_next_seg_bm as sub-categories of lost_win on the concept page - Describe overlap filtering as nested all-pairs rather than "consecutive pairs"; note the greedy removal rule - Correct the overlap_percentage table: >=50% (not >50%), and clarify that 100% still removes fully-contained duplicates - Output JSON examples: surface the pre-filter windows field, add lost_next_seg_bm, and add a note that real output carries additional duration and diagnostic fields - data-builder.mdx: add lost_next_seg_bm row; reword lost_no_spkr as a sub-category of lost_win - overlap-filtering.mdx: add manifest_filepath row and a note about the other intermediate fields the stage writes - Tutorial: expand Loss Statistics tuning table with the two new sub-categories; switch sample-data command to run from the repo root so the fixture path matches the in-repo README Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: remove duplicate PR #1608 release note entry The "AudioBatch to AudioTask Redesign (PR #1608)" section duplicated the more complete "Audio Task Redesign (PR #1608)" entry above it. Greptile flagged this as a P1 issue on PR #1738. Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: warn users of Python 3.10 support removal in 26.06 (#1868) * docs: warn users of Python 3.10 support removal in 26.06 Adds a deprecation notice to the 26.04 release notes, installation guide, and deployment requirements so users upgrading to 26.06 are prepared to move off Python 3.10. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * docs: soften Python 3.10 deprecation wording to allow 3.13+ Avoid enumerating "3.11 and 3.12 only" since Python 3.13 support in 26.06 is not yet confirmed. Rephrases the deprecation notices to point to any newer supported version (3.11+). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * ci: refresh detect-secrets baseline for synthetic docs Add the 9 placeholder API key examples in fern/versions/v26.04/pages/ curate-text/synthetic/ to the detect-secrets baseline as known false positives so the secrets-detector check passes. Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Signed-off-by: Logan Lane <llane@nvidia.com> Signed-off-by: Logan Lane <lbliii@users.noreply.github.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
1 parent 2420065 commit 27f063d

172 files changed

Lines changed: 26674 additions & 124 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/config/.secrets.baseline

Lines changed: 75 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@
124124
{
125125
"path": "detect_secrets.filters.regex.should_exclude_file",
126126
"pattern": [
127-
"pyproject\\.toml|\\.github/workflows/config/\\.secrets\\.baseline"
127+
"pyproject\\.toml|\\.github/workflows/config/\\.secrets\\.baseline|fern/product-docs/"
128128
]
129129
}
130130
],
@@ -243,6 +243,79 @@
243243
"line_number": 28
244244
}
245245
],
246+
"fern/versions/v26.04/pages/curate-text/synthetic/index.mdx": [
247+
{
248+
"type": "Secret Keyword",
249+
"filename": "fern/versions/v26.04/pages/curate-text/synthetic/index.mdx",
250+
"hashed_secret": "6d9c68c603e465077bdd49c62347fe54717f83a3",
251+
"is_verified": false,
252+
"line_number": 72
253+
}
254+
],
255+
"fern/versions/v26.04/pages/curate-text/synthetic/inference-server.mdx": [
256+
{
257+
"type": "Secret Keyword",
258+
"filename": "fern/versions/v26.04/pages/curate-text/synthetic/inference-server.mdx",
259+
"hashed_secret": "ce7501007f04a6529e650f1f1b3fc0586d1d94eb",
260+
"is_verified": false,
261+
"line_number": 173
262+
}
263+
],
264+
"fern/versions/v26.04/pages/curate-text/synthetic/llm-client.mdx": [
265+
{
266+
"type": "Secret Keyword",
267+
"filename": "fern/versions/v26.04/pages/curate-text/synthetic/llm-client.mdx",
268+
"hashed_secret": "e6bdb3f031eea3001ca83dd43d7d49d65a7a6ce5",
269+
"is_verified": false,
270+
"line_number": 32
271+
},
272+
{
273+
"type": "Secret Keyword",
274+
"filename": "fern/versions/v26.04/pages/curate-text/synthetic/llm-client.mdx",
275+
"hashed_secret": "2083c49ad8d63838a4d18f1de0c419f06eb464db",
276+
"is_verified": false,
277+
"line_number": 43
278+
},
279+
{
280+
"type": "Secret Keyword",
281+
"filename": "fern/versions/v26.04/pages/curate-text/synthetic/llm-client.mdx",
282+
"hashed_secret": "ec3810e10fb78db55ce38b9c18d1c3eb1db739e0",
283+
"is_verified": false,
284+
"line_number": 127
285+
},
286+
{
287+
"type": "Secret Keyword",
288+
"filename": "fern/versions/v26.04/pages/curate-text/synthetic/llm-client.mdx",
289+
"hashed_secret": "11fa7c37d697f30e6aee828b4426a10f83ab2380",
290+
"is_verified": false,
291+
"line_number": 134
292+
},
293+
{
294+
"type": "Secret Keyword",
295+
"filename": "fern/versions/v26.04/pages/curate-text/synthetic/llm-client.mdx",
296+
"hashed_secret": "ce7501007f04a6529e650f1f1b3fc0586d1d94eb",
297+
"is_verified": false,
298+
"line_number": 155
299+
}
300+
],
301+
"fern/versions/v26.04/pages/curate-text/synthetic/multilingual-qa.mdx": [
302+
{
303+
"type": "Secret Keyword",
304+
"filename": "fern/versions/v26.04/pages/curate-text/synthetic/multilingual-qa.mdx",
305+
"hashed_secret": "2083c49ad8d63838a4d18f1de0c419f06eb464db",
306+
"is_verified": false,
307+
"line_number": 28
308+
}
309+
],
310+
"fern/versions/v26.04/pages/curate-text/synthetic/nemo-data-designer.mdx": [
311+
{
312+
"type": "Secret Keyword",
313+
"filename": "fern/versions/v26.04/pages/curate-text/synthetic/nemo-data-designer.mdx",
314+
"hashed_secret": "ce7501007f04a6529e650f1f1b3fc0586d1d94eb",
315+
"is_verified": false,
316+
"line_number": 183
317+
}
318+
],
246319
"nemo_curator/models/nemotron_h_vl.py": [
247320
{
248321
"type": "Hex High Entropy String",
@@ -333,5 +406,5 @@
333406
}
334407
]
335408
},
336-
"generated_at": "2026-04-24T18:02:35Z"
409+
"generated_at": "2026-04-27T14:07:36Z"
337410
}

.github/workflows/fern-docs-preview.yml

Lines changed: 0 additions & 115 deletions
This file was deleted.

README.md

Lines changed: 70 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,11 @@
1515

1616
> *Part of the [NVIDIA NeMo](https://www.nvidia.com/en-us/ai-data-science/products/nemo/) software suite for managing the AI agent lifecycle.*
1717
18+
## Updates
19+
20+
- 2026-04: NeMo Curator 26.04 released with Cosmos-Xenna 0.2.0 upgrade, simplified `Resources` API, and Ray 2.54. See the [release notes](https://docs.nvidia.com/nemo/curator/latest/about/release-notes).
21+
- 2026-02: NeMo Curator 26.02 released with Ray-based pipeline architecture for all modalities — text, image, video, and audio.
22+
1823
## What You Can Do
1924

2025
| Modality | Key Capabilities | Get Started |
@@ -38,6 +43,16 @@ python tutorials/quickstart.py
3843

3944
---
4045

46+
## Architecture
47+
48+
NeMo Curator uses a modular, Ray-based pipeline architecture. Data flows through composable processing stages — each stage handles a discrete curation task (loading, filtering, deduplication, etc.) and can be configured with independent resource requirements.
49+
50+
<p align="center">
51+
<img src="./fern/assets/images/architecture-diagram.png" alt="NeMo Curator architecture diagram showing modular pipeline stages" width="700"/>
52+
</p>
53+
54+
---
55+
4156
## Features by Modality
4257

4358
### Text Curation
@@ -92,6 +107,33 @@ Prepare high-quality speech datasets for automatic speech recognition (ASR) and
92107

93108
---
94109

110+
## Why Data Curation?
111+
112+
High-quality training data is the single most important factor in building performant AI models. Raw datasets contain noise, duplicates, low-quality content, and potentially harmful material that degrade model performance and increase training costs.
113+
114+
<p align="center">
115+
<img src="./fern/assets/images/data-curation-challenges.png" alt="Common data curation challenges: quality, deduplication, filtering, and scale" width="700"/>
116+
</p>
117+
118+
At scale, data curation is a **throughput maximization problem**. A typical pipeline chains stages with very different compute profiles — lightweight CPU tokenization, small GPU classifiers, large GPU inference models — and a naive sequential approach leaves most hardware idle most of the time.
119+
120+
**Example:** Consider a pipeline with language identification (0.5B model, 1 GB VRAM, 2s/sample), tokenization (CPU-only, 1s/sample), and a 5B answer model (10 GB VRAM, 10s/sample) processing 1,000 questions on a single 102 GB GPU:
121+
122+
| Approach | How it works | Total runtime |
123+
|----------|-------------|---------------|
124+
| **Sequential** | Process each sample through all stages, one at a time | ~13,000 seconds |
125+
| **NeMo Curator** | Stream batches, auto-scale replicas per stage, overlap CPU/GPU work | ~1,000 seconds |
126+
127+
NeMo Curator achieves this by streaming data through the pipeline so all stages run concurrently, auto-balancing replicas to match each stage's throughput (2× language ID, 1× tokenizer, 10× answer model), and keeping GPU workers busy over 99% of the time after an initial warm-up period. See the [scaling concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/scaling) for details.
128+
129+
---
130+
131+
## Proven at Scale: Nemotron
132+
133+
NeMo Curator powers the data pipelines behind [NVIDIA Nemotron](https://developer.nvidia.com/nemotron) models. For example, the [Nemotron-4 pre-training dataset](https://arxiv.org/abs/2402.16819) was curated using NeMo Curator's text processing pipeline across 8+ trillion tokens of multilingual web data, applying quality filtering, deduplication, and domain classification at scale.
134+
135+
---
136+
95137
## Why NeMo Curator?
96138

97139
### Performance at Scale
@@ -106,15 +148,15 @@ NeMo Curator leverages NVIDIA RAPIDS™ libraries such as cuDF, cuML, and cuGrap
106148
**Real-World Recipe:** The [Nemotron-CC curation pipeline](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) uses NeMo Curator end-to-end — from Common Crawl extraction through language identification, exact/fuzzy/substring deduplication, ensemble quality classification, and LLM-based synthetic data generation — to reproduce the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2). The SDG stage is also available as an [in-repo tutorial](tutorials/synthetic/nemotron_cc/).
107149

108150
<p align="center">
109-
<img src="./docs/_images/text-benchmarks.png" alt="Performance benchmarks showing 16x speed improvement, 40% cost savings, and near-linear scaling" width="700"/>
151+
<img src="./fern/assets/images/text-benchmarks.png" alt="Performance benchmarks showing 16x speed improvement, 40% cost savings, and near-linear scaling" width="700"/>
110152
</p>
111153

112154
### Quality Improvements
113155

114156
Data curation modules measurably improve model performance. In ablation studies using a 357M-parameter GPT model trained on curated Common Crawl data:
115157

116158
<p align="center">
117-
<img src="./docs/_images/ablation.png" alt="Model accuracy improvements across curation pipeline stages" width="700"/>
159+
<img src="./fern/assets/images/ablation.png" alt="Model accuracy improvements across curation pipeline stages" width="700"/>
118160
</p>
119161

120162
**Results:** Progressive improvements in zero-shot downstream task performance through text cleaning, deduplication, and quality filtering stages.
@@ -136,3 +178,29 @@ Data curation modules measurably improve model performance. In ablation studies
136178
## Contribute
137179

138180
We welcome community contributions! Please refer to [CONTRIBUTING.md](https://github.com/NVIDIA/NeMo/blob/stable/CONTRIBUTING.md) for guidelines.
181+
182+
---
183+
184+
## Citation
185+
186+
If you find NeMo Curator useful in your research, please cite:
187+
188+
```bibtex
189+
@misc{nemo_curator,
190+
title = {NeMo Curator: GPU-Accelerated Data Curation for Training AI Models},
191+
author = {NVIDIA},
192+
year = {2024},
193+
url = {https://github.com/NVIDIA-NeMo/Curator}
194+
}
195+
```
196+
197+
For the data curation pipeline behind Nemotron models, please also cite:
198+
199+
```bibtex
200+
@article{parmar2024nemotron4,
201+
title = {Nemotron-4 15B Technical Report},
202+
author = {Parmar, Jupinder and Satheesh, Shrimai and others},
203+
journal = {arXiv preprint arXiv:2402.16819},
204+
year = {2024}
205+
}
206+
```

docs/about/release-notes/index.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,16 @@ modality: "universal"
1212

1313
# NeMo Curator Release Notes: {{ current_release }}
1414

15+
## What's New in 26.04
16+
17+
### Cosmos-Xenna 0.2.0
18+
19+
Upgraded Cosmos-Xenna from 0.1.2 to 0.2.0 with a simplified resource model and improved GPU management:
20+
21+
- **Simplified `Resources` API**: Removed `nvdecs`, `nvencs`, and `entire_gpu` fields. GPU allocation now uses `gpu_memory_gb` (fractional single-GPU) or `gpus` (one or more full GPUs) exclusively.
22+
- **Xenna-managed CUDA devices**: Xenna now manages CUDA device visibility directly, replacing the previous Ray-managed approach.
23+
- **Ray 2.54**: Updated Ray dependency to version 2.54 for compatibility with Cosmos-Xenna 0.2.0.
24+
1525
## What's New in 26.02
1626

1727
### Benchmarking Infrastructure
@@ -106,6 +116,13 @@ New API for tracking and analyzing pipeline execution:
106116

107117
## Dependency Updates
108118

119+
### 26.04
120+
121+
- **Cosmos-Xenna**: Updated from 0.1.2 to 0.2.0 with simplified resource model
122+
- **Ray**: Updated to 2.54
123+
124+
### 26.02
125+
109126
- **Transformers**: Pinned to 4.55.2 for stability and compatibility
110127
- **vLLM**: Updated to 0.14.1 with video pipeline compatibility fixes
111128
- **FFmpeg**: Upgraded to 8.0.1 for enhanced multimedia processing
@@ -136,6 +153,12 @@ New API for tracking and analyzing pipeline execution:
136153

137154
## Breaking Changes
138155

156+
### 26.04
157+
158+
- **`Resources` API**: The `nvdecs`, `nvencs`, and `entire_gpu` fields have been removed from `Resources`. Stages that previously used `entire_gpu=True` should use `gpus=1` instead. Stages that used `nvdecs` or `nvencs` should use `gpus` for GPU allocation.
159+
160+
### 26.02
161+
139162
- **InternVideo2 Removed**: Video pipelines must use alternative embedding models (Cosmos-Embed1)
140163
- **ID Field Standardization**: Custom deduplication workflows may need updates to use standardized ID field names
141164

743 KB
Loading
156 KB
Loading

0 commit comments

Comments
 (0)