Commit 27f063d
26.04 staging (#1655)
* Add Fern documentation site (synced from llane/fern-docs-migration)
Introduce the full fern/ tree on top of origin/main for a clean PR base.
Made-with: Cursor
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* regen fern effort on fresh branch
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* fern ci
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* fix: satisfy Ruff EXE001/INP001 for Fern helper scripts
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Made-with: Cursor
* remove
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* chore: fix detect-secrets for Fern docs
- Ignore and stop scanning generated fern/product-docs (Fern autodoc output)
- Extend baseline allowlist for v26.02 synthetic MDX (doc examples with api_key)
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Made-with: Cursor
* secrets baseline
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* index page redirects
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* simplify nested for articles
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* simplify sections
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* more flattening
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* prep 26.04
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: update docs and release notes for Cosmos-Xenna 0.2.0 (PR #1571) (#1683)
* docs: update docs for Cosmos-Xenna 0.2.0 bump (PR #1571)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* ci: add FERN_TOKEN to fern docs CI workflow
The generate-library-reference job was failing because fern docs md
generate requires authentication via the FERN_TOKEN secret.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* ci: use fern check instead of fern docs md generate
fern docs md generate requires cloud authentication via FERN_TOKEN,
which is not configured as a repo secret. fern check validates the
configuration locally without authentication.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* ci: use DOCS_FERN_TOKEN org secret for fern docs generation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: address PR feedback on release notes and GPU wording
- Add version sub-headers (26.04/26.02) to shared Dependency Updates
and Breaking Changes sections in cumulative release notes
- Fix misleading "multi-GPU" wording for gpus field — it supports
1 or more full GPUs, not just multi-GPU
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add release notes for audio stage name fix (PR #1470) (#1691)
* docs: add release notes for audio stage name fix (PR #1470)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: apply style guide fixes to release notes entries
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: revert Sphinx release notes changes, keep Fern only
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add release notes for vLLM race condition fix (PR #1590) (#1696)
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: update dedup docs for WorkflowRunResult (PR #1275) (#1654)
* docs: update dedup docs for WorkflowRunResult (PR #1275)
Update v26.04 fern docs to reflect the new WorkflowRunResult return type
from all deduplication workflow run() methods. Add API reference docs for
WorkflowRunResult and WorkflowBase, update code examples across dedup
pages, and replace 26.02 release notes with 26.04 skeleton.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* style: fix latinism and split long breaking changes sentence
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: add missing metadata keys from audit findings
Add input_filegroups_time, connected_components_pipeline_time, and
complete TextSemanticDedup keys to API reference table and inline
comments. Note PR #1275 provenance on workflow.py source link.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: address Greptile review comments
- Qualify WorkflowBase claim: "most" workflows inherit, not "all"
(TextSemanticDeduplicationWorkflow duck-types the interface)
- Add missing id_generator_path to exact/fuzzy comments in index.mdx
- Complete semdedup.mdx metadata comments with identification_time,
removal_time, and final_output_path
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: remove internal WorkflowRunResult/WorkflowBase API reference
Per praateekmahajan review feedback — these sections are too internal
for public docs. Also removes dead link in dedup index page.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: document multi-user metrics isolation (PR #1523) (#1656)
* docs: add 26.04 release notes and monitoring setup for PR #1523
Document the multi-user metrics isolation feature (per-user metrics
directories, metrics_dir parameter, PID-file tracking, auto-generated
Ray dashboards, graceful cleanup). Expand the monitoring setup section
in memory-management.mdx with step-by-step instructions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: apply style guide fixes to PR #1523 docs
Add periods to complete-sentence list items in release notes.
Fix passive voice ("are tracked" → active, "is stored" → active).
Adjust phrasing for PACE voice consistency.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: fix inaccurate dashboard name (core → default)
The Ray dashboard generator uses "default" not "core" as the name
(generate_default_grafana_dashboard → ray_default_dashboard.json).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: extract monitoring setup into dedicated page
Move Prometheus/Grafana content from memory-management.mdx into a new
monitoring.mdx page under reference/infrastructure. Update nav, release
notes link, and best practices cross-references. Rename step headers
from "Step N:" to "N." for consistency.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Logan Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: update semdedup docs for vLLM default backend (PR #1606) (#1659)
* docs: update semdedup docs for vLLM default backend (PR #1606)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: fix remaining merge conflict markers in semdedup.mdx
Resolve leftover conflict markers that were missed in the previous
merge resolution commit.
Signed-off-by: Logan Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Logan Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add InferenceServer guide and update SDG docs for PR #1541 (#1665)
Documents the new InferenceServer and InferenceModelConfig APIs
(Ray Serve + vLLM) in v26.04 fern docs. Adds new how-to page,
updates LLM client with extra_kwargs and local inference example,
adds install extras to installation guide, updates SDG overview
with InferenceServer references, and replaces 26.02 release notes
with 26.04 skeleton containing the Inference Server entry.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add release notes for cryptography bump and uv min version (PR #1682) (#1705)
Signed-off-by: Logan Lane <llane@nvidia.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add actor pool progress bar documentation (PR #1457) (#1706)
* docs: add actor pool progress bar documentation for PR #1457
Document the new show_progress and progress_interval parameters
added to RayActorPoolExecutor in the execution backends reference,
experimental executors API reference, and 26.04 release notes.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Update fern/versions/v26.04/pages/reference/infrastructure/execution-backends.mdx
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
* docs: update fern docs for vLLM default in semantic dedup (PR #1606) (#1704)
Update semantic deduplication docs and release notes for the switch
from SentenceTransformers to vLLM as the default embedding backend
in TextSemanticDeduplicationWorkflow.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add release note for Pygments bump (PR #1681) (#1729)
* docs: add release note for Pygments bump (PR #1681)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: remove transitive dependency entries from release notes
Remove Pygments and cryptography from Dependency Updates since
they are transitive dependencies, not core project dependencies.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add release notes and docs for batched shuffle insertion (PR #1369) (#1698)
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add release notes and container docs for CVE fixes (PR #1612) (#1733)
* docs: add release notes and container docs for CVE fixes (PR #1612)
Document four HIGH-severity CVE fixes (nemo-toolkit RCE, xgrammar DoS,
jackson-core DoS) and dependency updates (pynvml removal) in the 26.04
release notes. Add security hardening section to container environments
page for the ray_dist.jar removal.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: fix CVE vs GHSA terminology and clarify nemo-toolkit version range
Address review feedback:
- Change "four HIGH-severity CVEs" to "four HIGH-severity vulnerabilities"
since GHSA-72hv-8253-57qq is a GitHub Security Advisory, not a CVE.
- Clarify that the nemo-toolkit CVE was fixed in 2.6.1 but bumped to
>=2.7.2 for additional fixes and dependency compatibility.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: remove pynvml dependency note per review feedback
Remove pynvml entry from Dependency Updates — it was accidentally
added this release and was not present in the previous release.
Signed-off-by: Logan Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Logan Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add exact dedup tuning params to v26.04 docs and release notes (#1699)
* docs: add exact dedup tuning params to v26.04 docs and release notes
Documents three new ExactDeduplicationWorkflow parameters exposed in
PR #1561: total_nparts, rmm_pool_size, and spill_memory_limit.
Updates the configuration table and performance best practices in
exact.mdx, and adds an Enhancements entry to the 26.04 release notes.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Update fern/versions/v26.04/pages/curate-text/process-data/deduplication/exact.mdx
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: address PR feedback on exact dedup tuning guidance
Update total_nparts guidance per ayushdg's feedback: explicitly
recommend smaller values (256/512) for better shuffle performance
on large runs.
Signed-off-by: Logan Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Logan Lane <llane@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: document ScoreFilter benchmark metrics for 26.04 (#1686)
* docs: add ScoreFilter benchmark metrics to 26.04 release notes and heuristic filtering guide
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: fix undefined executor and division-by-zero in pipeline metrics snippet
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: apply NVIDIA style guide fixes to filter metrics docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: generalize metrics section per reviewer feedback
Remove ScoreFilter-specific framing from release notes and heuristic
guide since pipeline stage metrics are not unique to ScoreFilter.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add shared tokenizer docs and release notes (PR #1528) (#1700)
* docs: add shared tokenizer docs and release notes for PR #1528
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: replace latinism in shared tokenizer docs
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: address Greptile review concerns for shared tokenizer PR
Add AegisClassifier aegis_prompt_field warning when using
use_existing_tokens, and include MultilingualDomainClassifier
in the release notes DeBERTa group list.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: update docs and release notes for fused iterate-extract (PR #1458) (#1684)
* docs: update docs and release notes for fused iterate-extract stages (PR #1458)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: address PR feedback from Greptile review
Revert 26.02 release notes line to avoid referencing 26.04 class name
in wrong version section. Update custom.mdx wording to reflect that the
fused step maps to multiple abstract base classes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: address PR review feedback from jgerh and greptile
Apply tech pubs copyedits: add periods to list items, fix link spacing,
backtick code references, correct product names (FastText, GLiNER),
replace OOM'd with "ran out of memory", rewrite component descriptions
for clarity, convert <Note> to :::{note} directive, remove horizontal
dividers, and align fern/non-fern breaking changes (add Three-Stage
Pipeline entry to non-fern).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: revert non-fern release notes to staging state
Release notes live in fern only; remove docs/ changes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Update docs/about/release-notes/index.md
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Update docs/about/release-notes/index.md
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Update docs/about/release-notes/index.md
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Update docs/about/release-notes/index.md
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Update docs/about/release-notes/index.md
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Apply suggestion from @jgerh
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: address sarahyurick feedback and remove last divider
Remove ID Field Standardization entries from non-fern 26.02 release
notes since PR #1390 is a 26.04 feature. Remove fused iterate-extract
entry from same section (26.04 via PR #1458). Remove remaining
horizontal divider in custom.mdx.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: revert all changes to non-fern release notes
Restore docs/about/release-notes/index.md to staging state; this PR
should not modify the non-fern release notes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
* docs: update Fern docs for filter/modifier directory reorganization (PR #1472) (#1685)
Update import paths and release notes in Fern v26.04 docs to reflect
the DocumentFilter/DocumentModifier directory restructuring that avoids
eagerly importing heavy dependencies.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add AEGIS classifier GPU utilization note and release notes (Issue #878) (#1702)
* docs: add AEGIS classifier GPU utilization note and release notes (Issue #878)
Document confirmed full GPU utilization for the AEGIS safety classifier
on multi-GPU setups and add performance expectations note about the
LlamaGuard-7b generative model being slower than encoder-based classifiers.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: address Greptile review feedback on AEGIS docs
Remove "now" from release notes to clarify this confirms existing
behavior rather than implying a bug fix. Link to the NVIDIA AEGIS model
page instead of Meta's base LlamaGuard-7b model.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: address PR feedback for FastText filter benchmarking release notes (#1690)
Add --fasttext-quality-model-path mention and model reference links
per reviewer suggestions from sarahyurick and greptile bot.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: add per-stage runtime environment documentation (PR #1623) (#1753)
* docs: add per-stage runtime environment documentation (PR #1623)
Add reference page, release note, and API reference updates for the
per-stage runtime_env feature that enables isolated Python dependencies
per pipeline stage using Ray's native runtime_env support.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: address PR feedback for per-stage runtime docs
- Clarify that pip/uv keys control the worker virtualenv installer,
not the local package manager
- Replace undefined stage references in example with full
RecordPackagingVersionStage definition from PR #1623 tests
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add release notes and docs for jusText OOM fix (PR #1534) (#1697)
Addresses PR #1697 review feedback:
- Condensed "What's New" entry to 1-2 lines (ayushdg verbosity feedback)
- Consolidated all bug fixes under single "## Bug Fixes" section
- Fixed "Audio Stage Name Propagation" being orphaned under "Dependency Updates"
- Kept feature in "What's New" and bug fix in "Bug Fixes" as separate concise entries
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add text embeddings guide and release notes for PR #1346 (#1687)
* docs: add text embeddings guide and release notes for PR #1346
Add Fern documentation for vLLM and Sentence Transformers embedding
support. Creates new Text Embeddings section with overview and vLLM
Embedder pages. Updates 26.04 release notes and expands semdedup page
with vLLM embedding example.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: deduplicate code snippets and fix placeholder model names
Replace duplicated vLLM Quick Start in embeddings overview and semdedup
page with cross-references to the canonical vllm-embedder page. Replace
placeholder "large-embedding-model" with consistent model identifiers.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: fix internal stage reference for TextSemanticDeduplicationWorkflow
The workflow now uses VLLMEmbeddingModelStage internally, not
EmbeddingCreatorStage.
Signed-off-by: Logan Lane <lbliii@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: address sarahyurick review feedback on embedding docs
- Fix "HuggingFace" to "Hugging Face" everywhere
- Remove vllm install instructions (included in text_cuda12)
- Fix "classe" typo to "classes" in release notes
- Update Setup column to recommend text_cuda12
- Position vLLM as recommended for semantic dedup
- Fix pretokenize recommendation (model-dependent, not universal)
- Remove vLLM vs ST comparison table per reviewer request
- Use correct model identifier google/embeddinggemma-300m
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Logan Lane <lbliii@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add release notes for vLLM setup race condition fix (PR #1590) (#1708)
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add image reader Ray Data support docs (PR #1610) (#1731)
Document ImageReaderStage RayDataExecutor compatibility and fanout
behavior in tar archives loading guide, and add release note entry.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add release notes and docs for LSH memory config (PR #1603) (#1732)
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add release notes and docs for math benchmarking support (PR #1604) (#1737)
Document S3 transport for CommonCrawlWARCReader, serialization fixes
for MathContentExtractor and CommonCrawlWARCReader, and boto3 dependency.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add release notes and docs for Ray 2.54 update (PR #1557) (#1734)
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* ci: remove combined fern-docs-preview workflow (#1771)
Replace with the more secure two-part workflow split into
fern-docs-preview-build.yml and fern-docs-preview-comment.yml.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: update fern docs for RayDataExecutor promotion from experimental (PR #1619) (#1703)
* docs: update fern docs for RayDataExecutor promotion from experimental (PR #1619)
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: add RayDataExecutor config table to execution-backends page
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: rename AudioBatch to AudioTask in audio curation docs (#1694)
* docs: rename AudioBatch to AudioTask across audio curation docs
Rename AudioBatch class/concept to AudioTask throughout the 26.04
documentation to reflect the upstream API rename. Updates navigation,
redirects, concepts, API reference, tutorials, and release notes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: fix style guide issues in audio curation docs
Replace latinisms (via, etc.) with plain English equivalents and fix
code formatting spacing in text-integration how-to.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: fix AudioTask API mismatches flagged in PR review
Address review comments from #1694:
- Fix process() return type in duration-calculation.mdx (returns single
AudioTask, not a list)
- Fix AudioTask construction in manifests-ingest.mdx (single dict per
task, not a list)
- Update "batch-level validation" wording to match single-entry model
- Correct asr-inference/index.mdx to state ASR stage defines
process_batch() as its canonical method
- Fix "file paths" plural to singular for single-entry AudioTask
- Replace invalid list comprehension over AudioTask.data dict keys in
asr-pipeline.mdx with validate() call
- Replace direct process() call on ASR stage with pipeline-driven pattern
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: address 9 open documentation issues (#1722)
* docs: address open documentation issues (#1264, #1326, #1546, #1547, #1548, #1549, #1550, #1551, #1552)
- Fix broken Installation and Configuration links in tutorials/README.md (#1264)
- Document pip install dependency conflict and recommend uv (#1326)
- Clarify that Curator uses Ray (not Dask) in migration guide and about page (#1546)
- Add architecture diagram section to README (#1547)
- Add Nemotron dataset usage section to README (#1548)
- Add data curation importance section to README (#1549)
- Add deep-dives section to fern docs (resource allocation, streaming, auto-balancing, throughput) (#1550)
- Add citation section to README (#1551)
- Add Updates/News section to README (#1552)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: move assets to fern/, revert docs/ changes
Images now live in fern/assets/images/ and README references updated.
Reverted docs/about/release-notes/migration-guide.md and
docs/admin/installation.md since docs/ is deprecated in favor of fern/.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: enrich deep-dive pages with slide content
Update resource allocation, streaming, and auto-balancing deep-dives
with concrete details from internal slides: ~5x CPU stage speedup,
20% streaming overlap improvement, before/after auto-balancing example
(1 vs 4 videos/s), and code examples matching actual API.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: add Berkeley lecture throughput example to README and deep-dives
Add illustrative pipeline example from UCB lecture (lang ID → tokenization
→ 5B model, 13,000s naive → ~1,000s with Curator) to the "Why Data
Curation?" section and throughput deep-dive. Update auto-balancing with
accurate tasks/s numbers and streaming with 99% GPU utilization stat.
Framed as illustrative, not a benchmark claim.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: move deep-dives into concepts as Scaling & Performance section
These are concept articles, not a new content type. Moved resource
allocation, streaming, auto-balancing, and throughput pages under
About > Concepts > Scaling & Performance. Removed the separate
Deep Dives nav section and index page.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Address PR review comments on concept docs
- resource-allocation: Use class-level resources attr, fix .with_() API,
replace fractional GPU example with Resources(gpus=0.25), remove
unsourced 5x claim, remove deduplication from CPU-bound example
- streaming: Remove unsourced 20% claim, use quickstart-style example,
soften batch size trade-off language
- throughput: Clarify GPU memory math in example, emphasize streaming,
use RayClient/RaySlurmClient, add ScoreFilter default note, note
auto-balancing behavior
- auto-balancing: Remove untested stage_stats snippet, reference Ray
Dashboard instead
- tutorials/README: Add .html extensions to doc URLs for consistency
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Replace architecture diagram with version without red underline
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Replace architecture diagram with clean version
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add MegatronTokenizerWriter documentation for 26.04 (#1742)
* docs: add MegatronTokenizerWriter documentation for 26.04
Add Save and Export page for text curation documenting MegatronTokenizerWriter
(PR #1259), including configuration reference, output format details, and
pipeline examples. Update release notes with feature entry, add page to
navigation, and update related-tools to reflect Curator's tokenization
capability.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: fix output tree placeholders and .idx format description
Use distinct hash placeholders to clarify per-partition output files.
Correct .idx layout to show three separate contiguous arrays instead
of implying an interleaved per-record format.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add NeMo Data Designer (NDD) integration docs for 26.04 (#1743)
* docs: add NeMo Data Designer integration documentation for 26.04
Add fern/ docs for the NDD + Curator integration, covering
DataDesignerStage, NDD-backed Nemotron-CC stages, configuration
builder patterns, and local/remote inference setup. Includes
release note line items and nav config updates.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add missing import os in NDD code examples
Fixes Greptile review comments — two code snippets using
os.environ["NVIDIA_API_KEY"] were missing the os import.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(ndd): address review feedback on NDD docs
- Link to NDD docs in the intro for reader context
- Add callout linking to NDD config builder reference
- Include missing DataDesignerStage import in remote provider snippet
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs(ndd): define pipeline in NDD-backed stages example
The NDD-Backed Stages snippet in the Nemotron-CC page called
pipeline.add_stage(...) without constructing the Pipeline, so copying
it verbatim raised NameError. Add the Pipeline import and
instantiation to make the example self-contained.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: fix ImageReaderStage batch_size parameter name in v26.04 docs (#1837)
QA found that v26.04 image curation docs use batch_size when the
actual ImageReaderStage parameter is dali_batch_size, causing all
documented examples to fail with TypeError at runtime.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* docs: add ALM pipeline fern docs and release notes (PRs #1419, #1608, #1676) (#1738)
* docs: add ALM pipeline docs and release notes for PRs #1419, #1608, #1676
New fern pages for ALM data curation:
- Concept page: about/concepts/audio/alm-pipeline
- Tutorial: curate-audio/tutorials/alm
- Processing pages: curate-audio/process-data/alm/ (index, data-builder, overlap-filtering)
Updated existing pages:
- release-notes/index: added entries for ALM pipeline, AudioTask redesign, audio profiling
- version.yml: added ALM nav entries under concepts, tutorials, and process-data
- Audio concepts index: added ALM Pipeline card
- Audio tutorials index: added ALM Tutorial card
- Audio process-data index: added ALM Data Curation section with cards
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: fix incomplete output JSON examples and deduplicate release notes
- Add missing fields to output JSON examples in alm-pipeline.mdx and
alm.mdx: filtered_dur_list, total_dur_window, truncation_events,
and lost_no_spkr in stats
- Clarify PR #1608 release note to reference PR #1419 instead of
repeating Hydra configuration details
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs(alm): address reviewer feedback on ALM pipeline docs
- Loss Tracking prose: document lost_no_spkr and lost_next_seg_bm as
sub-categories of lost_win on the concept page
- Describe overlap filtering as nested all-pairs rather than
"consecutive pairs"; note the greedy removal rule
- Correct the overlap_percentage table: >=50% (not >50%), and clarify
that 100% still removes fully-contained duplicates
- Output JSON examples: surface the pre-filter windows field, add
lost_next_seg_bm, and add a note that real output carries additional
duration and diagnostic fields
- data-builder.mdx: add lost_next_seg_bm row; reword lost_no_spkr as a
sub-category of lost_win
- overlap-filtering.mdx: add manifest_filepath row and a note about the
other intermediate fields the stage writes
- Tutorial: expand Loss Statistics tuning table with the two new
sub-categories; switch sample-data command to run from the repo root
so the fixture path matches the in-repo README
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: remove duplicate PR #1608 release note entry
The "AudioBatch to AudioTask Redesign (PR #1608)" section duplicated the
more complete "Audio Task Redesign (PR #1608)" entry above it. Greptile
flagged this as a P1 issue on PR #1738.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: warn users of Python 3.10 support removal in 26.06 (#1868)
* docs: warn users of Python 3.10 support removal in 26.06
Adds a deprecation notice to the 26.04 release notes, installation
guide, and deployment requirements so users upgrading to 26.06 are
prepared to move off Python 3.10.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* docs: soften Python 3.10 deprecation wording to allow 3.13+
Avoid enumerating "3.11 and 3.12 only" since Python 3.13 support in
26.06 is not yet confirmed. Rephrases the deprecation notices to point
to any newer supported version (3.11+).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* ci: refresh detect-secrets baseline for synthetic docs
Add the 9 placeholder API key examples in fern/versions/v26.04/pages/
curate-text/synthetic/ to the detect-secrets baseline as known false
positives so the secrets-detector check passes.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Logan Lane <llane@nvidia.com>
Signed-off-by: Logan Lane <lbliii@users.noreply.github.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>1 parent 2420065 commit 27f063d
172 files changed
Lines changed: 26674 additions & 124 deletions
File tree
- .github/workflows
- config
- docs/about/release-notes
- fern
- assets/images
- versions
- v26.04/pages
- _images
- about
- concepts
- audio
- image
- text
- _images
- video
- _images
- release-notes
- admin
- deployment
- slurm
- integrations
- api-reference
- executors
- tasks
- curate-audio
- load-data
- process-data
- alm
- asr-inference
- audio-analysis
- quality-assessment
- text-integration
- tutorials
- curate-images
- load-data
- process-data
- embeddings
- filters
- tutorials
- curate-text
- load-data
- process-data
- content-processing
- deduplication
- embeddings
- language-management
- quality-assessment
- specialized-processing
- synthetic
- nemotron-cc
- tutorials
- curate-video
- load-data
- process-data
- tutorials
- _images
- pipeline-customization
- get-started
- reference
- infrastructure
- tutorials
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
124 | 124 | | |
125 | 125 | | |
126 | 126 | | |
127 | | - | |
| 127 | + | |
128 | 128 | | |
129 | 129 | | |
130 | 130 | | |
| |||
243 | 243 | | |
244 | 244 | | |
245 | 245 | | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
246 | 319 | | |
247 | 320 | | |
248 | 321 | | |
| |||
333 | 406 | | |
334 | 407 | | |
335 | 408 | | |
336 | | - | |
| 409 | + | |
337 | 410 | | |
This file was deleted.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
18 | 23 | | |
19 | 24 | | |
20 | 25 | | |
| |||
38 | 43 | | |
39 | 44 | | |
40 | 45 | | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
41 | 56 | | |
42 | 57 | | |
43 | 58 | | |
| |||
92 | 107 | | |
93 | 108 | | |
94 | 109 | | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
95 | 137 | | |
96 | 138 | | |
97 | 139 | | |
| |||
106 | 148 | | |
107 | 149 | | |
108 | 150 | | |
109 | | - | |
| 151 | + | |
110 | 152 | | |
111 | 153 | | |
112 | 154 | | |
113 | 155 | | |
114 | 156 | | |
115 | 157 | | |
116 | 158 | | |
117 | | - | |
| 159 | + | |
118 | 160 | | |
119 | 161 | | |
120 | 162 | | |
| |||
136 | 178 | | |
137 | 179 | | |
138 | 180 | | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
15 | 25 | | |
16 | 26 | | |
17 | 27 | | |
| |||
106 | 116 | | |
107 | 117 | | |
108 | 118 | | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
109 | 126 | | |
110 | 127 | | |
111 | 128 | | |
| |||
136 | 153 | | |
137 | 154 | | |
138 | 155 | | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
139 | 162 | | |
140 | 163 | | |
141 | 164 | | |
| |||
Loading
Loading
0 commit comments