Skip to content

Commit d95a3f7

Browse files
committed
sync with main #NO_PR
Merge origin/main into dev to pick up upstream changes (492 files, +57k/-6k): - 26.04 staging release - Generic ASR/TTS audio processing pipeline (#1679) - Dynamo disaggregated serving + validators (#1813, #1820, #1833, #1834, #1861) - ReadSpeech audio curation benchmark + tutorials (#1841, #1851, #1870) - VideoReader path validation, audio waveform leak fixes (#1845, #1765) - Sortformer tutorial fixes + benchmarks (#1764) - Generic audio pipeline + qwen3 support (#1827) - Fern docs (audio + curate-audio sections) Conflict resolution: - nemo_curator/stages/audio/__init__.py: kept dev's lazy __getattr__ registry, added main's new ManifestReader and ManifestWriterStage to both __all__ and _LAZY_IMPORTS (now lazy-loaded from nemo_curator.stages.audio.common). - uv.lock: took main's version (latest dependency resolutions). Removals propagated from main (pre-merge-base files we no longer need): - nemo_curator/stages/audio/alm/alm_manifest_writer.py (replaced by ShardedManifestWriterStage) - nemo_curator/stages/audio/alm/alm_manifest_reader.py - nemo_curator/backends/experimental/* (refactored away) - nemo_curator/core/serve.py (replaced by typed serve config) Verified intact: - SCOTCH pipeline: speaker_id/, hifi_pipeline/slurm_e2e/ (dev-only additions, untouched). - Cherry-picked audio PRs (#1853, #3, #1, #1839, integration-test) all present. Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
2 parents b60dc95 + 27f063d commit d95a3f7

492 files changed

Lines changed: 57692 additions & 6376 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 313 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,313 @@
1+
---
2+
name: nemo-curator-docs
3+
description: Maintain the NeMo Curator Fern docs site — add, update, move, or remove pages under fern/. Use for any documentation changes.
4+
---
5+
6+
# NeMo Curator Docs Maintenance
7+
8+
Unified skill for adding, updating, moving, and removing pages on the NeMo Curator Fern documentation site.
9+
10+
## Scope Rule
11+
12+
**ALL docs edits happen under `fern/`.** The legacy `docs/` directory is deprecated — do not add or move content into it. Release notes, migration guides, and every new page belong under `fern/`.
13+
14+
## Layout at a Glance
15+
16+
```
17+
fern/
18+
├── fern.config.json # Minimal Fern config (org + CLI version)
19+
├── docs.yml # Site config: versions, tabs, redirects, libraries
20+
├── versions/
21+
│ ├── latest.yml # Symlink → v26.02.yml (do not edit directly)
22+
│ ├── v26.02.yml # Nav tree for current train
23+
│ ├── v26.02/pages/ # MDX content for current train
24+
│ ├── v25.09.yml
25+
│ └── v25.09/pages/
26+
├── components/ # Custom TSX components (footer, etc.)
27+
├── assets/ # Images, SVGs, favicon
28+
├── substitute_variables.py # CI: resolves {{ variables }} in MDX
29+
└── AUTODOCS_GUIDE.md # Library reference generation guide
30+
```
31+
32+
**Current train:** `v26.02`. Default all new pages there unless the user specifies a version.
33+
34+
```
35+
File system Published URL
36+
─────────────────────────────────────── ────────────────────────────────────────
37+
fern/versions/v26.02/pages/ docs.nvidia.com/nemo/curator/latest/
38+
└─ get-started/text.mdx └─ get-started/text
39+
fern/versions/v26.02.yml ── nav for ──┐ docs.nvidia.com/nemo/curator/v26.02/
40+
fern/versions/latest.yml ─ symlink ───┘ └─ get-started/text
41+
fern/versions/v25.09/pages/ docs.nvidia.com/nemo/curator/v25.09/
42+
└─ get-started/text.mdx └─ get-started/text
43+
```
44+
45+
## Operations
46+
47+
### Add a Page
48+
49+
1. Gather: page title, target section, filename (kebab-case `.mdx`), subdirectory under `fern/versions/v26.02/pages/`.
50+
2. Create `fern/versions/v26.02/pages/<subdirectory>/<filename>.mdx`:
51+
52+
```mdx
53+
---
54+
description: "One-line SEO description"
55+
categories: ["<category>"]
56+
tags: ["<tag-1>", "<tag-2>"]
57+
personas: ["<persona>"]
58+
difficulty: "beginner" # beginner | intermediate | advanced
59+
content_type: "tutorial" # tutorial | how-to | reference | concept | index
60+
modality: "text-only" # text-only | image-only | video-only | audio-only | universal
61+
---
62+
63+
# <Page Title>
64+
65+
<content>
66+
```
67+
68+
3. Add a nav entry in `fern/versions/v26.02.yml` under the correct section:
69+
70+
```yaml
71+
- page: <Page Title>
72+
path: ./v26.02/pages/<subdirectory>/<filename>.mdx
73+
slug: <filename>
74+
```
75+
76+
4. If this also applies to `latest`, no action needed — `latest.yml` is a symlink to `v26.02.yml`.
77+
78+
### Update a Page
79+
80+
1. Locate by path, title, or keyword (`grep -rn` in `fern/versions/v26.02/pages/`).
81+
2. **Content only** — edit the MDX directly.
82+
3. **Title change** — update the frontmatter and the `- page:` name in `fern/versions/v26.02.yml`.
83+
4. **Section move** — `git mv` the file, update its `path:` in the nav, and fix all incoming links.
84+
5. **Slug change** — update `slug:` in the nav and add a redirect in `fern/docs.yml` so old URLs keep working.
85+
86+
### Remove a Page
87+
88+
1. Find incoming links: `grep -r "<filename>" fern/versions/v26.02/pages/ --include="*.mdx"`.
89+
2. `git rm fern/versions/v26.02/pages/<subdirectory>/<filename>.mdx`.
90+
3. Remove the `- page:` block from `fern/versions/v26.02.yml`. If it was the last page in a section, remove the `- section:` block.
91+
4. Fix or remove all incoming links found in step 1.
92+
5. Add a redirect in `fern/docs.yml` if the URL was public.
93+
94+
### Back-port to an Older Version
95+
96+
Only when explicitly asked. Repeat the operation in the corresponding `fern/versions/vXX.YY/` tree and `vXX.YY.yml` nav. MDX content often diverges between trains — do not blindly copy.
97+
98+
### Worked Example: Adding a Page
99+
100+
Request: *"Add a how-to for benchmarking text pipelines under Curate Text."*
101+
102+
1. Create `fern/versions/v26.02/pages/curate-text/benchmarking.mdx`:
103+
104+
```mdx
105+
---
106+
description: "Benchmark text curation pipelines and interpret throughput and memory metrics"
107+
categories: ["how-to"]
108+
tags: ["text-curation", "benchmarking", "performance"]
109+
personas: ["mle-focused"]
110+
difficulty: "intermediate"
111+
content_type: "how-to"
112+
modality: "text-only"
113+
---
114+
115+
# Benchmark Text Pipelines
116+
117+
<content>
118+
```
119+
120+
2. Add nav entry in `fern/versions/v26.02.yml` under the existing `Curate Text` section:
121+
122+
```yaml
123+
- page: Benchmark Text Pipelines
124+
path: ./v26.02/pages/curate-text/benchmarking.mdx
125+
slug: benchmarking
126+
```
127+
128+
3. `cd fern && fern check` then `fern docs dev` and verify the page renders at `/curate-text/benchmarking`.
129+
130+
### Worked Example: Renaming a Slug (with Redirect)
131+
132+
Request: *"Rename `/curate-text/benchmarking` to `/curate-text/performance`."*
133+
134+
1. Update `slug:` in `fern/versions/v26.02.yml`: `slug: performance`.
135+
2. (Optional) `git mv` the MDX file if you want the filename to match the slug.
136+
3. Add a redirect to `fern/docs.yml` so old links keep working:
137+
138+
```yaml
139+
redirects:
140+
- source: "/nemo/curator/latest/curate-text/benchmarking"
141+
destination: "/nemo/curator/latest/curate-text/performance"
142+
- source: "/nemo/curator/v26.02/curate-text/benchmarking"
143+
destination: "/nemo/curator/v26.02/curate-text/performance"
144+
```
145+
146+
4. `grep -rn "/curate-text/benchmarking" fern/versions/v26.02/pages/` and update any incoming links.
147+
148+
---
149+
150+
## Content Guidelines
151+
152+
NeMo Curator uses **Fern-native MDX components directly** (unlike Dynamo, which converts GitHub callouts in CI). Do not use `> [!NOTE]` syntax — it will not render.
153+
154+
| Purpose | Component |
155+
|---|---|
156+
| Neutral aside | `<Note>...</Note>` |
157+
| Helpful tip | `<Tip>...</Tip>` |
158+
| Informational callout | `<Info>...</Info>` |
159+
| Warning | `<Warning>...</Warning>` |
160+
| Error / danger | `<Error>...</Error>` |
161+
| Card grid on index pages | `<Cards>` with `<Card title="..." href="...">` children |
162+
163+
Images live in `fern/assets/` (shared) or `fern/versions/vXX.YY/pages/_images/` (version-scoped). Reference with root-relative paths.
164+
165+
Component examples:
166+
167+
```mdx
168+
<Tip>
169+
If `uv` is not installed, see the [Installation Guide](/admin/installation).
170+
</Tip>
171+
172+
<Warning>
173+
GPU-accelerated dedup requires CUDA {{ recommended_cuda }} or later.
174+
</Warning>
175+
176+
<Cards>
177+
<Card title="Text Curation" href="/get-started/text">
178+
Set up and run text curation workflows.
179+
</Card>
180+
<Card title="Image Curation" href="/get-started/image">
181+
Set up and run image curation workflows.
182+
</Card>
183+
</Cards>
184+
```
185+
186+
## Frontmatter Fields
187+
188+
Required: `description`.
189+
Optional but strongly preferred: `categories`, `tags`, `personas`, `difficulty`, `content_type`, `modality`. Existing pages in the same section are the best reference for valid values.
190+
191+
`title` is taken from the `- page:` entry in the nav file; the MDX file itself uses an `# H1` heading matching the page name.
192+
193+
## Variable Substitution
194+
195+
Tokens like `{{ product_name }}`, `{{ container_version }}`, `{{ current_release }}`, `{{ github_repo }}`, `{{ min_python_version }}` are resolved by `fern/substitute_variables.py` at CI time. Use them instead of hard-coding versions or URLs. Canonical list in `DEFAULT_VARIABLES` at the top of that file.
196+
197+
Example in MDX:
198+
199+
```mdx
200+
Install {{ product_name }} {{ current_release }} from {{ github_repo }}.
201+
Requires Python {{ min_python_version }}+ and CUDA {{ recommended_cuda }}.
202+
```
203+
204+
After substitution at CI time:
205+
206+
```
207+
Install NeMo Curator 25.09 from https://github.com/NVIDIA-NeMo/Curator.
208+
Requires Python 3.10+ and CUDA 12.0+.
209+
```
210+
211+
To preview substitution locally:
212+
213+
```bash
214+
python fern/substitute_variables.py versions/v26.02 --version 26.02 --dry-run
215+
```
216+
217+
## Validate
218+
219+
```bash
220+
cd fern
221+
fern check # YAML + frontmatter validation
222+
fern docs broken-links # link check
223+
fern docs dev # localhost:3000 hot-reload preview
224+
```
225+
226+
`fern check` must pass before commit. Broken-link check can be deferred but must pass in CI.
227+
228+
## Commit & Preview
229+
230+
```bash
231+
git add fern/
232+
git commit -s -m "docs: <add|update|remove> <page-title>"
233+
```
234+
235+
PRs that touch `fern/**` get an automatic Fern preview URL posted as a comment by `.github/workflows/fern-docs-preview.yml`. No manual step needed.
236+
237+
```
238+
┌─ fern-docs-ci.yml → fern check + autodocs
239+
PR (touches fern/) ─┼─ fern-docs-preview.yml → preview build
240+
└─ fern-docs-preview-*.yml → 🌿 preview URL comment
241+
242+
Merge to main → NO publish. Site is unchanged.
243+
244+
Tag push (docs/v*) → publish-fern-docs.yml → docs.nvidia.com/nemo/curator
245+
```
246+
247+
## Publishing to Production
248+
249+
**Merging to `main` does NOT publish.** Production only updates when a tag matching `docs/v*` is pushed (or the workflow is manually dispatched from the **Actions** tab). Do not push tags unless the user asks.
250+
251+
Tag must be `docs/v<MAJOR>.<MINOR>.<PATCH>` — the `docs/v` prefix is required by the workflow trigger and the semver suffix should match the docs release in `CHANGELOG.md`.
252+
253+
```bash
254+
# Correct — triggers publish
255+
git tag docs/v1.1.0
256+
git push origin docs/v1.1.0
257+
258+
git tag docs/v1.2.0-rc1 # pre-release suffix is fine, still matches docs/v*
259+
git push origin docs/v1.2.0-rc1
260+
261+
# Wrong — these will NOT trigger publish
262+
git tag v1.1.0 # missing docs/ prefix
263+
git tag docs/1.1.0 # missing v
264+
git tag docs-v1.1.0 # wrong separator
265+
```
266+
267+
URL → version mapping after publish:
268+
269+
```
270+
docs.nvidia.com/nemo/curator/latest/... → symlink to current train (v26.02 today)
271+
docs.nvidia.com/nemo/curator/v26.02/... → 26.02 train
272+
docs.nvidia.com/nemo/curator/v25.09/... → 25.09 train
273+
```
274+
275+
## Version Ship Checklist (when cutting a new train)
276+
277+
When the user ships a new version (e.g. `v26.04`):
278+
279+
1. Copy `fern/versions/v26.02/pages/``fern/versions/v26.04/pages/` and edit content.
280+
2. Copy `fern/versions/v26.02.yml``fern/versions/v26.04.yml` and update all `./v26.02/` path prefixes.
281+
3. Repoint the symlink: `ln -sf v26.04.yml fern/versions/latest.yml`.
282+
4. Update `fern/docs.yml` `versions:` list — add the new display-name, mark older trains stable.
283+
5. Add redirect rules in `fern/docs.yml` for `/nemo/curator/26.04/:path*``/nemo/curator/v26.04/:path*` (see existing patterns).
284+
6. Align `display-name` strings with `CHANGELOG.md` and `nemo_curator/package_info.py`.
285+
286+
## Debugging
287+
288+
| Symptom | Fix |
289+
|---|---|
290+
| `fern check` YAML error | 2-space indent; `- page:` inside `contents:`; `path:` is relative to the version YAML file |
291+
| Page 404 in preview | `slug:` missing or duplicated in the same section; confirm in `vXX.YY.yml` |
292+
| `{{ variable }}` shows literally on site | Not in `DEFAULT_VARIABLES` in `substitute_variables.py` — add it there |
293+
| MDX parse error | Replace bare `<https://...>` with `[text](https://...)`; escape `<` in prose with `&lt;` or backticks |
294+
| Old Sphinx URL breaks | Add a `redirects:` entry in `fern/docs.yml` |
295+
| Library reference missing | Run `fern docs md generate` in `fern/` (see `fern/AUTODOCS_GUIDE.md`) |
296+
| Broken image | Path is relative to the MDX file; check `fern/assets/` or `pages/_images/` exists |
297+
298+
## Key References
299+
300+
| File | Purpose |
301+
|---|---|
302+
| `fern/docs.yml` | Site config, versions, redirects, libraries |
303+
| `fern/versions/vXX.YY.yml` | Navigation tree for a version |
304+
| `fern/versions/vXX.YY/pages/` | MDX content for a version |
305+
| `fern/versions/latest.yml` | Symlink → current train's nav (do not edit) |
306+
| `fern/components/` | Custom TSX (footer, release banner) |
307+
| `fern/assets/` | Shared images, SVGs, favicon |
308+
| `fern/substitute_variables.py` | Variable definitions + CI replacement |
309+
| `fern/AUTODOCS_GUIDE.md` | Generating library reference MDX from source |
310+
| `fern/README.md` | Full docs architecture guide |
311+
| `.github/workflows/fern-docs-*.yml` | CI: validation, preview, publish |
312+
313+
---

0 commit comments

Comments
 (0)