Skip to content

Commit efda041

Browse files
committed
docs: polish plugins dev note
1 parent 995a725 commit efda041

6 files changed

Lines changed: 39 additions & 76 deletions

File tree

docs/devnotes/posts/have-it-your-way.md

Lines changed: 14 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -45,42 +45,32 @@ config_builder.add_column(
4545
config_builder.add_processor(RobotSFTProcessor(output_column="messages"))
4646
```
4747

48-
That is the point of plugins: install a package, import its config classes, and keep the workflow declarative. The Isaac run reader, event labeler, and trainer-format processor own the custom parsing, labeling, validation, and export shape, while Data Designer still handles discovery, dependency ordering, model calls, previews, and output.
49-
50-
---
48+
That is the point of plugins: install a package, import its config classes, and keep the workflow declarative. The Isaac run reader, event labeler, and trainer-format processor own the project-specific parsing and trainer-facing shape. Data Designer still does the framework work, from component discovery and dependency ordering to model execution and output handling.
5149

5250
## **Customization Is the Normal Case**
5351

5452
![A confused engineer trying to fit custom building blocks into the wrong framework slots](assets/have-it-your-way/customization-blocks-confusion.png){ .devnote-section-graphic }
5553

56-
The mess usually starts innocently. A team defines a Data Designer config, then discovers that its seed data lives in an internal layout, its generated column needs a domain simulator, and its trainer expects a slightly different record shape. Someone writes a small reader beside the notebook. Someone patches a generator into a project folder. Someone adds a cleanup script after preview because the final export has one more organization-specific rule. Each choice is reasonable because every project has its own corpus, policy, ontology, simulator, and training stack.
54+
The mess usually starts innocently. A team defines a Data Designer config, then discovers that its seed data lives in an internal layout, its generated column needs a domain simulator, and its trainer expects a slightly different record shape. Someone writes a small reader beside the notebook. Someone patches a generator into a project folder. Someone adds a cleanup script after preview because the final export has one more organization-specific rule. Each choice is reasonable because every project brings a different corpus, policy model, domain vocabulary, or training stack.
5755

5856
The problem is that the custom behavior now lives around Data Designer instead of inside the Data Designer workflow. It is harder to validate, harder to share, harder to version, and easier to lose. Plugins give that bespoke work a clean package boundary – a name, typed config, runtime implementation, entry point, and tests that travel together. Users still declare the dataset they want, but the local reader, domain generator, or trainer-format processor becomes a normal Data Designer component instead of another layer of glue.
5957

6058
<div class="devnote-clear"></div>
6159

62-
---
63-
6460
## **Where Plugins Fit**
6561

6662
The first plugin boundaries match the places where real projects most often need customization.
6763

68-
<div style="margin-left: 1.5rem;">
64+
**Seed reader plugins** bring new source systems into Data Designer. Use them for databases, document stores, object stores, internal APIs, file collections, or corpus layouts that need custom hydration before generation can begin.
6965

70-
<p>📥 <strong>Seed reader plugins</strong> bring new source systems into Data Designer. Use them for databases, document stores, object stores, internal APIs, file collections, or corpus layouts that need custom hydration before generation can begin.</p>
66+
**Column generator plugins** create new column types. Use them when a value should be produced during generation and should participate in dependency ordering like any other column. This is the right place for simulators, domain libraries, retrieval-backed generation, deterministic rule systems, or custom model-backed generation.
7167

72-
<p>🧬 <strong>Column generator plugins</strong> create new column types. Use them when a value should be produced during generation and should participate in dependency ordering like any other column. This is the right place for simulators, domain libraries, retrieval-backed generation, deterministic rule systems, or custom model-backed generation.</p>
68+
**Processor plugins** transform records before or after generation. Use them for redaction, cleanup, deduplication, export views, organization-specific schemas, or training formats that should not be hidden inside prompts.
7369

74-
<p>🔧 <strong>Processor plugins</strong> transform records before or after generation. Use them for redaction, cleanup, deduplication, export views, organization-specific schemas, or training formats that should not be hidden inside prompts.</p>
75-
76-
</div>
77-
78-
These boundaries are intentionally narrow. A plugin should own the behavior that is specific to your use case. Data Designer should keep owning the pipeline responsibilities: validation, dependency resolution, batching, model calls, logging, previews, output handling. That split lets custom components use the normal workflow without moving orchestration into the project.
70+
These boundaries are intentionally narrow. A plugin should own the behavior that is specific to your use case. Data Designer validates configs and resolves dependencies. It plans batches, runs models, records logs, shows previews, then writes the output. That split lets custom components use the normal workflow without moving orchestration into the project.
7971

8072
What about [custom columns](../../concepts/custom_columns.md)? Start with a custom column when you are prototyping column-generator behavior or need a one-off column that only one project uses. Custom columns keep the logic in a Python function inside the config, with declared dependencies and optional model access. When that logic needs a stable config schema, tests, packaging, docs, or reuse across teams, promote it to a column generator plugin.
8173

82-
---
83-
8474
## **Author a Plugin: From Glue Code to Seed Reader**
8575

8676
To make this concrete, let's walk through a full example. Consider a markdown seed reader. The one-off version might be a helper function that walks a directory, splits files into sections, returns a DataFrame, and then gets copied into the next project that needs it. That can work for one project. It becomes a problem when the reader needs options, tests, documentation, versioning, or reuse across teams. At that point, the helper has become a capability whether or not it is packaged like one.
@@ -109,7 +99,7 @@ class MarkdownSectionSeedSource(FileSystemSeedSource):
10999
seed_type: Literal["markdown-sections"] = "markdown-sections"
110100
```
111101

112-
The implementation class is where the old helper code should move. For a filesystem seed reader, Data Designer gives you a small interface instead of a blank page: implement `build_manifest(...)` to build a cheap index of candidate inputs, and implement `hydrate_row(...)` to turn each selected manifest row into one or more dataset rows. That split matters because Data Designer can sample, shuffle, partition, and batch against the lightweight manifest before paying the cost of reading files, parsing sections, or calling project-specific libraries. The parser can still be a normal helper function; the reader class is the framework boundary.
102+
The implementation class is where the old helper code should move. For a filesystem seed reader, Data Designer gives you a small interface instead of a blank page: implement `build_manifest(...)` to build a cheap index of candidate inputs, and implement `hydrate_row(...)` to turn each selected manifest row into one or more dataset rows. That split matters because Data Designer can plan work against the lightweight manifest before paying the cost of reading files, parsing sections, or calling project-specific libraries. The parser can still be a normal helper function; the reader class is the framework boundary.
113103

114104
```python
115105
# impl.py
@@ -141,7 +131,7 @@ class MarkdownSectionSeedReader(FileSystemSeedReader[MarkdownSectionSeedSource])
141131
context: SeedReaderFileSystemContext,
142132
) -> list[dict[str, str]]:
143133
# Fast path: enumerate candidate files and return cheap metadata.
144-
# Data Designer can index, sample, shuffle, and batch these rows.
134+
# Data Designer can plan work from this manifest before hydration.
145135
matched_paths = self.get_matching_relative_paths(
146136
context=context,
147137
file_pattern=self.source.file_pattern,
@@ -227,46 +217,33 @@ results = DataDesigner().preview(builder, num_records=5)
227217

228218
No custom orchestration. No separate DataFrame preparation step. The reader is part of the Data Designer workflow.
229219

230-
---
231-
232220
## **Building the Plugin Ecosystem**
233221

234-
Reusable plugins also need a discovery layer. Once a plugin is useful beyond one project, users need a simple way to find the right package, install it, and get back to declaring datasets. That is why Data Designer includes a built-in NVIDIA plugin catalog and a CLI workflow for discovery and installation.
222+
Plugin authoring is one half of reuse. Distribution is the other. When a plugin is ready for more than one project, Data Designer provides a clear path through the CLI, with the NVIDIA catalog built in.
235223

236224
The NVIDIA catalog is backed by [NVIDIA-NeMo/DataDesignerPlugins](https://github.com/NVIDIA-NeMo/DataDesignerPlugins), a dedicated home for first-party plugin packages, packaging examples, and plugin-specific docs. Keeping those packages outside the core repository lets them carry optional dependencies, target narrower use cases, and move at their own pace while still using the same plugin interface once installed.
237225

238-
For users, the catalog makes discovering and installing first-party plugins seamless. The common flow is intentionally short: list the compatible packages, search for what you need, and install the package by name or alias.
226+
For users, the first-party path is short: list what is available, search for what you need, and install by package name or alias.
239227

240228
```bash
241229
data-designer plugin list
242230
data-designer plugin search github
243231
data-designer plugin install github
244232
```
245233

246-
After installation, normal entry point discovery takes over. Import the plugin's config classes and keep building the same declarative workflow.
234+
After installation, there is no separate registration step. Data Designer discovers the package's entry points, so users import the plugin's config classes and keep building the same declarative workflow.
247235

248-
The same pattern works for teams and communities. A platform group can publish a catalog of approved internal plugins backed by an internal package index or direct package references. A community can publish a catalog for a domain or workflow. The catalog gives users a trusted path to the plugins they prefer, while plugin packages remain independently versioned and distributed.
236+
Catalogs are not limited to NVIDIA plugins. A platform group can publish a catalog of approved internal plugins backed by an internal package index or direct package references. A community can publish a catalog for a domain or workflow. The catalog gives users a trusted path to the plugins they prefer, while plugin packages remain independently versioned and distributed.
249237

250238
```bash
251239
data-designer plugin catalog add internal <catalog-url>
252240
data-designer plugin --catalog internal install <package-or-alias>
253241
```
254242

255-
That is the foundation for a richer Data Designer plugin ecosystem: the core framework provides the stable runtime, plugin authors provide specialized capabilities, and catalogs make those capabilities discoverable. For more information, see [Discover Plugins](../../plugins/discover.md).
256-
257-
---
243+
That gives Data Designer the foundation for a broader network of plugins: the core framework provides the stable runtime, plugin authors provide specialized capabilities, and catalogs make those capabilities discoverable. For more information, see [Discover Plugins](../../plugins/discover.md).
258244

259245
## **Where to Go Next**
260246

261-
Interested in building your own plugin? Here are some resources to get you started:
262-
263-
1. [Plugins overview](../../plugins/overview.md) — learn how plugins fit into Data Designer
264-
2. [Build Your Own](../../plugins/build_your_own.md) — follow the authoring guide for seed readers, column generators, and processors
265-
3. [Using Models in Plugins](../../plugins/models.md) — call configured models from plugin code
266-
4. [Markdown Section Seed Reader recipe](../../recipes/plugin_development/markdown_seed_reader.md) — study the complete version of the example from this post
267-
5. [Discover Plugins](../../plugins/discover.md) — learn how to discover and install plugins
268-
6. [DataDesignerPlugins on GitHub](https://github.com/NVIDIA-NeMo/DataDesignerPlugins) — explore first-party plugin packages
247+
From here, the [plugins overview](../../plugins/overview.md) is the best orientation. [Build Your Own](../../plugins/build_your_own.md) and [Using Models in Plugins](../../plugins/models.md) cover the authoring details, and the [Markdown Section Seed Reader recipe](../../recipes/plugin_development/markdown_seed_reader.md) shows the complete version of the example from this post. For distribution, [Discover Plugins](../../plugins/discover.md) explains the CLI catalog flow, and [DataDesignerPlugins on GitHub](https://github.com/NVIDIA-NeMo/DataDesignerPlugins) is where first-party plugin packages live.
269248

270249
Moving plugins out of experimental mode means Data Designer no longer has to predict every customization users will need. The framework provides the pipeline. Plugins supply the custom pieces.
271-
272-
🎨 🔌 Thanks for reading and happy plugin building!

fern/versions/latest.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -169,6 +169,8 @@ navigation:
169169
contents:
170170
- page: Overview
171171
path: ./latest/pages/devnotes/index.mdx
172+
- page: Have It Your Way
173+
path: ./v0.5.8/pages/devnotes/posts/have-it-your-way.mdx
172174
- page: VLM Long Document Understanding
173175
path: ./latest/pages/devnotes/posts/vlm-long-document-understanding.mdx
174176
- page: Push Datasets to Hugging Face Hub

fern/versions/latest/pages/devnotes/index.mdx

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,14 @@ import { BlogCard, BlogGrid } from "@/components/BlogCard";
99
Welcome to NeMo Data Designer Dev Notes — in-depth guides, benchmark write-ups, and insights from the team building NeMo Data Designer.
1010

1111
<BlogGrid>
12+
<BlogCard
13+
href="/dev-notes/have-it-your-way"
14+
title="Have It Your Way"
15+
description="A plugin framework for the custom pieces every real project ends up needing."
16+
date="May 5, 2026"
17+
authors={["jgreco", "etramel"]}
18+
image={<img src="/assets/have-it-your-way/data-designer-plugins-hero.png" alt="" loading="lazy" />}
19+
/>
1220
<BlogCard
1321
href="/dev-notes/vlm-long-document-understanding"
1422
title="Training a VLM to Understand Long Documents"

fern/versions/v0.5.8.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -147,8 +147,6 @@ navigation:
147147
contents:
148148
- page: Overview
149149
path: ./v0.5.8/pages/devnotes/index.mdx
150-
- page: Have It Your Way
151-
path: ./v0.5.8/pages/devnotes/posts/have-it-your-way.mdx
152150
- page: Push Datasets to Hugging Face Hub
153151
path: ./v0.5.8/pages/devnotes/posts/push-datasets-to-hugging-face-hub.mdx
154152
- page: Text-to-SQL for Nemotron Super

fern/versions/v0.5.8/pages/devnotes/index.mdx

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,6 @@ import { BlogCard, BlogGrid } from "@/components/BlogCard";
99
Welcome to NeMo Data Designer Dev Notes — in-depth guides, benchmark write-ups, and insights from the team building NeMo Data Designer.
1010

1111
<BlogGrid>
12-
<BlogCard
13-
href="/dev-notes/have-it-your-way"
14-
title="Have It Your Way"
15-
description="A plugin framework for the custom pieces every real project ends up needing."
16-
date="May 5, 2026"
17-
authors={["jgreco", "etramel"]}
18-
image={<img src="/assets/have-it-your-way/data-designer-plugins-hero.png" alt="" loading="lazy" />}
19-
/>
2012
<BlogCard
2113
href="/dev-notes/push-datasets-to-hugging-face-hub"
2214
title="Push Datasets to Hugging Face Hub"

0 commit comments

Comments
 (0)