Skip to content

Commit ef761b8

Browse files
authored
docs: add "Have It Your Way" plugin dev note (#608)
* docs: add plugins dev note * docs: mention custom columns Signed-off-by: Johnny Greco <jogreco@nvidia.com> * docs: update plugins dev note * docs: refine plugins dev note * docs: link v0.6.0 release --------- Signed-off-by: Johnny Greco <jogreco@nvidia.com>
1 parent 0fdea84 commit ef761b8

13 files changed

Lines changed: 612 additions & 19 deletions

File tree

docs/css/style.css

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,58 @@ h2 {
124124
margin-bottom: 0.2rem !important;
125125
}
126126

127+
.md-typeset .devnote-dek {
128+
border-left: 0.18rem solid #76B900;
129+
color: var(--md-default-fg-color);
130+
font-size: 1.05rem;
131+
font-weight: 500;
132+
line-height: 1.45;
133+
margin: 0.6rem 0 1rem;
134+
padding-left: 0.8rem;
135+
}
136+
137+
.md-typeset img.devnote-float-right,
138+
.md-typeset img.devnote-section-graphic {
139+
background: var(--md-default-bg-color);
140+
border: 0.05rem solid var(--md-default-fg-color--lightest);
141+
border-radius: 0.3rem;
142+
box-shadow: 0 0.25rem 0.8rem rgb(0 0 0 / 18%);
143+
}
144+
145+
.md-typeset img.devnote-float-right {
146+
float: right;
147+
width: min(42%, 28rem);
148+
max-width: 100%;
149+
height: auto;
150+
margin: 0 0 0.7rem 1rem;
151+
}
152+
153+
.md-typeset img.devnote-section-graphic {
154+
float: right;
155+
width: min(38%, 24rem);
156+
max-width: 100%;
157+
height: auto;
158+
margin: 0.1rem 0 0.7rem 1rem;
159+
}
160+
161+
.md-typeset .devnote-clear {
162+
clear: right;
163+
}
164+
165+
.md-post--excerpt .devnote-hide-in-index {
166+
display: none;
167+
}
168+
169+
@media screen and (max-width: 60em) {
170+
.md-typeset img.devnote-float-right,
171+
.md-typeset img.devnote-section-graphic {
172+
float: none;
173+
display: block;
174+
width: 100%;
175+
margin: 1rem 0;
176+
}
177+
}
178+
127179
/* Define the company grid layout */
128180

129181
#grid-container {
566 KB
Loading
562 KB
Loading
Lines changed: 271 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,271 @@
1+
---
2+
date: 2026-05-05
3+
authors:
4+
- jgreco
5+
- etramel
6+
---
7+
8+
# **Have It Your Way: Customizing Data Designer with Plugins**
9+
10+
<p class="devnote-dek"><em>A plugin framework for the custom pieces every real project ends up needing</em></p>
11+
12+
![Data Designer plugin extensions](assets/have-it-your-way/data-designer-plugins-hero.png){ .devnote-float-right .devnote-hide-in-index }
13+
14+
Data Designer is built around a simple idea: describe the dataset you want, and let the framework handle execution. A config points to seed data, defines generated columns, picks models, and shapes the final records — no orchestration code required. [Data Designer plugins](../../plugins/overview.md) keep that promise when a project needs something custom.
15+
16+
As of Data Designer [v0.6.0](https://github.com/NVIDIA-NeMo/DataDesigner/releases/tag/v0.6.0), plugins are out of experimental mode and stable. They are the supported path for turning reusable project-specific logic into normal Data Designer components.
17+
18+
<!-- more -->
19+
20+
What does "something custom" actually look like? Picture a robotics team sitting on a pile of [Isaac Sim](https://developer.nvidia.com/isaac/sim)-generated warehouse runs, trying to turn robot poses, camera views, and event metadata into instruction data. With an internal simulation-log plugin, the user-facing part can still be this small:
21+
22+
```bash
23+
uv pip install data-designer-isaac-logs
24+
```
25+
26+
```python
27+
from data_designer_isaac_logs.config import IsaacRunSeedSource
28+
from data_designer_isaac_logs.config import WarehouseEventLabelColumnConfig
29+
from data_designer_isaac_logs.config import RobotSFTProcessor
30+
31+
config_builder.with_seed_dataset(
32+
IsaacRunSeedSource(
33+
run_dir="s3://warehouse-sim/rare-events/",
34+
streams=("robot_pose", "overhead_rgb", "event_log"),
35+
max_events=10_000,
36+
)
37+
)
38+
config_builder.add_column(
39+
WarehouseEventLabelColumnConfig(
40+
name="safety_instruction",
41+
pose_column="robot_pose",
42+
event_log_column="event_log",
43+
)
44+
)
45+
config_builder.add_processor(RobotSFTProcessor(output_column="messages"))
46+
```
47+
48+
That is the point of plugins: install a package, import its config classes, and keep the workflow declarative. The Isaac run reader, event labeler, and trainer-format processor own the project-specific parsing and trainer-facing shape. Data Designer still does the framework work, from component discovery and dependency ordering to model execution and output handling.
49+
50+
---
51+
52+
## **Customization Is the Normal Case**
53+
54+
![A confused engineer trying to fit custom building blocks into the wrong framework slots](assets/have-it-your-way/customization-blocks-confusion.png){ .devnote-section-graphic }
55+
56+
The mess usually starts innocently. A team defines a Data Designer config, then discovers that its seed data lives in an internal layout, its generated column needs a domain simulator, and its trainer expects a slightly different record shape. Someone writes a small reader beside the notebook. Someone patches a generator into a project folder. Someone adds a cleanup script after preview because the final export has one more organization-specific rule. Each choice is reasonable because every project brings a different corpus, policy model, domain vocabulary, or training stack.
57+
58+
The problem is that the custom behavior now lives around Data Designer instead of inside the Data Designer workflow. It is harder to validate, harder to share, harder to version, and easier to lose. Plugins give that bespoke work a clean package boundary – a name, typed config, runtime implementation, entry point, and tests that travel together. Users still declare the dataset they want, but the local reader, domain generator, or trainer-format processor becomes a normal Data Designer component instead of another layer of glue.
59+
60+
<div class="devnote-clear"></div>
61+
62+
---
63+
64+
## **Where Plugins Fit**
65+
66+
The first plugin boundaries match the places where real projects most often need customization.
67+
68+
<div style="margin-left: 1.5rem;">
69+
70+
<p>📥 <strong>Seed reader plugins</strong> bring new source systems into Data Designer. Use them for databases, document stores, object stores, internal APIs, file collections, or corpus layouts that need custom hydration before generation can begin.</p>
71+
72+
<p>🧬 <strong>Column generator plugins</strong> create new column types. Use them when a value should be produced during generation and should participate in dependency ordering like any other column. This is the right place for simulators, domain libraries, retrieval-backed generation, deterministic rule systems, or custom model-backed generation.</p>
73+
74+
<p>🔧 <strong>Processor plugins</strong> transform records before or after generation. Use them for redaction, cleanup, deduplication, export views, organization-specific schemas, or training formats that should not be hidden inside prompts.</p>
75+
76+
</div>
77+
78+
These boundaries are intentionally narrow. A plugin should own the behavior that is specific to your use case. Data Designer validates configs and resolves dependencies. It plans batches, runs models, records logs, shows previews, then writes the output. That split lets custom components use the normal workflow without moving orchestration into the project.
79+
80+
What about [custom columns](../../concepts/custom_columns.md)? Start with a custom column when you are prototyping column-generator behavior or need a one-off column that only one project uses. Custom columns keep the logic in a Python function inside the config, with declared dependencies and optional model access. When that logic needs a stable config schema, tests, packaging, docs, or reuse across teams, promote it to a column generator plugin.
81+
82+
---
83+
84+
## **Author a Plugin: From Glue Code to Seed Reader**
85+
86+
To make this concrete, let's walk through a full example. Consider a markdown seed reader. The one-off version might be a helper function that walks a directory, splits files into sections, returns a DataFrame, and then gets copied into the next project that needs it. That can work for one project. It becomes a problem when the reader needs options, tests, documentation, versioning, or reuse across teams. At that point, the helper has become a capability whether or not it is packaged like one.
87+
88+
A plugin packages that same helper as a small Python project:
89+
90+
- A user-facing config class describes the options.
91+
- An implementation class does the work.
92+
- A `Plugin` object connects the config to the implementation.
93+
- An entry point registers the plugin with Data Designer.
94+
95+
The config class declares the user-facing options. For a directory-backed reader, Data Designer's `FileSystemSeedSource` already has fields for `path`, `file_pattern`, and `recursive`, we just need to define the seed type discriminator:
96+
97+
```python
98+
# config.py
99+
from __future__ import annotations
100+
101+
from typing import Literal
102+
103+
from data_designer.config.seed_source import FileSystemSeedSource
104+
105+
106+
class MarkdownSectionSeedSource(FileSystemSeedSource):
107+
"""Configure the markdown sections seed reader."""
108+
109+
seed_type: Literal["markdown-sections"] = "markdown-sections"
110+
```
111+
112+
The implementation class is where the old helper code should move. For a filesystem seed reader, Data Designer gives you a small interface instead of a blank page: implement `build_manifest(...)` to build a cheap index of candidate inputs, and implement `hydrate_row(...)` to turn each selected manifest row into one or more dataset rows. That split matters because Data Designer can plan work against the lightweight manifest before paying the cost of reading files, parsing sections, or calling project-specific libraries. The parser can still be a normal helper function; the reader class is the framework boundary.
113+
114+
```python
115+
# impl.py
116+
from __future__ import annotations
117+
118+
from pathlib import Path
119+
from typing import Any, ClassVar
120+
121+
from data_designer.engine.resources.seed_reader import (
122+
FileSystemSeedReader,
123+
SeedReaderFileSystemContext,
124+
)
125+
126+
from data_designer_markdown_sections.config import MarkdownSectionSeedSource
127+
128+
129+
class MarkdownSectionSeedReader(FileSystemSeedReader[MarkdownSectionSeedSource]):
130+
output_columns: ClassVar[list[str]] = [
131+
"relative_path",
132+
"file_name",
133+
"section_index",
134+
"section_header",
135+
"section_content",
136+
]
137+
138+
def build_manifest(
139+
self,
140+
*,
141+
context: SeedReaderFileSystemContext,
142+
) -> list[dict[str, str]]:
143+
# Fast path: enumerate candidate files and return cheap metadata.
144+
matched_paths = self.get_matching_relative_paths(
145+
context=context,
146+
file_pattern=self.source.file_pattern,
147+
recursive=self.source.recursive,
148+
)
149+
return [
150+
{"relative_path": relative_path, "file_name": Path(relative_path).name}
151+
for relative_path in matched_paths
152+
]
153+
154+
def hydrate_row(
155+
self,
156+
*,
157+
manifest_row: dict[str, Any],
158+
context: SeedReaderFileSystemContext,
159+
) -> list[dict[str, Any]]:
160+
# Expensive path: hydrate only the selected manifest rows.
161+
# This is where parsing, fan-out, and source-specific cleanup belong.
162+
relative_path = str(manifest_row["relative_path"])
163+
file_name = str(manifest_row["file_name"])
164+
with context.fs.open(relative_path, "r", encoding="utf-8") as handle:
165+
markdown_text = handle.read()
166+
167+
return [
168+
{
169+
"relative_path": relative_path,
170+
"file_name": file_name,
171+
"section_index": section_index,
172+
"section_header": section_header,
173+
"section_content": section_content,
174+
}
175+
for section_index, (section_header, section_content) in enumerate(
176+
extract_markdown_sections(markdown_text)
177+
)
178+
]
179+
```
180+
181+
The same rule applies to column generators and processors: choose the closest base class, keep options on the config object, implement the narrow runtime method, and leave orchestration out of the plugin.
182+
183+
Two small files connect the plugin to Data Designer — a `Plugin` descriptor that names the config and implementation, and a Python entry point that exposes them at install time:
184+
185+
```python
186+
# plugin.py
187+
from data_designer.plugins import Plugin, PluginType
188+
189+
plugin = Plugin(
190+
config_qualified_name="data_designer_markdown_sections.config.MarkdownSectionSeedSource",
191+
impl_qualified_name="data_designer_markdown_sections.impl.MarkdownSectionSeedReader",
192+
plugin_type=PluginType.SEED_READER,
193+
)
194+
```
195+
196+
```toml
197+
# pyproject.toml
198+
[project.entry-points."data_designer.plugins"]
199+
markdown-sections = "data_designer_markdown_sections.plugin:plugin"
200+
```
201+
202+
After that, users do not import engine internals or run registration code. They import the config class and use it:
203+
204+
```python
205+
import data_designer.config as dd
206+
from data_designer.interface import DataDesigner
207+
from data_designer_markdown_sections.config import MarkdownSectionSeedSource
208+
209+
builder = dd.DataDesignerConfigBuilder()
210+
builder.with_seed_dataset(
211+
MarkdownSectionSeedSource(
212+
path="docs/",
213+
file_pattern="*.md",
214+
)
215+
)
216+
builder.add_column(
217+
dd.LLMTextColumnConfig(
218+
name="question",
219+
model_alias="nvidia-text",
220+
prompt="Write a question about this section: {{ section_content }}",
221+
)
222+
)
223+
224+
results = DataDesigner().preview(builder, num_records=5)
225+
```
226+
227+
No custom orchestration. No separate DataFrame preparation step. The reader is part of the Data Designer workflow.
228+
229+
---
230+
231+
## **Building the Plugin Ecosystem**
232+
233+
Reusable plugins also need a discovery layer. Once a plugin is useful beyond one project, users need a simple way to find the right package, install it, and get back to declaring datasets. That is why Data Designer includes a built-in NVIDIA plugin catalog and a CLI workflow for discovery and installation.
234+
235+
The NVIDIA catalog is backed by [NVIDIA-NeMo/DataDesignerPlugins](https://github.com/NVIDIA-NeMo/DataDesignerPlugins), a dedicated home for first-party plugin packages, packaging examples, and plugin-specific docs. Keeping those packages outside the core repository lets them carry optional dependencies, target narrower use cases, and move at their own pace while still using the same plugin interface once installed.
236+
237+
For users, the first-party path is short: list what is available, search for what you need, and install by package name or alias.
238+
239+
```bash
240+
data-designer plugin list
241+
data-designer plugin search <keyword>
242+
data-designer plugin install <package-name>
243+
```
244+
245+
After installation, there is no separate registration step. Data Designer discovers the package's entry points, so users import the plugin's config classes and keep building the same declarative workflow.
246+
247+
Catalogs are not limited to NVIDIA plugins. A platform group can publish a catalog of approved internal plugins backed by an internal package index or direct package references. A community can publish a catalog for a domain or workflow. The catalog gives users a trusted path to the plugins they prefer, while plugin packages remain independently versioned and distributed.
248+
249+
```bash
250+
data-designer plugin catalog add <catalog-name> <catalog-url>
251+
data-designer plugin --catalog <catalog-name> install <package-name>
252+
```
253+
254+
This provides a foundation for a rich Data Designer plugin ecosystem: the core framework provides the stable runtime, plugin authors provide specialized capabilities, and catalogs make those capabilities discoverable. For more information, see [Discover Plugins](../../plugins/discover.md).
255+
256+
---
257+
258+
## **Where to Go Next**
259+
260+
Interested in building your own plugin? Here are some resources to get you started:
261+
262+
1. [Plugins overview](../../plugins/overview.md) — learn how plugins fit into Data Designer
263+
2. [Build Your Own](../../plugins/build_your_own.md) — follow the authoring guide for seed readers, column generators, and processors
264+
3. [Using Models in Plugins](../../plugins/models.md) — call configured models from plugin code
265+
4. [Markdown Section Seed Reader recipe](../../recipes/plugin_development/markdown_seed_reader.md) — study the complete version of the example from this post
266+
5. [Discover Plugins](../../plugins/discover.md) — learn how to discover and install plugins
267+
6. [DataDesignerPlugins on GitHub](https://github.com/NVIDIA-NeMo/DataDesignerPlugins) — explore first-party plugin packages
268+
269+
Moving plugins out of experimental mode means Data Designer no longer has to predict every customization users will need. The framework provides the pipeline. Plugins supply the custom pieces.
270+
271+
🎨 🔌 Thanks for reading and happy plugin building!
566 KB
Loading
562 KB
Loading

fern/versions/v0.5.8.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,8 @@ navigation:
147147
contents:
148148
- page: Overview
149149
path: ./v0.5.8/pages/devnotes/index.mdx
150+
- page: Have It Your Way
151+
path: ./v0.5.8/pages/devnotes/posts/have-it-your-way.mdx
150152
- page: Push Datasets to Hugging Face Hub
151153
path: ./v0.5.8/pages/devnotes/posts/push-datasets-to-hugging-face-hub.mdx
152154
- page: Text-to-SQL for Nemotron Super

fern/versions/v0.5.8/pages/devnotes/index.mdx

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,14 @@ import { BlogCard, BlogGrid } from "@/components/BlogCard";
99
Welcome to NeMo Data Designer Dev Notes — in-depth guides, benchmark write-ups, and insights from the team building NeMo Data Designer.
1010

1111
<BlogGrid>
12+
<BlogCard
13+
href="/dev-notes/have-it-your-way"
14+
title="Have It Your Way"
15+
description="A plugin framework for the custom pieces every real project ends up needing."
16+
date="May 5, 2026"
17+
authors={["jgreco", "etramel"]}
18+
image={<img src="/assets/have-it-your-way/data-designer-plugins-hero.png" alt="" loading="lazy" />}
19+
/>
1220
<BlogCard
1321
href="/dev-notes/push-datasets-to-hugging-face-hub"
1422
title="Push Datasets to Hugging Face Hub"

0 commit comments

Comments
 (0)