Skip to content

Commit 2eaf07e

Browse files
eric-trameljohnnygreco
authored andcommitted
docs: polish plugin dev note
- Refine the intro, customization, and authoring sections - Add wrapped devnote imagery and shared image styling - Link readers to plugin implementation references
1 parent 063ab8c commit 2eaf07e

3 files changed

Lines changed: 181 additions & 43 deletions

File tree

docs/css/style.css

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,54 @@ h2 {
124124
margin-bottom: 0.2rem !important;
125125
}
126126

127+
.md-typeset .devnote-dek {
128+
border-left: 0.18rem solid #76B900;
129+
color: var(--md-default-fg-color);
130+
font-size: 1.05rem;
131+
font-weight: 500;
132+
line-height: 1.45;
133+
margin: 0.6rem 0 1rem;
134+
padding-left: 0.8rem;
135+
}
136+
137+
.md-typeset img.devnote-float-right,
138+
.md-typeset img.devnote-section-graphic {
139+
background: var(--md-default-bg-color);
140+
border: 0.05rem solid var(--md-default-fg-color--lightest);
141+
border-radius: 0.3rem;
142+
box-shadow: 0 0.25rem 0.8rem rgb(0 0 0 / 18%);
143+
}
144+
145+
.md-typeset img.devnote-float-right {
146+
float: right;
147+
width: min(42%, 28rem);
148+
max-width: 100%;
149+
height: auto;
150+
margin: 0 0 0.7rem 1rem;
151+
}
152+
153+
.md-typeset img.devnote-section-graphic {
154+
float: right;
155+
width: min(38%, 24rem);
156+
max-width: 100%;
157+
height: auto;
158+
margin: 0.1rem 0 0.7rem 1rem;
159+
}
160+
161+
.md-typeset .devnote-clear {
162+
clear: right;
163+
}
164+
165+
@media screen and (max-width: 60em) {
166+
.md-typeset img.devnote-float-right,
167+
.md-typeset img.devnote-section-graphic {
168+
float: none;
169+
display: block;
170+
width: 100%;
171+
margin: 1rem 0;
172+
}
173+
}
174+
127175
/* Define the company grid layout */
128176

129177
#grid-container {
566 KB
Loading

docs/devnotes/posts/have-it-your-way.md

Lines changed: 133 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -7,21 +7,45 @@ authors:
77

88
# **Have It Your Way: Customizing Data Designer with Plugins**
99

10-
*A plugin framework for the custom pieces every real project ends up needing*
10+
<p class="devnote-dek"><em>A plugin framework for the custom pieces every real project ends up needing</em></p>
11+
12+
![Data Designer plugin extensions](assets/data-designer-plugins/data-designer-plugins-hero.png){ .devnote-float-right }
1113

1214
Data Designer is built around a simple idea: describe the dataset you want, and let the framework handle execution. A config points to seed data, defines generated columns, picks models, and shapes the final records — no orchestration code required.
1315

14-
That separation works well, right up until the point your project needs something specific to its corpus, its domain, or its training stack.
16+
[Data Designer plugins](../../plugins/overview.md) keep that promise when a project needs something custom. Suppose a robotics team has [Isaac Sim](https://developer.nvidia.com/isaac/sim)-generated warehouse runs and wants to turn robot poses, camera views, and event metadata into instruction data. With an internal simulation-log plugin, the user-facing part can still be this small:
1517

16-
Your seed data is not in a neat Parquet file — it lives behind an internal API, in a document store, or across a filesystem layout only your team understands. The value you need in a column does not come from a simple prompt; it comes from a simulator, a rules engine, a geospatial library, a retrieval stack, or a model call wrapped in domain-specific validation. And the final records do not quite match the shape your training pipeline expects, so there is one more transformation before the dataset is actually usable.
18+
```bash
19+
uv pip install data-designer-isaac-logs
20+
```
1721

18-
Data Designer now ships a plugin framework for exactly that layer of customization. Package the custom behavior, install it, import its config class, and use it in a normal workflow. Data Designer discovers the plugin automatically, validates it with the same config system, wires it through the registries, and runs it on the standard execution path.
22+
```python
23+
from data_designer_isaac_logs.config import IsaacRunSeedSource
24+
from data_designer_isaac_logs.config import WarehouseEventLabelColumnConfig
25+
from data_designer_isaac_logs.config import RobotSFTProcessor
1926

20-
<!-- more -->
27+
builder.with_seed_dataset(
28+
IsaacRunSeedSource(
29+
run_dir="s3://warehouse-sim/rare-events/",
30+
streams=("robot_pose", "overhead_rgb", "event_log"),
31+
max_events=10_000,
32+
)
33+
)
34+
builder.add_column(
35+
WarehouseEventLabelColumnConfig(
36+
name="safety_instruction",
37+
pose_column="robot_pose",
38+
event_log_column="event_log",
39+
)
40+
)
41+
builder.add_processor(RobotSFTProcessor(output_column="messages"))
42+
```
43+
44+
That is the point of plugins: install a package, import its config classes, and keep the workflow declarative. The Isaac run reader, event labeler, and trainer-format processor own the custom parsing, labeling, validation, and export shape, while Data Designer still handles discovery, dependency ordering, model calls, previews, and output.
2145

22-
![Data Designer plugin extensions](assets/data-designer-plugins/data-designer-plugins-hero.png){ width=100% }
46+
<!-- more -->
2347

24-
This is the practical problem plugins solve: no core library can predict every data source, generator, validator, simulator, processor, or output format users will need. The goal is not to absorb every specialized capability into Data Designer itself. It's to make custom behavior easy to package and reuse, without turning every project into a tangle of project-specific glue code.
48+
<div class="devnote-clear"></div>
2549

2650
!!! tip "TL;DR - What plugins give you"
2751

@@ -39,23 +63,37 @@ This is the practical problem plugins solve: no core library can predict every d
3963

4064
## **Customization Is the Normal Case**
4165

42-
Synthetic data work rarely stays generic for long. A useful dataset usually reflects the system it is meant to improve: product taxonomy, compliance rules, document structure, simulator outputs, training format, evaluation policy, privacy constraints, task distribution, and other domain-specific requirements.
43-
44-
Those details are where the value comes from. They are also where framework boundaries get tested.
66+
![A confused engineer trying to fit custom building blocks into the wrong framework slots](assets/data-designer-plugins/customization-blocks-confusion.png){ .devnote-section-graphic }
4567

46-
Without a plugin boundary, customization tends to leak into notebooks and wrapper scripts. Someone patches a file reader for one corpus. Someone else copies a generator into a project folder. A formatting step lives after generation because it did not fit anywhere else. The pipeline works, but the behavior is not really part of the Data Designer workflow. It is harder to validate, harder to share, harder to version, and easier to lose.
68+
The mess usually starts innocently. A team defines a Data Designer config, then discovers that its seed data lives in an internal layout, its generated column needs a domain simulator, and its trainer expects a slightly different record shape. Someone writes a small reader beside the notebook. Someone patches a generator into a project folder. Someone adds a cleanup script after preview because the final export has one more organization-specific rule. Each choice is reasonable because every project has its own corpus, policy, ontology, simulator, and training stack.
4769

48-
Plugins put that work behind a reusable package boundary. A custom capability gets a name, a typed config, a runtime implementation, and tests that travel with the package. Users still declare the dataset they want; the custom logic stays behind a clean interface.
70+
The problem is that the custom behavior now lives around Data Designer instead of inside the Data Designer workflow. It is harder to validate, harder to share, harder to version, and easier to lose. Plugins give that bespoke work a clean package boundary: a name, typed config, runtime implementation, entry point, and tests that travel together. Users still declare the dataset they want, but the local reader, domain generator, or trainer-format processor becomes a normal Data Designer component instead of another layer of glue.
4971

50-
The result is straightforward for users. Once installed, a seed reader for your internal corpus is just another seed source. A generator backed by a domain simulator uses the normal column path. A processor that emits your team's SFT format runs as a normal processor.
72+
<div class="devnote-clear"></div>
5173

5274
---
5375

54-
## **From Glue Code to a Capability**
76+
## **Where Plugins Fit**
77+
78+
The first plugin boundaries match the places where real projects most often need customization.
79+
80+
<div style="margin-left: 1.5rem;">
81+
82+
<p>📥 <strong>Seed reader plugins</strong> bring new source systems into Data Designer. Use them for databases, document stores, object stores, internal APIs, file collections, or corpus layouts that need custom hydration before generation can begin.</p>
83+
84+
<p>🧬 <strong>Column generator plugins</strong> create new column types. Use them when a value should be produced during generation and should participate in dependency ordering like any other column. This is the right place for simulators, domain libraries, retrieval-backed generation, deterministic rule systems, or custom model-backed generation.</p>
85+
86+
<p>🔧 <strong>Processor plugins</strong> transform records before or after generation. Use them for redaction, cleanup, deduplication, export views, organization-specific schemas, or training formats that should not be hidden inside prompts.</p>
87+
88+
</div>
89+
90+
These boundaries are intentionally narrow. A plugin should own the behavior that is specific to your use case. Data Designer should keep owning the pipeline responsibilities: validation, dependency resolution, batching, model calls, logging, previews, output handling. That split lets custom components use the normal workflow without moving orchestration into the project.
91+
92+
---
5593

56-
Consider a markdown seed reader. The one-off version might be a helper function that walks a directory, splits files into sections, returns a DataFrame, and then gets copied into the next project that needs it.
94+
## **Author a Plugin: From Glue Code to Seed Reader**
5795

58-
That can work for one project. It becomes a problem when the reader needs options, tests, documentation, versioning, or reuse across teams. At that point, the helper has become a capability whether or not it is packaged like one.
96+
Consider a markdown seed reader. The one-off version might be a helper function that walks a directory, splits files into sections, returns a DataFrame, and then gets copied into the next project that needs it. That can work for one project. It becomes a problem when the reader needs options, tests, documentation, versioning, or reuse across teams. At that point, the helper has become a capability whether or not it is packaged like one.
5997

6098
A plugin packages the same idea as a small Python project:
6199

@@ -64,16 +102,90 @@ A plugin packages the same idea as a small Python project:
64102
- A `Plugin` object connects the config to the implementation.
65103
- A Python entry point exposes the plugin to Data Designer.
66104

105+
The implementation class is where the old helper code should move. For a filesystem seed reader, Data Designer gives you a small interface instead of a blank page: implement `build_manifest(...)` to build a cheap index of candidate inputs, and implement `hydrate_row(...)` to turn each selected manifest row into one or more dataset rows. That split matters because Data Designer can sample, shuffle, partition, and batch against the lightweight manifest before paying the cost of reading files, parsing sections, or calling project-specific libraries. The parser can still be a normal helper function; the reader class is the framework boundary.
106+
107+
```python
108+
# impl.py
109+
from __future__ import annotations
110+
111+
from pathlib import Path
112+
from typing import Any, ClassVar
113+
114+
from data_designer.engine.resources.seed_reader import (
115+
FileSystemSeedReader,
116+
SeedReaderFileSystemContext,
117+
)
118+
119+
from data_designer_markdown_sections.config import MarkdownSectionSeedSource
120+
121+
122+
class MarkdownSectionSeedReader(FileSystemSeedReader[MarkdownSectionSeedSource]):
123+
output_columns: ClassVar[list[str]] = [
124+
"relative_path",
125+
"file_name",
126+
"section_index",
127+
"section_header",
128+
"section_content",
129+
]
130+
131+
def build_manifest(
132+
self,
133+
*,
134+
context: SeedReaderFileSystemContext,
135+
) -> list[dict[str, str]]:
136+
# Fast path: enumerate candidate files and return cheap metadata.
137+
# Data Designer can index, sample, shuffle, and batch these rows.
138+
matched_paths = self.get_matching_relative_paths(
139+
context=context,
140+
file_pattern=self.source.file_pattern,
141+
recursive=self.source.recursive,
142+
)
143+
return [
144+
{"relative_path": relative_path, "file_name": Path(relative_path).name}
145+
for relative_path in matched_paths
146+
]
147+
148+
def hydrate_row(
149+
self,
150+
*,
151+
manifest_row: dict[str, Any],
152+
context: SeedReaderFileSystemContext,
153+
) -> list[dict[str, Any]]:
154+
# Expensive path: hydrate only the selected manifest rows.
155+
# This is where parsing, fan-out, and source-specific cleanup belong.
156+
relative_path = str(manifest_row["relative_path"])
157+
file_name = str(manifest_row["file_name"])
158+
with context.fs.open(relative_path, "r", encoding="utf-8") as handle:
159+
markdown_text = handle.read()
160+
161+
return [
162+
{
163+
"relative_path": relative_path,
164+
"file_name": file_name,
165+
"section_index": section_index,
166+
"section_header": section_header,
167+
"section_content": section_content,
168+
}
169+
for section_index, (section_header, section_content) in enumerate(
170+
extract_markdown_sections(markdown_text)
171+
)
172+
]
173+
```
174+
175+
The class should own only the domain-specific behavior: how to find candidate files, how to parse them, and what rows it emits. Let Data Designer keep owning attachment, sampling, shuffling, batching, DuckDB registration, dependency resolution, and execution. The same rule applies to column generators and processors: choose the closest base class, keep options on the config object, implement the narrow runtime method, and leave orchestration out of the plugin.
176+
67177
The entry point exposes the plugin package to Data Designer:
68178

69179
```toml
180+
# pyproject.toml
70181
[project.entry-points."data_designer.plugins"]
71182
markdown-sections = "data_designer_markdown_sections.plugin:plugin"
72183
```
73184

74185
The plugin object tells Data Designer what kind of extension this is and where to find the config and implementation:
75186

76187
```python
188+
# plugin.py
77189
from data_designer.plugins import Plugin, PluginType
78190

79191
plugin = Plugin(
@@ -108,31 +220,13 @@ builder.add_column(
108220
results = DataDesigner().preview(builder, num_records=5)
109221
```
110222

111-
No custom orchestration. No separate DataFrame preparation step. The reader is part of the Data Designer workflow.
112-
113-
---
114-
115-
## **Where Plugins Fit**
116-
117-
The first plugin boundaries match the places where real projects most often need customization.
118-
119-
**Seed reader plugins** bring new source systems into Data Designer. Use them for databases, document stores, object stores, internal APIs, file collections, or corpus layouts that need custom hydration before generation can begin.
120-
121-
**Column generator plugins** create new column types. Use them when a value should be produced during generation and should participate in dependency ordering like any other column. This is the right place for simulators, domain libraries, retrieval-backed generation, deterministic rule systems, or custom model-backed generation.
122-
123-
**Processor plugins** transform records before or after generation. Use them for redaction, cleanup, deduplication, export views, organization-specific schemas, or training formats that should not be hidden inside prompts.
124-
125-
These boundaries are intentionally narrow. A plugin should own the behavior that is specific to your use case. Data Designer should keep owning the pipeline responsibilities: validation, dependency resolution, batching, model calls, logging, previews, output handling.
126-
127-
That split lets custom components use the normal workflow without moving orchestration into the project.
223+
No custom orchestration. No separate DataFrame preparation step. The reader is part of the Data Designer workflow. For the same package shape applied to other extension points, see the [Build Your Own plugin guide](../../plugins/build_your_own.md#implementation-patterns), [Column Generators](../../code_reference/engine/column_generators.md), and [Engine Processors](../../code_reference/engine/processors.md) documentation.
128224

129225
---
130226

131227
## **Start Local, Share When Useful**
132228

133-
A plugin does not need to start as a public package. Most should start locally.
134-
135-
Start with a local Python package and install it in editable mode:
229+
A plugin does not need to start as a public package. Most should start locally. Start with a local Python package and install it in editable mode:
136230

137231
```bash
138232
uv pip install -e .
@@ -148,13 +242,9 @@ It is useful for the broader community too. If you build a plugin that should be
148242

149243
## **A Repository for First-Party Plugins**
150244

151-
We also created [NVIDIA-NeMo/DataDesignerPlugins](https://github.com/NVIDIA-NeMo/DataDesignerPlugins), a dedicated repository for NVIDIA-maintained plugins.
152-
153-
The core Data Designer repo should stay focused on the framework: the config API, engine execution, model integration, validation behavior, and the stable plugin interface. Plugin packages often have different needs. They may depend on optional libraries, target narrower use cases, or move at a different release pace than the core library.
154-
155-
The plugin repo separates those packages from the core release cycle. It is where we will publish NVIDIA-maintained plugins, recommended packaging examples, and plugin-specific docs as the catalog grows. The packages install separately, but they use the same plugin interface once installed.
245+
We also created [NVIDIA-NeMo/DataDesignerPlugins](https://github.com/NVIDIA-NeMo/DataDesignerPlugins), a dedicated repository for NVIDIA-maintained plugins. It is where we will publish first-party plugin packages, recommended packaging examples, and plugin-specific docs as the catalog grows.
156246

157-
This keeps the core lean while specialized packages evolve around a stable extension surface.
247+
The split keeps the core Data Designer repo focused on the framework: the config API, engine execution, model integration, validation behavior, and stable plugin interface. Plugin packages can depend on optional libraries, target narrower use cases, and move at a different release pace, while still installing separately and using the same plugin interface once installed.
158248

159249
---
160250

@@ -183,4 +273,4 @@ For implementation details, see:
183273

184274
Moving plugins out of experimental mode means Data Designer no longer has to predict every customization users will need. The framework provides the pipeline. Plugins supply the custom pieces.
185275

186-
👋 Thanks for reading and happy plugin building!
276+
🎨🔌 Thanks for reading and happy plugin building!

0 commit comments

Comments
 (0)