docs: add "Have It Your Way" plugin dev note

johnnygreco · johnnygreco · commit 2bb1d6c3f800 · 2026-05-06T22:15:13.000Z
Introduces a dev note covering Data Designer's plugin framework:
seed reader, column generator, and processor extension points,
with packaging and entry-point examples.
diff --git a/docs/devnotes/posts/assets/data-designer-plugins/data-designer-plugins-hero.png b/docs/devnotes/posts/assets/data-designer-plugins/data-designer-plugins-hero.png
diff --git a/docs/devnotes/posts/have-it-your-way.md b/docs/devnotes/posts/have-it-your-way.md
@@ -0,0 +1,186 @@
+---
+date: 2026-05-05
+authors:
+  - jgreco
+  - etramel
+---
+
+# **Have It Your Way: Customizing Data Designer with Plugins**
+
+*A plugin framework for the custom pieces every real project ends up needing*
+
+Data Designer is built around a simple idea: describe the dataset you want, and let the framework handle execution. A config points to seed data, defines generated columns, picks models, and shapes the final records — no orchestration code required.
+
+That separation works well, right up until the point your project needs something specific to its corpus, its domain, or its training stack.
+
+Your seed data is not in a neat Parquet file — it lives behind an internal API, in a document store, or across a filesystem layout only your team understands. The value you need in a column does not come from a simple prompt; it comes from a simulator, a rules engine, a geospatial library, a retrieval stack, or a model call wrapped in domain-specific validation. And the final records do not quite match the shape your training pipeline expects, so there is one more transformation before the dataset is actually usable.
+
+Data Designer now ships a plugin framework for exactly that layer of customization. Package the custom behavior, install it, import its config class, and use it in a normal workflow. Data Designer discovers the plugin automatically, validates it with the same config system, wires it through the registries, and runs it on the standard execution path.
+
+<!-- more -->
+
+![Data Designer plugin extensions](assets/data-designer-plugins/data-designer-plugins-hero.png){ width=100% }
+
+This is the practical problem plugins solve: no core library can predict every data source, generator, validator, simulator, processor, or output format users will need. The goal is not to absorb every specialized capability into Data Designer itself. It's to make custom behavior easy to package and reuse, without turning every project into a tangle of project-specific glue code.
+
+!!! tip "TL;DR - What plugins give you"
+
+    1. Plugins expose custom behavior through the same Data Designer config and runtime paths as built-in components.
+
+    2. A plugin is a Python package discovered through the `data_designer.plugins` entry point group. Once installed, there is no manual registration step in user code.
+
+    3. Plugin configs use the same typed config model and serialization behavior as core config types. The engine receives an implementation through the plugin registry.
+
+    4. Plugins can start as a local editable install, move to an internal package index, and later be published publicly.
+
+    5. NVIDIA-maintained plugins now live in [NVIDIA-NeMo/DataDesignerPlugins](https://github.com/NVIDIA-NeMo/DataDesignerPlugins), separate from the core repo and installed as packages.
+
+---
+
+## **Customization Is the Normal Case**
+
+Synthetic data work rarely stays generic for long. A useful dataset usually reflects the system it is meant to improve: product taxonomy, compliance rules, document structure, simulator outputs, training format, evaluation policy, privacy constraints, task distribution, and other domain-specific requirements.
+
+Those details are where the value comes from. They are also where framework boundaries get tested.
+
+Without a plugin boundary, customization tends to leak into notebooks and wrapper scripts. Someone patches a file reader for one corpus. Someone else copies a generator into a project folder. A formatting step lives after generation because it did not fit anywhere else. The pipeline works, but the behavior is not really part of the Data Designer workflow. It is harder to validate, harder to share, harder to version, and easier to lose.
+
+Plugins put that work behind a reusable package boundary. A custom capability gets a name, a typed config, a runtime implementation, and tests that travel with the package. Users still declare the dataset they want; the custom logic stays behind a clean interface.
+
+The result is straightforward for users. Once installed, a seed reader for your internal corpus is just another seed source. A generator backed by a domain simulator uses the normal column path. A processor that emits your team's SFT format runs as a normal processor.
+
+---
+
+## **From Glue Code to a Capability**
+
+Consider a markdown seed reader. The one-off version might be a helper function that walks a directory, splits files into sections, returns a DataFrame, and then gets copied into the next project that needs it.
+
+That can work for one project. It becomes a problem when the reader needs options, tests, documentation, versioning, or reuse across teams. At that point, the helper has become a capability whether or not it is packaged like one.
+
+A plugin packages the same idea as a small Python project:
+
+- A user-facing config class describes the options.
+- An implementation class does the work.
+- A `Plugin` object connects the config to the implementation.
+- A Python entry point exposes the plugin to Data Designer.
+
+The entry point exposes the plugin package to Data Designer:
+
+```toml
+[project.entry-points."data_designer.plugins"]
+markdown-sections = "data_designer_markdown_sections.plugin:plugin"
+```
+
+The plugin object tells Data Designer what kind of extension this is and where to find the config and implementation:
+
+```python
+from data_designer.plugins import Plugin, PluginType
+
+plugin = Plugin(
+    config_qualified_name="data_designer_markdown_sections.config.MarkdownSectionSeedSource",
+    impl_qualified_name="data_designer_markdown_sections.impl.MarkdownSectionSeedReader",
+    plugin_type=PluginType.SEED_READER,
+)
+```
+
+After that, users do not import engine internals or run registration code. They import the config class and use it:
+
+```python
+import data_designer.config as dd
+from data_designer.interface import DataDesigner
+from data_designer_markdown_sections.config import MarkdownSectionSeedSource
+
+builder = dd.DataDesignerConfigBuilder()
+builder.with_seed_dataset(
+    MarkdownSectionSeedSource(
+        path="docs/",
+        file_pattern="*.md",
+    )
+)
+builder.add_column(
+    dd.LLMTextColumnConfig(
+        name="question",
+        model_alias="nvidia-text",
+        prompt="Write a question about this section: {{ section_content }}",
+    )
+)
+
+results = DataDesigner().preview(builder, num_records=5)
+```
+
+No custom orchestration. No separate DataFrame preparation step. The reader is part of the Data Designer workflow.
+
+---
+
+## **Where Plugins Fit**
+
+The first plugin boundaries match the places where real projects most often need customization.
+
+**Seed reader plugins** bring new source systems into Data Designer. Use them for databases, document stores, object stores, internal APIs, file collections, or corpus layouts that need custom hydration before generation can begin.
+
+**Column generator plugins** create new column types. Use them when a value should be produced during generation and should participate in dependency ordering like any other column. This is the right place for simulators, domain libraries, retrieval-backed generation, deterministic rule systems, or custom model-backed generation.
+
+**Processor plugins** transform records before or after generation. Use them for redaction, cleanup, deduplication, export views, organization-specific schemas, or training formats that should not be hidden inside prompts.
+
+These boundaries are intentionally narrow. A plugin should own the behavior that is specific to your use case. Data Designer should keep owning the pipeline responsibilities: validation, dependency resolution, batching, model calls, logging, previews, output handling.
+
+That split lets custom components use the normal workflow without moving orchestration into the project.
+
+---
+
+## **Start Local, Share When Useful**
+
+A plugin does not need to start as a public package. Most should start locally.
+
+Start with a local Python package and install it in editable mode:
+
+```bash
+uv pip install -e .
+```
+
+That is enough for Data Designer to discover the entry point. You can iterate on the config class and implementation while testing the plugin in a real preview loop. When the shape stabilizes, the same package can move to an internal index, a GitHub repo, or PyPI.
+
+This is useful inside teams. A data platform group can maintain seed readers for internal systems. An applied science group can maintain generators for its domain. A training group can maintain processors that emit exactly the record shapes its trainers consume. Everyone else installs a package and uses typed configs in the same workflow they already know.
+
+It is useful for the broader community too. If you build a plugin that should be discoverable by other Data Designer users, publish it and follow the instructions in [Available Plugins](../../plugins/available.md) to request a catalog listing.
+
+---
+
+## **A Repository for First-Party Plugins**
+
+We also created [NVIDIA-NeMo/DataDesignerPlugins](https://github.com/NVIDIA-NeMo/DataDesignerPlugins), a dedicated repository for NVIDIA-maintained plugins.
+
+The core Data Designer repo should stay focused on the framework: the config API, engine execution, model integration, validation behavior, and the stable plugin interface. Plugin packages often have different needs. They may depend on optional libraries, target narrower use cases, or move at a different release pace than the core library.
+
+The plugin repo separates those packages from the core release cycle. It is where we will publish NVIDIA-maintained plugins, recommended packaging examples, and plugin-specific docs as the catalog grows. The packages install separately, but they use the same plugin interface once installed.
+
+This keeps the core lean while specialized packages evolve around a stable extension surface.
+
+---
+
+## **Start with One Capability**
+
+If you have custom Data Designer code that keeps getting copied between projects, it is a strong candidate for a plugin.
+
+Pick one capability. Give it a typed config. Write the implementation behind the matching plugin boundary. Add an `assert_valid_plugin(...)` test so structural problems fail early:
+
+```python
+from data_designer.engine.testing import assert_valid_plugin
+from data_designer_markdown_sections.plugin import plugin
+
+assert_valid_plugin(plugin)
+```
+
+Then run a tiny `preview` before you trust it in a larger generation job.
+
+For implementation details, see:
+
+- [Plugins overview](../../plugins/overview.md)
+- [Build Your Own](../../plugins/build_your_own.md)
+- [Using Models in Plugins](../../plugins/models.md)
+- [Available Plugins](../../plugins/available.md)
+- [Markdown Section Seed Reader recipe](../../recipes/plugin_development/markdown_seed_reader.md)
+
+Moving plugins out of experimental mode means Data Designer no longer has to predict every customization users will need. The framework provides the pipeline. Plugins supply the custom pieces.
+
+👋 Thanks for reading and happy plugin building!
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -103,6 +103,7 @@ nav:
   - Dev Notes:
       # NOTE: Order is most recent -> oldest (so sidebar shows recent first!)
       - devnotes/index.md
+      - Have It Your Way: devnotes/posts/have-it-your-way.md
       - VLM Long Document Understanding: devnotes/posts/vlm-long-document-understanding.md
       - Push Datasets to Hugging Face Hub: devnotes/posts/push-datasets-to-hugging-face-hub.md
       - "Text-to-SQL for Nemotron Super": devnotes/posts/text-to-sql.md