Skip to content

Commit 982ce79

Browse files
feat: add processor plugin support (#299)
* feat: add processor plugin support Add PluginType.PROCESSOR to the plugin system, enabling third-party processor plugins via entry points. Includes a demo plugin package with RegexFilterProcessor (process_before_batch) and SemanticDedupProcessor (process_after_generation). - Add PluginType.PROCESSOR with processor_type discriminator - Create processor_types.py for ProcessorConfigT with plugin injection - Register plugin processors in engine ProcessorRegistry - Use RLock in PluginRegistry to prevent deadlocks during discovery - Add demo package: data-designer-demo-processors - Update processor and plugin documentation * test: add processor plugin registration test Verify that processor plugins from PluginRegistry are picked up by create_default_processor_registry and registered correctly. * test: simplify processor plugin registration test * move ProcessorConfig to base and convert demo to e2e test - Move ProcessorConfig from processors.py to config.base to guard against circular deps (alongside SingleColumnConfig) - Delete demo/ directory with regex_filter and semantic_dedup plugins - Add regex_filter as an e2e processor plugin test in tests_e2e/ * move plan to plans/299/
1 parent f07624b commit 982ce79

21 files changed

Lines changed: 354 additions & 37 deletions

File tree

docs/concepts/processors.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,36 @@ builder.add_processor(
138138

139139
Processors execute in the order they're added. Plan accordingly when one processor's output affects another.
140140

141+
## Processor Plugins
142+
143+
You can extend Data Designer with custom processors via the [plugin system](../plugins/overview.md). A processor plugin is a Python package that provides:
144+
145+
- A **config class** inheriting from `ProcessorConfig` with a `processor_type: Literal["your-type"]` discriminator
146+
- An **implementation class** inheriting from `Processor` that overrides the desired callback methods
147+
- A **`Plugin` instance** connecting the two
148+
149+
Once installed, plugin processors are automatically discovered and can be used with `add_processor()` like built-in processors.
150+
151+
```python
152+
from my_processor_plugin.config import MyProcessorConfig
153+
154+
builder.add_processor(
155+
MyProcessorConfig(
156+
name="my_processor",
157+
# ... plugin-specific parameters ...
158+
)
159+
)
160+
```
161+
162+
**Entry point configuration** in `pyproject.toml`:
163+
164+
```toml
165+
[project.entry-points."data_designer.plugins"]
166+
my-processor = "my_plugin.plugin:my_processor_plugin"
167+
```
168+
169+
See the [plugins overview](../plugins/overview.md) for the full guide on creating plugins.
170+
141171
## Configuration Parameters
142172

143173
### Common Parameters

docs/plugins/overview.md

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,11 @@
77

88
Plugins are Python packages that extend Data Designer's capabilities without modifying the core library. Similar to [VS Code extensions](https://marketplace.visualstudio.com/vscode) and [Pytest plugins](https://docs.pytest.org/en/stable/reference/plugin_list.html), the plugin system empowers you to build specialized extensions for your specific use cases and share them with the community.
99

10-
**Current capabilities**: Data Designer supports two plugin types:
10+
**Current capabilities**: Data Designer supports three plugin types:
1111

1212
- **Column Generator Plugins**: Custom column types you pass to the config builder's [add_column](../code_reference/config_builder.md#data_designer.config.config_builder.DataDesignerConfigBuilder.add_column) method.
1313
- **Seed Reader Plugins**: Custom seed dataset readers that let you load data from new sources (e.g., databases, cloud storage, custom formats).
14-
15-
**Coming soon**: Plugin support for processors, validators, and more!
14+
- **Processor Plugins**: Custom processors that transform data before batches, after batches, or after generation completes. Pass them to the config builder's [add_processor](../code_reference/config_builder.md#data_designer.config.config_builder.DataDesignerConfigBuilder.add_processor) method.
1615

1716
## How do you use plugins?
1817

@@ -39,9 +38,11 @@ Each plugin has three components, and we recommend organizing them into separate
3938
- **`config.py`** -- Configuration class defining user-facing parameters
4039
- Column generator plugins: inherit from `SingleColumnConfig` with a `column_type` discriminator
4140
- Seed reader plugins: inherit from `SeedSource` with a `seed_type` discriminator
41+
- Processor plugins: inherit from `ProcessorConfig` with a `processor_type` discriminator
4242
- **`impl.py`** -- Implementation class containing the core logic
4343
- Column generator plugins: inherit from `ColumnGeneratorFullColumn` or `ColumnGeneratorCellByCell`
4444
- Seed reader plugins: inherit from `SeedReader`
45+
- Processor plugins: inherit from `Processor` and override callback methods (`process_before_batch`, `process_after_batch`, `process_after_generation`)
4546
- **`plugin.py`** -- A `Plugin` instance that connects the config and implementation classes
4647

4748
### 2. Package Your Plugin
@@ -61,4 +62,23 @@ Each plugin has three components, and we recommend organizing them into separate
6162
- Publish to PyPI or another package index to make it installable by anyone via `pip install`
6263
- This step is only needed if you want others outside your environment to use the plugin
6364

65+
**Example entry point for a processor plugin:**
66+
67+
```toml
68+
[project.entry-points."data_designer.plugins"]
69+
my-processor = "my_plugin.plugin:my_processor_plugin"
70+
```
71+
72+
Where `my_processor_plugin` is a `Plugin` instance with `plugin_type=PluginType.PROCESSOR`:
73+
74+
```python
75+
from data_designer.plugins.plugin import Plugin, PluginType
76+
77+
my_processor_plugin = Plugin(
78+
config_qualified_name="my_plugin.config.MyProcessorConfig",
79+
impl_qualified_name="my_plugin.impl.MyProcessor",
80+
plugin_type=PluginType.PROCESSOR,
81+
)
82+
```
83+
6484
**Ready to get started?** See the [Example Plugin](example.md) for a complete walkthrough of creating a column generator plugin.

packages/data-designer-config/src/data_designer/config/base.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88

99
from abc import ABC, abstractmethod
1010

11-
from pydantic import BaseModel, ConfigDict
11+
from pydantic import BaseModel, ConfigDict, Field
1212

1313

1414
class ConfigBase(BaseModel):
@@ -66,3 +66,20 @@ def side_effect_columns(self) -> list[str]:
6666
List of column names that this column will create as a side effect. Empty list
6767
indicates no side effect columns. Override in subclasses to specify side effects.
6868
"""
69+
70+
71+
class ProcessorConfig(ConfigBase, ABC):
72+
"""Abstract base class for all processor configuration types.
73+
74+
Processors are transformations that run at different stages of the generation
75+
pipeline. They can modify, reshape, or augment the dataset.
76+
77+
Attributes:
78+
name: Unique name of the processor, used to identify the processor in results
79+
and to name output artifacts on disk.
80+
"""
81+
82+
name: str = Field(
83+
description="The name of the processor, used to identify the processor in the results and to write the artifacts to disk.",
84+
)
85+
processor_type: str

packages/data-designer-config/src/data_designer/config/config_builder.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@
2828
from data_designer.config.exportable_config import ExportableConfigBase
2929
from data_designer.config.mcp import ToolConfig
3030
from data_designer.config.models import ModelConfig, load_model_configs
31-
from data_designer.config.processors import ProcessorConfigT, ProcessorType, get_processor_config_from_kwargs
31+
from data_designer.config.processor_types import ProcessorConfigT
32+
from data_designer.config.processors import ProcessorType, get_processor_config_from_kwargs
3233
from data_designer.config.sampler_constraints import (
3334
ColumnConstraintT,
3435
ColumnInequalityConstraint,

packages/data-designer-config/src/data_designer/config/data_designer_config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
from data_designer.config.exportable_config import ExportableConfigBase
1313
from data_designer.config.mcp import ToolConfig
1414
from data_designer.config.models import ModelConfig
15-
from data_designer.config.processors import ProcessorConfigT
15+
from data_designer.config.processor_types import ProcessorConfigT
1616
from data_designer.config.sampler_constraints import ColumnConstraintT
1717
from data_designer.config.seed import SeedConfig
1818

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
from __future__ import annotations
5+
6+
from typing_extensions import TypeAlias
7+
8+
from data_designer.config.processors import DropColumnsProcessorConfig, SchemaTransformProcessorConfig
9+
from data_designer.plugin_manager import PluginManager
10+
11+
plugin_manager = PluginManager()
12+
13+
ProcessorConfigT: TypeAlias = DropColumnsProcessorConfig | SchemaTransformProcessorConfig
14+
ProcessorConfigT = plugin_manager.inject_into_processor_config_type_union(ProcessorConfigT)

packages/data-designer-config/src/data_designer/config/processors.py

Lines changed: 1 addition & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,12 @@
44
from __future__ import annotations
55

66
import json
7-
from abc import ABC
87
from enum import Enum
98
from typing import Any, Literal
109

1110
from pydantic import Field, field_validator
12-
from typing_extensions import TypeAlias
1311

14-
from data_designer.config.base import ConfigBase
12+
from data_designer.config.base import ProcessorConfig
1513
from data_designer.config.errors import InvalidConfigError
1614

1715

@@ -27,26 +25,6 @@ class ProcessorType(str, Enum):
2725
SCHEMA_TRANSFORM = "schema_transform"
2826

2927

30-
class ProcessorConfig(ConfigBase, ABC):
31-
"""Abstract base class for all processor configuration types.
32-
33-
Processors are transformations that run at different stages of the generation
34-
pipeline. They can modify, reshape, or augment the dataset.
35-
36-
The processor implementation determines which stages it handles by overriding
37-
the appropriate callback methods (process_before_batch, process_after_batch, process_after_generation).
38-
39-
Attributes:
40-
name: Unique name of the processor, used to identify the processor in results
41-
and to name output artifacts on disk.
42-
"""
43-
44-
name: str = Field(
45-
description="The name of the processor, used to identify the processor in the results and to write the artifacts to disk.",
46-
)
47-
processor_type: str
48-
49-
5028
def get_processor_config_from_kwargs(processor_type: ProcessorType, **kwargs: Any) -> ProcessorConfig:
5129
"""Create a processor configuration from a processor type and keyword arguments.
5230
@@ -129,6 +107,3 @@ def validate_template(cls, v: dict[str, Any]) -> dict[str, Any]:
129107
if "not JSON serializable" in str(e):
130108
raise InvalidConfigError("Template must be JSON serializable")
131109
return v
132-
133-
134-
ProcessorConfigT: TypeAlias = DropColumnsProcessorConfig | SchemaTransformProcessorConfig

packages/data-designer-config/src/data_designer/plugin_manager.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,3 +76,14 @@ def inject_into_seed_source_type_union(self, seed_source_type: type[TypeAlias])
7676
"""
7777
seed_source_type = self._plugin_registry.add_plugin_types_to_union(seed_source_type, PluginType.SEED_READER)
7878
return seed_source_type
79+
80+
def inject_into_processor_config_type_union(self, processor_config_type: type[TypeAlias]) -> type[TypeAlias]:
81+
"""Inject plugins into the processor config type.
82+
83+
Args:
84+
processor_config_type: The processor config type to inject plugins into.
85+
86+
Returns:
87+
The processor config type with plugins injected.
88+
"""
89+
return self._plugin_registry.add_plugin_types_to_union(processor_config_type, PluginType.PROCESSOR)

packages/data-designer-config/src/data_designer/plugins/plugin.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,16 @@
2020
class PluginType(str, Enum):
2121
COLUMN_GENERATOR = "column-generator"
2222
SEED_READER = "seed-reader"
23+
PROCESSOR = "processor"
2324

2425
@property
2526
def discriminator_field(self) -> str:
2627
if self == PluginType.COLUMN_GENERATOR:
2728
return "column_type"
2829
elif self == PluginType.SEED_READER:
2930
return "seed_type"
31+
elif self == PluginType.PROCESSOR:
32+
return "processor_type"
3033
else:
3134
raise ValueError(f"Invalid plugin type: {self.value}")
3235

packages/data-designer-config/src/data_designer/plugins/registry.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
class PluginRegistry:
2424
_instance = None
2525
_plugins_discovered = False
26-
_lock = threading.Lock()
26+
_lock = threading.RLock()
2727

2828
_plugins: dict[str, Plugin] = {}
2929

0 commit comments

Comments
 (0)