Skip to content

Commit 0f3a511

Browse files
davanstrienlhoestq
andauthored
Add DataDesigner integration for synthetic dataset generation (#2119)
* Add DataDesigner integration for synthetic dataset generation * Add documentation for DataDesigner integration with Hugging Face Inference Providers * Add DataDesigner section to Inference Providers documentation * fix order * Update model configuration in DataDesigner integration example * Update docs/inference-providers/_toctree.yml Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Update docs/inference-providers/integrations/datadesigner.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Update docs/inference-providers/integrations/index.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Update docs/inference-providers/integrations/index.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
1 parent 2604ea3 commit 0f3a511

3 files changed

Lines changed: 114 additions & 0 deletions

File tree

docs/inference-providers/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,8 @@
3838
title: Overview
3939
- local: integrations/adding-integration
4040
title: Add Your Integration
41+
- local: integrations/datadesigner
42+
title: NeMo Data Designer
4143
- local: integrations/macwhisper
4244
title: MacWhisper
4345
- local: integrations/opencode
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# NeMo Data Designer
2+
3+
[DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) is NVIDIA NeMo's framework for generating high-quality synthetic datasets using LLMs. It enables you to create diverse data using statistical samplers, LLMs, or existing seed datasets while maintaining control over field relationships and data quality.
4+
5+
## Overview
6+
7+
DataDesigner supports OpenAI-compatible endpoints, making it easy to use any model available through Hugging Face Inference Providers for synthetic data generation.
8+
9+
## Prerequisites
10+
11+
- DataDesigner installed (`pip install data-designer`)
12+
- A Hugging Face account with [API token](https://huggingface.co/settings/tokens/new?ownUserPermissions=inference.serverless.write&tokenType=fineGrained) (needs "Make calls to Inference Providers" permission)
13+
14+
## Configuration
15+
16+
### 1. Set your HF token
17+
18+
```bash
19+
export HF_TOKEN="hf_your_token_here"
20+
```
21+
22+
### 2. Configure HF as a provider
23+
24+
```python
25+
from data_designer.essentials import (
26+
CategorySamplerParams,
27+
DataDesigner,
28+
DataDesignerConfigBuilder,
29+
LLMTextColumnConfig,
30+
ModelConfig,
31+
ModelProvider,
32+
SamplerColumnConfig,
33+
SamplerType,
34+
)
35+
36+
# Define HF Inference Provider (OpenAI-compatible)
37+
hf_provider = ModelProvider(
38+
name="huggingface",
39+
endpoint="https://router.huggingface.co/v1",
40+
provider_type="openai",
41+
api_key="HF_TOKEN", # Reads from environment variable
42+
)
43+
44+
# Define a model available via HF Inference Providers
45+
hf_model = ModelConfig(
46+
alias="hf-gpt-oss",
47+
model="openai/gpt-oss-120b",
48+
provider="huggingface",
49+
)
50+
51+
# Create DataDesigner with HF provider
52+
data_designer = DataDesigner(model_providers=[hf_provider])
53+
config_builder = DataDesignerConfigBuilder(model_configs=[hf_model])
54+
```
55+
56+
### 3. Generate synthetic data
57+
58+
```python
59+
# Add a sampler column
60+
config_builder.add_column(
61+
SamplerColumnConfig(
62+
name="category",
63+
sampler_type=SamplerType.CATEGORY,
64+
params=CategorySamplerParams(
65+
values=["Electronics", "Books", "Clothing"],
66+
),
67+
)
68+
)
69+
70+
# Add an LLM-generated column
71+
config_builder.add_column(
72+
LLMTextColumnConfig(
73+
name="product_name",
74+
model_alias="hf-gpt-oss",
75+
prompt="Generate a creative product name for a {{ category }} item.",
76+
)
77+
)
78+
79+
# Preview the generated data
80+
preview = data_designer.preview(config_builder=config_builder, num_records=5)
81+
preview.display_sample_record()
82+
83+
# Access the DataFrame
84+
df = preview.dataset
85+
print(df)
86+
```
87+
88+
## Using Different Models
89+
90+
You can use any model available through [Inference Providers](https://huggingface.co/models?inference_provider=all). Simply update the `model` field:
91+
92+
```python
93+
# Use a different model
94+
hf_model = ModelConfig(
95+
alias="hf-olmo",
96+
model="allenai/OLMo-3-7B-Instruct",
97+
provider="huggingface",
98+
)
99+
```
100+
101+
## Resources
102+
103+
- [DataDesigner Documentation](https://nvidia-nemo.github.io/DataDesigner/)
104+
- [GitHub Repository](https://github.com/NVIDIA-NeMo/DataDesigner)
105+
- [Available Models on Inference Providers](https://huggingface.co/models?inference_provider=all&pipeline_tag=text-generation)

docs/inference-providers/integrations/index.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ This table lists _some_ tools, libraries, and applications that work with Huggin
1818
| Integration | Description | Resources |
1919
| ------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
2020
| [CrewAI](https://www.crewai.com/) | Framework for orchestrating AI agent teams | [Official docs](https://docs.crewai.com/en/concepts/llms#hugging-face) |
21+
| [NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) | Synthetic dataset generation framework | [HF docs](./datadesigner) |
2122
| [GitHub Copilot Chat](https://docs.github.com/en/copilot) | AI pair programmer in VS Code | [HF docs](./vscode) |
2223
| [fast-agent](https://fast-agent.ai/) | Flexible framework building MCP/ACP powered Agents, Workflows and evals | [Official docs](https://fast-agent.ai/models/llm_providers/#hugging-face) |
2324
| [Haystack](https://haystack.deepset.ai/) | Open-source LLM framework for building production applications | [Official docs](https://docs.haystack.deepset.ai/docs/huggingfaceapichatgenerator) |
@@ -71,6 +72,12 @@ LLM application frameworks and orchestration platforms.
7172
- [PydanticAI](https://ai.pydantic.dev/) - Framework for building AI agents with Python ([Official docs](https://ai.pydantic.dev/models/huggingface/))
7273
- [smolagents](https://huggingface.co/docs/smolagents) - Framework for building LLM agents with tool integration ([Official docs](https://huggingface.co/docs/smolagents/reference/models#smolagents.InferenceClientModel))
7374

75+
### Synthetic Data
76+
77+
Tools for creating synthetic datasets.
78+
79+
- [NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) - NVIDIA NeMo framework for synthetic data generation ([HF docs](./datadesigner))
80+
7481
<!-- ## Add Your Integration
7582
7683
Building something with Inference Providers? [Let us know](./adding-integration) and we'll add it to the list. -->

0 commit comments

Comments
 (0)