Skip to content

Commit cebfb0e

Browse files
nabinchhagreptile-apps[bot]davanstrien
authored
docs: Added starter dev notes on push to hugging face hub (#355)
* Added starter dev notes on push to huggingface hub * fix: move excerpt marker to intro and remove redundant markers Move the single <\!-- more --> to after the intro paragraph for a shorter blog teaser and remove the 6 redundant markers throughout the post. * Update docs/devnotes/posts/push-datasets-to-hugging-face-hub.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * docs: add HF ecosystem context to push-to-hub dev notes (#474) * docs: add HF ecosystem context to push-to-hub dev notes Add section on what datasets get on the Hub (Dataset Viewer, streaming, Viewer API), link to Hub search for DataDesigner datasets, and note that private datasets can be flipped to public. * Update docs/devnotes/posts/push-datasets-to-hugging-face-hub.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix: remove doubled library: prefix in Hub search URL --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update date * fix date for text-to-sql * update hero images" * updates --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
1 parent 6ef4953 commit cebfb0e

8 files changed

Lines changed: 329 additions & 2 deletions

File tree

docs/devnotes/.authors.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,3 +39,7 @@ authors:
3939
name: Nabin Mulepati
4040
description: Researcher at NVIDIA
4141
avatar: https://avatars.githubusercontent.com/u/5551931?v=4
42+
davanstrien:
43+
name: Daniel van Strien
44+
description: Machine Learning Librarian at Hugging Face
45+
avatar: https://avatars.githubusercontent.com/u/8995957?v=4
390 KB
Loading
463 KB
Loading
508 KB
Loading
508 KB
Loading
Lines changed: 320 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,320 @@
1+
---
2+
date: 2026-04-16
3+
authors:
4+
- nmulepati
5+
- davanstrien
6+
---
7+
8+
# **Push Datasets to Hugging Face Hub**
9+
10+
You just generated 10k multilingual greetings (or some other cool dataset). Now what — email a parquet file?
11+
Nah. Call `.push_to_hub()` and you've got a live dataset page on Hugging Face. Done and dusted 🚢.
12+
13+
<!-- more -->
14+
15+
![Push to Hub Hero](assets/push-datasets-to-hugging-face-hub/push-to-hub-hero.png){ width=100% }
16+
17+
---
18+
19+
Here's the full flow — build a multilingual greeting dataset with a conversation
20+
training processor, generate it, and push it to the Hub in one go:
21+
22+
```python
23+
import data_designer.config as dd
24+
from data_designer.interface import DataDesigner
25+
26+
data_designer = DataDesigner()
27+
config_builder = dd.DataDesignerConfigBuilder()
28+
29+
config_builder.add_column(
30+
dd.SamplerColumnConfig(
31+
name="language",
32+
sampler_type=dd.SamplerType.CATEGORY,
33+
params=dd.CategorySamplerParams(
34+
values=["English", "Spanish", "French", "German", "Italian"],
35+
),
36+
drop=True,
37+
)
38+
)
39+
40+
config_builder.add_column(
41+
dd.LLMTextColumnConfig(
42+
name="greeting",
43+
model_alias="nvidia-text",
44+
prompt="Write a casual greeting in {{ language }}.",
45+
)
46+
)
47+
config_builder.add_column(
48+
dd.LLMTextColumnConfig(
49+
name="response",
50+
model_alias="nvidia-text",
51+
prompt="Write a helpful agent response to this greeting: '{{ greeting }}'.",
52+
)
53+
)
54+
55+
# Reshape into an OpenAI-style conversation training format
56+
config_builder.add_processor(
57+
dd.SchemaTransformProcessorConfig(
58+
name="conversations",
59+
template={
60+
"messages": [
61+
{"role": "user", "content": "{{ greeting }}"},
62+
{"role": "assistant", "content": "{{ response }}"},
63+
]
64+
},
65+
)
66+
)
67+
68+
results = data_designer.create(config_builder, num_records=10_000)
69+
70+
# Ship it:
71+
url = results.push_to_hub(
72+
"my-org/multilingual-greetings",
73+
"10k synthetic agent/user conversations across 5 languages.",
74+
tags=["greetings", "multilingual", "conversation"],
75+
)
76+
print(url) # https://huggingface.co/datasets/my-org/multilingual-greetings
77+
```
78+
79+
---
80+
## Two Ways In - same outcome
81+
82+
**From results** (the happy path) — you just ran `.create()`, you have the
83+
results object, call `.push_to_hub()` on it.
84+
85+
**From a folder** (the "I closed my notebook" path) — you saved artifacts to
86+
disk earlier and want to push them later:
87+
88+
```python
89+
from data_designer.integrations.huggingface import HuggingFaceHubClient
90+
91+
url = HuggingFaceHubClient.push_to_hub_from_folder(
92+
dataset_path="./my-saved-dataset",
93+
repo_id="my-org/multilingual-greetings",
94+
description="10k synthetic agent/user conversations across 5 languages.",
95+
)
96+
```
97+
98+
---
99+
## What You Get on the Hub
100+
101+
Once pushed, your dataset is live in the Hugging Face ecosystem:
102+
103+
- **Dataset Viewer** — browsable in the browser immediately. Each processor
104+
config shows up as a separate subset tab (more on this in
105+
[Processors Get First-Class Treatment](#processors-get-first-class-treatment)).
106+
- **Streaming** — parquet means consumers can stream without downloading:
107+
108+
```python
109+
from datasets import load_dataset
110+
111+
ds = load_dataset("my-org/multilingual-greetings", "conversations", split="train", streaming=True)
112+
```
113+
114+
- **[Dataset Viewer API](https://huggingface.co/docs/dataset-viewer/)** — row
115+
pagination, text search, column statistics, and parquet shard URLs with no
116+
extra setup.
117+
118+
---
119+
## What Gets Uploaded
120+
121+
![Push to Hub Pipeline](assets/push-datasets-to-hugging-face-hub/push-to-hub-pipeline.png)
122+
123+
Everything. The upload pipeline runs in this order:
124+
125+
```
126+
1. README.md ← auto-generated dataset card
127+
2. data/*.parquet ← your main dataset (remapped from parquet-files/)
128+
3. images/* ← if you have image columns (skipped otherwise)
129+
4. {processor}/* ← processor outputs (remapped from processors-files/)
130+
5. builder_config.json
131+
6. metadata.json ← paths rewritten to match HF repo layout
132+
```
133+
134+
Each step is its own commit on the HF repo, so you get a clean history.
135+
136+
This is especially nice for large datasets. Data Designer writes output in
137+
batched parquet partitions — generate 100k records and you'll have dozens of
138+
parquet files across `parquet-files/`, `processors-files/`, and maybe `images/`.
139+
Manually uploading all of that, organizing it into the right HF repo structure,
140+
writing the dataset card YAML configs, and rewriting metadata paths would be
141+
tedious and error-prone. `push_to_hub` handles the whole thing in one call —
142+
folder uploads, path remapping, config registration, dataset card generation,
143+
all of it.
144+
145+
Re-pushing to the same `repo_id` updates the existing repo — no need to delete
146+
and recreate.
147+
148+
---
149+
## Processors Get First-Class Treatment
150+
151+
![Schema Transform for Conversation Training](assets/push-datasets-to-hugging-face-hub/push-to-hub-schema-transform.png)
152+
153+
Notice the `SchemaTransformProcessorConfig` in the example above. That's doing
154+
the heavy lifting — it takes the raw `greeting` and `response` columns and
155+
reshapes each row into an OpenAI-style `messages` array:
156+
157+
```python
158+
config_builder.add_processor(
159+
dd.SchemaTransformProcessorConfig(
160+
name="conversations",
161+
template={
162+
"messages": [
163+
{"role": "user", "content": "{{ greeting }}"},
164+
{"role": "assistant", "content": "{{ response }}"},
165+
]
166+
},
167+
)
168+
)
169+
```
170+
171+
The template is Jinja2 all the way down. Keys become columns in the output,
172+
values get rendered per-row with the actual column data. The template dict must
173+
be JSON-serializable — strings, lists, nested objects, all fair game. So you can
174+
build arbitrarily complex conversation schemas (multi-turn, system prompts,
175+
tool calls) just by adding more entries to the `messages` list.
176+
177+
The processor runs after each batch and writes its output to a separate parquet
178+
file alongside the main dataset. The main dataset (`data/`) still has the raw
179+
columns — the processor output is an *additional* view, not a replacement.
180+
181+
**When you push to hub, each processor gets its own top-level directory and its
182+
own HF dataset config.** So the `conversations` processor from our example ends
183+
up like this on HF:
184+
185+
```
186+
my-org/multilingual-greetings/
187+
├── README.md
188+
├── data/
189+
│ ├── batch_00000.parquet ← raw columns (greeting, response)
190+
│ └── batch_00001.parquet
191+
├── conversations/
192+
│ ├── batch_00000.parquet ← transformed (messages array)
193+
│ └── batch_00001.parquet
194+
├── builder_config.json
195+
└── metadata.json
196+
```
197+
198+
The dataset card YAML frontmatter registers each processor as its own named
199+
config:
200+
201+
```yaml
202+
configs:
203+
- config_name: data
204+
data_files: "data/*.parquet"
205+
default: true
206+
- config_name: conversations
207+
data_files: "conversations/*.parquet"
208+
```
209+
210+
So consumers grab exactly the format they need:
211+
212+
```python
213+
from datasets import load_dataset
214+
215+
# Raw columns — good for analysis
216+
df = load_dataset("my-org/multilingual-greetings", "data", split="train")
217+
218+
# Conversation format — ready for fine-tuning
219+
df_conv = load_dataset("my-org/multilingual-greetings", "conversations", split="train")
220+
print(df_conv[0])
221+
# {'messages': [{'role': 'user', 'content': 'Hey! Como estás?'},
222+
# {'role': 'assistant', 'content': 'Hola! Estoy bien, gracias...'}]}
223+
```
224+
225+
The Quick Start section in the generated README includes these snippets
226+
automatically — one `load_dataset` call per processor.
227+
228+
**Metadata paths are rewritten too.** Local paths like
229+
`processors-files/conversations/batch_00000.parquet` become
230+
`conversations/batch_00000.parquet` so file references in the metadata match
231+
the actual HF repo structure.
232+
233+
If there are no processors, all of this is silently skipped — no empty
234+
directories, no phantom configs.
235+
236+
---
237+
## The Auto-Generated Dataset Card
238+
239+
This is the fun part. The upload generates a full HuggingFace dataset card from
240+
your run metadata. It pulls from `metadata.json` and `builder_config.json` to
241+
build:
242+
243+
- A **Quick Start** section with `load_dataset` code (including processor subsets)
244+
- A **Dataset Summary** with record count, column count, completion %
245+
- A **Schema & Statistics** table — per-column type, uniqueness, null rate, token stats
246+
- **Generation Details** — how many columns of each config type
247+
- A **Citation** block so people can cite your dataset
248+
249+
Tags default to `["synthetic", "datadesigner"]` plus whatever you pass in.
250+
Size category (`n<1K`, `1K<n<10K`, etc.) is auto-computed. These tags make your
251+
dataset discoverable in [Hub search](https://huggingface.co/datasets?library=datadesigner&sort=trending)
252+
— you can browse all Data Designer datasets in one place.
253+
254+
The template lives at `packages/data-designer/src/data_designer/integrations/huggingface/dataset_card_template.md` if you
255+
want to see the Jinja2 source.
256+
257+
---
258+
## Auth
259+
260+
Token resolution follows the standard `huggingface_hub` chain:
261+
262+
1. Explicit `token=` parameter
263+
2. `HF_TOKEN` env var
264+
3. Cached creds from `hf auth login`
265+
266+
If none of those work, you get a clear error telling you what to do.
267+
268+
---
269+
## Reproducible Pipelines — The Round-Trip
270+
271+
![Round-Trip Reproducibility](assets/push-datasets-to-hugging-face-hub/push-to-hub-round-trip.png){ width="800" }
272+
273+
Here's the payoff: every dataset you push includes `builder_config.json` — the
274+
full SDG pipeline definition. Anyone (including future-you) can recreate the
275+
exact same pipeline from the HuggingFace URL:
276+
277+
```python
278+
import data_designer.config as dd
279+
280+
config_builder = dd.DataDesignerConfigBuilder.from_config(
281+
"https://huggingface.co/datasets/my-org/multilingual-greetings/blob/main/builder_config.json"
282+
)
283+
```
284+
285+
That's it. One line. `from_config` accepts a raw URL, a local file path, a dict,
286+
or a YAML string. When you hand it a HuggingFace Hub URL, it auto-rewrites the
287+
blob URL to a raw URL behind the scenes so the fetch just works (same trick for
288+
GitHub blob URLs).
289+
290+
The loaded config builder comes back fully hydrated — columns, model configs,
291+
constraints, seed config, all of it. You can inspect it, tweak it, and re-run:
292+
293+
```python
294+
from data_designer.interface import DataDesigner
295+
296+
# Maybe bump the count or swap a model
297+
results = DataDesigner().create(config_builder, num_records=50_000)
298+
299+
# And push the new version right back
300+
results.push_to_hub(
301+
"my-org/multilingual-greetings-v2",
302+
"50k version with the same pipeline.",
303+
)
304+
```
305+
306+
So the full loop is: **design → generate → push → share URL → recreate → iterate**.
307+
The `builder_config.json` on HuggingFace *is* the reproducibility artifact.
308+
309+
---
310+
## Gotchas
311+
312+
- **`repo_id` must be `username/dataset-name`** — exactly one slash. The client
313+
validates this before hitting the network.
314+
- **`description` is required** — it's the prose that appears right under the
315+
title on the dataset card. Make it good.
316+
- **`private=True`** if you don't want the world to see your dataset yet. You
317+
can flip it to public later from the dataset settings page.
318+
- **Metadata paths get rewritten** — local paths like `parquet-files/batch_00000.parquet`
319+
become `data/batch_00000.parquet` in the uploaded `metadata.json` so references
320+
stay valid on HF.

docs/devnotes/posts/text-to-sql.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,14 @@ authors:
88

99
# **Engineering an Enterprise-Grade Text-to-SQL Dataset with NeMo Data Designer**
1010

11-
![Text-to-SQL Synthetic Data Pipeline](assets/text-to-sql/text-to-sql-pipeline.jpg){ width=800 }
12-
1311
While LLMs have mastered generic coding, Text-to-SQL remains one of the most challenging frontiers in enterprise AI. In many ways this is due to (i) SQL tasks relying on both code and data and (ii) real-world data and databases being quite messy. Focusing on careful data design that accounts for real-world diversity and complexity, we built a [NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) pipeline that includes conditional sampling, three-stage LLM generation, code validators, and multi-dimensional judge scoring to generate reasoning-heavy text-to-SQL samples across PostgreSQL, MySQL, and SQLite, and automatically filter down to the highest quality 96.5k records. Each sample pairs a natural-language prompt and a fully synthetic database schema context with a target SQL query. To improve robustness and mimic the messiness of production databases, the pipeline injects distractor tables and columns into the schema context, forcing the model to learn to ignore irrelevant schema elements. The final dataset is validated and filtered through per-dialect syntax validators and five LLM-as-a-critic judges.
1412

1513
<!-- more -->
14+
<div style="text-align: center;" markdown>
15+
16+
![Text-to-SQL Synthetic Data Pipeline](assets/text-to-sql/text-to-sql-pipeline.jpg){ width=800 }
1617

18+
</div>
1719
---
1820

1921
## **The "Real-World" Gap: Why Academic Data Wasn't Enough**

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ nav:
7575
- Dev Notes:
7676
# NOTE: Order is most recent -> oldest (so sidebar shows recent first!)
7777
- devnotes/index.md
78+
- Push Datasets to Hugging Face Hub: devnotes/posts/push-datasets-to-hugging-face-hub.md
7879
- "Text-to-SQL for Nemotron Super": devnotes/posts/text-to-sql.md
7980
- "Async All the Way Down": devnotes/posts/async-all-the-way-down.md
8081
- Owning the Model Stack: devnotes/posts/owning-the-model-stack.md

0 commit comments

Comments
 (0)