Skip to content

Commit 031ad32

Browse files
committed
Added starter dev notes on push to huggingface hub
1 parent a101760 commit 031ad32

File tree

7 files changed

+306
-0
lines changed

7 files changed

+306
-0
lines changed

docs/devnotes/.authors.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,3 +19,7 @@ authors:
1919
name: Dhruv Nathawani
2020
description: Researcher at NVIDIA
2121
avatar: https://avatars.githubusercontent.com/u/128275431?v=4
22+
nmulepati:
23+
name: Nabin Mulepati
24+
description: Researcher at NVIDIA
25+
avatar: https://avatars.githubusercontent.com/u/5551931?v=4
390 KB
Loading
463 KB
Loading
508 KB
Loading
508 KB
Loading
Lines changed: 301 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,301 @@
1+
---
2+
date: 2026-02-26
3+
authors:
4+
- nmulepati
5+
---
6+
7+
# **Push Datasets to Hugging Face Hub**
8+
9+
![Push to Hub Hero](images/push-to-hub-hero.png)
10+
11+
You just generated 10k multilingual greetings (or some other cool dataset). Now what — email a parquet file?
12+
Nah. Call `.push_to_hub()` and you've got a live dataset page on Hugging Face. Done and dusted 🚢.
13+
14+
15+
Here's the full flow — build a multilingual greeting dataset with a conversation
16+
training processor, generate it, and push it to the Hub in one go:
17+
18+
```python
19+
import data_designer.config as dd
20+
from data_designer.interface import DataDesigner
21+
22+
data_designer = DataDesigner()
23+
config_builder = dd.DataDesignerConfigBuilder()
24+
25+
config_builder.add_column(
26+
dd.SamplerColumnConfig(
27+
name="language",
28+
sampler_type=dd.SamplerType.CATEGORY,
29+
params=dd.CategorySamplerParams(
30+
values=["English", "Spanish", "French", "German", "Italian"],
31+
),
32+
drop=True,
33+
)
34+
)
35+
36+
config_builder.add_column(
37+
dd.LLMTextColumnConfig(
38+
name="greeting",
39+
model_alias="nvidia-text",
40+
prompt="Write a casual greeting in {{ language }}.",
41+
)
42+
)
43+
config_builder.add_column(
44+
dd.LLMTextColumnConfig(
45+
name="response",
46+
model_alias="nvidia-text",
47+
prompt="Write a helpful agent response to this greeting: '{{ greeting }}'.",
48+
)
49+
)
50+
51+
# Reshape into an OpenAI-style conversation training format
52+
config_builder.add_processor(
53+
dd.SchemaTransformProcessorConfig(
54+
name="conversations",
55+
template={
56+
"messages": [
57+
{"role": "user", "content": "{{ greeting }}"},
58+
{"role": "assistant", "content": "{{ response }}"},
59+
]
60+
},
61+
)
62+
)
63+
64+
results = data_designer.create(config_builder, num_records=10_000)
65+
66+
# Ship it:
67+
url = results.push_to_hub(
68+
"my-org/multilingual-greetings",
69+
"10k synthetic agent/user conversations across 5 languages.",
70+
tags=["greetings", "multilingual", "conversation"],
71+
)
72+
print(url) # https://huggingface.co/datasets/my-org/multilingual-greetings
73+
```
74+
<!-- more -->
75+
76+
---
77+
## Two Ways In - same outcome
78+
79+
**From results** (the happy path) — you just ran `.create()`, you have the
80+
results object, call `.push_to_hub()` on it.
81+
82+
**From a folder** (the "I closed my notebook" path) — you saved artifacts to
83+
disk earlier and want to push them later:
84+
85+
```python
86+
from data_designer.integrations.huggingface import HuggingFaceHubClient
87+
88+
url = HuggingFaceHubClient.push_to_hub_from_folder(
89+
dataset_path="./my-saved-dataset",
90+
repo_id="my-org/multilingual-greetings",
91+
description="10k synthetic agent/user conversations across 5 languages.",
92+
)
93+
```
94+
95+
<!-- more -->
96+
97+
---
98+
## What Gets Uploaded
99+
100+
![Push to Hub Pipeline](images/push-to-hub-pipeline.png)
101+
102+
Everything. The upload pipeline runs in this order:
103+
104+
```
105+
1. README.md ← auto-generated dataset card
106+
2. data/*.parquet ← your main dataset (remapped from parquet-files/)
107+
3. images/* ← if you have image columns (skipped otherwise)
108+
4. {processor}/* ← processor outputs (remapped from processors-files/)
109+
5. builder_config.json
110+
6. metadata.json ← paths rewritten to match HF repo layout
111+
```
112+
113+
Each step is its own commit on the HF repo, so you get a clean history.
114+
115+
This is especially nice for large datasets. Data Designer writes output in
116+
batched parquet partitions — generate 100k records and you'll have dozens of
117+
parquet files across `parquet-files/`, `processors-files/`, and maybe `images/`.
118+
Manually uploading all of that, organizing it into the right HF repo structure,
119+
writing the dataset card YAML configs, and rewriting metadata paths would be
120+
tedious and error-prone. `push_to_hub` handles the whole thing in one call —
121+
folder uploads, path remapping, config registration, dataset card generation,
122+
all of it.
123+
124+
Re-pushing to the same `repo_id` updates the existing repo — no need to delete
125+
and recreate.
126+
<!-- more -->
127+
128+
---
129+
## Processors Get First-Class Treatment
130+
131+
![Schema Transform for Conversation Training](images/push-to-hub-schema-transform.png)
132+
133+
Notice the `SchemaTransformProcessorConfig` in the example above. That's doing
134+
the heavy lifting — it takes the raw `greeting` and `response` columns and
135+
reshapes each row into an OpenAI-style `messages` array:
136+
137+
```python
138+
config_builder.add_processor(
139+
dd.SchemaTransformProcessorConfig(
140+
name="conversations",
141+
template={
142+
"messages": [
143+
{"role": "user", "content": "{{ greeting }}"},
144+
{"role": "assistant", "content": "{{ response }}"},
145+
]
146+
},
147+
)
148+
)
149+
```
150+
151+
The template is Jinja2 all the way down. Keys become columns in the output,
152+
values get rendered per-row with the actual column data. The template dict must
153+
be JSON-serializable — strings, lists, nested objects, all fair game. So you can
154+
build arbitrarily complex conversation schemas (multi-turn, system prompts,
155+
tool calls) just by adding more entries to the `messages` list.
156+
157+
The processor runs after each batch and writes its output to a separate parquet
158+
file alongside the main dataset. The main dataset (`data/`) still has the raw
159+
columns — the processor output is an *additional* view, not a replacement.
160+
161+
**When you push to hub, each processor gets its own top-level directory and its
162+
own HF dataset config.** So the `conversations` processor from our example ends
163+
up like this on HF:
164+
165+
```
166+
my-org/multilingual-greetings/
167+
├── README.md
168+
├── data/
169+
│ ├── batch_00000.parquet ← raw columns (greeting, response)
170+
│ └── batch_00001.parquet
171+
├── conversations/
172+
│ ├── batch_00000.parquet ← transformed (messages array)
173+
│ └── batch_00001.parquet
174+
├── builder_config.json
175+
└── metadata.json
176+
```
177+
178+
The dataset card YAML frontmatter registers each processor as its own named
179+
config:
180+
181+
```yaml
182+
configs:
183+
- config_name: data
184+
data_files: "data/*.parquet"
185+
default: true
186+
- config_name: conversations
187+
data_files: "conversations/*.parquet"
188+
```
189+
190+
So consumers grab exactly the format they need:
191+
192+
```python
193+
from datasets import load_dataset
194+
195+
# Raw columns — good for analysis
196+
df = load_dataset("my-org/multilingual-greetings", "data", split="train")
197+
198+
# Conversation format — ready for fine-tuning
199+
df_conv = load_dataset("my-org/multilingual-greetings", "conversations", split="train")
200+
print(df_conv[0])
201+
# {'messages': [{'role': 'user', 'content': 'Hey! Como estás?'},
202+
# {'role': 'assistant', 'content': 'Hola! Estoy bien, gracias...'}]}
203+
```
204+
205+
The Quick Start section in the generated README includes these snippets
206+
automatically — one `load_dataset` call per processor.
207+
208+
**Metadata paths are rewritten too.** Local paths like
209+
`processors-files/conversations/batch_00000.parquet` become
210+
`conversations/batch_00000.parquet` so file references in the metadata match
211+
the actual HF repo structure.
212+
213+
If there are no processors, all of this is silently skipped — no empty
214+
directories, no phantom configs.
215+
<!-- more -->
216+
217+
---
218+
## The Auto-Generated Dataset Card
219+
220+
This is the fun part. The upload generates a full HuggingFace dataset card from
221+
your run metadata. It pulls from `metadata.json` and `builder_config.json` to
222+
build:
223+
224+
- A **Quick Start** section with `load_dataset` code (including processor subsets)
225+
- A **Dataset Summary** with record count, column count, completion %
226+
- A **Schema & Statistics** table — per-column type, uniqueness, null rate, token stats
227+
- **Generation Details** — how many columns of each config type
228+
- A **Citation** block so people can cite your dataset
229+
230+
Tags default to `["synthetic", "datadesigner"]` plus whatever you pass in.
231+
Size category (`n<1K`, `1K<n<10K`, etc.) is auto-computed.
232+
233+
The template lives at `integrations/huggingface/dataset_card_template.md` if you
234+
want to see the Jinja2 source.
235+
<!-- more -->
236+
237+
---
238+
## Auth
239+
240+
Token resolution follows the standard `huggingface_hub` chain:
241+
242+
1. Explicit `token=` parameter
243+
2. `HF_TOKEN` env var
244+
3. Cached creds from `hf auth login`
245+
246+
If none of those work, you get a clear error telling you what to do.
247+
<!-- more -->
248+
249+
---
250+
## Reproducible Pipelines — The Round-Trip
251+
252+
![Round-Trip Reproducibility](images/push-to-hub-round-trip.png){ width="800" }
253+
254+
Here's the payoff: every dataset you push includes `builder_config.json` — the
255+
full SDG pipeline definition. Anyone (including future-you) can recreate the
256+
exact same pipeline from the HuggingFace URL:
257+
258+
```python
259+
import data_designer.config as dd
260+
261+
config_builder = dd.DataDesignerConfigBuilder.from_config(
262+
"https://huggingface.co/datasets/my-org/multilingual-greetings/blob/main/builder_config.json"
263+
)
264+
```
265+
266+
That's it. One line. `from_config` accepts a raw URL, a local file path, a dict,
267+
or a YAML string. When you hand it a HuggingFace Hub URL, it auto-rewrites the
268+
blob URL to a raw URL behind the scenes so the fetch just works (same trick for
269+
GitHub blob URLs).
270+
271+
The loaded config builder comes back fully hydrated — columns, model configs,
272+
constraints, seed config, all of it. You can inspect it, tweak it, and re-run:
273+
274+
```python
275+
from data_designer.interface import DataDesigner
276+
277+
# Maybe bump the count or swap a model
278+
results = DataDesigner().create(config_builder, num_records=50_000)
279+
280+
# And push the new version right back
281+
results.push_to_hub(
282+
"my-org/multilingual-greetings-v2",
283+
"50k version with the same pipeline.",
284+
)
285+
```
286+
287+
So the full loop is: **design → generate → push → share URL → recreate → iterate**.
288+
The `builder_config.json` on HuggingFace *is* the reproducibility artifact.
289+
<!-- more -->
290+
291+
---
292+
## Gotchas
293+
294+
- **`repo_id` must be `username/dataset-name`** — exactly one slash. The client
295+
validates this before hitting the network.
296+
- **`description` is required** — it's the prose that appears right under the
297+
title on the dataset card. Make it good.
298+
- **`private=True`** if you don't want the world to see your dataset yet.
299+
- **Metadata paths get rewritten** — local paths like `parquet-files/batch_00000.parquet`
300+
become `data/batch_00000.parquet` in the uploaded `metadata.json` so references
301+
stay valid on HF.

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@ nav:
6969
- Design Principles: devnotes/posts/design-principles.md
7070
- RQA Dataset: devnotes/posts/rqa.md
7171
- Deep Research Trajectories: devnotes/posts/deep-research-trajectories.md
72+
- Push Datasets to Hugging Face Hub: devnotes/posts/push-datasets-to-hugging-face-hub.md
7273

7374
theme:
7475
name: material

0 commit comments

Comments
 (0)