Commit 236f62b
feat: add HuggingFace Hub integration for dataset publishing (#275)
* feat: add push_to_hub integration for HuggingFace datasets
Implement HuggingFace Hub integration to upload DataDesigner datasets:
- Add HuggingFaceHubClient with upload_dataset method
- Upload main parquet files to data/ subset
- Upload processor outputs to data/{processor_name}/ subsets
- Generate dataset card from metadata.json with column statistics
- Include sdg.json and metadata.json configuration files
- Comprehensive validation and error handling
- Add push_to_hub() method to DatasetCreationResults
* feat: improve push_to_hub with logging, path mapping, and config definitions
- Add progress logging with emojis following codebase style
- Add repository exists check before creation
- Update metadata.json paths for HuggingFace structure (parquet-files/ → data/, processors-files/{name}/ → {name}/)
- Enhance dataset card with detailed intro, tabular schema/statistics, and clickable config links
- Add explicit configs in YAML frontmatter to fix schema mismatch between main dataset and processor outputs
- Set data config as default configuration
* feat: add optional description parameter to push_to_hub
- Add description parameter to push_to_hub() for custom dataset card content
- Description appears after NeMo Data Designer intro section
- Update dataset card template to conditionally render custom description
- Add tests for with/without custom description scenarios
* feat: make description required and enhance dataset card design
- Make description parameter required in push_to_hub()
- Improve dataset card layout with flexbox header (title + right-aligned tagline)
- Add horizontal dividers between sections for visual separation
- Add emoji icons to section headers for better readability
- Move About NeMo Data Designer section after Citation
- Update section order: Description → Quick Start → Dataset Summary → Schema & Statistics → Generation Details → Citation → About
- Update all tests to provide required description parameter
* fix license headers
* remove modality deteciton
* break up upload_dataset
* make token private
* HuggingFace -> Hugging Face
* remove inline imports
* simplify tests + remvoe create pr option for simplicity
* Update packages/data-designer/src/data_designer/integrations/huggingface/dataset_card.py
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
* use consistent indentaion
* fix temp file clean up
* huggingface hub already a dep in engine
* add missing spaces
* reuse vars from artifact_storage.py
* pull put hf hub datasets url to constants
* HuggingfaceUploadError -> HuggingFaceHubClientUploadError
* defer to hfhub repo validation
* Update packages/data-designer/src/data_designer/integrations/huggingface/client.py
Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
* Update packages/data-designer/src/data_designer/interface/results.py
Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
* Update packages/data-designer/src/data_designer/integrations/huggingface/client.py
Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
* allow custom tags
* change sdg.json -> builder_config.json
---------
Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>1 parent 13c4ade commit 236f62b
File tree
10 files changed
+1564
-5
lines changed- packages
- data-designer-config/src/data_designer/config/utils
- data-designer-engine/src/data_designer/engine/dataset_builders
- data-designer
- src/data_designer
- integrations/huggingface
- interface
- tests/integrations/huggingface
10 files changed
+1564
-5
lines changedLines changed: 2 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
372 | 372 | | |
373 | 373 | | |
374 | 374 | | |
| 375 | + | |
| 376 | + | |
Lines changed: 8 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | | - | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
28 | 31 | | |
29 | 32 | | |
30 | 33 | | |
| |||
37 | 40 | | |
38 | 41 | | |
39 | 42 | | |
40 | | - | |
| 43 | + | |
41 | 44 | | |
42 | 45 | | |
43 | | - | |
| 46 | + | |
44 | 47 | | |
45 | 48 | | |
46 | 49 | | |
| |||
72 | 75 | | |
73 | 76 | | |
74 | 77 | | |
75 | | - | |
| 78 | + | |
76 | 79 | | |
77 | 80 | | |
78 | 81 | | |
| |||
259 | 262 | | |
260 | 263 | | |
261 | 264 | | |
262 | | - | |
| 265 | + | |
263 | 266 | | |
264 | 267 | | |
265 | 268 | | |
| |||
Lines changed: 7 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
0 commit comments