Skip to content

Commit 236f62b

Browse files
feat: add HuggingFace Hub integration for dataset publishing (#275)
* feat: add push_to_hub integration for HuggingFace datasets Implement HuggingFace Hub integration to upload DataDesigner datasets: - Add HuggingFaceHubClient with upload_dataset method - Upload main parquet files to data/ subset - Upload processor outputs to data/{processor_name}/ subsets - Generate dataset card from metadata.json with column statistics - Include sdg.json and metadata.json configuration files - Comprehensive validation and error handling - Add push_to_hub() method to DatasetCreationResults * feat: improve push_to_hub with logging, path mapping, and config definitions - Add progress logging with emojis following codebase style - Add repository exists check before creation - Update metadata.json paths for HuggingFace structure (parquet-files/ → data/, processors-files/{name}/ → {name}/) - Enhance dataset card with detailed intro, tabular schema/statistics, and clickable config links - Add explicit configs in YAML frontmatter to fix schema mismatch between main dataset and processor outputs - Set data config as default configuration * feat: add optional description parameter to push_to_hub - Add description parameter to push_to_hub() for custom dataset card content - Description appears after NeMo Data Designer intro section - Update dataset card template to conditionally render custom description - Add tests for with/without custom description scenarios * feat: make description required and enhance dataset card design - Make description parameter required in push_to_hub() - Improve dataset card layout with flexbox header (title + right-aligned tagline) - Add horizontal dividers between sections for visual separation - Add emoji icons to section headers for better readability - Move About NeMo Data Designer section after Citation - Update section order: Description → Quick Start → Dataset Summary → Schema & Statistics → Generation Details → Citation → About - Update all tests to provide required description parameter * fix license headers * remove modality deteciton * break up upload_dataset * make token private * HuggingFace -> Hugging Face * remove inline imports * simplify tests + remvoe create pr option for simplicity * Update packages/data-designer/src/data_designer/integrations/huggingface/dataset_card.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * use consistent indentaion * fix temp file clean up * huggingface hub already a dep in engine * add missing spaces * reuse vars from artifact_storage.py * pull put hf hub datasets url to constants * HuggingfaceUploadError -> HuggingFaceHubClientUploadError * defer to hfhub repo validation * Update packages/data-designer/src/data_designer/integrations/huggingface/client.py Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com> * Update packages/data-designer/src/data_designer/interface/results.py Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com> * Update packages/data-designer/src/data_designer/integrations/huggingface/client.py Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com> * allow custom tags * change sdg.json -> builder_config.json --------- Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
1 parent 13c4ade commit 236f62b

File tree

10 files changed

+1564
-5
lines changed

10 files changed

+1564
-5
lines changed

packages/data-designer-config/src/data_designer/config/utils/constants.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -372,3 +372,5 @@ class NordColor(Enum):
372372
LOCALES_WITH_MANAGED_DATASETS = list[str](NEMOTRON_PERSONAS_DATASET_SIZES.keys())
373373

374374
NEMOTRON_PERSONAS_DATASET_PREFIX = "nemotron-personas-dataset-"
375+
376+
HUGGINGFACE_HUB_DATASET_URL_PREFIX = "https://huggingface.co/datasets/"

packages/data-designer-engine/src/data_designer/engine/dataset_builders/artifact_storage.py

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,10 @@
2424
logger = logging.getLogger(__name__)
2525

2626
BATCH_FILE_NAME_FORMAT = "batch_{batch_number:05d}.parquet"
27-
SDG_CONFIG_FILENAME = "sdg.json"
27+
SDG_CONFIG_FILENAME = "builder_config.json"
28+
METADATA_FILENAME = "metadata.json"
29+
FINAL_DATASET_FOLDER_NAME = "parquet-files"
30+
PROCESSORS_OUTPUTS_FOLDER_NAME = "processors-files"
2831

2932

3033
class BatchStage(StrEnum):
@@ -37,10 +40,10 @@ class BatchStage(StrEnum):
3740
class ArtifactStorage(BaseModel):
3841
artifact_path: Path | str
3942
dataset_name: str = "dataset"
40-
final_dataset_folder_name: str = "parquet-files"
43+
final_dataset_folder_name: str = FINAL_DATASET_FOLDER_NAME
4144
partial_results_folder_name: str = "tmp-partial-parquet-files"
4245
dropped_columns_folder_name: str = "dropped-columns-parquet-files"
43-
processors_outputs_folder_name: str = "processors-files"
46+
processors_outputs_folder_name: str = PROCESSORS_OUTPUTS_FOLDER_NAME
4447

4548
@property
4649
def artifact_path_exists(self) -> bool:
@@ -72,7 +75,7 @@ def final_dataset_path(self) -> Path:
7275

7376
@property
7477
def metadata_file_path(self) -> Path:
75-
return self.base_dataset_path / "metadata.json"
78+
return self.base_dataset_path / METADATA_FILENAME
7679

7780
@property
7881
def partial_results_path(self) -> Path:
@@ -259,7 +262,7 @@ def write_metadata(self, metadata: dict) -> Path:
259262
"""
260263
self.mkdir_if_needed(self.base_dataset_path)
261264
with open(self.metadata_file_path, "w") as file:
262-
json.dump(metadata, file, indent=4, sort_keys=True)
265+
json.dump(metadata, file, indent=2, sort_keys=True)
263266
return self.metadata_file_path
264267

265268
def update_metadata(self, updates: dict) -> Path:
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
from data_designer.integrations.huggingface.client import HuggingFaceHubClient, HuggingFaceHubClientUploadError
5+
from data_designer.integrations.huggingface.dataset_card import DataDesignerDatasetCard
6+
7+
__all__ = ["HuggingFaceHubClient", "HuggingFaceHubClientUploadError", "DataDesignerDatasetCard"]

0 commit comments

Comments
 (0)