feat(results): add export() method and --output-format CLI flag#540
feat(results): add export() method and --output-format CLI flag#540przemekboruta wants to merge 6 commits intoNVIDIA-NeMo:mainfrom
Conversation
Adds DatasetCreationResults.export(path, format=) supporting jsonl, csv, and parquet. The CLI create command gains --output-format / -f which writes dataset.<format> alongside the parquet batch files.
Greptile SummaryThis PR adds
|
| Filename | Overview |
|---|---|
| packages/data-designer/src/data_designer/interface/results.py | Adds export() method with jsonl/csv/parquet support via streaming; format inferred from extension or explicit arg; schema unification for parquet; well-structured with good test coverage. |
| packages/data-designer/src/data_designer/cli/controllers/generation_controller.py | Early format validation added at top of run_create() before any generation work; export called after generation with path inferred from output_format; success message moved after export step. |
| packages/data-designer/src/data_designer/cli/commands/create.py | Adds --output-format / -f option with click.Choice validation derived from SUPPORTED_EXPORT_FORMATS; cleanly forwards to controller. |
| packages/data-designer/tests/interface/test_results.py | Comprehensive export tests covering all formats, multi-batch streaming, schema unification, extension inference, format override, empty dir, and Path/str return type. |
| packages/data-designer/tests/cli/controllers/test_generation_controller.py | New tests validate early exit on bad format, export called with correct path, and export failure exits with code 1. |
| packages/data-designer/tests/cli/commands/test_create_command.py | Existing delegation tests updated with output_format=None; new test verifies output_format="jsonl" forwarded correctly to controller. |
| packages/data-designer/tests/cli/test_main.py | Existing dispatch test updated with output_format=None; no new logic. |
Sequence Diagram
sequenceDiagram
participant User
participant CLI as create_command (CLI)
participant Ctrl as GenerationController
participant DD as DataDesigner
participant Results as DatasetCreationResults
participant FS as Filesystem
User->>CLI: data-designer create config.yaml -f jsonl
Note over CLI: click.Choice validates format
CLI->>Ctrl: run_create(..., output_format="jsonl")
Note over Ctrl: Early validation against SUPPORTED_EXPORT_FORMATS
Ctrl->>DD: create(config, num_records)
DD-->>Ctrl: DatasetCreationResults
Ctrl->>Results: load_dataset() → count records
Ctrl->>Results: load_analysis()
Ctrl->>Results: export(base_path/"dataset.jsonl")
Note over Results: Infers format from .jsonl extension
loop each batch_*.parquet
Results->>FS: read_parquet(batch_file)
Results->>FS: append to output file
end
Results-->>Ctrl: Path("dataset.jsonl")
Ctrl->>User: print success + artifact/export paths
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/data-designer/src/data_designer/cli/controllers/generation_controller.py
Line: 174-175
Comment:
**`export()` call omits explicit `format=` argument**
`results.export(export_path)` is called without passing `format=output_format`. The method correctly infers the format from the `.{output_format}` extension because the path is constructed as `f"dataset.{output_format}"`, so this works today. Explicitly passing the already-validated format avoids relying on extension inference and makes the intent clearer at the call site.
```suggestion
results.export(export_path, format=output_format)
except Exception as e:
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (6): Last reviewed commit: "feat(results): stream batch files in exp..." | Re-trigger Greptile
andreatgretel
left a comment
There was a problem hiding this comment.
Thanks for this @przemekboruta - export() fills a real gap and the core implementation is clean. A few things to tighten up before merging:
- Per CONTRIBUTING.md, features should have an associated issue. Could you open one retroactively and link it here with
Closes #NNN? - The success message / export ordering and double dataset load need fixing (see inline comments)
- Need controller-level tests for the new
output_formatpath - happy path, bad format rejection, and export failure - A couple of style guide items flagged inline (error types, import placement)
Left details in inline comments. Nice catch on the lazy import fix in the third commit.
…, import hygiene - Derive SUPPORTED_EXPORT_FORMATS from get_args(ExportFormat) so the two can't drift apart - Replace ValueError with InvalidFileFormatError in export() — consistent with project error conventions - Add date_format="iso" to to_json() for consistent datetime serialization across formats - Add click.Choice(SUPPORTED_EXPORT_FORMATS) to --output-format CLI option for parse-time validation, better --help output, and tab completion - Fix double load_dataset() in run_create: inline len() so the DataFrame ref dies before export - Move success message after the export block to avoid "Dataset created" followed by "Export failed" - Move imports to module level in test_results.py (json, Path, lazy already imported) - Add controller-level tests for output_format happy path, bad format rejection, and export failure
|
hey @andreatgretel! thanks for the review. All points seem to be addressed right now. |
|
Issue #566 has been triaged. The linked issue check is being re-evaluated. |
andreatgretel
left a comment
There was a problem hiding this comment.
Thanks @przemekboruta, all review comments are addressed - approving!
We should be able to merge soon. Seeking one more approval - cc @NVIDIA-NeMo/data-designer-reviewers.
|
Heads up on OOM risk with large datasets The proposed The data is already on disk as individual batch files (
One nuance for the parquet path: batch files can have slightly mismatched schemas (e.g., a column inferred as Example implementation sketch: from __future__ import annotations
from pathlib import Path
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
SUPPORTED_FORMATS = {"jsonl", "csv", "parquet"}
def export(self, path: str | Path) -> Path:
output = Path(path)
fmt = output.suffix.lstrip(".")
if fmt not in SUPPORTED_FORMATS:
raise ValueError(
f"Unsupported file extension {output.suffix!r}. "
f"Use one of: {', '.join(SUPPORTED_FORMATS)}"
)
batch_files = sorted(self.artifact_storage.final_dataset_path.glob("*.parquet"))
if not batch_files:
raise FileNotFoundError("No batch parquet files found.")
if fmt == "jsonl":
_export_jsonl(batch_files, output)
elif fmt == "csv":
_export_csv(batch_files, output)
elif fmt == "parquet":
_export_parquet(batch_files, output)
return output
def _export_jsonl(batch_files: list[Path], output: Path) -> None:
with output.open("w") as f:
for batch_file in batch_files:
chunk = pd.read_parquet(batch_file)
f.write(chunk.to_json(orient="records", lines=True))
f.write("\n")
def _export_csv(batch_files: list[Path], output: Path) -> None:
for i, batch_file in enumerate(batch_files):
chunk = pd.read_parquet(batch_file)
chunk.to_csv(output, mode="a", header=(i == 0), index=False)
def _export_parquet(batch_files: list[Path], output: Path) -> None:
schemas = [pq.read_schema(f) for f in batch_files]
unified = pa.unify_schemas(schemas)
with pq.ParquetWriter(output, unified) as writer:
for batch_file in batch_files:
table = pq.read_table(batch_file)
table = table.cast(unified)
writer.write_table(table)Usage: results.export("output.jsonl")
results.export("output.csv")
results.export("output.parquet") |
…atasets - Rewrite export() to read batch parquet files one at a time instead of materialising the full dataset via load_dataset(); peak memory is now proportional to a single batch regardless of dataset size - Infer output format from file extension by default; format= parameter kept as an explicit override (e.g. writing .txt as JSONL) - _export_parquet unifies schemas across batches (pa.unify_schemas) to handle type drift (e.g. int64 vs float64 in the same column) - Drop format= from the controller's export() call — path already carries the correct extension - Rewrite export tests around real batch parquet files (stub_batch_dir fixture); add tests for multi-batch output, schema unification, unknown extension, empty batch directory, and explicit format override
|
Hey @nabinchha! I totally agree with your proposal and I've implemented it. Looking forward to your feedback PR description updated |
Summary
Closes #566
DatasetCreationResults.export(path, format=)supporting jsonl, csv, and parquet formats--output-format/-fflag to thedata-designer createCLI command withclick.Choicevalidation; writesdataset.<format>alongside the parquet batch filesformat=is an optional explicit override (e.g. write a.txtfile as JSONL)SUPPORTED_EXPORT_FORMATSis derived fromget_args(ExportFormat)so the Literal and the tuple cannot drift apartInvalidFileFormatError(consistent with project error conventions)date_format="iso"for consistent datetime serialization across formatsImplementation notes
Streaming export (no OOM risk)
export()never materialises the full dataset in memory. Instead of callingload_dataset(), it reads the individual batch parquet files fromartifact_storage.final_dataset_pathone at a time and appends each to the output file, keeping peak memory proportional to a single batch regardless of dataset size:pyarrow.parquet.ParquetWriteris used to write each batch as a row group into one output file; schemas are unified withpa.unify_schemasupfront to handle type drift (e.g.int64vsfloat64in the same column across batches)Format inference
CLI:
Test plan
test_export_writes_file— parametrized over all 3 formatstest_export_jsonl_content— each line is valid JSON, all records presenttest_export_csv_content— single header row, full record counttest_export_parquet_content— DataFrame round-trip across two batchestest_export_infers_format_from_extension— format omitted, inferred from.jsonltest_export_explicit_format_overrides_extension—.txtwritten as JSONLtest_export_parquet_schema_unification— two batches with diverging column types merged correctlytest_export_unknown_extension_raises— raisesInvalidFileFormatErrortest_export_unsupported_explicit_format_raises— raisesInvalidFileFormatErrortest_export_no_batch_files_raises— raisesArtifactStorageErrortest_export_returns_path_object— returnsPathfor str inputtest_run_create_with_output_format_happy_path— export called with correct pathtest_run_create_invalid_output_format_exits— bad format exits before generation startstest_run_create_export_failure_exits— export exception exits with code 1output_formatparameter