feat(results): add export() method and --output-format CLI flag by przemekboruta · Pull Request #540 · NVIDIA-NeMo/DataDesigner

przemekboruta · 2026-04-13T19:26:35Z

Summary

Closes #566

Adds DatasetCreationResults.export(path, format=) supporting jsonl, csv, and parquet formats
Adds --output-format / -f flag to the data-designer create CLI command with click.Choice validation; writes dataset.<format> alongside the parquet batch files
Output format is inferred from the file extension by default; format= is an optional explicit override (e.g. write a .txt file as JSONL)
SUPPORTED_EXPORT_FORMATS is derived from get_args(ExportFormat) so the Literal and the tuple cannot drift apart
Unsupported format raises InvalidFileFormatError (consistent with project error conventions)
JSONL export uses date_format="iso" for consistent datetime serialization across formats

Implementation notes

Streaming export (no OOM risk)

export() never materialises the full dataset in memory. Instead of calling load_dataset(), it reads the individual batch parquet files from artifact_storage.final_dataset_path one at a time and appends each to the output file, keeping peak memory proportional to a single batch regardless of dataset size:

JSONL / CSV — each batch is appended to the output file; CSV header is written only on the first batch
Parquet — pyarrow.parquet.ParquetWriter is used to write each batch as a row group into one output file; schemas are unified with pa.unify_schemas upfront to handle type drift (e.g. int64 vs float64 in the same column across batches)

Format inference

results.export("output.jsonl")          # inferred from extension
results.export("output.csv")            # inferred from extension
results.export("output.parquet")        # inferred from extension
results.export("output.txt", format="jsonl")  # explicit override

CLI:

data-designer create config.yaml --output-format jsonl
data-designer create config.yaml -n 500 -f csv

Test plan

Adds DatasetCreationResults.export(path, format=) supporting jsonl, csv, and parquet. The CLI create command gains --output-format / -f which writes dataset.<format> alongside the parquet batch files.

greptile-apps · 2026-04-13T19:29:50Z

Greptile Summary

This PR adds DatasetCreationResults.export() supporting jsonl, csv, and parquet via memory-efficient batch streaming, along with a --output-format / -f CLI flag. Both previously-flagged issues (early format validation and docstring accuracy) are resolved in this revision — validation now fires at the top of run_create() before any generation work, and the docstring correctly names InvalidFileFormatError. Test coverage is thorough across all formats and edge cases.

Confidence Score: 5/5

Safe to merge — all P0/P1 concerns from prior rounds are addressed, and the single remaining note is a minor style suggestion.

Both previously-raised blocking issues (format validation ordering, docstring mismatch) are now fixed. The one remaining comment is a P2 style preference about passing format= explicitly at the call site — it does not affect correctness. All new code paths have dedicated parametrized tests.

No files require special attention.

Important Files Changed

Filename	Overview
packages/data-designer/src/data_designer/interface/results.py	Adds `export()` method with jsonl/csv/parquet support via streaming; format inferred from extension or explicit arg; schema unification for parquet; well-structured with good test coverage.
packages/data-designer/src/data_designer/cli/controllers/generation_controller.py	Early format validation added at top of `run_create()` before any generation work; export called after generation with path inferred from `output_format`; success message moved after export step.
packages/data-designer/src/data_designer/cli/commands/create.py	Adds `--output-format` / `-f` option with `click.Choice` validation derived from `SUPPORTED_EXPORT_FORMATS`; cleanly forwards to controller.
packages/data-designer/tests/interface/test_results.py	Comprehensive export tests covering all formats, multi-batch streaming, schema unification, extension inference, format override, empty dir, and Path/str return type.
packages/data-designer/tests/cli/controllers/test_generation_controller.py	New tests validate early exit on bad format, export called with correct path, and export failure exits with code 1.
packages/data-designer/tests/cli/commands/test_create_command.py	Existing delegation tests updated with `output_format=None`; new test verifies `output_format="jsonl"` forwarded correctly to controller.
packages/data-designer/tests/cli/test_main.py	Existing dispatch test updated with `output_format=None`; no new logic.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as create_command (CLI)
    participant Ctrl as GenerationController
    participant DD as DataDesigner
    participant Results as DatasetCreationResults
    participant FS as Filesystem

    User->>CLI: data-designer create config.yaml -f jsonl
    Note over CLI: click.Choice validates format
    CLI->>Ctrl: run_create(..., output_format="jsonl")
    Note over Ctrl: Early validation against SUPPORTED_EXPORT_FORMATS
    Ctrl->>DD: create(config, num_records)
    DD-->>Ctrl: DatasetCreationResults
    Ctrl->>Results: load_dataset() → count records
    Ctrl->>Results: load_analysis()
    Ctrl->>Results: export(base_path/"dataset.jsonl")
    Note over Results: Infers format from .jsonl extension
    loop each batch_*.parquet
        Results->>FS: read_parquet(batch_file)
        Results->>FS: append to output file
    end
    Results-->>Ctrl: Path("dataset.jsonl")
    Ctrl->>User: print success + artifact/export paths

Prompt To Fix All With AI

This is a comment left during a code review.
Path: packages/data-designer/src/data_designer/cli/controllers/generation_controller.py
Line: 174-175

Comment:
**`export()` call omits explicit `format=` argument**

`results.export(export_path)` is called without passing `format=output_format`. The method correctly infers the format from the `.{output_format}` extension because the path is constructed as `f"dataset.{output_format}"`, so this works today. Explicitly passing the already-validated format avoids relying on extension inference and makes the intent clearer at the call site.

```suggestion
                results.export(export_path, format=output_format)
            except Exception as e:
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (6): Last reviewed commit: "feat(results): stream batch files in exp..." | Re-trigger Greptile}

…lazy loading

andreatgretel

Thanks for this @przemekboruta - export() fills a real gap and the core implementation is clean. A few things to tighten up before merging:

Per CONTRIBUTING.md, features should have an associated issue. Could you open one retroactively and link it here with Closes #NNN?
The success message / export ordering and double dataset load need fixing (see inline comments)
Need controller-level tests for the new output_format path - happy path, bad format rejection, and export failure
A couple of style guide items flagged inline (error types, import placement)

Left details in inline comments. Nice catch on the lazy import fix in the third commit.

…, import hygiene - Derive SUPPORTED_EXPORT_FORMATS from get_args(ExportFormat) so the two can't drift apart - Replace ValueError with InvalidFileFormatError in export() — consistent with project error conventions - Add date_format="iso" to to_json() for consistent datetime serialization across formats - Add click.Choice(SUPPORTED_EXPORT_FORMATS) to --output-format CLI option for parse-time validation, better --help output, and tab completion - Fix double load_dataset() in run_create: inline len() so the DataFrame ref dies before export - Move success message after the export block to avoid "Dataset created" followed by "Export failed" - Move imports to module level in test_results.py (json, Path, lazy already imported) - Add controller-level tests for output_format happy path, bad format rejection, and export failure

…matError

przemekboruta · 2026-04-21T19:48:00Z

hey @andreatgretel! thanks for the review. All points seem to be addressed right now.

github-actions · 2026-04-21T20:02:04Z

Issue #566 has been triaged. The linked issue check is being re-evaluated.

andreatgretel

Thanks @przemekboruta, all review comments are addressed - approving!

We should be able to merge soon. Seeking one more approval - cc @NVIDIA-NeMo/data-designer-reviewers.

nabinchha · 2026-04-22T14:54:08Z

Heads up on OOM risk with large datasets

The proposed export() method should avoid calling load_dataset() internally. Today, load_dataset() materializes the entire dataset into a single pandas DataFrame via pd.read_parquet() -- for a large partitioned dataset, this will OOM.

The data is already on disk as individual batch files (batch_00000.parquet, batch_00001.parquet, ...), so export() can stream batch-by-batch into a single output file without ever holding the full dataset in memory:

JSONL / CSV -- iterate over batch files, append each chunk to the output file. These formats are naturally appendable, so this produces one valid file with memory proportional to a single batch.
Parquet -- use pyarrow.parquet.ParquetWriter to write each batch as a row group into one output file.

One nuance for the parquet path: batch files can have slightly mismatched schemas (e.g., a column inferred as int64 in one batch and float64 in another). The fix is cheap -- read just the metadata from each batch file upfront with pq.read_schema(), merge with pa.unify_schemas(), and cast each batch to the unified schema before writing.

Example implementation sketch:

from __future__ import annotations

from pathlib import Path

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


SUPPORTED_FORMATS = {"jsonl", "csv", "parquet"}


def export(self, path: str | Path) -> Path:
    output = Path(path)
    fmt = output.suffix.lstrip(".")

    if fmt not in SUPPORTED_FORMATS:
        raise ValueError(
            f"Unsupported file extension {output.suffix!r}. "
            f"Use one of: {', '.join(SUPPORTED_FORMATS)}"
        )

    batch_files = sorted(self.artifact_storage.final_dataset_path.glob("*.parquet"))

    if not batch_files:
        raise FileNotFoundError("No batch parquet files found.")

    if fmt == "jsonl":
        _export_jsonl(batch_files, output)
    elif fmt == "csv":
        _export_csv(batch_files, output)
    elif fmt == "parquet":
        _export_parquet(batch_files, output)

    return output


def _export_jsonl(batch_files: list[Path], output: Path) -> None:
    with output.open("w") as f:
        for batch_file in batch_files:
            chunk = pd.read_parquet(batch_file)
            f.write(chunk.to_json(orient="records", lines=True))
            f.write("\n")


def _export_csv(batch_files: list[Path], output: Path) -> None:
    for i, batch_file in enumerate(batch_files):
        chunk = pd.read_parquet(batch_file)
        chunk.to_csv(output, mode="a", header=(i == 0), index=False)


def _export_parquet(batch_files: list[Path], output: Path) -> None:
    schemas = [pq.read_schema(f) for f in batch_files]
    unified = pa.unify_schemas(schemas)

    with pq.ParquetWriter(output, unified) as writer:
        for batch_file in batch_files:
            table = pq.read_table(batch_file)
            table = table.cast(unified)
            writer.write_table(table)

Usage:

results.export("output.jsonl")
results.export("output.csv")
results.export("output.parquet")

…atasets - Rewrite export() to read batch parquet files one at a time instead of materialising the full dataset via load_dataset(); peak memory is now proportional to a single batch regardless of dataset size - Infer output format from file extension by default; format= parameter kept as an explicit override (e.g. writing .txt as JSONL) - _export_parquet unifies schemas across batches (pa.unify_schemas) to handle type drift (e.g. int64 vs float64 in the same column) - Drop format= from the controller's export() call — path already carries the correct extension - Rewrite export tests around real batch parquet files (stub_batch_dir fixture); add tests for multi-batch output, schema unification, unknown extension, empty batch directory, and explicit format override

przemekboruta · 2026-04-22T19:05:27Z

Hey @nabinchha! I totally agree with your proposal and I've implemented it. Looking forward to your feedback

PR description updated

feat(results): add export() method and --output-format CLI flag

68557c1

Adds DatasetCreationResults.export(path, format=) supporting jsonl, csv, and parquet. The CLI create command gains --output-format / -f which writes dataset.<format> alongside the parquet batch files.

przemekboruta requested a review from a team as a code owner April 13, 2026 19:26

greptile-apps Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread packages/data-designer/src/data_designer/cli/controllers/generation_controller.py Outdated

przemekboruta added 2 commits April 13, 2026 22:37

fix(cli): validate output_format before dataset generation

bbd5ac9

fix(cli): remove top-level results import from create.py to preserve …

56fbe31

…lazy loading

github-actions Bot mentioned this pull request Apr 20, 2026

Agentic CI: Issue & PR Triage Tracker #562

Open

andreatgretel requested changes Apr 21, 2026

View reviewed changes

greptile-apps Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread packages/data-designer/src/data_designer/interface/results.py Outdated

fix(results): correct Raises docstring — ValueError -> InvalidFileFor…

50a93d9

…matError

przemekboruta requested a review from andreatgretel April 21, 2026 19:48

andreatgretel previously approved these changes Apr 22, 2026

View reviewed changes

nabinchha reviewed Apr 22, 2026

View reviewed changes

Comment thread packages/data-designer/src/data_designer/interface/results.py

przemekboruta dismissed andreatgretel’s stale review via 6c1b503 April 22, 2026 18:56

przemekboruta requested a review from nabinchha April 22, 2026 19:05

Conversation

przemekboruta commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation notes

Test plan

Uh oh!

greptile-apps Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

andreatgretel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

przemekboruta commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

andreatgretel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nabinchha commented Apr 22, 2026

Uh oh!

przemekboruta commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

przemekboruta commented Apr 13, 2026 •

edited

Loading

greptile-apps Bot commented Apr 13, 2026 •

edited

Loading