Skip to content

feat(results): add export() method and --output-format CLI flag#540

Open
przemekboruta wants to merge 6 commits intoNVIDIA-NeMo:mainfrom
przemekboruta:feat/dataset-export
Open

feat(results): add export() method and --output-format CLI flag#540
przemekboruta wants to merge 6 commits intoNVIDIA-NeMo:mainfrom
przemekboruta:feat/dataset-export

Conversation

@przemekboruta
Copy link
Copy Markdown
Contributor

@przemekboruta przemekboruta commented Apr 13, 2026

Summary

Closes #566

  • Adds DatasetCreationResults.export(path, format=) supporting jsonl, csv, and parquet formats
  • Adds --output-format / -f flag to the data-designer create CLI command with click.Choice validation; writes dataset.<format> alongside the parquet batch files
  • Output format is inferred from the file extension by default; format= is an optional explicit override (e.g. write a .txt file as JSONL)
  • SUPPORTED_EXPORT_FORMATS is derived from get_args(ExportFormat) so the Literal and the tuple cannot drift apart
  • Unsupported format raises InvalidFileFormatError (consistent with project error conventions)
  • JSONL export uses date_format="iso" for consistent datetime serialization across formats

Implementation notes

Streaming export (no OOM risk)

export() never materialises the full dataset in memory. Instead of calling load_dataset(), it reads the individual batch parquet files from artifact_storage.final_dataset_path one at a time and appends each to the output file, keeping peak memory proportional to a single batch regardless of dataset size:

  • JSONL / CSV — each batch is appended to the output file; CSV header is written only on the first batch
  • Parquetpyarrow.parquet.ParquetWriter is used to write each batch as a row group into one output file; schemas are unified with pa.unify_schemas upfront to handle type drift (e.g. int64 vs float64 in the same column across batches)

Format inference

results.export("output.jsonl")          # inferred from extension
results.export("output.csv")            # inferred from extension
results.export("output.parquet")        # inferred from extension
results.export("output.txt", format="jsonl")  # explicit override

CLI:

data-designer create config.yaml --output-format jsonl
data-designer create config.yaml -n 500 -f csv

Test plan

  • test_export_writes_file — parametrized over all 3 formats
  • test_export_jsonl_content — each line is valid JSON, all records present
  • test_export_csv_content — single header row, full record count
  • test_export_parquet_content — DataFrame round-trip across two batches
  • test_export_infers_format_from_extension — format omitted, inferred from .jsonl
  • test_export_explicit_format_overrides_extension.txt written as JSONL
  • test_export_parquet_schema_unification — two batches with diverging column types merged correctly
  • test_export_unknown_extension_raises — raises InvalidFileFormatError
  • test_export_unsupported_explicit_format_raises — raises InvalidFileFormatError
  • test_export_no_batch_files_raises — raises ArtifactStorageError
  • test_export_returns_path_object — returns Path for str input
  • test_run_create_with_output_format_happy_path — export called with correct path
  • test_run_create_invalid_output_format_exits — bad format exits before generation starts
  • test_run_create_export_failure_exits — export exception exits with code 1
  • Existing CLI delegation tests updated for new output_format parameter

Adds DatasetCreationResults.export(path, format=) supporting jsonl,
csv, and parquet. The CLI create command gains --output-format / -f
which writes dataset.<format> alongside the parquet batch files.
@przemekboruta przemekboruta requested a review from a team as a code owner April 13, 2026 19:26
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 13, 2026

Greptile Summary

This PR adds DatasetCreationResults.export() supporting jsonl, csv, and parquet via memory-efficient batch streaming, along with a --output-format / -f CLI flag. Both previously-flagged issues (early format validation and docstring accuracy) are resolved in this revision — validation now fires at the top of run_create() before any generation work, and the docstring correctly names InvalidFileFormatError. Test coverage is thorough across all formats and edge cases.

Confidence Score: 5/5

Safe to merge — all P0/P1 concerns from prior rounds are addressed, and the single remaining note is a minor style suggestion.

Both previously-raised blocking issues (format validation ordering, docstring mismatch) are now fixed. The one remaining comment is a P2 style preference about passing format= explicitly at the call site — it does not affect correctness. All new code paths have dedicated parametrized tests.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer/src/data_designer/interface/results.py Adds export() method with jsonl/csv/parquet support via streaming; format inferred from extension or explicit arg; schema unification for parquet; well-structured with good test coverage.
packages/data-designer/src/data_designer/cli/controllers/generation_controller.py Early format validation added at top of run_create() before any generation work; export called after generation with path inferred from output_format; success message moved after export step.
packages/data-designer/src/data_designer/cli/commands/create.py Adds --output-format / -f option with click.Choice validation derived from SUPPORTED_EXPORT_FORMATS; cleanly forwards to controller.
packages/data-designer/tests/interface/test_results.py Comprehensive export tests covering all formats, multi-batch streaming, schema unification, extension inference, format override, empty dir, and Path/str return type.
packages/data-designer/tests/cli/controllers/test_generation_controller.py New tests validate early exit on bad format, export called with correct path, and export failure exits with code 1.
packages/data-designer/tests/cli/commands/test_create_command.py Existing delegation tests updated with output_format=None; new test verifies output_format="jsonl" forwarded correctly to controller.
packages/data-designer/tests/cli/test_main.py Existing dispatch test updated with output_format=None; no new logic.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as create_command (CLI)
    participant Ctrl as GenerationController
    participant DD as DataDesigner
    participant Results as DatasetCreationResults
    participant FS as Filesystem

    User->>CLI: data-designer create config.yaml -f jsonl
    Note over CLI: click.Choice validates format
    CLI->>Ctrl: run_create(..., output_format="jsonl")
    Note over Ctrl: Early validation against SUPPORTED_EXPORT_FORMATS
    Ctrl->>DD: create(config, num_records)
    DD-->>Ctrl: DatasetCreationResults
    Ctrl->>Results: load_dataset() → count records
    Ctrl->>Results: load_analysis()
    Ctrl->>Results: export(base_path/"dataset.jsonl")
    Note over Results: Infers format from .jsonl extension
    loop each batch_*.parquet
        Results->>FS: read_parquet(batch_file)
        Results->>FS: append to output file
    end
    Results-->>Ctrl: Path("dataset.jsonl")
    Ctrl->>User: print success + artifact/export paths
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/data-designer/src/data_designer/cli/controllers/generation_controller.py
Line: 174-175

Comment:
**`export()` call omits explicit `format=` argument**

`results.export(export_path)` is called without passing `format=output_format`. The method correctly infers the format from the `.{output_format}` extension because the path is constructed as `f"dataset.{output_format}"`, so this works today. Explicitly passing the already-validated format avoids relying on extension inference and makes the intent clearer at the call site.

```suggestion
                results.export(export_path, format=output_format)
            except Exception as e:
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (6): Last reviewed commit: "feat(results): stream batch files in exp..." | Re-trigger Greptile

Comment thread packages/data-designer/src/data_designer/cli/controllers/generation_controller.py Outdated
Copy link
Copy Markdown
Contributor

@andreatgretel andreatgretel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @przemekboruta - export() fills a real gap and the core implementation is clean. A few things to tighten up before merging:

  • Per CONTRIBUTING.md, features should have an associated issue. Could you open one retroactively and link it here with Closes #NNN?
  • The success message / export ordering and double dataset load need fixing (see inline comments)
  • Need controller-level tests for the new output_format path - happy path, bad format rejection, and export failure
  • A couple of style guide items flagged inline (error types, import placement)

Left details in inline comments. Nice catch on the lazy import fix in the third commit.

Comment thread packages/data-designer/src/data_designer/cli/controllers/generation_controller.py Outdated
Comment thread packages/data-designer/src/data_designer/interface/results.py Outdated
Comment thread packages/data-designer/src/data_designer/interface/results.py Outdated
Comment thread packages/data-designer/tests/interface/test_results.py Outdated
Comment thread packages/data-designer/src/data_designer/interface/results.py Outdated
…, import hygiene

- Derive SUPPORTED_EXPORT_FORMATS from get_args(ExportFormat) so the two can't drift apart
- Replace ValueError with InvalidFileFormatError in export() — consistent with project error conventions
- Add date_format="iso" to to_json() for consistent datetime serialization across formats
- Add click.Choice(SUPPORTED_EXPORT_FORMATS) to --output-format CLI option for parse-time
  validation, better --help output, and tab completion
- Fix double load_dataset() in run_create: inline len() so the DataFrame ref dies before export
- Move success message after the export block to avoid "Dataset created" followed by "Export failed"
- Move imports to module level in test_results.py (json, Path, lazy already imported)
- Add controller-level tests for output_format happy path, bad format rejection, and export failure
Comment thread packages/data-designer/src/data_designer/interface/results.py Outdated
@przemekboruta
Copy link
Copy Markdown
Contributor Author

hey @andreatgretel! thanks for the review. All points seem to be addressed right now.

@github-actions
Copy link
Copy Markdown
Contributor

Issue #566 has been triaged. The linked issue check is being re-evaluated.

andreatgretel
andreatgretel previously approved these changes Apr 22, 2026
Copy link
Copy Markdown
Contributor

@andreatgretel andreatgretel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @przemekboruta, all review comments are addressed - approving!

We should be able to merge soon. Seeking one more approval - cc @NVIDIA-NeMo/data-designer-reviewers.

Comment thread packages/data-designer/src/data_designer/interface/results.py
@nabinchha
Copy link
Copy Markdown
Contributor

Heads up on OOM risk with large datasets

The proposed export() method should avoid calling load_dataset() internally. Today, load_dataset() materializes the entire dataset into a single pandas DataFrame via pd.read_parquet() -- for a large partitioned dataset, this will OOM.

The data is already on disk as individual batch files (batch_00000.parquet, batch_00001.parquet, ...), so export() can stream batch-by-batch into a single output file without ever holding the full dataset in memory:

  • JSONL / CSV -- iterate over batch files, append each chunk to the output file. These formats are naturally appendable, so this produces one valid file with memory proportional to a single batch.
  • Parquet -- use pyarrow.parquet.ParquetWriter to write each batch as a row group into one output file.

One nuance for the parquet path: batch files can have slightly mismatched schemas (e.g., a column inferred as int64 in one batch and float64 in another). The fix is cheap -- read just the metadata from each batch file upfront with pq.read_schema(), merge with pa.unify_schemas(), and cast each batch to the unified schema before writing.

Example implementation sketch:

from __future__ import annotations

from pathlib import Path

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


SUPPORTED_FORMATS = {"jsonl", "csv", "parquet"}


def export(self, path: str | Path) -> Path:
    output = Path(path)
    fmt = output.suffix.lstrip(".")

    if fmt not in SUPPORTED_FORMATS:
        raise ValueError(
            f"Unsupported file extension {output.suffix!r}. "
            f"Use one of: {', '.join(SUPPORTED_FORMATS)}"
        )

    batch_files = sorted(self.artifact_storage.final_dataset_path.glob("*.parquet"))

    if not batch_files:
        raise FileNotFoundError("No batch parquet files found.")

    if fmt == "jsonl":
        _export_jsonl(batch_files, output)
    elif fmt == "csv":
        _export_csv(batch_files, output)
    elif fmt == "parquet":
        _export_parquet(batch_files, output)

    return output


def _export_jsonl(batch_files: list[Path], output: Path) -> None:
    with output.open("w") as f:
        for batch_file in batch_files:
            chunk = pd.read_parquet(batch_file)
            f.write(chunk.to_json(orient="records", lines=True))
            f.write("\n")


def _export_csv(batch_files: list[Path], output: Path) -> None:
    for i, batch_file in enumerate(batch_files):
        chunk = pd.read_parquet(batch_file)
        chunk.to_csv(output, mode="a", header=(i == 0), index=False)


def _export_parquet(batch_files: list[Path], output: Path) -> None:
    schemas = [pq.read_schema(f) for f in batch_files]
    unified = pa.unify_schemas(schemas)

    with pq.ParquetWriter(output, unified) as writer:
        for batch_file in batch_files:
            table = pq.read_table(batch_file)
            table = table.cast(unified)
            writer.write_table(table)

Usage:

results.export("output.jsonl")
results.export("output.csv")
results.export("output.parquet")

…atasets

- Rewrite export() to read batch parquet files one at a time instead of
  materialising the full dataset via load_dataset(); peak memory is now
  proportional to a single batch regardless of dataset size
- Infer output format from file extension by default; format= parameter
  kept as an explicit override (e.g. writing .txt as JSONL)
- _export_parquet unifies schemas across batches (pa.unify_schemas) to
  handle type drift (e.g. int64 vs float64 in the same column)
- Drop format= from the controller's export() call — path already carries
  the correct extension
- Rewrite export tests around real batch parquet files (stub_batch_dir
  fixture); add tests for multi-batch output, schema unification, unknown
  extension, empty batch directory, and explicit format override
@przemekboruta
Copy link
Copy Markdown
Contributor Author

Hey @nabinchha! I totally agree with your proposal and I've implemented it. Looking forward to your feedback

PR description updated

@przemekboruta przemekboruta requested a review from nabinchha April 22, 2026 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add export() method and --output-format CLI flag to DatasetCreationResults

3 participants