Skip to content

Expose WriteParams (rows/file size, storage version, stable row ids) via LanceWriteParams #15

@LuciferYang

Description

@LuciferYang

Summary

Expose the most useful upstream WriteParams fields through a new LanceWriteParams C struct and a lance_dataset_write_with_params() overload. Follow-up to the core lance_dataset_write() ticket.

Motivation

The base lance_dataset_write() uses default write parameters. Production use cases need control over file and group sizes, the lance file format version, and opt-in features like stable row ids. Exposing these avoids having to reopen and rewrite datasets to tune output shape.

This addresses the dataset-write half of #7. Extending lance_write_fragments with the same LanceWriteParams struct is a natural follow-up and is tracked in #7 itself.

Proposed API

C:

typedef struct {
    uint64_t    max_rows_per_file;     /* 0 = default; upstream type is usize */
    uint64_t    max_rows_per_group;    /* 0 = default; upstream type is usize */
    uint64_t    max_bytes_per_file;    /* 0 = default (~90 GB upstream); upstream type is usize */
    const char* data_storage_version;  /* NULL = default; parsed via LanceFileVersion::from_str
                                          (e.g. "2.0", "2.1", "stable", "legacy") */
    bool        enable_stable_row_ids; /* default false */
} LanceWriteParams;

int32_t lance_dataset_write_with_params(
    const char* uri,
    const struct ArrowSchema* schema,
    struct ArrowArrayStream* stream,
    LanceWriteMode mode,
    const LanceWriteParams* params, /* NULL = defaults */
    const char* const* storage_opts,
    LanceDataset** out_dataset       /* NULL = discard */
);

lance_dataset_write (from the base ticket) becomes a thin delegator that forwards params = NULL. This is additive — no #[deprecated] marker; both forms remain supported indefinitely.

C++: overload lance::Dataset::write(...) accepting a lance::WriteParams struct.

Upstream mapping

Confirmed against upstream lance::dataset::WriteParams:

  • max_rows_per_file: usize
  • max_rows_per_group: usize
  • max_bytes_per_file: usize
  • data_storage_version: Option<LanceFileVersion> — enum; parse the C string via LanceFileVersion::from_str.
  • enable_stable_row_ids: bool

Fields intentionally not exposed from C at this stage: commit_handler, progress, write_progress, base_store_params, external_blob_mode, blob/base internals. These are Rust-side advanced options not appropriate for a thin FFI layer.

Tests

  • Override max_rows_per_file → verify file count via fragment enumeration.
  • Override max_rows_per_group → verify row-group layout (via footer inspection helper if available, else behavioural check).
  • Override max_bytes_per_file with a small value → verify multiple files are produced for a sufficiently large input.
  • Override data_storage_version with "2.0" and "2.1" → verify accepted; invalid string → LANCE_ERR_INVALID_ARGUMENT.
  • Toggle enable_stable_row_ids → verify the produced fragments carry stable row-id metadata.
  • NULL params pointer → defaults used (delegator path).
  • C/C++ integration covers the params overload.

Locked design decisions

  • Flat struct, not a builder handle (write is one-shot).
  • data_storage_version typed as const char* parsed via upstream FromStr — accepts "2.0", "2.1", "stable", "legacy", etc.

Related

  • Depends on the base lance_dataset_write() ticket being merged first.
  • Cross-reference: Allow specify lance file version #7 requests the same capability for lance_write_fragments; a follow-up can reuse LanceWriteParams there.
  • Part of the plan in docs/plan-write-path.md.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions