You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Expose the most useful upstream WriteParams fields through a new LanceWriteParams C struct and a lance_dataset_write_with_params() overload. Follow-up to the core lance_dataset_write() ticket.
Motivation
The base lance_dataset_write() uses default write parameters. Production use cases need control over file and group sizes, the lance file format version, and opt-in features like stable row ids. Exposing these avoids having to reopen and rewrite datasets to tune output shape.
This addresses the dataset-write half of #7. Extending lance_write_fragments with the same LanceWriteParams struct is a natural follow-up and is tracked in #7 itself.
Proposed API
C:
typedefstruct {
uint64_tmax_rows_per_file; /* 0 = default; upstream type is usize */uint64_tmax_rows_per_group; /* 0 = default; upstream type is usize */uint64_tmax_bytes_per_file; /* 0 = default (~90 GB upstream); upstream type is usize */constchar*data_storage_version; /* NULL = default; parsed via LanceFileVersion::from_str (e.g. "2.0", "2.1", "stable", "legacy") */boolenable_stable_row_ids; /* default false */
} LanceWriteParams;
int32_tlance_dataset_write_with_params(
constchar*uri,
conststructArrowSchema*schema,
structArrowArrayStream*stream,
LanceWriteModemode,
constLanceWriteParams*params, /* NULL = defaults */constchar*const*storage_opts,
LanceDataset**out_dataset/* NULL = discard */
);
lance_dataset_write (from the base ticket) becomes a thin delegator that forwards params = NULL. This is additive — no #[deprecated] marker; both forms remain supported indefinitely.
C++: overload lance::Dataset::write(...) accepting a lance::WriteParams struct.
Upstream mapping
Confirmed against upstream lance::dataset::WriteParams:
max_rows_per_file: usize
max_rows_per_group: usize
max_bytes_per_file: usize
data_storage_version: Option<LanceFileVersion> — enum; parse the C string via LanceFileVersion::from_str.
enable_stable_row_ids: bool
Fields intentionally not exposed from C at this stage: commit_handler, progress, write_progress, base_store_params, external_blob_mode, blob/base internals. These are Rust-side advanced options not appropriate for a thin FFI layer.
Tests
Override max_rows_per_file → verify file count via fragment enumeration.
Override max_bytes_per_file with a small value → verify multiple files are produced for a sufficiently large input.
Override data_storage_version with "2.0" and "2.1" → verify accepted; invalid string → LANCE_ERR_INVALID_ARGUMENT.
Toggle enable_stable_row_ids → verify the produced fragments carry stable row-id metadata.
NULL params pointer → defaults used (delegator path).
C/C++ integration covers the params overload.
Locked design decisions
Flat struct, not a builder handle (write is one-shot).
data_storage_version typed as const char* parsed via upstream FromStr — accepts "2.0", "2.1", "stable", "legacy", etc.
Related
Depends on the base lance_dataset_write() ticket being merged first.
Cross-reference: Allow specify lance file version #7 requests the same capability for lance_write_fragments; a follow-up can reuse LanceWriteParams there.
Summary
Expose the most useful upstream
WriteParamsfields through a newLanceWriteParamsC struct and alance_dataset_write_with_params()overload. Follow-up to the corelance_dataset_write()ticket.Motivation
The base
lance_dataset_write()uses default write parameters. Production use cases need control over file and group sizes, the lance file format version, and opt-in features like stable row ids. Exposing these avoids having to reopen and rewrite datasets to tune output shape.This addresses the dataset-write half of #7. Extending
lance_write_fragmentswith the sameLanceWriteParamsstruct is a natural follow-up and is tracked in #7 itself.Proposed API
C:
lance_dataset_write(from the base ticket) becomes a thin delegator that forwardsparams = NULL. This is additive — no#[deprecated]marker; both forms remain supported indefinitely.C++: overload
lance::Dataset::write(...)accepting alance::WriteParamsstruct.Upstream mapping
Confirmed against upstream
lance::dataset::WriteParams:max_rows_per_file: usizemax_rows_per_group: usizemax_bytes_per_file: usizedata_storage_version: Option<LanceFileVersion>— enum; parse the C string viaLanceFileVersion::from_str.enable_stable_row_ids: boolFields intentionally not exposed from C at this stage:
commit_handler,progress,write_progress,base_store_params,external_blob_mode, blob/base internals. These are Rust-side advanced options not appropriate for a thin FFI layer.Tests
max_rows_per_file→ verify file count via fragment enumeration.max_rows_per_group→ verify row-group layout (via footer inspection helper if available, else behavioural check).max_bytes_per_filewith a small value → verify multiple files are produced for a sufficiently large input.data_storage_versionwith"2.0"and"2.1"→ verify accepted; invalid string →LANCE_ERR_INVALID_ARGUMENT.enable_stable_row_ids→ verify the produced fragments carry stable row-id metadata.Locked design decisions
data_storage_versiontyped asconst char*parsed via upstreamFromStr— accepts"2.0","2.1","stable","legacy", etc.Related
lance_dataset_write()ticket being merged first.lance_write_fragments; a follow-up can reuseLanceWriteParamsthere.docs/plan-write-path.md.