feat: add LanceWriteParams and lance_dataset_write_with_params#20
Conversation
…rrayStream
Writes an ArrowArrayStream into a Lance dataset with a committed manifest.
A mode enum (CREATE / APPEND / OVERWRITE) and an optional out_dataset that
returns the open dataset at the new version (so callers don't need to
reopen). Structure follows fragment_writer.rs: schema fail-fast, storage
options pass-through, thread-local errors. C++ gets a scoped WriteMode
enum and a lance::Dataset::write() static method that reads the stream's
schema automatically.
Two FFI details:
- mode is received as i32 and validated via LanceWriteMode::from_raw;
accepting the enum directly would be UB for out-of-range values.
- The stream is consumed via ArrowArrayStreamReader::from_raw right
after the NULL check so the "consumed on any return" contract holds.
Tests: 11 new Rust unit tests (CREATE/APPEND/OVERWRITE happy paths,
OVERWRITE-on-missing, CREATE-on-existing, schema mismatches, empty
stream, NULL args, invalid mode, out_dataset propagation). The ignored
C and C++ integration tests now do a scan->write round-trip.
Closes lance-format#14.
The LanceWriteMode doc referenced `LanceWriteMode::from_raw`, which is private. rustdoc -D warnings (the Rustdoc CI job) flags this as `private_intra_doc_links`. Rewording to describe the validation behavior without naming the private function. No API changes.
Extends the write surface with a LanceWriteParams struct covering the most
useful fields from upstream's WriteParams: max_rows_per_file,
max_rows_per_group, max_bytes_per_file, data_storage_version, and
enable_stable_row_ids. lance_dataset_write is now a thin delegator that
forwards params = NULL to lance_dataset_write_with_params — both remain
part of the ABI, no deprecations.
data_storage_version takes a string ("2.0", "2.1", "stable", "legacy",
etc.) parsed via lance_file::version::LanceFileVersion::from_str; invalid
values are rejected with LANCE_ERR_INVALID_ARGUMENT. Adds lance-file as a
main dependency (it was previously dev-only).
C++ gets a lance::WriteParams struct and a new Dataset::write overload
accepting it; the three-argument form delegates to the new one with
default params.
Tests: 5 new Rust unit tests — NULL-params-behaves-like-plain,
max_rows_per_file splits fragments (100 rows / 20 → at least 5), known
storage versions accepted ("2.0"/"2.1"/"stable"), invalid storage version
rejected, stable_row_ids toggle accepted.
Closes lance-format#15.
|
will rebase after #16 merged |
jja725
left a comment
There was a problem hiding this comment.
Nice clean layering — lance_dataset_write becomes a thin delegator and the new _with_params variant exposes only the knobs that make sense at the C boundary. A few inline notes; one carries forward from the discussion on #16.
| storage_opts: *const *const c_char, | ||
| out_dataset: *mut *mut LanceDataset, | ||
| ) -> Result<i32> { | ||
| if uri.is_null() || schema.is_null() || stream.is_null() { |
There was a problem hiding this comment.
Same stream-consumption ordering issue as discussed on #16 — when uri or schema is NULL with a valid stream, this returns before from_raw is called, so the stream's release callback is never invoked. The C caller, trusting the documented "stream is consumed regardless of return code" contract, leaks it.
Whichever PR lands first should pull stream.is_null() out and consume the stream before validating the rest. Flagging here too in case #20 lands first.
There was a problem hiding this comment.
Pulled the stream NULL check ahead of from_raw in _with_params (and dropped the uri/schema checks below it), so the stream is now consumed before any other validation can short-circuit. Mirrors the pattern that landed for plain lance_dataset_write in #16.
| })?; | ||
| target.data_storage_version = Some(version); | ||
| } | ||
| target.enable_stable_row_ids = params.enable_stable_row_ids; |
There was a problem hiding this comment.
Subtle: every other field uses 0 / NULL as a "keep upstream default" sentinel, but enable_stable_row_ids is unconditionally assigned. With bool there's no available sentinel, but the consequence is that any caller that zero-initializes LanceWriteParams (the documented way to keep defaults) will explicitly set enable_stable_row_ids = false. That happens to match upstream's current default, so it's silent today — but if upstream ever flips the default to true, every C caller silently regresses.
Options:
- Add an
enable_stable_row_ids_setcompanion flag (clunky). - Use a
int8_twith-1 = default, 0 = false, 1 = true(more verbose but matches the rest of the struct's pattern). - Document explicitly that this field is "strictly an override, no default sentinel" so future readers don't assume parity.
My preference is the doc-only fix unless upstream's default is actually expected to change.
There was a problem hiding this comment.
Went with the doc-only fix as you suggested. Both the Rust struct and the C/C++ headers now call out that enable_stable_row_ids has no default sentinel and is forwarded verbatim — so a future upstream flip is at least visible to anyone reading the struct, not silent.
| /// fields are no-ops so upstream defaults flow through. | ||
| unsafe fn apply_write_params(target: &mut WriteParams, params: &LanceWriteParams) -> Result<()> { | ||
| if params.max_rows_per_file > 0 { | ||
| target.max_rows_per_file = params.max_rows_per_file as usize; |
There was a problem hiding this comment.
Minor: params.max_rows_per_file as usize truncates on 32-bit targets if the caller passes a value > u32::MAX. Realistic use cases (rows-per-file in the millions, bytes-per-file ~90 GB default) wouldn't hit this, but a try_into().map_err(...) would surface the truncation cleanly instead of silently wrapping. Same applies to max_rows_per_group and max_bytes_per_file below.
There was a problem hiding this comment.
Good catch. Replaced the three as usize casts with a u64_to_usize(v, field) helper that uses usize::try_from and surfaces the offending field name + value as LANCE_ERR_INVALID_ARGUMENT on overflow. No realistic 64-bit caller hits it, but a 32-bit cross-build now fails loudly instead of wrapping.
|
@jja725 Thanks for your review. I will address the issues in this PR after the prerequisite PR is merged. |
|
@LuciferYang after this PR, we can make 0.1.0 release since we reached most of the roadmap items. |
|
Let me get this PR ready |
Resolves conflicts with upstream/main and addresses three review comments on PR lance-format#20: - r3140805119: stream consumption ordering — `_with_params` now consumes the stream via `from_raw` before validating uri/schema/mode, matching the pattern upstream landed for `lance_dataset_write` in lance-format#16. The documented "stream is consumed regardless of return code" contract now holds on every error path. - r3140805123: `enable_stable_row_ids` documented as "strictly an override, no default sentinel" in both Rust and C/C++ headers, so future readers don't assume defaults flow through. - r3140805126: replaced silent `as usize` casts with `try_into`-based `u64_to_usize` helper, surfacing 32-bit truncation cleanly via LANCE_ERR_INVALID_ARGUMENT instead of wrapping. Other merge integrations: - Bumped lance deps to 4.0.1 (matching upstream/main); kept lance-file for LanceFileVersion parsing. - Adopted upstream's RAII StreamGuard / SchemaGuard pattern in the C++ Dataset::write wrappers. - Adopted upstream's int32_t-mode FFI signature for ABI safety against -fshort-enums. - Append _with_params test suite onto upstream's expanded test coverage.
|
Thanks for the review! Pushed 63da60d which:
Re #16's stream-ordering issue: now applied here too, so both |
## Summary Bring the README roadmap in line with what actually shipped on `main`: - Fix the **header path links** at the top — `include/lance.h` / `include/lance.hpp` 404'd after #26 (Phase 4 packaging) moved the headers under `include/lance/`. - **Phase 3 → Dataset write**: flip `[ ]` → `[x]` (`lance_dataset_write` from #16, `lance_dataset_write_with_params` from #20). - **Phase 3 → Version management**: flip `[ ]` → `[x]` — list (`lance_dataset_versions`, #18), rollback (`lance_dataset_restore`, #18), and version checkout (`lance_dataset_open(uri, opts, version)`, documented in #19). - **Additional features**: add `Substrait filter pushdown` row pointing at `lance_scanner_set_substrait_filter` (#25). No source changes; docs only. ## Test plan - [x] Markdown renders correctly (links resolve to `include/lance/lance.h` and `include/lance/lance.hpp`) - [x] `git diff origin/main` shows only the four targeted edits
Summary
LanceWriteParams(max_rows_per_file,max_rows_per_group,max_bytes_per_file,data_storage_version,enable_stable_row_ids) andlance_dataset_write_with_params. Passparams = NULLto inherit upstream defaults.lance_dataset_writeis now a thin delegator to_with_paramswithparams = NULL. Its FFI signature and ABI are unchanged vsupstream/main.lance::WriteParamsstruct and a 5-arglance::Dataset::write(...)overload; the existing 4-arg overload delegates to it with a default-constructedWriteParams{}.Motivation
lance_dataset_writeuses upstream's defaults for everything. Production callers need to tune output file/group sizes and the Lance file format version; stable row ids are also a common opt-in. This exposes those knobs without touching the simple-path API.Closes #15.
Design notes
data_storage_versiontakes a string ("2.0","2.1","stable","legacy", ...) parsed viaLanceFileVersion::from_str. Empty or invalid strings →LANCE_ERR_INVALID_ARGUMENT.0as a "keep upstream default" sentinel.enable_stable_row_idsisbooland so has no default sentinel — whatever the caller writes is forwarded verbatim. Today upstream's default isfalse, so a zero-initLanceWriteParamsis a no-op; struct-level docs in Rust and C/C++ headers call this out so a future upstream flip won't be silent.WriteParamshas more advanced fields (commit handlers, progress callbacks, blob/base internals) intentionally not exposed — they belong in Rust, not a thin C FFI.Review fixes (jja725's pass)
_with_params): the stream is now consumed viafrom_rawbefore validatinguri/schema/mode, so the documented "consumed regardless of return code" contract holds on every error path. Mirrors what landed for plainlance_dataset_writein feat: add lance_dataset_write for create/append/overwrite from ArrowArrayStream #16.enable_stable_row_idssemantics: doc-only fix as suggested — Rust struct, C header, and C++ header all call out that this field is strictly an override.u64 → usizetruncation on 32-bit: replaced silentas usizecasts with au64_to_usize(v, field)helper that usesusize::try_fromand surfaces the offending field name + value asLANCE_ERR_INVALID_ARGUMENTon overflow.Merge with
upstream/mainResolved conflicts in
Cargo.toml,include/lance/lance.{h,hpp},src/writer.rs,tests/c_api_test.rs, andtests/cpp/{test_c_api.c,test_cpp_api.cpp}. Adopted upstream's:lance4.0.1 dependency set (addedlance-fileforLanceFileVersionparsing).int32_t modeFFI signature (ABI-safe against-fshort-enumson the consumer side).StreamGuard/SchemaGuardpattern in the C++Dataset::writewrappers.RwLock<Arc<Dataset>>shape forLanceDataset.inner.Diff vs
upstream/main: 466 insertions across 5 files (Cargo.toml,include/lance/lance.h,include/lance/lance.hpp,src/writer.rs,tests/c_api_test.rs).Test plan
cargo test— 103 passed, 0 failed. 8 new tests for_with_params: NULL params behaves like plainwrite,max_rows_per_filesplits fragments (100 / 20 → ≥5),max_rows_per_groupplumbed,max_bytes_per_fileplumbed, known versions ("2.0"/"2.1"/"stable") accepted, empty version rejected, invalid version rejected,enable_stable_row_idstoggle accepted.cargo clippy --all-targets -- -D warningsclean.cargo fmt --checkclean.RUSTDOCFLAGS="-D warnings" cargo doc --no-depsclean.cargo test --test compile_and_run_test -- --ignored— C and C++ integration tests pass.