Skip to content

feat: add LanceWriteParams and lance_dataset_write_with_params#20

Merged
jja725 merged 4 commits intolance-format:mainfrom
LuciferYang:feat/issue-15-write-params
Apr 28, 2026
Merged

feat: add LanceWriteParams and lance_dataset_write_with_params#20
jja725 merged 4 commits intolance-format:mainfrom
LuciferYang:feat/issue-15-write-params

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented Apr 24, 2026

Summary

  • Adds LanceWriteParams (max_rows_per_file, max_rows_per_group, max_bytes_per_file, data_storage_version, enable_stable_row_ids) and lance_dataset_write_with_params. Pass params = NULL to inherit upstream defaults.
  • lance_dataset_write is now a thin delegator to _with_params with params = NULL. Its FFI signature and ABI are unchanged vs upstream/main.
  • C++ gets a lance::WriteParams struct and a 5-arg lance::Dataset::write(...) overload; the existing 4-arg overload delegates to it with a default-constructed WriteParams{}.

Motivation

lance_dataset_write uses upstream's defaults for everything. Production callers need to tune output file/group sizes and the Lance file format version; stable row ids are also a common opt-in. This exposes those knobs without touching the simple-path API.

Closes #15.

Design notes

  • data_storage_version takes a string ("2.0", "2.1", "stable", "legacy", ...) parsed via LanceFileVersion::from_str. Empty or invalid strings → LANCE_ERR_INVALID_ARGUMENT.
  • Numeric fields use 0 as a "keep upstream default" sentinel.
  • enable_stable_row_ids is bool and so has no default sentinel — whatever the caller writes is forwarded verbatim. Today upstream's default is false, so a zero-init LanceWriteParams is a no-op; struct-level docs in Rust and C/C++ headers call this out so a future upstream flip won't be silent.
  • Lance's WriteParams has more advanced fields (commit handlers, progress callbacks, blob/base internals) intentionally not exposed — they belong in Rust, not a thin C FFI.

Review fixes (jja725's pass)

  • Stream consumption ordering (feat: add LanceWriteParams and lance_dataset_write_with_params #20 (comment) on _with_params): the stream is now consumed via from_raw before validating uri/schema/mode, so the documented "consumed regardless of return code" contract holds on every error path. Mirrors what landed for plain lance_dataset_write in feat: add lance_dataset_write for create/append/overwrite from ArrowArrayStream #16.
  • enable_stable_row_ids semantics: doc-only fix as suggested — Rust struct, C header, and C++ header all call out that this field is strictly an override.
  • u64 → usize truncation on 32-bit: replaced silent as usize casts with a u64_to_usize(v, field) helper that uses usize::try_from and surfaces the offending field name + value as LANCE_ERR_INVALID_ARGUMENT on overflow.

Merge with upstream/main

Resolved conflicts in Cargo.toml, include/lance/lance.{h,hpp}, src/writer.rs, tests/c_api_test.rs, and tests/cpp/{test_c_api.c,test_cpp_api.cpp}. Adopted upstream's:

  • lance 4.0.1 dependency set (added lance-file for LanceFileVersion parsing).
  • int32_t mode FFI signature (ABI-safe against -fshort-enums on the consumer side).
  • RAII StreamGuard / SchemaGuard pattern in the C++ Dataset::write wrappers.
  • RwLock<Arc<Dataset>> shape for LanceDataset.inner.

Diff vs upstream/main: 466 insertions across 5 files (Cargo.toml, include/lance/lance.h, include/lance/lance.hpp, src/writer.rs, tests/c_api_test.rs).

Test plan

  • cargo test103 passed, 0 failed. 8 new tests for _with_params: NULL params behaves like plain write, max_rows_per_file splits fragments (100 / 20 → ≥5), max_rows_per_group plumbed, max_bytes_per_file plumbed, known versions ("2.0" / "2.1" / "stable") accepted, empty version rejected, invalid version rejected, enable_stable_row_ids toggle accepted.
  • cargo clippy --all-targets -- -D warnings clean.
  • cargo fmt --check clean.
  • RUSTDOCFLAGS="-D warnings" cargo doc --no-deps clean.
  • cargo test --test compile_and_run_test -- --ignored — C and C++ integration tests pass.

…rrayStream

Writes an ArrowArrayStream into a Lance dataset with a committed manifest.
A mode enum (CREATE / APPEND / OVERWRITE) and an optional out_dataset that
returns the open dataset at the new version (so callers don't need to
reopen). Structure follows fragment_writer.rs: schema fail-fast, storage
options pass-through, thread-local errors. C++ gets a scoped WriteMode
enum and a lance::Dataset::write() static method that reads the stream's
schema automatically.

Two FFI details:
  - mode is received as i32 and validated via LanceWriteMode::from_raw;
    accepting the enum directly would be UB for out-of-range values.
  - The stream is consumed via ArrowArrayStreamReader::from_raw right
    after the NULL check so the "consumed on any return" contract holds.

Tests: 11 new Rust unit tests (CREATE/APPEND/OVERWRITE happy paths,
OVERWRITE-on-missing, CREATE-on-existing, schema mismatches, empty
stream, NULL args, invalid mode, out_dataset propagation). The ignored
C and C++ integration tests now do a scan->write round-trip.

Closes lance-format#14.
The LanceWriteMode doc referenced `LanceWriteMode::from_raw`, which is
private. rustdoc -D warnings (the Rustdoc CI job) flags this as
`private_intra_doc_links`. Rewording to describe the validation behavior
without naming the private function.

No API changes.
Extends the write surface with a LanceWriteParams struct covering the most
useful fields from upstream's WriteParams: max_rows_per_file,
max_rows_per_group, max_bytes_per_file, data_storage_version, and
enable_stable_row_ids. lance_dataset_write is now a thin delegator that
forwards params = NULL to lance_dataset_write_with_params — both remain
part of the ABI, no deprecations.

data_storage_version takes a string ("2.0", "2.1", "stable", "legacy",
etc.) parsed via lance_file::version::LanceFileVersion::from_str; invalid
values are rejected with LANCE_ERR_INVALID_ARGUMENT. Adds lance-file as a
main dependency (it was previously dev-only).

C++ gets a lance::WriteParams struct and a new Dataset::write overload
accepting it; the three-argument form delegates to the new one with
default params.

Tests: 5 new Rust unit tests — NULL-params-behaves-like-plain,
max_rows_per_file splits fragments (100 rows / 20 → at least 5), known
storage versions accepted ("2.0"/"2.1"/"stable"), invalid storage version
rejected, stable_row_ids toggle accepted.

Closes lance-format#15.
@LuciferYang LuciferYang marked this pull request as draft April 24, 2026 11:08
@LuciferYang
Copy link
Copy Markdown
Contributor Author

will rebase after #16 merged

Copy link
Copy Markdown
Collaborator

@jja725 jja725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clean layering — lance_dataset_write becomes a thin delegator and the new _with_params variant exposes only the knobs that make sense at the C boundary. A few inline notes; one carries forward from the discussion on #16.

Comment thread src/writer.rs Outdated
storage_opts: *const *const c_char,
out_dataset: *mut *mut LanceDataset,
) -> Result<i32> {
if uri.is_null() || schema.is_null() || stream.is_null() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same stream-consumption ordering issue as discussed on #16 — when uri or schema is NULL with a valid stream, this returns before from_raw is called, so the stream's release callback is never invoked. The C caller, trusting the documented "stream is consumed regardless of return code" contract, leaks it.

Whichever PR lands first should pull stream.is_null() out and consume the stream before validating the rest. Flagging here too in case #20 lands first.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled the stream NULL check ahead of from_raw in _with_params (and dropped the uri/schema checks below it), so the stream is now consumed before any other validation can short-circuit. Mirrors the pattern that landed for plain lance_dataset_write in #16.

Comment thread src/writer.rs
})?;
target.data_storage_version = Some(version);
}
target.enable_stable_row_ids = params.enable_stable_row_ids;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subtle: every other field uses 0 / NULL as a "keep upstream default" sentinel, but enable_stable_row_ids is unconditionally assigned. With bool there's no available sentinel, but the consequence is that any caller that zero-initializes LanceWriteParams (the documented way to keep defaults) will explicitly set enable_stable_row_ids = false. That happens to match upstream's current default, so it's silent today — but if upstream ever flips the default to true, every C caller silently regresses.

Options:

  • Add an enable_stable_row_ids_set companion flag (clunky).
  • Use a int8_t with -1 = default, 0 = false, 1 = true (more verbose but matches the rest of the struct's pattern).
  • Document explicitly that this field is "strictly an override, no default sentinel" so future readers don't assume parity.

My preference is the doc-only fix unless upstream's default is actually expected to change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with the doc-only fix as you suggested. Both the Rust struct and the C/C++ headers now call out that enable_stable_row_ids has no default sentinel and is forwarded verbatim — so a future upstream flip is at least visible to anyone reading the struct, not silent.

Comment thread src/writer.rs Outdated
/// fields are no-ops so upstream defaults flow through.
unsafe fn apply_write_params(target: &mut WriteParams, params: &LanceWriteParams) -> Result<()> {
if params.max_rows_per_file > 0 {
target.max_rows_per_file = params.max_rows_per_file as usize;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: params.max_rows_per_file as usize truncates on 32-bit targets if the caller passes a value > u32::MAX. Realistic use cases (rows-per-file in the millions, bytes-per-file ~90 GB default) wouldn't hit this, but a try_into().map_err(...) would surface the truncation cleanly instead of silently wrapping. Same applies to max_rows_per_group and max_bytes_per_file below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Replaced the three as usize casts with a u64_to_usize(v, field) helper that uses usize::try_from and surfaces the offending field name + value as LANCE_ERR_INVALID_ARGUMENT on overflow. No realistic 64-bit caller hits it, but a 32-bit cross-build now fails loudly instead of wrapping.

@LuciferYang
Copy link
Copy Markdown
Contributor Author

LuciferYang commented Apr 27, 2026

@jja725 Thanks for your review. I will address the issues in this PR after the prerequisite PR is merged.

@jja725
Copy link
Copy Markdown
Collaborator

jja725 commented Apr 27, 2026

@LuciferYang after this PR, we can make 0.1.0 release since we reached most of the roadmap items.

@LuciferYang
Copy link
Copy Markdown
Contributor Author

Let me get this PR ready

Resolves conflicts with upstream/main and addresses three review comments
on PR lance-format#20:

- r3140805119: stream consumption ordering — `_with_params` now consumes
  the stream via `from_raw` before validating uri/schema/mode, matching
  the pattern upstream landed for `lance_dataset_write` in lance-format#16. The
  documented "stream is consumed regardless of return code" contract now
  holds on every error path.
- r3140805123: `enable_stable_row_ids` documented as "strictly an
  override, no default sentinel" in both Rust and C/C++ headers, so
  future readers don't assume defaults flow through.
- r3140805126: replaced silent `as usize` casts with `try_into`-based
  `u64_to_usize` helper, surfacing 32-bit truncation cleanly via
  LANCE_ERR_INVALID_ARGUMENT instead of wrapping.

Other merge integrations:
- Bumped lance deps to 4.0.1 (matching upstream/main); kept lance-file
  for LanceFileVersion parsing.
- Adopted upstream's RAII StreamGuard / SchemaGuard pattern in the C++
  Dataset::write wrappers.
- Adopted upstream's int32_t-mode FFI signature for ABI safety against
  -fshort-enums.
- Append _with_params test suite onto upstream's expanded test coverage.
@LuciferYang
Copy link
Copy Markdown
Contributor Author

Thanks for the review! Pushed 63da60d which:

  • Merges upstream/main (10 commits — substrait, package distribution, etc.) and resolves conflicts. Adopted upstream's RAII StreamGuard/SchemaGuard and int32_t mode ABI in the process.
  • Addresses the three inline notes (replies on each).

Re #16's stream-ordering issue: now applied here too, so both lance_dataset_write and _with_params consume the stream before any other validation. PTAL.

@LuciferYang LuciferYang marked this pull request as ready for review April 28, 2026 03:21
@jja725 jja725 self-requested a review April 28, 2026 05:29
Copy link
Copy Markdown
Collaborator

@jja725 jja725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jja725 jja725 merged commit 8561bd9 into lance-format:main Apr 28, 2026
9 checks passed
jja725 added a commit that referenced this pull request Apr 28, 2026
## Summary

Bring the README roadmap in line with what actually shipped on `main`:

- Fix the **header path links** at the top — `include/lance.h` /
`include/lance.hpp` 404'd after #26 (Phase 4 packaging) moved the
headers under `include/lance/`.
- **Phase 3 → Dataset write**: flip `[ ]` → `[x]` (`lance_dataset_write`
from #16, `lance_dataset_write_with_params` from #20).
- **Phase 3 → Version management**: flip `[ ]` → `[x]` — list
(`lance_dataset_versions`, #18), rollback (`lance_dataset_restore`,
#18), and version checkout (`lance_dataset_open(uri, opts, version)`,
documented in #19).
- **Additional features**: add `Substrait filter pushdown` row pointing
at `lance_scanner_set_substrait_filter` (#25).

No source changes; docs only.

## Test plan

- [x] Markdown renders correctly (links resolve to
`include/lance/lance.h` and `include/lance/lance.hpp`)
- [x] `git diff origin/main` shows only the four targeted edits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose WriteParams (rows/file size, storage version, stable row ids) via LanceWriteParams

2 participants