feat: add lance_dataset_write for create/append/overwrite from ArrowArrayStream#16
feat: add lance_dataset_write for create/append/overwrite from ArrowArrayStream#16LuciferYang wants to merge 2 commits intolance-format:mainfrom
Conversation
…rrayStream
Writes an ArrowArrayStream into a Lance dataset with a committed manifest.
A mode enum (CREATE / APPEND / OVERWRITE) and an optional out_dataset that
returns the open dataset at the new version (so callers don't need to
reopen). Structure follows fragment_writer.rs: schema fail-fast, storage
options pass-through, thread-local errors. C++ gets a scoped WriteMode
enum and a lance::Dataset::write() static method that reads the stream's
schema automatically.
Two FFI details:
- mode is received as i32 and validated via LanceWriteMode::from_raw;
accepting the enum directly would be UB for out-of-range values.
- The stream is consumed via ArrowArrayStreamReader::from_raw right
after the NULL check so the "consumed on any return" contract holds.
Tests: 11 new Rust unit tests (CREATE/APPEND/OVERWRITE happy paths,
OVERWRITE-on-missing, CREATE-on-existing, schema mismatches, empty
stream, NULL args, invalid mode, out_dataset propagation). The ignored
C and C++ integration tests now do a scan->write round-trip.
Closes lance-format#14.
b3e813c to
8a9325e
Compare
The LanceWriteMode doc referenced `LanceWriteMode::from_raw`, which is private. rustdoc -D warnings (the Rustdoc CI job) flags this as `private_intra_doc_links`. Rewording to describe the validation behavior without naming the private function. No API changes.
jja725
left a comment
There was a problem hiding this comment.
LGTM overall. A few minor inline notes around stream-consumption ordering on the FFI error paths.
| storage_opts: *const *const c_char, | ||
| out_dataset: *mut *mut LanceDataset, | ||
| ) -> Result<i32> { | ||
| if uri.is_null() || schema.is_null() || stream.is_null() { |
There was a problem hiding this comment.
The function-level doc and the comment on line 119 both say the stream is consumed "before any other fallible validation" / "on every return path." But this NULL check is itself fallible validation, and it runs before the from_raw consumption — so when uri or schema is NULL with a valid stream, the call returns -1 without ever invoking the stream's release callback. A C caller trusting the documented contract won't release it either → resource leak.
Suggested fix — pull the stream check out and consume first:
if stream.is_null() {
return Err(lance_core::Error::InvalidInput {
source: "stream must not be NULL".into(),
location: snafu::location!(),
});
}
// Consume the stream first — released when `reader` drops on any error path.
let reader = unsafe { ArrowArrayStreamReader::from_raw(stream) }.map_err(|e| {
lance_core::Error::InvalidInput {
source: e.to_string().into(),
location: snafu::location!(),
}
})?;
if uri.is_null() || schema.is_null() {
return Err(lance_core::Error::InvalidInput {
source: "uri and schema must not be NULL".into(),
location: snafu::location!(),
});
}This matches what the existing comment claims the code does.
|
|
||
| // Pull the stream's schema so we can pass it to the C API. | ||
| ArrowSchema schema = {}; | ||
| if (stream->get_schema(stream, &schema) != 0) { |
There was a problem hiding this comment.
Same FFI-contract concern as src/writer.rs:112. The wrapper's doc says "Treat the stream as consumed once this call returns or throws — do not reuse it." But if get_schema fails here, we throw before ever calling lance_dataset_write, so the stream is never consumed by the Rust side. A caller following the doc won't release it either → leak.
Either release the stream in this branch:
if (stream->release) stream->release(stream);
throw Error(...);…or relax the doc to say the caller retains ownership when get_schema itself fails.
| } | ||
|
|
||
| #[test] | ||
| fn test_dataset_write_null_args_return_error() { |
There was a problem hiding this comment.
This test checks rc == -1 and the error code, but it doesn't observe whether the FFI_ArrowArrayStream's release callback was actually invoked. The Rust-side Drop of stream_a releases it on scope exit, which masks the leak locally — but the FFI contract is broken for C callers (see comment on src/writer.rs:112).
Worth wrapping the stream with a release-counter and asserting it fires exactly once, even when other args are NULL. A test along those lines would have caught the consumption-order bug.
Summary
lance_dataset_write(uri, schema, stream, mode, storage_opts, out_dataset)— writes anArrowArrayStreaminto a Lance dataset with a committed manifestLanceWriteModecoversCREATE/APPEND/OVERWRITEout_datasethands back an openLanceDataset*at the new version so callers don't need to reopenlance::Dataset::write(...)static method inlance.hppMotivation
Until now the C/C++ path only produced uncommitted fragment files (#5).
lance_dataset_writecloses the primary write path and unblocks the rest of Phase 3 (delete, update, merge-insert, schema evolution), which all need a way to create a dataset first.Notes
modeis received asint32_tand validated viaLanceWriteMode::from_raw. Accepting the enum directly would be UB for out-of-range values from C.ArrowArrayStreamReader::from_rawimmediately after the NULL check, so the "consumed on any return" contract holds on every error path.Test plan
cargo test— 55 integration tests, 11 new (CREATE/APPEND/OVERWRITE happy paths, OVERWRITE on a missing path, CREATE on an existing path, schema mismatches, empty stream, NULL args, invalid mode,out_datasetpropagation)cargo clippy --all-targets -- -D warningscleancargo fmt --checkcleancargo test --test compile_and_run_test -- --ignored— C and C++ scan→write round-trips passCloses #14.