Skip to content

feat: add lance_dataset_write for create/append/overwrite from ArrowArrayStream#16

Open
LuciferYang wants to merge 2 commits intolance-format:mainfrom
LuciferYang:feat/issue-14-dataset-write
Open

feat: add lance_dataset_write for create/append/overwrite from ArrowArrayStream#16
LuciferYang wants to merge 2 commits intolance-format:mainfrom
LuciferYang:feat/issue-14-dataset-write

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

Summary

  • Adds lance_dataset_write(uri, schema, stream, mode, storage_opts, out_dataset) — writes an ArrowArrayStream into a Lance dataset with a committed manifest
  • LanceWriteMode covers CREATE / APPEND / OVERWRITE
  • Optional out_dataset hands back an open LanceDataset* at the new version so callers don't need to reopen
  • Matching lance::Dataset::write(...) static method in lance.hpp

Motivation

Until now the C/C++ path only produced uncommitted fragment files (#5). lance_dataset_write closes the primary write path and unblocks the rest of Phase 3 (delete, update, merge-insert, schema evolution), which all need a way to create a dataset first.

Notes

  • mode is received as int32_t and validated via LanceWriteMode::from_raw. Accepting the enum directly would be UB for out-of-range values from C.
  • The stream is consumed via ArrowArrayStreamReader::from_raw immediately after the NULL check, so the "consumed on any return" contract holds on every error path.

Test plan

  • cargo test — 55 integration tests, 11 new (CREATE/APPEND/OVERWRITE happy paths, OVERWRITE on a missing path, CREATE on an existing path, schema mismatches, empty stream, NULL args, invalid mode, out_dataset propagation)
  • cargo clippy --all-targets -- -D warnings clean
  • cargo fmt --check clean
  • cargo test --test compile_and_run_test -- --ignored — C and C++ scan→write round-trips pass

Closes #14.

…rrayStream

Writes an ArrowArrayStream into a Lance dataset with a committed manifest.
A mode enum (CREATE / APPEND / OVERWRITE) and an optional out_dataset that
returns the open dataset at the new version (so callers don't need to
reopen). Structure follows fragment_writer.rs: schema fail-fast, storage
options pass-through, thread-local errors. C++ gets a scoped WriteMode
enum and a lance::Dataset::write() static method that reads the stream's
schema automatically.

Two FFI details:
  - mode is received as i32 and validated via LanceWriteMode::from_raw;
    accepting the enum directly would be UB for out-of-range values.
  - The stream is consumed via ArrowArrayStreamReader::from_raw right
    after the NULL check so the "consumed on any return" contract holds.

Tests: 11 new Rust unit tests (CREATE/APPEND/OVERWRITE happy paths,
OVERWRITE-on-missing, CREATE-on-existing, schema mismatches, empty
stream, NULL args, invalid mode, out_dataset propagation). The ignored
C and C++ integration tests now do a scan->write round-trip.

Closes lance-format#14.
@LuciferYang LuciferYang force-pushed the feat/issue-14-dataset-write branch from b3e813c to 8a9325e Compare April 24, 2026 08:02
The LanceWriteMode doc referenced `LanceWriteMode::from_raw`, which is
private. rustdoc -D warnings (the Rustdoc CI job) flags this as
`private_intra_doc_links`. Rewording to describe the validation behavior
without naming the private function.

No API changes.
@LuciferYang
Copy link
Copy Markdown
Contributor Author

cc @jja725 #15 depend on this one

Copy link
Copy Markdown
Collaborator

@jja725 jja725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. A few minor inline notes around stream-consumption ordering on the FFI error paths.

Comment thread src/writer.rs
storage_opts: *const *const c_char,
out_dataset: *mut *mut LanceDataset,
) -> Result<i32> {
if uri.is_null() || schema.is_null() || stream.is_null() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function-level doc and the comment on line 119 both say the stream is consumed "before any other fallible validation" / "on every return path." But this NULL check is itself fallible validation, and it runs before the from_raw consumption — so when uri or schema is NULL with a valid stream, the call returns -1 without ever invoking the stream's release callback. A C caller trusting the documented contract won't release it either → resource leak.

Suggested fix — pull the stream check out and consume first:

if stream.is_null() {
    return Err(lance_core::Error::InvalidInput {
        source: "stream must not be NULL".into(),
        location: snafu::location!(),
    });
}

// Consume the stream first — released when `reader` drops on any error path.
let reader = unsafe { ArrowArrayStreamReader::from_raw(stream) }.map_err(|e| {
    lance_core::Error::InvalidInput {
        source: e.to_string().into(),
        location: snafu::location!(),
    }
})?;

if uri.is_null() || schema.is_null() {
    return Err(lance_core::Error::InvalidInput {
        source: "uri and schema must not be NULL".into(),
        location: snafu::location!(),
    });
}

This matches what the existing comment claims the code does.

Comment thread include/lance.hpp

// Pull the stream's schema so we can pass it to the C API.
ArrowSchema schema = {};
if (stream->get_schema(stream, &schema) != 0) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same FFI-contract concern as src/writer.rs:112. The wrapper's doc says "Treat the stream as consumed once this call returns or throws — do not reuse it." But if get_schema fails here, we throw before ever calling lance_dataset_write, so the stream is never consumed by the Rust side. A caller following the doc won't release it either → leak.

Either release the stream in this branch:

if (stream->release) stream->release(stream);
throw Error(...);

…or relax the doc to say the caller retains ownership when get_schema itself fails.

Comment thread tests/c_api_test.rs
}

#[test]
fn test_dataset_write_null_args_return_error() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test checks rc == -1 and the error code, but it doesn't observe whether the FFI_ArrowArrayStream's release callback was actually invoked. The Rust-side Drop of stream_a releases it on scope exit, which masks the leak locally — but the FFI contract is broken for C callers (see comment on src/writer.rs:112).

Worth wrapping the stream with a release-counter and asserting it fires exactly once, even when other args are NULL. A test along those lines would have caught the consumption-order bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add lance_dataset_write() for create/append/overwrite from ArrowArrayStream

2 participants