Summary
Add the core FFI API for writing Lance datasets from Arrow record batch streams. Closes the Dataset write row in the README Phase 3 roadmap.
Motivation
Today C/C++ callers can write fragment files (via lance_write_fragments) but cannot create a full dataset with a committed manifest. This blocks the natural C/C++ ingestion path and leaves every Phase 3 downstream feature (delete, update, merge-insert, schema evolution) without a way to set up a dataset from C/C++.
Proposed API
C:
typedef enum {
LANCE_WRITE_CREATE = 0,
LANCE_WRITE_APPEND = 1,
LANCE_WRITE_OVERWRITE = 2,
} LanceWriteMode;
/* out_dataset: if non-NULL, on success receives an open LanceDataset* at the
newly-committed version (caller must lance_dataset_close it). Pass NULL to discard. */
int32_t lance_dataset_write(
const char* uri,
const struct ArrowSchema* schema,
struct ArrowArrayStream* stream,
LanceWriteMode mode,
const char* const* storage_opts,
LanceDataset** out_dataset
);
C++:
namespace lance {
enum class WriteMode { Create = 0, Append = 1, Overwrite = 2 };
class Dataset {
// Returns an open Dataset at the new version; throws lance::Error on failure.
static Dataset write(
const std::string& uri,
ArrowArrayStream& stream,
WriteMode mode,
const std::unordered_map<std::string, std::string>& storage_opts = {});
};
}
Rust side: LanceWriteMode is #[repr(C)] with pinned discriminants.
Upstream mapping
Confirmed against upstream lance source:
pub async fn write(
batches: impl RecordBatchReader + Send + 'static,
dest: impl Into<WriteDestination<'_>>,
params: Option<WriteParams>,
) -> Result<Self>
Reuses from existing crate: helpers::parse_storage_options, ArrowArrayStreamReader::from_raw, schema-validation block from src/fragment_writer.rs.
Semantics
NULL-argument contract:
uri, schema, stream — required; NULL → LANCE_ERR_INVALID_ARGUMENT.
storage_opts — may be NULL (= no options).
out_dataset — may be NULL (= discard the returned open dataset).
- Stream is consumed on call (parallel to
lance_write_fragments). On any return code (success or error), the caller must not use the stream again — ArrowArrayStreamReader::from_raw takes ownership unconditionally.
Commit semantics:
CREATE on existing path → LANCE_ERR_DATASET_ALREADY_EXISTS.
APPEND with schema not matching the existing dataset → LANCE_ERR_INVALID_ARGUMENT with a diff message (pattern from fragment_writer.rs).
- Concurrent commit conflict →
LANCE_ERR_COMMIT_CONFLICT (already in the enum; surfaced from lance_core::Error::CommitConflict).
- Empty stream → empty commit succeeds; dataset exists with 0 rows.
Tests
- CREATE happy path → verify
count_rows and schema via reopen.
- APPEND happy path → rows accumulate.
- OVERWRITE happy path → start from a CREATEd dataset containing batch
B_old, OVERWRITE with disjoint B_new, verify row count matches B_new alone and contents are B_new (not B_old ∪ B_new).
- CREATE on existing path →
LANCE_ERR_DATASET_ALREADY_EXISTS.
- APPEND with mismatched schema (missing column / wrong type / wrong nullability / wrong field order) →
LANCE_ERR_INVALID_ARGUMENT.
- Empty stream → success, dataset exists with 0 rows.
- NULL uri / schema / stream →
LANCE_ERR_INVALID_ARGUMENT.
out_dataset propagation: when non-NULL, returned handle reports correct count_rows without reopen; when NULL, no handle is produced.
- C/C++ integration: round-trip via
tests/cpp/test_c_api.c / test_cpp_api.cpp.
Locked design decisions
LanceWriteMode is a #[repr(C)] enum with values 0 = CREATE, 1 = APPEND, 2 = OVERWRITE, matching upstream WriteMode.
out_dataset output parameter is included (optional via NULL) to propagate the open dataset returned by upstream Dataset::write, saving callers a reopen.
- Empty stream is a valid input (empty commit succeeds).
Related
- Follow-up: B2 will expose additional write parameters (file size, group size, storage version, stable row ids).
- Part of the plan in
docs/plan-write-path.md.
- Sibling tickets: A1 (versions list), A2 (restore), A3 (checkout docs), B2 (write params).
Summary
Add the core FFI API for writing Lance datasets from Arrow record batch streams. Closes the Dataset write row in the README Phase 3 roadmap.
Motivation
Today C/C++ callers can write fragment files (via
lance_write_fragments) but cannot create a full dataset with a committed manifest. This blocks the natural C/C++ ingestion path and leaves every Phase 3 downstream feature (delete, update, merge-insert, schema evolution) without a way to set up a dataset from C/C++.Proposed API
C:
C++:
Rust side:
LanceWriteModeis#[repr(C)]with pinned discriminants.Upstream mapping
Confirmed against upstream lance source:
Reuses from existing crate:
helpers::parse_storage_options,ArrowArrayStreamReader::from_raw, schema-validation block fromsrc/fragment_writer.rs.Semantics
NULL-argument contract:
uri,schema,stream— required; NULL →LANCE_ERR_INVALID_ARGUMENT.storage_opts— may be NULL (= no options).out_dataset— may be NULL (= discard the returned open dataset).lance_write_fragments). On any return code (success or error), the caller must not use the stream again —ArrowArrayStreamReader::from_rawtakes ownership unconditionally.Commit semantics:
CREATEon existing path →LANCE_ERR_DATASET_ALREADY_EXISTS.APPENDwith schema not matching the existing dataset →LANCE_ERR_INVALID_ARGUMENTwith a diff message (pattern fromfragment_writer.rs).LANCE_ERR_COMMIT_CONFLICT(already in the enum; surfaced fromlance_core::Error::CommitConflict).Tests
count_rowsand schema via reopen.B_old, OVERWRITE with disjointB_new, verify row count matchesB_newalone and contents areB_new(notB_old ∪ B_new).LANCE_ERR_DATASET_ALREADY_EXISTS.LANCE_ERR_INVALID_ARGUMENT.LANCE_ERR_INVALID_ARGUMENT.out_datasetpropagation: when non-NULL, returned handle reports correctcount_rowswithout reopen; when NULL, no handle is produced.tests/cpp/test_c_api.c/test_cpp_api.cpp.Locked design decisions
LanceWriteModeis a#[repr(C)]enum with values0 = CREATE,1 = APPEND,2 = OVERWRITE, matching upstreamWriteMode.out_datasetoutput parameter is included (optional via NULL) to propagate the open dataset returned by upstreamDataset::write, saving callers a reopen.Related
docs/plan-write-path.md.