Skip to content

Add lance_dataset_write() for create/append/overwrite from ArrowArrayStream #14

@LuciferYang

Description

@LuciferYang

Summary

Add the core FFI API for writing Lance datasets from Arrow record batch streams. Closes the Dataset write row in the README Phase 3 roadmap.

Motivation

Today C/C++ callers can write fragment files (via lance_write_fragments) but cannot create a full dataset with a committed manifest. This blocks the natural C/C++ ingestion path and leaves every Phase 3 downstream feature (delete, update, merge-insert, schema evolution) without a way to set up a dataset from C/C++.

Proposed API

C:

typedef enum {
    LANCE_WRITE_CREATE    = 0,
    LANCE_WRITE_APPEND    = 1,
    LANCE_WRITE_OVERWRITE = 2,
} LanceWriteMode;

/* out_dataset: if non-NULL, on success receives an open LanceDataset* at the
   newly-committed version (caller must lance_dataset_close it). Pass NULL to discard. */
int32_t lance_dataset_write(
    const char* uri,
    const struct ArrowSchema* schema,
    struct ArrowArrayStream* stream,
    LanceWriteMode mode,
    const char* const* storage_opts,
    LanceDataset** out_dataset
);

C++:

namespace lance {
enum class WriteMode { Create = 0, Append = 1, Overwrite = 2 };
class Dataset {
  // Returns an open Dataset at the new version; throws lance::Error on failure.
  static Dataset write(
      const std::string& uri,
      ArrowArrayStream& stream,
      WriteMode mode,
      const std::unordered_map<std::string, std::string>& storage_opts = {});
};
}

Rust side: LanceWriteMode is #[repr(C)] with pinned discriminants.

Upstream mapping

Confirmed against upstream lance source:

pub async fn write(
    batches: impl RecordBatchReader + Send + 'static,
    dest: impl Into<WriteDestination<'_>>,
    params: Option<WriteParams>,
) -> Result<Self>

Reuses from existing crate: helpers::parse_storage_options, ArrowArrayStreamReader::from_raw, schema-validation block from src/fragment_writer.rs.

Semantics

NULL-argument contract:

  • uri, schema, stream — required; NULL → LANCE_ERR_INVALID_ARGUMENT.
  • storage_opts — may be NULL (= no options).
  • out_dataset — may be NULL (= discard the returned open dataset).
  • Stream is consumed on call (parallel to lance_write_fragments). On any return code (success or error), the caller must not use the stream again — ArrowArrayStreamReader::from_raw takes ownership unconditionally.

Commit semantics:

  • CREATE on existing path → LANCE_ERR_DATASET_ALREADY_EXISTS.
  • APPEND with schema not matching the existing dataset → LANCE_ERR_INVALID_ARGUMENT with a diff message (pattern from fragment_writer.rs).
  • Concurrent commit conflict → LANCE_ERR_COMMIT_CONFLICT (already in the enum; surfaced from lance_core::Error::CommitConflict).
  • Empty stream → empty commit succeeds; dataset exists with 0 rows.

Tests

  • CREATE happy path → verify count_rows and schema via reopen.
  • APPEND happy path → rows accumulate.
  • OVERWRITE happy path → start from a CREATEd dataset containing batch B_old, OVERWRITE with disjoint B_new, verify row count matches B_new alone and contents are B_new (not B_old ∪ B_new).
  • CREATE on existing path → LANCE_ERR_DATASET_ALREADY_EXISTS.
  • APPEND with mismatched schema (missing column / wrong type / wrong nullability / wrong field order) → LANCE_ERR_INVALID_ARGUMENT.
  • Empty stream → success, dataset exists with 0 rows.
  • NULL uri / schema / stream → LANCE_ERR_INVALID_ARGUMENT.
  • out_dataset propagation: when non-NULL, returned handle reports correct count_rows without reopen; when NULL, no handle is produced.
  • C/C++ integration: round-trip via tests/cpp/test_c_api.c / test_cpp_api.cpp.

Locked design decisions

  • LanceWriteMode is a #[repr(C)] enum with values 0 = CREATE, 1 = APPEND, 2 = OVERWRITE, matching upstream WriteMode.
  • out_dataset output parameter is included (optional via NULL) to propagate the open dataset returned by upstream Dataset::write, saving callers a reopen.
  • Empty stream is a valid input (empty commit succeeds).

Related

  • Follow-up: B2 will expose additional write parameters (file size, group size, storage version, stable row ids).
  • Part of the plan in docs/plan-write-path.md.
  • Sibling tickets: A1 (versions list), A2 (restore), A3 (checkout docs), B2 (write params).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions