Skip to content

Commit 8561bd9

Browse files
authored
feat: add LanceWriteParams and lance_dataset_write_with_params (#20)
## Summary - Adds `LanceWriteParams` (`max_rows_per_file`, `max_rows_per_group`, `max_bytes_per_file`, `data_storage_version`, `enable_stable_row_ids`) and `lance_dataset_write_with_params`. Pass `params = NULL` to inherit upstream defaults. - `lance_dataset_write` is now a thin delegator to `_with_params` with `params = NULL`. Its FFI signature and ABI are unchanged vs `upstream/main`. - C++ gets a `lance::WriteParams` struct and a 5-arg `lance::Dataset::write(...)` overload; the existing 4-arg overload delegates to it with a default-constructed `WriteParams{}`. ## Motivation `lance_dataset_write` uses upstream's defaults for everything. Production callers need to tune output file/group sizes and the Lance file format version; stable row ids are also a common opt-in. This exposes those knobs without touching the simple-path API. Closes #15. ## Design notes - `data_storage_version` takes a string (`"2.0"`, `"2.1"`, `"stable"`, `"legacy"`, ...) parsed via `LanceFileVersion::from_str`. Empty or invalid strings → `LANCE_ERR_INVALID_ARGUMENT`. - Numeric fields use `0` as a "keep upstream default" sentinel. - `enable_stable_row_ids` is `bool` and so has **no default sentinel** — whatever the caller writes is forwarded verbatim. Today upstream's default is `false`, so a zero-init `LanceWriteParams` is a no-op; struct-level docs in Rust and C/C++ headers call this out so a future upstream flip won't be silent. - Lance's `WriteParams` has more advanced fields (commit handlers, progress callbacks, blob/base internals) intentionally not exposed — they belong in Rust, not a thin C FFI. ## Review fixes (jja725's pass) - **Stream consumption ordering** (#20 (comment) on `_with_params`): the stream is now consumed via `from_raw` before validating `uri`/`schema`/`mode`, so the documented "consumed regardless of return code" contract holds on every error path. Mirrors what landed for plain `lance_dataset_write` in #16. - **`enable_stable_row_ids` semantics**: doc-only fix as suggested — Rust struct, C header, and C++ header all call out that this field is strictly an override. - **`u64 → usize` truncation on 32-bit**: replaced silent `as usize` casts with a `u64_to_usize(v, field)` helper that uses `usize::try_from` and surfaces the offending field name + value as `LANCE_ERR_INVALID_ARGUMENT` on overflow. ## Merge with `upstream/main` Resolved conflicts in `Cargo.toml`, `include/lance/lance.{h,hpp}`, `src/writer.rs`, `tests/c_api_test.rs`, and `tests/cpp/{test_c_api.c,test_cpp_api.cpp}`. Adopted upstream's: - `lance` 4.0.1 dependency set (added `lance-file` for `LanceFileVersion` parsing). - `int32_t mode` FFI signature (ABI-safe against `-fshort-enums` on the consumer side). - RAII `StreamGuard` / `SchemaGuard` pattern in the C++ `Dataset::write` wrappers. - `RwLock<Arc<Dataset>>` shape for `LanceDataset.inner`. Diff vs `upstream/main`: 466 insertions across 5 files (`Cargo.toml`, `include/lance/lance.h`, `include/lance/lance.hpp`, `src/writer.rs`, `tests/c_api_test.rs`). ## Test plan - `cargo test` — **103 passed**, 0 failed. 8 new tests for `_with_params`: NULL params behaves like plain `write`, `max_rows_per_file` splits fragments (100 / 20 → ≥5), `max_rows_per_group` plumbed, `max_bytes_per_file` plumbed, known versions (`"2.0"` / `"2.1"` / `"stable"`) accepted, empty version rejected, invalid version rejected, `enable_stable_row_ids` toggle accepted. - `cargo clippy --all-targets -- -D warnings` clean. - `cargo fmt --check` clean. - `RUSTDOCFLAGS="-D warnings" cargo doc --no-deps` clean. - `cargo test --test compile_and_run_test -- --ignored` — C and C++ integration tests pass.
1 parent 82efe6b commit 8561bd9

5 files changed

Lines changed: 466 additions & 9 deletions

File tree

Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ crate-type = ["cdylib", "staticlib", "rlib"]
2020
[dependencies]
2121
lance = { version = "4.0.1", features = ["substrait"] }
2222
lance-core = "4.0.1"
23+
lance-file = "4.0.1"
2324
lance-index = "4.0.1"
2425
lance-io = "4.0.1"
2526
lance-linalg = "4.0.1"

include/lance/lance.h

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -581,6 +581,48 @@ int32_t lance_dataset_write(
581581
LanceDataset** out_dataset
582582
);
583583

584+
/**
585+
* Tunable parameters for lance_dataset_write_with_params. Numeric fields
586+
* default-out via 0; `data_storage_version` defaults out via NULL.
587+
*
588+
* Note: `enable_stable_row_ids` is a `bool` and therefore has no default
589+
* sentinel — callers that zero-initialize this struct end up explicitly
590+
* setting it to false (which matches upstream's current default).
591+
*/
592+
typedef struct LanceWriteParams {
593+
/* Soft cap on rows per data file. 0 = default. */
594+
uint64_t max_rows_per_file;
595+
/* Soft cap on rows per row group. 0 = default. */
596+
uint64_t max_rows_per_group;
597+
/* Soft cap on bytes per data file (~90 GB upstream default). 0 = default. */
598+
uint64_t max_bytes_per_file;
599+
/* Lance file format version, e.g. "2.0", "2.1", "stable", "legacy".
600+
* NULL = default. Invalid strings → LANCE_ERR_INVALID_ARGUMENT. */
601+
const char* data_storage_version;
602+
/* Opt into stable row ids (better for compaction at a small write cost).
603+
* Strictly an override — see struct-level note above. */
604+
bool enable_stable_row_ids;
605+
} LanceWriteParams;
606+
607+
/**
608+
* Same as lance_dataset_write but takes a LanceWriteParams for tuning the
609+
* output shape. Pass `params` = NULL to use defaults (equivalent to calling
610+
* lance_dataset_write directly).
611+
*
612+
* @return 0 on success, -1 on error. See lance_dataset_write for the error
613+
* code list; invalid `data_storage_version` also returns
614+
* LANCE_ERR_INVALID_ARGUMENT.
615+
*/
616+
int32_t lance_dataset_write_with_params(
617+
const char* uri,
618+
const struct ArrowSchema* schema,
619+
struct ArrowArrayStream* stream,
620+
int32_t mode,
621+
const LanceWriteParams* params,
622+
const char* const* storage_opts,
623+
LanceDataset** out_dataset
624+
);
625+
584626
#ifdef __cplusplus
585627
} /* extern "C" */
586628
#endif

include/lance/lance.hpp

Lines changed: 44 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919

2020
#include <cstdint>
2121
#include <memory>
22+
#include <optional>
2223
#include <stdexcept>
2324
#include <string>
2425
#include <utility>
@@ -103,6 +104,22 @@ enum class WriteMode : int32_t {
103104
Overwrite = LANCE_WRITE_OVERWRITE,
104105
};
105106

107+
/// Tunable parameters for Dataset::write. Numeric fields default-out via 0;
108+
/// `data_storage_version` defaults out via `std::nullopt`.
109+
///
110+
/// `enable_stable_row_ids` has no default sentinel — whatever value the
111+
/// caller writes is forwarded to upstream. Today this matches upstream's
112+
/// default (`false`), so a default-constructed WriteParams is a no-op; if
113+
/// upstream ever changes its default, callers must set this field explicitly.
114+
struct WriteParams {
115+
uint64_t max_rows_per_file = 0;
116+
uint64_t max_rows_per_group = 0;
117+
uint64_t max_bytes_per_file = 0;
118+
/// Lance file format version, e.g. "2.0", "2.1", "stable", "legacy".
119+
std::optional<std::string> data_storage_version;
120+
bool enable_stable_row_ids = false;
121+
};
122+
106123
// ─── Dataset ─────────────────────────────────────────────────────────────────
107124

108125
class Dataset {
@@ -145,14 +162,26 @@ class Dataset {
145162
WriteMode mode,
146163
const std::vector<std::pair<std::string, std::string>>& storage_opts = {}) {
147164

165+
return write(uri, stream, mode, WriteParams{}, storage_opts);
166+
}
167+
168+
/// Same as the four-argument `write` but tunes the output via `params`.
169+
/// Pass a default-constructed `WriteParams{}` to inherit upstream defaults.
170+
static Dataset write(
171+
const std::string& uri,
172+
ArrowArrayStream* stream,
173+
WriteMode mode,
174+
const WriteParams& params,
175+
const std::vector<std::pair<std::string, std::string>>& storage_opts = {}) {
176+
148177
if (stream == nullptr) {
149178
throw Error(LANCE_ERR_INVALID_ARGUMENT, "stream must not be null");
150179
}
151180

152-
// RAII guard for the stream. Until `lance_dataset_write` is called,
153-
// any exception (failed `get_schema`, `std::bad_alloc` while building
154-
// `kv`, etc.) must release the stream. After that call Rust owns it,
155-
// so we `disarm()` immediately before invoking the C API.
181+
// RAII guard for the stream. Until `lance_dataset_write_with_params`
182+
// is called, any exception (failed `get_schema`, `std::bad_alloc`
183+
// while building `kv`, etc.) must release the stream. After that call
184+
// Rust owns it, so we `disarm()` immediately before invoking the C API.
156185
struct StreamGuard {
157186
ArrowArrayStream* s;
158187
bool armed = true;
@@ -219,17 +248,26 @@ class Dataset {
219248
const char* const* opts_ptr =
220249
storage_opts.empty() ? nullptr : kv.data();
221250

251+
LanceWriteParams c_params = {};
252+
c_params.max_rows_per_file = params.max_rows_per_file;
253+
c_params.max_rows_per_group = params.max_rows_per_group;
254+
c_params.max_bytes_per_file = params.max_bytes_per_file;
255+
c_params.data_storage_version =
256+
params.data_storage_version ? params.data_storage_version->c_str() : nullptr;
257+
c_params.enable_stable_row_ids = params.enable_stable_row_ids;
258+
222259
// The C API consumes the stream on every return path, so disarm the
223260
// guard before calling. After this point the stream pointer is logically
224261
// owned by Rust and any C++-side exception must not re-release it.
225262
stream_guard.disarm();
226263

227264
LanceDataset* out = nullptr;
228-
int32_t rc = lance_dataset_write(
265+
int32_t rc = lance_dataset_write_with_params(
229266
uri.c_str(),
230267
&schema,
231268
stream,
232269
static_cast<int32_t>(mode),
270+
&c_params,
233271
opts_ptr,
234272
&out);
235273
if (rc != 0) check_error();
@@ -240,7 +278,7 @@ class Dataset {
240278
// thread-local code is `LANCE_OK` on this path (rc == 0).
241279
if (!out) {
242280
throw Error(LANCE_ERR_INTERNAL,
243-
"lance_dataset_write returned success with null out_dataset");
281+
"lance_dataset_write_with_params returned success with null out_dataset");
244282
}
245283
return Dataset(out);
246284
}

src/writer.rs

Lines changed: 122 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
//! dataset with a committed manifest rather than uncommitted fragment files.
99
1010
use std::ffi::c_char;
11+
use std::str::FromStr;
1112
use std::sync::{Arc, RwLock};
1213

1314
use arrow::ffi::FFI_ArrowSchema;
@@ -17,6 +18,7 @@ use arrow_schema::Schema as ArrowSchema;
1718
use lance::Dataset;
1819
use lance::dataset::{WriteMode as LanceWriteModeUpstream, WriteParams};
1920
use lance_core::Result;
21+
use lance_file::version::LanceFileVersion;
2022
use lance_io::object_store::{ObjectStoreParams, StorageOptionsAccessor};
2123

2224
use crate::dataset::LanceDataset;
@@ -74,6 +76,36 @@ impl From<LanceWriteMode> for LanceWriteModeUpstream {
7476
}
7577
}
7678

79+
/// Tunable parameters for `lance_dataset_write_with_params`.
80+
///
81+
/// Numeric fields use `0` as a "keep upstream default" sentinel, and
82+
/// `data_storage_version` uses NULL the same way. The struct is `#[repr(C)]`
83+
/// and ABI-stable within a minor version.
84+
///
85+
/// `enable_stable_row_ids` is a `bool` and therefore has **no default
86+
/// sentinel** — whatever the caller writes is forwarded verbatim to upstream.
87+
/// Callers that zero-initialize this struct (the documented way to inherit
88+
/// other defaults) end up explicitly setting `enable_stable_row_ids = false`.
89+
/// This matches upstream's current default, so the behavior is identical
90+
/// today; if upstream ever changes that default, callers must set this field
91+
/// explicitly to follow.
92+
#[repr(C)]
93+
pub struct LanceWriteParams {
94+
/// Soft cap on rows per data file. `0` uses upstream's default.
95+
pub max_rows_per_file: u64,
96+
/// Soft cap on rows per row group. `0` uses upstream's default.
97+
pub max_rows_per_group: u64,
98+
/// Soft cap on bytes per data file (~90 GB by default). `0` uses upstream's default.
99+
pub max_bytes_per_file: u64,
100+
/// Lance file format version string, e.g. `"2.0"`, `"2.1"`, `"stable"`,
101+
/// `"legacy"`. NULL uses upstream's default. Invalid strings are rejected
102+
/// with `LANCE_ERR_INVALID_ARGUMENT`.
103+
pub data_storage_version: *const c_char,
104+
/// Strictly an override (no default sentinel — see struct-level docs).
105+
/// Whatever value the caller writes is forwarded to upstream.
106+
pub enable_stable_row_ids: bool,
107+
}
108+
77109
/// Write an Arrow record batch stream to a Lance dataset at `uri`, committing a manifest.
78110
///
79111
/// - `uri`: Dataset URI (`file://`, `s3://`, `memory://`, ...). Must not be NULL or empty.
@@ -88,6 +120,8 @@ impl From<LanceWriteMode> for LanceWriteModeUpstream {
88120
/// version on success (caller closes). Pass NULL to discard. On error
89121
/// `*out_dataset` is untouched — do not read or free it.
90122
///
123+
/// Equivalent to `lance_dataset_write_with_params(..., params = NULL, ...)`.
124+
///
91125
/// Returns 0 on success, -1 on error.
92126
#[unsafe(no_mangle)]
93127
pub unsafe extern "C" fn lance_dataset_write(
@@ -97,9 +131,36 @@ pub unsafe extern "C" fn lance_dataset_write(
97131
mode: i32,
98132
storage_opts: *const *const c_char,
99133
out_dataset: *mut *mut LanceDataset,
134+
) -> i32 {
135+
unsafe {
136+
lance_dataset_write_with_params(
137+
uri,
138+
schema,
139+
stream,
140+
mode,
141+
std::ptr::null(),
142+
storage_opts,
143+
out_dataset,
144+
)
145+
}
146+
}
147+
148+
/// Same as `lance_dataset_write` but takes a `LanceWriteParams` for tuning the
149+
/// output shape. Pass `params = NULL` to use upstream defaults.
150+
#[unsafe(no_mangle)]
151+
pub unsafe extern "C" fn lance_dataset_write_with_params(
152+
uri: *const c_char,
153+
schema: *const FFI_ArrowSchema,
154+
stream: *mut FFI_ArrowArrayStream,
155+
mode: i32,
156+
params: *const LanceWriteParams,
157+
storage_opts: *const *const c_char,
158+
out_dataset: *mut *mut LanceDataset,
100159
) -> i32 {
101160
ffi_try!(
102-
unsafe { write_dataset_inner(uri, schema, stream, mode, storage_opts, out_dataset) },
161+
unsafe {
162+
write_dataset_inner(uri, schema, stream, mode, params, storage_opts, out_dataset)
163+
},
103164
neg
104165
)
105166
}
@@ -109,6 +170,7 @@ unsafe fn write_dataset_inner(
109170
schema: *const FFI_ArrowSchema,
110171
stream: *mut FFI_ArrowArrayStream,
111172
mode: i32,
173+
params: *const LanceWriteParams,
112174
storage_opts: *const *const c_char,
113175
out_dataset: *mut *mut LanceDataset,
114176
) -> Result<i32> {
@@ -198,13 +260,20 @@ unsafe fn write_dataset_inner(
198260
storage_options_accessor: Some(Arc::new(StorageOptionsAccessor::with_static_options(opts))),
199261
..ObjectStoreParams::default()
200262
});
201-
let params = WriteParams {
263+
let mut write_params = WriteParams {
202264
mode: mode.into(),
203265
store_params,
204266
..WriteParams::default()
205267
};
268+
if !params.is_null() {
269+
// SAFETY: `params` is non-NULL (checked above) and the caller
270+
// guarantees it points to a properly-initialized `LanceWriteParams`
271+
// valid for the duration of this call. We borrow by shared reference
272+
// and only read; the borrow is consumed before any await point.
273+
unsafe { apply_write_params(&mut write_params, &*params)? };
274+
}
206275

207-
let dataset = block_on(Dataset::write(reader, uri_str, Some(params)))?;
276+
let dataset = block_on(Dataset::write(reader, uri_str, Some(write_params)))?;
208277

209278
if !out_dataset.is_null() {
210279
let handle = LanceDataset {
@@ -221,3 +290,53 @@ unsafe fn write_dataset_inner(
221290

222291
Ok(0)
223292
}
293+
294+
/// Apply caller-provided overrides onto an `lance::WriteParams`. Zero / NULL
295+
/// fields are no-ops so upstream defaults flow through.
296+
unsafe fn apply_write_params(target: &mut WriteParams, params: &LanceWriteParams) -> Result<()> {
297+
if params.max_rows_per_file > 0 {
298+
target.max_rows_per_file = u64_to_usize(params.max_rows_per_file, "max_rows_per_file")?;
299+
}
300+
if params.max_rows_per_group > 0 {
301+
target.max_rows_per_group = u64_to_usize(params.max_rows_per_group, "max_rows_per_group")?;
302+
}
303+
if params.max_bytes_per_file > 0 {
304+
target.max_bytes_per_file = u64_to_usize(params.max_bytes_per_file, "max_bytes_per_file")?;
305+
}
306+
if !params.data_storage_version.is_null() {
307+
// SAFETY: `data_storage_version` is non-NULL (checked above) and the
308+
// `apply_write_params` caller (`unsafe fn` contract) guarantees it
309+
// points to a NUL-terminated C string valid for the duration of the
310+
// outer FFI call. `parse_c_string` reads by shared reference and the
311+
// returned borrow is consumed before this function returns.
312+
//
313+
// `parse_c_string` returns `None` only for NULL input, which the
314+
// outer check already ruled out. `.filter` lets an empty C string
315+
// also fail presence, producing the clearer message below instead
316+
// of relying on `FromStr`'s generic "unknown version" path.
317+
let s = unsafe { helpers::parse_c_string(params.data_storage_version)? }
318+
.filter(|s| !s.is_empty())
319+
.ok_or_else(|| lance_core::Error::InvalidInput {
320+
source: "data_storage_version must not be an empty string".into(),
321+
location: snafu::location!(),
322+
})?;
323+
let version =
324+
LanceFileVersion::from_str(s).map_err(|e| lance_core::Error::InvalidInput {
325+
source: format!("invalid data_storage_version {s:?}: {e}").into(),
326+
location: snafu::location!(),
327+
})?;
328+
target.data_storage_version = Some(version);
329+
}
330+
target.enable_stable_row_ids = params.enable_stable_row_ids;
331+
Ok(())
332+
}
333+
334+
/// Narrow `u64 -> usize` with an explicit error on overflow (32-bit targets).
335+
/// Realistic write tunings fit in `usize` on every supported target, but a
336+
/// silent `as` cast would wrap on a 32-bit host.
337+
fn u64_to_usize(v: u64, field: &'static str) -> Result<usize> {
338+
usize::try_from(v).map_err(|_| lance_core::Error::InvalidInput {
339+
source: format!("{field}={v} exceeds usize::MAX on this target").into(),
340+
location: snafu::location!(),
341+
})
342+
}

0 commit comments

Comments
 (0)