You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: propagate update_columns offsets and partial last_updated for RewriteColumns (#6650)
## Summary
* Fixes#6505
* `FileFragment::update_columns` returns `Result<(Fragment, Vec<u32>)>`
(unchanged
public shape). `update_columns_with_offsets` returns
`FragmentUpdateColumnsResult`
(fragment, `fields_modified`, `matched_offsets: RoaringBitmap`) for
callers that need
physical row indices for stable row-id metadata.
* `HashJoiner::matched_join_rows` — boolean mask for hash hits; used by
`update_columns_with_offsets` and covered by `test_matched_join_rows`.
* `Operation::Update`: optional `updated_fragment_offsets:
Option<UpdatedFragmentOffsets>`
where `UpdatedFragmentOffsets` wraps `HashMap<u64, RoaringBitmap>`
(newtype with
`Default`, `PartialEq`, manual `DeepSizeOf`). `None` means the caller
did not supply
offsets.
* Proto (`transaction.proto`): backward-compatible `map<uint64,
UInt32List> updated_fragment_offsets = 9`
on `Update`; serde round-trip preserves semantics.
* `build_manifest`: when stable row IDs are enabled, `update_mode ==
RewriteColumns`,
and `Some(UpdatedFragmentOffsets(..))` includes a non-empty bitmap for a
fragment,
calls `refresh_row_latest_update_meta_for_partial_frag_rewrite_cols` for
those offsets
only — unmatched rows and untouched fragments are left unchanged.
* JNI / Java: `FragmentUpdateResult` includes matched row offsets; the
2-arg constructor
`(FragmentMetadata, long[])` delegates to the 3-arg form with an empty
offset array for
compatibility. JNI uses `update_columns_with_offsets`.
* Python: `update_columns` binding correctly destructures the
`(Fragment, Vec<u32>)` tuple.
## Root cause
For `Operation::Update` with `RewriteColumns`, commits could advance the
dataset version
without advancing `_row_last_updated_at_version` for the rows that were
actually rewritten.
`update_columns` did not report which physical offsets matched, and
`build_manifest` had no
per-fragment offset map to drive the partial refresh. Without that
information the transaction
layer cannot distinguish which rows changed, so the version metadata is
not updated.
## Implementation notes
* `RoaringBitmap` iteration is ascending and duplicate-free; redundant
`sort` / `dedup`
when building proto lists or offset vectors from bitmaps were removed.
* Call sites that do not populate offsets use `updated_fragment_offsets:
None`.
## Why the protobuf field exists
lance-spark passes `Transaction` through JNI as a protobuf blob: Java
builds a `Transaction`
proto, Rust deserializes it and runs `build_manifest`. Without
`updated_fragment_offsets` on
the wire, the decoded `Operation::Update` would always have
`updated_fragment_offsets: None`
even when matched offsets were computed on the JVM side, and the partial
refresh in
`build_manifest` would silently do nothing.
## Test plan
* `cargo test -p lance test_matched_join_rows` —
`HashJoiner::matched_join_rows`.
* `cargo test -p lance
test_build_manifest_partial_last_updated_rewrite_columns_stable_row_ids`
— `Dataset::commit` -> `build_manifest`: two fragments, partial
`update_columns_with_offsets`, `Operation::Update` with `RewriteColumns`
and an offset
map; asserts matched vs unmatched vs untouched row version metadata.
* `cargo test -p lance test_fragment_update` — fragment path with
`Operation::Update` and
offsets.
* `cargo test -p lance --tests` (or at least `cargo check -p lance
--tests`) and
`cargo check --manifest-path java/lance-jni/Cargo.toml`.
The `pylance` crate is excluded from the root workspace; validate Python
bindings in the
usual `maturin` / CI flow if you touch `python/`.
## Compatibility
* Rust: `update_columns` signature unchanged;
`update_columns_with_offsets` is additive.
* Java: 2-arg `FragmentUpdateResult` constructor preserved.
* Proto: field 9; older clients ignore unknown fields.
---------
Co-authored-by: Jing chen He <jingh@adobe.com>
0 commit comments