Skip to content

Appending dataset corrupts blob columns #6381

@eddyxu

Description

@eddyxu

Repro

When appending rows to an existing Lance dataset that has large_binary blob columns (with lance-encoding:blob metadata), the appended fragment stores those columns as struct<position: uint64, size: uint64> instead of large_binary — even when passing the correct schema explicitly with data_storage_version="stable".

Consequences:

ds.take_blobs() fails on rows from the appended fragment
ds.optimize.compact_files() fails with schema mismatch between the original and appended fragments
The dataset looks fine superficially (row counts correct, to_table() works) but blob reads are broken
Repro: write a dataset with blob columns using mode="overwrite", then append more rows with mode="append" using the same schema. Inspect ds.schema.field("rgb").type — it'll say large_binary (from the first fragment), but rows in the second fragment are actually struct-encoded.

Workaround: Always write all rows in a single write_dataset(mode="overwrite") call. Never use mode="append" with blob columns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingformatFile format

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions