Repro
When appending rows to an existing Lance dataset that has large_binary blob columns (with lance-encoding:blob metadata), the appended fragment stores those columns as struct<position: uint64, size: uint64> instead of large_binary — even when passing the correct schema explicitly with data_storage_version="stable".
Consequences:
ds.take_blobs() fails on rows from the appended fragment
ds.optimize.compact_files() fails with schema mismatch between the original and appended fragments
The dataset looks fine superficially (row counts correct, to_table() works) but blob reads are broken
Repro: write a dataset with blob columns using mode="overwrite", then append more rows with mode="append" using the same schema. Inspect ds.schema.field("rgb").type — it'll say large_binary (from the first fragment), but rows in the second fragment are actually struct-encoded.
Workaround: Always write all rows in a single write_dataset(mode="overwrite") call. Never use mode="append" with blob columns.
Repro
When appending rows to an existing Lance dataset that has large_binary blob columns (with lance-encoding:blob metadata), the appended fragment stores those columns as struct<position: uint64, size: uint64> instead of large_binary — even when passing the correct schema explicitly with data_storage_version="stable".
Consequences:
ds.take_blobs()fails on rows from the appended fragmentds.optimize.compact_files()fails with schema mismatch between the original and appended fragmentsThe dataset looks fine superficially (row counts correct, to_table() works) but blob reads are broken
Repro: write a dataset with blob columns using mode="overwrite", then append more rows with mode="append" using the same schema. Inspect
ds.schema.field("rgb").type— it'll say large_binary (from the first fragment), but rows in the second fragment are actually struct-encoded.Workaround: Always write all rows in a single write_dataset(mode="overwrite") call. Never use mode="append" with blob columns.