You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/config.md
+45-1Lines changed: 45 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -514,4 +514,48 @@ A column is treated as blob v2 when the Arrow field carries `ARROW:extension:nam
514
514
515
515
Filter pushdown for SQL `WHERE` is disabled on blob v2 tables; Spark evaluates predicates after the scan. Zonemap-based fragment pruning still runs.
516
516
517
-
The connector does not materialize blob bytes on read; queries against descriptor fields fetch metadata only.
517
+
The connector does not materialize blob bytes on read; queries against descriptor fields fetch metadata only. See [Blob v2 Writes](#blob-v2-writes) below for the write path.
518
+
519
+
## Blob v2 Writes
520
+
521
+
To write blob v2 columns, set `file_format_version` to `2.2` or higher and set
522
+
`<column>.lance.encoding = blob` in `TBLPROPERTIES`.
523
+
524
+
Spark still sees the column as `BINARY` when writing. The connector converts that binary
525
+
value into the Arrow blob write struct during encoding.
526
+
527
+
On reads, blob v2 columns are exposed as descriptor structs. See
528
+
[Blob v2 Reads](#blob-v2-reads). For writes, `INSERT` and DataFrame append still take
529
+
`BINARY`.
530
+
531
+
```sql
532
+
CREATETABLElance.mydb.users (
533
+
id INTNOT NULL,
534
+
content BINARY
535
+
) USING lance
536
+
TBLPROPERTIES (
537
+
'content.lance.encoding'='blob',
538
+
'file_format_version'='2.2'
539
+
);
540
+
```
541
+
542
+
With `file_format_version = '2.2'` or higher, blob columns are written using blob v2
543
+
encoding and `ARROW:extension:name = lance.blob.v2 metadata`.
544
+
545
+
With an older version, or when `file_format_version` is not set, blob columns use the
546
+
legacy v1 encoding with `lance-encoding:blob = true` metadata.
547
+
548
+
Blob encoding requires a numeric `file_format_version`, such as `2.2`.
549
+
550
+
Blob v2 writes must go through the catalog path. Use SQL DDL with `TBLPROPERTIES`, as
Lance supports large string columns for storing very large text data. By default, Arrow uses `Utf8` (VarChar) type with 32-bit offsets, which limits total string data to 2GB per batch. For columns containing very large strings (e.g., document content, base64-encoded data), you can use `LargeUtf8` (LargeVarChar) with 64-bit offsets.
0 commit comments