[WIP] [core] Introduce BLOB_REF for shared blob data by leaves12138 · Pull Request #7602 · apache/paimon

leaves12138 · 2026-04-07T02:55:45Z

Purpose

This PR introduces BLOB_REF for sharing blob data across tables without duplicating payloads in Paimon-managed storage.

Changes

add the BLOB_REF type and wire it through API, format, Arrow, Flink, Spark and Hive type conversions
serialize BLOB_REF values as BlobReference metadata instead of inline blob payloads
resolve blob references lazily on read, preferring direct URI reads and falling back to metadata lookup by table/row/field
keep the fallback path streaming instead of buffering the whole blob into memory
add fieldId to blob references for better schema evolution compatibility during fallback lookup
avoid dereferencing blob payloads in InternalRowToSizeVisitor
explicitly reject nested BLOB_REF in schema validation, since read-time resolution currently only supports top-level BLOB_REF
add unit tests for blob reference serialization, fallback streaming, size estimation, schema validation and fallback lookup

Testing

Passed:

mvn -pl paimon-common -am -DfailIfNoTests=false -Dcheckstyle.skip -Dspotless.check.skip -Denforcer.skip -Dtest=BlobReferenceTest,BlobReferenceBlobTest,InternalRowToSizeVisitorTest test

leaves12138

I found a few runtime gaps where BLOB_REF support is still incomplete.

leaves12138 · 2026-04-07T03:33:55Z

paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/FlinkCatalog.java

        if (blobFields.contains(fieldName)) {
            return toBlobType(logicalType);
        }
+        if (blobRefFields.contains(fieldName)) {


This wires the schema option through catalog translation, but the runtime source path still only treats BLOB as a special binary column. FileStoreSourceSplitReader.blobFieldIndex(...) only checks DataTypeRoot.BLOB, so BLOB_REF rows still go through plain FlinkRowData and the engine sees serialized BlobReference bytes instead of the dereferenced payload (and blob-as-descriptor will not apply either). Could we extend that reader path as well?

Addressed in 915465dc44. FileStoreSourceSplitReader now treats BLOB_REF the same as BLOB when selecting the blob-aware row wrapper, so the Flink source path no longer returns raw serialized BlobReference bytes and blob-as-descriptor applies consistently.

leaves12138 · 2026-04-07T03:33:55Z

paimon-spark/paimon-spark-common/src/main/java/org/apache/paimon/spark/SparkCatalog.java

                        field.dataType() instanceof org.apache.spark.sql.types.BinaryType,
                        "The type of blob field must be binary");
                type = new BlobType();
+            } else if (blobRefFields.contains(name)) {


Same concern on the Spark side: adding the catalog/type mapping here is not enough by itself. SparkInternalRow.blobFields(...) still only collects DataTypeRoot.BLOB, so reads return serialized reference bytes, and SparkInternalRowWrapper#getBlob still only recognizes BlobDescriptor, so V2 writes wrap BLOB_REF bytes as BlobData and then fail in BinaryWriter#serializeBlobReference. Could we update those runtime wrappers too?

Addressed in 915465dc44. SparkInternalRow.blobFields(...) now includes BLOB_REF, and both SparkInternalRowWrapper#getBlob and SparkRow#getBlob now decode through BlobUtils.fromBytes(...) with the BlobReferenceLookup resolver, so the V1/V2 write paths no longer wrap BLOB_REF bytes as plain BlobData.

leaves12138 · 2026-04-07T03:33:55Z

paimon-format/src/main/java/org/apache/paimon/format/orc/writer/FieldWriterFactory.java

    }

+    @Override
+    public FieldWriter visit(BlobRefType blobRefType) {


This adds the ORC field writer, but OrcTypeUtil.convertToOrcType(...) still only has a case BLOB branch. That means an ORC table with a BLOB_REF column never reaches this writer because schema conversion fails first. I think OrcTypeUtil needs the same BLOB_REF -> binary mapping.

Addressed in 915465dc44. OrcTypeUtil.convertToOrcType(...) now maps BLOB_REF to ORC binary before the writer path, and I added OrcTypeUtilTest coverage for the new type.

leaves12138 · 2026-04-07T03:33:55Z

...ormat/src/main/java/org/apache/paimon/format/parquet/reader/ParquetVectorUpdaterFactory.java

        }

+        @Override
+        public UpdaterFactory visit(BlobRefType blobRefType) {


Likewise for parquet, the reader updater is mirrored here, but the write/schema side still only switches on BLOB (ParquetSchemaConverter, ParquetRowDataWriter, and ParquetReaderUtil). With the current diff a parquet table containing BLOB_REF is still unsupported. Should those code paths be updated together?

Addressed in 915465dc44. I updated ParquetSchemaConverter, ParquetRowDataWriter, and ParquetReaderUtil so BLOB_REF is handled as reference bytes end-to-end on the parquet schema/write/read path. While touching the format stack I also filled the same schema/read/write gap for Avro.

leaves12138 force-pushed the ai_blob_ref branch from 1181a5e to 5e37656 Compare April 7, 2026 03:16

leaves12138 commented Apr 7, 2026

View reviewed changes

leaves12138 changed the title ~~[core] Introduce BLOB_REF for shared blob data~~ [WIP] [core] Introduce BLOB_REF for shared blob data Apr 7, 2026

[core] Introduce BLOB_REF for shared blob data

915465d

leaves12138 force-pushed the ai_blob_ref branch from 5e37656 to 915465d Compare April 7, 2026 04:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [core] Introduce BLOB_REF for shared blob data#7602

[WIP] [core] Introduce BLOB_REF for shared blob data#7602
leaves12138 wants to merge 1 commit intoapache:masterfrom
leaves12138:ai_blob_ref

leaves12138 commented Apr 7, 2026 •

edited

Loading

Uh oh!

leaves12138 left a comment

Uh oh!

leaves12138 Apr 7, 2026

Uh oh!

leaves12138 Apr 7, 2026

Uh oh!

leaves12138 Apr 7, 2026

Uh oh!

leaves12138 Apr 7, 2026

Uh oh!

leaves12138 Apr 7, 2026

Uh oh!

leaves12138 Apr 7, 2026

Uh oh!

leaves12138 Apr 7, 2026

Uh oh!

leaves12138 Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leaves12138 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Testing

Uh oh!

leaves12138 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leaves12138 commented Apr 7, 2026 •

edited

Loading