Skip to content

[WIP] [core] Introduce BLOB_REF for shared blob data#7602

Open
leaves12138 wants to merge 1 commit intoapache:masterfrom
leaves12138:ai_blob_ref
Open

[WIP] [core] Introduce BLOB_REF for shared blob data#7602
leaves12138 wants to merge 1 commit intoapache:masterfrom
leaves12138:ai_blob_ref

Conversation

@leaves12138
Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 commented Apr 7, 2026

Purpose

This PR introduces BLOB_REF for sharing blob data across tables without duplicating payloads in Paimon-managed storage.

Changes

  • add the BLOB_REF type and wire it through API, format, Arrow, Flink, Spark and Hive type conversions
  • serialize BLOB_REF values as BlobReference metadata instead of inline blob payloads
  • resolve blob references lazily on read, preferring direct URI reads and falling back to metadata lookup by table/row/field
  • keep the fallback path streaming instead of buffering the whole blob into memory
  • add fieldId to blob references for better schema evolution compatibility during fallback lookup
  • avoid dereferencing blob payloads in InternalRowToSizeVisitor
  • explicitly reject nested BLOB_REF in schema validation, since read-time resolution currently only supports top-level BLOB_REF
  • add unit tests for blob reference serialization, fallback streaming, size estimation, schema validation and fallback lookup

Testing

Passed:

  • mvn -pl paimon-common -am -DfailIfNoTests=false -Dcheckstyle.skip -Dspotless.check.skip -Denforcer.skip -Dtest=BlobReferenceTest,BlobReferenceBlobTest,InternalRowToSizeVisitorTest test

Copy link
Copy Markdown
Contributor Author

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a few runtime gaps where BLOB_REF support is still incomplete.

if (blobFields.contains(fieldName)) {
return toBlobType(logicalType);
}
if (blobRefFields.contains(fieldName)) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wires the schema option through catalog translation, but the runtime source path still only treats BLOB as a special binary column. FileStoreSourceSplitReader.blobFieldIndex(...) only checks DataTypeRoot.BLOB, so BLOB_REF rows still go through plain FlinkRowData and the engine sees serialized BlobReference bytes instead of the dereferenced payload (and blob-as-descriptor will not apply either). Could we extend that reader path as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 915465dc44. FileStoreSourceSplitReader now treats BLOB_REF the same as BLOB when selecting the blob-aware row wrapper, so the Flink source path no longer returns raw serialized BlobReference bytes and blob-as-descriptor applies consistently.

field.dataType() instanceof org.apache.spark.sql.types.BinaryType,
"The type of blob field must be binary");
type = new BlobType();
} else if (blobRefFields.contains(name)) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern on the Spark side: adding the catalog/type mapping here is not enough by itself. SparkInternalRow.blobFields(...) still only collects DataTypeRoot.BLOB, so reads return serialized reference bytes, and SparkInternalRowWrapper#getBlob still only recognizes BlobDescriptor, so V2 writes wrap BLOB_REF bytes as BlobData and then fail in BinaryWriter#serializeBlobReference. Could we update those runtime wrappers too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 915465dc44. SparkInternalRow.blobFields(...) now includes BLOB_REF, and both SparkInternalRowWrapper#getBlob and SparkRow#getBlob now decode through BlobUtils.fromBytes(...) with the BlobReferenceLookup resolver, so the V1/V2 write paths no longer wrap BLOB_REF bytes as plain BlobData.

}

@Override
public FieldWriter visit(BlobRefType blobRefType) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds the ORC field writer, but OrcTypeUtil.convertToOrcType(...) still only has a case BLOB branch. That means an ORC table with a BLOB_REF column never reaches this writer because schema conversion fails first. I think OrcTypeUtil needs the same BLOB_REF -> binary mapping.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 915465dc44. OrcTypeUtil.convertToOrcType(...) now maps BLOB_REF to ORC binary before the writer path, and I added OrcTypeUtilTest coverage for the new type.

}

@Override
public UpdaterFactory visit(BlobRefType blobRefType) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise for parquet, the reader updater is mirrored here, but the write/schema side still only switches on BLOB (ParquetSchemaConverter, ParquetRowDataWriter, and ParquetReaderUtil). With the current diff a parquet table containing BLOB_REF is still unsupported. Should those code paths be updated together?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 915465dc44. I updated ParquetSchemaConverter, ParquetRowDataWriter, and ParquetReaderUtil so BLOB_REF is handled as reference bytes end-to-end on the parquet schema/write/read path. While touching the format stack I also filled the same schema/read/write gap for Avro.

@leaves12138 leaves12138 changed the title [core] Introduce BLOB_REF for shared blob data [WIP] [core] Introduce BLOB_REF for shared blob data Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant