[WIP] [core] Introduce BLOB_REF for shared blob data#7602
[WIP] [core] Introduce BLOB_REF for shared blob data#7602leaves12138 wants to merge 1 commit intoapache:masterfrom
Conversation
leaves12138
left a comment
There was a problem hiding this comment.
I found a few runtime gaps where BLOB_REF support is still incomplete.
| if (blobFields.contains(fieldName)) { | ||
| return toBlobType(logicalType); | ||
| } | ||
| if (blobRefFields.contains(fieldName)) { |
There was a problem hiding this comment.
This wires the schema option through catalog translation, but the runtime source path still only treats BLOB as a special binary column. FileStoreSourceSplitReader.blobFieldIndex(...) only checks DataTypeRoot.BLOB, so BLOB_REF rows still go through plain FlinkRowData and the engine sees serialized BlobReference bytes instead of the dereferenced payload (and blob-as-descriptor will not apply either). Could we extend that reader path as well?
There was a problem hiding this comment.
Addressed in 915465dc44. FileStoreSourceSplitReader now treats BLOB_REF the same as BLOB when selecting the blob-aware row wrapper, so the Flink source path no longer returns raw serialized BlobReference bytes and blob-as-descriptor applies consistently.
| field.dataType() instanceof org.apache.spark.sql.types.BinaryType, | ||
| "The type of blob field must be binary"); | ||
| type = new BlobType(); | ||
| } else if (blobRefFields.contains(name)) { |
There was a problem hiding this comment.
Same concern on the Spark side: adding the catalog/type mapping here is not enough by itself. SparkInternalRow.blobFields(...) still only collects DataTypeRoot.BLOB, so reads return serialized reference bytes, and SparkInternalRowWrapper#getBlob still only recognizes BlobDescriptor, so V2 writes wrap BLOB_REF bytes as BlobData and then fail in BinaryWriter#serializeBlobReference. Could we update those runtime wrappers too?
There was a problem hiding this comment.
Addressed in 915465dc44. SparkInternalRow.blobFields(...) now includes BLOB_REF, and both SparkInternalRowWrapper#getBlob and SparkRow#getBlob now decode through BlobUtils.fromBytes(...) with the BlobReferenceLookup resolver, so the V1/V2 write paths no longer wrap BLOB_REF bytes as plain BlobData.
| } | ||
|
|
||
| @Override | ||
| public FieldWriter visit(BlobRefType blobRefType) { |
There was a problem hiding this comment.
This adds the ORC field writer, but OrcTypeUtil.convertToOrcType(...) still only has a case BLOB branch. That means an ORC table with a BLOB_REF column never reaches this writer because schema conversion fails first. I think OrcTypeUtil needs the same BLOB_REF -> binary mapping.
There was a problem hiding this comment.
Addressed in 915465dc44. OrcTypeUtil.convertToOrcType(...) now maps BLOB_REF to ORC binary before the writer path, and I added OrcTypeUtilTest coverage for the new type.
| } | ||
|
|
||
| @Override | ||
| public UpdaterFactory visit(BlobRefType blobRefType) { |
There was a problem hiding this comment.
Likewise for parquet, the reader updater is mirrored here, but the write/schema side still only switches on BLOB (ParquetSchemaConverter, ParquetRowDataWriter, and ParquetReaderUtil). With the current diff a parquet table containing BLOB_REF is still unsupported. Should those code paths be updated together?
There was a problem hiding this comment.
Addressed in 915465dc44. I updated ParquetSchemaConverter, ParquetRowDataWriter, and ParquetReaderUtil so BLOB_REF is handled as reference bytes end-to-end on the parquet schema/write/read path. While touching the format stack I also filled the same schema/read/write gap for Avro.
Purpose
This PR introduces
BLOB_REFfor sharing blob data across tables without duplicating payloads in Paimon-managed storage.Changes
BLOB_REFtype and wire it through API, format, Arrow, Flink, Spark and Hive type conversionsBLOB_REFvalues asBlobReferencemetadata instead of inline blob payloadsfieldIdto blob references for better schema evolution compatibility during fallback lookupInternalRowToSizeVisitorBLOB_REFin schema validation, since read-time resolution currently only supports top-levelBLOB_REFTesting
Passed:
mvn -pl paimon-common -am -DfailIfNoTests=false -Dcheckstyle.skip -Dspotless.check.skip -Denforcer.skip -Dtest=BlobReferenceTest,BlobReferenceBlobTest,InternalRowToSizeVisitorTest test