Skip to content

[core] FormatTable supports Blob Format#8191

Open
steFaiz wants to merge 2 commits into
apache:masterfrom
steFaiz:format_table_blob
Open

[core] FormatTable supports Blob Format#8191
steFaiz wants to merge 2 commits into
apache:masterfrom
steFaiz:format_table_blob

Conversation

@steFaiz

@steFaiz steFaiz commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Purpose

Supports Blob Format in FormatTable.
The situation is to replace ObjectStore by Paimon on DFS, unifying storage engines. Consider this situation:

  1. Users are trying to parse big videos, splitting into hundreds of images.
  2. This is always done by UDF, input is a video, output is a Json Map, contains <ImageIdentifier, ImageURL>, the results will be exported to structural storage e.g. ODPS
  3. Image splitting and upload is done within the UDF. Previously those images are uploaded to OSS. Now we can use paimon FormatTable to store them, we could get the BlobDescriptor easily by BlobConsumers.

The key advantages are:

  1. Partition-level management: drop/overwrite partitions to manage blob lifecycle natively
  2. Drastically fewer files: N blobs packed into one file instead of N separate objects.
  3. BlobDescriptor output: each written blob returns a descriptor (path + offset + length) that downstream structured tables (e.g., ODPS) can consume via UDF for random access.

Restriction

Now we only permit one non-partition column Blob Format Table.

Tests

See org.apache.paimon.table.format.FormatTableBlobTest

enum Format {
ORC,
PARQUET,
BLOB,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding BLOB here also exposes format-table projection paths. For a table like (payload BLOB, ds INT) PARTITIONED BY (ds), projecting only ds makes FormatReadBuilder remove partition columns before creating the file reader, so the projectedRowType passed to BlobFileFormat is empty. BlobFileFormat currently requires a BLOB field and throws, whereas other format tables can satisfy partition-only projections. Please handle this case, for example by reading only the blob file metadata to get the row count and then appending partition columns, or by adding an explicit supported projection path with a test.

}
if (writer instanceof FileAwareFormatWriter) {
FileAwareFormatWriter fileAwareFormatWriter = (FileAwareFormatWriter) writer;
fileAwareFormatWriter.setFile(path);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setFile(path) is not enough for the withBlobConsumer path. BlobFormatWriter invokes the consumer while writing and the emitted descriptor points at this target path, but this writer is backed by a TwoPhaseOutputStream, so the target file is not visible until FormatTableCommit commits it; if a later write/commit fails, abort()/FormatTableCommit.abort() discards it anyway. This violates the TableWrite.withBlobConsumer contract that these files are left for the caller to clean up, and leaves already-emitted descriptors dangling. Please either make the consumer path use visible/non-deleted files like SingleFileWriter does with deleteFileUponAbort(), or defer/avoid emitting descriptors until the file has actually been committed, and add a failure-path test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JingsongLi Thanks for your reivew! But this scenario is a little bit tricky.
Currently FormatTable on DFS uses RENAME to do two-phase-commit. So the set path is not real, only exists after commit! At that case, if commit failed and aborted, it's meaningless to retain the written files, because they are in temp dir and not equal to path stored in BlobDescriptors.
(However in python, no two-phase commit implemented, so I still retain written files on abortion)

Here're my thinkings:

  1. Maybe we could explicitly warn users that in FormatTable, returned blobDescriptors are only valid after commit? Or maybe introduce a PendingBlobDescriptor for format tables, all same as BlobDescriptors but BlobRef could warn users the Descriptor is still pending, rather than throws path not exists.
  2. I think this "visible after commit" is acceptable for batch scenarios, for example: in Spark/Ray, FormatTable commit is a part of job, exported descriptors will be visible only after the job is succesfully finished.
  3. Or maybe we do not use two-phase commit for BlobFormatTables? Just filter out the broken files on read.

Thanks again for your review! I'll close this PR and find an another way if you think this scenario is not suitable for paimon FormatTable.

@JingsongLi

Copy link
Copy Markdown
Contributor

@steFaiz Why not just using Paimon table to store objects?

@steFaiz

steFaiz commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Why not just using Paimon table to store objects?

@JingsongLi Thanks for your question! Let me explain this. My scenario is:

  • A Spark/Flink UDF takes images as input and immediately outputs a JSON Map<String, BlobDescriptor> — i.e. each image (blob) is written out and the UDF directly produces the descriptor (path + offset + length) for downstream (ODPS). Previously this is done by uploading each image to individual OSS files, I'm trying to replace OSS by directly Paimon on DFS

Why append table is not suitable?

  • If use paimon, each UDF need to commit on close(). Each udf instance will commit once. For spark jobs, there may be hundreds of concurrent commits! Format table's commit is pretty lightweight.

I'm exploring use Paimon Format Table to replace oss, just act as an archive for blobs. Users always refer to blobs by descriptor-only(not full scan) and can utilize paimon's blob packing, partition management and table management.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants