[spark] supports updating blobs through DataEvolution MergeInto by steFaiz · Pull Request #8175 · apache/paimon

steFaiz · 2026-06-09T03:01:09Z

Purpose

Parts of #7881

Supports Spark:

MERGE INTO t
USING s
ON ...
WHEN MATCHED THEN UPDATE SET t.raw_blob = s.raw_blob

where raw_blob means blobs stored in BlobFormat Files

Implementation

Introduce several marker columns during data evolution:

update columns..., _ROW_ID, _FIRST_ROW_ID, marker columns...

This is because spark only allow literal columns for basic types. i.e. BlobPlaceholder is not allowed.
Each blob column have one marker column, representing whether write blob values or BlobPlaceholder.INSTANCE

Side Effects

In Conflict Detection: change current checkRowIdExistence to:
a. If new files are normal files: should have an exactly matching row ranges
b. If new files are special storage files: should be exactly a sub range of an existing one
Change the semantics of all-placeholder:
If all blob records at the same row id are placeholder, it's deemed as NULL now (previously it's illegal)
Also fixes a bug for current DataEvolutionMergeInto impl at below situation: updating different columns with different match condition

WHEN MATCHED AND condition1 THEN UPDATE SET col1 = ...
WHEN MATCHED AND condition2 THEN UPDATE SET col2 = ...

Tests

JingsongLi · 2026-06-10T08:46:15Z

+    // The final output is composed by updated columns, metadata columns and blob marker columns.
+    // Marker columns are used to mark whether a blob field should be written with placeholder
+    val rawBlobUpdateColumns = updateColumnsSorted.filter(isRawBlobUpdateColumn)
+    val rawBlobMarkerNamesByColumn = rawBlobUpdateColumns.zipWithIndex.map {


The internal marker column names can collide with real target columns. For example, a table can legally have a column named __paimon_raw_blob_placeholder_0; if a MERGE updates that column and a raw BLOB in the same statement, mergeOutput will contain two attributes with the same name. Then reorderPartialWriteColumns selects by quoted name and writePartialFields resolves the marker with data.schema.fieldIndex, so Spark can either report an ambiguous reference or bind the user column as the boolean marker. Could we generate marker names that are guaranteed not to collide with the write columns/source output, or carry the marker attributes through by exprId instead of resolving them by name?

Thanks! Fixed, now picking new names will loop and increment the index util find some non-existing columns

JingsongLi · 2026-06-11T06:52:30Z

-                                base.firstRowId(),
-                                base.rowCount()));
+            if (base.firstRowId() != null && !dedicatedStorageFile(base.fileName())) {
+                existingRanges.put(base.firstRowId(), base.rowCount());


Partition level has been removed, cc @leaves12138 to take a look.

leaves12138 · 2026-06-11T07:16:38Z

        }

-        Set<FileRowIdKey> existingIndex = new HashSet<>();
+        NavigableMap<Long, Long> existingRanges = new TreeMap<>();


Can you use RowRangeIndex? Add method containsExactly.

leaves12138 · 2026-06-11T07:17:33Z

-                    && rowCount == that.rowCount
-                    && Objects.equals(partition, that.partition);
-        }
+    private static boolean rowIdRangeCovered(


Can you add containsExactly to RowRangeIndex? This seems common

leaves12138 · 2026-06-11T08:38:22Z

        return Collections.unmodifiableList(ranges);
    }

+    public boolean contains(long start, long end) {


There is already contains method in this class.

Thanks for your remind! I've noticed that this method is added recently in master branch. I've rebased the master and cleaned my code

steFaiz marked this pull request as draft June 9, 2026 03:01

steFaiz force-pushed the spark_de_update_blobs branch from a72f8c8 to b43f513 Compare June 9, 2026 10:09

steFaiz marked this pull request as ready for review June 10, 2026 04:12

steFaiz changed the title ~~[wip][spark] supports updating blobs through DataEvolution MergeInto~~ [spark] supports updating blobs through DataEvolution MergeInto Jun 10, 2026

JingsongLi reviewed Jun 10, 2026

View reviewed changes

JingsongLi reviewed Jun 11, 2026

View reviewed changes

leaves12138 reviewed Jun 11, 2026

View reviewed changes

steFaiz added 7 commits June 11, 2026 16:57

optim

7fedede

fix tests

42b57a2

add spark-4.0 codes

e06666b

fix comments

5fcc13d

fix tests

3faf1c8

fix comments

99e4d0c

fix comments

b806157

steFaiz force-pushed the spark_de_update_blobs branch from a4e21ac to b806157 Compare June 11, 2026 09:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] supports updating blobs through DataEvolution MergeInto#8175

[spark] supports updating blobs through DataEvolution MergeInto#8175
steFaiz wants to merge 7 commits into
apache:masterfrom
steFaiz:spark_de_update_blobs

steFaiz commented Jun 9, 2026 •

edited

Loading

Uh oh!

JingsongLi Jun 10, 2026

Uh oh!

steFaiz Jun 11, 2026

Uh oh!

JingsongLi Jun 11, 2026

Uh oh!

leaves12138 Jun 11, 2026

Uh oh!

leaves12138 Jun 11, 2026

Uh oh!

leaves12138 Jun 11, 2026

Uh oh!

steFaiz Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

steFaiz commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Implementation

Side Effects

Tests

Uh oh!

JingsongLi Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

steFaiz Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

leaves12138 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

leaves12138 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

leaves12138 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

steFaiz Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

steFaiz commented Jun 9, 2026 •

edited

Loading