feat: implement native empty2null spark inner function by kazantsev-maksim · Pull Request #4683 · apache/datafusion-comet

kazantsev-maksim · 2026-06-18T14:49:54Z

Which issue does this PR close?

Part of: #4670

Rationale for this change

Empty2Null is an internal Spark expression that converts an empty string "" into null. The logic is trivial: if the value is null or a zero-length string, it returns null; otherwise it returns the string unchanged.

Purpose
The function is applied during partitioned file writes (parquet, orc, etc.) — specifically to the partition columns. The reason is the correctness of Hive-style partitioning.

In Hive-style directory naming, an empty string and null are indistinguishable: both would produce a path like col1=, which is ambiguous and breaks reading the data back. To avoid this, Spark runs partition columns through Empty2Null before writing, so empty strings end up in the same default partition as null: col1=__HIVE_DEFAULT_PARTITION__

What changes are included in this PR?

How are these changes tested?

Add rust unit tests

This reverts commit 768b3e9.

comphead

Thanks @kazantsev-maksim can we investigate if this function can be implemented through codegen functions rather than native?

It doesn't seem to have intensive computations so codegen implementation should be fine I suppose. The example for codegen #4636

kazantsev-maksim · 2026-06-19T15:56:23Z

Thanks @comphead, it's work fine.

comphead · 2026-06-19T17:22:02Z

Thanks, starting CI
@kazantsev-maksim wondering, if its possible to write a unit test for the function to make sure it works properly?

kazantsev-maksim · 2026-06-19T17:42:59Z

@comphead Spark uses this function to wrap string fields during partitioning; it cannot be called via Spark SQL, but we need this function to implement native partition writing.

comphead · 2026-06-20T18:23:31Z

@comphead Spark uses this function to wrap string fields during partitioning; it cannot be called via Spark SQL, but we need this function to implement native partition writing.

I see, perhaps we can come up with unit test?

LLM suggests me something like

test("SPARK-XXXXX: empty partition values are written as null partitions") {
  withTempDir { dir =>
    val path = dir.getCanonicalPath

    Seq(
      (1, ""),
      (2, "a"),
      (3, null.asInstanceOf[String])
    ).toDF("id", "part")
      .write
      .partitionBy("part")
      .parquet(path)

    val fs = new Path(path).getFileSystem(spark.sessionState.newHadoopConf())
    val partitions = fs.listStatus(new Path(path))
      .filter(_.isDirectory)
      .map(_.getPath.getName)
      .sorted

    assert(partitions.contains("part=a"))

    // Empty string should not generate a dedicated partition directory.
    assert(!partitions.contains("part="))

    // Empty string and null should both map to the default partition.
    assert(partitions.count(_.startsWith("part=__HIVE_DEFAULT_PARTITION__")) == 1)

    checkAnswer(
      spark.read.parquet(path).groupBy("part").count(),
      Row(null, 2) :: Row("a", 1) :: Nil
    )
  }
}

Without unit test it would be pretty complicated to ensure no regressions happened

And we need to check that execution actually happened with native writer. There are bunch of tests in src/main/scala/org/apache/spark/sql/comet/CometNativeWriteExec.scala that validate native writer used

Kazantsev Maksim and others added 30 commits December 14, 2025 16:24

impl map_from_entries

768b3e9

Revert "impl map_from_entries"

c68c342

This reverts commit 768b3e9.

Merge branch 'apache:main' into main

d887555

Merge branch 'apache:main' into main

231aa90

Merge branch 'apache:main' into main

9500bbb

Merge branch 'apache:main' into main

9577481

Merge branch 'apache:main' into main

3791557

Merge branch 'apache:main' into main

7c2f082

Merge branch 'apache:main' into main

609a605

Merge branch 'apache:main' into main

a151b2c

Merge branch 'apache:main' into main

ad3e7f5

Merge branch 'apache:main' into main

ea92e4b

Merge branch 'apache:main' into main

8dfeca3

Merge branch 'apache:main' into main

559741e

Merge branch 'apache:main' into main

ebda14e

Merge branch 'apache:main' into main

408152e

Merge branch 'apache:main' into main

d7857b2

Merge branch 'apache:main' into main

aef41be

Merge branch 'apache:main' into main

5ac1c58

Merge branch 'apache:main' into main

9ae8e23

Merge branch 'apache:main' into main

5ca3888

Merge branch 'apache:main' into main

160a817

Merge branch 'apache:main' into main

88fc313

Merge branch 'apache:main' into main

e14c180

Merge branch 'apache:main' into main

610a885

Merge branch 'apache:main' into main

f8acb2c

Merge branch 'apache:main' into main

ec94897

Merge branch 'apache:main' into main

43405e4

Merge branch 'apache:main' into main

47b4915

Merge branch 'apache:main' into main

26e2682

kazantsev-maksim and others added 20 commits April 17, 2026 21:42

Merge branch 'apache:main' into main

671412c

Merge branch 'apache:main' into main

c9f52d1

Merge branch 'apache:main' into main

67f72d9

Merge branch 'apache:main' into main

314e594

Merge branch 'apache:main' into main

ac8292f

Merge branch 'apache:main' into main

c9c140e

Merge branch 'apache:main' into main

decca58

Merge branch 'apache:main' into main

0919b33

Merge branch 'apache:main' into main

7495e21

Merge branch 'apache:main' into main

0a37a60

Merge branch 'apache:main' into main

abbba84

Merge branch 'apache:main' into main

6020560

Merge branch 'apache:main' into main

e2bdfb1

Merge branch 'apache:main' into main

3edfc33

Merge branch 'apache:main' into main

a39e860

Merge branch 'apache:main' into main

e88dd7b

Merge branch 'apache:main' into main

3e29d37

Merge branch 'apache:main' into main

4068359

Merge branch 'apache:main' into main

a3cb8de

Feat: add empty2Null inner spark function

cfc751a

kazantsev-maksim changed the title ~~Empty2null~~ feat: implement native empty2null spark inner function Jun 18, 2026

Merge branch 'main' into empty2null

30a92c0

comphead reviewed Jun 18, 2026

View reviewed changes

Kazantsev Maksim added 2 commits June 19, 2026 19:53

fix pr issues

9ba7a8e

Merge remote-tracking branch 'origin/empty2null' into empty2null

e5847ed

kazantsev-maksim requested a review from comphead June 19, 2026 15:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement native empty2null spark inner function#4683

feat: implement native empty2null spark inner function#4683
kazantsev-maksim wants to merge 59 commits into
apache:mainfrom
kazantsev-maksim:empty2null

kazantsev-maksim commented Jun 18, 2026 •

edited

Loading

Uh oh!

comphead left a comment

Uh oh!

kazantsev-maksim commented Jun 19, 2026

Uh oh!

comphead commented Jun 19, 2026

Uh oh!

kazantsev-maksim commented Jun 19, 2026 •

edited

Loading

Uh oh!

comphead commented Jun 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kazantsev-maksim commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

kazantsev-maksim commented Jun 19, 2026

Uh oh!

comphead commented Jun 19, 2026

Uh oh!

kazantsev-maksim commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

comphead commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kazantsev-maksim commented Jun 18, 2026 •

edited

Loading

kazantsev-maksim commented Jun 19, 2026 •

edited

Loading

comphead commented Jun 20, 2026 •

edited

Loading