Skip to content

feat: implement native empty2null spark inner function#4683

Open
kazantsev-maksim wants to merge 59 commits into
apache:mainfrom
kazantsev-maksim:empty2null
Open

feat: implement native empty2null spark inner function#4683
kazantsev-maksim wants to merge 59 commits into
apache:mainfrom
kazantsev-maksim:empty2null

Conversation

@kazantsev-maksim

@kazantsev-maksim kazantsev-maksim commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Part of: #4670

Rationale for this change

Empty2Null is an internal Spark expression that converts an empty string "" into null. The logic is trivial: if the value is null or a zero-length string, it returns null; otherwise it returns the string unchanged.

Purpose
The function is applied during partitioned file writes (parquet, orc, etc.) — specifically to the partition columns. The reason is the correctness of Hive-style partitioning.

In Hive-style directory naming, an empty string and null are indistinguishable: both would produce a path like col1=, which is ambiguous and breaks reading the data back. To avoid this, Spark runs partition columns through Empty2Null before writing, so empty strings end up in the same default partition as null: col1=__HIVE_DEFAULT_PARTITION__

What changes are included in this PR?

How are these changes tested?

Add rust unit tests

@kazantsev-maksim kazantsev-maksim changed the title Empty2null feat: implement native empty2null spark inner function Jun 18, 2026

@comphead comphead left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kazantsev-maksim can we investigate if this function can be implemented through codegen functions rather than native?

It doesn't seem to have intensive computations so codegen implementation should be fine I suppose. The example for codegen #4636

@kazantsev-maksim

Copy link
Copy Markdown
Contributor Author

Thanks @comphead, it's work fine.

@comphead

Copy link
Copy Markdown
Contributor

Thanks, starting CI
@kazantsev-maksim wondering, if its possible to write a unit test for the function to make sure it works properly?

@kazantsev-maksim

kazantsev-maksim commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

@comphead Spark uses this function to wrap string fields during partitioning; it cannot be called via Spark SQL, but we need this function to implement native partition writing.

@comphead

comphead commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

@comphead Spark uses this function to wrap string fields during partitioning; it cannot be called via Spark SQL, but we need this function to implement native partition writing.

I see, perhaps we can come up with unit test?

LLM suggests me something like

test("SPARK-XXXXX: empty partition values are written as null partitions") {
  withTempDir { dir =>
    val path = dir.getCanonicalPath

    Seq(
      (1, ""),
      (2, "a"),
      (3, null.asInstanceOf[String])
    ).toDF("id", "part")
      .write
      .partitionBy("part")
      .parquet(path)

    val fs = new Path(path).getFileSystem(spark.sessionState.newHadoopConf())
    val partitions = fs.listStatus(new Path(path))
      .filter(_.isDirectory)
      .map(_.getPath.getName)
      .sorted

    assert(partitions.contains("part=a"))

    // Empty string should not generate a dedicated partition directory.
    assert(!partitions.contains("part="))

    // Empty string and null should both map to the default partition.
    assert(partitions.count(_.startsWith("part=__HIVE_DEFAULT_PARTITION__")) == 1)

    checkAnswer(
      spark.read.parquet(path).groupBy("part").count(),
      Row(null, 2) :: Row("a", 1) :: Nil
    )
  }
}

Without unit test it would be pretty complicated to ensure no regressions happened

And we need to check that execution actually happened with native writer. There are bunch of tests in src/main/scala/org/apache/spark/sql/comet/CometNativeWriteExec.scala that validate native writer used

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants