Skip to content

refactor(shim): add new shim infra with enableIfVer macro to avoid shims duplication#4657

Open
rluvaton wants to merge 5 commits into
apache:mainfrom
rluvaton:add-annotation-based-shims
Open

refactor(shim): add new shim infra with enableIfVer macro to avoid shims duplication#4657
rluvaton wants to merge 5 commits into
apache:mainfrom
rluvaton:add-annotation-based-shims

Conversation

@rluvaton

@rluvaton rluvaton commented Jun 15, 2026

Copy link
Copy Markdown
Member

Note

This does not remove all the shims in favor of the new infra as it will be very large PR
instead it moves some stuff to show how it is being used and the process of doing so

A small macro-annotation library that lets a single source file carry code for multiple
Spark versions. You annotate a member (def / class / object) with a version range, and at *
compile time* the macro keeps it, drops it, or tweaks it based on the version the build is
targeting.

The problem it solves

Version-specific code used to live in parallel source folders (src/main/spark-3.4/…, spark-3.5/…, spark-4.0/…) - so a whole file got copied per version just because one method differed, or because a type like WindowGroupLimitExec only exists in Spark 3.5+. @enableIfVer lets those differences live inline, in one file: no duplication, and the version logic sits right next to the code it gates.

How you use it

The targeted versions are supplied by the build per "dimension" (spark). You write a semver range

import org.apache.comet.enableIfVer

// keep this method only on Spark 3.5+
@enableIfVer(spark = ">=3.5.0")
def onlyOn35Plus(): Unit =
...

// two same-named methods, one per version range - exactly one survives per build
@enableIfVer(spark = "<3.5.0") def isWGL(p: SparkPlan): Boolean = false
@enableIfVer(spark = ">=3.5.0") def isWGL(p: SparkPlan): Boolean = p.isInstanceOf[WindowGroupLimitExec]

The four behaviors

Annotation On a non-matching version
@enableIfVer drops the member entirely
@implementIfVer keeps the class/object but empties its body and remove the inheritance (so a type that references 3.5-only symbols still exists on 3.4)
@enableOverrideIfVer strips override (for a member that only overrides on some versions)

Ranges

Matched by semver4j: full major.minor.patch versions with > >= < <= = !=, space = AND, || = OR, A - B hyphen ranges, x-ranges / partials and more (FYI, it supports package.json format)

What it gives us

  • No more per-version source-folder duplication for small differences.
  • One file, version logic inline - easier to read and maintain.
  • Compile-time safety: non-matching code (and its references to types that don't exist in that
    version) is removed before type-checking, so each build only ever sees valid code.
  • Extensible & general: any number of dimensions, combined with AND/OR, plus an explicit
    "not enabled" state - all from one unified @enableIfVer namespace.

Caveats

  • @enableIfVer annotates definitions only - not statements (e.g. a bare test(...) call). To
    gate a statement, wrap it in a def and annotate that.
  • It cannot be applied to a top-level class that has a companion object (a macro-paradise
    limitation). For whole top-level classes that are entirely version-specific, the version-specific
    source folder is still the right tool.

Apache Auron (Blaze)

most of the annotation code taken from Blaze/auron in sparkver.scala but updated the syntax of matching versions and replace @sparkverEnableMembers with @implementIfVer that remove the inheritance as well - useful for abstract class

Blaze uses / to split between supported versions so for example like this:

@sparkver("3.0 / 3.1 / 3.2 / 3.3 / 3.4 / 3.5 / 4.0 / 4.1")
override def simpleString(maxFields: Int): String =
  s"$nodeName (${basedHiveScan.simpleString(maxFields)})"

This however use semver match syntax that is more widely known so the same code for us will be:

@enableIfVer(spark = ">=3.0.0 <4.2.0")
override def simpleString(maxFields: Int): String =
  s"$nodeName (${basedHiveScan.simpleString(maxFields)})"

and if you want to automatically have this function for every spark version above 3.0 and not only until 4.1, you can do:

@enableIfVer(spark = ">=3.0.0")
override def simpleString(maxFields: Int): String =
  s"$nodeName (${basedHiveScan.simpleString(maxFields)})"

the syntax is matching the one used in package.json

also, Auron name is sparkver, I named enableIfVer instead so future shims (like iceberg/delta/etc) can use the same infra

@rluvaton rluvaton force-pushed the add-annotation-based-shims branch from 9f6e6c7 to 4e36d4f Compare June 15, 2026 14:01
@rluvaton rluvaton requested review from andygrove and comphead and removed request for comphead June 15, 2026 15:22
@rluvaton rluvaton requested review from comphead and mbutrovich June 15, 2026 17:13
@rluvaton

Copy link
Copy Markdown
Member Author

@andygrove can you please review? I think this will improve the scala code quality greatly

@rluvaton

Copy link
Copy Markdown
Member Author

Maybe @mbutrovich can look at it as well?

@andygrove

Copy link
Copy Markdown
Member

Thanks @rluvaton. I will take a look at this soon, but I am concerned that this is more complex than the current approach. How well does this approach work with IDEs, such as IntelliJ?

@wForget

wForget commented Jun 17, 2026

Copy link
Copy Markdown
Member

+0 for this, I prefer the separated source directories approach; introducing potentially incompatible classes or methods into a single source file seems more difficult to maintain.

@rluvaton

rluvaton commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

Thanks @rluvaton. I will take a look at this soon, but I am concerned that this is more complex than the current approach.

I think this will:

Reduce line count

you don't need to duplicate shims code when they are the same

Make it easier to read

When you look at specific shim directory like spark 3.5 you should also pay attention to the shim directory spark 3.x instead of just looking at the shim code and understand right away

Adding shims from specific spark version and up

for example, HashPartitioningLike was added in spark 4 and backported to spark 3.5 and spark 3.4 and released in spark 3.5.1
in the current approach, how would you create a function that check if spark partitioning is hash based if the HashPartitioningLike was not backported to spark 3.4 which you still have shims for?

with this pr you could write:

@enableIfVer(spark = ">= 3.5.1")
def isHashPartitioningLike(partitioning: Partitioning): Boolean = {
  import org.apache.spark.sql.catalyst.plans.physical.HashPartitioningLike
  
  partitioning match {
    case hashLike: HashPartitioningLike => true
    case par => false
  }
}

@enableIfVer(spark = "< 3.5.1")
def isHashPartitioningLike(partitioning: Partitioning): Boolean = {
  import org.apache.spark.sql.catalyst.plans.physical.HashPartitioning
  
  partitioning match {
    case hash: HashPartitioning => true
    case par => false
  }
}

@rluvaton

Copy link
Copy Markdown
Member Author

+0 for this, I prefer the separated source directories approach; introducing potentially incompatible classes or methods into a single source file seems more difficult to maintain.

you can use wildcard imports to get around it in most cases or scope the imports to the used code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants