Skip to content

Commit 57c471c

Browse files
committed
revert: default JVM UDF codegen dispatcher back to disabled
Flip spark.comet.exec.scalaUDF.codegen.enabled back to false and restore the experimental, disabled-by-default language across the regex/codegen docs and CometConf strings. With this default, the regex family (java engine path) and the DateFormat dispatcher fall back to Spark unless the user explicitly opts in. This keeps the engine=rust JVM-dispatcher fallthrough behavior introduced separately on this branch; only the codegen-enabled-by-default change is reverted.
1 parent 7f22f92 commit 57c471c

8 files changed

Lines changed: 40 additions & 41 deletions

File tree

docs/source/user-guide/latest/compatibility/regex.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -20,21 +20,21 @@ under the License.
2020
# Regular Expressions
2121

2222
Comet provides two regexp engines for evaluating regular expressions: a **Rust engine** that uses the Rust
23-
[`regex`] crate natively, and a **Java engine** that runs Spark's own `doGenCode` for the
23+
[`regex`] crate natively, and an experimental **Java engine** that runs Spark's own `doGenCode` for the
2424
expression inside Comet's Arrow-direct codegen dispatcher (the same dispatcher used by Comet's
2525
`ScalaUDF` codegen path). The engine is selected with `spark.comet.exec.regexp.engine`, which accepts:
2626

27-
- `java` (default) — route through the Java engine for full Spark compatibility. Uses the codegen
28-
dispatcher (`spark.comet.exec.scalaUDF.codegen.enabled`, also default `true`); if that flag is
29-
disabled, regex expressions fall back to Spark with an explanatory message.
27+
- `java` (default) — route through the Java engine for full Spark compatibility. Requires
28+
`spark.comet.exec.scalaUDF.codegen.enabled=true`; otherwise regex expressions fall back to Spark with
29+
an explanatory message.
3030
- `rust` — run the Rust engine when an expression has a native implementation. Setting this is itself
3131
the opt-in for the semantic differences between Java and Rust regex (no separate `allowIncompatible`
3232
flag needed). Expressions without a native Rust implementation (`regexp_extract`,
3333
`regexp_extract_all`, `regexp_instr`) fall through to the Java engine so users still get Comet
3434
acceleration with full Spark semantics.
3535

36-
With pure defaults (`engine=java`, `scalaUDF.codegen.enabled=true`), all regex expressions run on
37-
the Comet path with full Spark compatibility.
36+
The codegen dispatcher is experimental and disabled by default. With pure defaults
37+
(`engine=java`, `scalaUDF.codegen.enabled=false`), all regex expressions fall back to Spark.
3838

3939
## Disabling Comet for individual regex expressions
4040

@@ -54,7 +54,7 @@ the engine selector:
5454

5555
## Choosing an engine
5656

57-
| | Rust engine | Java engine (default) |
57+
| | Rust engine | Java engine (experimental, default) |
5858
| -------------------- | ------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
5959
| **Compatibility** | Differs from Java regex (see below) | 100% compatible with Spark |
6060
| **Feature coverage** | `rlike`, `regexp_replace`, `split` natively; `regexp_extract`, `regexp_extract_all`, `regexp_instr` via fallthrough | All regexp expressions (`rlike`, `regexp_extract`, `regexp_extract_all`, `regexp_instr`, `regexp_replace`, `split`) |
@@ -65,9 +65,9 @@ The **Rust engine** is faster but cannot match Java regex semantics for every pa
6565
choice is itself the opt-in, setting `spark.comet.exec.regexp.engine=rust` declares acceptance of those
6666
differences without a separate per-expression flag.
6767

68-
The **Java engine** is the default. It is gated behind `spark.comet.exec.scalaUDF.codegen.enabled`
69-
(also default `true`) so the codegen dispatcher can be disabled globally without changing the regex
70-
engine selector.
68+
The **Java engine** is the default but the underlying codegen dispatcher is experimental and gated behind
69+
`spark.comet.exec.scalaUDF.codegen.enabled=true`; the behavior, configuration, and supported expressions
70+
may change in future releases.
7171

7272
## Why the engines differ
7373

@@ -129,7 +129,7 @@ shape and want to avoid the JNI overhead of the Java engine, switching to the Ru
129129
`allowIncompatible=true` is generally safe.
130130

131131
For anything that uses backreferences, lookaround, or relies on Java's specific Unicode or line-handling
132-
defaults, use the Java engine.
132+
defaults, use the experimental Java engine.
133133

134134
[`java.util.regex`]: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
135135
[`regex`]: https://docs.rs/regex/latest/regex/

docs/source/user-guide/latest/iceberg.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -157,12 +157,12 @@ Iceberg ships several `ScalaUDF`s that surface in user queries and maintenance a
157157
(`INT_ORDERED_BYTES`, `LONG_ORDERED_BYTES`, ..., `INTERLEAVE_BYTES`) over the sort key columns
158158
during compaction.
159159

160-
[Scala UDF and Java UDF Support](scala_java_udfs.md)
161-
(`spark.comet.exec.scalaUDF.codegen.enabled=true`, the default) routes these UDFs through native
162-
execution so the project, exchange, and sort operators around them stay on the Comet path
163-
end-to-end. Disabling the flag causes the enclosing operator to fall back to Spark, which forces
164-
a columnar-to-row roundtrip and demotes the surrounding shuffle from `CometExchange` to
165-
`CometColumnarExchange`.
160+
By default these UDFs cause the enclosing operator to fall back to Spark, which forces a
161+
columnar-to-row roundtrip and demotes the surrounding shuffle from `CometExchange` to
162+
`CometColumnarExchange`. Enabling the experimental
163+
[Scala UDF and Java UDF Support](scala_java_udfs.md) feature
164+
(`spark.comet.exec.scalaUDF.codegen.enabled=true`) routes these UDFs through native execution so
165+
the project, exchange, and sort operators around them stay on the Comet path end-to-end.
166166

167167
### Task input metrics
168168

docs/source/user-guide/latest/scala_java_udfs.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,13 @@ Comet executes Spark's Scala and Java [scalar user-defined functions (UDFs)](htt
2323

2424
This page covers Spark's `ScalaUDF` (Scala `udf(...)`, `spark.udf.register(...)` over Scala or Java functional interfaces, and SQL `CREATE FUNCTION ... AS 'com.example.MyUDF'`). Other UDF kinds (Python / Pandas, Hive, aggregate) are out of scope and continue to fall back to Spark.
2525

26+
This feature is experimental and disabled by default.
27+
2628
## Configuration
2729

2830
| Key | Default | Description |
2931
| ------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------ |
30-
| `spark.comet.exec.scalaUDF.codegen.enabled` | `true` | When `true`, eligible `ScalaUDF`s run on the Comet path. When `false`, the enclosing operator falls back to Spark. |
32+
| `spark.comet.exec.scalaUDF.codegen.enabled` | `false` | When `true`, eligible `ScalaUDF`s run on the Comet path. When `false`, the enclosing operator falls back to Spark. |
3133

3234
## Supported
3335

spark/src/main/scala/org/apache/comet/CometConf.scala

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -365,14 +365,14 @@ object CometConf extends ShimCometConf {
365365
val COMET_SCALA_UDF_CODEGEN_ENABLED: ConfigEntry[Boolean] =
366366
conf("spark.comet.exec.scalaUDF.codegen.enabled")
367367
.category(CATEGORY_EXEC)
368-
.doc("Whether to route Spark `ScalaUDF` expressions through Comet's Arrow-direct codegen " +
369-
"dispatcher. When enabled (the default), a supported ScalaUDF is compiled into a " +
370-
"per-batch kernel that reads and writes Arrow vectors directly from native execution. " +
371-
"When disabled, plans containing a ScalaUDF fall back to Spark for the enclosing " +
372-
"operator. The same dispatcher backs `spark.comet.exec.regexp.engine=java` so the " +
373-
"regex family routes through it as well.")
368+
.doc("Experimental. Whether to route Spark `ScalaUDF` expressions through Comet's " +
369+
"Arrow-direct codegen dispatcher. When enabled, a supported ScalaUDF is compiled into " +
370+
"a per-batch kernel that reads and writes Arrow vectors directly from native " +
371+
"execution. When disabled, plans containing a ScalaUDF fall back to Spark for the " +
372+
"enclosing operator. The same dispatcher backs `spark.comet.exec.regexp.engine=java` " +
373+
"so the regex family routes through it as well.")
374374
.booleanConf
375-
.createWithDefault(true)
375+
.createWithDefault(false)
376376

377377
val REGEXP_ENGINE_RUST = "rust"
378378
val REGEXP_ENGINE_JAVA = "java"
@@ -384,8 +384,8 @@ object CometConf extends ShimCometConf {
384384
"Selects the engine used to evaluate Spark regular-expression expressions. " +
385385
s"`$REGEXP_ENGINE_JAVA` (default) routes through the Arrow-direct codegen dispatcher " +
386386
"so Spark's own `doGenCode` (backed by `java.util.regex.Pattern`) runs inside the " +
387-
"Comet pipeline; this falls back to Spark when " +
388-
s"${COMET_SCALA_UDF_CODEGEN_ENABLED.key}=false. `$REGEXP_ENGINE_RUST` runs the " +
387+
s"Comet pipeline; this requires ${COMET_SCALA_UDF_CODEGEN_ENABLED.key}=true and " +
388+
s"falls back to Spark otherwise. `$REGEXP_ENGINE_RUST` runs the " +
389389
"native DataFusion regexp engine when an implementation exists; setting this is " +
390390
"itself the opt-in for the semantic differences between Java and Rust regex. " +
391391
"Expressions without a native Rust implementation (`regexp_extract`, " +

spark/src/main/scala/org/apache/comet/serde/CometScalaUDF.scala

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,8 @@ import org.apache.comet.udf.codegen.CometScalaUDFCodegen
4343
* - Python / Pandas UDFs.
4444
* - Hive `GenericUDF` / `SimpleUDF`.
4545
*
46-
* Gated by [[CometConf.COMET_SCALA_UDF_CODEGEN_ENABLED]] (default `true`). When disabled, plans
47-
* containing a `ScalaUDF` fall back to Spark for the enclosing operator.
46+
* Gated by [[CometConf.COMET_SCALA_UDF_CODEGEN_ENABLED]]. When disabled, plans containing a
47+
* `ScalaUDF` fall back to Spark for the enclosing operator.
4848
*
4949
* [[emitJvmCodegenDispatch]] exposes the same closure-serialize + dispatcher-proto path to other
5050
* serdes that want to keep a built-in Spark expression inside the Comet pipeline when no native

spark/src/main/scala/org/apache/comet/serde/datetime.scala

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -653,8 +653,9 @@ object CometDateFormat extends CometExpressionSerde[DateFormatClass] {
653653
"Format strings in a curated allow-list run natively via DataFusion's `to_char` for UTC " +
654654
"sessions. Other format strings (including non-literal formats), as well as non-UTC " +
655655
"sessions, route through Spark's own `DateFormatClass.doGenCode` via the Arrow-direct " +
656-
"codegen dispatcher (default). Set `spark.comet.exec.scalaUDF.codegen.enabled=false` to " +
657-
"fall back to Spark for those cases instead.")
656+
"codegen dispatcher when `spark.comet.exec.scalaUDF.codegen.enabled=true`. When the " +
657+
"codegen dispatcher is disabled (default) the operator falls back to Spark in those " +
658+
"cases.")
658659

659660
override def convert(
660661
expr: DateFormatClass,

spark/src/main/scala/org/apache/comet/serde/strings.scala

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -63,10 +63,10 @@ private object RegexpRoute {
6363
/**
6464
* Pick a route given the user's config and whether a native Rust implementation exists for the
6565
* expression. `engine=java` (default) routes through the codegen dispatcher when
66-
* [[CometConf.COMET_SCALA_UDF_CODEGEN_ENABLED]] is true (also the default); otherwise Spark
67-
* fallback. `engine=rust` runs native if available, otherwise transparently falls back to the
68-
* JVM codegen dispatcher when it is enabled, and only declines (Spark fallback) when neither
69-
* the native path nor the dispatcher can serve the expression.
66+
* [[CometConf.COMET_SCALA_UDF_CODEGEN_ENABLED]] is true; otherwise Spark fallback.
67+
* `engine=rust` runs native if available, otherwise transparently falls back to the JVM codegen
68+
* dispatcher when it is enabled, and only declines (Spark fallback) when neither the native
69+
* path nor the dispatcher can serve the expression.
7070
*/
7171
def choose(exprName: String, hasNative: Boolean): RegexpRoute = {
7272
val engine = CometConf.COMET_REGEXP_ENGINE.get()
@@ -91,7 +91,8 @@ private object RegexpRoute {
9191
} else {
9292
Fallback(
9393
s"$exprName requires ${CometConf.COMET_SCALA_UDF_CODEGEN_ENABLED.key}=true when " +
94-
s"${CometConf.COMET_REGEXP_ENGINE.key}=${CometConf.REGEXP_ENGINE_JAVA}.")
94+
s"${CometConf.COMET_REGEXP_ENGINE.key}=${CometConf.REGEXP_ENGINE_JAVA}. " +
95+
"The codegen dispatcher is experimental and disabled by default.")
9596
}
9697

9798
case other => Fallback(s"Unknown ${CometConf.COMET_REGEXP_ENGINE.key}=$other")

spark/src/test/scala/org/apache/comet/CometSqlFileTestSuite.scala

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -99,11 +99,6 @@ class CometSqlFileTestSuite extends CometTestBase with AdaptiveSparkPlanHelper {
9999
/**
100100
* Pin the sentinel-query convention for fixtures that route an `expect_error` through the
101101
* codegen dispatcher. See [[ExpectError]] for the failure mode this guards against.
102-
*
103-
* Fires for fixtures that explicitly opt into the codegen dispatcher via `--CONFIG`. The
104-
* dispatcher now defaults to enabled, but most expression fixtures use their own native paths
105-
* and do not exercise the dispatcher, so we still key off the explicit opt-in rather than
106-
* forcing every fixture in the repo to add a sentinel.
107102
*/
108103
private def requireSentinelForCodegenExpectError(
109104
relativePath: String,

0 commit comments

Comments
 (0)