You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: introduce Arrow-native terminology in README and user guide
Comet's documentation conflates several distinct ideas under the word
'native': implementation language (Rust vs JVM), pipeline membership
(handled by Comet vs falls back to Spark), and data format (Arrow vs
Spark rows). The vocabulary clean-up in #4419 splits these out, and
this PR rolls in only the README and user-guide prose, with no code
or operator renames.
- 'Arrow-native' is now the term for the data-format property that
unifies the pipeline.
- 'Comet pipeline' replaces 'the native Comet path' / 'accelerated by
Comet' for membership.
- 'Rust-implemented' / 'native Rust' is used for the implementation-
language axis. Bare 'native execution' / 'runs natively' / 'the
native path' as vague adjectives are removed.
The biggest single rewrite is in user-guide/latest/understanding-comet-plans.md,
where the 'three kinds of nodes' framing becomes four (Arrow-native
Rust operators, Arrow-native JVM expressions, Arrow-native JVM
plumbing, Spark fallback).
Contributor-guide files, plugin-overview prose, and the about/ pages
are deliberately out of scope and will follow in a separate PR.
Copy file name to clipboardExpand all lines: docs/source/user-guide/latest/scala_java_udfs.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,24 +19,24 @@
19
19
20
20
# Scala UDF and Java UDF Support
21
21
22
-
Comet executes Spark's Scala and Java [scalar user-defined functions (UDFs)](https://spark.apache.org/docs/latest/sql-ref-functions-udf-scalar.html)on the native Comet path. The presence of a UDF does not force the enclosing operator off the native path; surrounding native operators stay native.
22
+
Comet executes Spark's Scala and Java [scalar user-defined functions (UDFs)](https://spark.apache.org/docs/latest/sql-ref-functions-udf-scalar.html)inside the Comet pipeline. The UDF body is JVM bytecode produced by Spark's whole-stage codegen, but it runs over Arrow batches alongside the surrounding Arrow-native operators rather than triggering a fallback to row-based Spark execution. The presence of a UDF does not force the enclosing operator out of the pipeline; surrounding Comet operators continue to run as usual.
23
23
24
24
This page covers Spark's `ScalaUDF` (Scala `udf(...)`, `spark.udf.register(...)` over Scala or Java functional interfaces, and SQL `CREATE FUNCTION ... AS 'com.example.MyUDF'`). Other UDF kinds (Python / Pandas, Hive, aggregate) are out of scope and continue to fall back to Spark.
25
25
26
26
This feature is experimental and disabled by default.
|`spark.comet.exec.scalaUDF.codegen.enabled`|`false`| When `true`, eligible `ScalaUDF`s run on the Comet path. When `false`, the enclosing operator falls back to Spark. |
|`spark.comet.exec.scalaUDF.codegen.enabled`|`false`| When `true`, eligible `ScalaUDF`s run inside the Comet pipeline. When `false`, the enclosing operator falls back to Spark. |
33
33
34
34
## Supported
35
35
36
36
- User functions registered via `udf(...)`, `spark.udf.register(...)` (Scala or Java functional interfaces), or SQL `CREATE FUNCTION ... AS 'com.example.MyUDF'`.
- Complex input/output types with arbitrary nesting: `ArrayType`, `StructType`, `MapType`.
39
-
- Composition with other Catalyst expressions inside the argument tree (e.g. `myUdf(upper(s))` runs as one native unit).
39
+
- Composition with other Catalyst expressions inside the argument tree (e.g. `myUdf(upper(s))` runs as a single compiled kernel without leaving the Comet pipeline).
@@ -45,7 +45,7 @@ This feature is experimental and disabled by default.
45
45
- Table UDFs and generators.
46
46
- Python `@udf` and Pandas `@pandas_udf`.
47
47
- Hive `GenericUDF` and `SimpleUDF`.
48
-
-`CalendarIntervalType`, `NullType`, and `UserDefinedType` arguments and return types. UDT-typed columns fall back to Spark; for native execution, store and read the underlying representation directly (e.g. write MLlib `Vector` outputs as `Struct<type: Byte, size: Int, indices: Array<Int>, values: Array<Double>>` rather than `VectorUDT`).
48
+
-`CalendarIntervalType`, `NullType`, and `UserDefinedType` arguments and return types. UDT-typed columns fall back to Spark; to keep the work in the Comet pipeline, store and read the underlying representation directly (e.g. write MLlib `Vector` outputs as `Struct<type: Byte, size: Int, indices: Array<Int>, values: Array<Double>>` rather than `VectorUDT`).
49
49
- Trees whose total nested-field count (output plus all input columns the UDF tree references) exceeds `spark.sql.codegen.maxFields` (default 100). Comet refuses these at plan time and the operator falls back to Spark.
50
50
51
51
When a UDF is rejected, the reason surfaces through Comet's standard fallback diagnostics; the query still runs on Spark.
0 commit comments