You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/user-guide/latest/iceberg.md
+18Lines changed: 18 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -146,6 +146,24 @@ The following scenarios will fall back to Spark's native Iceberg reader:
146
146
- Dynamic Partition Pruning under Adaptive Query Execution (non-AQE DPP is supported);
147
147
see [#3510](https://github.com/apache/datafusion-comet/issues/3510)
148
148
149
+
### Iceberg UDFs
150
+
151
+
Iceberg ships several `ScalaUDF`s that surface in user queries and maintenance actions:
152
+
153
+
-`IcebergSpark.registerBucketUDF` and `registerTruncateUDF` register `bucket(N, col)` and
154
+
`truncate(W, col)` for use in `SELECT` / `JOIN` / `WHERE` predicates that align with hidden
155
+
partitioning.
156
+
-`RewriteDataFiles` with `sort-strategy=zorder` builds a tree of per-type ordered-bytes UDFs
157
+
(`INT_ORDERED_BYTES`, `LONG_ORDERED_BYTES`, ..., `INTERLEAVE_BYTES`) over the sort key columns
158
+
during compaction.
159
+
160
+
By default these UDFs cause the enclosing operator to fall back to Spark, which forces a
161
+
columnar-to-row roundtrip and demotes the surrounding shuffle from `CometExchange` to
162
+
`CometColumnarExchange`. Enabling the experimental
163
+
[Scala UDF and Java UDF Support](scala_java_udfs.md) feature
164
+
(`spark.comet.exec.scalaUDF.codegen.enabled=true`) routes these UDFs through native execution so
165
+
the project, exchange, and sort operators around them stay on the Comet path end-to-end.
166
+
149
167
### Task input metrics
150
168
151
169
The native Iceberg reader populates Spark's task-level `inputMetrics.bytesRead` (visible in the Spark UI Stages tab) using the `bytes_read` counter from iceberg-rust's `ScanMetrics`. This counter includes bytes read from both data files and delete files.
Licensed to the Apache Software Foundation (ASF) under one
3
+
or more contributor license agreements. See the NOTICE file
4
+
distributed with this work for additional information
5
+
regarding copyright ownership. The ASF licenses this file
6
+
to you under the Apache License, Version 2.0 (the
7
+
"License"); you may not use this file except in compliance
8
+
with the License. You may obtain a copy of the License at
9
+
10
+
http://www.apache.org/licenses/LICENSE-2.0
11
+
12
+
Unless required by applicable law or agreed to in writing,
13
+
software distributed under the License is distributed on an
14
+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15
+
KIND, either express or implied. See the License for the
16
+
specific language governing permissions and limitations
17
+
under the License.
18
+
-->
19
+
20
+
# Scala UDF and Java UDF Support
21
+
22
+
Comet executes Spark's Scala and Java [scalar user-defined functions (UDFs)](https://spark.apache.org/docs/latest/sql-ref-functions-udf-scalar.html) on the native Comet path. The presence of a UDF does not force the enclosing operator off the native path; surrounding native operators stay native.
23
+
24
+
This page covers Spark's `ScalaUDF` (Scala `udf(...)`, `spark.udf.register(...)` over Scala or Java functional interfaces, and SQL `CREATE FUNCTION ... AS 'com.example.MyUDF'`). Other UDF kinds (Python / Pandas, Hive, aggregate) are out of scope and continue to fall back to Spark.
25
+
26
+
This feature is experimental and disabled by default.
|`spark.comet.exec.scalaUDF.codegen.enabled`|`false`| When `true`, eligible `ScalaUDF`s run on the Comet path. When `false`, the enclosing operator falls back to Spark. |
33
+
34
+
## Supported
35
+
36
+
- User functions registered via `udf(...)`, `spark.udf.register(...)` (Scala or Java functional interfaces), or SQL `CREATE FUNCTION ... AS 'com.example.MyUDF'`.
- Aggregate UDFs (`ScalaAggregator`, `TypedImperativeAggregate`, the legacy `UserDefinedAggregateFunction`).
45
+
- Table UDFs and generators.
46
+
- Python `@udf` and Pandas `@pandas_udf`.
47
+
- Hive `GenericUDF` and `SimpleUDF`.
48
+
-`CalendarIntervalType`, `NullType`, and `UserDefinedType` arguments and return types. UDT-typed columns fall back to Spark; for native execution, store and read the underlying representation directly (e.g. write MLlib `Vector` outputs as `Struct<type: Byte, size: Int, indices: Array<Int>, values: Array<Double>>` rather than `VectorUDT`).
49
+
- Trees whose total nested-field count (output plus all input columns the UDF tree references) exceeds `spark.sql.codegen.maxFields` (default 100). Comet refuses these at plan time and the operator falls back to Spark.
50
+
51
+
When a UDF is rejected, the reason surfaces through Comet's standard fallback diagnostics; the query still runs on Spark.
52
+
53
+
## Behavior
54
+
55
+
- Non-deterministic expressions referenced from the argument tree (`rand`, `uuid`, `monotonically_increasing_id`) produce per-partition sequences consistent with Spark.
56
+
-`TaskContext.get()` inside the user function returns the driving Spark task's context.
57
+
- The user function must be closure-serializable; the same function that works with Spark's executor execution works here.
58
+
59
+
## Known limitations
60
+
61
+
- Each query containing a ScalaUDF pays a one-time codegen cost on its first batch and reuses the compiled kernel for subsequent batches, matching Spark's whole-stage codegen behavior. Bytecode is deduped JVM-wide via the same `CodeGenerator` cache, so structurally identical queries across a session share the compiled class.
0 commit comments