You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Check whether the PR requires updates to user-facing documentation in `docs/`:
246
+
Some user-facing docs are auto-generated from the serde. Others are hand-edited. Treat them differently.
247
247
248
-
-**Compatibility guide** (`docs/source/user-guide/compatibility.md`): New expressions or operators should be listed. Incompatible behaviors should be documented.
249
-
-**Configuration guide** (`docs/source/user-guide/configs.md`): New config options should be documented.
250
-
-**Expressions list** (`docs/source/user-guide/expressions.md`): New expressions should be added.
248
+
**Generated by `GenerateDocs`** — do NOT ask the contributor to edit these by hand. CI regenerates and publishes them on every merge to `main`:
251
249
252
-
If the PR adds a new expression or operator but does not update the relevant docs, flag this as something that needs to be addressed.
- Configuration reference at `docs/source/user-guide/latest/configs.md`
252
+
253
+
For these, check the _source_ instead. Does the new or modified `CometExpressionSerde` provide accurate `getIncompatibleReasons()` and `getUnsupportedReasons()` strings? Each returned string is rendered as a bullet on the corresponding compat page. Common gaps to flag:
254
+
255
+
- Expression marked `Incompatible(Some("..."))` in `getSupportLevel` but `getIncompatibleReasons()` is empty, so the compat page shows it as supported with no caveats.
256
+
-`Unsupported(Some("..."))` for specific data types or argument shapes but no `getUnsupportedReasons()` to surface the limitation to users.
257
+
- Reason strings drifting from the `notes` argument passed to `Compatible` / `Incompatible` / `Unsupported`. They do not have to match exactly, but consistency helps users.
258
+
- Reason strings that are too terse to be useful in user-facing docs (a single word, no context, no link to a tracking issue when behavior is known to differ).
259
+
260
+
See `docs/source/contributor-guide/adding_a_new_expression.md` (sections "Documenting Incompatible and Unsupported Reasons") for the contract these methods follow.
261
+
262
+
**Hand-edited** — PR should update if relevant:
263
+
264
+
-`docs/source/user-guide/latest/expressions.md` — the supported-expressions list. New expressions belong here.
265
+
- Other `latest/compatibility/` pages such as `floating-point.md`, `operators.md`, `regex.md`, `scans.md`.
266
+
- Top-level user-guide pages such as `iceberg.md`, `installation.md`, `tuning.md` when the PR changes user-visible behavior.
267
+
268
+
If the PR adds a new expression but does not update `expressions.md`, flag that. If it touches incompatibility behavior, flag that the serde reasons should reflect the change.
253
269
254
270
### 8. Common Comet Review Issues
255
271
@@ -259,6 +275,7 @@ If the PR adds a new expression or operator but does not update the relevant doc
259
275
4.**Tests in wrong framework**: Expression tests should use Comet SQL Tests (`CometSqlFileTestSuite`) rather than adding to Comet Scala Tests like `CometExpressionSuite`. Suggest migration if the PR adds Comet Scala Tests for expressions that could use Comet SQL Tests instead.
260
276
5.**Stale native code**: PR might need `./mvnw install -pl common -DskipTests`
261
277
6.**Missing `getSupportLevel`**: Edge cases should be marked as `Incompatible`
278
+
7.**Scalar function name collides with a DataFusion built-in**: If the PR registers a Spark function whose name is also defined by `datafusion-functions` (e.g. `levenshtein`, `concat`, `coalesce`, `sha2`, `regexp_replace`), check that the serde sets the return type explicitly via `scalarFunctionExprToProtoWithReturnType` rather than `scalarFunctionExprToProto` or the bare `CometScalarFunction(name)` shortcut. Without an explicit return type, the native planner consults DataFusion's UDF registry first for type resolution, and any arity or input-type difference between the Spark and DataFusion versions will fail native execution with `Error from DataFusion: Function 'X' expects N arguments but received M`. The Comet UDF is only swapped in _after_ DF's signature validation passes. See the "When to set the return type explicitly" section in `docs/source/contributor-guide/adding_a_new_expression.md`.
Copy file name to clipboardExpand all lines: docs/source/contributor-guide/adding_a_new_expression.md
+34Lines changed: 34 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -84,6 +84,40 @@ For simple scalar functions that map directly to a DataFusion function, you can
84
84
classOf[Cos] ->CometScalarFunction("cos")
85
85
```
86
86
87
+
#### When to set the return type explicitly
88
+
89
+
`CometScalarFunction(name)` and the lower-level `scalarFunctionExprToProto(name, args)` helper both produce a protobuf `ScalarFunc` message **without** a `return_type` field. That is fine when the function name does not collide with a DataFusion built-in, or when it does collide and the Spark and DataFusion versions take the same arity and types. In that case the native planner consults DataFusion's UDF registry only to resolve the return type, then swaps in Comet's UDF for execution.
90
+
91
+
It is **not** fine when the Spark function and the DataFusion built-in differ in arity or input types. The native planner calls `coerce_types` and `return_field_from_args` on DataFusion's UDF before Comet's UDF is selected, and a signature mismatch fails the query at execution time with an error like:
92
+
93
+
```
94
+
org.apache.comet.CometNativeException: Error from DataFusion:
95
+
Function 'levenshtein' expects 2 arguments but received 3.
96
+
```
97
+
98
+
The classic case is `levenshtein`. Spark accepts an optional 3rd `threshold` argument, DataFusion's built-in is 2-arg only, so the 3-arg form fails native execution unless the serde sets the return type explicitly. Other names that exist in both engines with potentially different signatures include `concat`, `coalesce`, `sha2`, and `regexp_replace`. If you are adding a function whose name is shared with `datafusion-functions`, check the upstream signature before deciding how to serialize.
99
+
100
+
To avoid the registry lookup, write a custom `CometExpressionSerde` and use `scalarFunctionExprToProtoWithReturnType`, passing the Spark expression's declared `dataType`:
When the return type is set on the proto, the native planner skips the registry lookup entirely and routes straight to the Comet UDF registered in `create_comet_physical_fun_with_eval_mode`.
120
+
87
121
#### Registering the Expression Handler
88
122
89
123
Once you've created your `CometExpressionSerde` implementation, register it in `QueryPlanSerde.scala` by adding it to the appropriate expression map (e.g., `mathExpressions`, `stringExpressions`, `predicateExpressions`, etc.):
0 commit comments