Skip to content

Commit d55fc9c

Browse files
authored
docs: Rewrite supported expressions page to show complete overview of what is and is not supported by Comet (#4550)
1 parent 10c2a6d commit d55fc9c

3 files changed

Lines changed: 684 additions & 383 deletions

File tree

docs/source/contributor-guide/adding_a_new_expression.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,10 @@ Before you start, have a look through [these slides](https://docs.google.com/pre
2727

2828
You may have a specific expression in mind that you'd like to add, but if not, you can review the [expression coverage document](spark_expressions_support.md) to see which expressions are not yet supported.
2929

30+
When you add or change an expression, update **both** the coverage checklist
31+
(`spark_expressions_support.md`) and the user-facing status in
32+
[Supported Spark Expressions](../user-guide/latest/expressions.md).
33+
3034
## Implementing the Expression
3135

3236
Once you have the expression you'd like to add, you should take inventory of the following:

docs/source/contributor-guide/spark_expressions_support.md

Lines changed: 45 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -106,9 +106,9 @@
106106
- [ ] percentile_approx
107107
- [ ] percentile_cont
108108
- [ ] percentile_disc
109-
- [ ] regr_avgx
110-
- [ ] regr_avgy
111-
- [ ] regr_count
109+
- [x] regr_avgx
110+
- [x] regr_avgy
111+
- [x] regr_count
112112
- [ ] regr_intercept
113113
- [ ] regr_r2
114114
- [ ] regr_slope
@@ -243,7 +243,7 @@
243243
- Spark 4.1.1 (audited 2026-05-27): `inputTypes` tightened to `Seq(ArrayType, IntegralType)` (analysis-time only); runtime unchanged.
244244
- [ ] sequence
245245
- [ ] shuffle
246-
- [ ] slice
246+
- [x] slice
247247
- [x] sort_array
248248
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
249249
- Spark 3.5.8 (audited 2026-05-27): baseline. `SortArray(base, ascendingOrder) extends BinaryExpression with ArraySortLike`; the second arg must be a `Literal(_: Boolean, BooleanType)`. Comet `CometSortArray` flags `Incompatible` under strict floating-point and falls back for nested arrays whose innermost element is `Struct` or `Null`.
@@ -267,7 +267,7 @@
267267
- Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3.
268268
- Spark 4.0.1 (audited 2026-05-27): `>>` parses to `ShiftRight`. Comet `CometShiftRight` mirrors the same operand-cast logic as `CometShiftLeft`.
269269
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
270-
- [ ] `>>>`
270+
- [x] `>>>`
271271
- [x] `^`
272272
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
273273
- Spark 3.5.8 (audited 2026-05-27): baseline. `BitwiseXor(left, right) extends BinaryArithmetic` over `IntegralType`. Comet routes via `CometBitwiseXor` to the proto's `bitwise_xor` binary expression.
@@ -311,8 +311,9 @@
311311

312312
### collection_funcs
313313

314-
- [ ] array_size
315-
- [ ] cardinality
314+
- [x] array_size
315+
- Native via `size`; returns -1 instead of NULL for NULL input (https://github.com/apache/datafusion-comet/issues/4560).
316+
- [x] cardinality
316317
- [x] concat
317318
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
318319
- Spark 3.5.8 (audited 2026-05-27): baseline. `Concat(children) extends ComplexTypeMergingExpression with QueryErrorsBase`; `allowedTypes = Seq(StringType, BinaryType, ArrayType)`; result type is the merged child type. Empty children is allowed and returns the empty string of the result type.
@@ -355,7 +356,7 @@
355356
- Spark 3.5.8 (audited 2026-05-27): `NullIf(left, right, replacement) extends RuntimeReplaceable with InheritAnalysisRules`; the analyzer rewrites to `If(EqualTo(left, right), Literal(null, left.dataType), left)`. Comet handles via `CometIf` plus `CometEqualTo`.
356357
- Spark 4.0.1 (audited 2026-05-27): identical to 3.5.8.
357358
- Spark 4.1.1 (audited 2026-05-27): identical to 3.5.8.
358-
- [ ] nullifzero
359+
- [x] nullifzero
359360
- [x] nvl
360361
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
361362
- Spark 3.5.8 (audited 2026-05-27): `Nvl(left, right, replacement) extends RuntimeReplaceable`; analyzer rewrites to `Coalesce(Seq(left, right))`. Comet handles via `CometCoalesce`.
@@ -371,7 +372,7 @@
371372
- Spark 3.5.8 (audited 2026-05-27): the `CASE WHEN ... THEN ...` SQL form lowers to `CaseWhen(branches: Seq[(Expression, Expression)], elseValue: Option[Expression])`. Spark evaluates left-to-right with short-circuit; result type is the merged branch type. Comet routes via `CometCaseWhen` to the native `CaseWhen` proto.
372373
- Spark 4.0.1 (audited 2026-05-27): adds the `withNewAlwaysEvaluatedInputs` optimizer hook; semantics unchanged.
373374
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
374-
- [ ] zeroifnull
375+
- [x] zeroifnull
375376

376377
### conversion_funcs
377378

@@ -486,7 +487,8 @@
486487
- Known divergence: Comet's native timezone parser does not accept Spark's legacy zone forms (`GMT+1`, `UTC+1`, three-letter abbreviations like `PST`). Such timezones throw a native parse error at execution.
487488
- [x] trunc
488489
- [ ] try_make_interval
489-
- [ ] try_make_timestamp
490+
- [x] try_make_timestamp
491+
- Native for valid inputs; returns wrong values for invalid inputs instead of NULL (https://github.com/apache/datafusion-comet/issues/4554).
490492
- [ ] try_to_date
491493
- [ ] try_to_time
492494
- [ ] try_to_timestamp
@@ -505,13 +507,13 @@
505507

506508
- [x] explode
507509
- Handled at the operator level as a `GenerateExec` (`CometExplodeExec`), not via the expression serde maps, so it is not auto-detected by the function-registry checkbox logic. Compatible for array inputs; map inputs fall back ([#2837](https://github.com/apache/datafusion-comet/issues/2837)).
508-
- [ ] explode_outer
510+
- [x] explode_outer
509511
- Same `CometExplodeExec` path as `explode`, but the `outer=true` case is `Incompatible` (empty arrays are not preserved as null outputs) and falls back unless `spark.comet.expr.allowIncompatible=true` ([datafusion#19053](https://github.com/apache/datafusion/issues/19053)).
510512
- [ ] inline
511513
- [ ] inline_outer
512514
- [x] posexplode
513515
- Handled at the operator level as a `GenerateExec` (`CometExplodeExec`), like `explode`. Compatible for array inputs; map inputs fall back ([#2837](https://github.com/apache/datafusion-comet/issues/2837)).
514-
- [ ] posexplode_outer
516+
- [x] posexplode_outer
515517
- Same `CometExplodeExec` path as `posexplode`, but the `outer=true` case is `Incompatible` and falls back unless `spark.comet.expr.allowIncompatible=true` ([datafusion#19053](https://github.com/apache/datafusion/issues/19053)).
516518
- [ ] stack
517519

@@ -556,7 +558,8 @@
556558

557559
### json_funcs
558560

559-
- [ ] from_json
561+
- [x] from_json
562+
- Partial native support, marked `Incompatible` (requires explicit schema).
560563
- [x] get_json_object
561564
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
562565
- Spark 3.5.8 (audited 2026-05-27): baseline. `BinaryExpression with ExpectsInputTypes with CodegenFallback`; `inputTypes = Seq(StringType, StringType) -> StringType`. Eval is inline and uses Jackson with `RawStyle` output. Foldable paths are parsed once. Returns NULL for invalid JSON, missing paths, or `JsonProcessingException`.
@@ -567,7 +570,8 @@
567570
- [ ] json_object_keys
568571
- [ ] json_tuple
569572
- [ ] schema_of_json
570-
- [ ] to_json
573+
- [x] to_json
574+
- Partial native support; options and map/array inputs fall back.
571575

572576
### lambda_funcs
573577

@@ -629,7 +633,7 @@
629633
- Spark 3.5.8 (audited 2026-05-27): baseline. `StringToMap(text, pairDelim, keyValueDelim) extends TernaryExpression`; splits `text` on `pairDelim`, then each pair on `keyValueDelim` (default `","` and `":"`). Uses `ArrayBasedMapBuilder` for duplicate-key handling. Wired as `CometScalarFunction("str_to_map")`.
630634
- Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened to `StringTypeNonCSAICollation`; uses `CollationAwareUTF8String.splitSQL` with a `collationId`. Runtime unchanged for `UTF8_BINARY`.
631635
- Spark 4.1.1 (audited 2026-05-27): adds the `legacySplitTruncate` flag (driven by `spark.sql.legacy.truncateForEmptyRegexSplit`) to both `splitSQL` calls (https://github.com/apache/datafusion-comet/issues/4477). The Comet native impl does not honour this flag; behaviour matches the non-legacy default.
632-
- [ ] try_element_at
636+
- [x] try_element_at
633637

634638
### math_funcs
635639

@@ -681,7 +685,8 @@
681685
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.toDegrees, "DEGREES")` unchanged across versions.
682686
- [x] div
683687
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `IntegralDivide(left, right, evalMode)`. Non-decimal operands are cast to `DecimalType(19, 0)`; result is recomputed per `IntegralDivide.resultDecimalType`, wrapped in `CheckOverflow`, then cast to `Long`. ANSI overflow for `Long.MinValue div -1` and decimal-overflow ANSI cases are covered by existing tests.
684-
- [ ] e
688+
- [x] e
689+
- Foldable; rewritten to a literal by ConstantFolding (like `pi`).
685690
- [x] exp
686691
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(StrictMath.exp, "EXP")` unchanged. ULP-level differences vs DataFusion `exp` are possible but unflagged.
687692
- [x] expm1
@@ -729,7 +734,7 @@
729734
- See `misc_funcs / rand`.
730735
- [x] randn
731736
- See `misc_funcs / randn`.
732-
- [ ] random
737+
- [x] random
733738
- [ ] randstr
734739
- [x] rint
735740
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.rint, "ROUND")` with `funcName = "rint"`. Passthrough to DataFusion `rint` (round-half-to-even).
@@ -764,7 +769,7 @@
764769
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): rewrites to `Subtract(.., EvalMode.TRY)`. Integer path uses `checked_sub`; decimal uses `WideDecimalBinaryExpr` as needed.
765770
- [x] unhex
766771
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Unhex(child, failOnError)`. Spark 4.x widens input to `StringTypeWithCollation` and wraps the inner call in try/catch; Comet `CometUnhex` forwards `failOnError` to native `spark_unhex` but does not gate on collation.
767-
- [ ] uniform
772+
- [x] uniform
768773
- [x] width_bucket
769774
- Spark 3.5.8 (audited 2026-05-27): introduced; not available in 3.4.3.
770775
- Spark 4.0.1, 4.1.1 (audited 2026-05-27): same semantics; `NullIntolerant` -> `nullIntolerant: Boolean` refactor.
@@ -782,11 +787,15 @@
782787
- [ ] bitmap_construct_agg
783788
- [ ] bitmap_count
784789
- [ ] bitmap_or_agg
785-
- [ ] current_catalog
786-
- [ ] current_database
787-
- [ ] current_schema
788-
- [ ] current_user
789-
- [ ] equal_null
790+
- [x] current_catalog
791+
- Resolved to a literal by the analyzer (`ReplaceCurrentLike`).
792+
- [x] current_database
793+
- Resolved to a literal by the analyzer (`ReplaceCurrentLike`).
794+
- [x] current_schema
795+
- Alias of `current_database`; resolved to a literal by the analyzer.
796+
- [x] current_user
797+
- Resolved to a literal by the analyzer; same as `user`.
798+
- [x] equal_null
790799
- [ ] from_avro
791800
- [ ] from_protobuf
792801
- [ ] hll_sketch_estimate
@@ -819,7 +828,8 @@
819828
- [ ] schema_of_avro
820829
- [ ] schema_of_variant
821830
- [ ] schema_of_variant_agg
822-
- [ ] session_user
831+
- [x] session_user
832+
- Alias of `current_user`; resolved to a literal by the analyzer.
823833
- [x] spark_partition_id
824834
- Spark 3.4.3 (audited 2026-05-27): byte-for-byte identical to 4.1.1. `SparkPartitionID() extends LeafExpression with Nondeterministic`; returns the integer index of the partition being processed. Comet emits an empty `SparkPartitionId` proto.
825835
- Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3.
@@ -841,7 +851,8 @@
841851
- [ ] try_parse_json
842852
- [ ] try_reflect
843853
- [ ] try_variant_get
844-
- [ ] typeof
854+
- [x] typeof
855+
- Foldable; resolved to a literal before Comet sees the plan.
845856
- [x] user
846857
- Spark 3.4.3 (audited 2026-05-27): `CurrentUser() extends LeafExpression with Unevaluable`; the analyzer's `ResolveCurrentLike` rule replaces it with a `StringType` literal of the current user name before Comet sees the plan. No Comet serde needed; the literal flows through `CometLiteral`.
847858
- Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3.
@@ -940,8 +951,8 @@
940951
- Spark 3.5.8 (audited 2026-05-27): baseline. `Or(left, right) extends BinaryOperator with Predicate`; short-circuit left-to-right. Comet routes via `CometOr` to the proto's `or` binary expression.
941952
- Spark 4.0.1 (audited 2026-05-27): semantics unchanged.
942953
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
943-
- [ ] regexp
944-
- [ ] regexp_like
954+
- [x] regexp
955+
- [x] regexp_like
945956
- [x] rlike
946957
- See `string_funcs / regexp_replace` and the `CometRLike` notes (audited in PR #4461). Uses the Rust `regex` crate, which differs from Java's `Pattern` engine; requires `spark.comet.expression.regexp.allowIncompatible=true`.
947958

@@ -956,7 +967,7 @@
956967
- [x] character_length
957968
- [x] chr
958969
- [ ] collate
959-
- [ ] collation
970+
- [x] collation
960971
- [x] concat_ws
961972
- [x] contains
962973
- [x] decode
@@ -978,7 +989,7 @@
978989
- [x] lower
979990
- [x] lpad
980991
- [x] ltrim
981-
- [ ] luhn_check
992+
- [x] luhn_check
982993
- [ ] make_valid_utf8
983994
- [ ] mask
984995
- [x] octet_length
@@ -1006,7 +1017,7 @@
10061017
- [x] substr
10071018
- [x] substring
10081019
- [x] substring_index
1009-
- [ ] to_binary
1020+
- [x] to_binary
10101021
- [ ] to_char
10111022
- [ ] to_number
10121023
- [ ] to_varchar
@@ -1052,8 +1063,10 @@
10521063

10531064
- [ ] cume_dist
10541065
- [ ] dense_rank
1055-
- [ ] lag
1056-
- [ ] lead
1066+
- [x] lag
1067+
- Supported via `CometWindowExec` (operator level), not the expression serde.
1068+
- [x] lead
1069+
- Supported via `CometWindowExec` (operator level), not the expression serde.
10571070
- [ ] nth_value
10581071
- [ ] ntile
10591072
- [ ] percent_rank

0 commit comments

Comments
 (0)