Skip to content

Latest commit

 

History

History
679 lines (592 loc) · 28.7 KB

File metadata and controls

679 lines (592 loc) · 28.7 KB

Spark Expression Support

This page is the complete reference for how Apache Comet handles each Spark built-in expression. Comet accelerates expressions either with a native (Rust) implementation or by dispatching to a Spark-compatible codegen path. When an expression is not supported, Comet transparently falls back to Spark for that part of the plan; results are unaffected.

Expressions marked ✅ Supported are enabled by default and produce Spark-compatible results.

Some ✅ Supported expressions have specific incompatible cases that fall back to Spark by default. Those cases must be opted into per expression with spark.comet.expression.EXPRNAME.allowIncompatible=true (where EXPRNAME is the Spark expression class name, for example Cast). There is no global opt-in.

Most expressions can also be disabled with spark.comet.expression.EXPRNAME.enabled=false, where EXPRNAME is the Spark expression class name (for example Length or StartsWith). See the Comet Configuration Guide for the full list.

Status legend

Status Meaning
✅ Supported Comet produces Spark-compatible results by default. Some inputs or forms may fall back to Spark, and any incompatible behavior is opt-in (off by default).
🔜 Planned Intended; tracked by an open issue or pull request.

Not currently planned

Comet focuses acceleration on mainstream relational, string, datetime, math, and collection expressions. The following function families are not currently planned for native acceleration (they are not on the 1.0 roadmap): specialized functionality with narrow real-world analytics use and high implementation cost. They fall back to Spark and may be reconsidered based on demand:

  • Probabilistic sketches and approximate top-k (kll_sketch_*, hll_*, theta_*, count_min_sketch, bitmap_*, approx_top_k*): specialized data structures with exact-correctness traps.
  • Geospatial (st_*): brand-new Spark 4.1 functionality, specialized.
  • Avro / Protobuf codecs (from_avro, to_avro, from_protobuf, to_protobuf, schema_of_avro): format conversion belongs at the IO layer, not expression evaluation.
  • JVM reflection (java_method, reflect): niche, and they invoke arbitrary JVM methods (a security concern).
  • UTF-8 validation (is_valid_utf8, make_valid_utf8, validate_utf8, try_validate_utf8): niche Spark 4.x string-validation helpers.
  • Miscellaneous niche (histogram_numeric, version, sentences, quote): low-value or specialized functions with little benefit from native acceleration.

The file-metadata functions input_file_name, input_file_block_start, and input_file_block_length depend on scan-internal per-row file information rather than the expression layer; their support status is covered in the scan compatibility guide.

Note that approx_count_distinct, median, and mode are planned: they are mainstream (median and mode are exact aggregates). approx_percentile / percentile_approx are not currently planned because their approximate results cannot be made bit-identical to Spark.

The tables below list every Spark built-in expression with its current status.

agg_funcs

Function Status Notes
any
any_value
approx_count_distinct 🔜 tracking #4098
array_agg 🔜 Array aggregate (related to collect_list, #2524)
avg Interval types fall back
bit_and
bit_or
bit_xor
bool_and
bool_or
collect_list 🔜 #2524
collect_set
corr
count
count_if
covar_pop
covar_samp
every
first
first_value
grouping 🔜 Grouping indicator for ROLLUP/CUBE/GROUPING SETS
grouping_id 🔜 Grouping indicator for ROLLUP/CUBE/GROUPING SETS
kurtosis 🔜 tracking #4098
last
last_value
listagg 🔜 String aggregation
max
max_by 🔜 #3841
mean
median 🔜 tracking #4098
min
min_by 🔜 #3841
mode 🔜 #3970
percentile 🔜 #4542
percentile_cont 🔜 Percentile aggregate
percentile_disc 🔜 Percentile aggregate
regr_avgx Native: Spark rewrites to Average (tests in #4551)
regr_avgy Native: Spark rewrites to Average (tests in #4551)
regr_count Native: Spark rewrites to Count (tests in #4551)
regr_intercept 🔜 Falls back; can reuse covar_pop/var_pop accumulators (#4552)
regr_r2 🔜 Falls back; can reuse the corr accumulator (#4552)
regr_slope 🔜 Falls back; can reuse covar_pop/var_pop accumulators (#4552)
regr_sxx 🔜 Falls back; can reuse var_pop accumulator (#4552)
regr_sxy 🔜 Falls back; can reuse covar_pop accumulator (#4552)
regr_syy 🔜 Falls back; can reuse var_pop accumulator (#4552)
skewness 🔜 tracking #4098
some
std
stddev
stddev_pop
stddev_samp
string_agg 🔜 String aggregation (alias of listagg)
sum
try_avg 🔜 tracking #4098
try_sum 🔜 tracking #4098
var_pop
var_samp
variance

array_funcs

Function Status Notes
array
array_append
array_compact
array_contains NaN/signed-zero handling may differ (details)
array_distinct NaN/signed-zero handling may differ (details)
array_except Incompatible; falls back by default (details)
array_insert
array_intersect Incompatible; falls back by default (details)
array_join Incompatible; falls back by default (details)
array_max NaN ordering may differ (details)
array_min NaN ordering may differ (details)
array_position Binary/struct/map/null elements fall back
array_prepend 🔜 Sibling of array_append
array_remove
array_repeat
array_union NaN/signed-zero handling may differ (details)
arrays_overlap
arrays_zip
element_at MapType input falls back
flatten Binary/struct/map elements fall back
get
sequence
shuffle 🔜 Random array shuffle
slice Native (#4149)
sort_array Nested struct/null arrays fall back

bitwise_funcs

Function Status Notes
&
<<
>>
>>> Operator alias for shiftrightunsigned (Spark 4.0+)
^
bit_count
bit_get
getbit
shiftright
shiftrightunsigned
|
~

collection_funcs

Function Status Notes
array_size
cardinality MapType input falls back
concat Binary/array children fall back
reverse Binary-element arrays fall back (Incompatible) (details)
size MapType input falls back

conditional_funcs

Function Status Notes
coalesce
if
ifnull
nanvl
nullif
nullifzero Lowers to if/= (Spark 4.0+)
nvl
nvl2
when
zeroifnull Lowers to coalesce (Spark 4.0+)

conversion_funcs

The type-name conversion functions (bigint, binary, boolean, date, decimal, double, float, int, smallint, string, timestamp, tinyint) are SQL aliases for CAST(... AS <type>) and share the support and caveats of cast.

Function Status Notes
cast Some casts fall back; float-to-decimal is opt-in (details)

csv_funcs

Function Status Notes
from_csv
schema_of_csv
to_csv

datetime_funcs

Function Status Notes
add_months
convert_timezone
curdate Constant-folded to a literal (alias of current_date)
current_date Constant-folded to a literal before Comet sees the plan
current_time 🔜 Blocked on Spark 4.1 TIME type support (#4288)
current_timestamp Constant-folded to a literal before Comet sees the plan
current_timezone
date_add
date_diff
date_format
date_from_unix_date
date_part
date_sub
date_trunc
dateadd
datediff
datepart
day
dayname Abbreviated day name (Spark 4.0+)
dayofmonth
dayofweek
dayofyear
extract
from_unixtime
from_utc_timestamp Legacy zone forms fall back (Incompatible) (details)
hour
last_day
localtimestamp
make_date
make_dt_interval 🔜 #4541
make_interval 🔜 Produces legacy CalendarInterval; tracked by #4540
make_time 🔜 Spark 4.1 TIME type; tracked by #4288
make_timestamp
make_timestamp_ltz 2-arg TIME form falls back
make_timestamp_ntz 2-arg TIME form falls back
make_ym_interval 🔜 #4541
minute
month
monthname Abbreviated month name (Spark 4.0+)
months_between
next_day
now Constant-folded to a literal (alias of current_timestamp)
quarter
second
session_window 🔜 Time-window grouping; tracked by #4553
time_diff 🔜 Spark 4.1 TIME type; tracked by #4288
time_trunc 🔜 Spark 4.1 TIME type; tracked by #4288
timestamp_micros
timestamp_millis
timestamp_seconds
to_date Rewrites to Cast (or Cast(GetTimestamp) with a format) before Comet sees the plan
to_time 🔜 Spark 4.1 TIME type; tracked by #4288
to_timestamp Rewrites to Cast (or GetTimestamp with a format) before Comet sees the plan
to_timestamp_ltz Rewrites to to_timestamp (TimestampType)
to_timestamp_ntz Rewrites to to_timestamp (TimestampNTZType)
to_unix_timestamp
to_utc_timestamp Legacy zone forms fall back (Incompatible) (details)
trunc
try_make_interval 🔜 Produces legacy CalendarInterval; tracked by #4540
try_make_timestamp
try_to_date 🔜 Rewrites to Cast/GetTimestamp but currently falls back; tracked by #4556
try_to_time 🔜 Spark 4.1 TIME type; tracked by #4288
try_to_timestamp 🔜 Rewrites to Cast/GetTimestamp but currently falls back; tracked by #4556
unix_date
unix_micros
unix_millis
unix_seconds
unix_timestamp
weekday
weekofyear
window 🔜 Time-window grouping; tracked by #4553
window_time 🔜 Time-window grouping; tracked by #4553
year

generator_funcs

explode and posexplode are supported via CometExplodeExec (operator-level, not expression-level). The outer variants are wired but marked Incompatible; they require spark.comet.exec.explode.enabled=true and allowIncompatible.

Function Status Notes
explode via CometExplodeExec
explode_outer outer=true falls back (Incompatible) (audit)
inline 🔜 Operator-level generator (like explode)
inline_outer 🔜 Operator-level generator (like explode)
posexplode via CometExplodeExec
posexplode_outer outer=true falls back (Incompatible) (audit)
stack 🔜 Operator-level generator

hash_funcs

Function Status Notes
crc32
hash
md5
sha
sha1
sha2
xxhash64

json_funcs

Function Status Notes
from_json Falls back by default; opt-in via allowIncompatible (audit)
get_json_object Some inputs need allowIncompatible (audit)
json_array_length Single-quoted/trailing JSON needs allowIncompatible (audit)
json_object_keys
json_tuple 🔜 #3160
schema_of_json
to_json Options and map/array inputs fall back (audit)

lambda_funcs

Function Status Notes
aggregate
array_sort
exists
filter 🔜 General lambda not yet wired; the array_compact form is supported (#4224)
forall
map_filter
map_zip_with
reduce
transform
transform_keys
transform_values
zip_with

map_funcs

Function Status Notes
element_at MapType input falls back
map 🔜 Constructs a map
map_concat
map_contains_key
map_entries
map_from_arrays
map_from_entries BinaryType key/value falls back (Incompatible) (details)
map_keys
map_values
str_to_map
try_element_at Lowers to element_at; array input (MapType falls back)

math_funcs

Function Status Notes
%
* Interval multiplication falls back
+
-
/
abs Interval types fall back
acos
acosh
asin
asinh
atan
atan2
atanh
bin
bround
cbrt
ceil Two-arg form falls back
ceiling
conv
cos
cosh
cot
csc
degrees
div
e Folds to a literal (like pi)
exp
expm1
factorial
floor Two-arg form falls back
greatest
hex
hypot
least
ln
log
log10
log1p
log2
mod
negative
pi
pmod
positive
pow
power
radians
rand
randn
random Alias for rand (Spark 4.0+); seed must be a literal
randstr 🔜 Random string (Spark 4.0+)
rint
round Float/double inputs fall back
sec
shiftleft
sign
signum
sin
sinh
sqrt
tan
tanh
try_add Datetime/interval form falls back
try_divide
try_mod
try_multiply
try_subtract
unhex
uniform Constant-folded; literal arguments only (Spark 4.0+)
width_bucket

misc_funcs

Function Status Notes
aes_decrypt Routed through the JVM codegen dispatcher
aes_encrypt Routed through the JVM codegen dispatcher; nondeterministic IV by default
assert_true 🔜 Lowers to RaiseError, which falls back
current_catalog Resolved to a literal by the analyzer (ReplaceCurrentLike)
current_database Resolved to a literal by the analyzer (ReplaceCurrentLike)
current_schema Alias of current_database; resolved to a literal by the analyzer
current_user Resolved to a literal by the analyzer; same as user
equal_null Lowers to <=> (EqualNullSafe)
is_variant_null 🔜 tracking #4098
monotonically_increasing_id
parse_json 🔜 tracking #4098
raise_error 🔜 Raises a runtime error
rand Seed must be a literal
randn Seed must be a literal
schema_of_variant 🔜 tracking #4098
schema_of_variant_agg 🔜 tracking #4098
session_user Alias of current_user; resolved to a literal by the analyzer
spark_partition_id
to_variant_object 🔜 tracking #4098
try_aes_decrypt Routed through the JVM codegen dispatcher
try_parse_json 🔜 tracking #4098
try_variant_get 🔜 tracking #4098
typeof Foldable; resolved to a literal before Comet sees the plan
user Resolved to a literal by the Spark analyzer before reaching Comet
uuid 🔜 Nondeterministic random UUID
variant_get 🔜 tracking #4098

predicate_funcs

Function Status Notes
!
<
<=
<=>
=
==
>
>=
and
between
ilike
in
isnan
isnotnull
isnull
like
not
or
regexp Falls back by default; opt-in via allowIncompatible (details)
regexp_like Falls back by default; opt-in via allowIncompatible (details)
rlike Falls back by default; opt-in via allowIncompatible (details)

string_funcs

Function Status Notes
ascii
base64 🔜 Lowers to StaticInvoke(encode) (not allowlisted); falls back
bit_length
btrim
char
char_length
character_length
chr
collate 🔜 Spark collation (umbrella #2190)
collation Constant-folded to a literal (Spark 4.0+)
concat_ws
contains
decode
elt
encode 🔜 Lowers to StaticInvoke(encode) (not allowlisted); falls back
endswith
find_in_set
format_number
format_string
initcap
instr
lcase
left
len
length
levenshtein
locate
lower
lpad
ltrim
luhn_check Native via StaticInvoke (tests: luhn_check.sql)
mask 🔜 Data masking
octet_length
overlay
position
printf
regexp_count 🔜 tracking #4098
regexp_extract Routed through the JVM codegen dispatcher
regexp_extract_all Routed through the JVM codegen dispatcher
regexp_instr Routed through the JVM codegen dispatcher
regexp_replace
regexp_substr 🔜 tracking #4098
repeat
replace
right
rpad
rtrim
soundex
space
split
split_part 🔜 Lowers to element_at(StringSplitSQL(...)); StringSplitSQL falls back (#4561)
startswith
substr
substring
substring_index
to_binary Hex form accelerated; other formats fall back
to_char
to_number
to_varchar
translate Falls back by default; opt-in via allowIncompatible (#4463)
trim
try_to_binary 🔜 Lowers to TryEval(...), which falls back
try_to_number 🔜 TRY variant of to_number
ucase
unbase64
upper

struct_funcs

Function Status Notes
named_struct Duplicate field names fall back
struct

url_funcs

Function Status Notes
parse_url
try_url_decode
url_decode
url_encode

window_funcs

Window functions run via CometWindowExec. Window support is disabled by default due to known correctness issues (tracking #2721). When enabled, lag and lead are explicitly wired; aggregate window functions (count, min, max, sum) are also supported. Ranking functions (rank, dense_rank, row_number, ntile, percent_rank, cume_dist, nth_value) are not yet wired in the window serde and fall back to Spark.

Function Status Notes
cume_dist 🔜 Window function; tracked by #2721
dense_rank 🔜 Window function; tracked by #2721
lag via CometWindowExec
lead via CometWindowExec
nth_value 🔜 Window function; tracked by #2721
ntile 🔜 Window function; tracked by #2721
percent_rank 🔜 Window function; tracked by #2721
rank 🔜 Window function; tracked by #2721
row_number 🔜 Window function; tracked by #2721

xml_funcs

Function Status Notes
from_xml Spark 4.0+
schema_of_xml Spark 4.0+
to_xml Spark 4.0+
xpath
xpath_boolean
xpath_double
xpath_float
xpath_int
xpath_long
xpath_number Alias of xpath_double
xpath_short
xpath_string

Beyond SQL functions

Comet also accelerates a number of Catalyst expressions that have no Spark SQL function name and therefore do not appear in the tables above. These arise from the DataFrame API, from SQL syntax other than function calls, or from the query optimizer. They include:

  • Operator and optimizer-injected expressions: runtime bloom-filter join probes (BloomFilterMightContain, BloomFilterAggregate), optimized IN sets (InSet), scalar subqueries (ScalarSubquery), and floating-point normalization (KnownFloatingPointNormalized).
  • Accessor expressions (subscript and field access, not functions): struct field access (col.field), array element access (arr[i]), and map value access (map[key]).
  • Internal decimal arithmetic: CheckOverflow, MakeDecimal, and UnscaledValue, which the analyzer inserts around decimal operations.
  • User-defined functions: Scala UDFs registered through the DataFrame or SQL API.
  • Structural expressions: aliases, attribute references, literals, sort orders, and CASE WHEN.

This list is illustrative, not exhaustive: the per-function tables are not the complete set of expressions Comet can accelerate.

See also