Skip to content

Commit 13f5d7b

Browse files
authored
docs: rework expression docs (source-of-truth status, per-category audits, refined status semantics) (#4568)
1 parent 575877d commit 13f5d7b

26 files changed

Lines changed: 1752 additions & 1360 deletions

.claude/skills/audit-comet-expression/SKILL.md

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -579,16 +579,23 @@ After implementing tests, tell the user how to run them:
579579

580580
---
581581

582-
## Step 8: Update the Expression Support Doc
583-
584-
After completing the audit (whether or not tests were added), add sub-bullets under the expression's
585-
entry in `docs/source/contributor-guide/spark_expressions_support.md`.
586-
587-
Add one sub-bullet per Spark version checked, each including:
588-
589-
- Spark version (e.g. 3.4.3, 3.5.8, 4.0.1)
590-
- Today's date
591-
- A brief note for any version-specific finding (behavioral difference, known incompatibility); omit if nothing notable
582+
## Step 8: Record the Audit
583+
584+
After completing the audit (whether or not tests were added), record the findings on the
585+
expression's category page under `docs/source/contributor-guide/expression-audits/`.
586+
The page is named after the Spark function-registry category (e.g. `agg_funcs.md`,
587+
`string_funcs.md`, `datetime_funcs.md`). If you are unsure of the category, match the one
588+
used for the expression in `docs/source/user-guide/latest/expressions.md`.
589+
590+
- If the category page does not exist yet, create it with the ASF license header, a
591+
`# <category> Expression Audits` title, and the standard intro blockquote used by the
592+
other pages.
593+
- Add (or update) a `## <function_name>` section, keeping sections alphabetically ordered.
594+
- Under that heading, add one bullet per Spark version checked, each including:
595+
- Spark version (e.g. 3.4.3, 3.5.8, 4.0.1)
596+
- Today's date
597+
- A brief note for any version-specific finding (behavioral difference, known
598+
incompatibility); omit the note if nothing notable.
592599

593600
---
594601

.claude/skills/implement-comet-expression/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ The contributor guide is the canonical reference. Read these before writing code
1212

1313
- `docs/source/contributor-guide/adding_a_new_expression.md` covers the Scala serde, protobuf, Rust scalar function flow, support levels, shims, and tests.
1414
- `docs/source/contributor-guide/sql-file-tests.md` describes the Comet SQL Tests format.
15-
- `docs/source/contributor-guide/spark_expressions_support.md` lists the coverage status for every expression.
15+
- `docs/source/user-guide/latest/expressions.md` lists the support status for every expression.
1616

1717
## Workflow
1818

.claude/skills/wire-datafusion-function/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ Use `query expect_fallback(...)` for inputs where the serde returns `None`. Form
109109
Hand-curated (`make` does not regenerate these):
110110

111111
- `docs/source/user-guide/latest/expressions.md` — add `| <ExpressionClass> | \`<sql_name>\` |` to the matching category table, alphabetical.
112-
- `docs/source/contributor-guide/spark_expressions_support.md`flip `- [ ] <name>` to `- [x] <name>`.
112+
- `docs/source/user-guide/latest/expressions.md`update the function's Status from 🔜/💤 to ✅ (or ⚠️) in the matching category table.
113113

114114
Per-expression compatibility pages under `compatibility/expressions/*.md` are auto-generated from `getCompatibleNotes` / `getIncompatibleReasons` / `getUnsupportedReasons` — do not hand-edit.
115115

docs/source/contributor-guide/adding_a_new_expression.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,9 @@ Before you start, have a look through [these slides](https://docs.google.com/pre
2525

2626
## Finding an Expression to Add
2727

28-
You may have a specific expression in mind that you'd like to add, but if not, you can review the [expression coverage document](spark_expressions_support.md) to see which expressions are not yet supported.
28+
You may have a specific expression in mind that you'd like to add, but if not, you can review [Supported Spark Expressions](../user-guide/latest/expressions.md) in the user guide to see which expressions are not yet supported. For deep-dive audit notes on expressions that are already supported, see the [Expression Audits](expression-audits/index.md) section.
2929

30-
When you add or change an expression, update **both** the coverage checklist
31-
(`spark_expressions_support.md`) and the user-facing status in
30+
When you add or change an expression, update its status in
3231
[Supported Spark Expressions](../user-guide/latest/expressions.md).
3332

3433
## Implementing the Expression
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# agg_funcs Expression Audits
21+
22+
> Audit notes for expressions in this category that have been audited. Absence of an entry means the expression has not been audited yet, not that it is unsupported. See the user guide [Spark Expression Support] for current support status.
23+
24+
## any
25+
26+
- Spark 3.4.3 (audited 2026-05-26): registered as a SQL alias of `BoolOr`, which extends `RuntimeReplaceableAggregate` with `replacement = Max(child)`. Catalyst rewrites `any(x)` to `max(x)` before Comet sees the plan, so `any` is served by `CometMax` on a `BooleanType` column.
27+
- Spark 3.5.8 (audited 2026-05-26): identical to 3.4.3.
28+
- Spark 4.0.1 (audited 2026-05-26): identical to 3.4.3.
29+
30+
## avg
31+
32+
- Spark 3.4.3 (2026-05-26)
33+
- Spark 3.5.8 (2026-05-26): aggregate logic identical to 3.4.3
34+
- Spark 4.0.1 (2026-05-26): aggregate logic identical to 3.5.8; only `QueryContext` import path differs. `YearMonthIntervalType` and `DayTimeIntervalType` inputs (supported by Spark) fall back to Spark in Comet.
35+
36+
## bit_and
37+
38+
- Spark 3.4.3 (2026-05-26)
39+
- Spark 3.5.8 (2026-05-26)
40+
- Spark 4.0.1 (2026-05-26)
41+
42+
[Spark Expression Support]: ../../user-guide/latest/expressions.md
Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# array_funcs Expression Audits
21+
22+
> Audit notes for expressions in this category that have been audited. Absence of an entry means the expression has not been audited yet, not that it is unsupported. See the user guide [Spark Expression Support] for current support status.
23+
24+
## array
25+
26+
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
27+
- Spark 3.5.8 (audited 2026-05-27): baseline. `CreateArray(children, useStringTypeWhenEmpty)`; element type is the common type of children. Comet routes via `CometCreateArray` (native `make_array`) and special-cases the empty-array case to dodge a known DataFusion `coerce_types` issue (#3338).
28+
- Spark 4.0.1 (audited 2026-05-27): semantics unchanged.
29+
- Spark 4.1.1 (audited 2026-05-27): adds `contextIndependentFoldable` override; runtime semantics unchanged.
30+
31+
## array_append
32+
33+
- Spark 3.4.3 (audited 2026-05-27): standalone `BinaryExpression`, evaluated directly. Comet routes via `CometArrayAppend`.
34+
- Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3.
35+
- Spark 4.0.1 (audited 2026-05-27): now `RuntimeReplaceable` and rewritten to `ArrayInsert(arr, Literal(-1), elem)`. `CometArrayAppend` is therefore unreachable; dispatch goes through `CometArrayInsert` (which carries its own `Incompatible` notes documented at the `array_insert` entry).
36+
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
37+
38+
## array_compact
39+
40+
- Spark 3.4.3 (audited 2026-05-27): `RuntimeReplaceable` -> `ArrayFilter(arr, IsNotNull(lambda))`. Comet receives the rewritten form, dispatches through `CometArrayFilter`, which delegates back to `CometArrayCompact.convert` for the actual proto emission. The native path uses Comet's `spark_array_compact` UDF rather than DataFusion's `array_remove_all` because DataFusion 53 changed `array_remove_all`'s NULL semantics.
41+
- Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3.
42+
- Spark 4.0.1 (audited 2026-05-27): the replacement is wrapped in `KnownNotContainsNull(...)` (analysis-only hint, no semantic change).
43+
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
44+
45+
## array_contains
46+
47+
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
48+
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArrayContains(left, right) extends BinaryExpression with NullIntolerant with Predicate`; `inputTypes` uses `findWiderTypeWithoutStringPromotionForTwo`. Wired as `CometScalarFunction("array_contains")`.
49+
- Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` trait replaced by `nullIntolerant: Boolean`; `checkInputDataTypes` adopts `DataTypeUtils.sameType` (collation-aware in 4.x).
50+
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
51+
- Known limitation: no NaN-canonicalization guard in `getSupportLevel`. For `Float`/`Double` arrays containing NaN, Spark's `SQLOrderingUtil` may produce different results than DataFusion's IEEE comparison (https://github.com/apache/datafusion-comet/issues/4481).
52+
53+
## array_distinct
54+
55+
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
56+
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArrayDistinct(child)` over `ArraySetLike`; uses `SQLOpenHashSet` so NaN and `+0.0`/`-0.0` are canonicalized. Wired as `CometScalarFunction("array_distinct")`.
57+
- Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` -> `nullIntolerant` field refactor.
58+
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
59+
- Known divergence: DataFusion `array_distinct` uses hash-based equality without NaN/signed-zero canonicalization, so float/double arrays may produce different results (https://github.com/apache/datafusion-comet/issues/4481).
60+
61+
## array_except
62+
63+
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
64+
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArrayExcept(left, right) extends ArrayBinaryLike with ComplexTypeMergingExpression`; result preserves left-side first occurrences not present in right. Comet routes via `CometArrayExcept` and unconditionally flags `Incompatible` ("Null handling and ordering may differ from Spark"); also falls back for `BinaryType` / `StructType` element types.
65+
- Spark 4.0.1 (audited 2026-05-27): `nullIntolerant = true` moves into `ArrayBinaryLike`; the overflow path uses `arrayFunctionWithElementsExceedLimitError`.
66+
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
67+
- Known divergence: same NaN/signed-zero canonicalization gap as `array_distinct` for float/double arrays (https://github.com/apache/datafusion-comet/issues/4481).
68+
69+
## array_insert
70+
71+
- Spark 3.4.3 audited 2026-04-02
72+
- Spark 3.5.8 audited 2026-04-02
73+
- Spark 4.0.1 audited 2026-04-02 (pos=0 error message differs from Spark)
74+
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
75+
76+
## array_intersect
77+
78+
- Spark 3.4.3 audited 2026-04-24 (result element order may differ from Spark when the right array is longer than the left; DataFusion probes the longer side)
79+
- Spark 3.5.8 audited 2026-04-24 (same ordering incompatibility as 3.4.3)
80+
- Spark 4.0.1 audited 2026-04-24 (ordering incompatibility as above; collated strings now fall back to Spark)
81+
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
82+
83+
## array_join
84+
85+
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
86+
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArrayJoin(array, delimiter, nullReplacement)`. Comet routes via `CometArrayJoin` to DataFusion's `array_to_string` and is unconditionally flagged `Incompatible` ("Null handling may differ from Spark", #3178).
87+
- Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened to `AbstractArrayType(StringTypeWithCollation(supportsTrimCollation = true))`; non-binary collations not propagated (https://github.com/apache/datafusion-comet/issues/2190).
88+
- Spark 4.1.1 (audited 2026-05-27): adds `contextIndependentFoldable` override; runtime unchanged.
89+
90+
## array_max
91+
92+
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
93+
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArrayMax(child) extends UnaryExpression with ImplicitCastInputTypes`; skips NULL elements; for float/double Spark's `SQLOrderingUtil` treats NaN as greater than any non-NaN. Wired as `CometScalarFunction("array_max")`.
94+
- Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` -> `nullIntolerant` field refactor.
95+
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
96+
- Known divergence: DataFusion's `array_max` uses Arrow `partial_cmp`-based ordering, so float/double arrays containing NaN may produce different results (https://github.com/apache/datafusion-comet/issues/4482).
97+
98+
## array_min
99+
100+
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
101+
- Spark 3.5.8 (audited 2026-05-27): mirror of `ArrayMax` with `evalInternal` returning the minimum. Same NULL-skip and NaN-ordering semantics. Wired as `CometScalarFunction("array_min")`.
102+
- Spark 4.0.1 (audited 2026-05-27): same trait refactor as `array_max`.
103+
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
104+
- Known divergence: same NaN-handling gap as `array_max` (https://github.com/apache/datafusion-comet/issues/4482).
105+
106+
## array_position
107+
108+
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
109+
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArrayPosition(left, right)`; returns 1-based `LongType` position, 0 if not found, NULL if either input is NULL. `CometArrayPosition` falls back for all-foldable args (constant folding handles those) and for unsupported element types (binary/struct/map/null).
110+
- Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` -> `nullIntolerant` field refactor.
111+
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
112+
113+
## array_remove
114+
115+
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
116+
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArrayRemove(left, right)`; removes all occurrences equal to `right`. Wired as `CometScalarFunction("array_remove")`. Falls back via `ArraysBase.isTypeSupported` for binary/struct/map/null child types.
117+
- Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` -> `nullIntolerant` field refactor.
118+
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
119+
120+
## array_repeat
121+
122+
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
123+
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArrayRepeat(left, right) extends BinaryExpression with ExpectsInputTypes`; `inputTypes = Seq(AnyDataType, IntegerType)`. NULL count yields NULL; count <= 0 yields empty array; count > `MAX_ROUNDED_ARRAY_LENGTH` throws at runtime. Comet wraps the call in `CaseWhen(IsNotNull(right), array_repeat(...), null)`.
124+
- Spark 4.0.1 (audited 2026-05-27): error message uses `createArrayWithElementsExceedLimitError(prettyName, count)`; semantics unchanged.
125+
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
126+
127+
## array_union
128+
129+
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
130+
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArrayUnion(left, right) extends ArrayBinaryLike with ComplexTypeMergingExpression`; result is left-side distinct elements followed by new right-side elements. Wired as `CometScalarFunction("array_union")`.
131+
- Spark 4.0.1 (audited 2026-05-27): `nullIntolerant = true` moves into `ArrayBinaryLike`; overflow path uses `arrayFunctionWithElementsExceedLimitError`.
132+
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
133+
- Known divergence: same NaN/signed-zero canonicalization gap as `array_distinct` (https://github.com/apache/datafusion-comet/issues/4481). Result ordering versus DataFusion is also unverified; compare the `array_intersect` ordering caveat.
134+
135+
## arrays_overlap
136+
137+
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
138+
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArraysOverlap(left, right)`; three-valued logic (TRUE if any common non-null element, NULL if a null is present and no overlap is found in non-nulls, FALSE otherwise). Comet routes via `CometArraysOverlap` to the native `spark_arrays_overlap` UDF, which implements the same three-valued logic.
139+
- Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` -> `nullIntolerant` field refactor.
140+
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
141+
142+
## arrays_zip
143+
144+
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
145+
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArraysZip(children, names)`; returns an array of structs, padding shorter inputs with NULL. Comet routes via `CometArraysZip` and rejects unsupported child element types (anything outside primitives, decimals, dates/timestamps, strings, binary, and nested arrays/structs of those).
146+
- Spark 4.0.1 (audited 2026-05-27): the length-mismatch error switches from `IllegalArgumentException` to `SparkIllegalArgumentException("_LEGACY_ERROR_TEMP_3235")`; runtime unchanged.
147+
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
148+
149+
## element_at
150+
151+
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
152+
- Spark 3.5.8 (audited 2026-05-27): baseline. `ElementAt(left, right, defaultValueOutOfBound, failOnError)`; group label `map_funcs`. Comet supports only `ArrayType` input; `MapType` input falls back.
153+
- Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` -> `nullIntolerant` field refactor; group label changes to `collection_funcs`; ANSI default flips to `true` so out-of-bound throws by default. Comet wires `failOnError` through to native `ListExtract`.
154+
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
155+
156+
## flatten
157+
158+
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
159+
- Spark 3.5.8 (audited 2026-05-27): baseline. `Flatten(child) extends UnaryExpression`; returns NULL if any inner sub-array is NULL. Comet routes via `CometFlatten` and falls back for child types containing `BinaryType` / `StructType` / `MapType` (limitation of `ArraysBase.isTypeSupported`).
160+
- Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` -> `nullIntolerant` field refactor.
161+
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
162+
163+
## get
164+
165+
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
166+
- Spark 3.5.8 (audited 2026-05-27): baseline. `GetArrayItem(child, ordinal, failOnError)`; `inputTypes = Seq(AnyDataType, IntegralType)`. Comet routes via `CometGetArrayItem`, wiring `failOnError` through to the proto.
167+
- Spark 4.0.1 (audited 2026-05-27): semantics unchanged; ANSI default flips to `true`.
168+
- Spark 4.1.1 (audited 2026-05-27): `inputTypes` tightened to `Seq(ArrayType, IntegralType)` (analysis-time only); runtime unchanged.
169+
170+
## sort_array
171+
172+
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
173+
- Spark 3.5.8 (audited 2026-05-27): baseline. `SortArray(base, ascendingOrder) extends BinaryExpression with ArraySortLike`; the second arg must be a `Literal(_: Boolean, BooleanType)`. Comet `CometSortArray` flags `Incompatible` under strict floating-point and falls back for nested arrays whose innermost element is `Struct` or `Null`.
174+
- Spark 4.0.1 (audited 2026-05-27): trait set changes substantively: `ArraySortLike` and `NullIntolerant` are removed, `nullIntolerant = true` becomes an override, and `ascendingOrder` is widened to accept any foldable boolean (not just `Literal`). Comet's `CometSortArray` still requires a `Literal`, so the new foldable form falls back at convert time.
175+
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
176+
177+
[Spark Expression Support]: ../../user-guide/latest/expressions.md

0 commit comments

Comments
 (0)