Skip to content

Commit 754289a

Browse files
authored
docs: explain native vs codegen-dispatch implementation model and allowIncompatible opt-in (#4629)
1 parent 444e63e commit 754289a

2 files changed

Lines changed: 34 additions & 3 deletions

File tree

docs/source/user-guide/latest/compatibility/index.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,3 +30,30 @@ This guide documents areas where Comet's behavior is known to differ from Spark.
3030
- **Expressions**: per-expression compatibility notes, including cast.
3131
- **JSON**: choosing between the native and Spark-compatible engines for JSON expressions.
3232
- **Spark versions**: version-specific known issues and limitations.
33+
34+
## Native and codegen-dispatch implementations
35+
36+
Some Spark expressions have two implementations in Comet:
37+
38+
- A **codegen-dispatch** implementation that runs Spark's own generated code for the
39+
expression inside Comet's native pipeline (via the Arrow-direct codegen dispatcher). This
40+
produces byte-exact Spark results at the cost of one JNI round-trip per batch. It is gated
41+
globally by `spark.comet.exec.scalaUDF.codegen.enabled` (enabled by default); when the
42+
dispatcher is disabled, these expressions fall back to Spark.
43+
- A **native** (Rust / DataFusion) implementation that is faster, with no JNI overhead, but
44+
has known semantic differences from Spark for some inputs or patterns.
45+
46+
Because the codegen-dispatch path matches Spark exactly, Comet uses it by **default**. The
47+
faster native path is **opt-in per expression** via that expression's
48+
`spark.comet.expression.<ExprClassName>.allowIncompatible=true` flag, which declares that you
49+
accept its differences from Spark. There is no global opt-in. When the native path is enabled
50+
but a specific input or pattern has no native implementation, Comet routes that case back
51+
through the codegen dispatcher rather than running something incompatible.
52+
53+
This is the model behind the [regular expression](regex.md) and [JSON](json.md) families,
54+
which document their per-expression configs and the specific differences to expect.
55+
56+
This is distinct from expressions that have **no** codegen-dispatch path: there, the
57+
incompatible cases fall back to Spark by default, and `allowIncompatible=true` runs the native
58+
(incompatible) path instead. `cast` is the main example; see the
59+
[expression reference](../expressions.md) for which expressions have incompatible cases.

docs/source/user-guide/latest/expressions.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,14 @@ transparently falls back to Spark for that part of the plan; results are unaffec
2626

2727
Expressions marked ✅ Supported are enabled by default and produce Spark-compatible results.
2828

29-
Some ✅ Supported expressions have specific incompatible cases that fall back to Spark by
30-
default. Those cases must be opted into per expression with
29+
Some ✅ Supported expressions have specific incompatible cases that are not run by default.
30+
Those cases must be opted into per expression with
3131
`spark.comet.expression.EXPRNAME.allowIncompatible=true` (where `EXPRNAME` is the Spark
32-
expression class name, for example `Cast`). There is no global opt-in.
32+
expression class name, for example `Cast`). There is no global opt-in. By default such a case
33+
either falls back to Spark (for example `cast`) or, when the expression has a Spark-compatible
34+
codegen-dispatch implementation, runs through that instead (for example the regex and JSON
35+
families). See [Native and codegen-dispatch implementations](compatibility/index.md#native-and-codegen-dispatch-implementations)
36+
for how Comet chooses.
3337

3438
Most expressions can also be disabled with `spark.comet.expression.EXPRNAME.enabled=false`, where
3539
`EXPRNAME` is the Spark expression class name (for example `Length` or `StartsWith`). See the

0 commit comments

Comments
 (0)