Skip to content

Commit 9fbe7ce

Browse files
committed
docs: introduce Arrow-native terminology in README and user guide
Comet's documentation conflates several distinct ideas under the word 'native': implementation language (Rust vs JVM), pipeline membership (handled by Comet vs falls back to Spark), and data format (Arrow vs Spark rows). The vocabulary clean-up in #4419 splits these out, and this PR rolls in only the README and user-guide prose, with no code or operator renames. - 'Arrow-native' is now the term for the data-format property that unifies the pipeline. - 'Comet pipeline' replaces 'the native Comet path' / 'accelerated by Comet' for membership. - 'Rust-implemented' / 'native Rust' is used for the implementation- language axis. Bare 'native execution' / 'runs natively' / 'the native path' as vague adjectives are removed. The biggest single rewrite is in user-guide/latest/understanding-comet-plans.md, where the 'three kinds of nodes' framing becomes four (Arrow-native Rust operators, Arrow-native JVM expressions, Arrow-native JVM plumbing, Spark fallback). Contributor-guide files, plugin-overview prose, and the about/ pages are deliberately out of scope and will follow in a separate PR.
1 parent a08cb4e commit 9fbe7ce

8 files changed

Lines changed: 144 additions & 106 deletions

File tree

README.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -35,10 +35,12 @@ under the License.
3535

3636
<img src="docs/source/_static/images/DataFusionComet-Logo-Light.png" width="512" alt="logo"/>
3737

38-
Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful
39-
[Apache DataFusion] query engine. Comet is designed to significantly enhance the
40-
performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the
41-
Spark ecosystem without requiring any code changes.
38+
Apache DataFusion Comet is a high-performance accelerator for Apache Spark. Comet keeps Spark queries
39+
**Arrow-native end-to-end**: operators, expressions, shuffle, and broadcast all stay in Apache Arrow
40+
columnar format, avoiding the per-row overhead of Spark's row-based engine. Within the Arrow-native
41+
pipeline, operators and expressions execute as Rust code (via the [Apache DataFusion] query engine)
42+
or as JVM code that operates directly on Arrow batches. Comet integrates with the Spark ecosystem
43+
without requiring any code changes.
4244

4345
**Comet provides a ~2x speedup for TPC-DS @ SF 1000 (1TB), resulting in ~50% cost savings.**
4446

@@ -58,17 +60,22 @@ See the [Comet Benchmarking Guide](https://datafusion.apache.org/comet/contribut
5860

5961
## What Comet Accelerates
6062

61-
Comet replaces Spark operators and expressions with native Rust implementations that run on Apache DataFusion.
62-
It uses Apache Arrow for zero-copy data transfer between the JVM and native code.
63+
Comet replaces Spark operators and expressions with implementations that consume and produce Apache Arrow
64+
batches. Most run as native Rust code on top of Apache DataFusion; some run as JVM code over Arrow batches.
65+
Either way the work stays in the Comet pipeline without falling back to Spark's row-based engine.
6366

6467
- **Parquet scans**: native Parquet reader integrated with Spark's query planner
6568
- **Apache Iceberg**: accelerated Parquet scans when reading Iceberg tables from Spark
6669
(see the [Iceberg guide](https://datafusion.apache.org/comet/user-guide/iceberg.html))
67-
- **Shuffle**: native columnar shuffle with support for hash and range partitioning
70+
- **Shuffle**: Arrow-IPC columnar shuffle with support for hash and range partitioning, in a native Rust
71+
implementation paired with a JVM fallback for unsupported partition key types
6872
- **Expressions**: hundreds of supported Spark expressions across math, string, datetime, array,
6973
map, JSON, hash, and predicate categories
7074
- **Aggregations**: hash aggregate with support for `FILTER (WHERE ...)` clauses
7175
- **Joins**: hash join, sort-merge join, and broadcast join
76+
- **Scala/Java UDFs**: experimental support for keeping Scala/Java scalar UDFs in the Comet pipeline
77+
via Spark's whole-stage codegen (see the
78+
[Scala UDF guide](https://datafusion.apache.org/comet/user-guide/scala_java_udfs.html))
7279

7380
For the authoritative lists, see the [supported expressions](https://datafusion.apache.org/comet/user-guide/expressions.html)
7481
and [supported operators](https://datafusion.apache.org/comet/user-guide/operators.html) pages.

docs/source/user-guide/latest/compatibility/expressions/cast.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,9 @@ Cast operations in Comet fall into three levels of support:
2424
- **C (Compatible)**: The results match Apache Spark
2525
- **I (Incompatible)**: The results may match Apache Spark for some inputs, but there are known issues where some inputs
2626
will result in incorrect results or exceptions. The query stage will fall back to Spark by default. Setting
27-
`spark.comet.expression.Cast.allowIncompatible=true` will allow all incompatible casts to run natively in Comet, but this is not
28-
recommended for production use.
29-
- **U (Unsupported)**: Comet does not provide a native version of this cast expression and the query stage will fall back to
27+
`spark.comet.expression.Cast.allowIncompatible=true` will allow all incompatible casts to run inside the Comet pipeline,
28+
but this is not recommended for production use.
29+
- **U (Unsupported)**: Comet does not provide an implementation of this cast expression and the query stage will fall back to
3030
Spark.
3131
- **N/A**: Spark does not support this cast.
3232

@@ -83,9 +83,9 @@ is the same regardless of the session timezone setting.
8383

8484
In Legacy mode, `CAST(date AS INT)`, `CAST(date AS LONG)`, and casts to all other numeric
8585
types (Boolean, Byte, Short, Float, Double, Decimal) always return `NULL`. Comet handles
86-
this by short-circuiting to a null literal during query planning, so no native execution
87-
is needed. In ANSI and Try modes, Spark rejects these casts at analysis time (before
88-
execution reaches Comet).
86+
this by short-circuiting to a null literal during query planning, so the cast is removed
87+
before reaching the runtime. In ANSI and Try modes, Spark rejects these casts at analysis
88+
time (before execution reaches Comet).
8989

9090
## String to Timestamp
9191

@@ -139,7 +139,7 @@ The result is always a wall-clock timestamp with no timezone conversion or DST a
139139
Casting a `DecimalType` with a negative scale to `StringType` is marked as incompatible when
140140
`spark.sql.legacy.allowNegativeScaleOfDecimal` is `false` (the default). When that config is
141141
disabled, Spark cannot create negative-scale decimals, so Comet falls back to avoid running
142-
native execution on unexpected inputs.
142+
its cast implementation on unexpected inputs.
143143

144144
When `spark.sql.legacy.allowNegativeScaleOfDecimal=true`, the cast is compatible. Comet matches
145145
Spark's behavior of using Java `BigDecimal.toString()` semantics, which produces scientific

docs/source/user-guide/latest/compatibility/scans.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,8 +63,8 @@ The following limitation may produce incorrect results without falling back to S
6363
The following limitation raises an error at scan time rather than falling back to Spark:
6464

6565
- Invalid UTF-8 bytes in `STRING` columns. Spark permits arbitrary byte sequences in a `STRING`
66-
column (for example from `CAST(X'C1' AS STRING)`), but Comet's native execution path is built on
67-
Arrow, whose string type is strictly UTF-8. Reading a Parquet file whose `STRING` column contains
66+
column (for example from `CAST(X'C1' AS STRING)`), but Comet's pipeline is built on Arrow,
67+
whose string type is strictly UTF-8. Reading a Parquet file whose `STRING` column contains
6868
non-UTF-8 bytes fails with `Parquet error: encountered non UTF-8 data`. Disable Comet for the
6969
query, or cast the column to `BINARY` before persisting, if you need to preserve non-UTF-8 bytes.
7070
See [#4121](https://github.com/apache/datafusion-comet/issues/4121).

docs/source/user-guide/latest/datasources.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,10 @@
2323

2424
### Parquet
2525

26-
Parquet scans are performed natively by Comet if all data types in the schema are supported. When the scan
27-
falls back to Spark, enabling `spark.comet.convert.parquet.enabled` will immediately convert the data into
28-
Arrow format, allowing native execution to happen after that, but the process may not be efficient.
26+
Parquet scans are read directly into Arrow by Comet's Rust scan when all data types in the schema are supported.
27+
When the scan falls back to Spark, enabling `spark.comet.convert.parquet.enabled` will immediately convert the
28+
data into Arrow format so the rest of the plan can stay in the Comet pipeline, though the conversion itself
29+
may not be efficient.
2930

3031
### Apache Iceberg
3132

@@ -35,17 +36,17 @@ Comet accelerates Iceberg scans of Parquet files. See the [Iceberg Guide] for mo
3536

3637
### CSV
3738

38-
Comet provides experimental native CSV scan support. When `spark.comet.scan.csv.v2.enabled` is enabled, CSV files
39-
are read natively for improved performance. This feature is experimental and performance benefits are
40-
workload-dependent.
39+
Comet provides an experimental Rust-implemented CSV scan. When `spark.comet.scan.csv.v2.enabled` is enabled, CSV
40+
files are read directly into Arrow for improved performance. This feature is experimental and performance benefits
41+
are workload-dependent.
4142

4243
Alternatively, when `spark.comet.convert.csv.enabled` is enabled, data from Spark's CSV reader is immediately
43-
converted into Arrow format, allowing native execution to happen after that.
44+
converted into Arrow format so the rest of the plan can stay in the Comet pipeline.
4445

4546
### JSON
4647

47-
Comet does not provide native JSON scan, but when `spark.comet.convert.json.enabled` is enabled, data is immediately
48-
converted into Arrow format, allowing native execution to happen after that.
48+
Comet does not provide a Rust-implemented JSON scan, but when `spark.comet.convert.json.enabled` is enabled,
49+
data is immediately converted into Arrow format so the rest of the plan can stay in the Comet pipeline.
4950

5051
## Data Catalogs
5152

docs/source/user-guide/latest/iceberg.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222
## Native Reader
2323

2424
Comet's native Iceberg reader relies on reflection to extract `FileScanTask`s from Iceberg, which are
25-
then serialized to Comet's native execution engine (see
25+
then serialized to DataFusion for execution (see
2626
[PR #2528](https://github.com/apache/datafusion-comet/pull/2528)).
2727

2828
The example below uses Spark's package downloader to retrieve Comet $COMET_VERSION and Iceberg
@@ -157,12 +157,12 @@ Iceberg ships several `ScalaUDF`s that surface in user queries and maintenance a
157157
(`INT_ORDERED_BYTES`, `LONG_ORDERED_BYTES`, ..., `INTERLEAVE_BYTES`) over the sort key columns
158158
during compaction.
159159

160-
By default these UDFs cause the enclosing operator to fall back to Spark, which forces a
161-
columnar-to-row roundtrip and demotes the surrounding shuffle from `CometExchange` to
160+
By default these UDFs cause the enclosing operator to fall back to Spark, which forces an
161+
Arrow-to-row roundtrip and demotes the surrounding shuffle from `CometExchange` to
162162
`CometColumnarExchange`. Enabling the experimental
163163
[Scala UDF and Java UDF Support](scala_java_udfs.md) feature
164-
(`spark.comet.exec.scalaUDF.codegen.enabled=true`) routes these UDFs through native execution so
165-
the project, exchange, and sort operators around them stay on the Comet path end-to-end.
164+
(`spark.comet.exec.scalaUDF.codegen.enabled=true`) keeps these UDFs in the Comet pipeline so
165+
the project, exchange, and sort operators around them stay accelerated end-to-end.
166166

167167
### Task input metrics
168168

docs/source/user-guide/latest/scala_java_udfs.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,24 +19,24 @@
1919

2020
# Scala UDF and Java UDF Support
2121

22-
Comet executes Spark's Scala and Java [scalar user-defined functions (UDFs)](https://spark.apache.org/docs/latest/sql-ref-functions-udf-scalar.html) on the native Comet path. The presence of a UDF does not force the enclosing operator off the native path; surrounding native operators stay native.
22+
Comet executes Spark's Scala and Java [scalar user-defined functions (UDFs)](https://spark.apache.org/docs/latest/sql-ref-functions-udf-scalar.html) inside the Comet pipeline. The UDF body is JVM bytecode produced by Spark's whole-stage codegen, but it runs over Arrow batches alongside the surrounding Arrow-native operators rather than triggering a fallback to row-based Spark execution. The presence of a UDF does not force the enclosing operator out of the pipeline; surrounding Comet operators continue to run as usual.
2323

2424
This page covers Spark's `ScalaUDF` (Scala `udf(...)`, `spark.udf.register(...)` over Scala or Java functional interfaces, and SQL `CREATE FUNCTION ... AS 'com.example.MyUDF'`). Other UDF kinds (Python / Pandas, Hive, aggregate) are out of scope and continue to fall back to Spark.
2525

2626
This feature is experimental and disabled by default.
2727

2828
## Configuration
2929

30-
| Key | Default | Description |
31-
| ------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------ |
32-
| `spark.comet.exec.scalaUDF.codegen.enabled` | `false` | When `true`, eligible `ScalaUDF`s run on the Comet path. When `false`, the enclosing operator falls back to Spark. |
30+
| Key | Default | Description |
31+
| ------------------------------------------- | ------- | -------------------------------------------------------------------------------------------------------------------------- |
32+
| `spark.comet.exec.scalaUDF.codegen.enabled` | `false` | When `true`, eligible `ScalaUDF`s run inside the Comet pipeline. When `false`, the enclosing operator falls back to Spark. |
3333

3434
## Supported
3535

3636
- User functions registered via `udf(...)`, `spark.udf.register(...)` (Scala or Java functional interfaces), or SQL `CREATE FUNCTION ... AS 'com.example.MyUDF'`.
3737
- Scalar input/output types: `Boolean`, `Byte`, `Short`, `Int`, `Long`, `Float`, `Double`, `Decimal`, `String`, `Binary`, `Date`, `Timestamp`, `TimestampNTZ`.
3838
- Complex input/output types with arbitrary nesting: `ArrayType`, `StructType`, `MapType`.
39-
- Composition with other Catalyst expressions inside the argument tree (e.g. `myUdf(upper(s))` runs as one native unit).
39+
- Composition with other Catalyst expressions inside the argument tree (e.g. `myUdf(upper(s))` runs as a single compiled kernel without leaving the Comet pipeline).
4040
- Higher-order functions (`transform`, `filter`, `exists`, `aggregate`, `zip_with`, `map_filter`, `map_zip_with`, etc.) inside the argument tree.
4141

4242
## Not supported
@@ -45,7 +45,7 @@ This feature is experimental and disabled by default.
4545
- Table UDFs and generators.
4646
- Python `@udf` and Pandas `@pandas_udf`.
4747
- Hive `GenericUDF` and `SimpleUDF`.
48-
- `CalendarIntervalType`, `NullType`, and `UserDefinedType` arguments and return types. UDT-typed columns fall back to Spark; for native execution, store and read the underlying representation directly (e.g. write MLlib `Vector` outputs as `Struct<type: Byte, size: Int, indices: Array<Int>, values: Array<Double>>` rather than `VectorUDT`).
48+
- `CalendarIntervalType`, `NullType`, and `UserDefinedType` arguments and return types. UDT-typed columns fall back to Spark; to keep the work in the Comet pipeline, store and read the underlying representation directly (e.g. write MLlib `Vector` outputs as `Struct<type: Byte, size: Int, indices: Array<Int>, values: Array<Double>>` rather than `VectorUDT`).
4949
- Trees whose total nested-field count (output plus all input columns the UDF tree references) exceeds `spark.sql.codegen.maxFields` (default 100). Comet refuses these at plan time and the operator falls back to Spark.
5050

5151
When a UDF is rejected, the reason surfaces through Comet's standard fallback diagnostics; the query still runs on Spark.

docs/source/user-guide/latest/tuning.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,8 +61,8 @@ The valid pool types are:
6161
- `fair_unified` (default when `spark.memory.offHeap.enabled=true` is set)
6262
- `greedy_unified`
6363

64-
Both pool types are shared across all native execution contexts within the same Spark task. When
65-
Comet executes a shuffle, it runs two native execution contexts concurrently (e.g. one for
64+
Both pool types are shared across all DataFusion execution contexts within the same Spark task. When
65+
Comet executes a shuffle, it runs two DataFusion execution contexts concurrently (e.g. one for
6666
pre-shuffle operators and one for the shuffle writer). The shared pool ensures that the combined
6767
memory usage stays within the per-task limit.
6868

0 commit comments

Comments
 (0)