Skip to content

Commit 00897cc

Browse files
authored
docs: lead README with the Arrow-native framing (#4428)
1 parent 523ffb6 commit 00897cc

1 file changed

Lines changed: 12 additions & 7 deletions

File tree

README.md

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -35,10 +35,12 @@ under the License.
3535

3636
<img src="docs/source/_static/images/DataFusionComet-Logo-Light.png" width="512" alt="logo"/>
3737

38-
Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful
39-
[Apache DataFusion] query engine. Comet is designed to significantly enhance the
40-
performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the
41-
Spark ecosystem without requiring any code changes.
38+
Apache DataFusion Comet is a high-performance accelerator for Apache Spark. Comet keeps Spark queries
39+
**Arrow-native end-to-end**: operators, expressions, shuffle, and broadcast all stay in Apache Arrow
40+
columnar format, avoiding the per-row overhead of Spark's row-based engine. Within the Arrow-native
41+
pipeline, operators and expressions execute as Rust code (via the [Apache DataFusion] query engine)
42+
or as JVM code that operates directly on Arrow batches. Comet integrates with the Spark ecosystem
43+
without requiring any code changes.
4244

4345
**Comet provides a ~2x speedup for TPC-DS @ SF 1000 (1TB), resulting in ~50% cost savings.**
4446

@@ -58,17 +60,20 @@ See the [Comet Benchmarking Guide](https://datafusion.apache.org/comet/contribut
5860

5961
## What Comet Accelerates
6062

61-
Comet replaces Spark operators and expressions with native Rust implementations that run on Apache DataFusion.
62-
It uses Apache Arrow for zero-copy data transfer between the JVM and native code.
63+
Comet accelerates Spark workloads by replacing Spark operators and expressions with high-performance implementations that process Apache Arrow columnar data directly. Most operators are powered by native Rust execution built on Apache DataFusion, while others run efficiently in the JVM on Arrow batches. This unified columnar execution model keeps processing within the Comet engine end-to-end, reducing overhead and delivering faster, more efficient query execution without reverting to Spark's traditional row-based engine.
6364

6465
- **Parquet scans**: native Parquet reader integrated with Spark's query planner
6566
- **Apache Iceberg**: accelerated Parquet scans when reading Iceberg tables from Spark
6667
(see the [Iceberg guide](https://datafusion.apache.org/comet/user-guide/iceberg.html))
67-
- **Shuffle**: native columnar shuffle with support for hash and range partitioning
68+
- **Shuffle**: Arrow-IPC columnar shuffle with support for hash and range partitioning, in a native Rust
69+
implementation paired with a JVM fallback for unsupported partition key types
6870
- **Expressions**: hundreds of supported Spark expressions across math, string, datetime, array,
6971
map, JSON, hash, and predicate categories
7072
- **Aggregations**: hash aggregate with support for `FILTER (WHERE ...)` clauses
7173
- **Joins**: hash join, sort-merge join, and broadcast join
74+
- **Scala/Java UDFs**: support for keeping Scala/Java scalar UDFs in the Comet pipeline
75+
via Spark's whole-stage codegen (see the
76+
[Scala UDF guide](https://datafusion.apache.org/comet/user-guide/scala_java_udfs.html))
7277

7378
For the authoritative lists, see the [supported expressions](https://datafusion.apache.org/comet/user-guide/expressions.html)
7479
and [supported operators](https://datafusion.apache.org/comet/user-guide/operators.html) pages.

0 commit comments

Comments
 (0)