Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 12 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,12 @@ under the License.

<img src="docs/source/_static/images/DataFusionComet-Logo-Light.png" width="512" alt="logo"/>

Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful
[Apache DataFusion] query engine. Comet is designed to significantly enhance the
performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the
Spark ecosystem without requiring any code changes.
Apache DataFusion Comet is a high-performance accelerator for Apache Spark. Comet keeps Spark queries
**Arrow-native end-to-end**: operators, expressions, shuffle, and broadcast all stay in Apache Arrow
columnar format, avoiding the per-row overhead of Spark's row-based engine. Within the Arrow-native
pipeline, operators and expressions execute as Rust code (via the [Apache DataFusion] query engine)
or as JVM code that operates directly on Arrow batches. Comet integrates with the Spark ecosystem
without requiring any code changes.

**Comet provides a ~2x speedup for TPC-DS @ SF 1000 (1TB), resulting in ~50% cost savings.**

Expand All @@ -58,17 +60,20 @@ See the [Comet Benchmarking Guide](https://datafusion.apache.org/comet/contribut

## What Comet Accelerates

Comet replaces Spark operators and expressions with native Rust implementations that run on Apache DataFusion.
It uses Apache Arrow for zero-copy data transfer between the JVM and native code.
Comet accelerates Spark workloads by replacing Spark operators and expressions with high-performance implementations that process Apache Arrow columnar data directly. Most operators are powered by native Rust execution built on Apache DataFusion, while others run efficiently in the JVM on Arrow batches. This unified columnar execution model keeps processing within the Comet engine end-to-end, reducing overhead and delivering faster, more efficient query execution without reverting to Spark's traditional row-based engine.

- **Parquet scans**: native Parquet reader integrated with Spark's query planner
- **Apache Iceberg**: accelerated Parquet scans when reading Iceberg tables from Spark
(see the [Iceberg guide](https://datafusion.apache.org/comet/user-guide/iceberg.html))
- **Shuffle**: native columnar shuffle with support for hash and range partitioning
- **Shuffle**: Arrow-IPC columnar shuffle with support for hash and range partitioning, in a native Rust

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we add a (not 100% Spark-compatible) round-robin partitioning solution? Should we skip that if it's opt-in?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, round-robin in native is not compatible with Spark's version, so it is opt-in. For the top level README it seems like too much information to try and include

implementation paired with a JVM fallback for unsupported partition key types
- **Expressions**: hundreds of supported Spark expressions across math, string, datetime, array,
map, JSON, hash, and predicate categories
- **Aggregations**: hash aggregate with support for `FILTER (WHERE ...)` clauses
- **Joins**: hash join, sort-merge join, and broadcast join
- **Scala/Java UDFs**: support for keeping Scala/Java scalar UDFs in the Comet pipeline

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about Python UDFs?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python UDF support did not merge yet

via Spark's whole-stage codegen (see the
[Scala UDF guide](https://datafusion.apache.org/comet/user-guide/scala_java_udfs.html))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also add windows

For the authoritative lists, see the [supported expressions](https://datafusion.apache.org/comet/user-guide/expressions.html)
and [supported operators](https://datafusion.apache.org/comet/user-guide/operators.html) pages.
Expand Down