docs: lead README with the Arrow-native framing#4428
Open
andygrove wants to merge 1 commit into
Open
Conversation
f84408c to
9fbe7ce
Compare
Rewrite the top two paragraphs of README.md so the value prop leads with the Arrow-native pipeline (operators, expressions, shuffle, and broadcast all in Apache Arrow columnar format) rather than 'native Rust implementations'. The accelerator list grows by one entry to mention the experimental Scala/Java UDF support; shuffle and 'What Comet Accelerates' wording is tightened to match. No other docs are touched in this PR. Contributor-guide and user-guide prose updates for the same vocabulary clean-up (apache#4419) will follow separately.
9fbe7ce to
91420db
Compare
mbutrovich
reviewed
May 29, 2026
| It uses Apache Arrow for zero-copy data transfer between the JVM and native code. | ||
| Comet replaces Spark operators and expressions with implementations that consume and produce Apache Arrow | ||
| batches. Most run as native Rust code on top of Apache DataFusion; some run as JVM code over Arrow batches. | ||
| Either way the work stays in the Comet pipeline without falling back to Spark's row-based engine. |
Contributor
There was a problem hiding this comment.
Suggested change
| Either way the work stays in the Comet pipeline without falling back to Spark's row-based engine. | |
| Either way, query execution stays in the Comet pipeline without falling back to Spark's row-based engine. |
mbutrovich
reviewed
May 29, 2026
| - **Apache Iceberg**: accelerated Parquet scans when reading Iceberg tables from Spark | ||
| (see the [Iceberg guide](https://datafusion.apache.org/comet/user-guide/iceberg.html)) | ||
| - **Shuffle**: native columnar shuffle with support for hash and range partitioning | ||
| - **Shuffle**: Arrow-IPC columnar shuffle with support for hash and range partitioning, in a native Rust |
Contributor
There was a problem hiding this comment.
Didn't we add a (not 100% Spark-compatible) round-robin partitioning solution? Should we skip that if it's opt-in?
mbutrovich
reviewed
May 29, 2026
| map, JSON, hash, and predicate categories | ||
| - **Aggregations**: hash aggregate with support for `FILTER (WHERE ...)` clauses | ||
| - **Joins**: hash join, sort-merge join, and broadcast join | ||
| - **Scala/Java UDFs**: experimental support for keeping Scala/Java scalar UDFs in the Comet pipeline |
Contributor
There was a problem hiding this comment.
We can drop "experimental" if #4514 merges first.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Part of #4419 (the documentation-only phase, first slice).
Rationale for this change
Comet's documentation conflates several distinct ideas under the word "native": implementation language (Rust vs JVM), pipeline membership (handled by Comet vs falls back to Spark), and data format (Arrow columnar vs Spark rows). Issue #4419 spells out a clearer vocabulary so the docs stop overloading "native" and can scale to a roadmap where some JVM code paths (today Scala UDF codegen; soon Arrow UDFs and hybrid impls) also live inside the Comet pipeline.
The original version of this PR rolled the new vocabulary into ~20 files at once. This revision narrows the scope to a single file — the top-level
README.md— so reviewers can sign off on the framing first. The user-guide and contributor-guide sweeps will follow as separate PRs.What changes are included in this PR?
README.mdonly. The two top paragraphs and the "What Comet Accelerates" list are rewritten:No other docs files are touched. Operator renames (
CometExchange→CometNativeShuffleExchange, etc.) and the user-guide / contributor-guide vocabulary sweep are explicitly out of scope here and will land in follow-on PRs.How are these changes tested?
Documentation only. Verified that the README renders correctly on GitHub and that the new wording matches the vocabulary rules in #4419.