Skip to content

Latest commit

 

History

History
178 lines (127 loc) · 8.65 KB

File metadata and controls

178 lines (127 loc) · 8.65 KB

Installing DataFusion Comet

Prerequisites

Make sure the following requirements are met and software installed on your machine.

Supported Operating Systems

The published Comet jar files in Maven Central bundle native libraries for Linux only (amd64 and arm64). macOS users must build from source.

Operating System Published Maven Jars Build from Source
Linux (amd64) Yes Yes
Linux (arm64) Yes Yes
Apple macOS (Apple Silicon) No Yes

Supported Spark Versions

Comet $COMET_VERSION supports the following versions of Apache Spark. Refer to the Spark Version Compatibility page in the Compatibility Guide for more information, such as known limitations per Spark version.

We recommend only using Comet with Spark versions where we currently have both Comet and Spark tests enabled in CI. Other versions may work well enough for development and evaluation purposes.

Spark Version Java Version Scala Version Comet Tests in CI Spark SQL Tests in CI
3.4.3 11/17 2.12/2.13 Yes Yes
3.5.8 11/17 2.12/2.13 Yes Yes
4.0.2 17/21 2.13 Yes Yes
4.1.2 17/21 2.13 Yes Yes

Note that we do not test the full matrix of supported Java and Scala versions in CI for every Spark version.

Experimental support is provided for the following versions of Apache Spark and is intended for development/testing use only and should not be used in production yet.

Spark Version Java Version Scala Version Comet Tests in CI Spark SQL Tests in CI
4.2.0-preview4 17 2.13 No No

Note that Comet may not fully work with proprietary forks of Apache Spark such as the Spark versions offered by Cloud Service Providers.

Using a Published JAR File

This documentation is for the current development version of Comet. Published jar files are only available for released versions. To use this version of Comet, see Building from source.

Comet jar files are available in Maven Central for amd64 and arm64 architectures for Linux. For Apple macOS, it is currently necessary to build from source.

For performance reasons, published Comet jar files target baseline CPUs available in modern data centers. For example, the amd64 build uses the x86-64-v3 target that adds CPU instructions (e.g., AVX2) common after 2013. Similarly, the arm64 build uses the neoverse-n1 target, which is a common baseline for ARM cores found in AWS (Graviton2+), GCP, and Azure after 2019. If the Comet library fails for SIGILL (illegal instruction), please open an issue on the GitHub repository describing your environment, and [build from source] for your target architecture.

Here are the direct links for downloading the Comet $COMET_VERSION jar file.

Building from source

Refer to the Building from source guide for instructions from building Comet from source, either from official source releases, or from the latest code in the GitHub repository.

Deploying to Kubernetes

See the Comet Kubernetes Guide guide.

Run Spark Shell with Comet enabled

Make sure SPARK_HOME points to the same Spark version as Comet was built for.

export COMET_JAR=spark/target/comet-spark-spark4.1_2.13-$COMET_VERSION.jar

$SPARK_HOME/bin/spark-shell \
    --jars $COMET_JAR \
    --conf spark.driver.extraClassPath=$COMET_JAR \
    --conf spark.executor.extraClassPath=$COMET_JAR \
    --conf spark.plugins=org.apache.spark.CometPlugin \
    --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
    --conf spark.comet.explainFallback.enabled=true \
    --conf spark.memory.offHeap.enabled=true \
    --conf spark.memory.offHeap.size=4g

Verify Comet enabled for Spark SQL query

Create a test Parquet source

scala> (0 until 10).toDF("a").write.mode("overwrite").parquet("/tmp/test")

Comet will log output similar to:

INFO core/src/lib.rs: Comet native library version $COMET_VERSION initialized
WARN CometExecRule: Comet cannot execute some parts of this plan natively (set spark.comet.explainFallback.enabled=false to disable this logging):
  Execute InsertIntoHadoopFsRelationCommand [COMET: Native support for operator DataWritingCommandExec is disabled. Set spark.comet.parquet.write.enabled=true to enable it.]
+- WriteFiles
   +-  LocalTableScan [COMET: Native support for operator LocalTableScanExec is disabled. Set spark.comet.exec.localTableScan.enabled=true to enable it.]

Query the data from the test source and check:

  • INFO message shows the native Comet library has been initialized.
  • The query plan reflects Comet operators being used for this query instead of Spark ones
scala> spark.read.parquet("/tmp/test").createOrReplaceTempView("t1")
scala> spark.sql("select * from t1 where a > 5").explain

Comet will log output similar to:

== Physical Plan ==
CometNativeColumnarToRow
+- CometFilter [a#6], (isnotnull(a#6) AND (a#6 > 5))
   +- CometNativeScan parquet [a#6] Batched: true, DataFilters: [isnotnull(a#6), (a#6 > 5)], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/test], PartitionFilters: [], PushedFilters: [IsNotNull(a), GreaterThan(a,5)], ReadSchema: struct<a:int>

Additional Configuration

Depending on your deployment mode you may also need to set the driver & executor class path(s) to explicitly contain Comet otherwise Spark may use a different class-loader for the Comet components than its internal components which will then fail at runtime. For example:

--driver-class-path spark/target/comet-spark-spark4.1_2.13-$COMET_VERSION.jar

Some cluster managers may require additional configuration, see https://spark.apache.org/docs/latest/cluster-overview.html

Memory tuning

In addition to Apache Spark memory configuration parameters, Comet introduces additional parameters to configure memory allocation for native execution. See Comet Memory Tuning for details.