Skip to content

Commit 53a1418

Browse files
author
Kazantsev Maksim
committed
Merge remote-tracking branch 'origin/main' into to_csv
2 parents cf544c7 + 8dfeca3 commit 53a1418

54 files changed

Lines changed: 4728 additions & 1208 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/pr_build_linux.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ jobs:
122122
org.apache.comet.exec.CometAsyncShuffleSuite
123123
org.apache.comet.exec.DisableAQECometShuffleSuite
124124
org.apache.comet.exec.DisableAQECometAsyncShuffleSuite
125+
org.apache.spark.shuffle.sort.SpillSorterSuite
125126
- name: "parquet"
126127
value: |
127128
org.apache.comet.parquet.CometParquetWriterSuite
@@ -160,6 +161,7 @@ jobs:
160161
value: |
161162
org.apache.comet.CometExpressionSuite
162163
org.apache.comet.CometExpressionCoverageSuite
164+
org.apache.comet.CometHashExpressionSuite
163165
org.apache.comet.CometTemporalExpressionSuite
164166
org.apache.comet.CometArrayExpressionSuite
165167
org.apache.comet.CometCastSuite

.github/workflows/pr_build_macos.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@ jobs:
8585
org.apache.comet.exec.CometAsyncShuffleSuite
8686
org.apache.comet.exec.DisableAQECometShuffleSuite
8787
org.apache.comet.exec.DisableAQECometAsyncShuffleSuite
88+
org.apache.spark.shuffle.sort.SpillSorterSuite
8889
- name: "parquet"
8990
value: |
9091
org.apache.comet.parquet.CometParquetWriterSuite
@@ -123,6 +124,7 @@ jobs:
123124
value: |
124125
org.apache.comet.CometExpressionSuite
125126
org.apache.comet.CometExpressionCoverageSuite
127+
org.apache.comet.CometHashExpressionSuite
126128
org.apache.comet.CometTemporalExpressionSuite
127129
org.apache.comet.CometArrayExpressionSuite
128130
org.apache.comet.CometCastSuite

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,5 @@ apache-rat-*.jar
1818
venv
1919
dev/release/comet-rm/workdir
2020
spark/benchmarks
21+
.DS_Store
22+
comet-event-trace.json

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,11 +21,14 @@ under the License.
2121

2222
[![Apache licensed][license-badge]][license-url]
2323
[![Discord chat][discord-badge]][discord-url]
24+
[![Pending PRs][pending-pr-badge]][pending-pr-url]
2425

2526
[license-badge]: https://img.shields.io/badge/license-Apache%20v2-blue.svg
2627
[license-url]: https://github.com/apache/datafusion-comet/blob/main/LICENSE.txt
2728
[discord-badge]: https://img.shields.io/discord/885562378132000778.svg?logo=discord&style=flat-square
2829
[discord-url]: https://discord.gg/3EAr4ZX6JK
30+
[pending-pr-badge]: https://img.shields.io/github/issues-search/apache/datafusion-comet?query=is%3Apr+is%3Aopen+draft%3Afalse+review%3Arequired+status%3Asuccess&label=Pending%20PRs&logo=github
31+
[pending-pr-url]: https://github.com/apache/datafusion-comet/pulls?q=is%3Apr+is%3Aopen+draft%3Afalse+review%3Arequired+status%3Asuccess+sort%3Aupdated-desc
2932

3033
<img src="docs/source/_static/images/DataFusionComet-Logo-Light.png" width="512" alt="logo"/>
3134

benchmarks/pyspark/README.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Shuffle Size Comparison Benchmark
21+
22+
Compares shuffle file sizes between Spark, Comet JVM, and Comet Native shuffle implementations.
23+
24+
## Prerequisites
25+
26+
- Apache Spark cluster (standalone, YARN, or Kubernetes)
27+
- PySpark installed
28+
- Comet JAR built
29+
30+
## Build Comet JAR
31+
32+
```bash
33+
cd /path/to/datafusion-comet
34+
make release
35+
```
36+
37+
## Step 1: Generate Test Data
38+
39+
Generate test data with realistic 50-column schema (nested structs, arrays, maps):
40+
41+
```bash
42+
spark-submit \
43+
--master spark://master:7077 \
44+
--executor-memory 16g \
45+
generate_data.py \
46+
--output /tmp/shuffle-benchmark-data \
47+
--rows 10000000 \
48+
--partitions 200
49+
```
50+
51+
### Data Generation Options
52+
53+
| Option | Default | Description |
54+
| -------------------- | ---------- | ---------------------------- |
55+
| `--output`, `-o` | (required) | Output path for parquet data |
56+
| `--rows`, `-r` | 10000000 | Number of rows |
57+
| `--partitions`, `-p` | 200 | Number of output partitions |
58+
59+
## Step 2: Run Benchmark
60+
61+
Run benchmarks and check Spark UI for shuffle sizes:
62+
63+
```bash
64+
SPARK_MASTER=spark://master:7077 \
65+
EXECUTOR_MEMORY=16g \
66+
./run_all_benchmarks.sh /tmp/shuffle-benchmark-data
67+
```
68+
69+
Or run individual modes:
70+
71+
```bash
72+
# Spark baseline
73+
spark-submit --master spark://master:7077 \
74+
run_benchmark.py --data /tmp/shuffle-benchmark-data --mode spark
75+
76+
# Comet JVM shuffle
77+
spark-submit --master spark://master:7077 \
78+
--jars /path/to/comet.jar \
79+
--conf spark.comet.enabled=true \
80+
--conf spark.comet.exec.shuffle.enabled=true \
81+
--conf spark.comet.shuffle.mode=jvm \
82+
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
83+
run_benchmark.py --data /tmp/shuffle-benchmark-data --mode jvm
84+
85+
# Comet Native shuffle
86+
spark-submit --master spark://master:7077 \
87+
--jars /path/to/comet.jar \
88+
--conf spark.comet.enabled=true \
89+
--conf spark.comet.exec.shuffle.enabled=true \
90+
--conf spark.comet.shuffle.mode=native \
91+
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
92+
run_benchmark.py --data /tmp/shuffle-benchmark-data --mode native
93+
```
94+
95+
## Checking Results
96+
97+
Open the Spark UI (default: http://localhost:4040) during each benchmark run to compare shuffle write sizes in the Stages tab.

0 commit comments

Comments
 (0)