Skip to content

Commit 6260665

Browse files
authored
feat: add standalone shuffle benchmark tool (#3752)
1 parent 19ab0e7 commit 6260665

4 files changed

Lines changed: 826 additions & 14 deletions

File tree

native/Cargo.lock

Lines changed: 99 additions & 14 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

native/shuffle/Cargo.toml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ publish = false
3232
arrow = { workspace = true }
3333
async-trait = { workspace = true }
3434
bytes = { workspace = true }
35+
clap = { version = "4", features = ["derive"], optional = true }
3536
crc32c = "0.6.8"
3637
crc32fast = "1.3.2"
3738
datafusion = { workspace = true }
@@ -43,6 +44,8 @@ itertools = "0.14.0"
4344
jni = "0.21"
4445
log = "0.4"
4546
lz4_flex = { version = "0.13.0", default-features = false, features = ["frame"] }
47+
# parquet is only used by the shuffle_bench binary (shuffle-bench feature)
48+
parquet = { workspace = true, optional = true }
4649
simd-adler32 = "0.3.9"
4750
snap = "1.1"
4851
tokio = { version = "1", features = ["rt-multi-thread"] }
@@ -54,10 +57,18 @@ datafusion = { workspace = true, features = ["parquet_encryption", "sql"] }
5457
itertools = "0.14.0"
5558
tempfile = "3.26.0"
5659

60+
[features]
61+
shuffle-bench = ["clap", "parquet"]
62+
5763
[lib]
5864
name = "datafusion_comet_shuffle"
5965
path = "src/lib.rs"
6066

67+
[[bin]]
68+
name = "shuffle_bench"
69+
path = "src/bin/shuffle_bench.rs"
70+
required-features = ["shuffle-bench"]
71+
6172
[[bench]]
6273
name = "shuffle_writer"
6374
harness = false

native/shuffle/README.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,44 @@ This crate provides the shuffle writer and reader implementation for Apache Data
2323
of the [Apache DataFusion Comet] subproject.
2424

2525
[Apache DataFusion Comet]: https://github.com/apache/datafusion-comet/
26+
27+
## Shuffle Benchmark Tool
28+
29+
A standalone benchmark binary (`shuffle_bench`) is included for profiling shuffle write
30+
performance outside of Spark. It streams input data directly from Parquet files.
31+
32+
### Basic usage
33+
34+
```sh
35+
cargo run --release --features shuffle-bench --bin shuffle_bench -- \
36+
--input /data/tpch-sf100/lineitem/ \
37+
--partitions 200 \
38+
--codec lz4 \
39+
--hash-columns 0,3
40+
```
41+
42+
### Options
43+
44+
| Option | Default | Description |
45+
| --------------------- | -------------------------- | ------------------------------------------------------ |
46+
| `--input` | _(required)_ | Path to a Parquet file or directory of Parquet files |
47+
| `--partitions` | `200` | Number of output shuffle partitions |
48+
| `--partitioning` | `hash` | Partitioning scheme: `hash`, `single`, `round-robin` |
49+
| `--hash-columns` | `0` | Comma-separated column indices to hash on (e.g. `0,3`) |
50+
| `--codec` | `lz4` | Compression codec: `none`, `lz4`, `zstd`, `snappy` |
51+
| `--zstd-level` | `1` | Zstd compression level (1–22) |
52+
| `--batch-size` | `8192` | Batch size for reading Parquet data |
53+
| `--memory-limit` | _(none)_ | Memory limit in bytes; triggers spilling when exceeded |
54+
| `--write-buffer-size` | `1048576` | Write buffer size in bytes |
55+
| `--limit` | `0` | Limit rows processed per iteration (0 = no limit) |
56+
| `--iterations` | `1` | Number of timed iterations |
57+
| `--warmup` | `0` | Number of warmup iterations before timing |
58+
| `--output-dir` | `/tmp/comet_shuffle_bench` | Directory for temporary shuffle output files |
59+
60+
### Profiling with flamegraph
61+
62+
```sh
63+
cargo flamegraph --release --features shuffle-bench --bin shuffle_bench -- \
64+
--input /data/tpch-sf100/lineitem/ \
65+
--partitions 200 --codec lz4
66+
```

0 commit comments

Comments
 (0)