Skip to content

Commit 6b134dd

Browse files
authored
feat: add Spark-compatible xxhash64 function (#21967)
## Which issue does this PR close? - Partially addresses #14044 ## Rationale for this change Donate the `xxhash64` hash function from Comet so that other projects can benefit from it. The function was initially implemented in Comet by @advancedxy. This is a continuation of #19627, which has gone stale. To keep the change focused and easier to review, this PR adds only `xxhash64`. Murmur3 will follow in a separate PR. The first commit is the same as the version of #19627 that was previously approved. The second commit implements optmizations and bug fixes from the latest Comet version. The third commit is cleanup. ## What changes are included in this PR? - Add `xxhash64(expr1, expr2, ...)` to `datafusion-spark`. - Add Rust unit tests for primitives, boundary values, emoji/CJK strings, float `-0.0` normalization, dictionaries (with and without nulls), `FixedSizeBinary`, `Struct`, and `List`. - Add sqllogictest coverage in `datafusion/sqllogictest/test_files/spark/hash/xxhash64.slt` with values verified against Spark. ## Are these changes tested? Yes, both Rust unit tests and sqllogictest are included. ## Are there any user-facing changes? A new `xxhash64` scalar function is available in `datafusion-spark`.
1 parent e1dc63e commit 6b134dd

6 files changed

Lines changed: 1658 additions & 2 deletions

File tree

Cargo.lock

Lines changed: 4 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

datafusion/spark/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@ rand = { workspace = true }
6464
serde_json = { workspace = true }
6565
sha1 = "0.11"
6666
sha2 = { workspace = true }
67+
twox-hash = "2.1"
6768
url = { workspace = true }
6869

6970
[dev-dependencies]

datafusion/spark/src/function/hash/mod.rs

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@
1818
pub mod crc32;
1919
pub mod sha1;
2020
pub mod sha2;
21+
pub(crate) mod utils;
22+
pub mod xxhash64;
2123

2224
use datafusion_expr::ScalarUDF;
2325
use datafusion_functions::make_udf_function;
@@ -26,16 +28,18 @@ use std::sync::Arc;
2628
make_udf_function!(crc32::SparkCrc32, crc32);
2729
make_udf_function!(sha1::SparkSha1, sha1);
2830
make_udf_function!(sha2::SparkSha2, sha2);
31+
make_udf_function!(xxhash64::SparkXxhash64, xxhash64);
2932

3033
pub mod expr_fn {
3134
use datafusion_functions::export_functions;
3235
export_functions!(
3336
(crc32, "crc32(expr) - Returns a cyclic redundancy check value of the expr as a bigint.", arg1),
3437
(sha1, "sha1(expr) - Returns a SHA-1 hash value of the expr as a hex string.", arg1),
35-
(sha2, "sha2(expr, bitLength) - Returns a checksum of SHA-2 family as a hex string of expr. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent to 256.", arg1 arg2)
38+
(sha2, "sha2(expr, bitLength) - Returns a checksum of SHA-2 family as a hex string of expr. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent to 256.", arg1 arg2),
39+
(xxhash64, "xxhash64(expr1, expr2, ...) - Returns a 64-bit hash value of the arguments using xxHash.", args)
3640
);
3741
}
3842

3943
pub fn functions() -> Vec<Arc<ScalarUDF>> {
40-
vec![crc32(), sha1(), sha2()]
44+
vec![crc32(), sha1(), sha2(), xxhash64()]
4145
}

0 commit comments

Comments
 (0)