Skip to content

Commit 52d3c07

Browse files
authored
Merge pull request #667 from dolthub/perf/blake3-chunk-hash
Replace SHA-512 with vendored portable BLAKE3 for chunk addresses
2 parents c19947d + f6ae6ad commit 52d3c07

14 files changed

Lines changed: 1535 additions & 102 deletions

.github/workflows/test.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,10 @@ jobs:
8383
run: cd build && bash ../test/config_test.sh ./doltlite
8484
timeout-minutes: 2
8585

86+
- name: BLAKE3 known-answer test
87+
run: cd build && bash ../test/blake3_kat_test.sh ./doltlite
88+
timeout-minutes: 2
89+
8690
- name: Remote integration tests
8791
run: cd build && bash ../test/remote_test.sh ./doltlite
8892
timeout-minutes: 5

LICENSE.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,15 @@ Non-public-domain code included in this respository includes:
9494
software found in the legacy autoconf/ directory and its
9595
subdirectories.
9696

97+
* The vendored BLAKE3 reference implementation under `ext/blake3/`,
98+
used by the prolly tree's content-addressing layer. Upstream is
99+
dual-licensed under Apache License 2.0 (with LLVM exception) or
100+
CC0 1.0 Universal — DoltLite redistributes under Apache 2.0
101+
(the project-wide license). Source: BLAKE3 v1.8.5 from
102+
https://github.com/BLAKE3-team/BLAKE3. See `ext/blake3/LICENSE`
103+
and `ext/blake3/README.md` for the full license texts and a
104+
description of DoltLite-specific modifications.
105+
97106
The following unix shell command can be run from the top-level
98107
of this source repository in order to remove all non-public-domain
99108
code:

ext/blake3/LICENSE

Lines changed: 331 additions & 0 deletions
Large diffs are not rendered by default.

ext/blake3/README.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# BLAKE3 (vendored)
2+
3+
`prollyHashCompute` uses BLAKE3 to derive 20-byte content addresses for
4+
prolly chunks. The portable C reference implementation lives in this
5+
directory.
6+
7+
## Provenance
8+
9+
| | |
10+
|---|---|
11+
| Upstream | https://github.com/BLAKE3-team/BLAKE3 |
12+
| Version | 1.8.5 |
13+
| Files vendored | `blake3.h`, `blake3.c`, `blake3_portable.c` (verbatim from upstream `c/`) |
14+
| Files modified | `blake3_impl.h` — declarations for the SSE/AVX/NEON paths stripped, since DoltLite ships the portable implementation only |
15+
| Files DoltLite-original | `blake3_dispatch_portable.c` — replaces upstream `blake3_dispatch.c`. Removes runtime CPU feature detection; always calls the portable functions. ~30 lines. |
16+
| License | Apache 2.0 with LLVM exception, OR CC0 1.0 (dual-licensed by upstream). DoltLite redistributes under Apache 2.0 (project-wide). See `LICENSE`. |
17+
18+
## Why portable, not SIMD
19+
20+
DoltLite ships to platforms where the BLAKE3 SIMD paths don't apply:
21+
22+
- **WebAssembly**: no SSE/AVX/NEON intrinsics
23+
- **iOS / Android**: ARM64-NEON works there but the same source needs to compile on x86_64 too, and we want one source set
24+
- **Cross-platform consistency**: portable BLAKE3 produces identical hashes on every platform, by definition
25+
26+
The portable path measures ~2.6× faster than the SHA-512 it replaced. SIMD would be another ~2× on top — that's a separate follow-up gated behind a compile-time flag.
27+
28+
## Updating
29+
30+
To pull in a newer BLAKE3 release:
31+
32+
1. Replace `blake3.h`, `blake3.c`, `blake3_portable.c` with their current upstream versions verbatim.
33+
2. Re-apply the `blake3_impl.h` modification: drop the SIMD-only declarations (everything under `#if defined(IS_X86)` and `#if BLAKE3_USE_NEON == 1`).
34+
3. Set `MAX_SIMD_DEGREE` to 1 unconditionally.
35+
4. Verify `blake3_dispatch_portable.c` still satisfies the function signatures declared in `blake3_impl.h`. If upstream adds a new dispatched function, mirror it in the shim.
36+
5. Run `bash test/blake3_kat_test.sh` to confirm hash output is unchanged.
37+
6. Update the version table above.
38+
39+
If upstream ever changes the BLAKE3 algorithm itself (extremely unlikely; it would be BLAKE4), it'd require a `CHUNK_STORE_VERSION` bump for format compatibility.
40+
41+
## Test vectors
42+
43+
`test/blake3_kat_test.sh` runs a small KAT (Known Answer Test) suite against this vendored copy. Reference values were generated against the upstream-supplied libblake3 binary and cross-checked against the canonical empty-input vector published in the BLAKE3 spec.

0 commit comments

Comments
 (0)