feat[vortex-cuda]: GPU FSST decompression kernel#7776
feat[vortex-cuda]: GPU FSST decompression kernel#7776
Conversation
Merging this PR will degrade performance by 10.6%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | bitwise_not_vortex_buffer_mut[128] |
246.1 ns | 275.3 ns | -10.6% |
Comparing asubiotto/fsst-cuda (bebed3a) with develop (903ee6c)
c982cd8 to
007bdab
Compare
a10b28e to
79d1f10
Compare
|
Very cool! |
79d1f10 to
46670e8
Compare
|
Thanks for the review! Addressed the comments. |
0ax1
left a comment
There was a problem hiding this comment.
Fantastic contribution, one question inline.
| template <typename OffT> | ||
| __device__ inline void fsst_decode_string(const uint8_t *__restrict codes_bytes, | ||
| const OffT *__restrict codes_offsets, | ||
| const uint64_t *__restrict symbols, |
There was a problem hiding this comment.
You likely want this in shared memory?
There was a problem hiding this comment.
I tried that (see commit message) but it turned out to be slower.
There was a problem hiding this comment.
How did you implement this?
There was a problem hiding this comment.
This is what I tried:
diff --git a/vortex-cuda/kernels/src/fsst.cu b/vortex-cuda/kernels/src/fsst.cu
index 28ca66d2e..5ce282737 100644
--- a/vortex-cuda/kernels/src/fsst.cu
+++ b/vortex-cuda/kernels/src/fsst.cu
@@ -187,6 +187,17 @@ __device__ inline void fsst_decode_string(const uint8_t *__restrict codes_bytes,
const uint8_t *__restrict validity_bits, \
uint8_t *__restrict output_bytes, \
uint64_t num_strings) { \
+ __shared__ uint64_t sm_symbols[256]; \
+ __shared__ uint8_t sm_symbol_lengths[256]; \
+ for (uint32_t i = threadIdx.x; i < 256; i += blockDim.x) { \
+ sm_symbols[i] = symbols[i]; \
+ sm_symbol_lengths[i] = symbol_lengths[i]; \
+ } \
+ __syncthreads(); \
+ \
const uint64_t elements_per_block = (uint64_t)blockDim.x * ELEMENTS_PER_THREAD; \
const uint64_t block_start = (uint64_t)blockIdx.x * elements_per_block; \
const uint64_t block_end = (block_start + elements_per_block < num_strings) \
@@ -196,8 +207,8 @@ __device__ inline void fsst_decode_string(const uint8_t *__restrict codes_bytes,
for (uint64_t sid = block_start + threadIdx.x; sid < block_end; sid += blockDim.x) { \
fsst_decode_string<OffT>(codes_bytes, \
codes_offsets, \
- symbols, \
- symbol_lengths, \
+ sm_symbols, \
+ sm_symbol_lengths, \
output_offsets, \
validity_bits, \
output_bytes, \
|
|
||
| // Prefix-sum lens to per-string u32 output offsets so the kernel | ||
| // knows where to write each decoded string. | ||
| #[expect(clippy::cast_possible_truncation)] |
There was a problem hiding this comment.
This looks unsound to me
There was a problem hiding this comment.
Yes, this is leftover from the hackathon. Ignore. I will template on output offsets.
There was a problem hiding this comment.
Or just use u64, but that'd be a little wasteful.
There was a problem hiding this comment.
I chose to just use u64, not sure templating is worth it. LMK and I can change the approach.
| let output_offsets: Vec<u32> = match_each_integer_ptype!(lens.ptype(), |P| { | ||
| let mut out = Vec::with_capacity(lens.len() + 1); | ||
| let mut acc: usize = 0; | ||
| out.push(0u32); | ||
| for &l in lens.as_slice::<P>() { | ||
| acc += l as usize; | ||
| out.push(acc as u32); | ||
| } | ||
| out | ||
| }); |
There was a problem hiding this comment.
would be nice if this was computed in the kernel too?
There was a problem hiding this comment.
How would you suggest to do this? The kernel needs to know the output offsets in order to decode and I think the only thing that would work would be to do an O(n^2) prefix sum of all lengths for every string. I think it's probably cheaper to execute this linear computation with dependencies on the CPU. A GSST encoding would obviate the need for this so I would just punt on this.
This commit implements on-GPU decompression of the existing FSST encoding. This kernel achieves ~42% max throughput utilization as compared to the `throughput_cuda` benchmark on a DGX spark. CPU work is required to compute the output offsets. The core performance win is buffering up to 24 bytes of decompressed data in three u64 registers and emitting the widest aligned stores possible up to u128 (st.global.v2.u64). The 256-entry symbol table (≤ 2 KB) is read directly from global memory. Staging it into shared memory measured ~3% slower at 10M rows and ~15% slower at 1M rows. The hypothesis is that L1 already holds the table after a few iterations and the explicit shared copy adds bank-conflict latency on the warp-divergent symbols[code] reads; the gap is wider at 1M because the kernel is less bandwidth-bound there. Further optimizations would require an encoding change. Splits-style intra-string parallelism (one GPU thread per ~32-compressed-byte chunk instead of per-string) was prototyped on top of this kernel and measured an additional +30% kernel throughput at 1M clickbench URLs, +26% at 5M, +12% at 10M. Four kernel variants are generated for the unsigned widths of codes_offsets (u8/u16/u32/u64); signed integer ptypes are reinterpreted as their unsigned equivalent on the Rust side, so the bit pattern is preserved without copying. Signed-off-by: Alfonso Subiotto Marques <alfonso.subiotto@polarsignals.com>
46670e8 to
bebed3a
Compare
|
Updated to use u64 output offsets and cleaned up the kernel parameters into an args struct. |
Summary
This commit implements on-GPU decompression of the existing FSST encoding. This kernel achieves ~42% max throughput utilization as compared to the
throughput_cudabenchmark on a DGX spark. CPU work is required to compute the output offsets.The core performance win is buffering up to 24 bytes of decompressed data in three u64 registers and emitting the widest aligned stores possible up to u128 (st.global.v2.u64).
The 256-entry symbol table (≤ 2 KB) is read directly from global memory. Staging it into shared memory measured ~3% slower at 10M rows and ~15% slower at 1M rows. The hypothesis is that L1 already holds the table after a few iterations and the explicit shared copy adds bank-conflict latency on the warp-divergent symbols[code] reads; the gap is wider at 1M because the kernel is less bandwidth-bound there.
Further optimizations would require an encoding change. Splits-style intra-string parallelism (one GPU thread per ~32-compressed-byte chunk instead of per-string) was prototyped on top of this kernel and measured an additional +30% kernel throughput at 1M clickbench URLs, +26% at 5M, +12% at 10M.
Four kernel variants are generated for the unsigned widths of codes_offsets (u8/u16/u32/u64); signed integer ptypes are reinterpreted as their unsigned equivalent on the Rust side, so the bit pattern is preserved without copying.
Addresses: #6538
Testing
Unit tests against the CPU implementation on small and larger dataset.