Skip to content

feat[vortex-cuda]: GPU FSST decompression kernel#7776

Open
asubiotto wants to merge 1 commit intodevelopfrom
asubiotto/fsst-cuda
Open

feat[vortex-cuda]: GPU FSST decompression kernel#7776
asubiotto wants to merge 1 commit intodevelopfrom
asubiotto/fsst-cuda

Conversation

@asubiotto
Copy link
Copy Markdown
Contributor

Summary

This commit implements on-GPU decompression of the existing FSST encoding. This kernel achieves ~42% max throughput utilization as compared to the throughput_cuda benchmark on a DGX spark. CPU work is required to compute the output offsets.

The core performance win is buffering up to 24 bytes of decompressed data in three u64 registers and emitting the widest aligned stores possible up to u128 (st.global.v2.u64).

The 256-entry symbol table (≤ 2 KB) is read directly from global memory. Staging it into shared memory measured ~3% slower at 10M rows and ~15% slower at 1M rows. The hypothesis is that L1 already holds the table after a few iterations and the explicit shared copy adds bank-conflict latency on the warp-divergent symbols[code] reads; the gap is wider at 1M because the kernel is less bandwidth-bound there.

Further optimizations would require an encoding change. Splits-style intra-string parallelism (one GPU thread per ~32-compressed-byte chunk instead of per-string) was prototyped on top of this kernel and measured an additional +30% kernel throughput at 1M clickbench URLs, +26% at 5M, +12% at 10M.

Four kernel variants are generated for the unsigned widths of codes_offsets (u8/u16/u32/u64); signed integer ptypes are reinterpreted as their unsigned equivalent on the Rust side, so the bit pattern is preserved without copying.

Addresses: #6538

Testing

Unit tests against the CPU implementation on small and larger dataset.

Comment thread vortex-cuda/src/kernel/encodings/fsst.rs Outdated
@asubiotto asubiotto added the changelog/performance A performance improvement label May 4, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 4, 2026

Merging this PR will degrade performance by 10.6%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

❌ 1 regressed benchmark
✅ 1168 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation bitwise_not_vortex_buffer_mut[128] 246.1 ns 275.3 ns -10.6%

Comparing asubiotto/fsst-cuda (bebed3a) with develop (903ee6c)

Open in CodSpeed

@asubiotto asubiotto force-pushed the asubiotto/fsst-cuda branch from c982cd8 to 007bdab Compare May 4, 2026 12:29
@asubiotto asubiotto added changelog/feature A new feature and removed changelog/performance A performance improvement labels May 4, 2026
@asubiotto asubiotto requested review from 0ax1 and robert3005 May 4, 2026 12:33
@asubiotto asubiotto force-pushed the asubiotto/fsst-cuda branch 3 times, most recently from a10b28e to 79d1f10 Compare May 4, 2026 12:44
Comment thread vortex-cuda/src/kernel/encodings/fsst.rs Outdated
Comment thread vortex-cuda/kernels/src/fsst.cu Outdated
Comment thread vortex-cuda/kernels/src/fsst.cu Outdated
Comment thread vortex-cuda/kernels/src/fsst.cu Outdated
@a10y
Copy link
Copy Markdown
Contributor

a10y commented May 4, 2026

Very cool!

@asubiotto asubiotto force-pushed the asubiotto/fsst-cuda branch from 79d1f10 to 46670e8 Compare May 5, 2026 10:57
@asubiotto
Copy link
Copy Markdown
Contributor Author

Thanks for the review! Addressed the comments.

Copy link
Copy Markdown
Contributor

@0ax1 0ax1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic contribution, one question inline.

Comment thread vortex-cuda/src/kernel/encodings/fsst.rs Outdated
Comment thread vortex-cuda/kernels/src/fsst.cu Outdated
template <typename OffT>
__device__ inline void fsst_decode_string(const uint8_t *__restrict codes_bytes,
const OffT *__restrict codes_offsets,
const uint64_t *__restrict symbols,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You likely want this in shared memory?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that (see commit message) but it turned out to be slower.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you implement this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I tried:

diff --git a/vortex-cuda/kernels/src/fsst.cu b/vortex-cuda/kernels/src/fsst.cu
index 28ca66d2e..5ce282737 100644
--- a/vortex-cuda/kernels/src/fsst.cu
+++ b/vortex-cuda/kernels/src/fsst.cu
@@ -187,6 +187,17 @@ __device__ inline void fsst_decode_string(const uint8_t *__restrict codes_bytes,
                                              const uint8_t *__restrict validity_bits,                        \
                                              uint8_t *__restrict output_bytes,                               \
                                              uint64_t num_strings) {                                         \
+        __shared__ uint64_t sm_symbols[256];                                                                 \
+        __shared__ uint8_t sm_symbol_lengths[256];                                                           \
+        for (uint32_t i = threadIdx.x; i < 256; i += blockDim.x) {                                           \
+            sm_symbols[i] = symbols[i];                                                                      \
+            sm_symbol_lengths[i] = symbol_lengths[i];                                                        \
+        }                                                                                                    \
+        __syncthreads();                                                                                     \
+                                                                                                             \
         const uint64_t elements_per_block = (uint64_t)blockDim.x * ELEMENTS_PER_THREAD;                      \
         const uint64_t block_start = (uint64_t)blockIdx.x * elements_per_block;                              \
         const uint64_t block_end = (block_start + elements_per_block < num_strings)                          \
@@ -196,8 +207,8 @@ __device__ inline void fsst_decode_string(const uint8_t *__restrict codes_bytes,
         for (uint64_t sid = block_start + threadIdx.x; sid < block_end; sid += blockDim.x) {                 \
             fsst_decode_string<OffT>(codes_bytes,                                                            \
                                      codes_offsets,                                                          \
-                                     symbols,                                                                \
-                                     symbol_lengths,                                                         \
+                                     sm_symbols,                                                             \
+                                     sm_symbol_lengths,                                                      \
                                      output_offsets,                                                         \
                                      validity_bits,                                                          \
                                      output_bytes,                                                           \

Comment thread vortex-cuda/kernels/src/fsst.cu Outdated
Comment thread vortex-cuda/kernels/src/fsst.cu

// Prefix-sum lens to per-string u32 output offsets so the kernel
// knows where to write each decoded string.
#[expect(clippy::cast_possible_truncation)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unsound to me

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is leftover from the hackathon. Ignore. I will template on output offsets.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just use u64, but that'd be a little wasteful.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to just use u64, not sure templating is worth it. LMK and I can change the approach.

Comment on lines +90 to +99
let output_offsets: Vec<u32> = match_each_integer_ptype!(lens.ptype(), |P| {
let mut out = Vec::with_capacity(lens.len() + 1);
let mut acc: usize = 0;
out.push(0u32);
for &l in lens.as_slice::<P>() {
acc += l as usize;
out.push(acc as u32);
}
out
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice if this was computed in the kernel too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you suggest to do this? The kernel needs to know the output offsets in order to decode and I think the only thing that would work would be to do an O(n^2) prefix sum of all lengths for every string. I think it's probably cheaper to execute this linear computation with dependencies on the CPU. A GSST encoding would obviate the need for this so I would just punt on this.

This commit implements on-GPU decompression of the existing FSST encoding. This
kernel achieves ~42% max throughput utilization as compared to the
`throughput_cuda` benchmark on a DGX spark. CPU work is required to compute the
output offsets.

The core performance win is buffering up to 24 bytes of decompressed data in
three u64 registers and emitting the widest aligned stores possible up to
u128 (st.global.v2.u64).

The 256-entry symbol table (≤ 2 KB) is read directly from global memory.
Staging it into shared memory measured ~3% slower at 10M rows and ~15%
slower at 1M rows. The hypothesis is that L1 already holds the table after a
few iterations and the explicit shared copy adds bank-conflict latency on the
warp-divergent symbols[code] reads; the gap is wider at 1M because the kernel
is less bandwidth-bound there.

Further optimizations would require an encoding change. Splits-style
intra-string parallelism (one GPU thread per ~32-compressed-byte chunk instead
of per-string) was prototyped on top of this kernel and measured an additional
+30% kernel throughput at 1M clickbench URLs, +26% at 5M, +12% at 10M.

Four kernel variants are generated for the unsigned widths of
codes_offsets (u8/u16/u32/u64); signed integer ptypes are reinterpreted
as their unsigned equivalent on the Rust side, so the bit pattern is
preserved without copying.

Signed-off-by: Alfonso Subiotto Marques <alfonso.subiotto@polarsignals.com>
@asubiotto asubiotto force-pushed the asubiotto/fsst-cuda branch from 46670e8 to bebed3a Compare May 5, 2026 13:18
@asubiotto
Copy link
Copy Markdown
Contributor Author

Updated to use u64 output offsets and cleaned up the kernel parameters into an args struct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants