feat[vortex-cuda]: GPU FSST decompression kernel by asubiotto · Pull Request #7776 · vortex-data/vortex

asubiotto · 2026-05-04T12:21:34Z

Summary

This commit implements on-GPU decompression of the existing FSST encoding. This kernel achieves ~42% max throughput utilization as compared to the throughput_cuda benchmark on a DGX spark. CPU work is required to compute the output offsets.

The core performance win is buffering up to 24 bytes of decompressed data in three u64 registers and emitting the widest aligned stores possible up to u128 (st.global.v2.u64).

The 256-entry symbol table (≤ 2 KB) is read directly from global memory. Staging it into shared memory measured ~3% slower at 10M rows and ~15% slower at 1M rows. The hypothesis is that L1 already holds the table after a few iterations and the explicit shared copy adds bank-conflict latency on the warp-divergent symbols[code] reads; the gap is wider at 1M because the kernel is less bandwidth-bound there.

Further optimizations would require an encoding change. Splits-style intra-string parallelism (one GPU thread per ~32-compressed-byte chunk instead of per-string) was prototyped on top of this kernel and measured an additional +30% kernel throughput at 1M clickbench URLs, +26% at 5M, +12% at 10M.

Four kernel variants are generated for the unsigned widths of codes_offsets (u8/u16/u32/u64); signed integer ptypes are reinterpreted as their unsigned equivalent on the Rust side, so the bit pattern is preserved without copying.

Addresses: #6538

Testing

Unit tests against the CPU implementation on small and larger dataset.

codspeed-hq · 2026-05-04T12:27:53Z

Merging this PR will degrade performance by 10.6%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

❌ 1 regressed benchmark
✅ 1168 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`bitwise_not_vortex_buffer_mut[128]`	246.1 ns	275.3 ns	-10.6%

_{Comparing asubiotto/fsst-cuda (bebed3a) with develop (903ee6c)}

a10y · 2026-05-04T14:16:33Z

Very cool!

asubiotto · 2026-05-05T11:01:51Z

Thanks for the review! Addressed the comments.

0ax1

Fantastic contribution, one question inline.

joseph-isaacs · 2026-05-05T11:31:43Z

+template <typename OffT>
+__device__ inline void fsst_decode_string(const uint8_t *__restrict codes_bytes,
+                                          const OffT *__restrict codes_offsets,
+                                          const uint64_t *__restrict symbols,


You likely want this in shared memory?

I tried that (see commit message) but it turned out to be slower.

How did you implement this?

This is what I tried:

diff --git a/vortex-cuda/kernels/src/fsst.cu b/vortex-cuda/kernels/src/fsst.cu index 28ca66d2e..5ce282737 100644 --- a/vortex-cuda/kernels/src/fsst.cu +++ b/vortex-cuda/kernels/src/fsst.cu @@ -187,6 +187,17 @@ __device__ inline void fsst_decode_string(const uint8_t *__restrict codes_bytes, const uint8_t *__restrict validity_bits, \ uint8_t *__restrict output_bytes, \ uint64_t num_strings) { \ + __shared__ uint64_t sm_symbols[256]; \ + __shared__ uint8_t sm_symbol_lengths[256]; \ + for (uint32_t i = threadIdx.x; i < 256; i += blockDim.x) { \ + sm_symbols[i] = symbols[i]; \ + sm_symbol_lengths[i] = symbol_lengths[i]; \ + } \ + __syncthreads(); \ + \ const uint64_t elements_per_block = (uint64_t)blockDim.x * ELEMENTS_PER_THREAD; \ const uint64_t block_start = (uint64_t)blockIdx.x * elements_per_block; \ const uint64_t block_end = (block_start + elements_per_block < num_strings) \ @@ -196,8 +207,8 @@ __device__ inline void fsst_decode_string(const uint8_t *__restrict codes_bytes, for (uint64_t sid = block_start + threadIdx.x; sid < block_end; sid += blockDim.x) { \ fsst_decode_string<OffT>(codes_bytes, \ codes_offsets, \ - symbols, \ - symbol_lengths, \ + sm_symbols, \ + sm_symbol_lengths, \ output_offsets, \ validity_bits, \ output_bytes, \

joseph-isaacs · 2026-05-05T11:44:48Z

+
+        // Prefix-sum lens to per-string u32 output offsets so the kernel
+        // knows where to write each decoded string.
+        #[expect(clippy::cast_possible_truncation)]


This looks unsound to me

Yes, this is leftover from the hackathon. Ignore. I will template on output offsets.

Or just use u64, but that'd be a little wasteful.

I chose to just use u64, not sure templating is worth it. LMK and I can change the approach.

joseph-isaacs · 2026-05-05T11:46:05Z

+        let output_offsets: Vec<u32> = match_each_integer_ptype!(lens.ptype(), |P| {
+            let mut out = Vec::with_capacity(lens.len() + 1);
+            let mut acc: usize = 0;
+            out.push(0u32);
+            for &l in lens.as_slice::<P>() {
+                acc += l as usize;
+                out.push(acc as u32);
+            }
+            out
+        });


would be nice if this was computed in the kernel too?

How would you suggest to do this? The kernel needs to know the output offsets in order to decode and I think the only thing that would work would be to do an O(n^2) prefix sum of all lengths for every string. I think it's probably cheaper to execute this linear computation with dependencies on the CPU. A GSST encoding would obviate the need for this so I would just punt on this.

This commit implements on-GPU decompression of the existing FSST encoding. This kernel achieves ~42% max throughput utilization as compared to the `throughput_cuda` benchmark on a DGX spark. CPU work is required to compute the output offsets. The core performance win is buffering up to 24 bytes of decompressed data in three u64 registers and emitting the widest aligned stores possible up to u128 (st.global.v2.u64). The 256-entry symbol table (≤ 2 KB) is read directly from global memory. Staging it into shared memory measured ~3% slower at 10M rows and ~15% slower at 1M rows. The hypothesis is that L1 already holds the table after a few iterations and the explicit shared copy adds bank-conflict latency on the warp-divergent symbols[code] reads; the gap is wider at 1M because the kernel is less bandwidth-bound there. Further optimizations would require an encoding change. Splits-style intra-string parallelism (one GPU thread per ~32-compressed-byte chunk instead of per-string) was prototyped on top of this kernel and measured an additional +30% kernel throughput at 1M clickbench URLs, +26% at 5M, +12% at 10M. Four kernel variants are generated for the unsigned widths of codes_offsets (u8/u16/u32/u64); signed integer ptypes are reinterpreted as their unsigned equivalent on the Rust side, so the bit pattern is preserved without copying. Signed-off-by: Alfonso Subiotto Marques <alfonso.subiotto@polarsignals.com>

asubiotto · 2026-05-05T13:20:14Z

Updated to use u64 output offsets and cleaned up the kernel parameters into an args struct.

asubiotto commented May 4, 2026

View reviewed changes

Comment thread vortex-cuda/src/kernel/encodings/fsst.rs Outdated

asubiotto added the changelog/performance A performance improvement label May 4, 2026

asubiotto force-pushed the asubiotto/fsst-cuda branch from c982cd8 to 007bdab Compare May 4, 2026 12:29

asubiotto added changelog/feature A new feature and removed changelog/performance A performance improvement labels May 4, 2026

asubiotto requested review from 0ax1 and robert3005 May 4, 2026 12:33

asubiotto force-pushed the asubiotto/fsst-cuda branch 3 times, most recently from a10b28e to 79d1f10 Compare May 4, 2026 12:44

a10y reviewed May 4, 2026

View reviewed changes

Comment thread vortex-cuda/src/kernel/encodings/fsst.rs Outdated

Comment thread vortex-cuda/kernels/src/fsst.cu Outdated

Comment thread vortex-cuda/kernels/src/fsst.cu Outdated

Comment thread vortex-cuda/kernels/src/fsst.cu Outdated

asubiotto force-pushed the asubiotto/fsst-cuda branch from 79d1f10 to 46670e8 Compare May 5, 2026 10:57

0ax1 reviewed May 5, 2026

View reviewed changes

Comment thread vortex-cuda/src/kernel/encodings/fsst.rs Outdated

joseph-isaacs reviewed May 5, 2026

View reviewed changes

Comment thread vortex-cuda/kernels/src/fsst.cu Outdated

joseph-isaacs reviewed May 5, 2026

View reviewed changes

Comment thread vortex-cuda/kernels/src/fsst.cu

joseph-isaacs reviewed May 5, 2026

View reviewed changes

asubiotto force-pushed the asubiotto/fsst-cuda branch from 46670e8 to bebed3a Compare May 5, 2026 13:18

Conversation

asubiotto commented May 4, 2026

Summary

Testing

Uh oh!

Uh oh!

codspeed-hq Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 10.6%

Performance Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

a10y commented May 4, 2026

Uh oh!

asubiotto commented May 5, 2026

Uh oh!

0ax1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asubiotto commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codspeed-hq Bot commented May 4, 2026 •

edited

Loading