Commit 95c2a48
committed
Add CUDA inflate (deflate decompression) kernel
Implements RFC 1951 deflate decompression as a Numba @cuda.jit kernel
for GPU-accelerated TIFF tile decoding. One thread block per tile,
all tiles decompress in parallel.
Supports all three deflate block types:
- BTYPE=0: stored (no compression)
- BTYPE=1: fixed Huffman codes
- BTYPE=2: dynamic Huffman codes (most common in real files)
Uses a two-level Huffman decode:
- Fast path: 10-bit shared-memory lookup table (1024 entries)
- Slow path: overflow array scan for codes > 10 bits (up to 15)
Fixes the infinite loop bug where 14-bit lit/len codes exceeded
the original 10-bit table size.
Tested: 100% pixel-exact match on Copernicus deflate+pred3 COG
(3600x3600, 16 tiles) vs CPU zlib.
Performance: GPU inflate is ~20x slower than CPU zlib for this file
size (16 tiles). Deflate is inherently sequential per-stream, so
each thread block runs a long serial loop while most SMs sit idle.
The value is keeping data on GPU for end-to-end pipelines. For
files with hundreds of tiles, the parallelism would help more.1 parent d69d34f commit 95c2a48
1 file changed
+451
-3
lines changed
0 commit comments