Skip to content

perf(geotiff): single-buffer output in _try_nvcomp_from_device_bufs (#1659)#1665

Merged
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-12
May 12, 2026
Merged

perf(geotiff): single-buffer output in _try_nvcomp_from_device_bufs (#1659)#1665
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-12

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #1659.

Summary

  • _try_nvcomp_from_device_bufs in xrspatial/geotiff/_gpu_decode.py now allocates a single contiguous cupy.empty(n * tile_bytes) buffer and derives per-tile device pointers as base_ptr + i * tile_bytes, matching the LZW (L1847), deflate (L1878), and host-buffer (L1114) paths in the same file.
  • Adds _check_gpu_memory on the new allocation so a borderline-OK GDS read surfaces a clear MemoryError before the cupy.empty call.
  • Adds five regression tests in test_nvcomp_from_device_bufs_single_alloc_1659.py: unsupported-codec short-circuit, missing-library short-circuit, memory-guard contract, structural single-buffer shape, real ZSTD round-trip through nvCOMP (when the lib is available).

Why

The previous pattern kept two copies of the decompressed data live on device for the lifetime of the trailing cupy.concatenate, and the concat itself was pure overhead the surrounding nvCOMP paths already avoid. The return contract is unchanged: a flat cupy.uint8 of length n * tile_bytes, which _apply_predictor_and_assemble consumes directly.

The microbenchmark in issue #1659 isolates the cost of the deleted alloc + concat steps. It does NOT measure full nvCOMP decompression latency; the proportion of end-to-end read_geotiff_gpu time the overhead represents depends on the target host. The function is also gated behind kvikio + nvCOMP + ZSTD, so most users will see no behavioural change.

Test plan

  • test_nvcomp_from_device_bufs_single_alloc_1659.py -- 5/5 pass on a CUDA 12 + cupy 13.6 + nvCOMP host (including the real ZSTD round-trip)
  • pytest xrspatial/geotiff/tests/ -- 1458 pass, 3 pre-existing unrelated matplotlib/py3.14 deepcopy failures in test_features.py::TestPalette
  • pytest xrspatial/geotiff/tests/ -k "gpu or zstd or compress" -- 440 pass on this host

The GDS+nvCOMP fast path allocated N separate cupy.empty(tile_bytes)
output buffers and ran cupy.concatenate(d_decomp_bufs) after the
decompress kernel returned. That kept two copies of the decompressed
data alive at once and added a serial concat the other nvCOMP paths in
this module already avoid (LZW at L1847, deflate at L1878, host-buffer
at L1114 all use a single cupy.empty + pointer-offset pattern).

The function now allocates one contiguous cupy.empty(n*tile_bytes) up
front and derives the per-tile device pointers as base + i*tile_bytes,
matching the surrounding code. The return contract is unchanged: a
single flat cupy.uint8 buffer of length n*tile_bytes that
_apply_predictor_and_assemble consumes directly.

Adds _check_gpu_memory on the new allocation so a borderline-OK GDS
read surfaces a clear MemoryError before the cupy.empty call instead of
an opaque CUDA OOM deeper in the function.

Tests in test_nvcomp_from_device_bufs_single_alloc_1659.py cover the
codec early-return, the missing-library early-return, the memory-guard
contract, the structural single-buffer return shape, and a real
end-to-end ZSTD round-trip through nvCOMP when the library is
available.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 12, 2026
@brendancol brendancol requested a review from Copilot May 12, 2026 13:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes the GeoTIFF GPU decode fast path by changing the nvCOMP-from-device-buffers (GDS) ZSTD decompression to write into a single contiguous CuPy output buffer (matching other nvCOMP paths in the module), reducing peak VRAM and avoiding a post-decode cupy.concatenate.

Changes:

  • Refactors _try_nvcomp_from_device_bufs to allocate one cupy.empty(n * tile_bytes) buffer and derive per-tile output pointers via base-pointer offsets.
  • Adds a GPU memory pre-check (_check_gpu_memory) before allocating the contiguous decompression buffer.
  • Adds a new regression test module covering short-circuits, the memory-guard contract, and a real nvCOMP ZSTD round-trip when available.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
xrspatial/geotiff/_gpu_decode.py Switches nvCOMP GDS decompression output to a single contiguous device buffer + pointer offsets; adds a pre-allocation memory guard call.
xrspatial/geotiff/tests/test_nvcomp_from_device_bufs_single_alloc_1659.py Adds regression tests for the single-buffer pattern and key short-circuit/contract behavior.
.claude/sweep-performance-state.csv Updates internal performance audit state for the geotiff module to include #1659.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

from __future__ import annotations

import importlib.util
import tempfile

def _fake_temp_size_fn(n, tile_bytes, opts, p_temp_size, total):
"""Stub for nvcompBatchedZstdDecompressGetTempSizeAsync."""
import ctypes
Comment on lines +1646 to 1648
return d_decomp
except Exception:
return None
Comment on lines +1594 to 1601
_check_gpu_memory(n * tile_bytes,
what="GDS+nvCOMP decompressed buffer")
d_decomp = cupy.empty(n * tile_bytes, dtype=cupy.uint8)
base_decomp_ptr = int(d_decomp.data.ptr)
decomp_offsets = (np.arange(n, dtype=np.uint64)
* np.uint64(tile_bytes))
d_decomp_ptrs = cupy.asarray(base_decomp_ptr + decomp_offsets)

- Hoist codec name resolution before VRAM allocation and memory guard so
  an unsupported codec short-circuits without burning an allocation.
- Re-raise MemoryError from the memory guard instead of swallowing it
  under the blanket Exception catch; the docstring already promises a
  clear MemoryError before the cupy.empty call.
- Update test_memory_guard_runs_with_full_decomp_size to assert the
  MemoryError is surfaced via pytest.raises.
- Drop unused tempfile + ctypes imports from the test module.
@brendancol brendancol merged commit 5f6095c into main May 12, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: replace per-tile alloc + concat in _try_nvcomp_from_device_bufs (GDS+nvCOMP fast path)

2 participants