perf(geotiff): single-buffer output in _try_nvcomp_from_device_bufs (#1659) by brendancol · Pull Request #1665 · xarray-contrib/xarray-spatial

brendancol · 2026-05-12T12:51:53Z

Closes #1659.

Summary

_try_nvcomp_from_device_bufs in xrspatial/geotiff/_gpu_decode.py now allocates a single contiguous cupy.empty(n * tile_bytes) buffer and derives per-tile device pointers as base_ptr + i * tile_bytes, matching the LZW (L1847), deflate (L1878), and host-buffer (L1114) paths in the same file.
Adds _check_gpu_memory on the new allocation so a borderline-OK GDS read surfaces a clear MemoryError before the cupy.empty call.
Adds five regression tests in test_nvcomp_from_device_bufs_single_alloc_1659.py: unsupported-codec short-circuit, missing-library short-circuit, memory-guard contract, structural single-buffer shape, real ZSTD round-trip through nvCOMP (when the lib is available).

Why

The previous pattern kept two copies of the decompressed data live on device for the lifetime of the trailing cupy.concatenate, and the concat itself was pure overhead the surrounding nvCOMP paths already avoid. The return contract is unchanged: a flat cupy.uint8 of length n * tile_bytes, which _apply_predictor_and_assemble consumes directly.

The microbenchmark in issue #1659 isolates the cost of the deleted alloc + concat steps. It does NOT measure full nvCOMP decompression latency; the proportion of end-to-end read_geotiff_gpu time the overhead represents depends on the target host. The function is also gated behind kvikio + nvCOMP + ZSTD, so most users will see no behavioural change.

Test plan

test_nvcomp_from_device_bufs_single_alloc_1659.py -- 5/5 pass on a CUDA 12 + cupy 13.6 + nvCOMP host (including the real ZSTD round-trip)
pytest xrspatial/geotiff/tests/ -- 1458 pass, 3 pre-existing unrelated matplotlib/py3.14 deepcopy failures in test_features.py::TestPalette
pytest xrspatial/geotiff/tests/ -k "gpu or zstd or compress" -- 440 pass on this host

The GDS+nvCOMP fast path allocated N separate cupy.empty(tile_bytes) output buffers and ran cupy.concatenate(d_decomp_bufs) after the decompress kernel returned. That kept two copies of the decompressed data alive at once and added a serial concat the other nvCOMP paths in this module already avoid (LZW at L1847, deflate at L1878, host-buffer at L1114 all use a single cupy.empty + pointer-offset pattern). The function now allocates one contiguous cupy.empty(n*tile_bytes) up front and derives the per-tile device pointers as base + i*tile_bytes, matching the surrounding code. The return contract is unchanged: a single flat cupy.uint8 buffer of length n*tile_bytes that _apply_predictor_and_assemble consumes directly. Adds _check_gpu_memory on the new allocation so a borderline-OK GDS read surfaces a clear MemoryError before the cupy.empty call instead of an opaque CUDA OOM deeper in the function. Tests in test_nvcomp_from_device_bufs_single_alloc_1659.py cover the codec early-return, the missing-library early-return, the memory-guard contract, the structural single-buffer return shape, and a real end-to-end ZSTD round-trip through nvCOMP when the library is available.

Copilot

Pull request overview

Optimizes the GeoTIFF GPU decode fast path by changing the nvCOMP-from-device-buffers (GDS) ZSTD decompression to write into a single contiguous CuPy output buffer (matching other nvCOMP paths in the module), reducing peak VRAM and avoiding a post-decode cupy.concatenate.

Changes:

Refactors _try_nvcomp_from_device_bufs to allocate one cupy.empty(n * tile_bytes) buffer and derive per-tile output pointers via base-pointer offsets.
Adds a GPU memory pre-check (_check_gpu_memory) before allocating the contiguous decompression buffer.
Adds a new regression test module covering short-circuits, the memory-guard contract, and a real nvCOMP ZSTD round-trip when available.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`xrspatial/geotiff/_gpu_decode.py`	Switches nvCOMP GDS decompression output to a single contiguous device buffer + pointer offsets; adds a pre-allocation memory guard call.
`xrspatial/geotiff/tests/test_nvcomp_from_device_bufs_single_alloc_1659.py`	Adds regression tests for the single-buffer pattern and key short-circuit/contract behavior.
`.claude/sweep-performance-state.csv`	Updates internal performance audit state for the geotiff module to include #1659.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+from __future__ import annotations
+
+import importlib.util
+import tempfile


+
+def _fake_temp_size_fn(n, tile_bytes, opts, p_temp_size, total):
+    """Stub for nvcompBatchedZstdDecompressGetTempSizeAsync."""
+    import ctypes


+        return d_decomp
    except Exception:
        return None


+        _check_gpu_memory(n * tile_bytes,
+                          what="GDS+nvCOMP decompressed buffer")
+        d_decomp = cupy.empty(n * tile_bytes, dtype=cupy.uint8)
+        base_decomp_ptr = int(d_decomp.data.ptr)
+        decomp_offsets = (np.arange(n, dtype=np.uint64)
+                          * np.uint64(tile_bytes))
+        d_decomp_ptrs = cupy.asarray(base_decomp_ptr + decomp_offsets)



- Hoist codec name resolution before VRAM allocation and memory guard so an unsupported codec short-circuits without burning an allocation. - Re-raise MemoryError from the memory guard instead of swallowing it under the blanket Exception catch; the docstring already promises a clear MemoryError before the cupy.empty call. - Update test_memory_guard_runs_with_full_decomp_size to assert the MemoryError is surfaced via pytest.raises. - Drop unused tempfile + ctypes imports from the test module.

github-actions Bot added the performance PR touches performance-sensitive code label May 12, 2026

brendancol requested a review from Copilot May 12, 2026 13:03

Copilot started reviewing on behalf of brendancol May 12, 2026 13:04 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

brendancol merged commit 5f6095c into main May 12, 2026
11 checks passed

brendancol deleted the deep-sweep-performance-geotiff-2026-05-12 branch May 15, 2026 04:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(geotiff): single-buffer output in _try_nvcomp_from_device_bufs (#1659)#1665

perf(geotiff): single-buffer output in _try_nvcomp_from_device_bufs (#1659)#1665
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-12

brendancol commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

brendancol commented May 12, 2026

Summary

Why

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants