perf(geotiff): single-buffer output in _try_nvcomp_from_device_bufs (#1659)#1665
Merged
Merged
Conversation
The GDS+nvCOMP fast path allocated N separate cupy.empty(tile_bytes) output buffers and ran cupy.concatenate(d_decomp_bufs) after the decompress kernel returned. That kept two copies of the decompressed data alive at once and added a serial concat the other nvCOMP paths in this module already avoid (LZW at L1847, deflate at L1878, host-buffer at L1114 all use a single cupy.empty + pointer-offset pattern). The function now allocates one contiguous cupy.empty(n*tile_bytes) up front and derives the per-tile device pointers as base + i*tile_bytes, matching the surrounding code. The return contract is unchanged: a single flat cupy.uint8 buffer of length n*tile_bytes that _apply_predictor_and_assemble consumes directly. Adds _check_gpu_memory on the new allocation so a borderline-OK GDS read surfaces a clear MemoryError before the cupy.empty call instead of an opaque CUDA OOM deeper in the function. Tests in test_nvcomp_from_device_bufs_single_alloc_1659.py cover the codec early-return, the missing-library early-return, the memory-guard contract, the structural single-buffer return shape, and a real end-to-end ZSTD round-trip through nvCOMP when the library is available.
Contributor
There was a problem hiding this comment.
Pull request overview
Optimizes the GeoTIFF GPU decode fast path by changing the nvCOMP-from-device-buffers (GDS) ZSTD decompression to write into a single contiguous CuPy output buffer (matching other nvCOMP paths in the module), reducing peak VRAM and avoiding a post-decode cupy.concatenate.
Changes:
- Refactors
_try_nvcomp_from_device_bufsto allocate onecupy.empty(n * tile_bytes)buffer and derive per-tile output pointers via base-pointer offsets. - Adds a GPU memory pre-check (
_check_gpu_memory) before allocating the contiguous decompression buffer. - Adds a new regression test module covering short-circuits, the memory-guard contract, and a real nvCOMP ZSTD round-trip when available.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
xrspatial/geotiff/_gpu_decode.py |
Switches nvCOMP GDS decompression output to a single contiguous device buffer + pointer offsets; adds a pre-allocation memory guard call. |
xrspatial/geotiff/tests/test_nvcomp_from_device_bufs_single_alloc_1659.py |
Adds regression tests for the single-buffer pattern and key short-circuit/contract behavior. |
.claude/sweep-performance-state.csv |
Updates internal performance audit state for the geotiff module to include #1659. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from __future__ import annotations | ||
|
|
||
| import importlib.util | ||
| import tempfile |
|
|
||
| def _fake_temp_size_fn(n, tile_bytes, opts, p_temp_size, total): | ||
| """Stub for nvcompBatchedZstdDecompressGetTempSizeAsync.""" | ||
| import ctypes |
Comment on lines
+1646
to
1648
| return d_decomp | ||
| except Exception: | ||
| return None |
Comment on lines
+1594
to
1601
| _check_gpu_memory(n * tile_bytes, | ||
| what="GDS+nvCOMP decompressed buffer") | ||
| d_decomp = cupy.empty(n * tile_bytes, dtype=cupy.uint8) | ||
| base_decomp_ptr = int(d_decomp.data.ptr) | ||
| decomp_offsets = (np.arange(n, dtype=np.uint64) | ||
| * np.uint64(tile_bytes)) | ||
| d_decomp_ptrs = cupy.asarray(base_decomp_ptr + decomp_offsets) | ||
|
|
- Hoist codec name resolution before VRAM allocation and memory guard so an unsupported codec short-circuits without burning an allocation. - Re-raise MemoryError from the memory guard instead of swallowing it under the blanket Exception catch; the docstring already promises a clear MemoryError before the cupy.empty call. - Update test_memory_guard_runs_with_full_decomp_size to assert the MemoryError is surfaced via pytest.raises. - Drop unused tempfile + ctypes imports from the test module.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1659.
Summary
_try_nvcomp_from_device_bufsinxrspatial/geotiff/_gpu_decode.pynow allocates a single contiguouscupy.empty(n * tile_bytes)buffer and derives per-tile device pointers asbase_ptr + i * tile_bytes, matching the LZW (L1847), deflate (L1878), and host-buffer (L1114) paths in the same file._check_gpu_memoryon the new allocation so a borderline-OK GDS read surfaces a clearMemoryErrorbefore thecupy.emptycall.test_nvcomp_from_device_bufs_single_alloc_1659.py: unsupported-codec short-circuit, missing-library short-circuit, memory-guard contract, structural single-buffer shape, real ZSTD round-trip through nvCOMP (when the lib is available).Why
The previous pattern kept two copies of the decompressed data live on device for the lifetime of the trailing
cupy.concatenate, and the concat itself was pure overhead the surrounding nvCOMP paths already avoid. The return contract is unchanged: a flatcupy.uint8of lengthn * tile_bytes, which_apply_predictor_and_assembleconsumes directly.The microbenchmark in issue #1659 isolates the cost of the deleted alloc + concat steps. It does NOT measure full nvCOMP decompression latency; the proportion of end-to-end
read_geotiff_gputime the overhead represents depends on the target host. The function is also gated behindkvikio+nvCOMP+ ZSTD, so most users will see no behavioural change.Test plan
test_nvcomp_from_device_bufs_single_alloc_1659.py-- 5/5 pass on a CUDA 12 + cupy 13.6 + nvCOMP host (including the real ZSTD round-trip)pytest xrspatial/geotiff/tests/-- 1458 pass, 3 pre-existing unrelated matplotlib/py3.14 deepcopy failures intest_features.py::TestPalettepytest xrspatial/geotiff/tests/ -k "gpu or zstd or compress"-- 440 pass on this host