Skip to content

Commit 5f6095c

Browse files
authored
perf(geotiff): single-buffer output in _try_nvcomp_from_device_bufs (#1659) (#1665)
* Replace per-tile alloc + concat in _try_nvcomp_from_device_bufs (#1659) The GDS+nvCOMP fast path allocated N separate cupy.empty(tile_bytes) output buffers and ran cupy.concatenate(d_decomp_bufs) after the decompress kernel returned. That kept two copies of the decompressed data alive at once and added a serial concat the other nvCOMP paths in this module already avoid (LZW at L1847, deflate at L1878, host-buffer at L1114 all use a single cupy.empty + pointer-offset pattern). The function now allocates one contiguous cupy.empty(n*tile_bytes) up front and derives the per-tile device pointers as base + i*tile_bytes, matching the surrounding code. The return contract is unchanged: a single flat cupy.uint8 buffer of length n*tile_bytes that _apply_predictor_and_assemble consumes directly. Adds _check_gpu_memory on the new allocation so a borderline-OK GDS read surfaces a clear MemoryError before the cupy.empty call instead of an opaque CUDA OOM deeper in the function. Tests in test_nvcomp_from_device_bufs_single_alloc_1659.py cover the codec early-return, the missing-library early-return, the memory-guard contract, the structural single-buffer return shape, and a real end-to-end ZSTD round-trip through nvCOMP when the library is available. * Address review comments on PR #1665 - Hoist codec name resolution before VRAM allocation and memory guard so an unsupported codec short-circuits without burning an allocation. - Re-raise MemoryError from the memory guard instead of swallowing it under the blanket Exception catch; the docstring already promises a clear MemoryError before the cupy.empty call. - Update test_memory_guard_runs_with_full_decomp_size to assert the MemoryError is surfaced via pytest.raises. - Drop unused tempfile + ctypes imports from the test module.
1 parent 2d0e067 commit 5f6095c

3 files changed

Lines changed: 268 additions & 10 deletions

File tree

.claude/sweep-performance-state.csv

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ fire,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
1818
flood,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
1919
focal,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
2020
geodesic,2026-03-31T18:00:00Z,N/A,compute-bound,0,,
21-
geotiff,2026-05-10,SAFE,IO-bound,0,,"Pass 4 (2026-05-10): re-audit after #1559 (centralise attrs across all read backends). New _populate_attrs_from_geo_info helper at __init__.py:301 runs once per read, not per-chunk -- no perf impact. Probe: 2560x2560 deflate-tiled file opened via read_geotiff_dask yields 400 tasks (4 tasks/chunk for 100 chunks), well under 1M cap. read_geotiff_gpu(1024x1024) returns cupy.ndarray end-to-end with no host round-trip (226ms incl. write+decode). No new HIGH/MEDIUM findings. SAFE/IO-bound holds. | Pass 3 (2026-05-10): SAFE/IO-bound. Audited 4 perf commits: #1558 (in-place NaN writes on uniquely-owned buffers correct), #1556 (fp-predictor ngjit ~297us/tile for 256x256 float32), #1552 (single cupy.concatenate + one .get() for batched D2H at _gpu_decode.py:870-913), #1551 (parallel decode threshold >=65536px engages 256x256 default at _reader.py:1121). Bench: 8192x8192 f32 deflate+pred2 256-tile write 782ms; 4096x4096 f32 deflate read 83ms with parallel decode. Deferred LOW (none filed, all <10% MEDIUM threshold): _writer.py:459/1109 redundant .copy() before predictor encode (~1% per tile), _compression.py:280 lzw_decompress dst[:n].copy() (~2% per LZW tile decode), _writer.py:1419 seg_np.copy() before in-place NaN substitution (negligible, conditional path), _CloudSource.read_range opens fresh fsspec handle per range (pre-existing, predates audit scope). nvCOMP per-tile D2H batching break-even confirmed (variable sizes need staging buffer, no win). | Pass 3 (2026-05-10): audited f157746,39322c3,f23ec8f,1aac3b7. All 5 commits correct. Redundant .copy() in _writer.py:459,1109 and _compression.py:280 (1-2% overhead, LOW). _CloudSource.read_range() per-call open is pre-existing arch issue. No HIGH/MEDIUM regressions. SAFE. | re-audit 2026-05-02: 6 commits since 2026-04-16 (predictor=3 CPU encode/decode, GPU predictor stride fix, validate_tile_layout, BigTIFF LONG8 offsets, AREA_OR_POINT VRT, per-tile alloc guard). 1M dask chunk cap intact at __init__.py:948; adler32 batch transfer intact at _gpu_decode.py:1825. New code is metadata validation and dispatcher logic with no extra materialization or per-tile sync points. No HIGH/MEDIUM regressions."
21+
geotiff,2026-05-12,SAFE,IO-bound,0,1659,"Pass 4 (2026-05-10): re-audit after #1559 (centralise attrs across all read backends). New _populate_attrs_from_geo_info helper at __init__.py:301 runs once per read, not per-chunk -- no perf impact. Probe: 2560x2560 deflate-tiled file opened via read_geotiff_dask yields 400 tasks (4 tasks/chunk for 100 chunks), well under 1M cap. read_geotiff_gpu(1024x1024) returns cupy.ndarray end-to-end with no host round-trip (226ms incl. write+decode). No new HIGH/MEDIUM findings. SAFE/IO-bound holds. | Pass 3 (2026-05-10): SAFE/IO-bound. Audited 4 perf commits: #1558 (in-place NaN writes on uniquely-owned buffers correct), #1556 (fp-predictor ngjit ~297us/tile for 256x256 float32), #1552 (single cupy.concatenate + one .get() for batched D2H at _gpu_decode.py:870-913), #1551 (parallel decode threshold >=65536px engages 256x256 default at _reader.py:1121). Bench: 8192x8192 f32 deflate+pred2 256-tile write 782ms; 4096x4096 f32 deflate read 83ms with parallel decode. Deferred LOW (none filed, all <10% MEDIUM threshold): _writer.py:459/1109 redundant .copy() before predictor encode (~1% per tile), _compression.py:280 lzw_decompress dst[:n].copy() (~2% per LZW tile decode), _writer.py:1419 seg_np.copy() before in-place NaN substitution (negligible, conditional path), _CloudSource.read_range opens fresh fsspec handle per range (pre-existing, predates audit scope). nvCOMP per-tile D2H batching break-even confirmed (variable sizes need staging buffer, no win). | Pass 3 (2026-05-10): audited f157746,39322c3,f23ec8f,1aac3b7. All 5 commits correct. Redundant .copy() in _writer.py:459,1109 and _compression.py:280 (1-2% overhead, LOW). _CloudSource.read_range() per-call open is pre-existing arch issue. No HIGH/MEDIUM regressions. SAFE. | re-audit 2026-05-02: 6 commits since 2026-04-16 (predictor=3 CPU encode/decode, GPU predictor stride fix, validate_tile_layout, BigTIFF LONG8 offsets, AREA_OR_POINT VRT, per-tile alloc guard). 1M dask chunk cap intact at __init__.py:948; adler32 batch transfer intact at _gpu_decode.py:1825. New code is metadata validation and dispatcher logic with no extra materialization or per-tile sync points. No HIGH/MEDIUM regressions. | Pass 5 (2026-05-12): re-audit identified MEDIUM in _gpu_decode.py:1577 _try_nvcomp_from_device_bufs: per-tile cupy.empty + trailing cupy.concatenate doubled peak VRAM and added serial concat. Filed #1659 and fixed to single-buffer + pointer offsets (matches LZW/deflate/host-buffer patterns at L1847/L1878/L1114). Microbench (alloc+concat overhead only, not full nvCOMP latency): n=256 tile_bytes=65536 drops 3.66ms->0.69ms, n=256 tile_bytes=262144 drops 8.18ms->0.13ms. Tests: 5 new tests in test_nvcomp_from_device_bufs_single_alloc_1659.py (codec short-circuit, no-lib short-circuit, memory-guard contract, real ZSTD round-trip via nvCOMP, structural single-buffer check). 1458 existing geotiff tests pass, 3 unrelated matplotlib/py3.14 failures pre-existing. SAFE/IO-bound verdict holds."
2222
glcm,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,"Downgraded to MEDIUM. da.stack without rechunk is scheduling overhead, not OOM risk."
2323
hillshade,2026-04-16T12:00:00Z,SAFE,compute-bound,0,,"Re-audit after Horn's method rewrite (PR 1175): clean stencil, map_overlap depth=(1,1), no materialization. Zero findings."
2424
hydro,2026-05-01,RISKY,memory-bound,0,1416,"Fixed-in-tree 2026-05-01: hand_mfd._hand_mfd_dask now assembles via da.map_blocks instead of eager da.block of pre-computed tiles (matches hand_dinf pattern). Remaining MEDIUM: sink_d8 CCL fully materializes labels (inherently global), flow_accumulation_mfd frac_bdry held in driver dict instead of memmap-backed BoundaryStore. D8 iterative paths (flow_accum/fill/watershed/basin/stream_*) use serial-tile sweep with memmap-backed boundary store -- per-tile RAM bounded but driver iterates O(diameter) times. flow_direction_*, flow_path/snap_pour_point/twi/hand_d8/hand_dinf are SAFE."

xrspatial/geotiff/_gpu_decode.py

Lines changed: 36 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1561,34 +1561,59 @@ def gpu_decode_tiles_from_file(
15611561

15621562

15631563
def _try_nvcomp_from_device_bufs(d_tiles, tile_bytes, compression):
1564-
"""Run nvCOMP batch decompress on tiles already in GPU memory."""
1564+
"""Run nvCOMP batch decompress on tiles already in GPU memory.
1565+
1566+
The decompressed output uses a single contiguous device buffer with
1567+
per-tile pointers derived as ``base_ptr + i * tile_bytes``. The previous
1568+
implementation allocated N separate ``cupy.empty(tile_bytes)`` buffers
1569+
and ran ``cupy.concatenate`` after the decompress kernel; that pattern
1570+
kept two copies of the decompressed data alive at once (the per-tile
1571+
buffers and the concatenated result) and ran a serial concat that the
1572+
rest of the GPU paths avoid. The other nvCOMP code paths in this module
1573+
(LZW at ~L1847, deflate at ~L1878, host-buffer at ~L1114) already use
1574+
the single-buffer pattern; this brings the GDS path in line with them.
1575+
See issue #1659.
1576+
"""
15651577
import ctypes
15661578
import cupy
15671579

15681580
lib = _get_nvcomp()
15691581
if lib is None:
15701582
return None
15711583

1584+
# Resolve the codec entry-point names before touching VRAM. An
1585+
# unsupported codec must return None without burning an allocation or
1586+
# running the memory guard.
1587+
fn_name = {50000: 'nvcompBatchedZstdDecompressGetTempSizeAsync'}.get(compression)
1588+
dec_name = {50000: 'nvcompBatchedZstdDecompressAsync'}.get(compression)
1589+
if fn_name is None:
1590+
return None
1591+
15721592
class _NvcompDecompOpts(ctypes.Structure):
15731593
_fields_ = [('backend', ctypes.c_int), ('reserved', ctypes.c_char * 60)]
15741594

15751595
try:
15761596
n = len(d_tiles)
1577-
d_decomp_bufs = [cupy.empty(tile_bytes, dtype=cupy.uint8) for _ in range(n)]
1597+
# Single contiguous output buffer. nvCOMP's batched decompress takes
1598+
# an array of per-tile device pointers; derive those from the base
1599+
# pointer + per-tile byte offsets so we never allocate N small
1600+
# buffers (one cupy.empty per tile is ~tens of microseconds each)
1601+
# and never run a trailing cupy.concatenate.
1602+
_check_gpu_memory(n * tile_bytes,
1603+
what="GDS+nvCOMP decompressed buffer")
1604+
d_decomp = cupy.empty(n * tile_bytes, dtype=cupy.uint8)
1605+
base_decomp_ptr = int(d_decomp.data.ptr)
1606+
decomp_offsets = (np.arange(n, dtype=np.uint64)
1607+
* np.uint64(tile_bytes))
1608+
d_decomp_ptrs = cupy.asarray(base_decomp_ptr + decomp_offsets)
15781609

15791610
d_comp_ptrs = cupy.array([t.data.ptr for t in d_tiles], dtype=cupy.uint64)
1580-
d_decomp_ptrs = cupy.array([b.data.ptr for b in d_decomp_bufs], dtype=cupy.uint64)
15811611
d_comp_sizes = cupy.array([t.size for t in d_tiles], dtype=cupy.uint64)
15821612
d_buf_sizes = cupy.full(n, tile_bytes, dtype=cupy.uint64)
15831613
d_actual = cupy.empty(n, dtype=cupy.uint64)
15841614

15851615
opts = _NvcompDecompOpts(backend=0, reserved=b'\x00' * 60)
15861616

1587-
fn_name = {50000: 'nvcompBatchedZstdDecompressGetTempSizeAsync'}.get(compression)
1588-
dec_name = {50000: 'nvcompBatchedZstdDecompressAsync'}.get(compression)
1589-
if fn_name is None:
1590-
return None
1591-
15921617
temp_fn = getattr(lib, fn_name)
15931618
temp_fn.restype = ctypes.c_int
15941619
temp_size = ctypes.c_size_t(0)
@@ -1621,7 +1646,9 @@ class _NvcompDecompOpts(ctypes.Structure):
16211646
if int(cupy.any(d_statuses != 0)):
16221647
return None
16231648

1624-
return cupy.concatenate(d_decomp_bufs)
1649+
return d_decomp
1650+
except MemoryError:
1651+
raise
16251652
except Exception:
16261653
return None
16271654

Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
"""Regression tests for the single-buffer pattern in _try_nvcomp_from_device_bufs.
2+
3+
Issue #1659: ``_try_nvcomp_from_device_bufs`` used to allocate N separate
4+
``cupy.empty(tile_bytes)`` output buffers and run ``cupy.concatenate`` after
5+
the nvCOMP decompress kernel returned. That kept two copies of the
6+
decompressed data alive at once and ran a serial concat the other nvCOMP
7+
paths in this module already avoid. The fix matches the single-contiguous-
8+
buffer + pointer-offset pattern used by the deflate / LZW / host-buffer
9+
paths nearby.
10+
11+
These tests skip when CuPy + CUDA are not available. They also skip the
12+
end-to-end nvCOMP integration check when ``kvikio`` or the nvCOMP shared
13+
library are not installed, which is the common case on developer hosts;
14+
the unit-level checks (contract + memory guard) run regardless.
15+
"""
16+
from __future__ import annotations
17+
18+
import importlib.util
19+
20+
import numpy as np
21+
import pytest
22+
23+
from xrspatial.geotiff._gpu_decode import _try_nvcomp_from_device_bufs
24+
25+
26+
def _gpu_available() -> bool:
27+
if importlib.util.find_spec("cupy") is None:
28+
return False
29+
try:
30+
import cupy
31+
return bool(cupy.cuda.is_available())
32+
except Exception:
33+
return False
34+
35+
36+
def _nvcomp_available() -> bool:
37+
from xrspatial.geotiff._gpu_decode import _get_nvcomp
38+
return _get_nvcomp() is not None
39+
40+
41+
@pytest.mark.skipif(not _gpu_available(), reason="cupy + CUDA required")
42+
def test_unsupported_codec_short_circuits_before_allocation():
43+
"""Non-ZSTD codecs must return None without allocating output buffers.
44+
45+
Pins the early-return contract that lets the caller pick a different
46+
decoder when nvCOMP cannot handle this codec.
47+
"""
48+
import cupy
49+
50+
# Use Deflate (8), which is unsupported by this function (ZSTD only).
51+
d_tiles = [cupy.zeros(1024, dtype=cupy.uint8) for _ in range(4)]
52+
assert _try_nvcomp_from_device_bufs(d_tiles, 1024, 8) is None
53+
54+
55+
@pytest.mark.skipif(not _gpu_available(), reason="cupy + CUDA required")
56+
def test_no_nvcomp_lib_returns_none(monkeypatch):
57+
"""When the nvCOMP library is missing, the function must return None.
58+
59+
The caller relies on this signal to fall back to the bytes-based decode
60+
path. Without it, callers would hit a ctypes ``getattr`` AttributeError
61+
deeper in the function.
62+
"""
63+
import cupy
64+
from xrspatial.geotiff import _gpu_decode
65+
66+
monkeypatch.setattr(_gpu_decode, "_get_nvcomp", lambda: None)
67+
68+
d_tiles = [cupy.zeros(1024, dtype=cupy.uint8)]
69+
assert _try_nvcomp_from_device_bufs(d_tiles, 1024, 50000) is None
70+
71+
72+
@pytest.mark.skipif(not _gpu_available(), reason="cupy + CUDA required")
73+
def test_memory_guard_runs_with_full_decomp_size(monkeypatch):
74+
"""The single-buffer allocation must be size-checked before cupy.empty.
75+
76+
The new pattern allocates one contiguous ``n * tile_bytes`` buffer
77+
instead of N small buffers. The OOM guard is what tells the caller
78+
early that the decode will not fit on the device; a regression that
79+
removed the guard would surface as an opaque CUDA OOM instead.
80+
"""
81+
import cupy
82+
from xrspatial.geotiff import _gpu_decode
83+
84+
seen = {"total_bytes": None, "what": None, "called": False}
85+
86+
def fake_check(required_bytes, what="tile buffer"):
87+
seen["total_bytes"] = int(required_bytes)
88+
seen["what"] = what
89+
seen["called"] = True
90+
raise MemoryError("simulated OOM")
91+
92+
# Pin _get_nvcomp to something truthy so the function does not bail
93+
# before reaching the allocation step. The fake check raises before
94+
# any nvCOMP call would happen, so the lib value never gets used.
95+
monkeypatch.setattr(_gpu_decode, "_get_nvcomp", lambda: object())
96+
monkeypatch.setattr(_gpu_decode, "_check_gpu_memory", fake_check)
97+
98+
n_tiles = 8
99+
tile_bytes = 65536
100+
d_tiles = [cupy.zeros(128, dtype=cupy.uint8) for _ in range(n_tiles)]
101+
102+
with pytest.raises(MemoryError):
103+
_try_nvcomp_from_device_bufs(d_tiles, tile_bytes, 50000)
104+
105+
assert seen["called"], "_check_gpu_memory was not called"
106+
expected_bytes = n_tiles * tile_bytes
107+
assert seen["total_bytes"] == expected_bytes, (
108+
f"expected total {expected_bytes}, got {seen['total_bytes']}"
109+
)
110+
assert "decompressed" in seen["what"] or "nvCOMP" in seen["what"], (
111+
f"unhelpful 'what' label: {seen['what']!r}"
112+
)
113+
114+
115+
@pytest.mark.skipif(
116+
not _gpu_available() or not _nvcomp_available(),
117+
reason="cupy + CUDA + nvCOMP shared lib required",
118+
)
119+
def test_zstd_decompress_roundtrip_returns_single_contiguous_buffer():
120+
"""End-to-end: feed real ZSTD-compressed device buffers in, check the
121+
output is a single flat ``cupy.uint8`` array of length n*tile_bytes.
122+
123+
This test confirms the return contract that ``_apply_predictor_and_assemble``
124+
depends on: ``out`` is the contiguous concatenation of the N decompressed
125+
tiles, not a list. The previous implementation returned the same shape but
126+
via ``cupy.concatenate``; the new one allocates the contig buffer up front
127+
and writes through per-tile pointers, so a regression that dropped the
128+
return value would surface here.
129+
"""
130+
import cupy
131+
import zstandard as zstd
132+
133+
rng = np.random.default_rng(seed=1659)
134+
tile_bytes = 4096
135+
n_tiles = 8
136+
137+
cctx = zstd.ZstdCompressor()
138+
host_tiles = [rng.integers(0, 256, size=tile_bytes, dtype=np.uint8)
139+
for _ in range(n_tiles)]
140+
compressed = [cctx.compress(t.tobytes()) for t in host_tiles]
141+
d_tiles = [cupy.asarray(np.frombuffer(c, dtype=np.uint8))
142+
for c in compressed]
143+
144+
result = _try_nvcomp_from_device_bufs(d_tiles, tile_bytes, 50000)
145+
146+
# nvCOMP may be present but mis-configured on the host (e.g. driver
147+
# version mismatch); skip rather than fail in that case so the test is
148+
# informative when run on a real GDS rig.
149+
if result is None:
150+
pytest.skip("nvCOMP returned None; library may be unusable on this host")
151+
152+
assert isinstance(result, cupy.ndarray)
153+
assert result.dtype == cupy.uint8
154+
assert result.shape == (n_tiles * tile_bytes,)
155+
assert result.flags.c_contiguous
156+
157+
# Decoded payload must match the original host tiles. The buffer is a
158+
# single flat array; tile i lives at offset i*tile_bytes.
159+
host_out = result.get()
160+
for i, expected in enumerate(host_tiles):
161+
decoded = host_out[i * tile_bytes:(i + 1) * tile_bytes]
162+
assert np.array_equal(decoded, expected), (
163+
f"tile {i} decoded payload differs from input"
164+
)
165+
166+
167+
@pytest.mark.skipif(not _gpu_available(), reason="cupy + CUDA required")
168+
def test_no_orphan_decomp_buffers_after_call(monkeypatch):
169+
"""Earlier code held a Python list of N device buffers in scope
170+
alongside the concatenated result. The replacement allocates once
171+
and returns that one buffer.
172+
173+
The check here is structural rather than numerical: after a successful
174+
call the only cupy ndarray the caller receives is ``result`` itself,
175+
and inspecting it confirms ``result.size == n_tiles * tile_bytes``.
176+
"""
177+
import cupy
178+
from xrspatial.geotiff import _gpu_decode
179+
180+
# Stub the nvCOMP entry points so the decompress "succeeds" without an
181+
# actual library. Force the function down the success branch, capture
182+
# the returned buffer, then verify shape + ownership.
183+
monkeypatch.setattr(_gpu_decode, "_get_nvcomp",
184+
lambda: _FakeNvcompLib())
185+
186+
n_tiles = 4
187+
tile_bytes = 2048
188+
d_tiles = [cupy.zeros(64, dtype=cupy.uint8) for _ in range(n_tiles)]
189+
result = _try_nvcomp_from_device_bufs(d_tiles, tile_bytes, 50000)
190+
191+
# The fake lib reports success and zero-fills the output buffer; the
192+
# function returns the contiguous buffer as-is.
193+
assert result is not None
194+
assert isinstance(result, cupy.ndarray)
195+
assert result.size == n_tiles * tile_bytes
196+
assert result.flags.c_contiguous
197+
# The contract requires uint8 -- not a uint8 view of something else.
198+
assert result.dtype == cupy.uint8
199+
200+
201+
class _FakeNvcompLib:
202+
"""Stand-in for the nvCOMP CDLL handle used in tests.
203+
204+
The real function calls ``getattr(lib, fn_name)`` for two entry points
205+
and invokes each as a ctypes function. We expose those entry-point
206+
names as Python callables that pretend the work succeeded.
207+
"""
208+
209+
def __getattr__(self, name):
210+
if name == 'nvcompBatchedZstdDecompressGetTempSizeAsync':
211+
return _fake_temp_size_fn
212+
if name == 'nvcompBatchedZstdDecompressAsync':
213+
return _fake_decompress_fn
214+
raise AttributeError(name)
215+
216+
217+
def _fake_temp_size_fn(n, tile_bytes, opts, p_temp_size, total):
218+
"""Stub for nvcompBatchedZstdDecompressGetTempSizeAsync."""
219+
# Write a tiny temp-size value into the caller's c_size_t.
220+
p_temp_size._obj.value = 1
221+
return 0
222+
223+
224+
def _fake_decompress_fn(*args):
225+
"""Stub for nvcompBatchedZstdDecompressAsync.
226+
227+
The function's return-value test is ``s != 0``. We return 0 (success).
228+
The caller's d_statuses array is already zero from ``cupy.zeros``, so
229+
the post-decode any-nonzero check passes.
230+
"""
231+
return 0

0 commit comments

Comments
 (0)