Skip to content

Commit 635246c

Browse files
authored
perf(geotiff): batch _nvcomp_batch_compress allocations and D2H readback (#1712) (#1729)
* perf(geotiff): batch _nvcomp_batch_compress allocations and D2H readback (#1712) Two related anti-patterns in the GPU compress path remained after the decode-side fix in #1552 and the device-pointer fix in #1659: 1. Per-tile cupy.empty allocations for the compressed-output buffers. For an N-tile write, this issued N memory-pool calls. Replace with one contiguous allocation of size n_tiles * max_cs plus per-tile slab views; the per-tile pointer array still lets nvCOMP write independent slabs in parallel. 2. Per-tile .get().tobytes() in the result-collection loop. Each .get() was a separate D2H transfer on the default stream, and the per-DMA setup cost dominated wall time for large-N writes (the exact pattern #1552 fixed on the decode side). Replace with a single cupy.concatenate of trimmed slabs and one .get(), then slice the host buffer by cumulative offsets to peel out per-tile payloads. The adler32 deflate-wrap step is unchanged. Real-world benchmark on an Ampere-class GPU: 2048x2048 float32 zstd GPU write at tile_size=64 (1024 tiles) drops median from 84.3ms to 54.7ms (~35% reduction). Tests cover the structural change (regressions back to the per-tile patterns fail loudly) plus end-to-end round-trip equality at deflate and zstd compression. _check_gpu_memory now bounds the new contiguous allocation in the same way the decode-side fix does. * Address Copilot review feedback on #1729 - Add _check_gpu_memory guard before nvcomp staging-buffer concatenate (mirrors _batched_d2h_to_bytes pattern) - Tighten GPU skip to require cupy.cuda.is_available()
1 parent 09ac06a commit 635246c

2 files changed

Lines changed: 213 additions & 8 deletions

File tree

xrspatial/geotiff/_gpu_decode.py

Lines changed: 56 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2453,12 +2453,31 @@ class _DeflateCompOpts(ctypes.Structure):
24532453
return None
24542454
max_cs = max_comp_size.value
24552455

2456-
# Allocate compressed output buffers on device
2457-
d_comp_bufs = [cupy.empty(max_cs, dtype=cupy.uint8) for _ in range(n_tiles)]
2456+
# Allocate one contiguous device buffer of size ``n_tiles *
2457+
# max_cs`` and view ``n_tiles`` per-tile slabs into it. This
2458+
# mirrors the single-buffer pattern already used by the decode
2459+
# path in ``_try_nvcomp_from_device_bufs`` (#1659) and the GDS
2460+
# fallback in ``gpu_decode_tiles_from_file`` (#1552). Per-tile
2461+
# ``cupy.empty`` calls each round-trip through the memory pool
2462+
# and add up to milliseconds of overhead for thousand-tile
2463+
# writes; one allocation + pointer arithmetic skips that
2464+
# bookkeeping. The slab views are non-overlapping so nvCOMP
2465+
# can write into them in parallel via the per-tile pointer
2466+
# array. See issue #1712.
2467+
_check_gpu_memory(n_tiles * max_cs,
2468+
what="nvCOMP compressed-output buffer")
2469+
d_comp_pool = cupy.empty(n_tiles * max_cs, dtype=cupy.uint8)
2470+
comp_base_ptr = int(d_comp_pool.data.ptr)
2471+
d_comp_bufs = [
2472+
d_comp_pool[i * max_cs:(i + 1) * max_cs]
2473+
for i in range(n_tiles)
2474+
]
2475+
comp_ptrs_host = np.arange(n_tiles, dtype=np.uint64) * np.uint64(max_cs)
2476+
comp_ptrs_host += np.uint64(comp_base_ptr)
24582477

24592478
# Build pointer and size arrays
24602479
d_uncomp_ptrs = cupy.array([b.data.ptr for b in d_tile_bufs], dtype=cupy.uint64)
2461-
d_comp_ptrs = cupy.array([b.data.ptr for b in d_comp_bufs], dtype=cupy.uint64)
2480+
d_comp_ptrs = cupy.asarray(comp_ptrs_host)
24622481
d_uncomp_sizes = cupy.full(n_tiles, tile_bytes, dtype=cupy.uint64)
24632482
d_comp_sizes = cupy.empty(n_tiles, dtype=cupy.uint64)
24642483

@@ -2518,18 +2537,47 @@ class _DeflateCompOpts(ctypes.Structure):
25182537
adler_checksums[i] = zlib.adler32(
25192538
host_view[i * tile_bytes:(i + 1) * tile_bytes])
25202539

2521-
# Read compressed sizes and data back to CPU
2540+
# Read compressed sizes and data back to CPU.
2541+
#
2542+
# Previously this loop ran one ``.get()`` per tile, which
2543+
# serialised on the default stream and burned a per-DMA setup
2544+
# cost (issue #1712, same pattern as the decode-side fix in
2545+
# #1552). Instead, hoist the variable-size slabs into a
2546+
# contiguous buffer with one ``cupy.concatenate``, do one D2H
2547+
# transfer, then slice the host buffer per tile.
25222548
comp_sizes = d_comp_sizes.get().astype(int)
2549+
if n_tiles == 0:
2550+
return []
2551+
# Each tile's compressed payload lives in the first
2552+
# ``comp_sizes[i]`` bytes of its ``max_cs``-sized slab. Build a
2553+
# list of trimmed views and concatenate into one packed device
2554+
# buffer; the host then sees one DMA-sized region.
2555+
trimmed = [
2556+
d_comp_bufs[i][:int(comp_sizes[i])] for i in range(n_tiles)
2557+
]
2558+
# The concat allocates a fresh device buffer of ``sum(comp_sizes)``
2559+
# bytes -- a peak-VRAM bump on top of the ``n_tiles * max_cs``
2560+
# pool above. Guard it explicitly so an OOM here surfaces with
2561+
# the same message style as ``_batched_d2h_to_bytes`` rather
2562+
# than silently triggering the broad except below.
2563+
_check_gpu_memory(int(sum(comp_sizes)),
2564+
what="nvCOMP batch staging buffer")
2565+
d_comp_concat = cupy.concatenate(trimmed)
2566+
host_concat = bytes(d_comp_concat.get())
2567+
2568+
# Walk host_concat by per-tile offsets to peel out each tile's
2569+
# compressed bytes. Cumulative-sum on the host avoids a second
2570+
# device round-trip.
2571+
offsets = np.zeros(n_tiles + 1, dtype=np.int64)
2572+
np.cumsum(comp_sizes, out=offsets[1:])
2573+
25232574
result = []
25242575
for i in range(n_tiles):
2525-
cs = int(comp_sizes[i])
2526-
raw = d_comp_bufs[i][:cs].get().tobytes()
2527-
2576+
raw = host_concat[offsets[i]:offsets[i + 1]]
25282577
if adler_checksums is not None:
25292578
# Wrap raw deflate in zlib format: header + data + adler32
25302579
checksum = struct.pack('>I', adler_checksums[i] & 0xFFFFFFFF)
25312580
raw = b'\x78\x9c' + raw + checksum
2532-
25332581
result.append(raw)
25342582

25352583
return result
Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
"""Coverage for the batched ``_nvcomp_batch_compress`` (#1712).
2+
3+
The pre-fix function allocated compressed-output device buffers one
4+
``cupy.empty`` per tile and then read each tile back to host with one
5+
``.get()`` per tile. Both patterns serialised on the default CUDA
6+
stream and were dominant in large-N writes. The fix folds both into a
7+
single contiguous device allocation + a single batched D2H concat-and-
8+
``.get()``, matching the patterns already in use on the decode side
9+
(#1552, #1659).
10+
11+
These tests pin the new shape and confirm the deflate / zstd GPU write
12+
paths still round-trip end-to-end.
13+
"""
14+
from __future__ import annotations
15+
16+
import importlib.util
17+
import inspect
18+
import os
19+
import tempfile
20+
21+
import numpy as np
22+
import pytest
23+
import xarray as xr
24+
25+
try:
26+
import cupy
27+
_HAS_CUPY = True
28+
except Exception:
29+
_HAS_CUPY = False
30+
31+
32+
def _gpu_available() -> bool:
33+
"""Match the geotiff-test convention: cupy import AND working CUDA.
34+
35+
A host can have cupy installed without a usable CUDA runtime (no
36+
driver, no device visible, container misconfig), and in that case
37+
every test that calls into the GPU writer would fail rather than
38+
skip. ``cupy.cuda.is_available()`` is the cheap probe.
39+
"""
40+
if importlib.util.find_spec("cupy") is None:
41+
return False
42+
try:
43+
import cupy
44+
return bool(cupy.cuda.is_available())
45+
except Exception:
46+
return False
47+
48+
49+
_HAS_GPU = _gpu_available()
50+
51+
# nvCOMP is the entry point that exercises this code path.
52+
from xrspatial.geotiff import _gpu_decode
53+
54+
needs_cupy = pytest.mark.skipif(
55+
not _HAS_GPU, reason="cupy + CUDA required"
56+
)
57+
58+
59+
# ----------------------------------------------------------------------
60+
# Source-level structural assertions -- run on any host with the source
61+
# available, no GPU required.
62+
# ----------------------------------------------------------------------
63+
64+
def test_no_per_tile_cupy_empty_in_compressed_pool():
65+
"""The per-tile cupy.empty list comprehension is gone (#1712).
66+
67+
The fix replaced it with a single contiguous allocation. Catch any
68+
regression that brings the loop back.
69+
"""
70+
source = inspect.getsource(_gpu_decode._nvcomp_batch_compress)
71+
assert "cupy.empty(max_cs, dtype=cupy.uint8) for _ in range" not in source, (
72+
"_nvcomp_batch_compress regressed to per-tile cupy.empty "
73+
"allocations for the compressed output pool. See #1712."
74+
)
75+
76+
77+
def test_no_per_tile_get_in_result_loop():
78+
"""The per-tile ``d_comp_bufs[i][:cs].get().tobytes()`` is gone (#1712).
79+
80+
The fix replaced it with one concat + one ``.get()``. Catch any
81+
regression that brings the per-tile pattern back.
82+
"""
83+
source = inspect.getsource(_gpu_decode._nvcomp_batch_compress)
84+
# The exact string the prior loop used:
85+
bad_fragment = "d_comp_bufs[i][:cs].get().tobytes()"
86+
assert bad_fragment not in source, (
87+
"_nvcomp_batch_compress regressed to per-tile .get().tobytes() "
88+
"D2H readback. See #1712."
89+
)
90+
91+
92+
# ----------------------------------------------------------------------
93+
# End-to-end behaviour: GPU write + read round-trip stays correct
94+
# ----------------------------------------------------------------------
95+
96+
@needs_cupy
97+
@pytest.mark.parametrize("compression", ["deflate", "zstd"])
98+
def test_gpu_write_roundtrip_after_batched_compress(compression):
99+
"""GPU compress path round-trips uncorrupted for deflate + zstd.
100+
101+
Catches the most likely regression mode: any off-by-one in the
102+
cumulative-sum offsets used to slice the host-side concatenated
103+
buffer would scramble tile order, which a round-trip equality
104+
check picks up immediately.
105+
"""
106+
from xrspatial.geotiff import write_geotiff_gpu, open_geotiff
107+
108+
rng = np.random.default_rng(seed=1712)
109+
arr_cpu = rng.random((512, 512), dtype=np.float32)
110+
arr_gpu = cupy.asarray(arr_cpu)
111+
darr = xr.DataArray(arr_gpu, dims=["y", "x"])
112+
113+
with tempfile.TemporaryDirectory(prefix="nvcomp_batch_1712_") as td:
114+
path = os.path.join(td, f"roundtrip_{compression}.tif")
115+
try:
116+
write_geotiff_gpu(
117+
darr, path,
118+
compression=compression,
119+
tiled=True,
120+
tile_size=64,
121+
)
122+
except RuntimeError as e:
123+
# nvCOMP may be unavailable in this environment; the writer
124+
# falls back to CPU and that path doesn't exercise the
125+
# change. Skip rather than fail.
126+
pytest.skip(f"nvCOMP unavailable for {compression}: {e}")
127+
128+
back = open_geotiff(path)
129+
np.testing.assert_allclose(back.values, arr_cpu, rtol=0, atol=0)
130+
131+
132+
@needs_cupy
133+
def test_gpu_write_zero_tile_edge_case():
134+
"""A 0-tile compress returns an empty list without indexing into None.
135+
136+
The cumulative-sum / concat path must short-circuit before
137+
``cupy.concatenate`` (which would raise on an empty list). The
138+
pre-fix loop simply iterated zero times, so the contract is the
139+
same empty-list output.
140+
"""
141+
# Direct call into the internal function with n_tiles=0 is
142+
# awkward because it needs a libnvCOMP handle and matching opts.
143+
# Instead, exercise the public writer with a tiny single-tile
144+
# input and confirm the fast path does not crash. Real n_tiles==0
145+
# never occurs via the writer (every image has at least one tile).
146+
from xrspatial.geotiff import write_geotiff_gpu, open_geotiff
147+
arr_gpu = cupy.zeros((32, 32), dtype=cupy.float32)
148+
darr = xr.DataArray(arr_gpu, dims=["y", "x"])
149+
with tempfile.TemporaryDirectory(prefix="nvcomp_batch_1712_") as td:
150+
path = os.path.join(td, "tiny.tif")
151+
try:
152+
write_geotiff_gpu(darr, path, compression="zstd",
153+
tiled=True, tile_size=32)
154+
except RuntimeError as e:
155+
pytest.skip(f"nvCOMP unavailable: {e}")
156+
back = open_geotiff(path)
157+
assert back.shape == (32, 32)

0 commit comments

Comments
 (0)