Skip to content

Commit 33a8b7d

Browse files
authored
perf(geotiff): batch _try_kvikio_read_tiles preads + single buffer (#1688) (#1693)
* perf(geotiff): batch _try_kvikio_read_tiles pread submissions (#1688) Replaces the per-tile cupy.empty + blocking IOFuture.get() inside the kvikio GDS path with a single contiguous device buffer, batched pread submissions, and a _check_gpu_memory guard up front. The old loop alternated submit -> wait -> submit -> wait, so the kvikio worker pool only saw one outstanding pread at a time and the per-tile cupy.empty() setup cost compounded across all tiles. The new pattern allocates once, submits every pread before the first .get(), and lets the worker pool overlap the reads. Microbench with 8-worker pool simulation, 256 tiles @ 1ms IO latency: old 256ms vs new 38.7ms (~6.6x). Single-thread simulation: 28.5ms (9x). Adds 9 unit tests covering the kvikio-absent path, single-buffer pointer arithmetic, submit-before-get ordering, memory guard contract, partial- read fallback, end-to-end data round-trip, and zero-size / all-sparse tile edge cases. The fake CuFile lets the structural checks run on hosts without a real GDS install. * Address PR #1693 review: tighten batching test + flake8 + comment accuracy * test_all_preads_submitted_before_any_get now records both submit and get events into a single ordered timeline and asserts every submit occurs before the first get. The prior version asserted on per-event lists ([0,1,2,3] each), which the legacy interleaved submit->get->submit->get loop also satisfies, so the test could not catch a regression to that pattern. Verified by temporarily reverting _try_kvikio_read_tiles to the interleaved pattern: new assertion fails with a clear "preads and gets are interleaved" message showing the [submit,get,submit,get,...] timeline. * Removed the unused ``import sys`` and the no-op ``fake_mod_obj`` lines from test_all_zero_size_tiles_returns_zero_length_views. flake8 now reports no F401/F841 on the test file. * Reworded the MemoryError comment in _try_kvikio_read_tiles. The previous wording claimed the CPU-mmap fallback "does not pre-allocate the full compressed payload", but gpu_decode_tiles still calls ``d_comp = cupy.asarray(comp_buf_host)`` over ``total_comp`` bytes. The new wording explains the fallback skips the GDS-specific contiguous read buffer but still pays the bulk device allocation.
1 parent 71d4f51 commit 33a8b7d

3 files changed

Lines changed: 478 additions & 10 deletions

File tree

.claude/sweep-performance-state.csv

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ fire,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
1818
flood,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
1919
focal,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
2020
geodesic,2026-03-31T18:00:00Z,N/A,compute-bound,0,,
21-
geotiff,2026-05-12,SAFE,IO-bound,0,1659,"Pass 4 (2026-05-10): re-audit after #1559 (centralise attrs across all read backends). New _populate_attrs_from_geo_info helper at __init__.py:301 runs once per read, not per-chunk -- no perf impact. Probe: 2560x2560 deflate-tiled file opened via read_geotiff_dask yields 400 tasks (4 tasks/chunk for 100 chunks), well under 1M cap. read_geotiff_gpu(1024x1024) returns cupy.ndarray end-to-end with no host round-trip (226ms incl. write+decode). No new HIGH/MEDIUM findings. SAFE/IO-bound holds. | Pass 3 (2026-05-10): SAFE/IO-bound. Audited 4 perf commits: #1558 (in-place NaN writes on uniquely-owned buffers correct), #1556 (fp-predictor ngjit ~297us/tile for 256x256 float32), #1552 (single cupy.concatenate + one .get() for batched D2H at _gpu_decode.py:870-913), #1551 (parallel decode threshold >=65536px engages 256x256 default at _reader.py:1121). Bench: 8192x8192 f32 deflate+pred2 256-tile write 782ms; 4096x4096 f32 deflate read 83ms with parallel decode. Deferred LOW (none filed, all <10% MEDIUM threshold): _writer.py:459/1109 redundant .copy() before predictor encode (~1% per tile), _compression.py:280 lzw_decompress dst[:n].copy() (~2% per LZW tile decode), _writer.py:1419 seg_np.copy() before in-place NaN substitution (negligible, conditional path), _CloudSource.read_range opens fresh fsspec handle per range (pre-existing, predates audit scope). nvCOMP per-tile D2H batching break-even confirmed (variable sizes need staging buffer, no win). | Pass 3 (2026-05-10): audited f157746,39322c3,f23ec8f,1aac3b7. All 5 commits correct. Redundant .copy() in _writer.py:459,1109 and _compression.py:280 (1-2% overhead, LOW). _CloudSource.read_range() per-call open is pre-existing arch issue. No HIGH/MEDIUM regressions. SAFE. | re-audit 2026-05-02: 6 commits since 2026-04-16 (predictor=3 CPU encode/decode, GPU predictor stride fix, validate_tile_layout, BigTIFF LONG8 offsets, AREA_OR_POINT VRT, per-tile alloc guard). 1M dask chunk cap intact at __init__.py:948; adler32 batch transfer intact at _gpu_decode.py:1825. New code is metadata validation and dispatcher logic with no extra materialization or per-tile sync points. No HIGH/MEDIUM regressions. | Pass 5 (2026-05-12): re-audit identified MEDIUM in _gpu_decode.py:1577 _try_nvcomp_from_device_bufs: per-tile cupy.empty + trailing cupy.concatenate doubled peak VRAM and added serial concat. Filed #1659 and fixed to single-buffer + pointer offsets (matches LZW/deflate/host-buffer patterns at L1847/L1878/L1114). Microbench (alloc+concat overhead only, not full nvCOMP latency): n=256 tile_bytes=65536 drops 3.66ms->0.69ms, n=256 tile_bytes=262144 drops 8.18ms->0.13ms. Tests: 5 new tests in test_nvcomp_from_device_bufs_single_alloc_1659.py (codec short-circuit, no-lib short-circuit, memory-guard contract, real ZSTD round-trip via nvCOMP, structural single-buffer check). 1458 existing geotiff tests pass, 3 unrelated matplotlib/py3.14 failures pre-existing. SAFE/IO-bound verdict holds."
21+
geotiff,2026-05-12,SAFE,IO-bound,1,1688,"Pass 4 (2026-05-10): re-audit after #1559 (centralise attrs across all read backends). New _populate_attrs_from_geo_info helper at __init__.py:301 runs once per read, not per-chunk -- no perf impact. Probe: 2560x2560 deflate-tiled file opened via read_geotiff_dask yields 400 tasks (4 tasks/chunk for 100 chunks), well under 1M cap. read_geotiff_gpu(1024x1024) returns cupy.ndarray end-to-end with no host round-trip (226ms incl. write+decode). No new HIGH/MEDIUM findings. SAFE/IO-bound holds. | Pass 3 (2026-05-10): SAFE/IO-bound. Audited 4 perf commits: #1558 (in-place NaN writes on uniquely-owned buffers correct), #1556 (fp-predictor ngjit ~297us/tile for 256x256 float32), #1552 (single cupy.concatenate + one .get() for batched D2H at _gpu_decode.py:870-913), #1551 (parallel decode threshold >=65536px engages 256x256 default at _reader.py:1121). Bench: 8192x8192 f32 deflate+pred2 256-tile write 782ms; 4096x4096 f32 deflate read 83ms with parallel decode. Deferred LOW (none filed, all <10% MEDIUM threshold): _writer.py:459/1109 redundant .copy() before predictor encode (~1% per tile), _compression.py:280 lzw_decompress dst[:n].copy() (~2% per LZW tile decode), _writer.py:1419 seg_np.copy() before in-place NaN substitution (negligible, conditional path), _CloudSource.read_range opens fresh fsspec handle per range (pre-existing, predates audit scope). nvCOMP per-tile D2H batching break-even confirmed (variable sizes need staging buffer, no win). | Pass 3 (2026-05-10): audited f157746,39322c3,f23ec8f,1aac3b7. All 5 commits correct. Redundant .copy() in _writer.py:459,1109 and _compression.py:280 (1-2% overhead, LOW). _CloudSource.read_range() per-call open is pre-existing arch issue. No HIGH/MEDIUM regressions. SAFE. | re-audit 2026-05-02: 6 commits since 2026-04-16 (predictor=3 CPU encode/decode, GPU predictor stride fix, validate_tile_layout, BigTIFF LONG8 offsets, AREA_OR_POINT VRT, per-tile alloc guard). 1M dask chunk cap intact at __init__.py:948; adler32 batch transfer intact at _gpu_decode.py:1825. New code is metadata validation and dispatcher logic with no extra materialization or per-tile sync points. No HIGH/MEDIUM regressions. | Pass 5 (2026-05-12): re-audit identified MEDIUM in _gpu_decode.py:1577 _try_nvcomp_from_device_bufs: per-tile cupy.empty + trailing cupy.concatenate doubled peak VRAM and added serial concat. Filed #1659 and fixed to single-buffer + pointer offsets (matches LZW/deflate/host-buffer patterns at L1847/L1878/L1114). Microbench (alloc+concat overhead only, not full nvCOMP latency): n=256 tile_bytes=65536 drops 3.66ms->0.69ms, n=256 tile_bytes=262144 drops 8.18ms->0.13ms. Tests: 5 new tests in test_nvcomp_from_device_bufs_single_alloc_1659.py (codec short-circuit, no-lib short-circuit, memory-guard contract, real ZSTD round-trip via nvCOMP, structural single-buffer check). 1458 existing geotiff tests pass, 3 unrelated matplotlib/py3.14 failures pre-existing. SAFE/IO-bound verdict holds. | Pass 6 (2026-05-12): re-audit on top of #1659. New HIGH in _try_kvikio_read_tiles at _gpu_decode.py:941: per-tile cupy.empty() + blocking IOFuture.get() inside loop serialised GDS reads to ~1 outstanding pread, missed parallelism the kvikio worker pool was designed for, paid per-tile cupy.empty setup (matches #1659 anti-pattern in nvCOMP path), and lacked _check_gpu_memory guard. Filed #1688 and fixed to single contiguous buffer + batched submit + guard. Microbench with 8-worker pool simulation: 256 tiles@1ms latency drops 256ms->38.7ms (~6.6x); single-thread simulation 256ms->28.5ms (9x). Tests: 9 new tests in test_kvikio_batched_pread_1688.py (kvikio-absent path, single-buffer pointer arithmetic, submit-before-get ordering, memory guard, partial-read fallback, round-trip data, zero-size/all-sparse tiles). All 1577 geotiff tests pass except pre-existing matplotlib/py3.14 failures."
2222
glcm,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,"Downgraded to MEDIUM. da.stack without rechunk is scheduling overhead, not OOM risk."
2323
hillshade,2026-04-16T12:00:00Z,SAFE,compute-bound,0,,"Re-audit after Horn's method rewrite (PR 1175): clean stencil, map_overlap depth=(1,1), no materialization. Zero findings."
2424
hydro,2026-05-01,RISKY,memory-bound,0,1416,"Fixed-in-tree 2026-05-01: hand_mfd._hand_mfd_dask now assembles via da.map_blocks instead of eager da.block of pre-computed tiles (matches hand_dinf pattern). Remaining MEDIUM: sink_d8 CCL fully materializes labels (inherently global), flow_accumulation_mfd frac_bdry held in driver dict instead of memmap-backed BoundaryStore. D8 iterative paths (flow_accum/fill/watershed/basin/stream_*) use serial-tile sweep with memmap-backed boundary store -- per-tile RAM bounded but driver iterates O(diameter) times. flow_direction_*, flow_path/snap_pour_point/twi/hand_d8/hand_dinf are SAFE."

xrspatial/geotiff/_gpu_decode.py

Lines changed: 67 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -945,27 +945,85 @@ def _try_kvikio_read_tiles(file_path, tile_offsets, tile_byte_counts, tile_bytes
945945
directly from the NVMe drive to GPU VRAM, bypassing CPU entirely.
946946
Falls back to None if kvikio is not installed or GDS is not available.
947947
948-
Returns list of cupy arrays (one per tile) on GPU, or None.
948+
Allocates a single contiguous device buffer the size of all tiles,
949+
submits every ``pread`` call before waiting on any of the resulting
950+
futures, and returns per-tile views into the shared buffer. This
951+
mirrors the single-allocation pattern the sibling nvCOMP paths use
952+
(``_try_nvcomp_from_device_bufs`` at L1602, ``_try_nvcomp_batch_decompress``
953+
at L1108) and lets kvikio's internal worker pool overlap the file
954+
reads instead of serialising one ``IOFuture.get()`` per tile.
955+
956+
A ``_check_gpu_memory`` guard runs once against ``sum(tile_byte_counts)``
957+
before the allocation so the GDS path fails fast under malformed
958+
``TileByteCounts`` rather than OOM'ing the device one tile at a time.
959+
960+
See issue #1688.
961+
962+
Returns list of cupy arrays (one per tile, views into a shared
963+
buffer) on GPU, or None on partial read / setup failure.
949964
"""
965+
sizes = [int(bc) for bc in tile_byte_counts]
966+
n = len(sizes)
967+
if n == 0:
968+
return []
969+
950970
try:
951971
import kvikio
952972
import cupy
953973
except ImportError:
954974
return None
955975

976+
offsets = np.zeros(n, dtype=np.int64)
977+
if n > 1:
978+
np.cumsum(sizes[:-1], out=offsets[1:])
979+
total_bytes = int(sum(sizes))
980+
981+
if total_bytes == 0:
982+
# All tiles are sparse. Return per-tile zero-length views into a
983+
# zero-size buffer so callers iterating ``d_tiles`` still get N
984+
# entries in the original order.
985+
empty = cupy.empty(0, dtype=cupy.uint8)
986+
return [empty[0:0] for _ in range(n)]
987+
956988
try:
957-
d_tiles = []
989+
_check_gpu_memory(total_bytes, what="kvikio tile read buffer")
990+
combined = cupy.empty(total_bytes, dtype=cupy.uint8)
991+
992+
futures = []
958993
with kvikio.CuFile(file_path, 'r') as f:
959-
for off, bc in zip(tile_offsets, tile_byte_counts):
960-
buf = cupy.empty(bc, dtype=cupy.uint8)
961-
nbytes = f.pread(buf, file_offset=off)
962-
# Verify the read completed correctly
963-
actual = nbytes.get() if hasattr(nbytes, 'get') else int(nbytes)
964-
if actual != bc:
994+
for src_off, dst_off, bc in zip(tile_offsets, offsets, sizes):
995+
if bc == 0:
996+
futures.append(None)
997+
continue
998+
view = combined[dst_off:dst_off + bc]
999+
futures.append((f.pread(view, file_offset=int(src_off)), bc))
1000+
1001+
# Pass 2: wait on every submitted pread together so kvikio can
1002+
# overlap them in its internal thread pool. The historical
1003+
# loop called ``.get()`` between successive ``pread`` submits
1004+
# which forced one-at-a-time IO.
1005+
for fut in futures:
1006+
if fut is None:
1007+
continue
1008+
future, expected_bc = fut
1009+
actual = future.get() if hasattr(future, 'get') else int(future)
1010+
if actual != expected_bc:
9651011
return None # partial read, fall back
966-
d_tiles.append(buf)
1012+
9671013
cupy.cuda.Device().synchronize()
1014+
1015+
d_tiles = []
1016+
for dst_off, bc in zip(offsets, sizes):
1017+
d_tiles.append(combined[dst_off:dst_off + bc])
9681018
return d_tiles
1019+
except MemoryError:
1020+
# Surface OOM unchanged so the caller can decide how to recover.
1021+
# The bytes-based ``gpu_decode_tiles`` fallback still allocates
1022+
# ``total_comp`` bytes on the device (``d_comp = cupy.asarray(
1023+
# comp_buf_host)``), so it does not necessarily avoid this OOM.
1024+
# It does skip the GDS-specific contiguous read buffer though,
1025+
# which can help if the failure was kvikio-side rather than VRAM.
1026+
raise
9691027
except Exception as e:
9701028
# GDS not available, version mismatch, or CUDA error.
9711029
# Reset CUDA error state if possible (the inner pass stays broad

0 commit comments

Comments
 (0)