Skip to content

Add dtype, compression_level, and VRT output to geotiff I/O (#1083)#1085

Merged
brendancol merged 13 commits intomasterfrom
issue-1083
Mar 30, 2026
Merged

Add dtype, compression_level, and VRT output to geotiff I/O (#1083)#1085
brendancol merged 13 commits intomasterfrom
issue-1083

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • open_geotiff(path, dtype='float32') casts on read, one tile at a time, so the full-precision buffer is never allocated. Works on eager, dask, and GPU paths. Float-to-int casts are rejected (lossy); everything else is allowed.
  • to_geotiff(data, path, compression_level=N) exposes the codec's speed/size knob (zstd 1-22, deflate 1-9, lz4 0-16). Threaded through to the writer and GPU compress paths. Codecs without levels ignore it silently.
  • to_geotiff(data, 'out.vrt') writes a {stem}_tiles/ directory of per-chunk GeoTIFFs plus a VRT index with relative paths. For dask inputs, tiles write one at a time via scheduler='synchronous', so peak memory is one chunk. Numpy inputs are sliced by tile_size. cog=True and overview_levels raise ValueError with VRT output.

Closes #1083.

Test plan

  • 14 tests for compression_level (round-trips, file size ordering, invalid level rejection, LZW ignore)
  • 11 tests for dtype (eager + dask, float narrowing, int narrowing, float-to-int rejection, nodata+dtype interaction)
  • 11 tests for VRT output (numpy + dask round-trips, tile naming, relative paths, cog/overview rejection, non-empty dir guard)
  • Full geotiff suite: 380 passed, 4 skipped, 0 regressions

Covers dtype on read, compression_level on write, and VRT tiled
output from to_geotiff for streaming dask writes.
Thread a compression_level: int | None = None parameter from the
public to_geotiff() API down to the compress() call in the writer.
Validation rejects out-of-range levels for deflate (1-9), zstd (1-22),
and lz4 (0-16); codecs without a level concept (lzw, packbits, jpeg)
accept any value and ignore it.
write_geotiff_gpu now accepts compression_level: int | None = None so
callers that pass the parameter through to_geotiff on the GPU path no
longer get an unexpected keyword argument error. The value is accepted
silently since nvCOMP does not currently expose level control.
…port

- Use keyword form compression_level=compression_level at the two
  _write_tiled call sites in write() for clarity and refactor safety.
- Update compress() docstring to cover all codecs with level support
  (deflate, zstd, lz4) instead of only mentioning deflate.
- Remove unused tempfile import from test_compression_level.py.
Adds `dtype=None` to `open_geotiff`, `read_geotiff_dask`, `read_geotiff_gpu`,
and `read_vrt`. When specified, each path casts the array to the target dtype
after nodata masking. Float-to-int casts raise ValueError to prevent
accidental fractional data loss. Resolves part of #1083.
The stripped-file fallback in read_geotiff_gpu returned before
reaching the dtype cast block. Add the cast before the early return
so the dtype parameter is honored for non-tiled GeoTIFFs on the
GPU path.
The dask path validated the dtype cast against the raw file dtype
(e.g. uint16), but nodata masking promotes integer arrays to float64
inside the delayed worker. Requesting dtype='int32' on a uint16 file
with nodata would pass validation then silently cast float64+NaN to
int32, producing garbage. Now compute the effective post-masking dtype
and validate against that.

Also remove the dead `dtype` positional parameter from
_delayed_read_window and add tests for the nodata+dtype interaction
on both eager and dask paths.
When to_geotiff receives a path ending in .vrt, write a directory of
tiled GeoTIFFs with a VRT index file instead of a monolithic TIFF.
This lets dask arrays stream to disk without materializing the full
array in RAM.

- Dask inputs get one tile per chunk, written in parallel via
  dask.delayed
- Numpy inputs are sliced by tile_size
- cog=True and overview_levels raise ValueError with VRT output
- Non-empty tiles directory raises FileExistsError
- 11 new tests cover numpy/dask round-trips, tile naming, relative
  paths, compression forwarding, and edge cases
- Use scheduler='synchronous' in dask.compute so tiles write one at a
  time, keeping peak memory to a single chunk instead of N_cores chunks
- Reject 3D arrays in both the dask and numpy VRT paths with a clear
  error message, since raw.chunks[0] would read the band dimension
  instead of rows for (bands, y, x) shaped arrays
- Remove unused gpu parameter from _write_vrt_tiled signature and
  call site; CuPy detection already happens inside _write_single_tile
Demonstrates dtype on read, compression_level on write, and VRT tiled
output using self-contained tempdir examples.
@github-actions github-actions bot added the performance PR touches performance-sensitive code label Mar 30, 2026
os.path.relpath produces backslashes on Windows, but VRT XML
expects forward slashes for cross-platform portability.
@brendancol brendancol merged commit 443ed78 into master Mar 30, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add dtype, compression_level, and VRT output to geotiff I/O

1 participant